87flowers ~ $

accelerating incremental attack tables with vector math: part 3

15 Jun 2025 chess

incremental attack tables
operations for incremental updates
removing attacks for a piece
extending/retracting slider rays
adding attacks for a piece
implementing Position::move
next
navigation

This is part three of a multipart series in which we demonstrate how to make incrementally updated attack tables faster.

part 1: non-incrementally generating attack tables
part 2: move generation
part 3: incrementally generating attack tables (you are here!)
part 4: pin detection and fully-legal movegen (TBD)

incremental attack tables

We can already use the function in part 1 to calculate entire attack tables. While slow, it is still helpful for generating the initial attack tables at fen parsing time, because speed doesn't matter there. Unfortunately, this is slow.

The performance issue comes from needing to iterate over the entire board, which needs a total of 64 vector permulations. This performs unnecessary work as a majority of the squares do not need attack table updates. We should minimize the number of vector permutations we have to do. To do this we need to update the attack tables incrementally.

operations for incremental updates

Think about what happens when during a move like e2e4:

pawn vanishes from e2
pawn materializes on e4

e2e4: before and after

Now think about what needs to happen to the attack tables:

pawn vanishes from e2
1. all attacks of the pawn vanish
2. sliders that attack e2 must have their attacks extended past this square
pawn materializes on e4
1. all attacks of a pawn on e4 appear
2. sliders that attack e4 must have their attacks truncated to this square

Now think about what happens when a knight on g1 captures a queen on f3:

g1f3: before and after

knight vanishes from g1
1. all attacks of the knight vanish
2. sliders that attack g1 must have their attacks extended past this square
queen on f3 transmutes into a knight
1. all attacks of the queen vanish
2. all attacks of a knight on f3 appear
3. no slider extension/retraction required

Other move types are left as an exercise for the reader.

By doing this exercise you will see that there are only three sub-operations you need to implement. With only these three sub-operations you can implement any move type:

Removing attacks for a piece
Extending/retracting slider rays (These can be considered the same operation)
Adding attacks for a piece

removing attacks for a piece

Vector operations make this really easy to implement. We mask the entire attack table to remove the piece from it. This brute force method is much faster than figuring out what squares are actually being attacked. Implementation below:

void Position::removeAttacks(bool color, u8 id) {
  const v512 mask = v512::broadcast16(~u16(1 << (id & 0xF)));
  m_attack_table[color].z[0] &= mask;
  m_attack_table[color].z[1] &= mask;
}

extending/retracting slider rays

Whenever a piece appears or vanishes from a square, the sliders that see that square need their rays updating.

slider ray extension, when e4 vanishes

how the attacks in the attack tables for the white bishop are updated, when e4 vanishes

the delta to incrementally update

In the above example, when e4 vanishes, the bishop is now able to see the white pawn on c6.

Carefully think about what is happening from the point of view of the vanishing pawn on e4:

A superpiece on e4 is able to see the bishop on g2 (there is an unobstructed ray from e4 to g2), the south-east ray.
A superpiece on e4 knows that the bishop is attacking it.
A superpiece on e4 is able to see the pawn on c6 (there is an unobstructed ray from e4 to c6), the north-west ray.

Is it obvious? What we need to do is rotate all ray information by 180 degrees. So if we have a slider visible to us from the south-east direction, we know we need to update the squares on our north-west ray.

Rotating the ray-to-slider 180 degrees, then applying slider information to that ray. This provides us with the incremental attack table update that we want. (Compare with earlier diagram.)

In part 1, I said that “the order of the rays in the ray vector is significant, and we will explain why later.” Rotating the ray information by 180 degrees is why the rays are in compass direction order. We implement this rotation by swapping the upper and lower halves of the ray vector.

Now the basic theory is understood, here is how Rose implements this:

We will talk about the new Place format (because we need to squeeze color, piece type and id information into 8 bits now!)
First, we extract all rays (exactly as we did in part 1)
Second, we will detect all sliders that are visible to us that are able to attack us.
Third, we will rotate the slider information by 180 degrees.
Fourth, we will invert the ray permutation back into normal mailbox layout.
Fifth, we will apply the update to the attack tables.

We will attack each step in its own section below

new place format

If you want to keep the old piece type format, you can. You would need two mailboxes: one for piece type and another for id. This has a noticeable performance cost.

As promised, we abandon the Place format in part 1 for a new format.

We need to have piece id information in the mailbox, and the old format did not have enough space for this.

I realised later that this is actually an unnecessary restriction. This is why Clockwork doesn't use this format.

If we are careful with our encoding, we only need three bits to represent the piece type. This is because we only care about sliders.

// Place format:
// u1 color
// u3 piece type
// u4 id (0 to 15)
struct Place {
  u8 raw = 0;

  constexpr bool isEmpty() const { return raw == 0; }
  constexpr Color color() const { return (raw & 0x80) != 0; }
  constexpr PieceType ptype() const { return (raw >> 4) & 0x7; }
  constexpr u8 id() const { return raw & 0xF; }
};

enum class PieceType {
  none   = 0b000,
  king   = 0b001,
  pawn   = 0b010,
  knight = 0b011,
  bishop = 0b101,
  rook   = 0b110,
  queen  = 0b111,
};

// notice that slider pieces always have the 0b100 bit set
// for the sliders:
// diagonal sliders:   have 0b001 bit
// orthogonal sliders: have 0b010 bit

detecting visible sliders

Detecting sliders that can attack us is easy with the above piece type format.

This is more-or-less the same code in the “how to calculate attacks to one square?” section of part 1, but simplified to only consider sliders, so you should re-read that for a refresher.

// Constants

constexpr u8 place_slider_bit = 0b100 << 4;

constexpr u8 diag = 0b001 << 4;
constexpr u8 orth = 0b010 << 4;
constexpr v512 slider_mask = v512{std::array<u8, 64>{
  0, orth, orth, orth, orth, orth, orth, orth, // N
  0, diag, diag, diag, diag, diag, diag, diag, // NE
  0, orth, orth, orth, orth, orth, orth, orth, // E
  0, diag, diag, diag, diag, diag, diag, diag, // SE
  0, orth, orth, orth, orth, orth, orth, orth, // S
  0, diag, diag, diag, diag, diag, diag, diag, // SW
  0, orth, orth, orth, orth, orth, orth, orth, // W
  0, diag, diag, diag, diag, diag, diag, diag, // NW
}};

// How we actually do it:

// First, permute the board into our ray vector format
const auto [ray_coords, ray_valid] = geometry::superpieceRays(sq);
const v512 ray_places = vec::permute8(ray_coords, m_board.z);

// Which pieces can potentially slide down their rays?
const u64 sliders = vec::test8(ray_places, v512::broadcast8(place_slider_bit))
                    & vec::test8(ray_places, slider_mask);

// Which pieces are visible to us?
// (Same as part 1, but we mask out horses.)
const u64 blockers = ray_places.nonzero8();
const u64 raymask  = geometry::superpieceAttacks(blockers, ray_valid) & 0xFEFEFEFEFEFEFEFE;

// Which pieces are sliders that are visible to us?
const u64 visible_sliders = raymask & sliders;

I hope this is self-explanatory.

massaging slider information

Let us start with a worked example:

This is an example position in which a piece has just been removed from the e4 square, that will be used in the slider extension example below.

Extending rays of sliders attacking e4 is a four step process:
1. Detect all eligible sliders that can see e4.
2. Broadcast slider information across their rays.
3. Rotate slider information by 180 degrees.
4. Mask slider information to valid ray extent.

This would correspond to the code below:

// Slider information was calculate in the previous section
const u64 visible_sliders = raymask & sliders;

// Places contain color and id information, and that's the information we need.
// Retrieve this information using the visible_sliders bitboard.
const v512 visible_sliders_places = vec::mask8(visible_sliders, ray_places)

// Broadcasts slider id to fill its ray (its 8 byte group)
const v512 broadcasted_sliders = vec::lanebroadcast8to64(visible_sliders_places);

// Rotate ray vector
const v512 rotated = vec::shuffle128<0b01001011>(broadcasted_sliders);

// Mask to actual ray length
const v512 masked_rotated = vec::mask8(raymask, rotated);

vec::lanebroadcast8to64

We need to broadcast information from one byte to an eight-byte lane within a vector. There are a few ways to do this, but the way to do this with minimal number of instructions is:

v512 lanebroadcast8to64(v512 x) {
  const v512 y = vec::gf2p8affine8(v512::broadcast64(0x0102040810204080), x, 0);
  return vec::gf2p8affine8(v512::broadcast64(0xFFFFFFFFFFFFFFFF), y, 0);
}

This is yet another neat bithack with the vgf2p8affineqb instruction. The summary of the above code is it does a horizontal xor across 8 bytes which is broadcasted to those 8 bytes, so it is equivalent to doing:

v512 lanebroadcast8to64(v512 x) {
  for (int i = 0; i < 8; i++) {
    u8 value = 0;
    for (int j = 0; j < 8; j++)
      value ^= x.bytes[8 * i + j];
    for (int j = 0; j < 8; j++)
      x.bytes[8 * i + j] = value;
  }
}

Doing this works because only there is only ever zero or one sliders in each eight byte ray group; value will only pick up the value of that element.

There are alternative implementations, this is left as an exercise to the reader.

Clockwork has a different Place format because of this, because we need to ensure the msb of slider information is zero when doing vpsadbw to avoid negative values.

Clockwork does this with vpsadbw instead for AVX2 support, but there are some caveats with the most significant bit of each byte to be aware of when doing it that way.

inverting the ray permutation

Unlike GPUs, CPUs do not have a backwards permute instruction, which would make this very straightforward. I attempted to come up with alternative ways of computing this, but the easiest way is a lookup table. If you come up with a nice way of doing this, please let me know.

As we need use a forwards permute, we need to calculate the appropriate permutation to use. Notice in the below code that table is a diagram of a rays from a superpiece, with values corresponding to indexes into the ray-format. We just want to look at a 8x8 window of this larger table.

v512 superpieceInverseRays(Square sq) {
  constexpr u8 none = 0xFF;
  constexpr std::array<u8, 512> table{{
      none, none, none, none, none, none, none, none, none, none, none, none, none, none, none, none,
      none, 0x2F, none, none, none, none, none, none, 0x27, none, none, none, none, none, none, 0x1F,
      none, none, 0x2E, none, none, none, none, none, 0x26, none, none, none, none, none, 0x1E, none,
      none, none, none, 0x2D, none, none, none, none, 0x25, none, none, none, none, 0x1D, none, none,
      none, none, none, none, 0x2C, none, none, none, 0x24, none, none, none, 0x1C, none, none, none,
      none, none, none, none, none, 0x2B, none, none, 0x23, none, none, 0x1B, none, none, none, none,
      none, none, none, none, none, none, 0x2A, 0x28, 0x22, 0x20, 0x1A, none, none, none, none, none,
      none, none, none, none, none, none, 0x30, 0x29, 0x21, 0x19, 0x18, none, none, none, none, none,
      none, 0x37, 0x36, 0x35, 0x34, 0x33, 0x32, 0x31, none, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17,
      none, none, none, none, none, none, 0x38, 0x39, 0x01, 0x09, 0x10, none, none, none, none, none,
      none, none, none, none, none, none, 0x3A, 0x00, 0x02, 0x08, 0x0A, none, none, none, none, none,
      none, none, none, none, none, 0x3B, none, none, 0x03, none, none, 0x0B, none, none, none, none,
      none, none, none, none, 0x3C, none, none, none, 0x04, none, none, none, 0x0C, none, none, none,
      none, none, none, 0x3D, none, none, none, none, 0x05, none, none, none, none, 0x0D, none, none,
      none, none, 0x3E, none, none, none, none, none, 0x06, none, none, none, none, none, 0x0E, none,
      none, 0x3F, none, none, none, none, none, none, 0x07, none, none, none, none, none, none, 0x0F,
  }};

  // 0xYZ: Y = row in table above, Z = column in table above.
  // The below offsets are for square a1.
  // This defines a window into the above table.
  constexpr v512 offsets = v512::fromArray8({
      0x88, 0x89, 0x8A, 0x8B, 0x8C, 0x8D, 0x8E, 0x8F,
      0x98, 0x99, 0x9A, 0x9B, 0x9C, 0x9D, 0x9E, 0x9F,
      0xA8, 0xA9, 0xAA, 0xAB, 0xAC, 0xAD, 0xAE, 0xAF,
      0xB8, 0xB9, 0xBA, 0xBB, 0xBC, 0xBD, 0xBE, 0xBF,
      0xC8, 0xC9, 0xCA, 0xCB, 0xCC, 0xCD, 0xCE, 0xCF,
      0xD8, 0xD9, 0xDA, 0xDB, 0xDC, 0xDD, 0xDE, 0xDF,
      0xE8, 0xE9, 0xEA, 0xEB, 0xEC, 0xED, 0xEE, 0xEF,
      0xF8, 0xF9, 0xFA, 0xFB, 0xFC, 0xFD, 0xFE, 0xFF,
  });

  constexpr v512 table0 = v512::load(table.data());
  constexpr v512 table1 = v512::load(table.data() + 64);
  constexpr v512 table2 = v512::load(table.data() + 128);
  constexpr v512 table3 = v512::load(table.data() + 192);

  const v512 window = vec::sub8(offsets, v512::broadcast8(internal::expandSq(sq)));
  const v512 result0 = vec::permute8(window, table0, table1);
  const v512 result1 = vec::permute8(window, table2, table3);
  return vec::blend8(window.msb8(), result0, result1);
}

The above code effectively looks into window into table whose corner is defined by sq. The need to blend is because the maximum index for a AVX-512 permute is 7 bits and we need to effectively do an 8 bit permute, so we emulate this.

This can be optimized by precomputing the resulting v512 for all 64 possible input squares. This would be a lookup table that is 4 KiB in size.

updating attack tables

Now we have all the slider information back into a board format, we can update the attack tables. We first split the information into slider id and slider color. We then convert slider id into a bitmask (mask = 1 << id), then xor it into the attack tables.

const v512 inv_perm = geometry::superpieceInverseRays(sq);
// Inverse permutation from ray layout back into board layout
const v512 slider_info_in_board_layout = vec::permute8_mz(~inv_perm.msb8(), inv_perm, masked_rotated);

const u64 color = slider_info_in_board_layout.msb8();
const v512 ids = slider_info_in_board_layout & v512::broadcast8(0xF);

// We use zero as an invalid id.
// Elsewhere (during fen parsing), we ensure that zero is always the id of the king.
// As the king is not a slider, this is fine.
const u64 valid_ids = slider_info_in_board_layout.nonzero8();

const v512 ones = v512::broadcast16(1);
const v512 bits0 = vec::shl16_mz(u32(valid_ids), ones, vec::zext8to16(ids.to256()));
const v512 bits1 = vec::shl16_mz(u32(valid_ids >> 32), ones, vec::zext8to16(vec::extract256<1>(ids)));

m_attack_table[0].z[0] ^= vec::mask16(~u32(color), bits0);
m_attack_table[0].z[1] ^= vec::mask16(~u32(color >> 32), bits1);
m_attack_table[1].z[0] ^= vec::mask16(u32(color), bits0);
m_attack_table[1].z[1] ^= vec::mask16(u32(color >> 32), bits1);

Using xor means we can use this exact same function for both extending or retracting rays, because it's a bit flip.

all together now

auto Position::incrementalSliderUpdate(Square sq) -> void {
  const auto [ray_coords, ray_valid] = geometry::superpieceRays(sq);
  const v512 ray_places = vec::permute8(ray_coords, m_board.z);

  const u64 blockers = ray_places.nonzero8();
  const u64 raymask = geometry::superpieceAttacks(blockers, ray_valid)
                      & geometry::non_horse_attack_mask;
  const u64 sliders = (ray_places & v512::broadcast8(Place::slider_bit)).nonzero8()
                      & (ray_places & geometry::sliderMask).nonzero8();

  // We only care about sliders that are visible to us that can attack us.
  const u64 visible_sliders = raymask & sliders;
  const v512 visible_sliders_places = vec::mask8(visible_sliders, ray_places)

  // Broadcasts slider id to fill its ray (its 8 byte group)
  const v512 broadcasted_sliders = vec::lanebroadcast8to64(visible_sliders_places);
  // Rotate ray vector
  const v512 rotated = vec::shuffle128<0b01001011>(broadcasted_sliders);
  // Mask to actual ray length
  const v512 masked_rotated = vec::mask8(raymask, rotated);

  const v512 inv_perm = geometry::superpieceInverseRays(sq);
  // Inverse permutation from ray layout back into board layout
  const v512 slider_info_in_board_layout = vec::permute8_mz(~inv_perm.msb8(), inv_perm, masked_rotated);

  const v512 ids = slider_info_in_board_layout & v512::broadcast8(0xF);
  const u64 color = slider_info_in_board_layout.msb8();

  // We use zero as an invalid id beacuse it is guaranteed to be a king which is never a slider.
  const u64 valid_ids = slider_info_in_board_layout.nonzero8();

  const v512 ones = v512::broadcast16(1);
  const v512 bits0 = vec::shl16_mz(u32(valid_ids), ones, vec::zext8to16(ids.to256()));
  const v512 bits1 = vec::shl16_mz(u32(valid_ids >> 32), ones, vec::zext8to16(vec::extract256<1>(ids)));

  m_attack_table[0].z[0] ^= vec::mask16(~u32(color), bits0);
  m_attack_table[0].z[1] ^= vec::mask16(~u32(color >> 32), bits1);
  m_attack_table[1].z[0] ^= vec::mask16(u32(color), bits0);
  m_attack_table[1].z[1] ^= vec::mask16(u32(color >> 32), bits1);
}

And that's how to incrementally do slider updates in vectors. End of section!

adding attacks for a piece

Last but not least, we need to add the attacks of a piece when it materalizes on a new square.

There are several ways of doing this, and I take completely different approaches in Rose and Clockwork.

In Rose, I do this without a lookup table.

In Clockwork, I use a look-up table.

I will show you how this is done in Clockwork (translated into AVX-512) because that it is a lot more intuitive and requires less explanation.

void Position::addAttacks(bool color, PieceId id, Square sq, PieceType ptype) {
  auto [ray_coords, ray_valid] = geometry::superpieceRays(sq);
  v512 ray_places = v512::permute8(ray_coords, m_board.to_vec());
  u64 raymask = geometry::superpieceAttacks(ray_places, ray_valid);

  v512 inv_perm  = geometry::superpieceInverseRays(sq);
  u64 mask = vec::bitpermute8(inv_perm, raymask);

  u64 moves = geometry::pieceMoves(color, ptype, sq) & mask;

  v512 bit = v512::broadcast16(1 << id);
  m_attack_table[color].raw[0] |= vec::mask(u32(moves), bit);
  m_attack_table[color].raw[1] |= vec::mask(u32(moves >> 32), bit);
}

vec::bitpermute8 is the vpshufbitqmb instruction, which permutes the bits of a 64-bit value using a 64-element permute vector.

The lookup table code should have a familiar structure (a 8x8 window into a bigger table):

u64 pieceMoves(bool color, PieceType ptype, Square sq) {
    assert(ptype != PieceType::None);
    i32  index = ptype == PieceType::Pawn ? color : static_cast<i32>(ptype);
    v512 bit = v512::broadcast8(1 << index);
    v512 table = pieceMovesTable(sq);
    return v512::test8(table, bit);
}

v512 pieceMovesTable(Square sq) {
  constexpr u8 K = 1 << static_cast<i32>(PieceType::King);
  constexpr u8 Q = 1 << static_cast<i32>(PieceType::Queen);
  constexpr u8 B = 1 << static_cast<i32>(PieceType::Bishop);
  constexpr u8 R = 1 << static_cast<i32>(PieceType::Rook);
  constexpr u8 HORS = 1 << static_cast<i32>(PieceType::Knight);
  constexpr u8 ORTH = Q | R;
  constexpr u8 DIAG = Q | B;
  constexpr u8 OADJ = ORTH | K;
  constexpr u8 WPDG = DIAG | K | 0b01;
  constexpr u8 BPDG = DIAG | K | 0b10;

  constexpr std::array<u8, 512> table{{
    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
    0, DIAG,    0,    0,    0,    0,    0,    0, ORTH,    0,    0,    0,    0,    0,    0, DIAG,
    0,    0, DIAG,    0,    0,    0,    0,    0, ORTH,    0,    0,    0,    0,    0, DIAG,    0,
    0,    0,    0, DIAG,    0,    0,    0,    0, ORTH,    0,    0,    0,    0, DIAG,    0,    0,
    0,    0,    0,    0, DIAG,    0,    0,    0, ORTH,    0,    0,    0, DIAG,    0,    0,    0,
    0,    0,    0,    0,    0, DIAG,    0,    0, ORTH,    0,    0, DIAG,    0,    0,    0,    0,
    0,    0,    0,    0,    0,    0, DIAG, HORS, ORTH, HORS, DIAG,    0,    0,    0,    0,    0,
    0,    0,    0,    0,    0,    0, HORS, BPDG, OADJ, BPDG, HORS,    0,    0,    0,    0,    0,
    0, ORTH, ORTH, ORTH, ORTH, ORTH, ORTH, OADJ,    0, OADJ, ORTH, ORTH, ORTH, ORTH, ORTH, ORTH,
    0,    0,    0,    0,    0,    0, HORS, WPDG, OADJ, WPDG, HORS,    0,    0,    0,    0,    0,
    0,    0,    0,    0,    0,    0, DIAG, HORS, ORTH, HORS, DIAG,    0,    0,    0,    0,    0,
    0,    0,    0,    0,    0, DIAG,    0,    0, ORTH,    0,    0, DIAG,    0,    0,    0,    0,
    0,    0,    0,    0, DIAG,    0,    0,    0, ORTH,    0,    0,    0, DIAG,    0,    0,    0,
    0,    0,    0, DIAG,    0,    0,    0,    0, ORTH,    0,    0,    0,    0, DIAG,    0,    0,
    0,    0, DIAG,    0,    0,    0,    0,    0, ORTH,    0,    0,    0,    0,    0, DIAG,    0,
    0, DIAG,    0,    0,    0,    0,    0,    0, ORTH,    0,    0,    0,    0,    0,    0, DIAG,
  }};

  // 0xYZ: Y = row in table above, Z = column in table above.
  // The below offsets are for square a1.
  // This defines a window into the above table.
  constexpr v512 offsets = v512::fromArray8({
      0x88, 0x89, 0x8A, 0x8B, 0x8C, 0x8D, 0x8E, 0x8F,
      0x98, 0x99, 0x9A, 0x9B, 0x9C, 0x9D, 0x9E, 0x9F,
      0xA8, 0xA9, 0xAA, 0xAB, 0xAC, 0xAD, 0xAE, 0xAF,
      0xB8, 0xB9, 0xBA, 0xBB, 0xBC, 0xBD, 0xBE, 0xBF,
      0xC8, 0xC9, 0xCA, 0xCB, 0xCC, 0xCD, 0xCE, 0xCF,
      0xD8, 0xD9, 0xDA, 0xDB, 0xDC, 0xDD, 0xDE, 0xDF,
      0xE8, 0xE9, 0xEA, 0xEB, 0xEC, 0xED, 0xEE, 0xEF,
      0xF8, 0xF9, 0xFA, 0xFB, 0xFC, 0xFD, 0xFE, 0xFF,
  });

  constexpr v512 table0 = v512::load(table.data());
  constexpr v512 table1 = v512::load(table.data() + 64);
  constexpr v512 table2 = v512::load(table.data() + 128);
  constexpr v512 table3 = v512::load(table.data() + 192);

  const v512 window = vec::sub8(offsets, v512::broadcast8(internal::expandSq(sq)));
  const v512 result0 = vec::permute8(window, table0, table1);
  const v512 result1 = vec::permute8(window, table2, table3);
  return vec::blend8(window.msb8(), result0, result1);
}

implementing Position::move

Now we have all three suboperations we need, we can actually implement piece moves:

void Position::addPiece(bool color, Place p, Square to) {
  m_board[to] = p;

  incrementalSliderUpdate(to);
  addAttacks(color, p.id(), to, p.ptype());
}

void Position::removePiece(bool color, PieceId id, Square from) {
  removeAttacks(color, id);
  incrementalSliderUpdate(from);

  m_board[from] = Place::empty();
}

// For implementing captures
void Position::mutatePiece(bool old_color, PieceId old_id, Square sq, bool new_color, Place p) {
    m_board[sq] = p;

    removeAttacks(old_color, old_id);
    addAttacks(new_color, p.id(), sq, p.ptype());
}

Position Position::move(Move m) {
  Position new_pos = *this;

  Square from  = m.from();
  Square to    = m.to();
  Place  src   = m_board[from];
  Place  dst   = m_board[to];
  bool   color = static_cast<bool>(m_stm);

  switch (m.flags()) {
  case MoveFlags::Normal:
    new_pos.removePiece(color, src.id(), from);
    new_pos.addPiece(color, src, to);

    new_pos.m_piece_list_sq[color][src.id] = to;

    // ... update 50mr, enpassant ...

    break;

  // ... implement other move types ...

  }

  // ... update castling rights, side-to-move ...

  return new_pos;
}

The remainder is left as an exercise for the reader. For a complete implementation of the above (albeit for AVX-2), see this Clockwork PR.

Phew! We finally have incremental attack tables, with pseudo-legal movegen. Your movegen speed should be acceptable now at this point.

^* Faster than implementations intended for use in real engines that I have tested against. I am no where near achieving Gigantua speeds.

In the next part, we will take this further with pin detection and fully legal movegen to achieve faster-than-magic-bitboard^* bulk perft speeds.