chess
This is part three of a multipart series in which we demonstrate how to make incrementally updated attack tables faster.
We can already use the function in part 1 to calculate entire attack tables. While slow, it is still helpful for generating the initial attack tables at fen parsing time, because speed doesn't matter there. Unfortunately, this is slow.
The performance issue comes from needing to iterate over the entire board, which needs a total of 64 vector permulations. This performs unnecessary work as a majority of the squares do not need attack table updates. We should minimize the number of vector permutations we have to do. To do this we need to update the attack tables incrementally.
Think about what happens when during a move like e2e4:
Now think about what needs to happen to the attack tables:
Now think about what happens when a knight on g1 captures a queen on f3:
Other move types are left as an exercise for the reader.
By doing this exercise you will see that there are only three sub-operations you need to implement. With only these three sub-operations you can implement any move type:
Vector operations make this really easy to implement. We mask the entire attack table to remove the piece from it. This brute force method is much faster than figuring out what squares are actually being attacked. Implementation below:
void Position::removeAttacks(bool color, u8 id) {
const v512 mask = v512::broadcast16(~u16(1 << (id & 0xF)));
m_attack_table[color].z[0] &= mask;
m_attack_table[color].z[1] &= mask;
}
Whenever a piece appears or vanishes from a square, the sliders that see that square need their rays updating.
In the above example, when e4 vanishes, the bishop is now able to see the white pawn on c6.
Carefully think about what is happening from the point of view of the vanishing pawn on e4:
In part 1, I said that “the order of the rays in the ray vector is significant, and we will explain why later.” Rotating the ray information by 180 degrees is why the rays are in compass direction order. We implement this rotation by swapping the upper and lower halves of the ray vector.
Now the basic theory is understood, here is how Rose implements this:
Place
format (because we need to squeeze color, piece type and id information into 8 bits now!)
We need to have piece id information in the mailbox, and the old format did not have enough space for this.
// Place format:
// u1 color
// u3 piece type
// u4 id (0 to 15)
struct Place {
u8 raw = 0;
constexpr bool isEmpty() const { return raw == 0; }
constexpr Color color() const { return (raw & 0x80) != 0; }
constexpr PieceType ptype() const { return (raw >> 4) & 0x7; }
constexpr u8 id() const { return raw & 0xF; }
};
enum class PieceType {
none = 0b000,
king = 0b001,
pawn = 0b010,
knight = 0b011,
bishop = 0b101,
rook = 0b110,
queen = 0b111,
};
// notice that slider pieces always have the 0b100 bit set
// for the sliders:
// diagonal sliders: have 0b001 bit
// orthogonal sliders: have 0b010 bit
Detecting sliders that can attack us is easy with the above piece type format.
This is more-or-less the same code in the “how to calculate attacks to one square?” section of part 1, but simplified to only consider sliders, so you should re-read that for a refresher.
// Constants
constexpr u8 place_slider_bit = 0b100 << 4;
constexpr u8 diag = 0b001 << 4;
constexpr u8 orth = 0b010 << 4;
constexpr v512 slider_mask = v512{std::array<u8, 64>{
0, orth, orth, orth, orth, orth, orth, orth, // N
0, diag, diag, diag, diag, diag, diag, diag, // NE
0, orth, orth, orth, orth, orth, orth, orth, // E
0, diag, diag, diag, diag, diag, diag, diag, // SE
0, orth, orth, orth, orth, orth, orth, orth, // S
0, diag, diag, diag, diag, diag, diag, diag, // SW
0, orth, orth, orth, orth, orth, orth, orth, // W
0, diag, diag, diag, diag, diag, diag, diag, // NW
}};
// How we actually do it:
// First, permute the board into our ray vector format
const auto [ray_coords, ray_valid] = geometry::superpieceRays(sq);
const v512 ray_places = vec::permute8(ray_coords, m_board.z);
// Which pieces can potentially slide down their rays?
const u64 sliders = vec::test8(ray_places, v512::broadcast8(place_slider_bit))
& vec::test8(ray_places, slider_mask);
// Which pieces are visible to us?
// (Same as part 1, but we mask out horses.)
const u64 blockers = ray_places.nonzero8();
const u64 raymask = geometry::superpieceAttacks(blockers, ray_valid) & 0xFEFEFEFEFEFEFEFE;
// Which pieces are sliders that are visible to us?
const u64 visible_sliders = raymask & sliders;
I hope this is self-explanatory.
Let us start with a worked example:
This would correspond to the code below:
// Slider information was calculate in the previous section
const u64 visible_sliders = raymask & sliders;
// Places contain color and id information, and that's the information we need.
// Retrieve this information using the visible_sliders bitboard.
const v512 visible_sliders_places = vec::mask8(visible_sliders, ray_places)
// Broadcasts slider id to fill its ray (its 8 byte group)
const v512 broadcasted_sliders = vec::lanebroadcast8to64(visible_sliders_places);
// Rotate ray vector
const v512 rotated = vec::shuffle128<0b01001011>(broadcasted_sliders);
// Mask to actual ray length
const v512 masked_rotated = vec::mask8(raymask, rotated);
We need to broadcast information from one byte to an eight-byte lane within a vector. There are a few ways to do this, but the way to do this with minimal number of instructions is:
v512 lanebroadcast8to64(v512 x) {
const v512 y = vec::gf2p8affine8(v512::broadcast64(0x0102040810204080), x, 0);
return vec::gf2p8affine8(v512::broadcast64(0xFFFFFFFFFFFFFFFF), y, 0);
}
This is yet another neat bithack with the vgf2p8affineqb
instruction.
The summary of the above code is it does a horizontal xor across 8 bytes which is broadcasted to those 8 bytes, so it is equivalent to doing:
v512 lanebroadcast8to64(v512 x) {
for (int i = 0; i < 8; i++) {
u8 value = 0;
for (int j = 0; j < 8; j++)
value ^= x.bytes[8 * i + j];
for (int j = 0; j < 8; j++)
x.bytes[8 * i + j] = value;
}
}
Doing this works because only there is only ever zero or one sliders in each eight byte ray group; value
will only pick up the value of that element.
There are alternative implementations, this is left as an exercise to the reader.
Place
format
because of this, because we need to
ensure the msb of slider information is zero
when doing vpsadbw to avoid negative values.Unlike GPUs, CPUs do not have a backwards permute instruction, which would make this very straightforward. I attempted to come up with alternative ways of computing this, but the easiest way is a lookup table. If you come up with a nice way of doing this, please let me know.
As we need use a forwards permute, we need to calculate the appropriate permutation to use.
Notice in the below code that table
is a diagram of a rays from a superpiece,
with values corresponding to indexes into the ray-format.
We just want to look at a 8x8 window of this larger table.
v512 superpieceInverseRays(Square sq) {
constexpr u8 none = 0xFF;
constexpr std::array<u8, 512> table{{
none, none, none, none, none, none, none, none, none, none, none, none, none, none, none, none,
none, 0x2F, none, none, none, none, none, none, 0x27, none, none, none, none, none, none, 0x1F,
none, none, 0x2E, none, none, none, none, none, 0x26, none, none, none, none, none, 0x1E, none,
none, none, none, 0x2D, none, none, none, none, 0x25, none, none, none, none, 0x1D, none, none,
none, none, none, none, 0x2C, none, none, none, 0x24, none, none, none, 0x1C, none, none, none,
none, none, none, none, none, 0x2B, none, none, 0x23, none, none, 0x1B, none, none, none, none,
none, none, none, none, none, none, 0x2A, 0x28, 0x22, 0x20, 0x1A, none, none, none, none, none,
none, none, none, none, none, none, 0x30, 0x29, 0x21, 0x19, 0x18, none, none, none, none, none,
none, 0x37, 0x36, 0x35, 0x34, 0x33, 0x32, 0x31, none, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17,
none, none, none, none, none, none, 0x38, 0x39, 0x01, 0x09, 0x10, none, none, none, none, none,
none, none, none, none, none, none, 0x3A, 0x00, 0x02, 0x08, 0x0A, none, none, none, none, none,
none, none, none, none, none, 0x3B, none, none, 0x03, none, none, 0x0B, none, none, none, none,
none, none, none, none, 0x3C, none, none, none, 0x04, none, none, none, 0x0C, none, none, none,
none, none, none, 0x3D, none, none, none, none, 0x05, none, none, none, none, 0x0D, none, none,
none, none, 0x3E, none, none, none, none, none, 0x06, none, none, none, none, none, 0x0E, none,
none, 0x3F, none, none, none, none, none, none, 0x07, none, none, none, none, none, none, 0x0F,
}};
// 0xYZ: Y = row in table above, Z = column in table above.
// The below offsets are for square a1.
// This defines a window into the above table.
constexpr v512 offsets = v512::fromArray8({
0x88, 0x89, 0x8A, 0x8B, 0x8C, 0x8D, 0x8E, 0x8F,
0x98, 0x99, 0x9A, 0x9B, 0x9C, 0x9D, 0x9E, 0x9F,
0xA8, 0xA9, 0xAA, 0xAB, 0xAC, 0xAD, 0xAE, 0xAF,
0xB8, 0xB9, 0xBA, 0xBB, 0xBC, 0xBD, 0xBE, 0xBF,
0xC8, 0xC9, 0xCA, 0xCB, 0xCC, 0xCD, 0xCE, 0xCF,
0xD8, 0xD9, 0xDA, 0xDB, 0xDC, 0xDD, 0xDE, 0xDF,
0xE8, 0xE9, 0xEA, 0xEB, 0xEC, 0xED, 0xEE, 0xEF,
0xF8, 0xF9, 0xFA, 0xFB, 0xFC, 0xFD, 0xFE, 0xFF,
});
constexpr v512 table0 = v512::load(table.data());
constexpr v512 table1 = v512::load(table.data() + 64);
constexpr v512 table2 = v512::load(table.data() + 128);
constexpr v512 table3 = v512::load(table.data() + 192);
const v512 window = vec::sub8(offsets, v512::broadcast8(internal::expandSq(sq)));
const v512 result0 = vec::permute8(window, table0, table1);
const v512 result1 = vec::permute8(window, table2, table3);
return vec::blend8(window.msb8(), result0, result1);
}
The above code effectively looks into window into table
whose corner is defined by sq
.
The need to blend is because the maximum index for a AVX-512 permute is 7 bits and we need to effectively do an 8 bit permute,
so we emulate this.
This can be optimized by precomputing the resulting v512
for all 64 possible input squares.
This would be a lookup table that is 4 KiB in size.
Now we have all the slider information back into a board format, we can update the attack tables.
We first split the information into slider id and slider color.
We then convert slider id into a bitmask (mask = 1 << id
),
then xor it into the attack tables.
const v512 inv_perm = geometry::superpieceInverseRays(sq);
// Inverse permutation from ray layout back into board layout
const v512 slider_info_in_board_layout = vec::permute8_mz(~inv_perm.msb8(), inv_perm, masked_rotated);
const u64 color = slider_info_in_board_layout.msb8();
const v512 ids = slider_info_in_board_layout & v512::broadcast8(0xF);
// We use zero as an invalid id.
// Elsewhere (during fen parsing), we ensure that zero is always the id of the king.
// As the king is not a slider, this is fine.
const u64 valid_ids = slider_info_in_board_layout.nonzero8();
const v512 ones = v512::broadcast16(1);
const v512 bits0 = vec::shl16_mz(u32(valid_ids), ones, vec::zext8to16(ids.to256()));
const v512 bits1 = vec::shl16_mz(u32(valid_ids >> 32), ones, vec::zext8to16(vec::extract256<1>(ids)));
m_attack_table[0].z[0] ^= vec::mask16(~u32(color), bits0);
m_attack_table[0].z[1] ^= vec::mask16(~u32(color >> 32), bits1);
m_attack_table[1].z[0] ^= vec::mask16(u32(color), bits0);
m_attack_table[1].z[1] ^= vec::mask16(u32(color >> 32), bits1);
Using xor means we can use this exact same function for both extending or retracting rays, because it's a bit flip.
auto Position::incrementalSliderUpdate(Square sq) -> void {
const auto [ray_coords, ray_valid] = geometry::superpieceRays(sq);
const v512 ray_places = vec::permute8(ray_coords, m_board.z);
const u64 blockers = ray_places.nonzero8();
const u64 raymask = geometry::superpieceAttacks(blockers, ray_valid)
& geometry::non_horse_attack_mask;
const u64 sliders = (ray_places & v512::broadcast8(Place::slider_bit)).nonzero8()
& (ray_places & geometry::sliderMask).nonzero8();
// We only care about sliders that are visible to us that can attack us.
const u64 visible_sliders = raymask & sliders;
const v512 visible_sliders_places = vec::mask8(visible_sliders, ray_places)
// Broadcasts slider id to fill its ray (its 8 byte group)
const v512 broadcasted_sliders = vec::lanebroadcast8to64(visible_sliders_places);
// Rotate ray vector
const v512 rotated = vec::shuffle128<0b01001011>(broadcasted_sliders);
// Mask to actual ray length
const v512 masked_rotated = vec::mask8(raymask, rotated);
const v512 inv_perm = geometry::superpieceInverseRays(sq);
// Inverse permutation from ray layout back into board layout
const v512 slider_info_in_board_layout = vec::permute8_mz(~inv_perm.msb8(), inv_perm, masked_rotated);
const v512 ids = slider_info_in_board_layout & v512::broadcast8(0xF);
const u64 color = slider_info_in_board_layout.msb8();
// We use zero as an invalid id beacuse it is guaranteed to be a king which is never a slider.
const u64 valid_ids = slider_info_in_board_layout.nonzero8();
const v512 ones = v512::broadcast16(1);
const v512 bits0 = vec::shl16_mz(u32(valid_ids), ones, vec::zext8to16(ids.to256()));
const v512 bits1 = vec::shl16_mz(u32(valid_ids >> 32), ones, vec::zext8to16(vec::extract256<1>(ids)));
m_attack_table[0].z[0] ^= vec::mask16(~u32(color), bits0);
m_attack_table[0].z[1] ^= vec::mask16(~u32(color >> 32), bits1);
m_attack_table[1].z[0] ^= vec::mask16(u32(color), bits0);
m_attack_table[1].z[1] ^= vec::mask16(u32(color >> 32), bits1);
}
And that's how to incrementally do slider updates in vectors. End of section!
Last but not least, we need to add the attacks of a piece when it materalizes on a new square.
There are several ways of doing this, and I take completely different approaches in Rose and Clockwork.
In Rose, I do this without a lookup table.
In Clockwork, I use a look-up table.
I will show you how this is done in Clockwork (translated into AVX-512) because that it is a lot more intuitive and requires less explanation.
void Position::addAttacks(bool color, PieceId id, Square sq, PieceType ptype) {
auto [ray_coords, ray_valid] = geometry::superpieceRays(sq);
v512 ray_places = v512::permute8(ray_coords, m_board.to_vec());
u64 raymask = geometry::superpieceAttacks(ray_places, ray_valid);
v512 inv_perm = geometry::superpieceInverseRays(sq);
u64 mask = vec::bitpermute8(inv_perm, raymask);
u64 moves = geometry::pieceMoves(color, ptype, sq) & mask;
v512 bit = v512::broadcast16(1 << id);
m_attack_table[color].raw[0] |= vec::mask(u32(moves), bit);
m_attack_table[color].raw[1] |= vec::mask(u32(moves >> 32), bit);
}
vec::bitpermute8
is the vpshufbitqmb
instruction, which permutes the bits of a 64-bit value using a 64-element permute vector.
The lookup table code should have a familiar structure (a 8x8 window into a bigger table):
u64 pieceMoves(bool color, PieceType ptype, Square sq) {
assert(ptype != PieceType::None);
i32 index = ptype == PieceType::Pawn ? color : static_cast<i32>(ptype);
v512 bit = v512::broadcast8(1 << index);
v512 table = pieceMovesTable(sq);
return v512::test8(table, bit);
}
v512 pieceMovesTable(Square sq) {
constexpr u8 K = 1 << static_cast<i32>(PieceType::King);
constexpr u8 Q = 1 << static_cast<i32>(PieceType::Queen);
constexpr u8 B = 1 << static_cast<i32>(PieceType::Bishop);
constexpr u8 R = 1 << static_cast<i32>(PieceType::Rook);
constexpr u8 HORS = 1 << static_cast<i32>(PieceType::Knight);
constexpr u8 ORTH = Q | R;
constexpr u8 DIAG = Q | B;
constexpr u8 OADJ = ORTH | K;
constexpr u8 WPDG = DIAG | K | 0b01;
constexpr u8 BPDG = DIAG | K | 0b10;
constexpr std::array<u8, 512> table{{
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, DIAG, 0, 0, 0, 0, 0, 0, ORTH, 0, 0, 0, 0, 0, 0, DIAG,
0, 0, DIAG, 0, 0, 0, 0, 0, ORTH, 0, 0, 0, 0, 0, DIAG, 0,
0, 0, 0, DIAG, 0, 0, 0, 0, ORTH, 0, 0, 0, 0, DIAG, 0, 0,
0, 0, 0, 0, DIAG, 0, 0, 0, ORTH, 0, 0, 0, DIAG, 0, 0, 0,
0, 0, 0, 0, 0, DIAG, 0, 0, ORTH, 0, 0, DIAG, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, DIAG, HORS, ORTH, HORS, DIAG, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, HORS, BPDG, OADJ, BPDG, HORS, 0, 0, 0, 0, 0,
0, ORTH, ORTH, ORTH, ORTH, ORTH, ORTH, OADJ, 0, OADJ, ORTH, ORTH, ORTH, ORTH, ORTH, ORTH,
0, 0, 0, 0, 0, 0, HORS, WPDG, OADJ, WPDG, HORS, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, DIAG, HORS, ORTH, HORS, DIAG, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, DIAG, 0, 0, ORTH, 0, 0, DIAG, 0, 0, 0, 0,
0, 0, 0, 0, DIAG, 0, 0, 0, ORTH, 0, 0, 0, DIAG, 0, 0, 0,
0, 0, 0, DIAG, 0, 0, 0, 0, ORTH, 0, 0, 0, 0, DIAG, 0, 0,
0, 0, DIAG, 0, 0, 0, 0, 0, ORTH, 0, 0, 0, 0, 0, DIAG, 0,
0, DIAG, 0, 0, 0, 0, 0, 0, ORTH, 0, 0, 0, 0, 0, 0, DIAG,
}};
// 0xYZ: Y = row in table above, Z = column in table above.
// The below offsets are for square a1.
// This defines a window into the above table.
constexpr v512 offsets = v512::fromArray8({
0x88, 0x89, 0x8A, 0x8B, 0x8C, 0x8D, 0x8E, 0x8F,
0x98, 0x99, 0x9A, 0x9B, 0x9C, 0x9D, 0x9E, 0x9F,
0xA8, 0xA9, 0xAA, 0xAB, 0xAC, 0xAD, 0xAE, 0xAF,
0xB8, 0xB9, 0xBA, 0xBB, 0xBC, 0xBD, 0xBE, 0xBF,
0xC8, 0xC9, 0xCA, 0xCB, 0xCC, 0xCD, 0xCE, 0xCF,
0xD8, 0xD9, 0xDA, 0xDB, 0xDC, 0xDD, 0xDE, 0xDF,
0xE8, 0xE9, 0xEA, 0xEB, 0xEC, 0xED, 0xEE, 0xEF,
0xF8, 0xF9, 0xFA, 0xFB, 0xFC, 0xFD, 0xFE, 0xFF,
});
constexpr v512 table0 = v512::load(table.data());
constexpr v512 table1 = v512::load(table.data() + 64);
constexpr v512 table2 = v512::load(table.data() + 128);
constexpr v512 table3 = v512::load(table.data() + 192);
const v512 window = vec::sub8(offsets, v512::broadcast8(internal::expandSq(sq)));
const v512 result0 = vec::permute8(window, table0, table1);
const v512 result1 = vec::permute8(window, table2, table3);
return vec::blend8(window.msb8(), result0, result1);
}
Now we have all three suboperations we need, we can actually implement piece moves:
void Position::addPiece(bool color, Place p, Square to) {
m_board[to] = p;
incrementalSliderUpdate(to);
addAttacks(color, p.id(), to, p.ptype());
}
void Position::removePiece(bool color, PieceId id, Square from) {
removeAttacks(color, id);
incrementalSliderUpdate(from);
m_board[from] = Place::empty();
}
// For implementing captures
void Position::mutatePiece(bool old_color, PieceId old_id, Square sq, bool new_color, Place p) {
m_board[sq] = p;
removeAttacks(old_color, old_id);
addAttacks(new_color, p.id(), sq, p.ptype());
}
Position Position::move(Move m) {
Position new_pos = *this;
Square from = m.from();
Square to = m.to();
Place src = m_board[from];
Place dst = m_board[to];
bool color = static_cast<bool>(m_stm);
switch (m.flags()) {
case MoveFlags::Normal:
new_pos.removePiece(color, src.id(), from);
new_pos.addPiece(color, src, to);
new_pos.m_piece_list_sq[color][src.id] = to;
// ... update 50mr, enpassant ...
break;
// ... implement other move types ...
}
// ... update castling rights, side-to-move ...
return new_pos;
}
The remainder is left as an exercise for the reader. For a complete implementation of the above (albeit for AVX-2), see this Clockwork PR.
Phew! We finally have incremental attack tables, with pseudo-legal movegen. Your movegen speed should be acceptable now at this point.