chess
Incrementally updated attack tables. There is currently only one actively developed chess engine that uses incrementally updated attack tables for move generation. It is also by definition the world's best attack table chess engine.
In this multipart blogpost we will demonstrate how to make incrementally updated attack tables faster.
The term “byteboard” was invented by the author of the shogi engine takotto (たこっと) in 2016. This term seems to have been used in the shogi community for techniques that manipulate mailboxes with vector instructions, though the method seems to have fallen out of fashion. Reading this article was the spark for developing a new move generation technique for chess.
This spark was kindled by two coincidences:
Since a vector can store all possible attack destinations from a square, you can compute all superpiece attacks to and/or from a chessboard square in one vector permutation.
For reasons that will later become apparent, my ray vector has the following format:
knight [ north ray attacks ]
knight [ north-east ray attacks ]
knight [ east ray attacks ]
knight [ south-east ray attacks ]
knight [ south ray attacks ]
knight [ south-west ray attacks ]
knight [ west ray attacks ]
knight [ north-west ray attacks ]
Notice that a few points about this format:
As an example we will demonstrate how how all superpiece rays from d4 can be computed.
Given we have a board mailbox with squares in the following order:
a8 b8 c8 d8 e8 f8 g8 h8
a7 b7 c7 d7 e7 f7 g7 h7
a6 b6 c6 d6 e6 f6 g6 h6
a5 b5 c5 d5 e5 f5 g5 h5
a4 b4 c4 d4 e4 f4 g4 h4
a3 b3 c3 d3 e3 f3 g3 h3
a2 b2 c2 d2 e2 f2 g2 h2
a1 b1 c1 d1 e1 f1 g1 h1
We can apply the following permutation vector:
2a 23 2b 33 3b 03 0b 13
2c 24 2d 36 3f 00 09 12
25 1c 1d 1e 1f 18 19 1a
15 14 0d 06 3f 30 29 22
0c 13 0b 03 3b 33 2b 23
0a 12 09 00 37 2e 25 1c
11 1a 19 18 17 16 15 14
21 22 29 30 37 3e 05 0c
with vector mask: 0xf0f0f0f0f1f1f1f
To get a ray vector in the following arrangement:
c6 | d5 d6 d7 d8 -- -- --
e6 | e5 f6 g7 h8 -- -- --
f5 | e4 f4 g4 h4 -- -- --
f3 | e3 f2 g1 -- -- -- --
e2 | d3 d2 d1 -- -- -- --
c2 | c3 b2 a1 -- -- -- --
b3 | c4 b4 a4 -- -- -- --
b5 | c5 b6 a7 -- -- -- --
An easy way to do the ray permutations would be with a lookup table. This would be sufficiently small in size (4 kiB).
However, they are cheap enough to compute with some 0x88
math.
In 0x88
math, we expand the bit representation of a square which is normally 00fffrrr
(or file * 8 + rank
), to the form 0fff0rrr
(or file * 16 + rank
).
We do this so that we can easily detect out-of-bounds squares when doing math on coordinates.
This is because if overflow or underflow in the rank or file parts of the square happens the zero bits would be set and this can be detected by masking by 0x88
.
With the above, to calculate the permutation vector, we can add all possible offsets to the square. We must also then detect out-of-bounds squares. We can do this in a handful of vector operations:
std::tuple<v512, u64> superpieceRays(Square sq) {
constexpr v512 offsets = v512{std::array<u8, 64>{
0x1F, 0x10, 0x20, 0x30, 0x40, 0x50, 0x60, 0x70, // N
0x21, 0x11, 0x22, 0x33, 0x44, 0x55, 0x66, 0x77, // NE
0x12, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, // E
0xF2, 0xF1, 0xE2, 0xD3, 0xC4, 0xB5, 0xA6, 0x97, // SE
0xE1, 0xF0, 0xE0, 0xD0, 0xC0, 0xB0, 0xA0, 0x90, // S
0xDF, 0xEF, 0xDE, 0xCD, 0xBC, 0xAB, 0x9A, 0x89, // SW
0xEE, 0xFF, 0xFE, 0xFD, 0xFC, 0xFB, 0xFA, 0xF9, // W
0x0E, 0x0F, 0x1E, 0x2D, 0x3C, 0x4B, 0x5A, 0x69, // NW
}};
const v512 sq_vec = v512::broadcast8(internal::expandSq(sq));
const v512 uncompressed = vec::add8(sq_vec, offsets);
return internal::compressCoords(uncompressed);
}
Wait up, what is up with internal::compressCoords
?
We can do coordinate re-compression (from 0x88
format into regular format) in parallel in one vector instruction:
std::tuple<v512, u64> compressCoords(v512 list) {
// Test with 0x88
const u64 valid = vec::testn8(list, v512::broadcast8(0x88));
// 0rrr0fff → 00rrrfff
constexpr v512 magic = v512::broadcast64(0x0102041020400000);
const v512 compressed = vec::gf2p8affine8(list, magic, 0);
return {compressed, valid};
}
This abuse of the gf2p8affineqb
instruction deserves its own mini-article to explain the math.
Being able to treat bytes as 8-element GF(2) vectors to do matrix operations on them allows for complex bit manipulations in few instructions.
Before we explore how to calculate attack tables we should discuss what information we are trying to compute.
Modern attack table engines store “square attacked by” information. In other words, given a square, what pieces are attacking this square? Some ways this information can be stored are:
We choose to split the 32 ids into white ids and black ids.
We also choose to split the attack tables into white attackers and black attackers.
This is important because this allows each colored attack table to be stored in two AVX-512 vectors (sixty-four 16-bit elements).
This is important as the vpermi2b
AVX-512 instruction can perform a permutation using two vector registers as the source.
We also often only want to update the attacks for a particular color.
Computing all attacks to one square was the first thing I considered. We abandon this method when we implement incremental updates, but this is an important stepping stone, so we will start with it.
Firstly we should consider what we will be storing in our mailbox.
We will use an example that makes things clearer that we will later abandon for performance.
A mailbox will be made up of 64 Place
s, which are 8-bit integers that have the following 8 flags:
00000001 white pawn
00000010 black pawn
00000100 knight
00001000 bishop
00010000 rook
00011000 queen
00100000 king
10000000 black color (for white this bit will be cleared)
We have one bit that represents each type of movement. Notice that queen = rook + bishop.
Also notice that white pawns and black pawns are different, because they have different movesets.
This makes testing the whole board for a particular type of piece one vptestb
instruction.
The vptestb
instruction is also a straightforward way to extract a piece bitboard from a mailbox!
We won't do anything interesting with this observation. It was just neat to note. This demonstrates that mailboxes are just transposed bitboards.
The first thing we have to do is to handle sliders.
The trick we use is similar to the logic of the o^(o-2r)
technique.
Let us first consider an example position:
3rk3/4q1q1/1b1P4/4p1n1/r2K3q/8/2N1n3/3R2B1 w - - 0 1
We will:
vpcmpneqb
instruction,
o^(o-2r)
-like trick, then
bitand
the rays with the occupied squares to get a mask our potential attackers.
const auto [ray_coords, ray_valid] = geometry::superpieceRays(sq);
const v512 ray_places = vec::permute8(ray_coords, m_board_mailbox);
const u64 occupied = ray_places.nonzero8();
const u64 color = ray_places.msb8();
const u64 rays_to_nearest = geometry::superpieceAttacks(occupied, ray_valid);
const u64 visible = rays_to_nearest & occupied;
where superpieceAttacks
is defined as:
u64 superpieceAttacks(u64 occupied, u64 ray_valid) {
const u64 o = occupied | 0x8181818181818181;
const u64 x = o ^ (o - 0x0303030303030303);
return x & ray_valid;
}
For our example position, the values of occupied
, rays_to_nearest
, and visible
are as follows:
occupied:
-- | -- d6 -- d8 -- -- --
-- | e5 -- g7 -- -- -- --
-- | -- -- -- h4 -- -- --
-- | -- -- g1 -- -- -- --
e2 | -- -- d1 -- -- -- --
c2 | -- -- a1 -- -- -- --
-- | -- -- a4 -- -- -- --
-- | -- b6 -- -- -- -- --
rays_to_nearest:
c6 | d5 d6 -- -- -- -- --
e6 | e5 -- -- -- -- -- --
f5 | e4 f4 g4 h4 -- -- --
f3 | e3 f2 g1 -- -- -- --
e2 | d3 d2 d1 -- -- -- --
c2 | c3 b2 a1 -- -- -- --
b3 | c4 b4 a4 -- -- -- --
b5 | c5 b6 -- -- -- -- --
visible:
-- | -- d6 -- -- -- -- --
-- | e5 -- -- -- -- -- --
-- | -- -- -- h4 -- -- --
-- | -- -- g1 -- -- -- --
e2 | -- -- d1 -- -- -- --
c2 | -- -- a1 -- -- -- --
-- | -- -- a4 -- -- -- --
-- | -- b6 -- -- -- -- --
Now we will need to determine if pieces that are visible to us are actually attacking us. Given we have set up our piece types to have a single bit for their movement types, this is a vector bitand with a mask. The mask is constructed like so:
constexpr u8 horse = 0b000100; // knight
constexpr u8 orth = 0b010000; // rook and queen
constexpr u8 diag = 0b001000; // bishop and queen
constexpr u8 orth_near = 0b110000; // king, rook and queen
constexpr u8 wpawn_near = 0b101001; // wp, king, bishop, queen
constexpr u8 bpawn_near = 0b101010; // bp, king, bishop, queen
constexpr v512 attacker_mask = v512{std::array<u8, 64>{
horse, orth_near, orth, orth, orth, orth, orth, orth, // N
horse, wpawn_near, diag, diag, diag, diag, diag, diag, // NE
horse, orth_near, orth, orth, orth, orth, orth, orth, // E
horse, bpawn_near, diag, diag, diag, diag, diag, diag, // SE
horse, orth_near, orth, orth, orth, orth, orth, orth, // S
horse, bpawn_near, diag, diag, diag, diag, diag, diag, // SW
horse, orth_near, orth, orth, orth, orth, orth, orth, // W
horse, wpawn_near, diag, diag, diag, diag, diag, diag, // NW
}};
The remainder of the function is self-explanatory (where vec::compress8
is the vpcompressb
instruction) ...
const u64 attackers = (ray_places & attacker_mask).nonzero8();
const u64 white_attackers = ~color & visible & attackers;
const u64 black_attackers = color & visible & attackers;
const int white_attackers_count = std::popcount(white_attackers);
const int black_attackers_count = std::popcount(black_attackers);
const v128 white_attackers_coord = vec::compress8(white_attackers, ray_coords).to128();
const v128 black_attackers_coord = vec::compress8(black_attackers, ray_coords).to128();
return {
vec::findset8(white_attackers_coord, white_attackers_count, m_piece_list_sq[0].x),
vec::findset8(black_attackers_coord, black_attackers_count, m_piece_list_sq[1].x),
};
... except for vec::findset8
.
Here is its definition in Rose:
forceinline u16 findset8(v128 haystack, int haystack_len, v128 needles) {
return _mm_extract_epi16(_mm_cmpestrm(haystack.raw, haystack_len, needles.raw, 16, 0), 0);
}
This is the pcmpestrm
instruction. You can use it to find which elements of a set are present in a variable-length array (up to 16 bytes long). Neat, huh?
This concludes part one of this two part series on accelerating incremental attack tables with vector math. In this first part, we:
gf2p8affineqb
trick.