87flowers ~ $

accelerating incremental attack tables with vector math: part 1

21 Apr 2025 chess

byteboards
all the attacks, all at once
computing permutations
1. fast coordinate re-compression
what's in the tables
how to calculate attacks to one square?
1. vec::findset8
next
navigation
thanks to

Incrementally updated attack tables. There is currently only one actively developed chess engine that uses incrementally updated attack tables for move generation. It is also by definition the world's best attack table chess engine.

In this multipart blogpost we will demonstrate how to make incrementally updated attack tables faster.

part 1: non-incrementally generating attack tables (you are here!)
part 2: move generation
part 3: incrementally generating attack tables
part 4: pin detection and fully-legal movegen (TBD)

byteboards

The term “byteboard” was invented by the author of the shogi engine takotto (たこっと) in 2016. This term seems to have been used in the shogi community for techniques that manipulate mailboxes with vector instructions, though the method seems to have fallen out of fashion. Reading this article was the spark for developing a new move generation technique for chess.

This spark was kindled by two coincidences:

An AVX-512 vector register is 64 bytes wide, which fits a chessboard perfectly.
All attacks from a square can also fit into one AVX-512 vector register.
- A ray has maximum length of 7.
- You have 8 rays, one for each ordinal direction.
- You also have 8 knight attacks.
- 7 × 8 + 8 = 64 squares.

all the attacks, all at once

Since a vector can store all possible attack destinations from a square, you can compute all superpiece attacks to and/or from a chessboard square in one vector permutation.

For reasons that will later become apparent, my ray vector has the following format:

knight [      north ray attacks      ]
knight [   north-east ray attacks    ]
knight [       east ray attacks      ]
knight [   south-east ray attacks    ]
knight [      south ray attacks      ]
knight [   south-west ray attacks    ]
knight [       west ray attacks      ]
knight [   north-west ray attacks    ]

Notice that a few points about this format:

Each ray is fully within an 8 byte lane. This is important for later ray manipulation.
The order of the rays in the vector is also significant, and we will explain why later.
The order and position of the knight attacks do not significantly matter; I could have placed them at either end of each 8 byte lane.

As an example we will demonstrate how how all superpiece rays from d4 can be computed.

All possible superpiece attacks from d4.

Given we have a board mailbox with squares in the following order:

a8 b8 c8 d8 e8 f8 g8 h8
a7 b7 c7 d7 e7 f7 g7 h7
a6 b6 c6 d6 e6 f6 g6 h6
a5 b5 c5 d5 e5 f5 g5 h5
a4 b4 c4 d4 e4 f4 g4 h4
a3 b3 c3 d3 e3 f3 g3 h3
a2 b2 c2 d2 e2 f2 g2 h2
a1 b1 c1 d1 e1 f1 g1 h1

We can apply the following permutation vector:

2a 23 2b 33 3b 03 0b 13
2c 24 2d 36 3f 00 09 12
25 1c 1d 1e 1f 18 19 1a
15 14 0d 06 3f 30 29 22
0c 13 0b 03 3b 33 2b 23
0a 12 09 00 37 2e 25 1c
11 1a 19 18 17 16 15 14
21 22 29 30 37 3e 05 0c
with vector mask: 0xf0f0f0f0f1f1f1f

To get a ray vector in the following arrangement:

c6 | d5 d6 d7 d8 -- -- --
e6 | e5 f6 g7 h8 -- -- --
f5 | e4 f4 g4 h4 -- -- --
f3 | e3 f2 g1 -- -- -- --
e2 | d3 d2 d1 -- -- -- --
c2 | c3 b2 a1 -- -- -- --
b3 | c4 b4 a4 -- -- -- --
b5 | c5 b6 a7 -- -- -- --

computing permutations

An easy way to do the ray permutations would be with a lookup table. This would be sufficiently small in size (4 kiB). However, they are cheap enough to compute with some 0x88 math.

In 0x88 math, we expand the bit representation of a square which is normally 00fffrrr (or file * 8 + rank), to the form 0fff0rrr (or file * 16 + rank). We do this so that we can easily detect out-of-bounds squares when doing math on coordinates. This is because if overflow or underflow in the rank or file parts of the square happens the zero bits would be set and this can be detected by masking by 0x88.

With the above, to calculate the permutation vector, we can add all possible offsets to the square. We must also then detect out-of-bounds squares. We can do this in a handful of vector operations:

std::tuple<v512, u64> superpieceRays(Square sq) {

  constexpr v512 offsets = v512{std::array<u8, 64>{
      0x1F, 0x10, 0x20, 0x30, 0x40, 0x50, 0x60, 0x70, // N
      0x21, 0x11, 0x22, 0x33, 0x44, 0x55, 0x66, 0x77, // NE
      0x12, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, // E
      0xF2, 0xF1, 0xE2, 0xD3, 0xC4, 0xB5, 0xA6, 0x97, // SE
      0xE1, 0xF0, 0xE0, 0xD0, 0xC0, 0xB0, 0xA0, 0x90, // S
      0xDF, 0xEF, 0xDE, 0xCD, 0xBC, 0xAB, 0x9A, 0x89, // SW
      0xEE, 0xFF, 0xFE, 0xFD, 0xFC, 0xFB, 0xFA, 0xF9, // W
      0x0E, 0x0F, 0x1E, 0x2D, 0x3C, 0x4B, 0x5A, 0x69, // NW
  }};

  const v512 sq_vec = v512::broadcast8(internal::expandSq(sq));
  const v512 uncompressed = vec::add8(sq_vec, offsets);
  return internal::compressCoords(uncompressed);

}

fast coordinate re-compression

Wait up, what is up with internal::compressCoords?

We can do coordinate re-compression (from 0x88 format into regular format) in parallel in one vector instruction:

std::tuple<v512, u64> compressCoords(v512 list) {

  // Test with 0x88
  const u64 valid = vec::testn8(list, v512::broadcast8(0x88));

  // 0rrr0fff → 00rrrfff
  constexpr v512 magic = v512::broadcast64(0x0102041020400000);
  const v512 compressed = vec::gf2p8affine8(list, magic, 0);

  return {compressed, valid};

}

This abuse of the gf2p8affineqb instruction deserves its own mini-article to explain the math. Being able to treat bytes as 8-element GF(2) vectors to do matrix operations on them allows for complex bit manipulations in few instructions.

what's in the tables

Before we explore how to calculate attack tables we should discuss what information we are trying to compute.

Modern attack table engines store “square attacked by” information. In other words, given a square, what pieces are attacking this square? Some ways this information can be stored are:

as a bitboard representing squares that are attacking the current square (64-bits per square)
The i-th bit represent an attack from piece id i.as a bitlist of pieces that are attacking the current square (32-bits per square)
as a bitlist of directions that are attacking the current square (16-bits per square: 8 ordinal directions and 8 knight directions)

The most sensible option is the middle option, This also the decision the world's best attack table engine makes, so we are in good company. as this provides good tradeoff regarding lookup versus space. This requires keeping piecelists so we can assign each piece an id from 0 to 31.

our implementation

We choose to split the 32 ids into white ids and black ids. We also choose to split the attack tables into white attackers and black attackers. This is important because this allows each colored attack table to be stored in two AVX-512 vectors (sixty-four 16-bit elements). This is important as the vpermi2b AVX-512 instruction can perform a permutation using two vector registers as the source. We also often only want to update the attacks for a particular color.

how to calculate attacks to one square?

Computing all attacks to one square was the first thing I considered. We abandon this method when we implement incremental updates, but this is an important stepping stone, so we will start with it.

Firstly we should consider what we will be storing in our mailbox. We will use an example that makes things clearer that we will later abandon for performance. A mailbox will be made up of 64 Places, which are 8-bit integers that have the following 8 flags:

00000001 white pawn
00000010 black pawn
00000100 knight
00001000 bishop
00010000 rook
00011000 queen
00100000 king
10000000 black color (for white this bit will be cleared)

We have one bit that represents each type of movement. Notice that queen = rook + bishop. Also notice that white pawns and black pawns are different, because they have different movesets. This makes testing the whole board for a particular type of piece one vptestb instruction. The vptestb instruction is also a straightforward way to extract a piece bitboard from a mailbox! We won't do anything interesting with this observation. It was just neat to note. This demonstrates that mailboxes are just transposed bitboards.

The first thing we have to do is to handle sliders. The trick we use is similar to the logic of the o^(o-2r) technique. Let us first consider an example position:

fen: 3rk3/4q1q1/1b1P4/4p1n1/r2K3q/8/2N1n3/3R2B1 w - - 0 1
An impossible position, but a useful example.
We will look at attacks from the point of view of the d4 square.

We will:

extract the occupied squares in our ray vectors in one vpcmpneqb instruction,
calculate rays to nearest occupied squares with a o^(o-2r)-like trick, then
bitand the rays with the occupied squares to get a mask our potential attackers.

const auto [ray_coords, ray_valid] = geometry::superpieceRays(sq);
const v512 ray_places = vec::permute8(ray_coords, m_board_mailbox);

const u64 occupied = ray_places.nonzero8();
const u64 color = ray_places.msb8();

const u64 rays_to_nearest = geometry::superpieceAttacks(occupied, ray_valid);
const u64 visible = rays_to_nearest & occupied;

where superpieceAttacks is defined as:

u64 superpieceAttacks(u64 occupied, u64 ray_valid) {
  const u64 o = occupied | 0x8181818181818181;
  const u64 x = o ^ (o - 0x0303030303030303);
  return x & ray_valid;
}

For our example position, the values of occupied, rays_to_nearest, and visible are as follows:

occupied:
-- | -- d6 -- d8 -- -- -- 
-- | e5 -- g7 -- -- -- -- 
-- | -- -- -- h4 -- -- -- 
-- | -- -- g1 -- -- -- -- 
e2 | -- -- d1 -- -- -- -- 
c2 | -- -- a1 -- -- -- -- 
-- | -- -- a4 -- -- -- -- 
-- | -- b6 -- -- -- -- -- 

rays_to_nearest:
c6 | d5 d6 -- -- -- -- -- 
e6 | e5 -- -- -- -- -- -- 
f5 | e4 f4 g4 h4 -- -- -- 
f3 | e3 f2 g1 -- -- -- -- 
e2 | d3 d2 d1 -- -- -- -- 
c2 | c3 b2 a1 -- -- -- -- 
b3 | c4 b4 a4 -- -- -- -- 
b5 | c5 b6 -- -- -- -- -- 

visible:
-- | -- d6 -- -- -- -- -- 
-- | e5 -- -- -- -- -- -- 
-- | -- -- -- h4 -- -- -- 
-- | -- -- g1 -- -- -- -- 
e2 | -- -- d1 -- -- -- -- 
c2 | -- -- a1 -- -- -- -- 
-- | -- -- a4 -- -- -- -- 
-- | -- b6 -- -- -- -- --

Now we will need to determine if pieces that are visible to us are actually attacking us. Given we have set up our piece types to have a single bit for their movement types, this is a vector bitand with a mask. The mask is constructed like so:

constexpr u8 horse      = 0b000100; // knight
constexpr u8 orth       = 0b010000; // rook and queen
constexpr u8 diag       = 0b001000; // bishop and queen
constexpr u8 orth_near  = 0b110000; // king, rook and queen
constexpr u8 wpawn_near = 0b101001; // wp, king, bishop, queen
constexpr u8 bpawn_near = 0b101010; // bp, king, bishop, queen

constexpr v512 attacker_mask = v512{std::array<u8, 64>{
  horse, orth_near,  orth, orth, orth, orth, orth, orth, // N
  horse, wpawn_near, diag, diag, diag, diag, diag, diag, // NE
  horse, orth_near,  orth, orth, orth, orth, orth, orth, // E
  horse, bpawn_near, diag, diag, diag, diag, diag, diag, // SE
  horse, orth_near,  orth, orth, orth, orth, orth, orth, // S
  horse, bpawn_near, diag, diag, diag, diag, diag, diag, // SW
  horse, orth_near,  orth, orth, orth, orth, orth, orth, // W
  horse, wpawn_near, diag, diag, diag, diag, diag, diag, // NW
}};

The remainder of the function is self-explanatory (where vec::compress8 is the vpcompressb instruction) ...

const u64 attackers = (ray_places & attacker_mask).nonzero8();
const u64 white_attackers = ~color & visible & attackers;
const u64 black_attackers = color & visible & attackers;

const int white_attackers_count = std::popcount(white_attackers);
const int black_attackers_count = std::popcount(black_attackers);
const v128 white_attackers_coord = vec::compress8(white_attackers, ray_coords).to128();
const v128 black_attackers_coord = vec::compress8(black_attackers, ray_coords).to128();
return {
  vec::findset8(white_attackers_coord, white_attackers_count, m_piece_list_sq[0].x),
  vec::findset8(black_attackers_coord, black_attackers_count, m_piece_list_sq[1].x),
};

... except for vec::findset8.

vec::findset8

Here is its definition in Rose:

forceinline u16 findset8(v128 haystack, int haystack_len, v128 needles) {
  return _mm_extract_epi16(_mm_cmpestrm(haystack.raw, haystack_len, needles.raw, 16, 0), 0);
}

This is the pcmpestrm instruction. You can use it to find which elements of a set are present in a variable-length array (up to 16 bytes long). Neat, huh?

This concludes part one of this two part series on accelerating incremental attack tables with vector math. In this first part, we:

Showed how to extract all rays from a mailbox at once via permutation.
Showed off a gf2p8affineqb trick.
Showed how the choice of piece type representation can be important and showed that mailboxes and bitboards are just transpositions of each other.
Showed how to calculate all possible attackers to one single square.

In subsequent parts we will demonstrate:

How to use this information to actually generate moves.
How to update all squares affected by a piece move in parallel.
How to detect pins then mask the entire attack table at once for efficient fully-legal move generation.
A few more vector instruction tricks.

navigation

part 1: non-incrementally generating attack tables
part 2: move generation
part 3: incrementally generating attack tables
part 4: pin detection and fully-legal movegen (TBD)

thanks to

The authors of Yukari whose attack tables inspired this exploration.
The regulars of the Engine Programming and Stockfish discords.
Stefan Haack for cm-chessboard which was used to render the the chess boards on this page. (License: MIT.)
Chess pieces are by Wikimedia user Cburnett, licensed under CC BY-SA 3.0