2026-04-20 04:41:04

Consider the following problem. Given a string, you must match all of the ASCII white-space characters (\t, \n, \r, and the space) and some characters important in JSON (:, ,, [, ], {, }). JSON is a text-based data format used for web services. A toy JSON document looks as follows.
{
"name": "Alice",
"age": 30,
"email": "[email protected]",
"tags": ["developer", "python", "open-source"],
"active": true
}
We want to solve this problem using SIMD (single-instruction-multiple-data) instructions. With these instructions, you can compare a block of 16 bytes with another block of 16 bytes in one instruction.
It is a subproblem in the fast simdjson JSON library when we index a JSON document. We call this task vectorized classification. We also use the same technique when parsing DNS records, and so forth. In the actual simdjson library, we must also handle strings and quotes, and it gets more complicated.
I need to define what I mean by ‘matching’ the characters. In my case, it is enough to get, for each block of 64 bytes, two 64-bit masks: one for spaces and one for important characters. To illustrate, let me consider a 16-byte variant:
{"name": "Ali" }
1000000100000001 // important characters
0000000010000010 // spaces
Thus, I want to get back the numbers 0b1000000100000001 and 0b0000000010000010 in binary format (they are 33025 and 130 in decimal).
I refer you to Langdale and Lemire (2019) for how to do it using the conventional SIMD instructions available on ARM processors (NEON). Their key idea is a table-driven, branch-free classifier: for each byte, split into low and high nibbles, use SIMD table lookups to map each nibble to a bitmask, and combine the two masks (with bitwise AND) to decide whether the byte belongs to a target set (whitespace or structural JSON characters). This avoids doing many separate equality comparisons per character.
There is now a better way on recent ARM processors.
The 128-bit version of NEON was introduced in 2011 with the ARMv8-A architecture (AArch64). Apple played an important role and it was first used by the Apple A7 chip in the iPhone 5S. You can count on all 64-bit ARM processors to support NEON, which is convenient. (There are 32-bit ARM processors but they are mostly used for embedded systems, not mainstream computing.)
ARM NEON is good but getting old. It is no match for the AVX-512 instruction set available on x64 (AMD and Intel) processors. Not only do the AVX-512 instructions support wider registers (64 bytes as opposed to ARM NEON’s 16 bytes), but they also have more powerful instructions.
But ARM has something else to offer: Scalable Vector Extension (SVE) and its successor, SVE2. Though SVE was first introduced in 2016, it took until 2022 before we had actual access. The Neoverse V1 architecture used by the Amazon Graviton 3 is the first one I had access to. Soon after, we got SVE2 with the Neoverse V2 and N2 architectures. Today it is readily available: the Graviton4 on AWS, the Microsoft Cobalt 100 on Azure, the Google Axion on Google Cloud (and newer Google Cloud ARM CPUs), the NVIDIA Grace CPU, as well as several chips from Qualcomm, MediaTek, and Samsung. Notice who I am not including? Apple. For unclear reasons, Apple has not yet adopted SVE2.
I have mixed feelings about SVE/SVE2. Like RISC-V, it breaks with the approach from ARM NEON and x64 SIMD that uses fixed-length register sizes (16 bytes, 32 bytes, 64 bytes). This means that you are expected to code without knowing how wide the registers are.
This is convenient for chip makers because it gives them the option of adjusting the register size to better suit their market. Yet it seems to have failed. While the Graviton 3 processor from Amazon had 256-bit registers… all commodity chips have had 128-bit registers after that.
On the plus side, SVE/SVE2 has masks a bit like AVX-512, so you can load and process data only in a subset of the registers. It solves a long-standing problem with earlier SIMD instruction sets where the input is not a multiple of the register size. Both SVE/SVE2 and AVX-512 might make tail handling nicer. Being able to operate on only part of the register allows clever optimizations. Sadly, SVE/SVE2 does not allow you to move masks to and from a general-purpose register efficiently, unlike AVX-512. And that’s a direct consequence of their design with variable-length registers. Thus, even though your registers might always be 128-bit and contain 16 bytes, the instruction set is not allowed to assume that a mask fits in a 16-bit word.
I was pessimistic regarding SVE/SVE2 until I learned that it is designed to be interoperable with ARM NEON. Thus you can use the SVE/SVE2 instructions with your ARM NEON code. This works especially well if you know that the SVE/SVE2 registers match the ARM NEON registers (16 bytes).
For the work I do, there are two SVE2 instructions that are important: match and nmatch. In their 8-bit versions, what they do is the following: given two vectors a and b, each containing up to 16 bytes, match sets a predicate bit to true for each position i where a[i] equals any of the bytes in b. In other words, b acts as a small lookup set, and match tests set membership for every byte of a simultaneously. The nmatch instruction is the logical complement: it sets a predicate bit to true wherever a[i] does not match any byte in b. A single instruction thus replaces a series of equality comparisons and OR-reductions that would otherwise be needed. In the code below, op_chars holds the 8 structural JSON characters and ws_chars holds the 4 whitespace characters; calling svmatch_u8 once on a 16-byte chunk d0 produces a predicate that has a true bit exactly where that input byte is a structural character. The code uses SVE2 intrinsics: compiler-provided C/C++ functions that map almost one-to-one to CPU SIMD instructions, so you get near-assembly control without writing assembly.
// : , [ ] { }
uint8_t op_chars_data[16] = {
0x3a, 0x2c, 0x5b, 0x5d, 0x7b, 0x7d, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0
};
// \t \n \r ' '
uint8_t ws_chars_data[16] = {
0x09, 0x0a, 0x0d, 0x20, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0
};
// load the characters in SIMD registers
svuint8_t op_chars = svld1_u8(svptrue_b8(), op_chars_data);
svuint8_t ws_chars = svld1_u8(svptrue_b8(), ws_chars_data);
// load data
// const char * input = ...
svbool_t pg = svptrue_pat_b8(SV_VL16);
svuint8_t d = svld1_u8(pg, input);
// matching
svbool_t op = svmatch_u8(pg, d, op_chars);
svbool_t ws = svmatch_u8(pg, d, ws_chars);
In this code snippet, svuint8_t is an SVE vector type containing unsigned 8-bit lanes (bytes). svbool_t is an SVE predicate (mask) type. svptrue_b8() builds a predicate where all 8-bit lanes are active, and svld1_u8(pg, ptr) loads bytes from memory into an SVE vector, using predicate pg to decide which lanes are actually read.
If you paid attention thus far, you might have noticed that my code is slightly wrong since I am including 0 in the character sets. But it is fine as long as I assume that the zero byte is not present in the input. In practice, I could just repeat one of the characters, or use a bogus character that I do not expect to see in my inputs (such as the byte value 0xFF, which cannot appear in a valid UTF-8 string).
In standard SVE/SVE2, op and ws are predicates, not integer masks. A practical trick is to materialize each predicate as bytes (0xFF for true, 0x00 for false), for example with svdup_n_u8_z.
svuint8_t opm = svdup_n_u8_z(op, 0xFF);
svuint8_t wsm = svdup_n_u8_z(ws, 0xFF);
When SVE vectors are 128 bits, this byte vector maps naturally to a NEON uint8x16_t via svget_neonq_u8, and from there we can build scalar bitmasks efficiently with NEON operations (masking plus pairwise additions). Repeating this over four 16-byte chunks gives the two 64-bit masks needed for a 64-byte block.
How does it compare with pure NEON code? I compiled the routine processing blocks of 64 bytes using different compilers.
| method | GCC 16 | LLVM clang 20 |
|---|---|---|
| simdjson (NEON) | 69 | 66 |
| SVE/SVE2 (new!) | 42 | 52 |
Interestingly, GCC 16 even adopts SVE instructions in the pure NEON code, which suggests that recompiling old NEON code while targeting SVE/SVE2 could be beneficial.
I was hoping to test both compilers in a benchmark, but I wanted to quickly run my benchmarks on an AWS Graviton 4. I also did not want to compile GCC 16 from source. So I just kept LLVM clang 20 which was readily available in the images that AWS makes available (I picked RedHat 10).
The AWS Graviton 4 processor is a Neoverse V2 processor. Google has its own Neoverse V2 processors in its cloud. In my tests, it ran at 2.8 GHz.
My benchmark generates a random string of 1 MiB and computes the bitmaps indicating the positions of the characters. It is available on GitHub. My results are as follow.
| method | GB/s | instructions/byte | instructions/cycle |
|---|---|---|---|
| simdjson (NEON) | 11.4 | 0.94 | 3.5 |
| SVE/SVE2 (new!) | 14.4 | 0.67 | 3.8 |
So the SVE/SVE2 approach is about 25% faster than the NEON equivalent and uses 30% fewer instructions, and that’s without any kind of fancy optimization. Importantly, the code is relatively simple thanks the match instruction.
It might be that the SVE2 function match is the fastest way to match characters on ARM processors.
Credit: This post was motivated by a sketch by user liuyang-664 on GitHub.
Langdale, G., & Lemire, D. (2019). Parsing gigabytes of JSON per second. The VLDB Journal, 28(6), 941-960.
Koekkoek, J., & Lemire, D. (2025). Parsing millions of DNS records per second. Software: Practice and Experience, 55(4), 778-788.
Lemire, D. (2025). Scanning HTML at Tens of Gigabytes Per Second on ARM Processors. Software: Practice and Experience, 55(7), 1256-1265.
2026-04-09 22:58:53

2026-04-06 05:16:14

Suppose you have several strings and you want to count the number of instances of the character ! in your strings. In C++, you might solve the problem as follows if you are an old-school programmer.
size_t c = 0;
for (const auto &str : strings) {
c += std::count(str.begin(), str.end(), '!');
}
You can also get fancier with ranges.
for (const auto &str : strings) {
c += std::ranges::count(str, '!');
}
And so forth.
But what if you want to go faster? Maybe you’d want to rewrite this function in assembly. I decided to do so, and to have fun using both Grok and Claude as my AIs, setting up a friendly competition.
I started with my function and then I asked AIs to optimize it in assembly. Importantly, they knew which machine I was on, so they started to write ARM assembly.
By repeated prompting, I got the following functions.
count_classic: Uses C++ standard library std::count for reference.count_assembly: A basic ARM64 assembly loop (byte-by-byte comparison). Written by Grok.count_assembly_claude: Claude’s SIMD-optimized version using NEON instructions (16-byte chunks).count_assembly_grok: Grok’s optimized version (32-byte chunks).count_assembly_claude_2: Claude’s further optimized version (64-byte chunks with multiple accumulators).count_assembly_grok_2: Grok’s latest version (64-byte chunks with improved accumulator handling).count_assembly_claude_3: Claude’s most advanced version with additional optimizations.You get the idea.
So, how is the performance? I use random strings of up to 1 kilobyte. In all cases, I test that the functions provide the correct count. I did not closely examine the code, so it is possible that mistakes could be hiding in the code.
I record the average number of instructions per string.
| name | instructions/string |
|---|---|
| classic C++ | 1200 |
| claude assembly | 250 |
| grok assembly | 204 |
| claude assembly 2 | 183 |
| grok assembly 2 | 176 |
| claude assembly 3 | 154 |
By repeated optimization, I reduced the number of instructions by a factor of eight. The running time decreases similarly.
Can we get the AIs to rewrite the best option in C? Yes, although you need SIMD intrinsics. So there is no benefit to leaving the code in assembly in this instance.
An open question is whether the AIs could find optimizations that are not possible if we use a higher-level language like C or C++. It is an intriguing question that I will seek to answer later. For the time being, the AIs can beat my C++ compiler!
2026-03-30 02:18:01

Consider the following problem. You have a large set of strings, maybe millions. You need to map these strings to 8-byte integers (uint64). These integers are given to you.
If you are working in Go, the standard solution is to create a map. The construction is trivial, something like the following loop.
m := make(map[string]uint64, N)
for i, k := range keys {
m[k] = values[i]
}
One downside is that the map may use over 50 bytes per entry.
In important scenarios, we might have the following conditions. The map is large (a million of entries or more), you do not need to modify it dynamically (it is immutable), and all queried keys are in the set. In such conditions, you can reduce the memory usage down to almost the size of the keys, so about 8 bytes per entry. One fast technique is the binary fuse filters.
I implemented it as a Go library called constmap that provides an immutable map from strings to uint64 values using binary fuse filters. This data structure is ideal when you have a fixed set of keys at construction time and need fast, memory-efficient lookups afterward. You can even construct the map once, save it to disk so you do not pay the cost of constructing the map each time you need it.
The usage is just as simple.
package main
import (
"fmt"
"log"
"github.com/lemire/constmap"
)
func main() {
keys := []string{"apple", "banana", "cherry"}
values := []uint64{100, 200, 300}
cm, err := constmap.New(keys, values)
if err != nil {
log.Fatal(err)
}
fmt.Println(cm.Map("banana")) // 200
}
The construction time is higher (as expected for any compact data structure), but lookups are optimized for speed. I ran benchmarks on my Apple M4 Max processor to compare constmap lookups against Go’s built-in map[string]uint64. The test uses 1 million keys.
| Data Structure | Lookup Time | Memory Usage |
|---|---|---|
| ConstMap | 7.4 ns/op | 9 bytes/key |
| Go Map | 20 ns/op | 56 bytes/key |
ConstMap is nearly 3 times faster than Go’s standard map for lookups! And we reduced the memory usage by a factor of 6.
The ConstMap may not always be faster, but it should always use significantly less memory. If it can reside in CPU cache while the map cannot, then it will be significantly faster.
Source Code The implementation is available on GitHub: github.com/lemire/constmap.
2026-03-26 08:29:38

The next C++ standard (C++26) is getting exciting new features. One of these features is compile-time reflection. It is ideally suited to serialize and deserialize data at high speed. To test it out, we extended our fast JSON library (simdjson) and we gave a talk at CppCon 2025. The video is out on YouTube.
2026-03-19 05:52:53

Modern processors have the ability to execute many instructions per cycle, on a single core. To be able to execute many instructions per cycle in practice, processors predict branches. I have made the point over the years that modern CPUs have an incredible ability to predict branches.
It makes benchmarking difficult because if you test on small datasets, you can get surprising results that might not work on real data.
My go-to benchmark is a function like so:
while (howmany != 0) {
val = generate_random_value()
if(val is odd) write to buffer
decrement howmany
}
The processor tries to predict the branch (if clause). Because we use random values, the processor should mispredict one time out of two.
However, if we repeat multiple times the benchmark, always using the same random values, the processor learns the branches. How many can processors learn? I test using three recent processors.
Once more I am disappointed by Intel. AMD is doing wonderfully well on this benchmark.