MoreRSS

site iconDaniel LemireModify

Computer science professor at the University of Quebec (TELUQ), open-source hacker, and long-time blogger.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of Daniel Lemire

The fastest way to match characters on ARM processors?

2026-04-20 04:41:04

Consider the following problem. Given a string, you must match all of the ASCII white-space characters (\t, \n, \r, and the space) and some characters important in JSON (:, ,, [, ], {, }). JSON is a text-based data format used for web services. A toy JSON document looks as follows.

{
  "name": "Alice",
  "age": 30,
  "email": "[email protected]",
  "tags": ["developer", "python", "open-source"],
  "active": true
}

We want to solve this problem using SIMD (single-instruction-multiple-data) instructions. With these instructions, you can compare a block of 16 bytes with another block of 16 bytes in one instruction.

It is a subproblem in the fast simdjson JSON library when we index a JSON document. We call this task vectorized classification. We also use the same technique when parsing DNS records, and so forth. In the actual simdjson library, we must also handle strings and quotes, and it gets more complicated.

I need to define what I mean by ‘matching’ the characters. In my case, it is enough to get, for each block of 64 bytes, two 64-bit masks: one for spaces and one for important characters. To illustrate, let me consider a 16-byte variant:

{"name": "Ali" }
1000000100000001 // important characters
0000000010000010 // spaces

Thus, I want to get back the numbers 0b1000000100000001 and 0b0000000010000010 in binary format (they are 33025 and 130 in decimal).

I refer you to Langdale and Lemire (2019) for how to do it using the conventional SIMD instructions available on ARM processors (NEON). Their key idea is a table-driven, branch-free classifier: for each byte, split into low and high nibbles, use SIMD table lookups to map each nibble to a bitmask, and combine the two masks (with bitwise AND) to decide whether the byte belongs to a target set (whitespace or structural JSON characters). This avoids doing many separate equality comparisons per character.

There is now a better way on recent ARM processors.

The 128-bit version of NEON was introduced in 2011 with the ARMv8-A architecture (AArch64). Apple played an important role and it was first used by the Apple A7 chip in the iPhone 5S. You can count on all 64-bit ARM processors to support NEON, which is convenient. (There are 32-bit ARM processors but they are mostly used for embedded systems, not mainstream computing.)

ARM NEON is good but getting old. It is no match for the AVX-512 instruction set available on x64 (AMD and Intel) processors. Not only do the AVX-512 instructions support wider registers (64 bytes as opposed to ARM NEON’s 16 bytes), but they also have more powerful instructions.

But ARM has something else to offer: Scalable Vector Extension (SVE) and its successor, SVE2. Though SVE was first introduced in 2016, it took until 2022 before we had actual access. The Neoverse V1 architecture used by the Amazon Graviton 3 is the first one I had access to. Soon after, we got SVE2 with the Neoverse V2 and N2 architectures. Today it is readily available: the Graviton4 on AWS, the Microsoft Cobalt 100 on Azure, the Google Axion on Google Cloud (and newer Google Cloud ARM CPUs), the NVIDIA Grace CPU, as well as several chips from Qualcomm, MediaTek, and Samsung. Notice who I am not including? Apple. For unclear reasons, Apple has not yet adopted SVE2.

I have mixed feelings about SVE/SVE2. Like RISC-V, it breaks with the approach from ARM NEON and x64 SIMD that uses fixed-length register sizes (16 bytes, 32 bytes, 64 bytes). This means that you are expected to code without knowing how wide the registers are.

This is convenient for chip makers because it gives them the option of adjusting the register size to better suit their market. Yet it seems to have failed. While the Graviton 3 processor from Amazon had 256-bit registers… all commodity chips have had 128-bit registers after that.

On the plus side, SVE/SVE2 has masks a bit like AVX-512, so you can load and process data only in a subset of the registers. It solves a long-standing problem with earlier SIMD instruction sets where the input is not a multiple of the register size. Both SVE/SVE2 and AVX-512 might make tail handling nicer. Being able to operate on only part of the register allows clever optimizations. Sadly, SVE/SVE2 does not allow you to move masks to and from a general-purpose register efficiently, unlike AVX-512. And that’s a direct consequence of their design with variable-length registers. Thus, even though your registers might always be 128-bit and contain 16 bytes, the instruction set is not allowed to assume that a mask fits in a 16-bit word.

I was pessimistic regarding SVE/SVE2 until I learned that it is designed to be interoperable with ARM NEON. Thus you can use the SVE/SVE2 instructions with your ARM NEON code. This works especially well if you know that the SVE/SVE2 registers match the ARM NEON registers (16 bytes).

For the work I do, there are two SVE2 instructions that are important: match and nmatch. In their 8-bit versions, what they do is the following: given two vectors a and b, each containing up to 16 bytes, match sets a predicate bit to true for each position i where a[i] equals any of the bytes in b. In other words, b acts as a small lookup set, and match tests set membership for every byte of a simultaneously. The nmatch instruction is the logical complement: it sets a predicate bit to true wherever a[i] does not match any byte in b. A single instruction thus replaces a series of equality comparisons and OR-reductions that would otherwise be needed. In the code below, op_chars holds the 8 structural JSON characters and ws_chars holds the 4 whitespace characters; calling svmatch_u8 once on a 16-byte chunk d0 produces a predicate that has a true bit exactly where that input byte is a structural character. The code uses SVE2 intrinsics: compiler-provided C/C++ functions that map almost one-to-one to CPU SIMD instructions, so you get near-assembly control without writing assembly.

// : , [ ] { }
uint8_t op_chars_data[16] = {
    0x3a, 0x2c, 0x5b, 0x5d, 0x7b, 0x7d, 0, 0, 
    0, 0, 0, 0, 0, 0, 0, 0
};
// \t \n \r ' '
uint8_t ws_chars_data[16] = {
    0x09, 0x0a, 0x0d, 0x20, 0, 0, 0, 0, 
    0, 0, 0, 0, 0, 0, 0, 0
};

// load the characters in SIMD registers
svuint8_t op_chars = svld1_u8(svptrue_b8(), op_chars_data);
svuint8_t ws_chars = svld1_u8(svptrue_b8(), ws_chars_data);

// load data
// const char * input = ...
svbool_t pg = svptrue_pat_b8(SV_VL16);
svuint8_t d = svld1_u8(pg, input);

// matching
svbool_t op = svmatch_u8(pg, d, op_chars);
svbool_t ws = svmatch_u8(pg, d, ws_chars);

In this code snippet, svuint8_t is an SVE vector type containing unsigned 8-bit lanes (bytes). svbool_t is an SVE predicate (mask) type. svptrue_b8() builds a predicate where all 8-bit lanes are active, and svld1_u8(pg, ptr) loads bytes from memory into an SVE vector, using predicate pg to decide which lanes are actually read.

If you paid attention thus far, you might have noticed that my code is slightly wrong since I am including 0 in the character sets. But it is fine as long as I assume that the zero byte is not present in the input. In practice, I could just repeat one of the characters, or use a bogus character that I do not expect to see in my inputs (such as the byte value 0xFF, which cannot appear in a valid UTF-8 string).

In standard SVE/SVE2, op and ws are predicates, not integer masks. A practical trick is to materialize each predicate as bytes (0xFF for true, 0x00 for false), for example with svdup_n_u8_z.

svuint8_t opm = svdup_n_u8_z(op, 0xFF);
svuint8_t wsm = svdup_n_u8_z(ws, 0xFF);

When SVE vectors are 128 bits, this byte vector maps naturally to a NEON uint8x16_t via svget_neonq_u8, and from there we can build scalar bitmasks efficiently with NEON operations (masking plus pairwise additions). Repeating this over four 16-byte chunks gives the two 64-bit masks needed for a 64-byte block.

How does it compare with pure NEON code? I compiled the routine processing blocks of 64 bytes using different compilers.

method GCC 16 LLVM clang 20
simdjson (NEON) 69 66
SVE/SVE2 (new!) 42  52

Interestingly, GCC 16 even adopts SVE instructions in the pure NEON code, which suggests that recompiling old NEON code while targeting SVE/SVE2 could be beneficial.

I was hoping to test both compilers in a benchmark, but I wanted to quickly run my benchmarks on an AWS Graviton 4. I also did not want to compile GCC 16 from source. So I just kept LLVM clang 20 which was readily available in the images that AWS makes available (I picked RedHat 10).

The AWS Graviton 4 processor is a Neoverse V2 processor. Google has its own Neoverse V2 processors in its cloud. In my tests, it ran at 2.8 GHz.

My benchmark generates a random string of 1 MiB and computes the bitmaps indicating the positions of the characters. It is available on GitHub. My results are as follow.

method GB/s instructions/byte instructions/cycle
simdjson (NEON) 11.4 0.94 3.5
SVE/SVE2 (new!) 14.4  0.67 3.8

So the SVE/SVE2 approach is about 25% faster than the NEON equivalent and uses 30% fewer instructions, and that’s without any kind of fancy optimization. Importantly, the code is relatively simple thanks the match instruction.

It might be that the SVE2 function match is the fastest way to match characters on ARM processors.

Credit: This post was motivated by a sketch by user liuyang-664 on GitHub.

References

Langdale, G., & Lemire, D. (2019). Parsing gigabytes of JSON per second. The VLDB Journal, 28(6), 941-960.

Koekkoek, J., & Lemire, D. (2025). Parsing millions of DNS records per second. Software: Practice and Experience, 55(4), 778-788.

Lemire, D. (2025). Scanning HTML at Tens of Gigabytes Per Second on ARM Processors. Software: Practice and Experience, 55(7), 1256-1265.

A brief history of C/C++ programming languages

2026-04-09 22:58:53

Initially, we had languages like Fortran (1957), Pascal (1970), and C (1972). Fortran was designed for number crunching and scientific computing. Pascal was restrictive with respect to low-level access (it was deliberately “safe”, as meant for teaching structured programming). So C won out as a language that allowed low-level/unsafe programming (pointer arithmetic, direct memory access) while remaining general-purpose enough for systems work like Unix. To be fair, Pascal had descendants that are still around, but C clearly dominated.
Object-oriented programming became viewed as the future in the 1980s and 1990s. It turned into some kind of sect.
But C was not object-oriented.
So we got C++, which began as “C with Classes”. C++ had templates, enabling generic programming and compile-time metaprogramming. This part of the language makes C++ quite powerful, but somewhat difficult to master (with crazy error messages).
Both C and C++ became wildly successful, but writing portable applications remained difficult — you often had to target Windows or a specific Unix variant. This was a problem for a company like Sun Microsystems that sold Unix boxes and wanted to compete against the juggernaut that Microsoft was becoming.
So Java came along in 1995. It was positioned as a safe, portable alternative to C++: it eliminated raw pointer arithmetic, added mandatory garbage collection, array bounds checking everywhere, and ran on a virtual machine (JVM) with just-in-time compilation for performance.
The “write once, run anywhere” promise addressed C/C++ portability pain points directly. To this day, Java remains a strong solution for writing portable enterprise and server-side code.
We also got JavaScript in 1995. Despite the name, it has almost nothing in common with Java semantically. It is best viewed as separate from the C/C++ branch. Python is similarly quite different.
Microsoft would eventually come up with C# in 2000. It belongs to the same C-family syntax tradition as C++ and Java, but with support for ahead-of-time compilation in modern .NET. It also allows guarded pointer access within explicitly marked unsafe scopes. At this point, C# can be seen as “C++ with garbage collection” in spirit. It even competes against C++ in the game industry thanks to Unity.
Google came up with Go. It is much like a simpler, modern C: garbage-collected, with built-in bounds checking on slices/arrays, and pointers allowed but without arbitrary arithmetic in safe code (the unsafe package exists for low-level needs).
Later, Apple came up with Swift. It has C++-like performance and syntax goals but adds modern safety features (bounds checking by default, integer overflow panics in debug mode) and uses Automatic Reference Counting (ARC) for memory management. Swift replaced Objective-C but I still view it as a C++ successor.
At about the same time, we got Rust. Like Swift, it drops the generational garbage collection from Java, C# and Go. It relies instead on compile-time ownership and borrowing rules, with the tradeoff that you can leak memory with reference cycles. We also got Zig  which makes memory usage fully explicit.
I think that it is fairer to describe Rust and Zig as descendants of C rather than C++. Both are much more powerful than C, of course… and the evolution of programming languages is complex. Still. They are C-like programming languages.
To this day, in much of the industry, the dominant programming languages for performance-critical, systems, enterprise, and infrastructure work remain C, C++, Java, and C#. By the Lindy effect (the longer something has survived, the longer it is likely to continue surviving), these languages, especially C, now over 50 years old, are still going to be around for a long time.

Can your AI rewrite your code in assembly?

2026-04-06 05:16:14

Suppose you have several strings and you want to count the number of instances of the character ! in your strings. In C++, you might solve the problem as follows if you are an old-school programmer.

size_t c = 0;
for (const auto &str : strings) {
    c += std::count(str.begin(), str.end(), '!');
}

You can also get fancier with ranges.

for (const auto &str : strings) {
    c += std::ranges::count(str, '!');
}

And so forth.

But what if you want to go faster? Maybe you’d want to rewrite this function in assembly. I decided to do so, and to have fun using both Grok and Claude as my AIs, setting up a friendly competition.

I started with my function and then I asked AIs to optimize it in assembly. Importantly, they knew which machine I was on, so they started to write ARM assembly.

By repeated prompting, I got the following functions.

  • count_classic: Uses C++ standard library std::count for reference.
  • count_assembly: A basic ARM64 assembly loop (byte-by-byte comparison). Written by Grok.
  • count_assembly_claude: Claude’s SIMD-optimized version using NEON instructions (16-byte chunks).
  • count_assembly_grok: Grok’s optimized version (32-byte chunks).
  • count_assembly_claude_2: Claude’s further optimized version (64-byte chunks with multiple accumulators).
  • count_assembly_grok_2: Grok’s latest version (64-byte chunks with improved accumulator handling).
  • count_assembly_claude_3: Claude’s most advanced version with additional optimizations.

You get the idea.

So, how is the performance? I use random strings of up to 1 kilobyte. In all cases, I test that the functions provide the correct count. I did not closely examine the code, so it is possible that mistakes could be hiding in the code.

I record the average number of instructions per string.

name instructions/string
classic C++ 1200
claude assembly 250
grok assembly 204
claude assembly 2 183
grok assembly 2 176
claude assembly 3 154

By repeated optimization, I reduced the number of instructions by a factor of eight. The running time decreases similarly.

Can we get the AIs to rewrite the best option in C? Yes, although you need SIMD intrinsics. So there is no benefit to leaving the code in assembly in this instance.

An open question is whether the AIs could find optimizations that are not possible if we use a higher-level language like C or C++. It is an intriguing question that I will seek to answer later. For the time being, the AIs can beat my C++ compiler!

My source code is available.

A Fast Immutable Map in Go

2026-03-30 02:18:01

Consider the following problem. You have a large set of strings, maybe millions. You need to map these strings to 8-byte integers (uint64). These integers are given to you.

If you are working in Go, the standard solution is to create a map. The construction is trivial, something like the following loop.

m := make(map[string]uint64, N)
for i, k := range keys {
    m[k] = values[i]
}

One downside is that the map may use over 50 bytes per entry.

In important scenarios, we might have the following conditions. The map is large (a million of entries or more), you do not need to modify it dynamically (it is immutable), and all queried keys are in the set. In such conditions, you can reduce the memory usage down to almost the size of the keys, so about 8 bytes per entry. One fast technique is the binary fuse filters.

I implemented it as a Go library called constmap that provides an immutable map from strings to uint64 values using binary fuse filters. This data structure is ideal when you have a fixed set of keys at construction time and need fast, memory-efficient lookups afterward. You can even construct the map once, save it to disk so you do not pay the cost of constructing the map each time you need it.

The usage is just as simple.

package main

import (
    "fmt"
    "log"

    "github.com/lemire/constmap"
)

func main() {
    keys := []string{"apple", "banana", "cherry"}
    values := []uint64{100, 200, 300}

    cm, err := constmap.New(keys, values)
    if err != nil {
        log.Fatal(err)
    }

    fmt.Println(cm.Map("banana")) // 200
}

The construction time is higher (as expected for any compact data structure), but lookups are optimized for speed. I ran benchmarks on my Apple M4 Max processor to compare constmap lookups against Go’s built-in map[string]uint64. The test uses 1 million keys.

Data Structure Lookup Time Memory Usage
ConstMap 7.4 ns/op 9 bytes/key
Go Map 20 ns/op 56 bytes/key

ConstMap is nearly 3 times faster than Go’s standard map for lookups! And we reduced the memory usage by a factor of 6.

The ConstMap may not always be faster, but it should always use significantly less memory. If it can reside in CPU cache while the map cannot, then it will be significantly faster.

Source Code The implementation is available on GitHub: github.com/lemire/constmap.

JSON and C++26 compile-time reflection: a talk

2026-03-26 08:29:38

The next C++ standard (C++26) is getting exciting new features. One of these features is compile-time reflection. It is ideally suited to serialize and deserialize data at high speed. To test it out, we extended our fast JSON library (simdjson) and we gave a talk at CppCon 2025. The video is out on YouTube.

Our slides are also available.

How many branches can your CPU predict?

2026-03-19 05:52:53

Modern processors have the ability to execute many instructions per cycle, on a single core. To be able to execute many instructions per cycle in practice, processors predict branches. I have made the point over the years that modern CPUs have an incredible ability to predict branches.

It makes benchmarking difficult because if you test on small datasets, you can get surprising results that might not work on real data.

My go-to benchmark is a function like so:

while (howmany != 0) {
    val = generate_random_value()
    if(val is odd) write to buffer
    decrement howmany
}

The processor tries to predict the branch (if clause). Because we use random values, the processor should mispredict one time out of two.

However, if we repeat multiple times the benchmark, always using the same random values, the processor learns the branches. How many can processors learn? I test using three recent processors.

  • The AMD Zen 5 processor can predict perfectly 30,000 branches.
  • The Apple M4 processor can predict perfectly 10,000 branches.
  • Intel Emerald Rapids can predict perfectly 5,000 branches.

Once more I am disappointed by Intel. AMD is doing wonderfully well on this benchmark.

My source code is available.