2026-04-06 05:16:14

Suppose you have several strings and you want to count the number of instances of the character ! in your strings. In C++, you might solve the problem as follows if you are an old-school programmer.
size_t c = 0;
for (const auto &str : strings) {
c += std::count(str.begin(), str.end(), '!');
}
You can also get fancier with ranges.
for (const auto &str : strings) {
c += std::ranges::count(str, '!');
}
And so forth.
But what if you want to go faster? Maybe you’d want to rewrite this function in assembly. I decided to do so, and to have fun using both Grok and Claude as my AIs, setting up a friendly competition.
I started with my function and then I asked AIs to optimize it in assembly. Importantly, they knew which machine I was on, so they started to write ARM assembly.
By repeated prompting, I got the following functions.
count_classic: Uses C++ standard library std::count for reference.count_assembly: A basic ARM64 assembly loop (byte-by-byte comparison). Written by Grok.count_assembly_claude: Claude’s SIMD-optimized version using NEON instructions (16-byte chunks).count_assembly_grok: Grok’s optimized version (32-byte chunks).count_assembly_claude_2: Claude’s further optimized version (64-byte chunks with multiple accumulators).count_assembly_grok_2: Grok’s latest version (64-byte chunks with improved accumulator handling).count_assembly_claude_3: Claude’s most advanced version with additional optimizations.You get the idea.
So, how is the performance? I use random strings of up to 1 kilobyte. In all cases, I test that the functions provide the correct count. I did not closely examine the code, so it is possible that mistakes could be hiding in the code.
I record the average number of instructions per string.
| name | instructions/string |
|---|---|
| classic C++ | 1200 |
| claude assembly | 250 |
| grok assembly | 204 |
| claude assembly 2 | 183 |
| grok assembly 2 | 176 |
| claude assembly 3 | 154 |
By repeated optimization, I reduced the number of instructions by a factor of eight. The running time decreases similarly.
Can we get the AIs to rewrite the best option in C? Yes, although you need SIMD intrinsics. So there is no benefit to leaving the code in assembly in this instance.
An open question is whether the AIs could find optimizations that are not possible if we use a higher-level language like C or C++. It is an intriguing question that I will seek to answer later. For the time being, the AIs can beat my C++ compiler!
2026-03-30 02:18:01

Consider the following problem. You have a large set of strings, maybe millions. You need to map these strings to 8-byte integers (uint64). These integers are given to you.
If you are working in Go, the standard solution is to create a map. The construction is trivial, something like the following loop.
m := make(map[string]uint64, N)
for i, k := range keys {
m[k] = values[i]
}
One downside is that the map may use over 50 bytes per entry.
In important scenarios, we might have the following conditions. The map is large (a million of entries or more), you do not need to modify it dynamically (it is immutable), and all queried keys are in the set. In such conditions, you can reduce the memory usage down to almost the size of the keys, so about 8 bytes per entry. One fast technique is the binary fuse filters.
I implemented it as a Go library called constmap that provides an immutable map from strings to uint64 values using binary fuse filters. This data structure is ideal when you have a fixed set of keys at construction time and need fast, memory-efficient lookups afterward. You can even construct the map once, save it to disk so you do not pay the cost of constructing the map each time you need it.
The usage is just as simple.
package main
import (
"fmt"
"log"
"github.com/lemire/constmap"
)
func main() {
keys := []string{"apple", "banana", "cherry"}
values := []uint64{100, 200, 300}
cm, err := constmap.New(keys, values)
if err != nil {
log.Fatal(err)
}
fmt.Println(cm.Map("banana")) // 200
}
The construction time is higher (as expected for any compact data structure), but lookups are optimized for speed. I ran benchmarks on my Apple M4 Max processor to compare constmap lookups against Go’s built-in map[string]uint64. The test uses 1 million keys.
| Data Structure | Lookup Time | Memory Usage |
|---|---|---|
| ConstMap | 7.4 ns/op | 9 bytes/key |
| Go Map | 20 ns/op | 56 bytes/key |
ConstMap is nearly 3 times faster than Go’s standard map for lookups! And we reduced the memory usage by a factor of 6.
Source Code The implementation is available on GitHub: github.com/lemire/constmap.
2026-03-26 08:29:38

The next C++ standard (C++26) is getting exciting new features. One of these features is compile-time reflection. It is ideally suited to serialize and deserialize data at high speed. To test it out, we extended our fast JSON library (simdjson) and we gave a talk at CppCon 2025. The video is out on YouTube.
2026-03-19 05:52:53

Modern processors have the ability to execute many instructions per cycle, on a single core. To be able to execute many instructions per cycle in practice, processors predict branches. I have made the point over the years that modern CPUs have an incredible ability to predict branches.
It makes benchmarking difficult because if you test on small datasets, you can get surprising results that might not work on real data.
My go-to benchmark is a function like so:
while (howmany != 0) {
val = generate_random_value()
if(val is odd) write to buffer
decrement howmany
}
The processor tries to predict the branch (if clause). Because we use random values, the processor should mispredict one time out of two.
However, if we repeat multiple times the benchmark, always using the same random values, the processor learns the branches. How many can processors learn? I test using three recent processors.
Once more I am disappointed by Intel. AMD is doing wonderfully well on this benchmark.
2026-03-09 04:09:48

Suppose that you have a record of your sales per day. You might want to get a running record where, for each day, you are told how many sales you have made since the start of the year.
| day | sales per day | running sales |
|---|---|---|
| 1 | 10$ | 10 $ |
| 2 | 15$ | 25 $ |
| 3 | 5$ | 30 $ |
Such an operation is called a prefix sum or a scan.
Implementing it in C is not difficult. It is a simple loop.
for (size_t i = 1; i < length; i++) {
data[i] += data[i - 1];
}
How fast can this function be? We can derive a speed limit rather simply: to compute the current value, you must have computed the previous one, and so forth.
data[0] -> data[1] -> data[2] -> ...
At best, you require one CPU cycle per entry in your table. Thus, on a 4 GHz processor, you might process 4 billion integer values per second. It is an upper bound but you might be able to reach close to it in practice on many modern systems. Of course, there are other instructions involved such as loads, stores and branching, but our processors can execute many instructions per cycle and they can predict branches effectively. So you should be able to process billions of integers per second on most processors today.
Not bad! But can we do better?
We can use SIMD instructions. SIMD instructions are special instructions that process several values at once. All 64-bit ARM processors support NEON instructions. NEON instructions can process four integers at once, if they are packed in one SIMD register.
But how do you do the prefix sum on a 4-value register? You can do it with two shifts and two additions. In theory, it scales as log(N) where N is the number elements in a vector register.
input = [A B C D]
shift1 = [0 A B C]
sum1 = [A A+B B+C C+D]
shift2 = [0 0 A B+A]
result = [A A+B A+B+C A+B+C+D]
You can then extract the last value (A+B+C+D) and broadcast it to all positions so that you can add it to the next value.
Is this faster than the scalar approach? We have 4 instructions in sequence, plus at least one instruction if you want to use the total sum in the next block of four values.
Thus the SIMD approach might be worse. It is disappointing.
A solution might be the scale up over many more integer values.
Consider ARM NEON which has interleaved load and store instructions. If you can load 16 values at once, and get all of the first values together, all of the second values together, and so forth.
original data : ABCD EFGH IJKL MNOP
loaded data : AEIM BFJN CGKO DHLP
Then I can do a prefix sum over the four blocks in parallel. It takes three instructions. At the end of the three instructions, we have one register which contains the local sums:
A+B+C+D E+F+G+H I+J+K+L M+N+O+P
And then we can apply our prefix sum recipe on this register (4 instructions). You might end up with something like 8 sequential instructions per block of 16 values.
It is theoretically twice as fast as the scalar approach.
In C with instrinsics, you might code it as follows.
void neon_prefixsum_fast(uint32_t *data, size_t length) {
uint32x4_t zero = {0, 0, 0, 0};
uint32x4_t prev = {0, 0, 0, 0};
for (size_t i = 0; i < length / 16; i++) {
uint32x4x4_t vals = vld4q_u32(data + 16 * i);
// Prefix sum inside each transposed ("vertical") lane
vals.val[1] = vaddq_u32(vals.val[1], vals.val[0]);
vals.val[2] = vaddq_u32(vals.val[2], vals.val[1]);
vals.val[3] = vaddq_u32(vals.val[3], vals.val[2]);
// Now vals.val[3] contains the four local prefix sums:
// vals.val[3] = [s0=A+B+C+D, s1=E+F+G+H,
// s2=I+J+K+L, s3=M+N+O+P]
// Compute prefix sum across the four local sums
uint32x4_t off = vextq_u32(zero, vals.val[3], 3);
uint32x4_t ps = vaddq_u32(vals.val[3], off);
off = vextq_u32(zero, ps, 2);
ps = vaddq_u32(ps, off);
// Now ps contains cumulative sums across the four groups
// Add the incoming carry from the previous 16-element block
ps = vaddq_u32(ps, prev);
// Prepare carry for next block: broadcast the last lane of ps
prev = vdupq_laneq_u32(ps, 3);
// The add vector to apply to the original lanes is the
// prefix up to previous group
uint32x4_t add = vextq_u32(prev, ps, 3);
// Apply carry/offset to each of the four transposed lanes
vals.val[0] = vaddq_u32(vals.val[0], add);
vals.val[1] = vaddq_u32(vals.val[1], add);
vals.val[2] = vaddq_u32(vals.val[2], add);
vals.val[3] = vaddq_u32(vals.val[3], add);
// Store back the four lanes (interleaved)
vst4q_u32(data + 16 * i, vals);
}
scalar_prefixsum_leftover(data, length, 16);
}
Let us try it out on an Apple M4 processor (4.5 GHz).
| method | billions of values/s |
|---|---|
| scalar | 3.9 |
| naive SIMD | 3.6 |
| fast SIMD | 8.9 |
So the SIMD approach is about 2.3 times faster than the scalar approach. Not bad.
My source code is available on GitHub.
Appendix. Instrinsics
| Intrinsic | What it does |
|---|---|
vld4q_u32 |
Loads 16 consecutive 32-bit unsigned integers from memory and deinterleaves them into 4 separate uint32x4_t vectors (lane 0 = elements 0,4,8,12,…; lane 1 = 1,5,9,13,… etc.). |
vaddq_u32 |
Adds corresponding 32-bit unsigned integer lanes from two vectors (a[i] + b[i] for each of 4 lanes). |
vextq_u32 |
Extracts (concatenates a and b, then takes 4 lanes starting from lane n of the 8-lane concatenation). Used to implement shifts/rotates by inserting zeros (when a is zero vector). |
vdupq_laneq_u32 |
Broadcasts (duplicates) the value from the specified lane (0–3) of the input vector to all 4 lanes of the result. |
vdupq_n_u32 (implied usage) |
Sets all 4 lanes of the result to the same scalar value (commonly used for zero or broadcast). |
2026-03-05 22:40:58

The Internet relies on text formats. Thus, we spend a lot of time producing and consuming data encoded in text.
Your web pages are HTML. The code running in them is JavaScript, sent as text (JavaScript source), not as already-parsed code. Your emails, including their attachments, are sent as text (your binary files are sent as text).
It does not stop there. The Python code that runs your server is stored as text. It queries data by sending text queries. It often gets back the answer as text that must then be decoded.
JSON is the universal data interchange format online today. We share maps as JSON (GeoJSON).
Not everything is text, of course. There is no common video or image format that is shared as text. Transmissions over the Internet are routinely compressed to binary formats. There are popular binary formats that compete with JSON.
But why is text dominant?
It is not because, back in the 1970s, programmers did not know about binary formats.
In fact, we did not start with text formats. Initially, we worked with raw binary data. Those of us old enough will remember programming in assembly using raw byte values.
Why text won?
1.Text is efficient.
In the XML era, when everything had to be XML, there were countless proposals for binary formats. People were sometimes surprised to find that the binary approach was not much faster in practice. Remember that many text formats date back to an era when computers were much slower. Had text been a performance bottleneck, it would not have spread. Of course, there are cases where text makes things slower. You then have a choice: optimize your code further or transition to another format. Often, both are viable.
It is easy to make wrong assumptions about binary formats, such as that you can consume them without any parsing or validation. If you pick up data from the Internet, you must assume that it could have been sent by an adversary or someone who does not follow your conventions.
2.Text is easy to work with.
If you receive text from a remote source, you can often transform it, index it, search it, quote it, version it… with little effort and without in-depth knowledge of the format. Text is often self-documenting.
In an open world, when you will never speak with the person producing the data, text often makes everything easier and smoother.
If there is an issue to report and the data is in text, you can usually copy-paste the relevant section into a message. Things are much harder with a binary format.