MoreRSS

site iconDaniel LemireModify

Computer science professor at the University of Quebec (TELUQ), open-source hacker, and long-time blogger.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of Daniel Lemire

A Fast Immutable Map in Go

2026-03-30 02:18:01

Consider the following problem. You have a large set of strings, maybe millions. You need to map these strings to 8-byte integers (uint64). These integers are given to you.

If you are working in Go, the standard solution is to create a map. The construction is trivial, something like the following loop.

m := make(map[string]uint64, N)
for i, k := range keys {
    m[k] = values[i]
}

One downside is that the map may use over 50 bytes per entry.

In important scenarios, we might have the following conditions. The map is large (a million of entries or more), you do not need to modify it dynamically (it is immutable), and all queried keys are in the set. In such conditions, you can reduce the memory usage down to almost the size of the keys, so about 8 bytes per entry. One fast technique is the binary fuse filters.

I implemented it as a Go library called constmap that provides an immutable map from strings to uint64 values using binary fuse filters. This data structure is ideal when you have a fixed set of keys at construction time and need fast, memory-efficient lookups afterward. You can even construct the map once, save it to disk so you do not pay the cost of constructing the map each time you need it.

The usage is just as simple.

package main

import (
    "fmt"
    "log"

    "github.com/lemire/constmap"
)

func main() {
    keys := []string{"apple", "banana", "cherry"}
    values := []uint64{100, 200, 300}

    cm, err := constmap.New(keys, values)
    if err != nil {
        log.Fatal(err)
    }

    fmt.Println(cm.Map("banana")) // 200
}

The construction time is higher (as expected for any compact data structure), but lookups are optimized for speed. I ran benchmarks on my Apple M4 Max processor to compare constmap lookups against Go’s built-in map[string]uint64. The test uses 1 million keys.

Data Structure Lookup Time Memory Usage
ConstMap 7.4 ns/op 9 bytes/key
Go Map 20 ns/op 56 bytes/key

ConstMap is nearly 3 times faster than Go’s standard map for lookups! And we reduced the memory usage by a factor of 6.

Source Code The implementation is available on GitHub: github.com/lemire/constmap.

JSON and C++26 compile-time reflection: a talk

2026-03-26 08:29:38

The next C++ standard (C++26) is getting exciting new features. One of these features is compile-time reflection. It is ideally suited to serialize and deserialize data at high speed. To test it out, we extended our fast JSON library (simdjson) and we gave a talk at CppCon 2025. The video is out on YouTube.

Our slides are also available.

How many branches can your CPU predict?

2026-03-19 05:52:53

Modern processors have the ability to execute many instructions per cycle, on a single core. To be able to execute many instructions per cycle in practice, processors predict branches. I have made the point over the years that modern CPUs have an incredible ability to predict branches.

It makes benchmarking difficult because if you test on small datasets, you can get surprising results that might not work on real data.

My go-to benchmark is a function like so:

while (howmany != 0) {
    val = generate_random_value()
    if(val is odd) write to buffer
    decrement howmany
}

The processor tries to predict the branch (if clause). Because we use random values, the processor should mispredict one time out of two.

However, if we repeat multiple times the benchmark, always using the same random values, the processor learns the branches. How many can processors learn? I test using three recent processors.

  • The AMD Zen 5 processor can predict perfectly 30,000 branches.
  • The Apple M4 processor can predict perfectly 10,000 branches.
  • Intel Emerald Rapids can predict perfectly 5,000 branches.

Once more I am disappointed by Intel. AMD is doing wonderfully well on this benchmark.

My source code is available.

Prefix sums at tens of gigabytes per second with ARM NEON

2026-03-09 04:09:48

Suppose that you have a record of your sales per day. You might want to get a running record where, for each day, you are told how many sales you have made since the start of the year.

day sales per day running sales
1 10$ 10 $
2 15$ 25 $
3 5$ 30 $

Such an operation is called a prefix sum or a scan.

Implementing it in C is not difficult. It is a simple loop.

  for (size_t i = 1; i < length; i++) {
    data[i] += data[i - 1];
  }

How fast can this function be? We can derive a speed limit rather simply: to compute the current value, you must have computed the previous one, and so forth.

data[0] -> data[1] -> data[2] -> ...

At best, you require one CPU cycle per entry in your table. Thus, on a 4 GHz processor, you might process 4 billion integer values per second. It is an upper bound but you might be able to reach close to it in practice on many modern systems. Of course, there are other instructions involved such as loads, stores and branching, but our processors can execute many instructions per cycle and they can predict branches effectively. So you should be able to process billions of integers per second on most processors today.

Not bad! But can we do better?

We can use SIMD instructions. SIMD instructions are special instructions that process several values at once. All 64-bit ARM processors support NEON instructions. NEON instructions can process four integers at once, if they are packed in one SIMD register.

But how do you do the prefix sum on a 4-value register? You can do it with two shifts and two additions. In theory, it scales as log(N) where N is the number elements in a vector register.

input   = [A B   C     D]
shift1  = [0 A   B     C]
sum1    = [A A+B B+C   C+D]
shift2  = [0 0   A     B+A]
result  = [A A+B A+B+C A+B+C+D]

You can then extract the last value (A+B+C+D) and broadcast it to all positions so that you can add it to the next value.

Is this faster than the scalar approach? We have 4 instructions in sequence, plus at least one instruction if you want to use the total sum in the next block of four values.

Thus the SIMD approach might be worse. It is disappointing.

A solution might be the scale up over many more integer values.

Consider ARM NEON which has interleaved load and store instructions. If you can load 16 values at once, and get all of the first values together, all of the second values together, and so forth.

original data : ABCD EFGH IJKL MNOP
loaded data   : AEIM BFJN CGKO DHLP

Then I can do a prefix sum over the four blocks in parallel. It takes three instructions. At the end of the three instructions, we have one register which contains the local sums:

A+B+C+D E+F+G+H I+J+K+L M+N+O+P

And then we can apply our prefix sum recipe on this register (4 instructions). You might end up with something like 8 sequential instructions per block of 16 values.

It is theoretically twice as fast as the scalar approach.

In C with instrinsics, you might code it as follows.

void neon_prefixsum_fast(uint32_t *data, size_t length) {
  uint32x4_t zero = {0, 0, 0, 0};
  uint32x4_t prev = {0, 0, 0, 0};

  for (size_t i = 0; i < length / 16; i++) {
    uint32x4x4_t vals = vld4q_u32(data + 16 * i);

    // Prefix sum inside each transposed ("vertical") lane
    vals.val[1] = vaddq_u32(vals.val[1], vals.val[0]);
    vals.val[2] = vaddq_u32(vals.val[2], vals.val[1]);
    vals.val[3] = vaddq_u32(vals.val[3], vals.val[2]);

    // Now vals.val[3] contains the four local prefix sums:
    //   vals.val[3] = [s0=A+B+C+D, s1=E+F+G+H, 
    //                  s2=I+J+K+L, s3=M+N+O+P]

    // Compute prefix sum across the four local sums 
    uint32x4_t off = vextq_u32(zero, vals.val[3], 3);
    uint32x4_t ps = vaddq_u32(vals.val[3], off);       
    off = vextq_u32(zero, ps, 2);                      
    ps = vaddq_u32(ps, off);

    // Now ps contains cumulative sums across the four groups
    // Add the incoming carry from the previous 16-element block
    ps = vaddq_u32(ps, prev);

    // Prepare carry for next block: broadcast the last lane of ps
    prev = vdupq_laneq_u32(ps, 3);

    // The add vector to apply to the original lanes is the 
    // prefix up to previous group
    uint32x4_t add = vextq_u32(prev, ps, 3);  

    // Apply carry/offset to each of the four transposed lanes
    vals.val[0] = vaddq_u32(vals.val[0], add);
    vals.val[1] = vaddq_u32(vals.val[1], add);
    vals.val[2] = vaddq_u32(vals.val[2], add);
    vals.val[3] = vaddq_u32(vals.val[3], add);

    // Store back the four lanes (interleaved)
    vst4q_u32(data + 16 * i, vals);
  }

  scalar_prefixsum_leftover(data, length, 16);
}

Let us try it out on an Apple M4 processor (4.5 GHz).

method billions of values/s
scalar 3.9
naive SIMD 3.6
fast SIMD 8.9

So the SIMD approach is about 2.3 times faster than the scalar approach. Not bad.

My source code is available on GitHub.

Appendix. Instrinsics

Intrinsic What it does
vld4q_u32 Loads 16 consecutive 32-bit unsigned integers from memory and deinterleaves them into 4 separate uint32x4_t vectors (lane 0 = elements 0,4,8,12,…; lane 1 = 1,5,9,13,… etc.).
vaddq_u32 Adds corresponding 32-bit unsigned integer lanes from two vectors (a[i] + b[i] for each of 4 lanes).
vextq_u32 Extracts (concatenates a and b, then takes 4 lanes starting from lane n of the 8-lane concatenation). Used to implement shifts/rotates by inserting zeros (when a is zero vector).
vdupq_laneq_u32 Broadcasts (duplicates) the value from the specified lane (0–3) of the input vector to all 4 lanes of the result.
vdupq_n_u32 (implied usage) Sets all 4 lanes of the result to the same scalar value (commonly used for zero or broadcast).

Text formats are everywhere. Why?

2026-03-05 22:40:58

The Internet relies on text formats. Thus, we spend a lot of time producing and consuming data encoded in text.

Your web pages are HTML. The code running in them is JavaScript, sent as text (JavaScript source), not as already-parsed code. Your emails, including their attachments, are sent as text (your binary files are sent as text).

It does not stop there. The Python code that runs your server is stored as text. It queries data by sending text queries. It often gets back the answer as text that must then be decoded.

JSON is the universal data interchange format online today. We share maps as JSON (GeoJSON).

Not everything is text, of course. There is no common video or image format that is shared as text. Transmissions over the Internet are routinely compressed to binary formats. There are popular binary formats that compete with JSON.
But why is text dominant?

It is not because, back in the 1970s, programmers did not know about binary formats.

In fact, we did not start with text formats. Initially, we worked with raw binary data. Those of us old enough will remember programming in assembly using raw byte values.

Why text won?

1.Text is efficient.

In the XML era, when everything had to be XML, there were countless proposals for binary formats. People were sometimes surprised to find that the binary approach was not much faster in practice. Remember that many text formats date back to an era when computers were much slower. Had text been a performance bottleneck, it would not have spread. Of course, there are cases where text makes things slower. You then have a choice: optimize your code further or transition to another format. Often, both are viable.

It is easy to make wrong assumptions about binary formats, such as that you can consume them without any parsing or validation. If you pick up data from the Internet, you must assume that it could have been sent by an adversary or someone who does not follow your conventions.

2.Text is easy to work with.

If you receive text from a remote source, you can often transform it, index it, search it, quote it, version it… with little effort and without in-depth knowledge of the format. Text is often self-documenting.

In an open world, when you will never speak with the person producing the data, text often makes everything easier and smoother.

If there is an issue to report and the data is in text, you can usually copy-paste the relevant section into a message. Things are much harder with a binary format.

You can use newline characters in URLs

2026-03-01 03:21:39

We locate web content using special addresses called URLs. We are all familiar with addresses like https://google.com. Sometimes, URLs can get long and they can become difficult to read. Thus, we might be tempted to format them
like so in HTML using newline and tab characters, like so:

<a href="https://lemire.me/blog/2026/02/21/
        how-fast-do-browsers-correct-utf-16-strings/">my blog post</a>

It will work.

Let us refer to the WHATWG URL specification that browsers follow. It makes two statements in sequence.

  1. If input contains any ASCII tab or newline, invalid-URL-unit validation error.
  2. Remove all ASCII tab or newline from input.

Notice how it reports an error if there is a tab or newline character, but continues anyway? The specification says that A validation error does not mean that the parser terminates and it encourages systems to report errors somewhere. Effectively, the error is ignored although it might be logged. Thus our HTML is fine in practice.

The following is also fine:

<a href="https://go
ogle.c
om" class="button">Visit Google</a>

You can also use tabs. But you cannot arbitrarily insert any other whitespace.

Yet there are cases when you can use any ASCII whitespace character: data URLs. Data URLs (also called data URIs) embed small files—like images, text, or other content—directly inside a URL string, instead of linking to an external resource. Data URLs are a special kind of URL and they follow different rules.

A typical data URL might look like data:image/png;base64,iVBORw0KGgoAAAANSUhEUg... where the string iVBORw0KGgoAAAANSUhEUg... is the binary data of the image that has been encoded with base64. Base64 is a text format that can represent any binary content: we use 64 ASCII characters so that each character encodes 6 bits. Your binary email attachments are base64 encoded.

On the web, when decoding a base64 string, you ignore all ASCII whitespaces (including the space character itself). Thus you can embed a PNG image in HTML as follows.

<img src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAA
                                 QAAAAECAIAAAAmkwkpAAAAEUl
                                 EQVR4nGP8z4AATEhsPBwAM9EB
                                 BzDn4UwAAAAASUVORK5CYII=" />

This HTML code is valid and will insert a tiny image in your page.

But there is more. A data URL can also be used to insert an SVG image. SVG (Scalable Vector Graphics) is an XML-based vector image format that describes 2D graphics using mathematical paths, shapes, and text instead of pixels.
The following should draw a very simple sunset:

<img src='data:image/svg+xml,
<svg width="200" height="200" 
     xmlns="http://www.w3.org/2000/svg">
  <rect width="100%" height="100%" fill="blue" /> 
  <!-- the sky -->
  <circle cx="100" cy="110" r="50" fill="yellow" />  
  <!-- the sun -->
  <rect x="0" y="120" width="200" height="80" fill="brown" />  
  <!-- the ground -->
</svg>' />

Observe how I was able to format the SVG code so that it is readable.

Further reading: Nizipli, Y., & Lemire, D. (2024). Parsing millions of URLs per second. Software: Practice and Experience, 54(5), 744-758.