MoreRSS

site iconDaniel LemireModify

Computer science professor at the University of Quebec (TELUQ), open-source hacker, and long-time blogger.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of Daniel Lemire

How fast do browsers correct UTF-16 strings?

2026-02-22 04:07:17

JavaScript represents strings using Unicode, like most programming languages today. Each character in a JavaScript string is stored using one or two 16-bit words. The following JavaScript code might surprise some programmers because a single character becomes two 16-bit words.

> t="🧰"
'🧰'
> t.length
2
> t[0]
'\ud83e'
> t[1]
'\uddf0'

The convention is that \uddf0 is the 16-bit value 0xDDF0 also written U+DDF0.

The UTF-16 standard is relatively simple. There are three types of values. high surrogates (the range U+D800 to U+DBFF), low surrogates (U+DC00 to U+DFFF), and all other code units (U+0000–U+D7FF together with U+E000–U+FFFF). A high surrogate must always be followed by a low surrogate, and a low surrogate must always be preceded by a high surrogate.

What happens if you break the rules and have a high surrogate followed by a high surrogate? Then you have an invalid string. We can correct the strings by patching them: we replace the bad values by the replacement character (\ufffd). The replacement character sometimes appears as a question mark.

To correct a broken string in JavaScript, you can call the toWellFormed method.

> t = '\uddf0\uddf0'
'\uddf0\uddf0'
> t.toWellFormed()
'��'

How fast is it?

I wrote a small benchmark that you can test online to measure its speed. I use broken strings of various sizes up to a few kilobytes. I run the benchmarks on my Apple M4 processor using different browsers.

Browser Speed
Safari 18.6 1 GiB/s
Firefox 147 3 GiB/s
Chrome 145 15 GiB/s

Quite a range of performance! The speed of other chromium-based browsers (Brave and Edge) is much the same as Chrome.

I also tested with JavaScript runtimes.

Engine Speed
Node.js v25.5.0 16 GiB/s
Bun 1.3.9 8.4 GiB/s

Usually Bun is faster than Node, but in this instance, Node is twice as far as Bun.

Thus, we can correct strings in JavaScript at over ten gigabytes per second if you use Chromium-based browsers.

How bad can Python stop-the-world pauses get?

2026-02-16 04:02:29

When programming, we need to allocate memory, and then deallocate it. If you program in C, you get used to malloc/free functions. Sadly, this leaves you vulnerable to memory leaks: unrecovered memory. Most popular programming languages today use automated memory management: Java, JavaScript, Python, C#, Go, Swift and so forth.

There are essentially two types of automated memory managements. The simplest method is reference counting. You track how many references there are to each object. When an object has no more references, then we can free the memory associated with it. Swift and Python use reference counting. The downside of reference counting are circular references. You may have your main program reference object A, then you add object B which references object A, and you make it so that object A also reference object B. Thus object B has one reference while object A has two references. If your main program drops its reference to object A, the both objects A and B still have a reference count of one. Yet they should be freed. To solve this problem, you could just visit all of your objects to detect which are unreachable, including A and B. However, it takes time to do so. Thus, the other popular approach of automated memory management: generational garbage collection. You use the fact that most memory gets released soon after allocation. Thus you track young objects and visit them from time to time. Then, more rarely, you do a full scan. The downside of generational garbage collection is that typical implementations stop the world to scan the memory. In many instances, your entire program is stopped. There are many variations on the implementation, with decades of research.

The common Python implementation has both types: reference counting and generational garbage collection. The generational garbage collection component can trigger pauses. A lot of servers are written in Python. It means that your service might just become unavailable for a time. We often call them ‘stop the world’ pauses. How long can this pause get?

To test this out, I wrote a Python function to create a classical linked list:

class Node:
    def __init__(self, value):
        self.value = value
        self.next = None
    def add_next(self, node):
        self.next = node

def create_linked_list(limit):
    """ create a linked list of length 'limit' """
    head = Node(0)
    current = head
    for i in range(1, limit):
        new_node = Node(i)
        current.add_next(new_node)
        current = new_node
    return head

And then I create one large linked list and then, in a tight loop, we create small linked lists that are immediately discarded.

x = create_linked_list(50_000_000)
for i in range(1000000):
    create_linked_list(1000)

A key characteristic of my code is the 50 million linked list. It does not get released until the end of the program, but the garbage collector may still examine it.

And I record the maximum delay between two iterations in the loop (using time.time()).

How bad can it get? The answer depends on the Python version. And it is not consistent from run-to-run. So I ran it once and picked whatever result I got. I express the delay in milliseconds.

python version system max delay
3.14 macOS (Apple M4) 320 ms
3.12 Linux (Intel Ice Lake)  2,200 ms

Almost all of this delay (say 320 ms) is due to the garbage collection. Creating a linked list with 1000 elements takes less than a millisecond.

How long is 320 ms? It is a third of a second, so it is long enough for human beings to notice it. For reference, a video game drawing the screen 60 times per second has less than 17 ms to draw the screen. The 2,200 ms delay could look like a server crash from the point of view of a user, and might definitely trigger a time-out (failed request).

I ported the Python program to Go. It is the same algorithm, but a direct comparison is likely unfair. Still, it gives us a reference.

go version system max delay
1.25 macOS (Apple M4) 50 ms
1.25 Linux (Intel Ice Lake)  33 ms

Thus Go has pauses that are several times shorter than Python, and there is no catastrophic 2-second pause.

Should these pauses be a concern? Most Python programs do not create so many objects in memory at the same time. Thus you are not likely to see these long pauses if you have a simple web app or a script. Python gives you a few options, such as gc.set_threshold and gc.freeze which could help you tune the behaviour.

My source code is available.

 

Video

AI: Igniting the Spark to End Stagnation

2026-02-15 23:19:31

Much of the West has been economically stagnant. Countries like Canada have failed to improve their productivity and standard of living as of late. In Canada, there has been no progress in Canadian living standards as measured by per-person GDP over the past five years. It is hard to overstate how anomalous this is: the USSR collapsed in part because it could only sustain a growth rate of about 1%, far below what the West was capable of. Canada is more stagnant than the USSR.

Late in 2022, some of us got access to a technical breakthrough: AI. In three years, it has become part of our lives. Nearly all students use AI to do research or write essays.

Dallas Fed economists projected the most credible effect that AI might have on our economies: AI should help reverse the post-2008 slowdown and deliver higher living standards in line with historical technological progress.

It will imply a profound, rapid but gradual transformation of our economy. There will still be teachers, accountants, and even translators in the future… but their work will change as it has changed in the past. Accountants do far less arithmetic today; that part of their work has been replaced by software. Even more of their work is about to be replaced by software, thus improving their productivity further. We will still have teachers, but all our kids, including the poorest ones, will have dedicated always-on tutors: this will not be just available in Canada or the USA, but everywhere. It is up to us to decide who is allowed to build this technology.

AI empowers the individual. An entrepreneur with a small team can get faster access to quality advice, copywriting, and so forth. Artists with an imagination can create more with fewer constraints.

I don’t have to prove these facts: they are fast becoming obvious to the whole world.

New jobs are created. Students of mine work as AI specialists. One of them helps build software providing AI assistance to pharmacists. One of my sons is an AI engineer. These are great jobs.

The conventional explanation for Canada’s stagnation is essentially that we have already harvested all the innovation we are ever going to get. The low-hanging fruit has been picked. Further progress has become inherently difficult because we are already so advanced; there is simply not much room left to improve. In this view, there is no need to rethink our institutions. Yet a sufficiently large breakthrough compels us to reconsider where we stand and what is still possible. It forces us to use our imagination again. It helps renew the culture.

We often hear claims that artificial intelligence will consume vast amounts of energy and water in the coming years. It is true that data centers, which host AI workloads along with many other computing tasks, rely on water for cooling.

But let’s look at the actual water numbers. In 2023, U.S. data centers directly consumed roughly 17.4 billion gallons of water—a figure that could potentially double or quadruple by 2028 as demand grows. By comparison, American golf courses use more than 500 billion gallons every year for irrigation, often in arid regions where this usage is widely criticized as wasteful. Even if data-center water demand were to grow exponentially, it would take decades to reach the scale of golf-course irrigation.

On the energy side, data centers are indeed taking a larger share of electricity demand. According to the International Energy Agency’s latest analysis, they consumed approximately 415 TWh in 2024—about 1.5% of global electricity consumption. This is projected to more than double to around 945 TWh by 2030 (just under 3% of global electricity). However, even this rapid growth accounts for less than 10% (roughly 8%) of the total expected increase in worldwide electricity demand through 2030. Data centers are therefore not the main driver of the much larger rise in overall energy use.

If we let engineers in Australia, Canada, or Argentina free to innovate, we will surely see fantastic developments.

You might also have heard about the possibility that ChatGPT might decide to kill us all. Nobody can predict the future, but you are surely more likely to be killed by cancer than by a rogue AI. And AI might help you with your cancer.

We always have a choice. Nations can try to regulate AI out of existence. We can set up new government bodies to prevent the application of AI. This will surely dampen the productivity gains and marginalize some nations economically.

The European Union showed it could be done. By some reports, Europeans make more money by fining American software companies than by building their own innovation enterprises. Countries like Canada have economies dominated by finance, mining and oil (with a side of Shopify).

If you are already well off, stopping innovation sounds good. It’s not if you are trying to get a start.

AI is likely to help young people who need it so much. They, more than any other group, will find it easier to occupy the new jobs, start the new businesses.

If you are a politician and you want to lose the vote of young people: make it difficult to use AI. It will crater your credibility.

It is time to renew our prosperity. It is time to create new exciting jobs.

 

References:

Wynne, M. A., & Derr, L. (2025, June 24). Advances in AI will boost productivity, living standards over time. Federal Reserve Bank of Dallas.

Fraser Institute. (2025, December 16). Canada’s recent economic growth performance has been awful.

DemandSage. (2026, January 9). 75 AI in education statistics 2026 (Global trends & facts).

MIT Technology Review. (2026, January 21). Rethinking AI’s future in an augmented workplace.

Davis, J. H. (2025). Coming into view: How AI and other megatrends will shape your investments. Wiley.

Choi, J. H., & Xie, C. (2025, June 26). AI is reshaping accounting jobs by doing the boring stuff. Stanford Graduate School of Business.

International Energy Agency. (n.d.). Energy demand from AI.

University of Colorado Anschutz Medical Campus. (2025, May 19). Real talk about AI and advancing cancer treatments.

International Energy Agency. (2025). Global energy review 2025.

The cost of a function call

2026-02-09 04:11:32

When programming, we chain functions together. Function A calls function B. And so forth.

You do not have to program this way, you could write an entire program using a single function. It would be a fun exercise to write a non-trivial program using a single function… as long as you delegate the code writing to AI because human beings quickly struggle with long functions.

A key compiler optimization is ‘inlining’: the compiler takes your function definition and it tries to substitute it at the call location. It is conceptually quite simple. Consider the following example where the function add3 calls the function add.

int add(int x, int y) {
    return x + y;
}

int add3(int x, int y, int z) {
    return add(add(x, y), z);
}

You can manually inline the call as follows.

int add3(int x, int y, int z) {
    return x + y + z;
}

A function call is reasonably cheap performance-wise, but not free. If the function takes non-trivial parameters, you might need to save and restore them on the stack, so you get extra loads and stores. You need to jump into the function, and then jump out at the end. And depending on the function call convention on your system, and the type of instructions you are using, there are extra instructions at the beginning and at the end.

If a function is sufficiently simple, such as my add function, it should always be inlined when performance is critical. Let us examine a concrete example. Let me sum the integers in an array.

for (int x : numbers) {
  sum = add(sum, x);
}

I am using my MacBook (M4 processor with LLVM).

function ns/int
regular 0.7
inline 0.03

Wow. The inline version is over 20 times faster.

Let us try to see what is happening. The call site of the ‘add’ function is just a straight loop with a call to the function.

ldr    w1, [x19], #0x4
bl     0x100021740    ; add(int, int)
cmp    x19, x20
b.ne   0x100001368    ; <+28>

The function itself is as cheap as it can be: just two instructions.

add    w0, w1, w0
ret

So, we spend 6 instructions for each addition. It takes about 3 cycles per addition.

What about the inline function?

ldp    q4, q5, [x12, #-0x20]
ldp    q6, q7, [x12], #0x40
add.4s v0, v4, v0
add.4s v1, v5, v1
add.4s v2, v6, v2
add.4s v3, v7, v3
subs   x13, x13, #0x10
b.ne   0x1000013fc    ; <+104>

It is entirely different. The compiler has converted the addition to advanced (SIMD) instructions processing blocks of 16 integers using 8 instructions. So we are down to half an instruction per integer (from 6 instructions). So we use 12 times fewer instructions. On top of having fewer instructions, the processor is able to retire more instructions per cycle, for a massive performance boost.

What if we prevented the compiler from using these fancy instructions while still inlining? We still get a significant performance boost (about 10x faster).

function ns/int
regular 0.7
inline 0.03
inline (no SIMD) 0.07

Ok. But the add function is a bit extreme. We know it should always be inlined. What about something less trivial like a function that counts the number of spaces in a string.

size_t count_spaces(std::string_view sv) {
    size_t count = 0;
    for (char c : sv) {
        if (c == ' ') ++count;
    }
    return count;
}

If the string is reasonably long, then the overhead of the function call should be negligible.
Let us pass a string of 1000 characters.

function ns/string
regular 111
inline 115

The inline version is not only not faster, but it is even slightly slower. I am not sure why.

What if I use short strings (say between 0 and 6 characters)? Then the inline function is measurably faster.

function ns/string
regular 1.6
inline 1.0

Takeaways:

  1. Short and simple functions should be inlined when possible if performance is a concern. The benefits can be impressive.
  2. For functions that can be fast or slow, the decision as to whether to inline or not depends on the input. For string processing functions, the size of the string may determine whether inlining is necessary for best performance.

Note: My source code is available.

Converting data to hexadecimal outputs quickly

2026-02-02 23:52:27

Given any string of bytes, you can convert it to an hexadecimal string by mapping the least significant and the most significant 4 bits of byte to characters in 01...9A...F. There are more efficient techniques like base64, that map 3 bytes to 4 characters. However, hexadecimal outputs are easier to understand and often sufficiently concise.

A simple function to do the conversion using a short lookup table is as follows:

static const char hex[] = "0123456789abcdef";
for (size_t i = 0, k = 0; k < dlen; i += 1, k += 2) {
    uint8_t val = src[i];
    dst[k + 0] = hex[val >> 4];
    dst[k + 1] = hex[val & 15];
}

This code snippet implements a straightforward byte-to-hexadecimal string conversion loop in C++. It iterates over an input byte array (src), processing one byte at a time using index i, while simultaneously building the output string in dst with index k that advances twice as fast (by 2) since each input byte produces two hexadecimal characters. For each byte, it extracts the value as an unsigned 8-bit integer (val), then isolates the high 4 bits (via right shift by 4) and low 4 bits (via bitwise AND with 15) to index into a static lookup table (hex) containing the characters ‘0’ through ‘9’ and ‘a’ through ‘f’. The loop continues until k reaches the expected output length (dlen), which should be twice the input length, ensuring all bytes are converted without bounds errors.

This lookup table approach is used in the popular Node.js JavaScript runtime. Skovoroda recently proposed to replace this lookup table approach with an arithmetic version.

char nibble(uint8_t x) { return x + '0' + ((x > 9) * 39); }
for (size_t i = 0, k = 0; k < dlen; i += 1, k += 2) {
    uint8_t val = src[i];
    dst[k + 0] = nibble(val >> 4);
    dst[k + 1] = nibble(val & 15);
}

Surprisingly maybe, this approach is much faster and uses far fewer instructions. At first glance, this result might be puzzling. A table lookup is cheap, the new nibble function seemingly does more work.

The trick that Skovoroda relies upon is that compilers are smart: they will ‘autovectorize’ such number crunching functions (if you are lucky). That is, instead of using regular instructions that process byte values, the will SIMD instructions that process 16 bytes at once or more.

Of course, instead of relying on the compiler, you can manually invoke SIMD instructions through SIMD instrinsic functions. Let us assume that you have an ARM processors (e.g., on Apple Silicon). Then you can process blocks of 32 bytes as follows.

size_t maxv = (slen - (slen%32));
for (; i < maxv; i += 32) {
    uint8x16_t val1 = vld1q_u8((uint8_t*)src + i);
    uint8x16_t val2 = vld1q_u8((uint8_t*)src + i + 16);
    uint8x16_t high1 = vshrq_n_u8(val1, 4);
    uint8x16_t low1 = vandq_u8(val1, vdupq_n_u8(15));
    uint8x16_t high2 = vshrq_n_u8(val2, 4);
    uint8x16_t low2 = vandq_u8(val2, vdupq_n_u8(15));
    uint8x16_t high_chars1 = vqtbl1q_u8(table, high1);
    uint8x16_t low_chars1 = vqtbl1q_u8(table, low1);
    uint8x16_t high_chars2 = vqtbl1q_u8(table, high2);
    uint8x16_t low_chars2 = vqtbl1q_u8(table, low2);
    uint8x16x2_t zipped1 = {high_chars1, low_chars1};
    uint8x16x2_t zipped2 = {high_chars2, low_chars2};
    vst2q_u8((uint8_t*)dst + i*2, zipped1);
    vst2q_u8((uint8_t*)dst + i*2 + 32, zipped2);
}

This SIMD code leverages ARM NEON intrinsics to accelerate hexadecimal encoding by processing 32 input bytes simultaneously. It begins by loading two 16-byte vectors (val1 and val2) from the source array using vld1q_u8. For each vector, it extracts the high nibbles (via right shift by 4 with vshrq_n_u8) and low nibbles (via bitwise AND with 15 using vandq_u8 and vdupq_n_u8). The nibbles are then used as indices into a pre-loaded hex table via vqtbl1q_u8 to fetch the corresponding ASCII characters. The high and low character vectors are interleaved using vzipq_u8, producing two output vectors per input pair. Finally, the results are stored back to the destination array with vst1q_u8, ensuring efficient memory operations.

You could do similar work on other systems like x64. The same code with AVX-512 for recent Intel and AMD processors would probably be insanely efficient.

Benchmarking these implementations on a dataset of 10,000 random bytes reveals significant performance differences. The basic lookup table version achieves around 3 GB/s, while the arithmetic version, benefiting from compiler autovectorization, reaches 23 GB/s. The manual SIMD NEON versions push performance further: I reach 42 GB/s in my tests.

method speed instructions per byte
table 3.1 GB/s 9
Skovoroda 23 GB/s 0.75
intrinsics 42 GB/s 0.69

One lesson is that intuition can be a poor guide when trying to assess performance.

My source code is available.

Converting floats to strings quickly

2026-02-01 23:23:25

When serializing data to JSON, CSV or when logging, we convert numbers to strings. Floating-point numbers are stored in binary, but we need them as decimal strings. The first formally published algorithm is Steele and White’s Dragon schemes (specifically Dragin2) in 1990. Since then, faster methods have emerged: Grisu3, Ryū, Schubfach, Grisu-Exact, and Dragonbox. In C++17, we have a standard function called std::to_chars for this purpose. A common objective is to generate the shortest strings while still being able to uniquely identify the original number.

We recently published Converting Binary Floating-Point Numbers to Shortest Decimal Strings. We examine the full conversion, from the floating-point number to the string. In practice, the conversion implies two steps: we take the number and compute the significant and the power of 10 (step 1) and then we generate the string (step 2). E.g., for the number pi, you might need to compute 31415927 and -7 (step 1) before generating the string 3.1415927. The string generation requires placing the dot at the right location and switching to the exponential notation when needed. The generation of the string is relatively cheap and was probably a negligible cost for older schemes, but as the software got faster, it is now a more important component (using 20% to 35% of the time).

The results vary quite a bit depending on the numbers being converted. But we find that the two implementations tend to do best: Dragonbox by Jeon and Schubfach by Giulietti. The Ryū implementation by Adams is close behind or just as fast. All of these techniques are about 10 times faster than the original Dragon 4 from 1990. A tenfold performance gain in performance over three decades is equivalent to a gain of about 8% per year, entirely due to better implementations and algorithms.

Efficient algorithms use between 200 and 350 instructions for each string generated. We find that the standard function std::to_chars under Linux uses slightly more instructions than needed (up to nearly 2 times too many). So there is room to improve common implementations. Using the popular C++ library fmt is slightly less efficient.

A fun fact is that we found that that none of the available functions generate the shortest possible string. The std::to_chars C++ function renders the number 0.00011 as 0.00011 (7 characters), while the shorter scientific form 1.1e-4 would do. But, by convention, when switching to the scientific notation, it is required to pad the exponent to two digits (so 1.1e-04). Beyond this technicality, we found that no implementation always generate the shortest string.

All our code, datasets, and raw results are open-source. The benchmarking suite is at https://github.com/fastfloat/float_serialization_benchmark, test data at https://github.com/fastfloat/float-data.

Reference: Converting Binary Floating-Point Numbers to Shortest
Decimal Strings: An Experimental Review
, Software: Practice and Experience (to appear)