MoreRSS

site iconDaniel LemireModify

Computer science professor at the University of Quebec (TELUQ), open-source hacker, and long-time blogger.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of Daniel Lemire

Detect control characters, quotes and backslashes efficiently using ‘SWAR’

2025-04-14 06:36:41

When trying to write fast functions operating over many bytes, we sometimes use ‘SWAR’. SWAR stands for SIMD Within A Register, and it is a technique to perform parallel computations on multiple data elements packed into a single word. It treats a register (e.g., 64-bit) as a vector of smaller units (e.g., 8-bit bytes) and processes them simultaneously without explicit SIMD instructions.

For example, if you have 8 bytes in a 64-bit word x, computing (x – 0x2020202020202020) subtracts 32 (0x20 in hex) from each byte. It works well if each byte value is greater than or equal to 32. Otherwise, the operation ‘overflows’: if the least significant byte is smaller than 32, then its most significant bit is set, and you cannot rely on the other more significant byte results.

It works well if all bytes are ‘ASCII’ (they have their most significant bit unset). When that is the case, you can check if any byte value is less than 32 by computing (x – 0x2020202020202020) and checking if any most significant bit is set, by masking the result: (x – 0x2020202020202020) & 0x8080808080808080.

Similarly,  you can check whether the byte value 34 (0x22) appears simply by first doing an XOR which sets any such byte value to zero, (x ^ 0x2222222222222222) and then subtracting each byte value by 1: (x ^ 0x2222222222222222) – 0x0101010101010101. It then suffices to masking with 0x80…80 again to see if any  byte value was set to 34, as long as all input byte values were ASCII.

Thankfully, it is easy to select only the byte values that are ASCII: (0x8080808080808080 & ~x) will do.

In JSON and other languages, you need to escape within strings all character values smaller than 32 (control characters in ASCII), equal to 34 (‘”‘ in ASCII) or 92 (“\” in ASCII). So you want to quickly check if a byte is smaller than 32, equal to 34 or 92. With SWAR, you can do it with about ten operations:
bool has_json_escapable_byte(uint64_t x) {
    uint64_t is_ascii = 0x8080808080808080ULL & ~x;
    uint64_t lt32 =
        (x - 0x2020202020202020ULL);
    uint64_t sub34 = x ^ 0x2222222222222222ULL;
    uint64_t eq34 = (sub34 - 0x0101010101010101ULL);
    uint64_t sub92 = x ^ 0x5C5C5C5C5C5C5C5CULL;
    uint64_t eq92 = (sub92 - 0x0101010101010101ULL);
    return ((lt32 | eq34 | eq92) & is_ascii) != 0;
}

That’s slightly more than one operation per input byte.

The following routine is even better (credit to Falk Hüffner):

bool has_json_escapable_byte(uint64_t x) {
    uint64_t is_ascii = 0x8080808080808080ULL & ~x;
    uint64_t xor2 = x ^ 0x0202020202020202ULL;
    uint64_t lt32_or_eq34 = xor2 – 0x2121212121212121ULL;
    uint64_t sub92 = x ^ 0x5C5C5C5C5C5C5C5CULL;
    uint64_t eq92 = (sub92 - 0x0101010101010101ULL);
    return ((lt32_or_eq34 | eq92) & is_ascii) != 0;
}

You might be able to do even better.

How can really smart people appear totally incompetent?

2025-04-12 02:56:26

It is often puzzling to encounter organizations run by highly capable and ambitious people… appear dysfunctional.

An example that I like are colleges that claim to provide great learning experience… while providing their students with recycled lectures given by contract lecturers…

I like to tell the tale of this math class I took at the University of Toronto. The esteemed professor would show up with the textbook and a ruler. He would sit down, open the book to the page where the ruler was last left, and read each line of the textbook, slowly moving the ruler down. He would not look at us. He would just read, very slowly, page by page, the content of the textbook. When we showed up to the math department to complain, we were told that the esteemed professor had to teach something, and that they thought best to assign it to our class, since we were in the honours program and could learn on our own.

Recently, the Canadian state broadcaster ran an article about how an otherwise reasonable Canadian university accepted 4000 foreign students, of which only 38 showed up at all, and only 5 stuck with their classes during the first term. It appears that they contributed to a scam which is meant to facilitate immigration.

Apple announced triumphantly its ground-breaking ‘Apple Intelligence’ feature… only to restrict it so much that it becomes irrelevant.

For decades, the leading chip maker in the world was Intel. Intel had such a lead over its competitors that it was believed that attempts at competing against it were useless. Today, few would say that Intel makes the best processors. Its competitors quickly caught up.

Such dysfunctions can be explained by many effects. But I believe that a critical one is a culture of lies. It might be fine to lie to your enemy but once you start lying to yourself, you are inviting trouble. Lies about your objectives are the worst kind.

So you run a college. And you tell prospective students that the quality of their education is your primary concern. But it is a lie. In fact, you want to keep your enrolment at healthy levels and not have too much trouble with your faculty. You do not care very much for what the students learn. In truth, it does not matter because the employers also do not care. You selected bright people and your students managed to fulfil bureaucratic obligations for four years, and that’s good enough.

So what do you do with the esteemed professor who lectures by reading line-by-line in the textbook? Does he lower your enrolment? No. He does not. He is a prestigious researcher who won many awards, so he might contribute to the prestige of your university and even attract students. Trying to get him to learn to teach would mean going against the faculty, which is not in your interest as an administrator. Of course, the students are having a terrible experience, but what does it matter? They will get their grades and they will graduate. And if they tell their story, nobody will care.

Or, as in the example of the Canadian university, you are happy to have enrolled thousands of students. That these students were not actual students does not matter. You look really good on paper, at least for a time. And if you get caught… well, you did your best, didn’t you?

Apple probably did not really want to get into customer AI. They had a really long time to do something with Siri, and they did nothing at all. Apple probably wants to sell iPhones, watches and laptops. They want to look like they are doing something about AI, but they don’t really care.

As for Intel, I don’t have a good theory but I suspect that what happened is the following classic scenario. Intel’s leadership likely touted “we’re still number one” while ignoring warning signs. Nobody cared about long-term technological relevance. They were crushing the competition on the market. It seemed like they had many new things upcoming. It was good enough.

It is not that people running colleges, Intel or Apple are incapable of succeeding. They all succeed on their own terms. They are just not sharing openly what these terms are. And if pushed, they will lie about their goals.

So, how can really smart people appear totally incompetent? Because they are lying to you. What they are openly setting out to achieve is not their real goal.

Given a choice, prefer people who are brutally honest and highly capable. You are much more likely to see them exceed your expectations.

 

How helpful is AI?

2025-04-08 03:41:27

Do large language models (AI) make you 3x faster or only 3% faster? The answer depends on the quality of the work you are producing.

If you need something like a stock photo but not much beyond that, AI can make you 10x faster.

If you need a picture taken at the right time of the right person, AI doesn’t help you much.

If you need a piece of software that an intern could have written, AI can do it 10x faster than the intern.

If you need a piece of software that only 10 engineers in the country can understand, AI doesn’t help you much.

The effect is predictable: finding work if your skill level is low becomes more difficult. However, if you are a highly skilled individual, you can eliminate much of the boilerplate work and focus on what matters. Thus, elite people are going to become even more productive.

Faster shuffling in Go with batching

2025-04-07 04:26:56

Random integer generation is a fundamental operation in programming, often used in tasks like shuffling arrays. Go’s standard library provides convenient tools like rand.Shuffle for such purposes. You may be able to beat the standard library by a generous margin. Let us see why.
Go’s rand.Shuffle implements the Fisher-Yates (or Knuth) shuffle algorithm, a standard method for generating a uniform random permutation. In Go, you may call it like so:
import "math/rand/v2"

func shuffleStandard(data []uint64) {
  rand.Shuffle(len(data), func(i, j int) {
    data[i], data[j] = data[j], data[i]
  })
}
The rand.Shuffle function takes a length and a swap function, internally generating random indices in the range [0, i) for each position i and swapping elements accordingly.
Under the hood, rand.Shuffle needs random integers within a shrinking range. This can be a surprisingly expensive step that requires a modulo/division operation per function call. We can optimize such a process so that division are avoided with the following routine (Lemire, 2019):
package main

import (
    "math/bits"
)

func randomBounded(rangeVal uint64) uint64 {
    random64bit := rand.Uint64()
    hi, lo := bits.Mul64(random64bit, rangeVal)
    leftover := lo
    if leftover < rangeVal {
        threshold := -rangeVal % rangeVal
        for leftover < threshold {
            random64bit = rand.Uint64()
            hi, lo = bits.Mul64(random64bit, rangeVal)
            leftover = lo
        }
    }
    return hi // [0, range)
}
This function returns an integer in the interval [0, rangeVal). It is fair: as long as the 64-bit generator is uniform, it will generate all values in the interval with the same probability.
The original shuffle function in the Go standard library used a slower division-based approach while the math/rand/v2 approach uses this fast algorithm. So for better performance, always use math/rand/v2 and not math/rand.
While efficient, this improved function needs to generate many random numbers. For rand.Shuffle, this means one rand.Uint64() call per element.
We can optimize drastically the function with batching (Brackett-Rozinsky and Daniel Lemire, 2024). That is, we can use a single 64-bit random integer to produce multiple random numbers, reducing the number of random-number generations.
I have implemented shuffling by batching in Go and my results are promising.
The partialShuffle64b function is the heart of my approach:
func partialShuffle64b(storage []uint64, indexes [7]uint64, n, k, 
                           bound uint64) uint64 {
    r := rand.Uint64()

    for i := uint64(0); i < k; i++ {
        hi, lo := bits.Mul64(n-i, r)
        r = lo
        indexes[i] = hi
    }

    if r < bound {
        bound = n
        for i := uint64(1); i < k; i++ {
            bound *= (n - i)
        }
        t := (-bound) % bound

        for r < t {
            r = rand.Uint64()
            for i := uint64(0); i < k; i++ {
                hi, lo := bits.Mul64(n-i, r)
                r = lo
                indexes[i] = hi
            }
        }
    }
    for i := uint64(0); i < k; i++ {
        pos1 := n - i - 1
        pos2 := indexes[i]
        storage[pos1], storage[pos2] = storage[pos2], 
                     storage[pos1]
    }
    return bound
}
The partialShuffle64b function performs a partial shuffle of a slice by generating k random indices from a range of size n, using a single 64-bit random number r obtained from rand.Uint64(), and then swapping elements at those indices. I reuse an array where the indices are stored. Though it looks much different, it is essentially a generalization of the randomBounded routine already used by the Go standard library.
Using the partialShuffle64b function, we can code a shuffle function where we generate indices in batches of two:
func shuffleBatch2(storage []uint64) {
    i := uint64(len(storage))
    indexes := [7]uint64{}
    for ; i > 1<<30; i-- {
        partialShuffle64b(storage, indexes, i, 1, i)
    }
    bound := uint64(1) << 60
    for ; i > 1; i -= 2 {
        bound = partialShuffle64b(storage, indexes, i, 2, bound)
    }
}

We can also generalize and generate the indices in batches of six when possible:

func shuffleBatch23456(storage []uint64) {
    i := uint64(len(storage))
    indexes := [7]uint64{}

    for ; i > 1<<30; i-- {
        partialShuffle64b(storage, indexes, i, 1, i)
    }
    bound := uint64(1) << 60
    for ; i > 1<<19; i -= 2 {
        bound = partialShuffle64b(storage, indexes, i, 2, bound)
    }
    bound = uint64(1) << 57
    for ; i > 1<<14; i -= 3 {
        bound = partialShuffle64b(storage, indexes, i, 3, bound)
    }
    bound = uint64(1) << 56
    for ; i > 1<<11; i -= 4 {
        bound = partialShuffle64b(storage, indexes, i, 4, bound)
    }
    bound = uint64(1) << 55
    for ; i > 1<<9; i -= 5 {
        bound = partialShuffle64b(storage, indexes, i, 5, bound)
    }
    bound = uint64(1) << 54
    for ; i > 6; i -= 6 {
        bound = partialShuffle64b(storage, indexes, i, 6, bound)
    }
    if i > 1 {
        partialShuffle64b(storage, indexes, i, i-1, 720)
    }
}

Using the fact that Go comes with full support for benchmarking, we can try shuffling 10,000 elements as a Go benchmark:

func BenchmarkShuffleStandard10K(b *testing.B) {
    size := 10000
    data := make([]uint64, size)
    for i := 0; i &lt; size; i++ {
        data[i] = uint64(i)
    }
    b.ResetTimer()
    for i := 0; i &lt; b.N; i++ {
        shuffleStandard(data)
    }
}

func BenchmarkShuffleBatch2_10K(b *testing.B) { ... }
func BenchmarkShuffleBatch23456_10K(b *testing.B) { ... }
On an Intel Gold 6338 CPU @ 2.00GHz with Go 1.24, I get the following results :
  • Standard (rand.Shuffle): 13 ns/element.
  • Batch 2: 9.3 ns/element (~1.4x faster).
  • Batch 2-3-4-5-6: 5.0 ns/element (~2.6x faster).
You will get different results on your systems. I make my code available on GitHub.
The batched approaches reduce the number of random numbers generated, amortizing the cost of rand.Uint64() across multiple outputs. The 6-way batch maximizes this effect, yielding the best performance.

Go’s rand.Shuffle is a solid baseline, but batching random integer generation can make it much faster. By generating multiple random numbers from a single 64-bit value, we can boost efficiency—by over 2x in our benchmarks.

References

Appendix: Code

package main

// Nevin Brackett-Rozinsky, Daniel Lemire, Batched Ranged Random 
// Integer Generation, Software: Practice and Experience 55 (1), 2024.
// Daniel Lemire, Fast Random Integer Generation in an Interval, 
// ACM Transactions on Modeling and Computer Simulation, 29 (1), 2019.
import (
    "fmt"
    "math/bits"
    "math/rand/v2"
)

func shuffleBatch2(storage []uint64) {
    i := uint64(len(storage))
    indexes := [7]uint64{}
    for ; i > 1<<30; i-- {
        partialShuffle64b(storage, indexes, i, 1, i)
    }
    bound := uint64(1) << 60
    for ; i > 1; i -= 2 {
        bound = partialShuffle64b(storage, indexes, i, 2, bound)
    }
}

func shuffleBatch23456(storage []uint64) {
    i := uint64(len(storage))
    indexes := [7]uint64{}

    for ; i > 1<<30; i-- {
        partialShuffle64b(storage, indexes, i, 1, i)
    }
    bound := uint64(1) << 60
    for ; i > 1<<19; i -= 2 {
        bound = partialShuffle64b(storage, indexes, i, 2, bound)
    }
    bound = uint64(1) << 57
    for ; i > 1<<14; i -= 3 {
        bound = partialShuffle64b(storage, indexes, i, 3, bound)
    }
    bound = uint64(1) << 56
    for ; i > 1<<11; i -= 4 {
        bound = partialShuffle64b(storage, indexes, i, 4, bound)
    }
    bound = uint64(1) << 55
    for ; i > 1<<9; i -= 5 {
        bound = partialShuffle64b(storage, indexes, i, 5, bound)
    }
    bound = uint64(1) << 54
    for ; i > 6; i -= 6 {
        bound = partialShuffle64b(storage, indexes, i, 6, bound)
    }
    if i > 1 {
        partialShuffle64b(storage, indexes, i, i-1, 720)
    }
}

func partialShuffle64b(storage []uint64, indexes [7]uint64, n, k, 
  bound uint64) uint64 {
    r := rand.Uint64()

    for i := uint64(0); i < k; i++ {
        hi, lo := bits.Mul64(n-i, r)
        r = lo
        indexes[i] = hi
    }

    if r < bound {
        bound = n
        for i := uint64(1); i < k; i++ {
            bound *= (n - i)
        }
        t := (-bound) % bound

        for r < t {
            r = rand.Uint64()
            for i := uint64(0); i < k; i++ {
                hi, lo := bits.Mul64(n-i, r)
                r = lo
                indexes[i] = hi
            }
        }
    }
    for i := uint64(0); i < k; i++ {
        pos1 := n - i - 1
        pos2 := indexes[i]
        storage[pos1], storage[pos2] = storage[pos2], 
                    storage[pos1]
    }
    return bound
}

Mixing ARM NEON with SVE code for fun and profit

2025-03-29 09:44:47

Most mobile devices use 64-bit ARM processors. A growing number of servers (Amazon, Microsoft) also use 64-bit ARM processors.

These processors  have special instructions called ARM NEON providing parallelism called Single instruction, multiple data (SIMD). For example, you can compare sixteen values with sixteen other values using one instruction.

Some of the most recent ARM processors also support even more advanced instructions called SVE or Scalable Vector Extension. They have added more and more extensions over time: SVE 2 and SVE 2.1.

While ARM NEON’s registers are set at 128 bits, SVE registers are a multiple of 128 bits. In practice, SVE registers are most often 128 bits although there are exceptions. The Amazon Graviton 3 is based on ARM Neoverse V1 core, which supports SVE with a vector length of 256 bits (32 bytes). For the Graviton 4, it is based on the ARM Neoverse V2 core, which supports SVE2 with a vector length of 16 bytes, like NEON.

So if you are a low-level engineer, you are supposed to choose: either you use ARM NEON or SVE. As a comment to a recent article I wrote, Ashton Six observed that you can mix and match these instructions (NEON and SVE) because it is guaranteed that the first 128 bits of SVE registers are the NEON registers. Ashton provided a demonstration using assembly code.
If you have a recent C/C++ compiler (e.g., GCC 14), it turns out that you can fairly easily switch back and forth between NEON and SVE. If you use the arm_neon_sve_bridge.h header, you are providing two functions:
  • svset_neonq: sets the contents of a NEON 128-bit vector (uint8x16_t, int32x4_t, etc.) into an SVE scalable vector (svuint8_t, svint32_t, etc.).
  • svget_neonq: Extracts the first 128 bits of an SVE scalable vector and returns them as a NEON 128-bit vector.

These functions are ‘free’: they are likely compiled to no instruction whatsoever.

Let us illustrate with an example. In a recent post, I discussed how it is a bit complicated to check whether there is a non-zero byte in a NEON register. A competitive solution is as follows:

int veq_non_zero_max(uint8x16_t v) {
  return vmaxvq_u32(vreinterpretq_u32_u8(v)) != 0;
}

Effectively, we compute the maximum 32-bit integer in the register, considering it as four 32-bit integers. The function  compiles down to three essential instruction: umaxv, fmov and a comparison (cmp).

Let us consider the following SVE alternative. It converts the input to an SVE vector, creates a mask for all 16 positions, compares each element to zero to generate a predicate of non-zero positions, and finally tests if any elements are non-zero, returning 1 if so or 0 if all are zero—essentially performing an efficient vectorized “any non-zero” check.
int sve_non_zero_max(uint8x16_t nvec) {
  svuint8_t vec;
  vec = svset_neonq_u8(vec, nvec);
  svbool_t mask = svwhilelt_b8(0, 16);
  svbool_t cmp = svcmpne_n_u8(mask, vec, 0);
  return svptest_any(mask, cmp);
}

Except for the initialization of mask, the function is made of two instructions cmpne and cset. These two instructions may be fused to one instruction in some ARM cores. Even though the code mixing NEON and SVE looks more complicated, it should be more efficient.

If you know that your target processor supports SVE (or SVE 2 or SVE 2.1), and you already have ARM NEON code, you could try adding bits of SVE to it.

 

Unsigned comparisons using signed types

2025-03-25 07:24:50

There are two main types of fixed-precision integers in modern software: unsigned and signed. In C++20 and above, the signed integers must use the two’s complement convention. Other programming languages typically specify two’s complement as well.

Two’s complement is a method for representing signed integers in binary, where the leftmost bit serves as the sign bit—0 for positive numbers and 1 for negative ones—allowing computers to handle both positive and negative values efficiently with simple arithmetic. Positive numbers are written as standard binary (e.g., 5 in 8-bit is 00000101), while negative numbers are formed by taking the positive version, flipping all bits (one’s complement), and adding 1 (e.g., -5 becomes 11111011 from 00000101).

There are cases where you have signed types and you wish to use unsigned comparisons. It happens in Java (which lacks unsigned types) with some of the x64 SIMD instruction sets: extensions to the x64 architecture that allow a single instruction to perform the same operation on multiple data elements simultaneously.

Thus, you would want any negative value to be larger than any positive value. The trick is to simply use the fact that arithmetic operations typically wrap around in hardware: if x + y is too large and exceeds the range, then the processor replaces it with x + yK where K is the total number of values (e.g., 2 to the power 64). Let M be the smallest possible value (a very small negative), and take two signed integers x and y, then (x + M) < (y + M) is equivalent to comparing x and y as if they had been cast to unsigned integer values.

To see why it is so, consider that the transformation x to x + M maps 0 to M (so 0 becomes the smallest value). It maps all positive values to the range from 1 to –M-1 to the range from M+1 to1 (so positives become negatives). The negative values all overflow in such a way that the last bit becomes a zero while other bits remain unchanged. Thus it maps negative values from M to -1 to the range 0 to -M-1 (so negatives become positives).

Unfortunately, it will not generally work in C or C++ where signed overflow leads to an undefined behavior. However, it will work if you code for SIMD instruction sets using intrinsics (special functions used by SIMD enthusiasts). It will also work in other programming languages like Java or Go.

To avoid undefined behavior in C and C++, we can use the following identity:

x + M = (x^M) + ((x&M)<<1)

where ^ is the XOR operation. This identity holds when you assume that the addition wraps around and that the left shift acts as if the values were unsigned. Because M has only the most significant bit set in two’s complement notation, we have that the second term is zero. Thus we have

x + M = x^M

Hence, instead of using the addition with M, we can use the XOR with M: (x ^ M) < (y ^ M). That’s well defined in C and C++.
Credit: Thanks to Marc Reynolds for his input.