2025-04-14 06:36:41
When trying to write fast functions operating over many bytes, we sometimes use ‘SWAR’. SWAR stands for SIMD Within A Register, and it is a technique to perform parallel computations on multiple data elements packed into a single word. It treats a register (e.g., 64-bit) as a vector of smaller units (e.g., 8-bit bytes) and processes them simultaneously without explicit SIMD instructions.
For example, if you have 8 bytes in a 64-bit word x, computing (x – 0x2020202020202020) subtracts 32 (0x20 in hex) from each byte. It works well if each byte value is greater than or equal to 32. Otherwise, the operation ‘overflows’: if the least significant byte is smaller than 32, then its most significant bit is set, and you cannot rely on the other more significant byte results.
It works well if all bytes are ‘ASCII’ (they have their most significant bit unset). When that is the case, you can check if any byte value is less than 32 by computing (x – 0x2020202020202020) and checking if any most significant bit is set, by masking the result: (x – 0x2020202020202020) & 0x8080808080808080.
Similarly, you can check whether the byte value 34 (0x22) appears simply by first doing an XOR which sets any such byte value to zero, (x ^ 0x2222222222222222) and then subtracting each byte value by 1: (x ^ 0x2222222222222222) – 0x0101010101010101. It then suffices to masking with 0x80…80 again to see if any byte value was set to 34, as long as all input byte values were ASCII.
Thankfully, it is easy to select only the byte values that are ASCII: (0x8080808080808080 & ~x) will do.
bool has_json_escapable_byte(uint64_t x) { uint64_t is_ascii = 0x8080808080808080ULL & ~x; uint64_t lt32 = (x - 0x2020202020202020ULL); uint64_t sub34 = x ^ 0x2222222222222222ULL; uint64_t eq34 = (sub34 - 0x0101010101010101ULL); uint64_t sub92 = x ^ 0x5C5C5C5C5C5C5C5CULL; uint64_t eq92 = (sub92 - 0x0101010101010101ULL); return ((lt32 | eq34 | eq92) & is_ascii) != 0; }
That’s slightly more than one operation per input byte.
The following routine is even better (credit to Falk Hüffner):
bool has_json_escapable_byte(uint64_t x) { uint64_t is_ascii = 0x8080808080808080ULL & ~x; uint64_t xor2 = x ^ 0x0202020202020202ULL; uint64_t lt32_or_eq34 = xor2 – 0x2121212121212121ULL; uint64_t sub92 = x ^ 0x5C5C5C5C5C5C5C5CULL; uint64_t eq92 = (sub92 - 0x0101010101010101ULL); return ((lt32_or_eq34 | eq92) & is_ascii) != 0; }
You might be able to do even better.
2025-04-12 02:56:26
It is often puzzling to encounter organizations run by highly capable and ambitious people… appear dysfunctional.
An example that I like are colleges that claim to provide great learning experience… while providing their students with recycled lectures given by contract lecturers…
I like to tell the tale of this math class I took at the University of Toronto. The esteemed professor would show up with the textbook and a ruler. He would sit down, open the book to the page where the ruler was last left, and read each line of the textbook, slowly moving the ruler down. He would not look at us. He would just read, very slowly, page by page, the content of the textbook. When we showed up to the math department to complain, we were told that the esteemed professor had to teach something, and that they thought best to assign it to our class, since we were in the honours program and could learn on our own.
Recently, the Canadian state broadcaster ran an article about how an otherwise reasonable Canadian university accepted 4000 foreign students, of which only 38 showed up at all, and only 5 stuck with their classes during the first term. It appears that they contributed to a scam which is meant to facilitate immigration.
Apple announced triumphantly its ground-breaking ‘Apple Intelligence’ feature… only to restrict it so much that it becomes irrelevant.
For decades, the leading chip maker in the world was Intel. Intel had such a lead over its competitors that it was believed that attempts at competing against it were useless. Today, few would say that Intel makes the best processors. Its competitors quickly caught up.
Such dysfunctions can be explained by many effects. But I believe that a critical one is a culture of lies. It might be fine to lie to your enemy but once you start lying to yourself, you are inviting trouble. Lies about your objectives are the worst kind.
So you run a college. And you tell prospective students that the quality of their education is your primary concern. But it is a lie. In fact, you want to keep your enrolment at healthy levels and not have too much trouble with your faculty. You do not care very much for what the students learn. In truth, it does not matter because the employers also do not care. You selected bright people and your students managed to fulfil bureaucratic obligations for four years, and that’s good enough.
So what do you do with the esteemed professor who lectures by reading line-by-line in the textbook? Does he lower your enrolment? No. He does not. He is a prestigious researcher who won many awards, so he might contribute to the prestige of your university and even attract students. Trying to get him to learn to teach would mean going against the faculty, which is not in your interest as an administrator. Of course, the students are having a terrible experience, but what does it matter? They will get their grades and they will graduate. And if they tell their story, nobody will care.
Or, as in the example of the Canadian university, you are happy to have enrolled thousands of students. That these students were not actual students does not matter. You look really good on paper, at least for a time. And if you get caught… well, you did your best, didn’t you?
Apple probably did not really want to get into customer AI. They had a really long time to do something with Siri, and they did nothing at all. Apple probably wants to sell iPhones, watches and laptops. They want to look like they are doing something about AI, but they don’t really care.
As for Intel, I don’t have a good theory but I suspect that what happened is the following classic scenario. Intel’s leadership likely touted “we’re still number one” while ignoring warning signs. Nobody cared about long-term technological relevance. They were crushing the competition on the market. It seemed like they had many new things upcoming. It was good enough.
It is not that people running colleges, Intel or Apple are incapable of succeeding. They all succeed on their own terms. They are just not sharing openly what these terms are. And if pushed, they will lie about their goals.
So, how can really smart people appear totally incompetent? Because they are lying to you. What they are openly setting out to achieve is not their real goal.
Given a choice, prefer people who are brutally honest and highly capable. You are much more likely to see them exceed your expectations.
2025-04-08 03:41:27
Do large language models (AI) make you 3x faster or only 3% faster? The answer depends on the quality of the work you are producing.
If you need something like a stock photo but not much beyond that, AI can make you 10x faster.
If you need a picture taken at the right time of the right person, AI doesn’t help you much.
If you need a piece of software that an intern could have written, AI can do it 10x faster than the intern.
If you need a piece of software that only 10 engineers in the country can understand, AI doesn’t help you much.
The effect is predictable: finding work if your skill level is low becomes more difficult. However, if you are a highly skilled individual, you can eliminate much of the boilerplate work and focus on what matters. Thus, elite people are going to become even more productive.
2025-04-07 04:26:56
import "math/rand/v2" func shuffleStandard(data []uint64) { rand.Shuffle(len(data), func(i, j int) { data[i], data[j] = data[j], data[i] }) }
package main import ( "math/bits" ) func randomBounded(rangeVal uint64) uint64 { random64bit := rand.Uint64() hi, lo := bits.Mul64(random64bit, rangeVal) leftover := lo if leftover < rangeVal { threshold := -rangeVal % rangeVal for leftover < threshold { random64bit = rand.Uint64() hi, lo = bits.Mul64(random64bit, rangeVal) leftover = lo } } return hi // [0, range) }
func partialShuffle64b(storage []uint64, indexes [7]uint64, n, k, bound uint64) uint64 { r := rand.Uint64() for i := uint64(0); i < k; i++ { hi, lo := bits.Mul64(n-i, r) r = lo indexes[i] = hi } if r < bound { bound = n for i := uint64(1); i < k; i++ { bound *= (n - i) } t := (-bound) % bound for r < t { r = rand.Uint64() for i := uint64(0); i < k; i++ { hi, lo := bits.Mul64(n-i, r) r = lo indexes[i] = hi } } } for i := uint64(0); i < k; i++ { pos1 := n - i - 1 pos2 := indexes[i] storage[pos1], storage[pos2] = storage[pos2], storage[pos1] } return bound }
partialShuffle64b
function performs a partial shuffle of a slice by generating k
random indices from a range of size n
, using a single 64-bit random number r
obtained from rand.Uint64()
, and then swapping elements at those indices. I reuse an array where the indices are stored. Though it looks much different, it is essentially a generalization of the randomBounded
routine already used by the Go standard library.func shuffleBatch2(storage []uint64) { i := uint64(len(storage)) indexes := [7]uint64{} for ; i > 1<<30; i-- { partialShuffle64b(storage, indexes, i, 1, i) } bound := uint64(1) << 60 for ; i > 1; i -= 2 { bound = partialShuffle64b(storage, indexes, i, 2, bound) } }
We can also generalize and generate the indices in batches of six when possible:
func shuffleBatch23456(storage []uint64) { i := uint64(len(storage)) indexes := [7]uint64{} for ; i > 1<<30; i-- { partialShuffle64b(storage, indexes, i, 1, i) } bound := uint64(1) << 60 for ; i > 1<<19; i -= 2 { bound = partialShuffle64b(storage, indexes, i, 2, bound) } bound = uint64(1) << 57 for ; i > 1<<14; i -= 3 { bound = partialShuffle64b(storage, indexes, i, 3, bound) } bound = uint64(1) << 56 for ; i > 1<<11; i -= 4 { bound = partialShuffle64b(storage, indexes, i, 4, bound) } bound = uint64(1) << 55 for ; i > 1<<9; i -= 5 { bound = partialShuffle64b(storage, indexes, i, 5, bound) } bound = uint64(1) << 54 for ; i > 6; i -= 6 { bound = partialShuffle64b(storage, indexes, i, 6, bound) } if i > 1 { partialShuffle64b(storage, indexes, i, i-1, 720) } }
Using the fact that Go comes with full support for benchmarking, we can try shuffling 10,000 elements as a Go benchmark:
func BenchmarkShuffleStandard10K(b *testing.B) { size := 10000 data := make([]uint64, size) for i := 0; i < size; i++ { data[i] = uint64(i) } b.ResetTimer() for i := 0; i < b.N; i++ { shuffleStandard(data) } } func BenchmarkShuffleBatch2_10K(b *testing.B) { ... } func BenchmarkShuffleBatch23456_10K(b *testing.B) { ... }
Go’s rand.Shuffle is a solid baseline, but batching random integer generation can make it much faster. By generating multiple random numbers from a single 64-bit value, we can boost efficiency—by over 2x in our benchmarks.
References
Appendix: Code
package main // Nevin Brackett-Rozinsky, Daniel Lemire, Batched Ranged Random // Integer Generation, Software: Practice and Experience 55 (1), 2024. // Daniel Lemire, Fast Random Integer Generation in an Interval, // ACM Transactions on Modeling and Computer Simulation, 29 (1), 2019. import ( "fmt" "math/bits" "math/rand/v2" ) func shuffleBatch2(storage []uint64) { i := uint64(len(storage)) indexes := [7]uint64{} for ; i > 1<<30; i-- { partialShuffle64b(storage, indexes, i, 1, i) } bound := uint64(1) << 60 for ; i > 1; i -= 2 { bound = partialShuffle64b(storage, indexes, i, 2, bound) } } func shuffleBatch23456(storage []uint64) { i := uint64(len(storage)) indexes := [7]uint64{} for ; i > 1<<30; i-- { partialShuffle64b(storage, indexes, i, 1, i) } bound := uint64(1) << 60 for ; i > 1<<19; i -= 2 { bound = partialShuffle64b(storage, indexes, i, 2, bound) } bound = uint64(1) << 57 for ; i > 1<<14; i -= 3 { bound = partialShuffle64b(storage, indexes, i, 3, bound) } bound = uint64(1) << 56 for ; i > 1<<11; i -= 4 { bound = partialShuffle64b(storage, indexes, i, 4, bound) } bound = uint64(1) << 55 for ; i > 1<<9; i -= 5 { bound = partialShuffle64b(storage, indexes, i, 5, bound) } bound = uint64(1) << 54 for ; i > 6; i -= 6 { bound = partialShuffle64b(storage, indexes, i, 6, bound) } if i > 1 { partialShuffle64b(storage, indexes, i, i-1, 720) } } func partialShuffle64b(storage []uint64, indexes [7]uint64, n, k, bound uint64) uint64 { r := rand.Uint64() for i := uint64(0); i < k; i++ { hi, lo := bits.Mul64(n-i, r) r = lo indexes[i] = hi } if r < bound { bound = n for i := uint64(1); i < k; i++ { bound *= (n - i) } t := (-bound) % bound for r < t { r = rand.Uint64() for i := uint64(0); i < k; i++ { hi, lo := bits.Mul64(n-i, r) r = lo indexes[i] = hi } } } for i := uint64(0); i < k; i++ { pos1 := n - i - 1 pos2 := indexes[i] storage[pos1], storage[pos2] = storage[pos2], storage[pos1] } return bound }
2025-03-29 09:44:47
Most mobile devices use 64-bit ARM processors. A growing number of servers (Amazon, Microsoft) also use 64-bit ARM processors.
These processors have special instructions called ARM NEON providing parallelism called Single instruction, multiple data (SIMD). For example, you can compare sixteen values with sixteen other values using one instruction.
Some of the most recent ARM processors also support even more advanced instructions called SVE or Scalable Vector Extension. They have added more and more extensions over time: SVE 2 and SVE 2.1.
While ARM NEON’s registers are set at 128 bits, SVE registers are a multiple of 128 bits. In practice, SVE registers are most often 128 bits although there are exceptions. The Amazon Graviton 3 is based on ARM Neoverse V1 core, which supports SVE with a vector length of 256 bits (32 bytes). For the Graviton 4, it is based on the ARM Neoverse V2 core, which supports SVE2 with a vector length of 16 bytes, like NEON.
These functions are ‘free’: they are likely compiled to no instruction whatsoever.
Let us illustrate with an example. In a recent post, I discussed how it is a bit complicated to check whether there is a non-zero byte in a NEON register. A competitive solution is as follows:
int veq_non_zero_max(uint8x16_t v) { return vmaxvq_u32(vreinterpretq_u32_u8(v)) != 0; }
Effectively, we compute the maximum 32-bit integer in the register, considering it as four 32-bit integers. The function compiles down to three essential instruction: umaxv, fmov and a comparison (cmp).
int sve_non_zero_max(uint8x16_t nvec) { svuint8_t vec; vec = svset_neonq_u8(vec, nvec); svbool_t mask = svwhilelt_b8(0, 16); svbool_t cmp = svcmpne_n_u8(mask, vec, 0); return svptest_any(mask, cmp); }
Except for the initialization of mask, the function is made of two instructions cmpne and cset. These two instructions may be fused to one instruction in some ARM cores. Even though the code mixing NEON and SVE looks more complicated, it should be more efficient.
If you know that your target processor supports SVE (or SVE 2 or SVE 2.1), and you already have ARM NEON code, you could try adding bits of SVE to it.
2025-03-25 07:24:50
There are two main types of fixed-precision integers in modern software: unsigned and signed. In C++20 and above, the signed integers must use the two’s complement convention. Other programming languages typically specify two’s complement as well.
Two’s complement is a method for representing signed integers in binary, where the leftmost bit serves as the sign bit—0 for positive numbers and 1 for negative ones—allowing computers to handle both positive and negative values efficiently with simple arithmetic. Positive numbers are written as standard binary (e.g., 5 in 8-bit is 00000101), while negative numbers are formed by taking the positive version, flipping all bits (one’s complement), and adding 1 (e.g., -5 becomes 11111011 from 00000101).
There are cases where you have signed types and you wish to use unsigned comparisons. It happens in Java (which lacks unsigned types) with some of the x64 SIMD instruction sets: extensions to the x64 architecture that allow a single instruction to perform the same operation on multiple data elements simultaneously.
Thus, you would want any negative value to be larger than any positive value. The trick is to simply use the fact that arithmetic operations typically wrap around in hardware: if x + y is too large and exceeds the range, then the processor replaces it with x + y – K where K is the total number of values (e.g., 2 to the power 64). Let M be the smallest possible value (a very small negative), and take two signed integers x and y, then (x + M) < (y + M) is equivalent to comparing x and y as if they had been cast to unsigned integer values.
To see why it is so, consider that the transformation x to x + M maps 0 to M (so 0 becomes the smallest value). It maps all positive values to the range from 1 to –M-1 to the range from M+1 to –1 (so positives become negatives). The negative values all overflow in such a way that the last bit becomes a zero while other bits remain unchanged. Thus it maps negative values from M to -1 to the range 0 to -M-1 (so negatives become positives).
Unfortunately, it will not generally work in C or C++ where signed overflow leads to an undefined behavior. However, it will work if you code for SIMD instruction sets using intrinsics (special functions used by SIMD enthusiasts). It will also work in other programming languages like Java or Go.
To avoid undefined behavior in C and C++, we can use the following identity:
x + M = (x^M) + ((x&M)<<1)
where ^ is the XOR operation. This identity holds when you assume that the addition wraps around and that the left shift acts as if the values were unsigned. Because M has only the most significant bit set in two’s complement notation, we have that the second term is zero. Thus we have
x + M = x^M