2025-04-21 04:40:40
In software, we often use key-value data structures, where each key is unique and maps to a specific value. Common examples include dictionaries in Python, hash maps in Java, and objects in JavaScript. If you combine arrays with key-value data structures, you can represent most data.
In C++, we have two standard key-value data structures: the std::map and the std::unordered_map. Typically, the std::map is implemented as a tree (e.g., a red-black tree) with sorted keys, providing logarithmic time complexity O(log n) for lookups, insertions, and deletions, and maintaining keys in sorted order. The std::unordered_map is typically implemented as a hash table with unsorted keys, offering average-case constant time complexity O(1) for lookups, insertions, and deletions. In the std::unordered_map, the hash table uses a hashing function to map keys to indices in an array of buckets. Each bucket is essentially a container that can hold multiple key-value pairs, typically implemented as a linked list. When a key is hashed, the hash function computes an index corresponding to a bucket. If multiple keys hash to the same bucket (a collision), they are stored in that bucket’s linked list.
Quite often, we only need to look at the keys, or look at the values. The C++20 standard makes this convenient through the introduction of ranges (std::ranges::views::keys and std::ranges::views::values). Let us consider two functions using the ‘modern’ functional style. The first function sums the values and the next function counts how many keys (assuming that they are strings) start with a given prefix.
template<typename map_type> auto sum_values(const map_type& map) { auto values = map | std::ranges::views::values; return std::accumulate(values.begin(), values.end(), typename map_type::mapped_type{}); } template<typename map_type> size_t count_keys_with_prefix(const map_type& map, std::string_view prefix) { auto keys = map | std::ranges::views::keys; return std::ranges::count_if(keys, [prefix](std::string_view key) { return key.starts_with(prefix); }); }
template <typename map_type> auto sum_values_daniel(map_type &map) { typename map_type::mapped_type sum{}; for (const auto &value : map | std::ranges::views::values) { sum += value; } return sum; } template <typename map_type> size_t count_keys_with_prefix_daniel(const map_type &map, std::string_view prefix) { size_t count = 0; for (const auto &key : map | std::ranges::views::keys) { if (key.starts_with(prefix)) { ++count; } } return count; }
When you do not have C++20, you can write the same functions using a more conventional style:
template<typename map_type> typename map_type::mapped_type sum_values_cpp11(const map_type& map) { typename map_type::mapped_type sum = typename map_type::mapped_type(); for (typename map_type::const_iterator it = map.begin(); it != map.end(); ++it) { sum += it->second; } return sum; } template<typename map_type> size_t count_keys_with_prefix_cpp11(const map_type& map, const std::string& prefix) { size_t count = 0; for (typename map_type::const_iterator it = map.begin(); it != map.end(); ++it) { const std::string& key = it->first; if (key.size() >= prefix.size() && key.compare(0, prefix.size(), prefix) == 0) { ++count; } } return count; }
Let us compare the performance. I have generated a large key-value map with a thousand keys (as strings) with corresponding integer values. All my keys begin with ‘key_’ and I ask my functions to count how many keys begin with ‘key_’ (that should be all of them). I use both an std::map and std::unordered_map data source. I record how many nanoseconds I need to spend per key checked. I use two machines, one with an Intel Ice Lake processor running at 3.2 GHz with GCC 12 and LLVM 16 as the compilers and one with an Apple M2 processor running at 3.5 GHz with LLVM 15 as the compiler. They do not use the same instruction set, so we cannot directly compare the number of instructions retired: we expect an ARM-based system like the Apple M2 to use a few more instructions.
Using an std::map I get the following timings and instruction counts:
function | Apple M2/LLVM15 | Intel Ice Lake/GCC12 | Intel Ice Lake/LLVM16 |
---|---|---|---|
C++20 functional | 2.2 ns | 5.5 ns | 5.8 ns |
C++20 loop | 2.0 ns | 5.5 ns | 5.8 ns |
C++11 | 4.0 ns | 5.5 ns | 5.5 ns |
C++11 with starts_with | 2.0 ns | 5.5 ns | 5.8 ns |
function | Apple M2/LLVM15 | Intel Ice Lake/GCC12 | Intel Ice Lake/LLVM16 |
---|---|---|---|
C++20 functional | 46 instructions | 29 instructions | 49 instructions |
C++20 loop | 41 instructions | 29 instructions | 49 instructions |
C++11 | 58 instructions | 30 instructions | 30 instructions |
C++11 with starts_with | 43 instructions | 30 instructions | 49 instructions |
Using an std::unordered_map, I get the following timings and instruction counts:
function | Apple M2/LLVM15 | Intel Ice Lake/GCC12 | Intel Ice Lake/LLVM16 |
---|---|---|---|
C++20 functional | 1.1 ns | 3.8 ns | 3.8 ns |
C++20 loop | 0.9 ns | 5.4 ns | 3.8 ns |
C++11 | 0.9 ns | 5.4 ns | 3.8 ns |
C++11 with starts_with | 0.9 ns | 5.4 ns | 3.8 ns |
function | Apple M2/LLVM15 | Intel Ice Lake/GCC12 | Intel Ice Lake/LLVM16 |
---|---|---|---|
C++20 functional | 35 instructions | 10 instructions | 30 instructions |
C++20 loop | 30 instructions | 13 instructions | 30 instructions |
C++11 | 26 instructions | 14 instructions | 12 instructions |
C++11 with starts_with | 32 instructions | 14 instructions | 30 instructions |
We find that iterating over an std::map instance is slower than iterating over an std::unordered_map instance.
Performance comparisons between Apple M2/LLVM15 and Intel Ice Lake/GCC12 systems reveal striking differences in C++ code execution. While functional approaches (e.g., C++20 ranges) can outperform traditional loops by up to 40% under GCC 12, using starts_with for string comparisons dramatically halves running time for std::map on Apple/LLVM15.
Despite small input sizes fitting in cache, Intel’s iteration performance is not compute-bound, achieving under 2 instructions per cycle (IPC), while Apple’s exceeds 4 IPC, highlighting its architectural advantage in optimizing data access patterns.
2025-04-14 06:36:41
When trying to write fast functions operating over many bytes, we sometimes use ‘SWAR’. SWAR stands for SIMD Within A Register, and it is a technique to perform parallel computations on multiple data elements packed into a single word. It treats a register (e.g., 64-bit) as a vector of smaller units (e.g., 8-bit bytes) and processes them simultaneously without explicit SIMD instructions.
For example, if you have 8 bytes in a 64-bit word x, computing (x – 0x2020202020202020) subtracts 32 (0x20 in hex) from each byte. It works well if each byte value is greater than or equal to 32. Otherwise, the operation ‘overflows’: if the least significant byte is smaller than 32, then its most significant bit is set, and you cannot rely on the other more significant byte results.
It works well if all bytes are ‘ASCII’ (they have their most significant bit unset). When that is the case, you can check if any byte value is less than 32 by computing (x – 0x2020202020202020) and checking if any most significant bit is set, by masking the result: (x – 0x2020202020202020) & 0x8080808080808080.
Similarly, you can check whether the byte value 34 (0x22) appears simply by first doing an XOR which sets any such byte value to zero, (x ^ 0x2222222222222222) and then subtracting each byte value by 1: (x ^ 0x2222222222222222) – 0x0101010101010101. It then suffices to masking with 0x80…80 again to see if any byte value was set to 34, as long as all input byte values were ASCII.
Thankfully, it is easy to select only the byte values that are ASCII: (0x8080808080808080 & ~x) will do.
bool has_json_escapable_byte(uint64_t x) { uint64_t is_ascii = 0x8080808080808080ULL & ~x; uint64_t lt32 = (x - 0x2020202020202020ULL); uint64_t sub34 = x ^ 0x2222222222222222ULL; uint64_t eq34 = (sub34 - 0x0101010101010101ULL); uint64_t sub92 = x ^ 0x5C5C5C5C5C5C5C5CULL; uint64_t eq92 = (sub92 - 0x0101010101010101ULL); return ((lt32 | eq34 | eq92) & is_ascii) != 0; }
That’s slightly more than one operation per input byte.
The following routine is even better (credit to Falk Hüffner):
bool has_json_escapable_byte(uint64_t x) { uint64_t is_ascii = 0x8080808080808080ULL & ~x; uint64_t xor2 = x ^ 0x0202020202020202ULL; uint64_t lt32_or_eq34 = xor2 – 0x2121212121212121ULL; uint64_t sub92 = x ^ 0x5C5C5C5C5C5C5C5CULL; uint64_t eq92 = (sub92 - 0x0101010101010101ULL); return ((lt32_or_eq34 | eq92) & is_ascii) != 0; }
You might be able to do even better.
2025-04-12 02:56:26
It is often puzzling to encounter organizations run by highly capable and ambitious people… appear dysfunctional.
An example that I like are colleges that claim to provide great learning experience… while providing their students with recycled lectures given by contract lecturers…
I like to tell the tale of this math class I took at the University of Toronto. The esteemed professor would show up with the textbook and a ruler. He would sit down, open the book to the page where the ruler was last left, and read each line of the textbook, slowly moving the ruler down. He would not look at us. He would just read, very slowly, page by page, the content of the textbook. When we showed up to the math department to complain, we were told that the esteemed professor had to teach something, and that they thought best to assign it to our class, since we were in the honours program and could learn on our own.
Recently, the Canadian state broadcaster ran an article about how an otherwise reasonable Canadian university accepted 4000 foreign students, of which only 38 showed up at all, and only 5 stuck with their classes during the first term. It appears that they contributed to a scam which is meant to facilitate immigration.
Apple announced triumphantly its ground-breaking ‘Apple Intelligence’ feature… only to restrict it so much that it becomes irrelevant.
For decades, the leading chip maker in the world was Intel. Intel had such a lead over its competitors that it was believed that attempts at competing against it were useless. Today, few would say that Intel makes the best processors. Its competitors quickly caught up.
Such dysfunctions can be explained by many effects. But I believe that a critical one is a culture of lies. It might be fine to lie to your enemy but once you start lying to yourself, you are inviting trouble. Lies about your objectives are the worst kind.
So you run a college. And you tell prospective students that the quality of their education is your primary concern. But it is a lie. In fact, you want to keep your enrolment at healthy levels and not have too much trouble with your faculty. You do not care very much for what the students learn. In truth, it does not matter because the employers also do not care. You selected bright people and your students managed to fulfil bureaucratic obligations for four years, and that’s good enough.
So what do you do with the esteemed professor who lectures by reading line-by-line in the textbook? Does he lower your enrolment? No. He does not. He is a prestigious researcher who won many awards, so he might contribute to the prestige of your university and even attract students. Trying to get him to learn to teach would mean going against the faculty, which is not in your interest as an administrator. Of course, the students are having a terrible experience, but what does it matter? They will get their grades and they will graduate. And if they tell their story, nobody will care.
Or, as in the example of the Canadian university, you are happy to have enrolled thousands of students. That these students were not actual students does not matter. You look really good on paper, at least for a time. And if you get caught… well, you did your best, didn’t you?
Apple probably did not really want to get into customer AI. They had a really long time to do something with Siri, and they did nothing at all. Apple probably wants to sell iPhones, watches and laptops. They want to look like they are doing something about AI, but they don’t really care.
As for Intel, I don’t have a good theory but I suspect that what happened is the following classic scenario. Intel’s leadership likely touted “we’re still number one” while ignoring warning signs. Nobody cared about long-term technological relevance. They were crushing the competition on the market. It seemed like they had many new things upcoming. It was good enough.
It is not that people running colleges, Intel or Apple are incapable of succeeding. They all succeed on their own terms. They are just not sharing openly what these terms are. And if pushed, they will lie about their goals.
So, how can really smart people appear totally incompetent? Because they are lying to you. What they are openly setting out to achieve is not their real goal.
Given a choice, prefer people who are brutally honest and highly capable. You are much more likely to see them exceed your expectations.
2025-04-08 03:41:27
Do large language models (AI) make you 3x faster or only 3% faster? The answer depends on the quality of the work you are producing.
If you need something like a stock photo but not much beyond that, AI can make you 10x faster.
If you need a picture taken at the right time of the right person, AI doesn’t help you much.
If you need a piece of software that an intern could have written, AI can do it 10x faster than the intern.
If you need a piece of software that only 10 engineers in the country can understand, AI doesn’t help you much.
The effect is predictable: finding work if your skill level is low becomes more difficult. However, if you are a highly skilled individual, you can eliminate much of the boilerplate work and focus on what matters. Thus, elite people are going to become even more productive.
2025-04-07 04:26:56
import "math/rand/v2" func shuffleStandard(data []uint64) { rand.Shuffle(len(data), func(i, j int) { data[i], data[j] = data[j], data[i] }) }
package main import ( "math/bits" ) func randomBounded(rangeVal uint64) uint64 { random64bit := rand.Uint64() hi, lo := bits.Mul64(random64bit, rangeVal) leftover := lo if leftover < rangeVal { threshold := -rangeVal % rangeVal for leftover < threshold { random64bit = rand.Uint64() hi, lo = bits.Mul64(random64bit, rangeVal) leftover = lo } } return hi // [0, range) }
func partialShuffle64b(storage []uint64, indexes [7]uint64, n, k, bound uint64) uint64 { r := rand.Uint64() for i := uint64(0); i < k; i++ { hi, lo := bits.Mul64(n-i, r) r = lo indexes[i] = hi } if r < bound { bound = n for i := uint64(1); i < k; i++ { bound *= (n - i) } t := (-bound) % bound for r < t { r = rand.Uint64() for i := uint64(0); i < k; i++ { hi, lo := bits.Mul64(n-i, r) r = lo indexes[i] = hi } } } for i := uint64(0); i < k; i++ { pos1 := n - i - 1 pos2 := indexes[i] storage[pos1], storage[pos2] = storage[pos2], storage[pos1] } return bound }
partialShuffle64b
function performs a partial shuffle of a slice by generating k
random indices from a range of size n
, using a single 64-bit random number r
obtained from rand.Uint64()
, and then swapping elements at those indices. I reuse an array where the indices are stored. Though it looks much different, it is essentially a generalization of the randomBounded
routine already used by the Go standard library.func shuffleBatch2(storage []uint64) { i := uint64(len(storage)) indexes := [7]uint64{} for ; i > 1<<30; i-- { partialShuffle64b(storage, indexes, i, 1, i) } bound := uint64(1) << 60 for ; i > 1; i -= 2 { bound = partialShuffle64b(storage, indexes, i, 2, bound) } }
We can also generalize and generate the indices in batches of six when possible:
func shuffleBatch23456(storage []uint64) { i := uint64(len(storage)) indexes := [7]uint64{} for ; i > 1<<30; i-- { partialShuffle64b(storage, indexes, i, 1, i) } bound := uint64(1) << 60 for ; i > 1<<19; i -= 2 { bound = partialShuffle64b(storage, indexes, i, 2, bound) } bound = uint64(1) << 57 for ; i > 1<<14; i -= 3 { bound = partialShuffle64b(storage, indexes, i, 3, bound) } bound = uint64(1) << 56 for ; i > 1<<11; i -= 4 { bound = partialShuffle64b(storage, indexes, i, 4, bound) } bound = uint64(1) << 55 for ; i > 1<<9; i -= 5 { bound = partialShuffle64b(storage, indexes, i, 5, bound) } bound = uint64(1) << 54 for ; i > 6; i -= 6 { bound = partialShuffle64b(storage, indexes, i, 6, bound) } if i > 1 { partialShuffle64b(storage, indexes, i, i-1, 720) } }
Using the fact that Go comes with full support for benchmarking, we can try shuffling 10,000 elements as a Go benchmark:
func BenchmarkShuffleStandard10K(b *testing.B) { size := 10000 data := make([]uint64, size) for i := 0; i < size; i++ { data[i] = uint64(i) } b.ResetTimer() for i := 0; i < b.N; i++ { shuffleStandard(data) } } func BenchmarkShuffleBatch2_10K(b *testing.B) { ... } func BenchmarkShuffleBatch23456_10K(b *testing.B) { ... }
Go’s rand.Shuffle is a solid baseline, but batching random integer generation can make it much faster. By generating multiple random numbers from a single 64-bit value, we can boost efficiency—by over 2x in our benchmarks.
References
Appendix: Code
package main // Nevin Brackett-Rozinsky, Daniel Lemire, Batched Ranged Random // Integer Generation, Software: Practice and Experience 55 (1), 2024. // Daniel Lemire, Fast Random Integer Generation in an Interval, // ACM Transactions on Modeling and Computer Simulation, 29 (1), 2019. import ( "fmt" "math/bits" "math/rand/v2" ) func shuffleBatch2(storage []uint64) { i := uint64(len(storage)) indexes := [7]uint64{} for ; i > 1<<30; i-- { partialShuffle64b(storage, indexes, i, 1, i) } bound := uint64(1) << 60 for ; i > 1; i -= 2 { bound = partialShuffle64b(storage, indexes, i, 2, bound) } } func shuffleBatch23456(storage []uint64) { i := uint64(len(storage)) indexes := [7]uint64{} for ; i > 1<<30; i-- { partialShuffle64b(storage, indexes, i, 1, i) } bound := uint64(1) << 60 for ; i > 1<<19; i -= 2 { bound = partialShuffle64b(storage, indexes, i, 2, bound) } bound = uint64(1) << 57 for ; i > 1<<14; i -= 3 { bound = partialShuffle64b(storage, indexes, i, 3, bound) } bound = uint64(1) << 56 for ; i > 1<<11; i -= 4 { bound = partialShuffle64b(storage, indexes, i, 4, bound) } bound = uint64(1) << 55 for ; i > 1<<9; i -= 5 { bound = partialShuffle64b(storage, indexes, i, 5, bound) } bound = uint64(1) << 54 for ; i > 6; i -= 6 { bound = partialShuffle64b(storage, indexes, i, 6, bound) } if i > 1 { partialShuffle64b(storage, indexes, i, i-1, 720) } } func partialShuffle64b(storage []uint64, indexes [7]uint64, n, k, bound uint64) uint64 { r := rand.Uint64() for i := uint64(0); i < k; i++ { hi, lo := bits.Mul64(n-i, r) r = lo indexes[i] = hi } if r < bound { bound = n for i := uint64(1); i < k; i++ { bound *= (n - i) } t := (-bound) % bound for r < t { r = rand.Uint64() for i := uint64(0); i < k; i++ { hi, lo := bits.Mul64(n-i, r) r = lo indexes[i] = hi } } } for i := uint64(0); i < k; i++ { pos1 := n - i - 1 pos2 := indexes[i] storage[pos1], storage[pos2] = storage[pos2], storage[pos1] } return bound }
2025-03-29 09:44:47
Most mobile devices use 64-bit ARM processors. A growing number of servers (Amazon, Microsoft) also use 64-bit ARM processors.
These processors have special instructions called ARM NEON providing parallelism called Single instruction, multiple data (SIMD). For example, you can compare sixteen values with sixteen other values using one instruction.
Some of the most recent ARM processors also support even more advanced instructions called SVE or Scalable Vector Extension. They have added more and more extensions over time: SVE 2 and SVE 2.1.
While ARM NEON’s registers are set at 128 bits, SVE registers are a multiple of 128 bits. In practice, SVE registers are most often 128 bits although there are exceptions. The Amazon Graviton 3 is based on ARM Neoverse V1 core, which supports SVE with a vector length of 256 bits (32 bytes). For the Graviton 4, it is based on the ARM Neoverse V2 core, which supports SVE2 with a vector length of 16 bytes, like NEON.
These functions are ‘free’: they are likely compiled to no instruction whatsoever.
Let us illustrate with an example. In a recent post, I discussed how it is a bit complicated to check whether there is a non-zero byte in a NEON register. A competitive solution is as follows:
int veq_non_zero_max(uint8x16_t v) { return vmaxvq_u32(vreinterpretq_u32_u8(v)) != 0; }
Effectively, we compute the maximum 32-bit integer in the register, considering it as four 32-bit integers. The function compiles down to three essential instruction: umaxv, fmov and a comparison (cmp).
int sve_non_zero_max(uint8x16_t nvec) { svuint8_t vec; vec = svset_neonq_u8(vec, nvec); svbool_t mask = svwhilelt_b8(0, 16); svbool_t cmp = svcmpne_n_u8(mask, vec, 0); return svptest_any(mask, cmp); }
Except for the initialization of mask, the function is made of two instructions cmpne and cset. These two instructions may be fused to one instruction in some ARM cores. Even though the code mixing NEON and SVE looks more complicated, it should be more efficient.
If you know that your target processor supports SVE (or SVE 2 or SVE 2.1), and you already have ARM NEON code, you could try adding bits of SVE to it.