2025-05-04 03:44:26
In C++, templates enable generic programming by allowing functions and classes to operate on different data types without sacrificing type safety. Defined using the template keyword, they let developers write reusable, type-agnostic code, such as functions (e.g., template <typename T> max(T a, T b)) or classes (e.g., std::vector), where the type T is specified at compile time.
Historically, the C++ language has tended to produce complicated compiler error messages. The main culprit is template metaprogramming. C++ templates are powerful but complex. When errors occur in template code, the compiler generates long, verbose messages with nested type information, often involving deep template instantiations. A simple mistake in a template function can produce a message spanning multiple lines with obscure type names.
Let us consider an example. In C++, we often use the ‘Standard Template Library (STL)’. It includes a useful dynamic array template: std::vector. A vector manages a sequence of elements with automatic memory handling and flexible sizing. Unlike fixed-size arrays, it can grow or shrink at runtime through operations like push_back to append elements or pop_back to remove them. You can store just about anything in an std::vector but there are some limits. For example, your type must be copyable.
Let us create a C++ type that is not copyable:
struct non_copyable { non_copyable() = default; non_copyable(const non_copyable&) = delete; // No copy };
If we try to create an empty std::vector with non_copyable, it might seemingly work:
std::vector<non_copyable> v;
However, once you try to actually work with the std::vector instance, you may get verbose error messages. Let me write a vectorize function template takes a single value of any type and returns a std::vector containing that value as its sole element. If I try to call this vectorize function template with my non_copyable type, I get in trouble:
template <typename type> std::vector<type> vectorize(type&& t) { return {t}; } void g() { non_copyable m; // works: std::vector<non_copyable> v; // fails: vectorize(m); }
It does not work simply because the compiler tries to copy a non_copyable instance. It is not difficult error. Yet the compiler error message can be epic. For example, you might get the following error:
include/c++/16.0.0//bits/allocator.h:133:30: note: in instantiation of template class 'std::__new_allocator' requested here | class allocator : public __allocator_base<_Tp> | ^ include/c++/16.0.0/ext/alloc_traits.h:46:47: note: in instantiation of template class 'std::allocator' requested here | template | ^ include/c++/16.0.0/bits/stl_vector.h:93:35: note: in instantiation of default argument for '__alloc_traits<std::allocator>' required here | typedef typename __gnu_cxx::__alloc_traits<_Alloc>::template | ^~~~~~~~~~~~~~~~~~~~~~ include/c++/16.0.0/bits/stl_vector.h:458:30: note: in instantiation of template class 'std::_Vector_base>' requested here | class vector : protected _Vector_base<_Tp, _Alloc> | ^ :50:5: note: in instantiation of template class 'std::vector' requested here | vectorize(m);
C++ concepts, introduced in C++20, are a compile-time mechanism for defining and enforcing constraints on template parameters. Using the concept keyword, developers can specify requirements that a type must satisfy to be used with a template, such as having certain operations, member functions, or inheriting from a specific base class. Concepts improve error messages by catching type mismatches early and make code more expressive by documenting intent. They can be applied directly in template declarations (e.g., template).
In our case, we can define a type that can be used in an std::vector instance: we require the type to be destructible, copyable, and default constructible.
template <typename T> concept vector_element = requires(T a, T b) { // Must be destructible requires std::destructible<T>; // Must be copy constructible requires std::copy_constructible<T>; // Must be copy assignable requires std::assignable_from<T&, T>; // Must be default constructible (for operations like resize) requires std::default_initializable<T>; };
Concepts in C++20 are general. For example, if I wanted to require that an instance can be compared with itself, I might define the following concept:
template<typename T> concept equality_comparable = requires(T a, T b) { { a == b } -> std::convertible_to<bool>; };
You can similarly define a concept for a type like std::vector that supports a ‘push_back’ method.
template<typename T> concept pushable = requires(T a, typename T::value_type val) { { a.push_back(val) }; };
In any case, let me now write a new vectorize function with my newly defined vector_element concept:
template <vector_element type> std::vector<type> safe_vectorize(type&& t) { return {t}; }
And this time, you are going to get a much better error message when writing the following erroneous code:
void g() { non_copyable m; safe_vectorize(m); }
For example, you may get the following error:
error: no matching function for call to 'safe_vectorize' note: candidate template ignored: constraints not satisfied [with type = non_copyable &] note: because 'non_copyable &' does not satisfy 'vector_element'
C++ templates, while powerful for enabling generic and reusable code, often lead to complex and verbose error messages, particularly when misused, as seen with the std::vector and non_copyable example. The introduction of C++20 concepts allows developers to enforce type constraints explicitly, resulting in clearer, more concise error diagnostics. By using concepts like vector_element, programmers can catch errors early and improve code readability.
The adoption of C++20 has been excellent, both by compiler vendors and by major users. If you can, I recommend you take C++ concepts out for a spin.
2025-04-21 04:40:40
In software, we often use key-value data structures, where each key is unique and maps to a specific value. Common examples include dictionaries in Python, hash maps in Java, and objects in JavaScript. If you combine arrays with key-value data structures, you can represent most data.
In C++, we have two standard key-value data structures: the std::map and the std::unordered_map. Typically, the std::map is implemented as a tree (e.g., a red-black tree) with sorted keys, providing logarithmic time complexity O(log n) for lookups, insertions, and deletions, and maintaining keys in sorted order. The std::unordered_map is typically implemented as a hash table with unsorted keys, offering average-case constant time complexity O(1) for lookups, insertions, and deletions. In the std::unordered_map, the hash table uses a hashing function to map keys to indices in an array of buckets. Each bucket is essentially a container that can hold multiple key-value pairs, typically implemented as a linked list. When a key is hashed, the hash function computes an index corresponding to a bucket. If multiple keys hash to the same bucket (a collision), they are stored in that bucket’s linked list.
Quite often, we only need to look at the keys, or look at the values. The C++20 standard makes this convenient through the introduction of ranges (std::ranges::views::keys and std::ranges::views::values). Let us consider two functions using the ‘modern’ functional style. The first function sums the values and the next function counts how many keys (assuming that they are strings) start with a given prefix.
template<typename map_type> auto sum_values(const map_type& map) { auto values = map | std::ranges::views::values; return std::accumulate(values.begin(), values.end(), typename map_type::mapped_type{}); } template<typename map_type> size_t count_keys_with_prefix(const map_type& map, std::string_view prefix) { auto keys = map | std::ranges::views::keys; return std::ranges::count_if(keys, [prefix](std::string_view key) { return key.starts_with(prefix); }); }
template <typename map_type> auto sum_values_daniel(map_type &map) { typename map_type::mapped_type sum{}; for (const auto &value : map | std::ranges::views::values) { sum += value; } return sum; } template <typename map_type> size_t count_keys_with_prefix_daniel(const map_type &map, std::string_view prefix) { size_t count = 0; for (const auto &key : map | std::ranges::views::keys) { if (key.starts_with(prefix)) { ++count; } } return count; }
When you do not have C++20, you can write the same functions using a more conventional style:
template<typename map_type> typename map_type::mapped_type sum_values_cpp11(const map_type& map) { typename map_type::mapped_type sum = typename map_type::mapped_type(); for (typename map_type::const_iterator it = map.begin(); it != map.end(); ++it) { sum += it->second; } return sum; } template<typename map_type> size_t count_keys_with_prefix_cpp11(const map_type& map, const std::string& prefix) { size_t count = 0; for (typename map_type::const_iterator it = map.begin(); it != map.end(); ++it) { const std::string& key = it->first; if (key.size() >= prefix.size() && key.compare(0, prefix.size(), prefix) == 0) { ++count; } } return count; }
Let us compare the performance. I have generated a large key-value map with a thousand keys (as strings) with corresponding integer values. All my keys begin with ‘key_’ and I ask my functions to count how many keys begin with ‘key_’ (that should be all of them). I use both an std::map and std::unordered_map data source. I record how many nanoseconds I need to spend per key checked. I use two machines, one with an Intel Ice Lake processor running at 3.2 GHz with GCC 12 and LLVM 16 as the compilers and one with an Apple M2 processor running at 3.5 GHz with LLVM 15 as the compiler. They do not use the same instruction set, so we cannot directly compare the number of instructions retired: we expect an ARM-based system like the Apple M2 to use a few more instructions.
Using an std::map I get the following timings and instruction counts:
function | Apple M2/LLVM15 | Intel Ice Lake/GCC12 | Intel Ice Lake/LLVM16 |
---|---|---|---|
C++20 functional | 2.2 ns | 5.5 ns | 5.8 ns |
C++20 loop | 2.0 ns | 5.5 ns | 5.8 ns |
C++11 | 4.0 ns | 5.5 ns | 5.5 ns |
C++11 with starts_with | 2.0 ns | 5.5 ns | 5.8 ns |
function | Apple M2/LLVM15 | Intel Ice Lake/GCC12 | Intel Ice Lake/LLVM16 |
---|---|---|---|
C++20 functional | 46 instructions | 29 instructions | 49 instructions |
C++20 loop | 41 instructions | 29 instructions | 49 instructions |
C++11 | 58 instructions | 30 instructions | 30 instructions |
C++11 with starts_with | 43 instructions | 30 instructions | 49 instructions |
Using an std::unordered_map, I get the following timings and instruction counts:
function | Apple M2/LLVM15 | Intel Ice Lake/GCC12 | Intel Ice Lake/LLVM16 |
---|---|---|---|
C++20 functional | 1.1 ns | 3.8 ns | 3.8 ns |
C++20 loop | 0.9 ns | 5.4 ns | 3.8 ns |
C++11 | 0.9 ns | 5.4 ns | 3.8 ns |
C++11 with starts_with | 0.9 ns | 5.4 ns | 3.8 ns |
function | Apple M2/LLVM15 | Intel Ice Lake/GCC12 | Intel Ice Lake/LLVM16 |
---|---|---|---|
C++20 functional | 35 instructions | 10 instructions | 30 instructions |
C++20 loop | 30 instructions | 13 instructions | 30 instructions |
C++11 | 26 instructions | 14 instructions | 12 instructions |
C++11 with starts_with | 32 instructions | 14 instructions | 30 instructions |
We find that iterating over an std::map instance is slower than iterating over an std::unordered_map instance.
Performance comparisons between Apple M2/LLVM15 and Intel Ice Lake/GCC12 systems reveal striking differences in C++ code execution. While functional approaches (e.g., C++20 ranges) can outperform traditional loops by up to 40% under GCC 12, using starts_with for string comparisons dramatically halves running time for std::map on Apple/LLVM15.
Despite small input sizes fitting in cache, Intel’s iteration performance is not compute-bound, achieving under 2 instructions per cycle (IPC), while Apple’s exceeds 4 IPC, highlighting its architectural advantage in optimizing data access patterns.
2025-04-14 06:36:41
When trying to write fast functions operating over many bytes, we sometimes use ‘SWAR’. SWAR stands for SIMD Within A Register, and it is a technique to perform parallel computations on multiple data elements packed into a single word. It treats a register (e.g., 64-bit) as a vector of smaller units (e.g., 8-bit bytes) and processes them simultaneously without explicit SIMD instructions.
For example, if you have 8 bytes in a 64-bit word x, computing (x – 0x2020202020202020) subtracts 32 (0x20 in hex) from each byte. It works well if each byte value is greater than or equal to 32. Otherwise, the operation ‘overflows’: if the least significant byte is smaller than 32, then its most significant bit is set, and you cannot rely on the other more significant byte results.
It works well if all bytes are ‘ASCII’ (they have their most significant bit unset). When that is the case, you can check if any byte value is less than 32 by computing (x – 0x2020202020202020) and checking if any most significant bit is set, by masking the result: (x – 0x2020202020202020) & 0x8080808080808080.
Similarly, you can check whether the byte value 34 (0x22) appears simply by first doing an XOR which sets any such byte value to zero, (x ^ 0x2222222222222222) and then subtracting each byte value by 1: (x ^ 0x2222222222222222) – 0x0101010101010101. It then suffices to masking with 0x80…80 again to see if any byte value was set to 34, as long as all input byte values were ASCII.
Thankfully, it is easy to select only the byte values that are ASCII: (0x8080808080808080 & ~x) will do.
bool has_json_escapable_byte(uint64_t x) { uint64_t is_ascii = 0x8080808080808080ULL & ~x; uint64_t lt32 = (x - 0x2020202020202020ULL); uint64_t sub34 = x ^ 0x2222222222222222ULL; uint64_t eq34 = (sub34 - 0x0101010101010101ULL); uint64_t sub92 = x ^ 0x5C5C5C5C5C5C5C5CULL; uint64_t eq92 = (sub92 - 0x0101010101010101ULL); return ((lt32 | eq34 | eq92) & is_ascii) != 0; }
That’s slightly more than one operation per input byte.
The following routine is even better (credit to Falk Hüffner):
bool has_json_escapable_byte(uint64_t x) { uint64_t is_ascii = 0x8080808080808080ULL & ~x; uint64_t xor2 = x ^ 0x0202020202020202ULL; uint64_t lt32_or_eq34 = xor2 – 0x2121212121212121ULL; uint64_t sub92 = x ^ 0x5C5C5C5C5C5C5C5CULL; uint64_t eq92 = (sub92 - 0x0101010101010101ULL); return ((lt32_or_eq34 | eq92) & is_ascii) != 0; }
You might be able to do even better.
2025-04-12 02:56:26
It is often puzzling to encounter organizations run by highly capable and ambitious people… appear dysfunctional.
An example that I like are colleges that claim to provide great learning experience… while providing their students with recycled lectures given by contract lecturers…
I like to tell the tale of this math class I took at the University of Toronto. The esteemed professor would show up with the textbook and a ruler. He would sit down, open the book to the page where the ruler was last left, and read each line of the textbook, slowly moving the ruler down. He would not look at us. He would just read, very slowly, page by page, the content of the textbook. When we showed up to the math department to complain, we were told that the esteemed professor had to teach something, and that they thought best to assign it to our class, since we were in the honours program and could learn on our own.
Recently, the Canadian state broadcaster ran an article about how an otherwise reasonable Canadian university accepted 4000 foreign students, of which only 38 showed up at all, and only 5 stuck with their classes during the first term. It appears that they contributed to a scam which is meant to facilitate immigration.
Apple announced triumphantly its ground-breaking ‘Apple Intelligence’ feature… only to restrict it so much that it becomes irrelevant.
For decades, the leading chip maker in the world was Intel. Intel had such a lead over its competitors that it was believed that attempts at competing against it were useless. Today, few would say that Intel makes the best processors. Its competitors quickly caught up.
Such dysfunctions can be explained by many effects. But I believe that a critical one is a culture of lies. It might be fine to lie to your enemy but once you start lying to yourself, you are inviting trouble. Lies about your objectives are the worst kind.
So you run a college. And you tell prospective students that the quality of their education is your primary concern. But it is a lie. In fact, you want to keep your enrolment at healthy levels and not have too much trouble with your faculty. You do not care very much for what the students learn. In truth, it does not matter because the employers also do not care. You selected bright people and your students managed to fulfil bureaucratic obligations for four years, and that’s good enough.
So what do you do with the esteemed professor who lectures by reading line-by-line in the textbook? Does he lower your enrolment? No. He does not. He is a prestigious researcher who won many awards, so he might contribute to the prestige of your university and even attract students. Trying to get him to learn to teach would mean going against the faculty, which is not in your interest as an administrator. Of course, the students are having a terrible experience, but what does it matter? They will get their grades and they will graduate. And if they tell their story, nobody will care.
Or, as in the example of the Canadian university, you are happy to have enrolled thousands of students. That these students were not actual students does not matter. You look really good on paper, at least for a time. And if you get caught… well, you did your best, didn’t you?
Apple probably did not really want to get into customer AI. They had a really long time to do something with Siri, and they did nothing at all. Apple probably wants to sell iPhones, watches and laptops. They want to look like they are doing something about AI, but they don’t really care.
As for Intel, I don’t have a good theory but I suspect that what happened is the following classic scenario. Intel’s leadership likely touted “we’re still number one” while ignoring warning signs. Nobody cared about long-term technological relevance. They were crushing the competition on the market. It seemed like they had many new things upcoming. It was good enough.
It is not that people running colleges, Intel or Apple are incapable of succeeding. They all succeed on their own terms. They are just not sharing openly what these terms are. And if pushed, they will lie about their goals.
So, how can really smart people appear totally incompetent? Because they are lying to you. What they are openly setting out to achieve is not their real goal.
Given a choice, prefer people who are brutally honest and highly capable. You are much more likely to see them exceed your expectations.
2025-04-08 03:41:27
Do large language models (AI) make you 3x faster or only 3% faster? The answer depends on the quality of the work you are producing.
If you need something like a stock photo but not much beyond that, AI can make you 10x faster.
If you need a picture taken at the right time of the right person, AI doesn’t help you much.
If you need a piece of software that an intern could have written, AI can do it 10x faster than the intern.
If you need a piece of software that only 10 engineers in the country can understand, AI doesn’t help you much.
The effect is predictable: finding work if your skill level is low becomes more difficult. However, if you are a highly skilled individual, you can eliminate much of the boilerplate work and focus on what matters. Thus, elite people are going to become even more productive.
2025-04-07 04:26:56
import "math/rand/v2" func shuffleStandard(data []uint64) { rand.Shuffle(len(data), func(i, j int) { data[i], data[j] = data[j], data[i] }) }
package main import ( "math/bits" ) func randomBounded(rangeVal uint64) uint64 { random64bit := rand.Uint64() hi, lo := bits.Mul64(random64bit, rangeVal) leftover := lo if leftover < rangeVal { threshold := -rangeVal % rangeVal for leftover < threshold { random64bit = rand.Uint64() hi, lo = bits.Mul64(random64bit, rangeVal) leftover = lo } } return hi // [0, range) }
func partialShuffle64b(storage []uint64, indexes [7]uint64, n, k, bound uint64) uint64 { r := rand.Uint64() for i := uint64(0); i < k; i++ { hi, lo := bits.Mul64(n-i, r) r = lo indexes[i] = hi } if r < bound { bound = n for i := uint64(1); i < k; i++ { bound *= (n - i) } t := (-bound) % bound for r < t { r = rand.Uint64() for i := uint64(0); i < k; i++ { hi, lo := bits.Mul64(n-i, r) r = lo indexes[i] = hi } } } for i := uint64(0); i < k; i++ { pos1 := n - i - 1 pos2 := indexes[i] storage[pos1], storage[pos2] = storage[pos2], storage[pos1] } return bound }
partialShuffle64b
function performs a partial shuffle of a slice by generating k
random indices from a range of size n
, using a single 64-bit random number r
obtained from rand.Uint64()
, and then swapping elements at those indices. I reuse an array where the indices are stored. Though it looks much different, it is essentially a generalization of the randomBounded
routine already used by the Go standard library.func shuffleBatch2(storage []uint64) { i := uint64(len(storage)) indexes := [7]uint64{} for ; i > 1<<30; i-- { partialShuffle64b(storage, indexes, i, 1, i) } bound := uint64(1) << 60 for ; i > 1; i -= 2 { bound = partialShuffle64b(storage, indexes, i, 2, bound) } }
We can also generalize and generate the indices in batches of six when possible:
func shuffleBatch23456(storage []uint64) { i := uint64(len(storage)) indexes := [7]uint64{} for ; i > 1<<30; i-- { partialShuffle64b(storage, indexes, i, 1, i) } bound := uint64(1) << 60 for ; i > 1<<19; i -= 2 { bound = partialShuffle64b(storage, indexes, i, 2, bound) } bound = uint64(1) << 57 for ; i > 1<<14; i -= 3 { bound = partialShuffle64b(storage, indexes, i, 3, bound) } bound = uint64(1) << 56 for ; i > 1<<11; i -= 4 { bound = partialShuffle64b(storage, indexes, i, 4, bound) } bound = uint64(1) << 55 for ; i > 1<<9; i -= 5 { bound = partialShuffle64b(storage, indexes, i, 5, bound) } bound = uint64(1) << 54 for ; i > 6; i -= 6 { bound = partialShuffle64b(storage, indexes, i, 6, bound) } if i > 1 { partialShuffle64b(storage, indexes, i, i-1, 720) } }
Using the fact that Go comes with full support for benchmarking, we can try shuffling 10,000 elements as a Go benchmark:
func BenchmarkShuffleStandard10K(b *testing.B) { size := 10000 data := make([]uint64, size) for i := 0; i < size; i++ { data[i] = uint64(i) } b.ResetTimer() for i := 0; i < b.N; i++ { shuffleStandard(data) } } func BenchmarkShuffleBatch2_10K(b *testing.B) { ... } func BenchmarkShuffleBatch23456_10K(b *testing.B) { ... }
Go’s rand.Shuffle is a solid baseline, but batching random integer generation can make it much faster. By generating multiple random numbers from a single 64-bit value, we can boost efficiency—by over 2x in our benchmarks.
References
Appendix: Code
package main // Nevin Brackett-Rozinsky, Daniel Lemire, Batched Ranged Random // Integer Generation, Software: Practice and Experience 55 (1), 2024. // Daniel Lemire, Fast Random Integer Generation in an Interval, // ACM Transactions on Modeling and Computer Simulation, 29 (1), 2019. import ( "fmt" "math/bits" "math/rand/v2" ) func shuffleBatch2(storage []uint64) { i := uint64(len(storage)) indexes := [7]uint64{} for ; i > 1<<30; i-- { partialShuffle64b(storage, indexes, i, 1, i) } bound := uint64(1) << 60 for ; i > 1; i -= 2 { bound = partialShuffle64b(storage, indexes, i, 2, bound) } } func shuffleBatch23456(storage []uint64) { i := uint64(len(storage)) indexes := [7]uint64{} for ; i > 1<<30; i-- { partialShuffle64b(storage, indexes, i, 1, i) } bound := uint64(1) << 60 for ; i > 1<<19; i -= 2 { bound = partialShuffle64b(storage, indexes, i, 2, bound) } bound = uint64(1) << 57 for ; i > 1<<14; i -= 3 { bound = partialShuffle64b(storage, indexes, i, 3, bound) } bound = uint64(1) << 56 for ; i > 1<<11; i -= 4 { bound = partialShuffle64b(storage, indexes, i, 4, bound) } bound = uint64(1) << 55 for ; i > 1<<9; i -= 5 { bound = partialShuffle64b(storage, indexes, i, 5, bound) } bound = uint64(1) << 54 for ; i > 6; i -= 6 { bound = partialShuffle64b(storage, indexes, i, 6, bound) } if i > 1 { partialShuffle64b(storage, indexes, i, i-1, 720) } } func partialShuffle64b(storage []uint64, indexes [7]uint64, n, k, bound uint64) uint64 { r := rand.Uint64() for i := uint64(0); i < k; i++ { hi, lo := bits.Mul64(n-i, r) r = lo indexes[i] = hi } if r < bound { bound = n for i := uint64(1); i < k; i++ { bound *= (n - i) } t := (-bound) % bound for r < t { r = rand.Uint64() for i := uint64(0); i < k; i++ { hi, lo := bits.Mul64(n-i, r) r = lo indexes[i] = hi } } } for i := uint64(0); i < k; i++ { pos1 := n - i - 1 pos2 := indexes[i] storage[pos1], storage[pos2] = storage[pos2], storage[pos1] } return bound }