2026-06-10 02:39:32

The keyword in politics these days is ‘sovereign’.
What few will admit is that it is effectively the adoption of the American strategy: Make America Great Again. In other words, reindustrialization of key sectors of the economy. The UK used to be a computing champion. Our chip designs (ARM) originated from the UK. Canada had BlackBerry, everyone was using Canadian phones.
Like Canada, many countries have progressively slid into financialization. Huge banks and bank-related businesses, surrounded by emptied factories.
Part of it was the doing of economists who promoted globalization. We are going to make our best CPUs in Taiwan, because they have a comparative advantage (whatever that means).
Another part is the rise of the managerial class, or our version of the technocracy: the summum of the status game is to make PowerPoint presentations in a nice office. Everyone has 2 or 3 university degrees. And if you don’t have many degrees, what is wrong with you?
I think that this is breaking apart for a few reasons.
One of them is Trump. And I don’t mean bombastic statements, bad hair color or overly long ties… But rather the realization that globalization might leave your people economically better off for a time because they have cheap stuff… But it also leaves your people with few skills. You can fund a robotics factory near Montreal, and you’ll find 2000 people with robotics PhDs, but nobody actually knows how to build robots. The analogy is thus: salary is not everything (to the great chagrin of economists). I have left a much better paying job. My previous job meant that I had to sit in an office and do little concrete. I would have had a great and early retirement… but I would never have developed my skills nor would I have built anything. And that’s what we did at the country level. Great total compensation, but a dead-end skill-wise.
Another factor, I believe, is the COVID era (2020-2023) and its final outcome: empty offices, closed coffee shops. What happened at work was illegible. Lots of people in offices. Certainly, something important was happening. Entire businesses and government organizations have now migrated partially or entirely to a pajama party of some kind. Netflix in the middle of the workday is no longer a dream, but a reality.
Another element is educational misalignment. A country like Canada has the most schooled population in history. You cannot throw a rock without hitting someone with a PhD. Meanwhile, we are not making robots or microchips. You can’t even pay with your phone in the Montreal subway. It is a project for another decade, maybe. We are using a push strategy: push more people with degrees into the economy and you are going to get a fancier economy. Won’t work.
Finally, the AI breakthrough of 2022 is the final nail in the proverbial coffin. My country (Canada) claimed for decades that we were the AI powerhouse. All these PDFs online can’t lie, can they? Canada basically invited modern AI, didn’t it? We did. On paper. On paper we did a great many things. In practice? Few know how to build anything.
In a country like Canada, the population has not yet caught up. They blame the orange man for whatever trouble they see. And the politicians promise to do what they must: shower money to build tech sovereignty. It won’t work. They will try again. It won’t work again. The cycle makes things worse because it sustains a managerial class that is great at politics but terrible at building.
Meanwhile, you can’t escape preference falsification. People will use ChatGPT, Claude, Grok, Gemini. And if Elon produces robots, they’ll want them.
Trump will leave office in two years… Canada and the UK will still be flat-lined economically. The USA will still surge ahead.
Countries like Canada and the UK will have to realize that it is industry and know-how first. Build stuff and the wealth will follow. Stop the virtue signaling. Stop the credentialism. Build.
2026-06-07 04:25:19

Our 64-bit Intel and AMD processors have evolved over decades. When you compile a Go program for a 64-bit Intel or AMD processor, the compiler targets, by default, a nearly 20-year-old instruction set. The binary that comes out runs on essentially any x64 chip, but it also leaves on the table every instruction that was added since 2003.
We often refer to microarchitecture levels. Each level bundles a set of instruction-set extensions that you can assume are present:
| Level | Adds (roughly) |
|---|---|
| v1 | the original AMD64 baseline (SSE2) |
| v2 |
popcnt, SSE4.2 |
| v3 | AVX2 |
| v4 | AVX-512 (F/BW/DQ/VL) |
In my view, this ladder is already slightly obsolete. It was frozen around 2020, and the hardware has moved on. We would need to add the latest AVX-512 sub-extensions (VBMI, VBMI2, VNNI, BF16, FP16, VPOPCNTDQ, and so on), which recent server and consumer chips support but which v4 does not require. While v1 through v4 are a useful common language, a realistic “use everything this CPU offers” target today would need at least a v5, and arguably the whole scheme should be replaced by finer-grained feature detection.
In any case, the Go toolchain exposes this v1 through v4 ladder via the GOAMD64 environment variable. Setting GOAMD64=v3 tells the compiler it may use everything up to and including AVX2. The default is v1, the lowest common denominator.
This raises an obvious question. If I take a real, performance-sensitive library and recompile it at each level, how much do I actually gain? I picked Roaring Bitmaps, a compressed bitset data structure used in databases and search engines.
A Roaring Bitmap stores a set of 32-bit integers. It splits the 32-bit space into chunks of 65,536 values, keyed by the high 16 bits, and stores each chunk in a container that holds only the low 16 bits. A container comes in one of three shapes, and the library always keeps whichever is smallest:
[start, length] intervals, used when the set bits cluster into consecutive runs.I fetched the latest release of the library, then ran its own benchmark suite four times, once per level, collecting eight samples each. I did this on a single Intel Xeon Gold 6548N (Emerald Rapids, which supports all four levels, including AVX-512) under Go 1.26.2 and Roaring v2.18.2.
A population count (or popcount, also called the Hamming weight) is simply the number of bits set to 1 in a machine word. Roaring leans on it constantly: the cardinality of a bitmap container, how many values it holds, is the sum of the population counts of its 1024 64-bit words. Modern x86 chips have a dedicated popcnt instruction that does this in a single operation, but it only became available at the v2 level (SSE4.2, 2008). Without it, the compiler has to fall back to a multi-instruction bit-twiddling sequence.
The clearest single result is population count: counting the number of set bits in a bitmap container. The v1 baseline cannot use the popcnt instruction, so Go emits a software fallback. The moment we move to v2, popcnt becomes available and the time is cut almost in half:
That is a 43% reduction, and it is free: no source change, just a compiler flag. Notice, though, that v3 and v4 do nothing more. A single popcnt instruction is already optimal; as far as the Go compiler is concerned, AVX2 and AVX-512 have nothing to add.
Population count is the easy win. What about the rest of the library?
Another clear win is building a container from a dense bitmap. The FromDense array benchmark takes a raw 8 KB bit vector and constructs the most compact container for it: it popcounts every word to learn the cardinality, then scans out the positions of the set bits. That word-at-a-time popcount-and-scan loop is exactly what the compiler can auto-vectorize once 256-bit registers are available, so the gains keep coming past v2:
v2 already cuts 21% by using scalar popcnt/tzcnt instructions, and v3 (AVX2) nearly doubles that to a 38% reduction. As with popcount, v4 adds nothing.
Set operations show the same pattern. The IntersectionCardinality benchmark counts how many values two bitmaps have in common: for bitmap containers, it ANDs the words pairwise and population-counts the result, without ever materializing the intersection. Here v2 does essentially nothing (the scalar popcnt is already in the inner loop), but v3 lets the compiler widen the AND-and-count loop to 256-bit registers, cutting the time by 22%:
Takeaways:
v2 or better. The resulting binary will run in any data center and on any non-ancient laptop.v3 level might be worth investigating.v4 level should have helped in some of my benchmarks, but it did not. I suspect that the Go compiler is just not great at it.(Obviously: run your own benchmarks.)
2026-05-29 07:04:35

Where is your intelligence located? In your brain?
It is a simplistic answer. A better model is that your intelligence is embodied.
Consider a cook working at an expensive restaurant. He has all his favorite knives and cooking instructions, placed exactly where he wants them. His kitchen is part of his intelligence, of his skills. The same cook working in your kitchen can probably cook better than you do, but he can’t reproduce the same meals he would prepare in his favorite kitchen.
We often assess computer programmers using whiteboard tests. It is an endless source of complaints. Programmers rightly point out that it forces them out of their element. They are just not as good when you take away their laptop. It is not an excuse, it is a real issue: you are cutting them off from part of what makes them so intelligent.
To sum it up, the model of intelligence as a brain in a jar, disconnected from anything else, is ridiculous.
If you accept the idea of embodied intelligence, then many actions that we view as a consequence of our intelligence are actually part of our intelligence. First and foremost, language. Our ability to talk or write to each other means that I am not limited by my own person. Have you ever heard of human beings isolated in small tribes making technological breakthroughs? Nah. Progress requires lots of people communicating together. Up until a few decades ago, progress required cities. Today I am less certain than it does, as I can more and more communicate with anyone in the world from anywhere. But language is still critical, we have not invented anything better. Similarly, having hands and the ability to build sophisticated tools (like laptops) allows us to extend our intelligence.
At the end of 2022, we got a breakthrough technology: ChatGPT. It built on several pre-existing ideas such as (large) language models, neural networks, and so forth. That’s the ‘GPT’ part. But an important, if underappreciated, part of the breakthrough was the ‘Chat’ component. Someone had the idea of connecting a large language model with a chat interface. Maybe this came naturally and obviously to people building this system, but it should not be assumed to be trivial or unimportant.
Language is a key component of our intelligence, and, thus, it makes sense that it would be pivotal for machine intelligence.
We embodied the AI software in a chat box.
The next step was what we call today ‘agentic AI’. We keep the chat box, but we add the ability for the AI software to interact with tools, and to make plans to use them. In effect, we give the AI more agency: it can do stuff and learn from the results as they happen. It is starting to resemble a human being with hands and tools.
I was talking with a colleague this week. My colleague is all in on the AI revolution. He uses his AI to help him write better and faster, and to get his data analysis done faster, without so much help from technical experts.
But my colleague was not aware of the agentic AI approach. I tried to explain on the phone. What does it mean to give the AI access to tools? Is this only about saving the effort of copying and pasting the AI’s response?
I ended up making a video where I start an AI in a shell within something called RStudio. It is an environment people use to program in R, to do data analysis. I don’t use R or RStudio, but thanks to the AI, I was able to build an entire climate research project in a few minutes, complete with the retrieval of the data from the web.
How did the AI do it? I recorded it. It tried a few things, initially struggling to download the data. At some point, it finds out that it needs new R packages, so it installs them, and once they are installed, it can proceed to generate figures, verifying that it works.
Agentic AI greatly extends machine intelligence by improving the embodiment of AI.
I believe that it is not yet understood as it should be.
In Montreal, the most established professor in the field of AI is Yoshua Bengio. He started his own non-trivial enterprise a few years ago (Element AI). His latest venture is Law Zero, which aims to create a Scientist AI. The first goal of this project is to build AI without the agentic component. It should be a disembodied AI that has no goal of its own, no agency.
I fear that Bengio suffers from what Kevin Kelly called Thinkism. Let me quote from Kelly’s 2008 essay.
No intelligence, no matter how super duper, can figure out how human body works simply by reading all the known scientific literature in the world and then contemplating it. No super AI can simply think about all the current and past nuclear fission experiments and then come up with working nuclear fusion in a day. Between not knowing how things work and knowing how they work is a lot more than thinkism. There are tons of experiments in the real world which yields tons and tons of data that will be required to form the correct working hypothesis. Thinking about the potential data will not yield the correct data. Thinking is only part of science; maybe even a small part. (…) Thinkism is not enough. Without conducting experiments, building prototypes, having failures, and engaging in reality, an intelligence can have thoughts but not results. It cannot think its way to solving the world’s problems. (…) The Singularity is an illusion that will be constantly retreating — always “near” but never arriving. We’ll wonder why it never came after we got AI. Then one day in the future, we’ll realize it already happened. The super AI came, and all the things we thought it would bring instantly — personal nanotechnology, brain upgrades, immortality — did not come. Instead other benefits accrued, which we did not anticipate, and took long to appreciate. Since we did not see them coming, we look back and say, yes, that was the Singularity.
I believe that University professors are especially prone to thinkism. They view intelligence as being centered on what is happening in their brain. When you live in an ivory tower, it is easy to dismiss the real world as the core source of intelligence. Further, they are often people who did quite well in school where thinkism is naturally prevalent.
I have been a professor most of my life. However, I tire quickly of talking with other professors. What I most enjoy is working with people who have new tools that they apply in the real world. Unsurprisingly, I spent most of my time working with software that people deploy in the real world.
What Kelly is saying is that a high degree of intelligence is not enough to do much of anything. The real world is not the final stage of your thinking process. It is maybe the most important part of it.
And thus, when you connect your AI with the real world, giving it the ability of running experiments (as virtually all software developers do today), you get impressive results that go much beyond what AI software can do on its own.
Agency is not a feature. Agency is primary.
2026-05-23 10:45:11

Every machine connected to the Internet has an address called an IP address. Originally, these addresses were 32-bit integers (IPv4), giving a theoretical maximum of about four billion distinct addresses. We are all familiar with these addresses (e.g., 192.168.0.0). There was a big fuss about how we would run out of addresses. It never happened because we don’t actually need every device to have its own unique address. Your home router needs an address, but every device in your home does not need a worldwide unique address.
Nevertheless, the range was extended to cover 128 bits (IPv6). An IPv6 address is conventionally written as eight groups of four hexadecimal digits separated by colons. For example:
2001:0db8:85a3:0000:0000:8a2e:0370:7334
Because addresses often contain runs of zeros, the format allows two shortcuts:
2001:db8:85a3:0:0:8a2e:370:7334.::: 2001:db8:85a3::8a2e:370:7334.The double-colon trick can appear only once in an address, and can match one or more zero groups. Hence ::1 is the loopback address (all zeros except the last group), and :: is the unspecified address (all zeros).
IPv6 also accepts an embedded IPv4 address in the last 32 bits, written in the usual dotted-decimal form. This is mostly used for IPv4-mapped IPv6 addresses such as ::ffff:192.168.1.1. The longest possible textual form is 45 characters:
0000:0000:0000:0000:0000:ffff:255.255.255.255
So a parser must accept hexadecimal groups, the compressed form, and an optional IPv4 tail. It is more involved than parsing IPv4.
The standard C function for the job is inet_pton, available on essentially every system.
Can we do better?
A few years ago, I showed that you could parse IPv4 addresses really fast. Can we do the same with IPv6?
The trick is to use data parallelism: we invoke the so-called SIMD instructions that all our processors support. These instructions can process potentially dozens of bytes at once.
Shreesh Adiga gave it a try with AVX-512, the powerful instruction set supported by recent Intel server processors and all new AMD CPUs. The idea is to load the entire string into a 512-bit register, find the colons with a single comparison, compute the spacing between them to drive a byte-level expand, translate hex digits via a permute, and finish with a multiply-accumulate that combines the hex digits into bytes. Almost the whole parser is branch-free, meaning that there are few if clauses.
I put together a small benchmark that generates random IPv6 addresses with inet_ntop (so the addresses are written in their canonical, compressed form) and parses each one with both inet_pton and the AVX-512 routine. The benchmark runs on an Intel Xeon Gold 6548N CPU @ 2.8 GHz (Emerald Rapids) with GCC, compiled with -march=native -O3.
| function | ns/addr | speed (Mv/s) | instr/addr | instr/cycle |
|---|---|---|---|---|
inet_pton |
175.3 | 5.7 | 954 | 1.56 |
| AVX-512 | 14.0 | 71.3 | 120 | 2.45 |
The AVX-512 routine is about 12 times faster than inet_pton, parsing more than 70 million addresses per second on a single core. It uses eight times fewer instructions, and runs them at a higher throughput (2.45 instructions per cycle versus 1.56).
The source code used for this benchmark is available on my blog repository.
2026-05-22 09:16:35

In software programming, the product between two integers is often computed to a fixed number of bits with overflow. Consider 8-bit integers. If you multiply 127 by 127, you get back the number 1 as an 8-bit unsigned integer, with an overflow. The actual full product is 16129. To represent 16129, you typically use 16 bits of precision.
Thus we have the notion of the full product. The full product of two 32-bit integers is typically represented using 64 bits. The question that preoccupied me is what fraction of all 64-bit integers can be written as the product of two 32-bit integers.
You might wonder why you would care?
We often design hash functions: they are special functions that take an input and generate a random-looking output. Several years ago I designed a very fast hash function called clhash. It is a super-fast hash function for strings having a few hundred bytes or more. If you don’t know about clhash, check it out. It is interesting in its own right.
This clhash hash function uses a type of multiplication typical of cryptographic applications. I was trying to argue that our approach had benefits compared with techniques based on standard multiplications. Let me illustrate. A simple hash function for 32-bit integers could take the least significant bits and multiply them with the most significant bits.
// simpleHighLowHash is a simple (and weak) 32-bit hash
// that multiplies the high 16 bits by the low 16 bits.
func simpleHighLowHash(x uint32) uint32 {
high := uint16(x >> 16)
low := uint16(x & 0xFFFF)
return uint32(high) * uint32(low)
}
Maybe you’d want the hash function to be uniform: all possible 32-bit hash values should be equally probable. It is only possible in this instance if the hash function can produce all 32-bit hash values, which is not the case.
The great mathematician Erdös showed that the proportion of all 2n-bit values that can be generated by the product of two n-bit values goes to zero as n becomes large. This means that if you have, say, 10000000-bit integers multiplying 10000000-bit integers, you’d expect relatively few 20000000-bit integers to be produced. But what about practical cases like 32-bit integers or 64-bit integers?
You can just brute-force the problem easily up to the multiplication of 16-bit integers into 32-bit products. At that point, slightly one out of five 32-bit numbers is a product between two 16-bit integers. About 80% of all 32-bit integers are never produced by this hash. However, the running time grows exponentially, and brute force won’t scale all the way to 32 bits.
So what do we do about the 32-bit case? That is, what do you do when you multiply two 32-bit integers to produce a 64-bit product? What fraction of 64-bit values can the following function produce?
func simpleHighLowHash(x uint64) uint64 {
high := uint32(x >> 32)
low := uint32(x & 0xFFFFFFFF)
return uint64(high) * uint64(low)
}
Can we get an exact result?
Yes!!!
Webster and his colleagues built the math to allow us to scale up the exact computation. He was kind enough to publish his code.
There are 3,215,709,724,700,470,902 64-bit (unsigned) integers that can be written as a product of two 32-bit integers. That’s about 17% of all possible values.
What about actually computing a pair of integers given their product? One approach consists of computing its full prime factorization, and then using those factors to build all possible divisors that are strictly less than 2^32, starting with a set of candidates containing only 1 and iteratively multiplying existing candidates by each prime factor (only keeping products that stay below 2^32). We can avoid adding duplicates to our set by processing unique prime factors with their multiplicity. Finally, we select the maximum such candidate m as the largest divisor under 2^32, compute the corresponding leftover n / m, and report whether a valid split into two 32-bit factors exists. In general, the answer (if it exists) is not unique: this returns the pair where one value is maximized. In Python, the code might look as follows.
for p in factor_multiplicities:
new_candidates = []
for c in candidates:
for i in range(factor_multiplicities[p] + 1):
if c * (p ** i) < 2**32:
new_candidates.append(c * (p ** i))
for new_c in new_candidates:
candidates.append(new_c)
m = max(candidates)
print(f"Maximum candidate: {m}")
leftover = n // m
print(f"Leftover: {leftover}")
if leftover >= 2**32:
print("Leftover is too large, cannot find a suitable candidate.")
You might be able to come up with a more efficient algorithm. I find it interesting to consider that if you pick a value at random, it will usually fail! That is, most 64-bit integers cannot be written as the product of two 32-bit integers.
2026-05-19 03:39:49

Converting a 64-bit integer to its decimal string representation is a mundane task that shows up everywhere: logging, JSON serialization, CSV output, debug prints, etc. In C++, you might use std::to_chars, sprintf, or some library routine.
How do these functions work? At a high level, they repeatedly divide by ten. Start with your integer k. Divide it by ten, use the remainder as the last digit (it is between 0 and 9 inclusively). You then add the code point value of the character 0 to get the ASCII digit. To go faster, you can divide by 100 and use a lookup table so that the value between 0 and 99 inclusively is mapped to a string.
So far so good. Unfortunately, even with all these optimizations, this string generation may become a performance bottleneck. Can you do better?
Let us assume that you have a recent AMD processor or an Intel server. Then you have powerful data-parallel instructions (AVX-512) that can multiply eight 64-bit integers at once. We often refer to these instructions as SIMD (single instruction multiple data). My colleague Jaël Champagne Gareau and I recently published a new paper on exactly this problem. The title says it all: Converting an Integer to a Decimal String in Under Two Nanoseconds.
When you write n / 100 in code, an optimizing compiler converts the operation to a multiplication followed by a shift. It is often described as a multiplicative inverse. Generally, you can replace the division of n by d with the division of c * n or c * n + c by m for convenient integers c and m chosen so that they approximate the reciprocal: c/m ~= 1/d. We often call c * n + c a fused multiply-add. Picking m to be a power of two means that the division by m is just a shift. Then you can get the remainder of the division by using the remainder of the division by m, multiplied by d and divided again by m, which is essentially a multiplication followed by a shift (Lemire et al., 2021).
We can put this to good use with the Integer Fused Multiply-Add (IFMA) instructions available on recent Intel and AMD processors. They essentially allow you to compute eight instances of (c * n + c)/m in one instruction. The expression (c * n + c)/m gives you the division, but we need the remainder, so instead we pick (c * n + c)%m which we need to multiply by the divisor.
The fun thing with AVX-512 instructions is that they can use a different c and a different divisor for each of the eight operations. Using Intel intrinsic functions, our core routine which converts a value smaller than 10^8 to eight digits looks as follows:
__m512i to_string_avx512ifma_8digits(uint64_t n) {
__m512i bcstq_l = _mm512_set1_epi64(n);
constexpr uint64_t twoto52 = 0x10000000000000ULL; // 2^52
__m512i ifma_const = _mm512_setr_epi64(
twoto52 / 100000000, twoto52 / 10000000,
twoto52 / 1000000, twoto52 / 100000,
twoto52 / 10000, twoto52 / 1000,
twoto52 / 100, twoto52 / 10
);
__m512i zmmTen = _mm512_set1_epi64(10);
__m512i asciiZero = _mm512_set1_epi64('0');
__m512i lowbits_l = _mm512_madd52lo_epu64(ifma_const,
bcstq_l, ifma_const); // ifma_const * bcstq_l + ifma_const
__m512i highbits_l = _mm512_madd52hi_epu64(asciiZero,
zmmTen, lowbits_l);
return highbits_l;
}
It compiles down to two multiplication-add instructions: vpmadd52huq. That’s it. Two instructions to generate eight digits.
It works by broadcasting n across all eight 64-bit lanes of a __m512i vector (bcstq_l). It then prepares a vector of carefully chosen multiplicative inverses (ifma_const) that represent the reciprocals of 10^8, 10^7, … The magic happens in the single _mm512_madd52lo_epu64 instruction, which simultaneously performs eight fused multiply-adds: each lane computes (ifma_const[i] * n + ifma_const[i]) using 52-bit low-half multiplication, effectively extracting the quotient when dividing by the corresponding power of ten. A second _mm512_madd52hi_epu64 instruction (with a vector of ten and a vector of '0') then isolates the digit values and adds the ASCII '0' offset in the high 52 bits, producing eight packed digit characters in a single 512-bit register.
If all your integers require eight digits, you are done. But in the general case, putting this to good use requires a bit of effort.
Thankfully, even if you, say, need only six digits, you can do the full 8-digit computation and then use a masked store if you want to store only six digits, ignoring the two leftovers. That is, instruction sets like AVX-512 allow you to write only some of the data to memory, which is quite convenient.
We have two variants. One is branch-heavy and does well on homogeneous data (numbers with similar digit lengths). The other is branch-light and better for mixed workloads. A quick profiling step can pick the right one for your dataset.
Our implementation is consistently 1.4–2× faster than the best competitors and 2–4× faster than std::to_chars across a wide range of inputs. What I find interesting is that even if the std::to_chars implementation is not at all naive, you can do significantly better in many ways. James Anhalt’s approach (jeaiii) is also quite fast on modern hardware.
Further reading
– The paper: doi:10.1002/spe.70079
– The benchmarks are on GitHub (fully reproducible)
– Shortly after our paper came online, Barend Erasmus created a software library implementing our proposed approach. I am not certain that Barend includes both the homogeneous and heterogeneous approaches.