2026-05-23 10:45:11

Every machine connected to the Internet has an address called an IP address. Originally, these addresses were 32-bit integers (IPv4), giving a theoretical maximum of about four billion distinct addresses. We are all familiar with these addresses (e.g., 192.168.0.0). There was a big fuss about how we would run out of addresses. It never happened because we don’t actually need every device to have its own unique address. Your home router needs an address, but every device in your home does not need a worldwide unique address.
Nevertheless, the range was extended to cover 128 bits (IPv6). An IPv6 address is conventionally written as eight groups of four hexadecimal digits separated by colons. For example:
2001:0db8:85a3:0000:0000:8a2e:0370:7334
Because addresses often contain runs of zeros, the format allows two shortcuts:
2001:db8:85a3:0:0:8a2e:370:7334.::: 2001:db8:85a3::8a2e:370:7334.The double-colon trick can appear only once in an address, and can match one or more zero groups. Hence ::1 is the loopback address (all zeros except the last group), and :: is the unspecified address (all zeros).
IPv6 also accepts an embedded IPv4 address in the last 32 bits, written in the usual dotted-decimal form. This is mostly used for IPv4-mapped IPv6 addresses such as ::ffff:192.168.1.1. The longest possible textual form is 45 characters:
0000:0000:0000:0000:0000:ffff:255.255.255.255
So a parser must accept hexadecimal groups, the compressed form, and an optional IPv4 tail. It is more involved than parsing IPv4.
The standard C function for the job is inet_pton, available on essentially every system.
Can we do better?
A few years ago, I showed that you could parse IPv4 addresses really fast. Can we do the same with IPv6?
The trick is to use data parallelism: we invoke the so-called SIMD instructions that all our processors support. These instructions can process potentially dozens of bytes at once.
Shreesh Adiga gave it a try with AVX-512, the powerful instruction set supported by recent Intel server processors and all new AMD CPUs. The idea is to load the entire string into a 512-bit register, find the colons with a single comparison, compute the spacing between them to drive a byte-level expand, translate hex digits via a permute, and finish with a multiply-accumulate that combines the hex digits into bytes. Almost the whole parser is branch-free, meaning that there are few if clauses.
I put together a small benchmark that generates random IPv6 addresses with inet_ntop (so the addresses are written in their canonical, compressed form) and parses each one with both inet_pton and the AVX-512 routine. The benchmark runs on an Intel Xeon Gold 6548N CPU @ 2.8 GHz (Emerald Rapids) with GCC, compiled with -march=native -O3.
| function | ns/addr | speed (Mv/s) | instr/addr | instr/cycle |
|---|---|---|---|---|
inet_pton |
175.3 | 5.7 | 954 | 1.56 |
| AVX-512 | 14.0 | 71.3 | 120 | 2.45 |
The AVX-512 routine is about 12 times faster than inet_pton, parsing more than 70 million addresses per second on a single core. It uses eight times fewer instructions, and runs them at a higher throughput (2.45 instructions per cycle versus 1.56).
The source code used for this benchmark is available on my blog repository.
2026-05-22 09:16:35

In software programming, the product between two integers is often computed to a fixed number of bits with overflow. Consider 8-bit integers. If you multiply 127 by 127, you get back the number 1 as an 8-bit unsigned integer, with an overflow. The actual full product is 16129. To represent 16129, you typically use 16 bits of precision.
Thus we have the notion of the full product. The full product of two 32-bit integers is typically represented using 64 bits. The question that preoccupied me is what fraction of all 64-bit integers can be written as the product of two 32-bit integers.
You might wonder why you would care?
We often design hash functions: they are special functions that take an input and generate a random-looking output. Several years ago I designed a very fast hash function called clhash. It is a super-fast hash function for strings having a few hundred bytes or more. If you don’t know about clhash, check it out. It is interesting in its own right.
This clhash hash function uses a type of multiplication typical of cryptographic applications. I was trying to argue that our approach had benefits compared with techniques based on standard multiplications. Let me illustrate. A simple hash function for 32-bit integers could take the least significant bits and multiply them with the most significant bits.
// simpleHighLowHash is a simple (and weak) 32-bit hash
// that multiplies the high 16 bits by the low 16 bits.
func simpleHighLowHash(x uint32) uint32 {
high := uint16(x >> 16)
low := uint16(x & 0xFFFF)
return uint32(high) * uint32(low)
}
Maybe you’d want the hash function to be uniform: all possible 32-bit hash values should be equally probable. It is only possible in this instance if the hash function can produce all 32-bit hash values, which is not the case.
The great mathematician Erdös showed that the proportion of all 2n-bit values that can be generated by the product of two n-bit values goes to zero as n becomes large. This means that if you have, say, 10000000-bit integers multiplying 10000000-bit integers, you’d expect relatively few 100000000000000-bit integers to be produced. But what about practical cases like 32-bit integers or 64-bit integers?
You can just brute-force the problem easily up to the multiplication of 16-bit integers into 32-bit products. At that point, slightly one out of five 32-bit numbers is a product between two 16-bit integers. About 80% of all 32-bit integers are never produced by this hash. However, the running time grows exponentially, and brute force won’t scale all the way to 32 bits.
So what do we do about the 32-bit case? That is, what do you do when you multiply two 32-bit integers to produce a 64-bit product? What fraction of 64-bit values can the following function produce?
func simpleHighLowHash(x uint64) uint64 {
high := uint32(x >> 16)
low := uint32(x & 0xFFFFFFFF)
return uint64(high) * uint64(low)
}
Can we get an exact result?
Yes!!!
Webster and his colleagues built the math to allow us to scale up the exact computation. He was kind enough to publish his code.
There are 3,215,709,724,700,470,902 64-bit (unsigned) integers that can be written as a product of two 32-bit integers. That’s about 17% of all possible values.
What about actually computing a pair of integers given their product? One approach consists of computing its full prime factorization, and then using those factors to build all possible divisors that are strictly less than 2^32, starting with a set of candidates containing only 1 and iteratively multiplying existing candidates by each prime factor (only keeping products that stay below 2^32). We can avoid adding duplicates to our set by processing unique prime factors with their multiplicity. Finally, we select the maximum such candidate m as the largest divisor under 2^32, compute the corresponding leftover n / m, and report whether a valid split into two 32-bit factors exists. In general, the answer (if it exists) is not unique: this returns the pair where one value is maximized. In Python, the code might look as follows.
for p in factor_multiplicities:
new_candidates = []
for c in candidates:
for i in range(factor_multiplicities[p] + 1):
if c * (p ** i) < 2**32:
new_candidates.append(c * (p ** i))
for new_c in new_candidates:
candidates.append(new_c)
m = max(candidates)
print(f"Maximum candidate: {m}")
leftover = n // m
print(f"Leftover: {leftover}")
if leftover >= 2**32:
print("Leftover is too large, cannot find a suitable candidate.")
You might be able to come up with a more efficient algorithm. I find it interesting to consider that if you pick a value at random, it will usually fail! That is, most 64-bit integers cannot be written as the product of two 32-bit integers.
2026-05-19 03:39:49

Converting a 64-bit integer to its decimal string representation is a mundane task that shows up everywhere: logging, JSON serialization, CSV output, debug prints, etc. In C++, you might use std::to_chars, sprintf, or some library routine.
How do these functions work? At a high level, they repeatedly divide by ten. Start with your integer k. Divide it by ten, use the remainder as the last digit (it is between 0 and 9 inclusively). You then add the code point value of the character 0 to get the ASCII digit. To go faster, you can divide by 100 and use a lookup table so that the value between 0 and 99 inclusively is mapped to a string.
So far so good. Unfortunately, even with all these optimizations, this string generation may become a performance bottleneck. Can you do better?
Let us assume that you have a recent AMD processor or an Intel server. Then you have powerful data-parallel instructions (AVX-512) that can multiply eight 64-bit integers at once. We often refer to these instructions as SIMD (single instruction multiple data). My colleague Jaël Champagne Gareau and I recently published a new paper on exactly this problem. The title says it all: Converting an Integer to a Decimal String in Under Two Nanoseconds.
When you write n / 100 in code, an optimizing compiler converts the operation to a multiplication followed by a shift. It is often described as a multiplicative inverse. Generally, you can replace the division of n by d with the division of c * n or c * n + c by m for convenient integers c and m chosen so that they approximate the reciprocal: c/m ~= 1/d. We often call c * n + c a fused multiply-add. Picking m to be a power of two means that the division by m is just a shift. Then you can get the remainder of the division by using the remainder of the division by m, multiplied by d and divided again by m, which is essentially a multiplication followed by a shift (Lemire et al., 2021).
We can put this to good use with the Integer Fused Multiply-Add (IFMA) instructions available on recent Intel and AMD processors. They essentially allow you to compute eight instances of (c * n + c)/m in one instruction. The expression (c * n + c)/m gives you the division, but we need the remainder, so instead we pick (c * n + c)%m which we need to multiply by the divisor.
The fun thing with AVX-512 instructions is that they can use a different c and a different divisor for each of the eight operations. Using Intel intrinsic functions, our core routine which converts a value smaller than 10^8 to eight digits looks as follows:
__m512i to_string_avx512ifma_8digits(uint64_t n) {
__m512i bcstq_l = _mm512_set1_epi64(n);
constexpr uint64_t twoto52 = 0x10000000000000ULL; // 2^52
__m512i ifma_const = _mm512_setr_epi64(
twoto52 / 100000000, twoto52 / 10000000,
twoto52 / 1000000, twoto52 / 100000,
twoto52 / 10000, twoto52 / 1000,
twoto52 / 100, twoto52 / 10
);
__m512i zmmTen = _mm512_set1_epi64(10);
__m512i asciiZero = _mm512_set1_epi64('0');
__m512i lowbits_l = _mm512_madd52lo_epu64(ifma_const,
bcstq_l, ifma_const); // ifma_const * bcstq_l + ifma_const
__m512i highbits_l = _mm512_madd52hi_epu64(asciiZero,
zmmTen, lowbits_l);
return highbits_l;
}
It compiles down to two multiplication-add instructions: vpmadd52huq. That’s it. Two instructions to generate eight digits.
It works by broadcasting n across all eight 64-bit lanes of a __m512i vector (bcstq_l). It then prepares a vector of carefully chosen multiplicative inverses (ifma_const) that represent the reciprocals of 10^8, 10^7, … The magic happens in the single _mm512_madd52lo_epu64 instruction, which simultaneously performs eight fused multiply-adds: each lane computes (ifma_const[i] * n + ifma_const[i]) using 52-bit low-half multiplication, effectively extracting the quotient when dividing by the corresponding power of ten. A second _mm512_madd52hi_epu64 instruction (with a vector of ten and a vector of '0') then isolates the digit values and adds the ASCII '0' offset in the high 52 bits, producing eight packed digit characters in a single 512-bit register.
If all your integers require eight digits, you are done. But in the general case, putting this to good use requires a bit of effort.
Thankfully, even if you, say, need only six digits, you can do the full 8-digit computation and then use a masked store if you want to store only six digits, ignoring the two leftovers. That is, instruction sets like AVX-512 allow you to write only some of the data to memory, which is quite convenient.
We have two variants. One is branch-heavy and does well on homogeneous data (numbers with similar digit lengths). The other is branch-light and better for mixed workloads. A quick profiling step can pick the right one for your dataset.
Our implementation is consistently 1.4–2× faster than the best competitors and 2–4× faster than std::to_chars across a wide range of inputs. What I find interesting is that even if the std::to_chars implementation is not at all naive, you can do significantly better in many ways. James Anhalt’s approach (jeaiii) is also quite fast on modern hardware.
Further reading
– The paper: doi:10.1002/spe.70079
– The benchmarks are on GitHub (fully reproducible)
– Shortly after our paper came online, Barend Erasmus created a software library implementing our proposed approach. I am not certain that Barend includes both the homogeneous and heterogeneous approaches.
2026-05-07 04:15:20

Suppose that x is a variable of an unsigned type. In C/C++, it could be of type size_t for example.
You have an expression like 6 * x and you want to know whether 6 * x overflows. That is, you want to know if 6 * x exceeds the range of values that can be represented by the type. In most cases, a variable of type size_t will be about to represent all values in the range [0, 2^64-1]. Instead of 64, let me use a variable for the number of bits: [0, 2^L-1].
The easiest approach is to compare x with (2^L-1) // 6 where I use the symbol // to denote the integer division (as opposed to /).
But can you do otherwise ?
If the value does not overflow, we know for sure that (6 * x)//6 == x. The interesting question is what happens when it overflows. We can answer this directly for an arbitrary non-zero constant a in the range [1, 2^L-1].
Let k = (a*x)//2^L be the number of times the multiplication wraps around. The effective (wrapped) value computed by the machine is r = a*x - k*2^L, with 0 <= r < 2^L. Overflow happens precisely when k >= 1. We have that k <= a − 1 because x<2^L.
Performing the integer division of r = a*x - k*2^L by a, we get x plus -k*2^L//a. When k is non-zero, this last value (-k*2^L//a) is one of -2^L//a, -2* 2^L//a, …, -(a-1) * 2^L//a.
k = 0 (no overflow): r // a = x.k ≥ 1: r // a = x + (negative integer) ≠ x.Hence we have the following result.
Theorem If x is of an unsigned type and a is a non-zero constant, then a * x overflows if and only if (a * x)//a != x.
In practice, a simple comparison x with (2^L-1) // a is likely more efficient. Optimizing compilers might be able to convert (a * x)//a != x to a simple comparison. Unfortunately, the Go compiler (for example) cannot.
An open question is whether there is a more mathematically elegant check.
2026-05-06 03:01:09

A common pattern in modern software is to map a string key to a small array of floating-point numbers. Word embeddings, feature vectors, lookup tables for physical constants: all variations on the same theme. In Go, the obvious way to write this is a map[string][]float32. But how fast is it, really, and can we do better?
I have been working on constmap, a Go library that builds an immutable map from strings to uint64 values using the binary fuse filter construction. A lookup amounts to one hash, three array reads, and two XORs. There is no comparison, no chaining, no probing. The whole table fits in roughly 9 bytes per key, which often means it fits in cache where a Go map does not.
Go has fast maps, you cannot easily beat them in performance. But if you build a smaller data structure that causes fewer cache misses, you can definitively go faster.
By default, the constmap returns a uint64. But what if your value is an array of eight float32 numbers? You have at least two options:
[][]float32. The constmap returns the index.uint64.The second option requires the unsafe package because we are smuggling a pointer through an integer field. It has some limitations.
unsafe calls converting an integer to a pointer value. Thankfully, you can just put all your arrays of floats in an array and call runtime.KeepAlive(mybigarray) at a strategic location: this will prevent Go from collecting mybigarray. The call to runtime.KeepAlive is not free but also quite cheap so you can possibly use a lot of such calls. benchmarkI built three lookups over 100,000 keys, each mapping to an 8-element []float32. We always access the first element of the array, to make sure that the bencmark is a bit fair. We have a large set of random queries (a query is a string).
We compare map[string][]float32, the standard constmap coupled with an array (so that the constmap constains indexes), and the constmap that contains what is effectively a pointer to the location of the []float32.
Run on an Apple M4 Max with Go’s standard benchmark harness:
| Lookup | Time per op |
|---|---|
map[string][]float32 |
21 ns |
ConstMap → index → [][]float32
|
11 ns |
ConstMap → pointer → *[8]float32
|
8.7 ns |
The constmap with an index is already a twice as fast as the Go map. Replacing the index by a raw pointer shaves another 2 ns by skipping the indirection through the [][]float32 slice header. It is a speedup of about 20% in my case.
The result is interesting on its own: a constmap lookup is fast enough that the next memory load, the slice header read, becomes a measurable fraction of the work.
The benchmark and code are in github.com/lemire/constmap. Run them with:
go test -bench 'FloatArray' -benchtime=1s
2026-05-01 06:17:34

No, rising house prices are not the driver of sharp fertility declines. The evidence shows only modest, mixed effects that cannot explain the large drops observed in places like Canada.
What the Research Actually Shows:
A well-known study by Dettling and Kearney (2014) found that rising house prices have opposing effects: they slightly increase fertility among homeowners (via a “home equity” or wealth effect) and slightly decrease it among renters (via a price effect). At average U.S. homeownership rates, the net effect was a small increase in fertility.
This pattern has held up in other countries. For example, Daysal et al. (2021) and related work confirm similar homeowner/renter dynamics in Denmark and elsewhere.
Clark (2012) found that expensive housing markets are associated with a modest delay in age at first birth (roughly 3–4 years after controls), but the overall impact on completed fertility remains limited.
Canadian evidence aligns with this. Clark and Ferrer (2019) analyzed longitudinal data and found that higher lagged house prices were positively associated with the probability of an additional birth among homeowners in some specifications, but the effects were small and did not drive large-scale declines. They noted that falling school-aged children in high-price cities like Vancouver or Toronto are better explained by selective migration of people with preferences for fewer children into urban centers, rather than existing residents having fewer kids due to prices.
These effects are too modest to explain Canada’s fertility rate of about 1.25–1.3 children per woman (a record low, far below the 2.1 replacement level).
Expensive cities do have fewer children on average, but this largely reflects self-sorting: higher-income, higher-educated people (who tend to have fewer children) cluster in costly urban areas. Correlation is not causation.
Fertility is declining across most developed countries, regardless of housing costs:
– Israel maintains a high fertility rate (~2.9 children per woman, highest in the OECD), despite expensive housing; Tel Aviv is pricier than Montreal.
– Japan has long had very low fertility (~1.2) despite more affordable housing in many areas compared to Canada.
As Richard Florida summarized in “Don’t Blame Expensive Housing for Falling Fertility”: intuition suggests high costs deter families, but demographic research points more strongly to higher education and related lifestyle shifts as drivers of lower birth rates. Richer societies and cities tend to have lower fertility.
Fertility choices are heavily shaped early in life. A 16-year-old planning her future is unlikely to base major decisions on distant future house prices in her 20s or 30s. Models ignoring how teenagers view family, career, and life priorities miss the bigger picture.
Housing costs matter at the margins (with offsetting effects by tenure), but they are not the story behind broad fertility collapses. Policies focused on housing affordability are unlikely to reverse fertility trends.
We have no reason whatsoever to believe that a collapse in housing prices in a country like Canada would drive fertility upward. It is motivated reasoning.
High house prices are a terrible thing in my opinion. Investing all your capital in houses is an odd way to get prosperity. But it does not drive our fertility collapse.
References
Clark, Jeremy. 2012. “Do Women Delay Family Formation in Expensive Housing Markets?” Demographic Research 27(1): 1–24. https://doi.org/10.4054/DemRes.2012.27.1
Clark, Jeremy, and Ana Ferrer. 2019. “The Effect of House Prices on Fertility: Evidence from Canada.” Economics: The Open-Access, Open-Assessment E-Journal 13(2019-38): 1–32. https://doi.org/10.5018/economics-ejournal.ja.2019-38
Daysal, N. Meltem, Michael F. Lovenheim, Nikolaj Siersbæk, and David N. Wasser. 2021. “Home Prices, Fertility, and Early-Life Health Outcomes.” Journal of Public Economics 198: 104366. https://doi.org/10.1016/j.jpubeco.2021.104366
Dettling, Lisa J., and Melissa S. Kearney. 2014. “House Prices and Birth Rates: The Impact of the Real Estate Market on the Decision to Have a Baby.” Journal of Public Economics 110: 82–100. https://doi.org/10.1016/j.jpubeco.2013.09.009 (NBER Working Paper 17485, 2011/2014 version)
Florida, Richard. 2018. “Don’t Blame Expensive Housing for Falling Fertility.” CityLab (Bloomberg), June 14, 2018. https://www.bloomberg.com/news/articles/2018-06-14/the-complex-relationship-between-house-prices-and-fertility