2025-10-27 04:13:34

The hardest problem in software performance is often to understand your code and why it might be slow. One approach to this problem is called profiling. Profiling tries to count the time spent in the various functions of your program.
It can be difficult to understand the result of profiling. Furthermore, it is more complicated than it seems. For example, how do you count the time spent in a function A that calls another function B. Does the time spent in function B count for the time spent in function A? Suppose function A executes code for 2 seconds, then calls function B which takes 3 seconds to run, and finally A continues with 1 additional second. The total time for A would be 6 seconds if everything were included, but that does not accurately reflect where the time is truly spent. We might prefer the exclusive or flat time: the time spent only in the body of the function itself, without counting calls to other functions. In our example, for A, that would be 2 + 1 = 3 seconds (the time before and after the call to B). For B, it would be 3 seconds. The cumulative time is the total time spent in the function and all the functions it calls, recursively. For A, that would include the 3 own seconds plus the 3 seconds of B, totaling 6 seconds. For B, it would just be its 3 seconds, unless it calls other functions.
Go has builtin support for profiling. It requires two steps.
runtime/pprof to collect profiles.go tool pprof to review the data.Let us consider a concrete example. We have a benchmark function parsing repeatedly strings as floating-point numbers.
func BenchmarkParseFloat(b *testing.B) {
for i := 0; i < b.N; i++ {
idx := i % len(floatStrings)
_, _ = strconv.ParseFloat(floatStrings[idx], 64)
}
}
In effect, we want to profile the strconv.ParseFloat function to see where it is spending its time. We run:
go test -bench=. -cpuprofile=cpu.prof
Notice the -cpuprofile=cpu.prof flag which instructs go to dump its profiling data to the file cpu.prof. Go profiles the code using a sampling-based approach, where it periodically interrupts the execution to capture stack traces of running goroutines. It has low overhead—only a small fraction of execution time is spent on sampling—and it works across multiple goroutines and threads.
If you are trying to profile a long-running application, like a server, you can use the net/http/pprof package to request profiling data dynamically using network requests. However, it requires modifying your application accordingly.
We can call go tool pprof on the cpu.prof to examine its result. The simplest flag is -text, it prints out a summary in the console.
go tool pprof -text cpu.prof
You might see the following output.
Time: 2025-10-26 14:35:03 EDT
Duration: 1.31s, Total samples = 990ms (75.45%)
Showing nodes accounting for 990ms, 100% of 990ms total
flat flat% sum% cum cum%
830ms 83.84% 83.84% 830ms 83.84% strconv.readFloat
20ms 2.02% 92.93% 900ms 90.91% strconv.atof64
20ms 2.02% 94.95% 20ms 2.02% strconv.eiselLemire64
20ms 2.02% 96.97% 20ms 2.02% strconv.special
10ms 1.01% 98.99% 910ms 91.92% strconv.ParseFloat
10ms 1.01% 100% 10ms 1.01% strconv.atof64exact
It can be difficult to interpret this result. Thus we want to visualize the result. My favorite technique is the flame graph. It is a useful technique invented by Brendan Gregg.
A flamegraph is a stacked bar chart where each bar represents a function call, and the width of the bar indicates the estimated time spent in that function. The time can be replaced by other metrics such as CPU instructions, memory allocations or any other metrics that can be matched to a software function. The y-axis represents the call stack depth, with the top being the root of the stack. Colors can be used to make it more elegant.
The name flame graph comes from the shape that often resembles flames, with wider bases indicating more time-consuming functions and narrower tops showing less frequent paths. There are many variations such as flame charts where the x-axis represents time progression, and y-axis shows call stack at each moment. But flame graphs are more common.
Given that you have generated the file cpu.prof, you can visualize it by first starting a web server:
go tool pprof -http=:8123 cpu.prof
It assumes that port 8123 is free on your system: you can change the port 8123 if needed. Open your browser at the URL http://localhost:8123/ui/flamegraph.
You should see a graph:
Our first step is to narrow our focus on the function we care about. For more clarity, double click on strconv.ParseFloat, or right click on it and select focus. The bulk of the time (over 80%) of the ParseFloat is spent in the readFloat function. This function converts the string into a mantissa and an exponent. There are other functions being called such as strconv.special, strconv.atof64exact, and strconv.eiselLemire64. However, they account for little time in this specific benchmark.
Thus, if we wanted to improve the performance of strconv.ParseFloat function, we might want to focus on optimizing the readFloat function. Furthermore, at least given our specific test, optimizing the strconv.special, strconv.atof64exact, and strconv.eiselLemire64 functions is unlikely to be productive.
In our specific example, the profiling gives us a clear answer. However, it is not always so. Profiling can be misleading. For one thing, profiling is typically statistical. If your program is short-lived or has uneven load, it might provide incorrect data. Furthermore, inefficient work spread out over many functions could not show up in a flame graph. Thus you should always interpret profiling results with care.
2025-10-26 22:59:29

If you have ever met me in person, you know that when you share an idea with me, I simplify it to its core and reflect it back to you, focusing on its essential parts. I dissect each statement for precision. “What do you mean by this word?”
I have two decades of experience working with academics who overcomplicate everything. Humans are easily confused. A project proposal with ten moving parts and five objectives is overwhelming. Most people cannot think it through critically, which can lead to disaster.
By instinct, I simplify problems as my first step, reducing them to their “minimum viable product,” as they say in Silicon Valley.
Some people avoid simplicity to sound smarter. They won’t admit it, maybe not even to themselves, but that’s what they are thinking: “Oh no! I’m not doing this simple thing; my work is much more sophisticated.”
That’s a terrible idea. Even simple projects become challenging if you are ambitious. There is no need to complexify them. For example, seven years ago, we aimed to create a JSON parser faster than anything on the market—a simple idea. A senior colleague in computer science saw me at a campus coffee shop while I was working. He asked what I was busy doing… When I told him, “We’re writing a fast JSON parser,” he laughed. He would later admit that he thought I was joking: how could I work on something so mundane. I wasn’t kidding. The result was simdjson, a JSON parser four times faster than anything else at the time. By keeping our project conceptually simple, we made success easier.
Complexity is a burden, not a badge of pride.
Clear thinking demands precision.
People use emotionally charged words like “safe.” “My car is safe.” “My software is safe.” What does that mean? Define it precisely.
A graduate student of mine recently proposed reducing the cognitive load of agentic AI on developers. “Cognitive load” sounds great—thousands of papers discuss it—but what does it mean? How do measure it?
Take “AI” as another example. Nobody knows what AI is. Is a Google search AI? Is image search AI? Is an expert system AI? Or do you mean large language models? Clarify what you mean.
Too often, we accidentally hide behind overly abstract language. Not only does this harm how we think, but it also harms how we are perceived. People who avoid jargon are viewed as more honest, trustworthy and benevolent (Fick et al. 2025).
Your motivation should be clear from the start. Here is an example. I am often asked: “Should I go to graduate school?” What’s the motivation? Often, it’s to get a great job. Years ago, I had this conversation with a research assistant:
Given his exceptional technical skills, I told him he could already land that job soon. He did, and now he’s in a leadership role at one of Montreal’s top companies. He is outearning me by a wide margin, no doubt.
People often lose sight of their motivation and follow trends. Reconnect with your motivation and adjust your means accordingly.
To sum up, my advice for clear thinking:
2025-10-20 05:28:24

In large code bases, we are often stuck with unpleasant designs that are harming our performance. We might be looking for a non-intrusive method to improve the performance. For example, you may not want to change the function signatures.
Let us consider a concrete example. Maybe someone designed the programming interface so that you have to access the values from a map using an index. They may have code like so:
auto at_index(map_like auto& index_map, size_t idx) {
size_t count = 0;
for (const auto &[key, value] : index_map) {
if(count == idx)
return value;
count++;
}
throw std::out_of_range("Index out of range");
}
This code goes through the keys of the map idx times. Typically, it implies some kind of linked list traversal. If you are stuck with this interface, going through the values might imply repeated calls to the at_index function:
for (size_t i = 0; i < input_size; ++i) {
at_index(index_map, i);
}
If you took any kind of computer science, you will immediately see the problem: my code has quadratic complexity. If you double the map size, you may quadruple the running time. It is likely fine if you have 2 or 4 elements in the map, but definitely not fine if you have 400 elements.
The proper solution is to avoid such a design. If you can have access directly to the map, you can just iterate through it directly:
for (auto& [key, value] : index_map) {
sum += value;
}
But what if you are stuck? Assume that the map is never modified in practice. Then you can use a static or thread_local cache. The key insight is to keep in cache your location in the map, and start from there on the next query. If the user is typically querying in sequence, then your cache should speed up tremendously the function.
auto
at_index_thread_local_cache(map_like auto& index_map,
size_t idx) {
using iterator = decltype(index_map.begin());
struct Cache {
iterator last_iterator;
size_t last_index = -1;
decltype(&index_map) map_ptr = nullptr;
};
thread_local Cache cache;
if (cache.map_ptr == &index_map
&& idx == cache.last_index + 1
&& cache.last_iterator != index_map.end()) {
cache.last_iterator++;
cache.last_index = idx;
if (cache.last_iterator != index_map.end()) {
return cache.last_iterator->second;
} else {
throw std::out_of_range("Index out of range");
}
} else {
cache.last_iterator = index_map.begin();
cache.last_index = -1;
cache.map_ptr = &index_map;
size_t count = 0;
for (auto it = index_map.begin();
it != index_map.end(); ++it) {
if (count == idx) {
cache.last_iterator = it;
cache.last_index = idx;
return it->second;
}
count++;
}
throw std::out_of_range("Index out of range");
}
}
In C++, a thread_local variable is such that there is just one instance of the variable (shared by all function calls) within the same thread. If you wish to have just one instance of the variable for the entire program, you can use static instead, but thread_local is the best choice in our case. You might be worried about the performance implication of a thread_local variable, but it is generally quite cheap: we only add a few instructions when accessing it or modifying it.
Our cache variable remembers the last accessed iterator and index per thread. If the next index is requested, we just increment the iterator and return. If the access is non-sequential or the first call, it falls back to a linear scan from the beginning, rebuilding the cache along the way.
The code is more complicated, and if you are not accessing the key in sequence, it might be slower. However, the performance gains can be enormous. I wrote a benchmark to test it out with maps containing 400 elements.
| Method | ns/key | instructions/key |
|---|---|---|
| original | 300 | 2000 |
| cache | 2 | 20 |
In my case, the cache multiplied the performance by 150. Not bad.
2025-10-18 02:50:53

Many view scientific research as a public good. I consider this naive and indefensible. Scientific progress hinges on people and culture, not on research results as public goods.
What do they mean by a public good? A public good is non-excludable: Once a scientific discovery is made anyone can access it at low cost.
If true, why bother innovating? Just wait and adopt others’ discoveries.
History tells us that it does not work. The British Empire’s technological edge took decades for others to match, despite eventual diffusion—at great cost.
Consider the most significant advance of the last 20 years: large language models (LLMs). Papers, code, and breakthroughs are freely available online.
Yet firms like OpenAI, Anthropic, and xAI pay millions in salaries to top talent. Why? Can’t others just study the papers and replicate?
Most organizations and countries can’t capture this “new knowledge.” Europe lacks an OpenAI equivalent for a reason.
Knowledge is not a fluid, infinitely replicable like Jesus’s loaves. If you gave Romans the design plans of a modern Tesla, they would not know what to do with it. They would first need to become Tesla engineers which would take them years and require that they live among Tesla engineers. You cannot show up to your local university’s Physics department and just copy their knowledge into your head. You need to first become a good physicist which is going to take years and require that you live with physicists.
Scientific progress as cultural evolution. Adopting discoveries requires cultural updates—and it is far from easy. Einstein’s papers are public, yet reading them won’t make you Einstein. He wasn’t a public good; he thrived in a specific culture. It’s unclear if he’d fit in today’s universities.
We defend government funded research, often claiming markets can’t price public goods. But markets price large language models effectively now. Innovators capture only a fraction of the value, but resources do flow to productive entities regardless of credit.
In fact, I believe that the current relative scientific stagnation stems from government intervention. Pre-1970s laissez-faire policies served scientific progress better.
But even if you disagree, please do not spread the naive view that knowledge and scientific progress are public goods. It is not. It is very much a private good as it is a cultural artefact. It is all about people, specific people, and how they think. Cultural artefacts can and will spread, obviously, but there is nothing free about it. And will definitively not spread equally everywhere.
From Galileo to today’s amateur astronomers, scientists have been rebels. Like artists and poets, they are free spirits who resist the restrictions their cultures impose on them. In their pursuit of nature’s truths, they are guided as much by imagination as by reason, and their greatest theories have the uniqueness and beauty of great works of art. (Freeman Dyson)
2025-10-16 06:18:33

We often need to generate random numbers in software. We need them for games, simulations, testing, and so forth. In many of these cases, we would like to use the fastest generator we can find, as long as it is reasonably random-looking.
In some instances, we need them for cryptography: we may need to generate a number that nobody else could guess. In these cases, performance is less of a concern.
In Go, the standard library offers a few options. The original math/rand package provides a fast but non-cryptographic generator, while crypto/rand delivers secure bits. Go 1.22 introduced math/rand/v2 with ChaCha8 and PCG options. ChaCha8 was designed by Daniel J. Bernstein, a famous cryptographer. PCG was created by a co-author of mine, Professor O’Neill from Harvey Mudd College. Her work on the PCG generator has been broadly adopted. The Go ChaCha8 implementation benefits from advanced optimizations: it is partly implemented in assembly. The PCG implementation is simpler, just a few Go functions.
| function | Apple M4 | Intel Ice Lake |
|---|---|---|
| math/rand | 1.7 ns | 5.8 ns |
| math/rand/v2 PCG | 1.4 ns | 4.8 ns |
| math/rand/v2 ChaCha8 | 2.6 ns | 8.1 ns |
| crypto/rand | 51 ns | 518 ns |
I find that O’Neill’s PCG is fastest in my tests. It is not cryptographically strong, but if you are using the random numbers for testing or simulation purposes, it should serve you well.
2025-10-08 22:16:55
