MoreRSS

site iconDaniel LemireModify

Computer science professor at the University of Quebec (TELUQ), open-source hacker, and long-time blogger.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of Daniel Lemire

Flame Graphs in Go

2025-10-27 04:13:34

The hardest problem in software performance is often to understand your code and why it might be slow. One approach to this problem is called profiling. Profiling tries to count the time spent in the various functions of your program.

It can be difficult to understand the result of profiling. Furthermore, it is more complicated than it seems. For example, how do you count the time spent in a function A that calls another function B. Does the time spent in function B count for the time spent in function A? Suppose function A executes code for 2 seconds, then calls function B which takes 3 seconds to run, and finally A continues with 1 additional second. The total time for A would be 6 seconds if everything were included, but that does not accurately reflect where the time is truly spent. We might prefer the exclusive or flat time: the time spent only in the body of the function itself, without counting calls to other functions. In our example, for A, that would be 2 + 1 = 3 seconds (the time before and after the call to B). For B, it would be 3 seconds. The cumulative time is the total time spent in the function and all the functions it calls, recursively. For A, that would include the 3 own seconds plus the 3 seconds of B, totaling 6 seconds. For B, it would just be its 3 seconds, unless it calls other functions.

Go has builtin support for profiling. It requires two steps.

  1. Use runtime/pprof to collect profiles.
  2. Use go tool pprof to review the data.

Let us consider a concrete example. We have a benchmark function parsing repeatedly strings as floating-point numbers.

func BenchmarkParseFloat(b *testing.B) {
    for i := 0; i < b.N; i++ {
        idx := i % len(floatStrings)
        _, _ = strconv.ParseFloat(floatStrings[idx], 64)
    }
}

In effect, we want to profile the strconv.ParseFloat function to see where it is spending its time. We run:

go test -bench=. -cpuprofile=cpu.prof    

Notice the -cpuprofile=cpu.prof flag which instructs go to dump its profiling data to the file cpu.prof. Go profiles the code using a sampling-based approach, where it periodically interrupts the execution to capture stack traces of running goroutines. It has low overhead—only a small fraction of execution time is spent on sampling—and it works across multiple goroutines and threads.

If you are trying to profile a long-running application, like a server, you can use the net/http/pprof package to request profiling data dynamically using network requests. However, it requires modifying your application accordingly.

We can call go tool pprof on the cpu.prof to examine its result. The simplest flag is -text, it prints out a summary in the console.

go tool pprof -text cpu.prof

You might see the following output.

Time: 2025-10-26 14:35:03 EDT
Duration: 1.31s, Total samples = 990ms (75.45%)
Showing nodes accounting for 990ms, 100% of 990ms total
      flat  flat%   sum%        cum   cum%
     830ms 83.84% 83.84%      830ms 83.84%  strconv.readFloat
      20ms  2.02% 92.93%      900ms 90.91%  strconv.atof64
      20ms  2.02% 94.95%       20ms  2.02%  strconv.eiselLemire64
      20ms  2.02% 96.97%       20ms  2.02%  strconv.special
      10ms  1.01% 98.99%      910ms 91.92%  strconv.ParseFloat
      10ms  1.01%   100%       10ms  1.01%  strconv.atof64exact

It can be difficult to interpret this result. Thus we want to visualize the result. My favorite technique is the flame graph. It is a useful technique invented by Brendan Gregg.

A flamegraph is a stacked bar chart where each bar represents a function call, and the width of the bar indicates the estimated time spent in that function. The time can be replaced by other metrics such as CPU instructions, memory allocations or any other metrics that can be matched to a software function. The y-axis represents the call stack depth, with the top being the root of the stack. Colors can be used to make it more elegant.

The name flame graph comes from the shape that often resembles flames, with wider bases indicating more time-consuming functions and narrower tops showing less frequent paths. There are many variations such as flame charts where the x-axis represents time progression, and y-axis shows call stack at each moment. But flame graphs are more common.

Given that you have generated the file cpu.prof, you can visualize it by first starting a web server:

go tool pprof -http=:8123 cpu.prof

It assumes that port 8123 is free on your system: you can change the port 8123 if needed. Open your browser at the URL http://localhost:8123/ui/flamegraph.

You should see a graph:

Our first step is to narrow our focus on the function we care about. For more clarity, double click on strconv.ParseFloat, or right click on it and select focus. The bulk of the time (over 80%) of the ParseFloat is spent in the readFloat function. This function converts the string into a mantissa and an exponent. There are other functions being called such as strconv.special, strconv.atof64exact, and strconv.eiselLemire64. However, they account for little time in this specific benchmark.

Thus, if we wanted to improve the performance of strconv.ParseFloat function, we might want to focus on optimizing the readFloat function. Furthermore, at least given our specific test, optimizing the strconv.special, strconv.atof64exact, and strconv.eiselLemire64 functions is unlikely to be productive.

In our specific example, the profiling gives us a clear answer. However, it is not always so. Profiling can be misleading. For one thing, profiling is typically statistical. If your program is short-lived or has uneven load, it might provide incorrect data. Furthermore, inefficient work spread out over many functions could not show up in a flame graph. Thus you should always interpret profiling results with care.

Thinking Clearly

2025-10-26 22:59:29

If you have ever met me in person, you know that when you share an idea with me, I simplify it to its core and reflect it back to you, focusing on its essential parts. I dissect each statement for precision. “What do you mean by this word?”

I have two decades of experience working with academics who overcomplicate everything. Humans are easily confused. A project proposal with ten moving parts and five objectives is overwhelming. Most people cannot think it through critically, which can lead to disaster.

By instinct, I simplify problems as my first step, reducing them to their “minimum viable product,” as they say in Silicon Valley.

Some people avoid simplicity to sound smarter. They won’t admit it, maybe not even to themselves, but that’s what they are thinking: “Oh no! I’m not doing this simple thing; my work is much more sophisticated.”

That’s a terrible idea. Even simple projects become challenging if you are ambitious. There is no need to complexify them. For example, seven years ago, we aimed to create a JSON parser faster than anything on the market—a simple idea. A senior colleague in computer science saw me at a campus coffee shop while I was working. He asked what I was busy doing… When I told him, “We’re writing a fast JSON parser,” he laughed. He would later admit that he thought I was joking: how could I work on something so mundane. I wasn’t kidding. The result was simdjson, a JSON parser four times faster than anything else at the time. By keeping our project conceptually simple, we made success easier.

Complexity is a burden, not a badge of pride.

Clear thinking demands precision.

People use emotionally charged words like “safe.” “My car is safe.” “My software is safe.” What does that mean? Define it precisely.

A graduate student of mine recently proposed reducing the cognitive load of agentic AI on developers. “Cognitive load” sounds great—thousands of papers discuss it—but what does it mean? How do measure it?

Take “AI” as another example. Nobody knows what AI is. Is a Google search AI? Is image search AI? Is an expert system AI? Or do you mean large language models? Clarify what you mean.

Too often, we accidentally hide behind overly abstract language. Not only does this harm how we think, but it also harms how we are perceived. People who avoid jargon are viewed as more honest, trustworthy and benevolent (Fick et al. 2025).

Your motivation should be clear from the start. Here is an example. I am often asked: “Should I go to graduate school?” What’s the motivation? Often, it’s to get a great job. Years ago, I had this conversation with a research assistant:

  • Should I get a Master’s degree? Everyone with good grades does it.
  • What is your objective?
  • I want a good, well-paying job.

Given his exceptional technical skills, I told him he could already land that job soon. He did, and now he’s in a leadership role at one of Montreal’s top companies. He is outearning me by a wide margin, no doubt.

People often lose sight of their motivation and follow trends. Reconnect with your motivation and adjust your means accordingly.

To sum up, my advice for clear thinking:

  • Simplify projects to their essentials early and often.
  • Use precise language. Avoid jargon.
  • Focus on your motivation first, then choose the appropriate means.

Speeding up C++ functions with a thread_local cache

2025-10-20 05:28:24

In large code bases, we are often stuck with unpleasant designs that are harming our performance. We might be looking for a non-intrusive method to improve the performance. For example, you may not want to change the function signatures.

Let us consider a concrete example. Maybe someone designed the programming interface so that you have to access the values from a map using an index. They may have code like so:

auto at_index(map_like auto& index_map, size_t idx) {
  size_t count = 0;
  for (const auto &[key, value] : index_map) {
    if(count == idx)
      return value;
    count++;
  }
  throw std::out_of_range("Index out of range");
}

This code goes through the keys of the map idx times. Typically, it implies some kind of linked list traversal. If you are stuck with this interface, going through the values might imply repeated calls to the at_index function:

for (size_t i = 0; i < input_size; ++i) {
    at_index(index_map, i);
}

If you took any kind of computer science, you will immediately see the problem: my code has quadratic complexity. If you double the map size, you may quadruple the running time. It is likely fine if you have 2 or 4 elements in the map, but definitely not fine if you have 400 elements.

The proper solution is to avoid such a design. If you can have access directly to the map, you can just iterate through it directly:

for (auto& [key, value] : index_map) {
   sum += value;
}

But what if you are stuck? Assume that the map is never modified in practice. Then you can use a static or thread_local cache. The key insight is to keep in cache your location in the map, and start from there on the next query. If the user is typically querying in sequence, then your cache should speed up tremendously the function.

auto 
at_index_thread_local_cache(map_like auto& index_map,
    size_t idx) {
  using iterator = decltype(index_map.begin());
  struct Cache {
    iterator last_iterator;
    size_t last_index = -1;
    decltype(&index_map) map_ptr = nullptr;
  };
  thread_local Cache cache;
  if (cache.map_ptr == &index_map
      && idx == cache.last_index + 1
      && cache.last_iterator != index_map.end()) {
    cache.last_iterator++;
    cache.last_index = idx;
    if (cache.last_iterator != index_map.end()) {
      return cache.last_iterator->second;
    } else {
      throw std::out_of_range("Index out of range");
    }
  } else {
    cache.last_iterator = index_map.begin();
    cache.last_index = -1;
    cache.map_ptr = &index_map;
    size_t count = 0;
    for (auto it = index_map.begin();
        it != index_map.end(); ++it) {
      if (count == idx) {
        cache.last_iterator = it;
        cache.last_index = idx;
        return it->second;
      }
      count++;
    }
    throw std::out_of_range("Index out of range");
  }
}

In C++, a thread_local variable is such that there is just one instance of the variable (shared by all function calls) within the same thread. If you wish to have just one instance of the variable for the entire program, you can use static instead, but thread_local is the best choice in our case. You might be worried about the performance implication of a thread_local variable, but it is generally quite cheap: we only add a few instructions when accessing it or modifying it.

Our cache variable remembers the last accessed iterator and index per thread. If the next index is requested, we just increment the iterator and return. If the access is non-sequential or the first call, it falls back to a linear scan from the beginning, rebuilding the cache along the way.

The code is more complicated, and if you are not accessing the key in sequence, it might be slower. However, the performance gains can be enormous. I wrote a benchmark to test it out with maps containing 400 elements.

Method ns/key instructions/key
original 300 2000
cache 2 20

In my case, the cache multiplied the performance by 150. Not bad.

Research results are cultural artifacts, not public goods

2025-10-18 02:50:53

Many view scientific research as a public good. I consider this naive and indefensible. Scientific progress hinges on people and culture, not on research results as public goods.
What do they mean by a public good? A public good is non-excludable: Once a scientific discovery is made anyone can access it at low cost.
If true, why bother innovating? Just wait and adopt others’ discoveries.
History tells us that it does not work. The British Empire’s technological edge took decades for others to match, despite eventual diffusion—at great cost.
Consider the most significant advance of the last 20 years: large language models (LLMs). Papers, code, and breakthroughs are freely available online.
Yet firms like OpenAI, Anthropic, and xAI pay millions in salaries to top talent. Why? Can’t others just study the papers and replicate?
Most organizations and countries can’t capture this “new knowledge.” Europe lacks an OpenAI equivalent for a reason.
Knowledge is not a fluid, infinitely replicable like Jesus’s loaves. If you gave Romans the design plans of a modern Tesla, they would not know what to do with it. They would first need to become Tesla engineers which would take them years and require that they live among Tesla engineers. You cannot show up to your local university’s Physics department and just copy their knowledge into your head. You need to first become a good physicist which is going to take years and require that you live with physicists.
Scientific progress as cultural evolution. Adopting discoveries requires cultural updates—and it is far from easy. Einstein’s papers are public, yet reading them won’t make you Einstein. He wasn’t a public good; he thrived in a specific culture. It’s unclear if he’d fit in today’s universities.
We defend government funded research, often claiming markets can’t price public goods. But markets price large language models effectively now. Innovators capture only a fraction of the value, but resources do flow to productive entities regardless of credit.
In fact, I believe that the current relative scientific stagnation stems from government intervention. Pre-1970s laissez-faire policies served scientific progress better.
But even if you disagree, please do not spread the naive view that knowledge and scientific progress are public goods. It is not. It is very much a private good as it is a cultural artefact. It is all about people, specific people, and how they think. Cultural artefacts can and will spread, obviously, but there is nothing free about it. And will definitively not spread equally everywhere.

From Galileo to today’s amateur astronomers, scientists have been rebels. Like artists and poets, they are free spirits who resist the restrictions their cultures impose on them. In their pursuit of nature’s truths, they are guided as much by imagination as by reason, and their greatest theories have the uniqueness and beauty of great works of art. (Freeman Dyson)

Speed of random number generators in Go

2025-10-16 06:18:33

We often need to generate random numbers in software. We need them for games, simulations, testing, and so forth. In many of these cases, we would like to use the fastest generator we can find, as long as it is reasonably random-looking.
In some instances, we need them for cryptography: we may need to generate a number that nobody else could guess. In these cases, performance is less of a concern.

In Go, the standard library offers a few options. The original math/rand package provides a fast but non-cryptographic generator, while crypto/rand delivers secure bits. Go 1.22 introduced math/rand/v2 with ChaCha8 and PCG options. ChaCha8 was designed by Daniel J. Bernstein, a famous cryptographer. PCG was created by a co-author of mine, Professor O’Neill from Harvey Mudd College. Her work on the PCG generator has been broadly adopted. The Go ChaCha8 implementation benefits from advanced optimizations: it is partly implemented in assembly. The PCG implementation is simpler, just a few Go functions.

I benchmark how long it takes to generate a 64-bit word on Apple M4 and Intel Ice Lake processors using Go 1.24.

function Apple M4 Intel Ice Lake
math/rand 1.7 ns 5.8 ns
math/rand/v2 PCG 1.4 ns 4.8 ns
math/rand/v2 ChaCha8 2.6 ns 8.1 ns
crypto/rand 51 ns 518 ns

I find that O’Neill’s PCG is fastest in my tests. It is not cryptographically strong, but if you are using the random numbers for testing or simulation purposes, it should serve you well.

Aesthetics matter

2025-10-08 22:16:55

I gave a talk last week at a software shop. Two hundred engineers were present. I wore a tie. The man who invited me wore a tie. Can you guess how the 200 engineers were dressed? Yeah. You guessed it. They all picked up a t-shirt that they found under their bed.
 
I attended a large programming conference in Denver a few weeks ago. I shared a picture someone took of me with a friend back in Montreal. She said that my shoes were too casual. I told her: Never mind my shoes; I am the only one with a jacket and a tie in the whole place.
 
I have had the pleasure of attending a great many PhD defenses. The typical attire for someone seeking to get their PhD is an ugly t-shirt and old sneakers.
 
I won’t pretend that I dress well. Many people dress much better than I do. But I try to pay attention to how I look, a bit. My sons will tell you that I urge them not to show up in a professional setting with a wrinkled t-shirt.
 
Why would it matter? After all, many of these people with wrinkled t-shirts out-earn me by a factor of 5 or 10. They drive cars that I could not afford. They fly first class while I am in coach.
 
A colleague of mine was grading programming assignments, and he remarked that some students do not understand, intuitively, that code should generally be indented. We had a blast when I showed him that researchers often submit papers with pseudocode that is just randomly laid out on the page.
 
Why does it matter how the code looks? If it works, it works, right?
 
I live near Montreal. I think that it is one of the ugliest cities. It is in a constant state of disrepair. Little attention is paid to how things are going to look.
 
Why would it matter? You just drive around.
 
Why would anyone write a book entitled Beautiful Code?
 
Why would anyone build a house in Venice, a house on water? Sure, it offers some protection from invaders, but why are the houses not fortified? Why did they make the city so elegant?
 
Why would it matter? Because, as Marshall McLuhan so aptly put it, “the medium is the message.” The form of a medium embeds itself in any message it transmits, creating a symbiotic relationship whereby the medium influences and transforms the perception of the content. In the case of code, it’s not just its execution that counts, but its presentation that invites collaboration and clarity. Likewise, your tie or jacket are no mere ornaments: they are the medium through which you assert a professional presence, an attention to detail that elevates the exchange. And for Montreal, this persistent ugliness is no trivial matter; it is the medium of collective neglect that erodes everyday well-being, turning a simple drive into a diffuse experience of gloom. Aesthetics is no superfluous luxury: it is the invisible vector that amplifies or diminishes everything we build.