MoreRSS

site iconDaniel LemireModify

Computer science professor at the University of Quebec (TELUQ), open-source hacker, and long-time blogger.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of Daniel Lemire

Aesthetics matter

2025-10-08 22:16:55

I gave a talk last week at a software shop. Two hundred engineers were present. I wore a tie. The man who invited me wore a tie. Can you guess how the 200 engineers were dressed? Yeah. You guessed it. They all picked up a t-shirt that they found under their bed.
 
I attended a large programming conference in Denver a few weeks ago. I shared a picture someone took of me with a friend back in Montreal. She said that my shoes were too casual. I told her: Never mind my shoes; I am the only one with a jacket and a tie in the whole place.
 
I have had the pleasure of attending a great many PhD defenses. The typical attire for someone seeking to get their PhD is an ugly t-shirt and old sneakers.
 
I won’t pretend that I dress well. Many people dress much better than I do. But I try to pay attention to how I look, a bit. My sons will tell you that I urge them not to show up in a professional setting with a wrinkled t-shirt.
 
Why would it matter? After all, many of these people with wrinkled t-shirts out-earn me by a factor of 5 or 10. They drive cars that I could not afford. They fly first class while I am in coach.
 
A colleague of mine was grading programming assignments, and he remarked that some students do not understand, intuitively, that code should generally be indented. We had a blast when I showed him that researchers often submit papers with pseudocode that is just randomly laid out on the page.
 
Why does it matter how the code looks? If it works, it works, right?
 
I live near Montreal. I think that it is one of the ugliest cities. It is in a constant state of disrepair. Little attention is paid to how things are going to look.
 
Why would it matter? You just drive around.
 
Why would anyone write a book entitled Beautiful Code?
 
Why would anyone build a house in Venice, a house on water? Sure, it offers some protection from invaders, but why are the houses not fortified? Why did they make the city so elegant?
 
Why would it matter? Because, as Marshall McLuhan so aptly put it, “the medium is the message.” The form of a medium embeds itself in any message it transmits, creating a symbiotic relationship whereby the medium influences and transforms the perception of the content. In the case of code, it’s not just its execution that counts, but its presentation that invites collaboration and clarity. Likewise, your tie or jacket are no mere ornaments: they are the medium through which you assert a professional presence, an attention to detail that elevates the exchange. And for Montreal, this persistent ugliness is no trivial matter; it is the medium of collective neglect that erodes everyday well-being, turning a simple drive into a diffuse experience of gloom. Aesthetics is no superfluous luxury: it is the invisible vector that amplifies or diminishes everything we build.

std::ranges may not deliver the performance that you expect

2025-10-06 05:35:07

Good engineers seek software code that is ‘simple’ in the sense that we can read and understand it quickly. But they they also look for highly performant code.

For the last 20 years, we have been offering programmers the possibility to replace conventional for loops with a more functional approach. To illustrate, suppose that you want to extract all even integers from a container and create a new container. In conventional C++, you would proceed with a loop, as follows.

std::vector<int> even_numbers;
for (int n : numbers) {
  if (n % 2 == 0) {
    even_numbers.push_back(n);
  }
}

In recent versions of C++, we have std::ranges which allows us to rewrite the code without a for loop.

auto even_numbers = numbers 
    | std::views::filter([](int n) { return n % 2 == 0; })
    | std::ranges::to<std::vector>();

The magic underneath is that the filter is lazy, it does not go through the whole input and generate a temporary result before passing it out. This allows you to pipe a whole lot of transformations, one after the other. You can also sum up the results, and, in the near future, we are even getting parallel execution. One could probably write a whole book on std::ranges and maybe someone did.

C++ is not at all unique in this respect. Even Java has been offering similar constructs for many years.

The principle underneath abstractions like std::ranges is that we get the full performance without compromise. Sadly such promises should be taken with caution.

I gave a talk at a local C++ company this week. One of the questions I was asked  was about the performance overhead of C++ ranges. When they switched their compiler to C++20, some engineers tried out std::ranges and triggered performance degradation.

Within our fast C++ JSON parser (simdjson), we have, thus far, limited std::ranges support to specific portions of the library… not because we could not make it work, but because it led to performance degradation. To be clear, I am confident that we will make it work, but it is not trivial. Importantly, we benchmarked the code before releasing it.

To illustrate the potential performance pitfalls, I picked the first example I found online: how to trim the spaces at the beginning and at the end of a string.

s | std::views::drop_while(is_space) 
                    | std::views::reverse 
                    | std::views::drop_while(is_space) 
                    | std::views::reverse;

This C++ code snippet processes a string s to trim whitespace characters from both ends without modifying the original string, returning a lazy view of the trimmed result. It begins by applying std::views::drop_while(is_space) to remove any leading whitespace. Then, std::views::reverse inverts the remaining content, transforming the original trailing whitespace into leading whitespace. Next, another std::views::drop_while(is_space) discards this new leading whitespace (effectively the original trailing part), and a final std::views::reverse restores the original order.

To be clear, that is just one way to get the job done with std::ranges, I just happened to pick it because it is what I first found online. Instead of this functional code,  we can use a messier looking approach with two loops.

while (!input.empty() &&
         is_space(input.front())) {
    input.remove_prefix(1);
}
while (!input.empty() && is_space(input.back())) {
    input.remove_suffix(1);
}

So how does the performance vary? I wrote a benchmark to compare these functions. I use random strings that do not contain spaces. I record the number of instructions required per string processed. I use two distinct C++ compilers and two distinct processors.

function LLVM 17/Apple M4 GCC 15/Intel IceLake
std::ranges 24 70
conventional 18 16

In my case, I find that the conventional function is faster  as it generates fewer instructions. In the case of GCC, the difference is large. You can look at how GCC compiles these functions if you are interested.

Does it mean that std::ranges are a bad idea or that they are poorly designed?  No. It means that there is no magical faeries giving you the best performance out of the box. Use std::ranges, but benchmark, benchmark and benchmark again.

 

Beyond papers: rethinking science in the era of artificial intelligence

2025-10-04 03:18:13

For the last few decades, we have defined science by the publication of peer-reviewed research. That is, a scientist is someone who can write a paper and have it reviewed by two or four other scientists who conclude that the work is credible.

Notice how this is a circular definition: it defines a scientist as, effectively, a member of a social club of people who agree on writing conventions.

Contrary to what many people believe, this system of peer review is recent. When Watson and Crick submitted their ‘double helix’ paper to Nature in 1953, it did not go to reviewers.

Peer review became the dominant model in the 1970s. It emerged around the same time the West adopted aspects of the Soviet model of science. It also coincided with a decade-long stagnation in science. I believe it was a mistake.

Sure, science is still making amazing progress. I celebrate the incredible scientific productivity of enterprises like Google DeepMind and OpenAI—enterprises that do not rely on conventional peer review.

This model of ‘peer-review science’ now faces an existential threat. In fact, I do not believe that the current dominant model of science, established in the early 1970s, can survive. Indeed, Large Language Models can produce papers today that can pass the peer review test. In a sense, it is an even easier test to pass than the Turing test, because the conventions of peer review science are so artificial.

Why would anyone be paid to write papers that are indistinguishable from what ChatGPT can produce in mere seconds?

So where do we go from here? There is only one option. We have to stop considering the research paper as the end product of research. It never was, in any case. That was always a convenient bureaucratic convention, inspired by the need to ‘manage science.’

If ChatGPT can produce research papers that are indistinguishable from what most scientists can write, then maybe scientists can focus on actually advancing science—something that ChatGPT has thus far proven unable to do.

« AI’s ability to generate vast amounts of text raises concerns about a potential flood of irrelevant theoretical papers, further straining the evaluation system. Stonebraker’s (2018) call for rewarding problem-solving over publication needs revisiting. Perhaps the emphasis should be on the impact and significance of research, not just its passage through peer review—a skill replicable by AI.  Daniel Lemire. 2024. Will AI Flood Us with Irrelevant Papers? Commun. ACM 67, 9 (September 2024), 9. https://doi.org/10.1145/3673649

The smallest number that is infinite

2025-09-30 05:28:33

In software, we represent real numbers as binary floating-point numbers. Effectively, we represent real numbers as a fixed-precision integer (the significand) multiplied by a power of two. Thus we do not represent exactly the number ‘pi’, but we get a very close approximation: 3.141592653589793115997963468544185161590576171875.  The first 16 digits are exact.

You can represent the approximation of pi as 7074237752028440 times 2 to the power -51. It is a bit annoying to represent numbers in this manner, so let us use the hexadecimal float notation.

In the hexadecimal float notation, The number 1 would be 0x1.0 and the number two would be 0x2.0. It works just like the decimal numbers. Except that the value one half would be 0x0.8 and the infinite string 0x0.ffffff…. would be 1.

Let us go back to the number pi. We write the significand as an hexadecimal number, 0x1921fb54442d18 and we insert a period right after the first digit: 0x1.921fb54442d18. It is a number in the range [1,2), specifically 0x1921fb54442d18 times 2 to the power -52.

To get the number pi, we need to multiply by 2, which we do by appending ‘p+1’ at the end: 0x1.921fb54442d18p+1.

Of course, you do not compute any of this by hand. In modern C++, you just do:

std::print("Pi (hex): {:a}\n", std::numbers::pi);

If you are using 64-bit floating point numbers, then you can go from about -1.7976931348623157e+308 to 1.7976931348623157e308 using the exponential notation where 1.79769e308  means 1.79769 times 10 to the power 308. In hexadecimal notation, the largest value is 0x1.fffffffffffffp+1023. The integer value 0x1fffffffffffff is 9007199254740991 and the largest value is 9007199254740991 times 2 to the power 1023-52.

Numbers outside of this range are represented by an infinite value. Indeed, our computers have a notion of infinity. And they know that 1 over infinity is 0. Thus in modern C++, the following would print zero.

double infinity = std::numeric_limits<double>::infinity();
std::print("Zero: {}\n", 1 / infinity);

You might be surprised to find that if you type in a string value that is larger than the maximal value that can represented, you do not get an error or an infinite value. For example, the number represented by the string 1.7976931348623158e308 is larger than the largest value that can be represented by a floating-point type, but it is not infinite. The following C++ code will print ‘true’:

 std::print("{}\n", 1.7976931348623158e308 == std::numeric_limits<double>::max());

So what is the smallest number represented as a string that will map to infinity? Let us go back to the hexadecimal representation.

  1. The second largest value is 0x1.ffffffffffffep+1023
  2. The largest value is 0x1.fffffffffffffp+1023.
  3. The next largest value would be 0x2.0p+1023.
  4. The midpoint between the two is 0x1.fffffffffffff8p+1023.

By default, we always round to the nearest value. Thus all number strings between 0x1.ffffffffffffe8p+1023 and 0x1.fffffffffffff8p+1023 fall to 0x1.fffffffffffffp+1023. If you are right at 0x1.fffffffffffff8p+1023, then we ’round to the nearest even’ value, which is 0x2.0p+1023. Thus the number 0x1.fffffffffffff8p+1023 is right at the border with the infinite values.

In decimal, this value is 179769313486231580793728971405303415079934132710037826936173778980444968292764750946649017977587207096330286416692887910946555547851940402630657488671505820681908902000708383676273854845817711531764475730270069855571366959622842914819860834936475292719074168444365510704342711559699508093042880177904174497792.0.

If you type this string as a number constant, many compilers (irrespective of the programming language) will complain and reject it. Any number string just slightly smaller should be fine.

 

 

 

Hard work is a virtue

2025-09-22 03:10:18

When I was an undergraduate student, I discovered symbolic algebra. It was great! Instead of solving for a variable by hand, I could just put all the equations in the machine and get the result.

I soon found that symbolic algebra did not turn me into a genius mathematician. It could do the boring work, but I often found myself “getting stuck.” I would start from a problem, throw it at the machine and get back a mess that did not get me closer to the solution.

Over time, I realized that these tools had an undesirable effect on me. You see, human beings are lazy by nature. If you can avoid having to think hard about an issue, you will.

And I fear that much of the same is happening with large language models. Why think it through? Why read the documentation? Let us just have the machine try and try again until it succeeds.

It works. Symbolic algebra can solve lots of interesting mathematical problems. Large language models can solve even more problems.

But if you just sit there and mindlessly babysit a large language model, where are your new skills going to come from? Where are your deep insights going to come from?

I am not being a Luddite. I encourage you to embrace new tools when you can. But you should not dispense with doing the hard work.

Splitting a long string in lines efficiently

2025-09-08 03:44:56

Suppose that you have a long string and you want to insert line breaks every 72 characters. You might need to do this if you need to write a public cryptographic key to a text file.

A simple C function ought to suffice. I use the letter K to indicate the length of the lines. I copy from an input buffer to an output buffer.

void insert_line_feed(const char *buffer, size_t length, 
        int K, char *output) {
  if (K == 0) {
    memcpy(output, buffer, length);
    return;
  }
  size_t input_pos = 0;
  size_t next_line_feed = K;
  while (input_pos < length) {
    output[0] = buffer[input_pos];
    output++;
    input_pos++;
    next_line_feed--;
    if (next_line_feed == 0) {
      output[0] = '\n';
      output++;
      next_line_feed = K;
    }
  }
}

This character-by-character process might be inefficient. To go faster, we might call memcpy to copy blocks of data.

void insert_line_feed_memcpy(const char *buffer, size_t length, int K,
                             char *output) {
  if (K == 0) {
    memcpy(output, buffer, length);
    return;
  }
  size_t input_pos = 0;
  while (input_pos + K < length) {
    std::memcpy(output, buffer + input_pos, K);
    output += K;
    input_pos += K;
    output[0] = '\n';
    output++;
  }
  std::memcpy(output, buffer + input_pos, length - input_pos);
}

The memcpy function is likely to be turned into just a few instruction. For example, if you compile for a recent AMD processor (Zen 5), it might generate only two instructions (two vmovups) when the length of the lines (K) is 64.

Can we do better ?

In general, I expect that you cannot do much better than using the memcpy function. Compilers are simply great a optimizing it.

Yet it might be interesting to explore whether deliberate use of SIMD instructions could optimize this code. SIMD (Single Instruction, Multiple Data) instructions process multiple data elements simultaneously with a single instruction: the memcpy function automatically uses it. We can utilize SIMD instructions through intrinsic functions, which are compiler-supplied interfaces that enable direct access to processor-specific instructions, optimizing performance while preserving high-level code readability.

Let me focus on AVX2, the instruction set supported by effectively all x64 (Intel and AMD) processors. We can load 32-byte registers and write 32-byte registers. Thus we need a function that takes a 32-byte register and inserts a line-feed character at some location (N) in it. For cases where N is less than 16, the function shifts the input vector right by one byte to align the data correctly, using _mm256_alignr_epi8 and _mm256_blend_epi32, before applying the shuffle mask and inserting the newline. When N is 16 or greater, it directly uses a shuffle mask from the precomputed shuffle_masks array to reorder the input bytes and insert the newline, leveraging a comparison with `0x80` to identify the insertion point and blending the result with a vector of newline characters for efficient parallel processing.

inline __m256i insert_line_feed32(__m256i input, int N) {
  __m256i line_feed_vector = _mm256_set1_epi8('\n');
  __m128i identity =
      _mm_setr_epi8(0, 1, 2, 3, 4, 5, 6, 7, 8, 
         9, 10, 11, 12, 13, 14, 15);
  if (K >= 16) {
    __m128i maskhi = _mm_loadu_si128(shuffle_masks[N - 16]);
    __m256i mask = _mm256_set_m128i(maskhi, identity);
    __m256i lf_pos = _mm256_cmpeq_epi8(mask, 
       _mm256_set1_epi8(0x80));
    __m256i shuffled = _mm256_shuffle_epi8(input, mask);
    __m256i result = _mm256_blendv_epi8(shuffled, 
       line_feed_vector, lf_pos);
    return result;
  }
  // Shift input right by 1 byte
  __m256i shift = _mm256_alignr_epi8(
      input, _mm256_permute2x128_si256(input, input, 0x21), 
        15);
  input = _mm256_blend_epi32(input, shift, 0xF0);
  __m128i masklo = _mm_loadu_si128(shuffle_masks[N]);
  __m256i mask = _mm256_set_m128i(identity, masklo);
  __m256i lf_pos = _mm256_cmpeq_epi8(mask, 
      _mm256_set1_epi8(0x80));
  __m256i shuffled = _mm256_shuffle_epi8(input, mask);
  __m256i result = _mm256_blendv_epi8(shuffled, 
      line_feed_vector, lf_pos);
  return result;
}

Can we go faster by using such a fancy function ? Let us test it out. I wrote a benchmark. I use a large input string on an Intel Ice Lake processor with GCC 12.

character-by-character 1.0 GB/s 8.0 ins/byte
memcpy 11 GB/s 0.46 ins/byte
AVX2 16 GB/s 0.52 ins/byte

The handcrafted AVX2 approach is faster in my tests than the memcpy approach despite using more instructions. However, the handcrafted AVX2 approach stores data to memory using fewer instructions.