RSS preview of Daniel Lemire

Rss preview of Blog of Daniel Lemire

Fast character classification with z3

2025-06-01 11:32:28

We often need to quickly classify characters. For example, consider how the binary data that you send by email is converted to an ASCII string made of 64 distinct characters (A-Z, a-z, 0-9, +, /). ASCII characters are stored as 7-bit integer values (from 0 to 128) using on byte per character. During the decoding of a base64 string, you might want to identify which characters are base64 characters. If you are processing the data character-by-character from an ASCII or UTF-8 stream, a simple 256-element lookup table suffices. E.g., you create a table filled with false values, except at the indexes corresponding to the 7-bit code point value of the designed characters (A-Z, a-z, 0-9, +, /).

However, if you are using fancier techniques such as Single Instruction, Multiple Data (SIMD) instructions, where you try to process the data in blocks of 16 characters or more, the standard lookup table fails.
I call the process of classifying characters using SIMD instructions vectorized classification. In a common version of vectorized classification, each character’s ASCII value is split into two 4-bit nibbles (lower and upper), and 16-byte lookup tables (lut_lo and lut_hi) are used to classify the character via a bitwise AND operation: f(c) = lut_lo[c & 0x0F] AND lut_hi[c >> 4]. For base64 decoding, you might want to have that f(c) = 0 indicates a valid base64 character, while f(c) > 0 flags an invalid one, allowing rapid validation across vectors of characters in a single instruction cycle.

How do you figure out the content of the arrays lut_lo and lut_hi tables? In “Parsing gigabytes of JSON per second” (Langdale and Lemire, 2019) as well as in other papers, we describe how you can reason it out. But in other papers, such as “Faster Base64 Encoding and Decoding Using AVX2 Instructions” (Muła and Lemire, 2018) we just give the solution without further explanation.

These constants in these tables must be computed to satisfy the classification rules for all possible input characters, which can be a confusing task given the overlapping nibble patterns of valid and invalid characters. I do not recommend using ChatGPT. However, you can automated the task with a theorem prove such as Z3. Z3 can model the classification constraints as a satisfiability problem and automatically compute a set of table values that meet the requirements.

The following Python code uses Z3 to solve for the lut_lo and lut_hi tables. It begins by importing Z3 and creating a solver instance (s = Solver()). Two lists of 8-bit bitvector variables, lut_lo and lut_hi, are defined to represent the lookup tables, each with 16 entries. The base64 character set (A-Z, a-z, 0-9, +, /) is defined using ASCII ranges, and invalid characters are identified as all other ASCII values (0 to 255). Constraints are added to the solver: base64 characters must classify to 0, and invalid characters must classify to a value greater than 0. The solver checks for a solution; if found, it extracts the table values. It is quite easy ! Admittedly, you have to know how to use Z3.

from z3 import *
s = Solver()
lut_lo = [BitVec(f'lut_lo_{i}', 8) for i in range(16)]
lut_hi = [BitVec(f'lut_hi_{i}', 8) for i in range(16)]
# Base64 characters (A-Z, a-z, 0-9, +, /)
base64_chars = list(range(65, 91)) + list(range(97, 123)) + list(range(48, 58)) + [43, 47]
invalid_chars = [c for c in range(256) if c not in base64_chars]
def classify(c):
    lower_nibble = c & 0x0F
    upper_nibble = c >> 4
    return lut_lo[lower_nibble] & lut_hi[upper_nibble]
for c in base64_chars:
    s.add(classify(c) == 0)
for c in invalid_chars:
    s.add(classify(c) > 0)
if s.check() == sat:
    m = s.model()
    print("lut_lo =", [m[lut_lo[i]].as_long() for i in range(16)])
    print("lut_hi =", [m[lut_hi[i]].as_long() for i in range(16)])

In this instance, it finds the following solution:

lut_lo = [103, 39, 39, 39, 39, 39, 39, 39, 39, 39, 47, 31, 63, 63, 63, 31]
lut_hi = [255, 255, 224, 152, 192, 144, 192, 144, 255, 255, 255, 255, 255, 255, 255, 255]

In general the solution is not unique.

In Muła and Lemire (2018), we solved the full base64 decoding problem, and we needed at least one more table. I have checked in a script on how to do the full computation in the GitHub repository I used for my blog.

If you are interested in vectorized classification, you might find the following references useful:

Lemire, D. (2025). Scanning HTML at Tens of Gigabytes Per Second on ARM Processors. Software: Practice and Experience.
Keiser, J., & Lemire, D. (2021). Validating UTF‐8 in less than one instruction per byte. Software: Practice and Experience, 51(5)
Koekkoek, J., & Lemire, D. (2025). Parsing Millions of DNS Records per Second. Software: Practice and Experience, 55(4), 778-788.

Is the job market for software developers collapsing?

2025-05-28 05:34:49

Concerns persist that artificial intelligence (AI) could render software developers obsolete, particularly with tools like GitHub Copilot and Cursor streamlining certain programming tasks. While these tools undoubtedly boost efficiency—Microsoft’s CEO has estimated that AI could write 30% of all code—the precise impact on productivity remains challenging to quantify. Moreover, productivity gains do not automatically lead to unemployment. In fact, increased efficiency could drive greater demand for software, as lower development costs make projects more accessible, enabling customers to request more complex or numerous applications.

For example, AI-assisted tools allow developers to create simple web applications faster than ever, reducing time spent on repetitive tasks like boilerplate code or debugging. But how do we assess the broader implications for the software development workforce? To understand the interplay between AI, productivity, and employment, we can turn to data from the US Bureau of Labor Statistics (BLS).

According to the BLS, employment for software developers, quality assurance analysts, and testers is projected to grow by 17% from 2023 to 2033, significantly outpacing the average for all occupations (4%).

We can look at data from the last three years:

Year	Total Population (USA)	Software Developers	Percentage
2021	331,893,745	1,364,180	0.41%
2022	333,287,557	1,534,790	0.46%
2023	334,914,895	1,656,880	0.49%

Unfortunately, I do not have the data for 2024 yet. But from 2021 to 2023, the number of software developers in the US has grown both in relative and absolute terms.

Based on this data alone, it seems that the demand for software developers remains strong.

Sources: 1 2 3

Models and science

2025-05-24 07:38:12

The ancient Greeks crafted extraordinary models that continue to resonate. For instance, Ptolemy’s geocentric model, with Earth at the core and planets tracing intricate epicycles, elegantly accounted for celestial motion, much like the ancient ambition behind the Tower of Babel sought to bridge humanity and the heavens through a grand, unifying structure. Ptolemy’s framework dominated astronomy for over a millennium until Copernicus upended it with a heliocentric vision.

Modern science diverges from these ancient endeavors—whether the Greeks’ celestial models or the Babel builders’ monumental aspirations—through its insistence on testing ideas against observable reality. The theory of plate tectonics, once dismissed by the scientific establishment, won acceptance not through persuasive arguments but through compelling evidence: seafloor spreading, earthquake distributions, and paleomagnetic records. The Royal Society’s motto, Nullius in verba (“Take nobody’s word for it”), encapsulates this ethos—no authority, however revered, escapes scrutiny. This defines science: a steadfast commitment to an external, tangible reality that exists beyond human invention.

Richard Feynman captured this principle succinctly: “It doesn’t matter how beautiful your theory is, it doesn’t matter how smart you are. If it doesn’t agree with experiment, it’s wrong.” Models, no matter how elegant, are mere human constructs, subordinate to the unyielding truth of reality. Science demands humility—the rejection of hubris, whether in the form of a towering edifice or a cherished theory, and an embrace of relentless questioning, empirical rigor, and the courage to discard flawed ideas when evidence dictates.

Dividing an array into fair sized chunks

2025-05-23 01:40:02

Suppose that you have an array of N elements and you want to divide it into M chunks. It is a common task when trying to spread N units of work over M threads, for example.

You want chunks to be all the same size, plus or minus one element (fair sized). It might be slightly harder than you imagine.

Following Python, we use the convention that a//b is the integer quotient so that 5//2 is 2. And let % be the ‘remainder’ : 5%2 is 1.

The naive algorithm uses the following chunk ranges: [0, N//M), [N/M, 2*N//M)… up to [N//M*M, N). Thus all chunks have size N/M but there is an extra chunk of size N%M.

If N is 5 and M is 2, we have the chunks [0,2), [2,4), [4,5).

That is not great. We have one extra chunk ! At least we do when N//M is at least one.

Instead, we can make the last chunk possibly larger. We use the ranges [0, N//M), [N/M, 2*N//M)… up to [N//M*(M-1), N). If N is 5 and M is 2, we have the chunks [0,2), [2,5). At least we have the right number of chunks.

But what if we have N is 20 and M is 6? We get the chunks [0,3), [3,6), [6,9), [9,12), [12,15), [15,20). It is quite bad because the last chunk is much larger.

Instead, you might try the following: let the chunk size be ceil(N/M). That is, the chunk size is N/M rounded up. Thus if N is 5 and M is 2, we have that the chunk size is ceil(5/3) or 3. We use the chunk ranges: [0, ceil(N/M)), [N/M, 2*ceil(N/M))…[(M-1)*ceil(N/M), N). If N is 5 and M is 2, we have the chunks [0,3), [3,5). That is much better. This time, we have the right number of chunks. One has size 3 and the other one has size 2. But if fails if N is 12 and M is 5. We get that the chunks size is 3, and so the ranges are [0,3), [3,6), [6,9), [9, 12). So we are missing one chunk!

So what should you do? A better approach is to have the first N%M chunks have size ceil(N/M) and have the remaining N-N%M chunks have size floor(N/M).

Let me implement it in Python. The get_chunk_range_simple function divides a total of N elements into M chunks and returns the start and end indices of the i-th chunk (0-indexed). It handles uneven divisions by distributing the remainder across chunks: the first remainder chunks (where remainder = N % M) have size quotient + 1 (where quotient = N // M), while the remaining chunks have size quotient. For a given chunk index i, the function calculates the start as quotient * i + min(i, remainder), accounting for the extra elements in earlier chunks, and the end as quotient * (i + 1) + min(i + 1, remainder), ensuring the range reflects the cumulative effect of previous chunk sizes. This approach ensures all elements are covered, with earlier chunks absorbing the remainder for balanced distribution.

def get_chunk_range_simple(N, M, i):
    quotient, remainder = divmod(N, M)
    return (quotient * i + min(i, remainder), 
            quotient * (i + 1) + min(i+1, remainder))

The result is fair sized. Let us illustrate when N is 20 and M is 6.

We can implement it in C++ as well. For a change, I return the start index and the length of a given chunk. Importantly, we would rather avoid an overflow which might be caused by a multiplication between the chunk index and the number of elements.

// N is the total number of elements
// M is the number of chunks
// i is the index of the chunk (0-indexed)
std::pair<size_t, size_t> get_chunk_range_simple(size_t N, size_t M, size_t i) {
    // Calculate the quotient and remainder
    size_t quotient = N / M;
    size_t remainder = N % M;
    size_t start = quotient * i + (i < remainder ? i : remainder);
    size_t length = quotient + (i < remainder ? 1 : 0);
    return {start, length};
}

Credit: The problem was posted on X by Robert Clausecker.

Returning several values from a function in C++ (C++23 edition)

2025-05-19 04:22:16

Many programming languages such as the Go programming language are designed to make it easy to return several values at once from a function. In Go, it is often used to return an optional error code. The C++ programming language does not have a built-in support for returning several values. However, several standard types can serve the same purpose. If you need to return two values, you can use an std::pair instance. If you need to return two or more values, an std::tuple instance will do. With recent C++ standards, it works really well!

Suppose we want to compute a division with a string error message when one is trying to divided by zero:

std::tuple<int,std::string> divide(int a, int b) {
    if (b == 0) {
        return {0, "Error: Division by zero"};
    }
    return {a / b, "Success"};
}

This approach works nicely. The code is clear and readable.

You might be concerned that we are fixing the type (int). If you want to write one function for all integer types, you can do so with concepts, like so:

#include <concepts>


template <std::integral int_type>
std::tuple<int_type, std::string> divide(int_type a, int_type b) {
    if (b == 0) {
        return {0, "Error: Division by zero"};
    }
    return {a / b, "Success"};
}

With this definition, you can call divide(10,3), but if you try to call divide(10.0, 3.1323), you will get an error message such as “‘double’ does not satisfy ‘integral'”.

So far so good.

Prior to C++17, there was no elegant way to call this function and immediately capture the values without boilerplate code. Thankfully, we now have structured bindings:

auto [result, message] = divide(10, 2);
std::println("Result: {}, Message: {}", result, message);

As a bonus, in this example, I am using C++23’s std::println function which is available with GCC 14 and LLVM 19.

The structured binding approach (the keyword ‘auto’ followed by brackets) is nice, but what if you care about only one of the return values ? In C++23, you can use the underscore to indicate that you do not care about a value:

auto [quotient, _] = divide(15, 3);
std::println("Quotient only: {}", quotient);

What if you have existing variables, and you do not want to create new variables? You can use std::tie:

std::string msg = "Hello";
int i = 42;
std::tie(i,msg) = divide(3, 0);

You can thus write decent error handling code:

if (auto [res, err] = divide(8, 0); err != "Success") {
  std::println("Error occurred: {}", err);
} else {
  std::println("Result: {}", res);
}

You can do slightly better if the reason for returning two values was strictly to be able to return an error code. You can use std::expected.

Your division function might be written as…

template <std::integral int_type>
std::expected<int_type, std::string> divide(int_type a, int_type b) {
    if (b == 0) {
        return std::unexpected("Error: Division by zero");
    }
    return a / b;
}

It allows you to write slightly simpler error-handling code:

if (auto res1 = divide(10, 2)) {
  std::println("Result: {:>4}, Message: Success", res1.value());
} else {
  std::println("Error: {}", res1.error());
}

Another benefit of std::expected is that you can let the calling code choose the default value (in case of error):

int quotient = divide(15, 3).value_or(0);
std::println("Quotient only: {:>4}", quotient);

If you prefer, you can use std::optional, which works much the same way:

std::optional<int> divide(int a, int b) {
    if (b == 0) {
        return std::nullopt;
    }
    return a / b;
}

However, std::optional is less appropriate for error handling: you cannot pass an error code or an error message. The std::optional strategy is meant for the simple case where a value can be omitted.

In C++, returning multiple values from functions has evolved significantly, offering developers flexible and elegant solutions. From using std::pair and std::tuple for general-purpose multi-value returns to leveraging std::expected for robust error handling, and std::optional for optional values. By adopting these techniques, C++ programmers can make their code easier to read and to maintain.

Further reading:

Care is needed to use C++ std::optional with non-trivial objects

Producing useful commands on the go using C++ and AI

2025-05-11 04:54:46

A few weeks ago, I attended a software engineering seminar focused on the role of large language models in programming. The distinguished software engineering professors in attendance were notably skeptical, frequently dismissing the technology’s potential. During a break, I turned to a senior professor, a longtime colleague, and playfully remarked, “ChatGPT writes better C++ than you!” The comment was partly in jest, given our years of camaraderie, but it carried a kernel of conviction. I genuinely believe that these models, while limited, demonstrate remarkable proficiency in generating code, challenging traditional views on programming expertise.

As technologists, we often rely on small scripts to manipulate data or automate repetitive tasks. Like many software developers, I typically use high-level languages such as Python, awk, or Bash for these purposes, as they are well-suited for the job.

However, I wanted to explore whether recent advancements in generative AI could shift this approach. To test this, I turned to Grok, a competitor to ChatGPT, and tasked it with a small challenge: write a concise program in C++23 (compatible with GCC 13, LLVM 18, or later) to parse simple CSV files, identify column labels, and count the number of distinct values in each column. A CSV (Comma-Separated Values) file is a plain text format used to store tabular data, where each line represents a row and columns are separated by commas.

Name,Age,City
Alice,30,New York
Bob,25,London
Charlie,12,Paris
David,40,Longueuil
David,35,Tokyo

I got back the following answer which is, as far as I can tell, correct. It will not support fancy CSV files, with quotes and escaped characters, but that is fine.

// assume clang++ 18 or g++ 13
// compile with: c++ -std=c++23 -o csv csv.cpp
#include <cstdlib>
#include <fstream>
#include <iostream>
#include <map>
#include <print>
#include <ranges>
#include <sstream>
#include <string>
#include <unordered_set>

int main(int argc, char *argv[]) {
  if (argc != 2) {
    std::println("Usage: {} <csv_file>", argv[0]);
    return EXIT_FAILURE;
  }
  std::ifstream file(argv[1]);
  if (!file.is_open()) {
    std::println("Error: Cannot open file {}\n", argv[1]);
    return EXIT_FAILURE;
  }
  // Read first line for column labels
  std::string line;
  if (!std::getline(file, line)) {
    std::println("Error: Empty file {}", argv[1]);
    return EXIT_FAILURE;
  }

  // Parse first line into labels
  std::map<size_t, std::string> labels;
  auto cells = line | std::ranges::views::split(',');
  for (auto cell : cells) {
    std::string label(cell.begin(), cell.end());
    labels[labels.size()] = label;
  }
  std::map<std::string, std::unordered_set<std::string>> columns;
  while (std::getline(file, line)) {
    auto cells = line | std::ranges::views::split(',');
    for (auto [idx, cell] : std::ranges::views::enumerate(cells)) {
      columns[labels[idx]].insert(std::string(cell.begin(), cell.end()));
    }
  }
  // Print results using labels
  for (const auto &[label, values] : columns) {
    std::println("Column {}: {} distinct values", label, values.size());
  }
  return EXIT_SUCCESS;
}

This C++ program parses a CSV file specified via command-line arguments (argc, argv). The first line is parsed into a std::map<size_t, std::string> to store column labels, associating each index with its label. The std::ranges::views::split range adaptor splits lines by commas, transforming a string into a range of substrings. Similarly, std::ranges::views::enumerate pairs each cell with its index during iteration, enabling clean access to both the cell value and its corresponding column label without manual index tracking. A std::map<std::string, std::unordered_set> tracks distinct values per column, where each label maps to an std::unordered_set. The program reads subsequent lines, populating the sets with cell values, and uses std::println (C++23) for formatted output of each column’s label and distinct value count.

I am not suggesting you abandon Python for C++—this was purely an experimental dive into AI-driven code generation. That said, I’m genuinely impressed by the results. However, without a solid understanding of C++, these AI tools offer limited value, as the ability to critically assess and refine the output remains essential.

Note: The code is on GitHub.

Reproduction: My good result has been independently reproduced with ChatGPT and Claude.

Daniel LemireModify

Rss preview of Blog of Daniel Lemire

The author's social media

Daniel Lemire Modify