2025-12-21 07:26:09

Strings in programming are often represented as arrays of 8-bit words. The string is ASCII if and only if all 8-bit words have their most significant bit unset. In other words, the byte values must be no larger than 127 (or 0x7F in hexadecimal).
A decent C function to check that the string is ASCII is as follows.
bool is_ascii_pessimistic(const char *data, size_t length) {
for (size_t i = 0; i < length; i++) {
if (static_cast<unsigned char>(data[i]) > 0x7F) {
return false;
}
}
return true;
}
We go over each character, we compare it with 0x7F and continue if the value is no larger than 0x7F. If you have scanned the entire string and all tests have passed, you know that your string is ASCII.
Notice how I called this function pessimistic. What do I mean? I mean that it expects, in some sense, that it will find some non-ASCII character. If so, the best option is to immediately return and not scan the whole string.
What if you expect the string to almost always be ASCII? An alternative then is to effectively do a bitwise OR reduction of the string: you OR all characters together and you check just once that the result is bounded by 0x7F. If any character has its most significant bit set, then the bitwise OR of all characters will also have its most significant bit set. So you might write your function as follows.
bool is_ascii_optimistic(const char *data, size_t length) {
unsigned char result = 0;
for (size_t i = 0; i < length; i++) {
result |= static_cast<unsigned char>(data[i]);
}
return result <= 0x7F;
}
If you have strings that are all pure ASCII, which function will be fastest? Maybe surprisingly, the optimistic might be several times faster. I wrote a benchmark and ran it with GCC 15 on an Intel Ice Lake processor. I get the following results.
| function | speed |
|---|---|
| pessimistic | 1.8 GB/s |
| optimistic | 13 GB/s |
Why is the optimistic faster? Mostly because the compiler is better able to optimize it. Among other possibilities, it can use autovectorization to automatically use data-level parallelization (e.g., SIMD instructions).
Which function is best depends on your use case.
What if you would prefer a pessimistic function, that is, one that returns early when non-ASCII characters are encountered, but you still want high speed? Then you can use a dedicated library like simdutf where we have hand-coded the logic. In simdutf, the pessimistic function is called validate_ascii_with_errors. Your results will vary but I got that it has the same speed as optimistic function.
| function | speed |
|---|---|
| pessimistic | 1.8 GB/s |
| pessimistic (simdutf) | 14 GB/s |
| optimistic | 13 GB/s |
So it is possible to combine the benefits of pessimism and optimism although it requires a bit of care.
2025-12-21 05:24:33

Much of the data on the Internet is shared using a simple format called JSON. JSON is made of two composite types (arrays and key-value maps) and a small number of primitive types (64-bit floating-point numbers, strings, null, Booleans). That JSON became ubiquitous despite its simplicity is telling.
{ "name": "Nova Starlight", "age": 28, "powers": ["telekinesis", "flight","energy blasts"] }
Interestingly, JSON matches closely the data structures provided by default in the popular language Go. Go gives you arrays/slices and maps… in addition to the standard primitive types. It is a bit more than C which does not provide maps by default. But it is significantly simpler than Java, C++, C#, and many other programming languages where the standard library covers much of the data structures found in textbooks.
There is at least one obvious data structure that is missing in JSON, and in Go, the set. Because objects are supposed to have no duplicate keys, you can implement a set of strings by assigning keys to an arbitrary value like true.
{"element1": true, "element2": true}
But I believe that it is a somewhat unusual pattern. Most times, when we mean to represent a set of objects, an array suffices. We just need to handle the duplicates somehow.
There have been many attempts at adding more concepts to JSON, more complexity, but none of them have achieved much traction. I believe that it reflects the fact that JSON is good enough as a data format.
I refer to any format that allows you to represent JSON data, such as YAML, as a JSON-complete data format. If it is at least equivalent to JSON, it is rich enough for most problems.
Similarly, I suggest that new programming languages should aim to be JSON-complete: they should provide a map with key-value pairs, arrays, and basic primitive types. In this light, the C and the Pascal programming languages are not JSON-complete.
2025-12-15 09:42:10

Programmers often want to randomly shuffle arrays. Evidently, we want to do so as efficiently as possible. Maybe surprisingly, I found that the performance of random shuffling was not limited by memory bandwidth or latency, but rather by computation. Specifically, it is the computation of the random indexes itself that is slow.
Earlier in 2025, I reported how you could more than double the speed of a random shuffle in Go using a new algorithm (Brackett-Rozinsky and Lemire, 2025). However, I was using custom code that could not serve as a drop-in replacement for the standard Go Shuffle function. I decided to write a proper library called batchedrand. You can use it just like the math/rand/v2 standard library.
rng := batchedrand.Rand{rand.New(rand.NewPCG(1, 2))}
data := []int{1, 2, 3, 4, 5}
rng.Shuffle(len(data), func(i, j int) {
data[i], data[j] = data[j], data[i]
})
How fast is it? The standard library provides two generators, PCG and ChaCha8. ChaCha8 should be slower than PCG, because it has better cryptographic guarantees. However, both have somewhat comparable speeds because ChaCha8 is heavily optimized with assembly code in the Go runtime while the PCG implementation is conservative and not focused on speed.
On my Apple M4 processor with Go 1.25, I get the following results. I report the time per input element, not the total time.
| Benchmark | Size | Batched (ns/item) | Standard (ns/item) | speedup |
|---|---|---|---|---|
| ChaChaShuffle | 30 | 1.8 | 4.6 | 2.6 |
| ChaChaShuffle | 100 | 1.8 | 4.7 | 2.5 |
| ChaChaShuffle | 500000 | 2.6 | 5.1 | 1.9 |
| PCGShuffle | 30 | 1.5 | 3.9 | 2.6 |
| PCGShuffle | 100 | 1.5 | 4.2 | 2.8 |
| PCGShuffle | 500000 | 1.9 | 3.8 | 2.0 |
Thus, from tiny to large arrays, the batched approach is two to three times faster. Not bad for a drop-in replacement!
Get the Go library at https://github.com/lemire/batchedrand
Further reading:
2025-12-06 03:24:50

The one constant that I have observed in my professional life is that people underestimate the need to move fast.
Of course, doing good work takes time. I once spent six months writing a URL parser. But the fact that it took so long is not a feature, it is not a positive, it is a negative.
If everything is slow-moving around you, it is likely not going to be good. To fully make use of your brain, you need to move as close as possible to the speed of your thought.
If I give you two PhD students, one who completed their thesis in two years and one who took eight years… you can be almost certain that the two-year thesis will be much better.
Moving fast does not mean that you complete your projects quickly. Projects have many parts, and getting everything right may take a long time.
Nevertheless, you should move as fast as you can.
For multiple reasons:
1. A common mistake is to spend a lot of time—too much time—on a component of your project that does not matter. I once spent a lot of time building a podcast-like version of a course… only to find out later that students had no interest in the podcast format.
2. You learn by making mistakes. The faster you make mistakes, the faster you learn.
3. Your work degrades, becomes less relevant with time. And if you work slowly, you will be more likely to stick with your slightly obsolete work. You know that professor who spent seven years preparing lecture notes twenty years ago? He is not going to throw them away and start again, as that would be a new seven-year project. So he will keep teaching using aging lecture notes until he retires and someone finally updates the course.
What if you are doing open-heart surgery? Don’t you want someone who spends days preparing and who works slowly? No. You almost surely want the surgeon who does many, many open-heart surgeries. They are very likely to be the best one.
Now stop being so slow. Move!
2025-12-04 23:40:59

“We see something that works, and then we understand it.” (Thomas Dullien)
It is a deeper insight than it seems.
Young people spend years in school learning the reverse: understanding happens before progress. That is the linear theory of innovation.
So Isaac Newton comes up with his three laws of mechanics, and we get a clockmaking boom. Of course, that’s not what happened: we get the pendulum clock in 1656, then Hooke (1660) and Newton (1665–1666) get to think about forces, speed, motion, and latent energy.
The linear model of innovation makes as much sense as the waterfall model in software engineering. In the waterfall model, you are taught that you first need to design every detail of your software application (e.g., using a language like UML) before you implement it. To this day, half of the information technology staff members at my school are made up of “analysts” whose main job is supposedly to create such designs based on requirements and supervise execution.
Both the linear theory and the waterfall model are forms of thinkism, a term I learned from Kevin Kelly. Thinkism sets aside practice and experience. It is the belief that given a problem, you should just think long and hard about it, and if you spend enough time thinking, you will solve it.
Thinkism works well in school. The teacher gives you all the concepts, then gives you a problem that, by a wonderful coincidence, can be solved just by thinking with the tools the same teacher just gave you.
As a teacher, I can tell you that students get really angry if you put a question on an exam that requires a concept not explicitly covered in class. Of course, if you work as an engineer and you’re stuck on a problem and you tell your boss it cannot be solved with the ideas you learned in college… you’re going to look like a fool.
If you’re still in school, here’s a fact: you will learn as much or more every year of your professional life than you learned during an entire university degree—assuming you have a real engineering job.
Thinkism also works well in other limited domains beyond school. It works well in bureaucratic settings where all the rules are known and you’re expected to apply them without question. There are many jobs where you first learn and then apply. And if you ever encounter new conditions where your training doesn’t directly apply, you’re supposed to report back to your superiors, who will then tell you what to do.
But if you work in research and development, you always begin with incomplete understanding. And most of the time, even if you could read everything ever written about your problem, you still wouldn’t understand enough to solve it. The way you make discoveries is often to either try something that seems sensible, or to observe something that happens to work—maybe your colleague has a practical technique that just works—and then you start thinking about it, formalizing it, putting it into words… and it becomes a discovery.
And the reason it often works this way is that “nobody knows anything.” The world is so complex that even the smartest individual knows only a fraction of what there is to know, and much of what they think they know is slightly wrong—and they don’t know which part is wrong.
So why should you care about how progress happens? You should care because…
1. It gives you a recipe for breakthroughs: spend more time observing and trying new things… and less time thinking abstractly.
2. Stop expecting an AI to cure all diseases or solve all problems just because it can read all the scholarship and “think” for a very long time. No matter how much an AI “knows,” it is always too little.
Further reading: Godin, Benoît (2017). Models of innovation: The history of an idea. MIT press.
2025-12-04 04:41:26

It is absolutely clear to me that large language models represent the most significant scientific breakthrough of the past fifty years. The nature of that breakthrough has far reaching implications for what is happening in science today. And I believe that the entire scientific establishment is refusing to acknowledge it.
We often excuse our slow progress with tired clichés like “all the low-hanging fruit has been picked.” It is an awfully convenient excuse if you run a scientific institution that pretends to lead the world in research—but in reality is mired in bureaucracy, stagnation and tradition.
A quick look at the world around us tells a different story, progress is possible and even moderately easy, even through the lens of everyday experience. I have been programming in Python for twenty years and even wrote a book about it. Managing dependencies has always been a painful, frustrating process—seemingly unsolvable. The best anyone could manage was to set up a virtual environment. Yes, it was clumsy and awkward as you know if you programmed in Python, but that was the state of the art after decades of effort by millions of Python developers. Then, in 2024, a single tool called uv appeared and suddenly made the Python ecosystem feel sane, bringing it in line with the elegance of Go or JavaScript runtimes. In retrospect, the solution seems almost obvious.
NASA has twice the budget of SpaceX. Yet SpaceX has launched more missions to orbit in the past decade than NASA managed in the previous fifty years. The difference is not money; it is culture, agility, and a willingness to embrace new ideas.
Large language models have answered many profound scientific questions, yet one of the deepest concerns the very nature of language itself. For generations, the prevailing view was that human language depends on a vast set of logical rules that the brain applies unconsciously. That rule-based paradigm dominated much of twentieth-century linguistics and even shaped the early web. We spent an entire decade chasing the dream of the Semantic Web, convinced that if we all shared formal, machine-readable metadata, rule engines would deliver web-scale intelligence. Thanks to large language models, we now know that language does not need to be rule-based at all. Verbal intelligence does not need to require on explicit rules.
It is a tremendous scientific insight that overturns decades of established thinking.
A common objection is that I am conflating engineering with science. Large language models are just engineering. I invite you to examine the history of science more closely. Scientific progress has always depended on the tools we build.
You need a seaworthy boat before you can sail to distant islands, observe wildlife, and formulate the theory of natural selection. Measuring the Earth’s radius with the precision achieved by the ancient Greeks required both sophisticated engineering and non-trivial mathematics. Einstein’s insights into relativity emerged in an era when people routinely experienced relative motion on trains; the phenomenon was staring everyone in the face.
The tidy, linear model of scientific progress—professors thinking deep thoughts in ivory towers, then handing blueprints to engineers—is indefensible. Fast ships and fast trains are not just consequences of scientific discovery; they are also wellsprings of it. Real progress is messy, iterative, and deeply intertwined with the tools we build. Large language models are the latest, most dramatic example of that truth.
So what does it tell us about science? I believe it is telling us that we need to rethink our entire approach to scientific research. We need to embrace agility, experimentation, and a willingness to challenge established paradigms. The bureaucratization of science was a death sentence for progress.