2026-02-02 23:52:27

Given any string of bytes, you can convert it to an hexadecimal string by mapping the least significant and the most significant 4 bits of byte to characters in 01...9A...F. There are more efficient techniques like base64, that map 3 bytes to 4 characters. However, hexadecimal outputs are easier to understand and often sufficiently concise.
A simple function to do the conversion using a short lookup table is as follows:
static const char hex[] = "0123456789abcdef";
for (size_t i = 0, k = 0; k < dlen; i += 1, k += 2) {
uint8_t val = src[i];
dst[k + 0] = hex[val >> 4];
dst[k + 1] = hex[val & 15];
}
This code snippet implements a straightforward byte-to-hexadecimal string conversion loop in C++. It iterates over an input byte array (src), processing one byte at a time using index i, while simultaneously building the output string in dst with index k that advances twice as fast (by 2) since each input byte produces two hexadecimal characters. For each byte, it extracts the value as an unsigned 8-bit integer (val), then isolates the high 4 bits (via right shift by 4) and low 4 bits (via bitwise AND with 15) to index into a static lookup table (hex) containing the characters ‘0’ through ‘9’ and ‘a’ through ‘f’. The loop continues until k reaches the expected output length (dlen), which should be twice the input length, ensuring all bytes are converted without bounds errors.
This lookup table approach is used in the popular Node.js JavaScript runtime. Skovoroda recently proposed to replace this lookup table approach with an arithmetic version.
char nibble(uint8_t x) { return x + '0' + ((x > 9) * 39); }
for (size_t i = 0, k = 0; k < dlen; i += 1, k += 2) {
uint8_t val = src[i];
dst[k + 0] = nibble(val >> 4);
dst[k + 1] = nibble(val & 15);
}
Surprisingly maybe, this approach is much faster and uses far fewer instructions. At first glance, this result might be puzzling. A table lookup is cheap, the new nibble function seemingly does more work.
The trick that Skovoroda relies upon is that compilers are smart: they will ‘autovectorize’ such number crunching functions (if you are lucky). That is, instead of using regular instructions that process byte values, the will SIMD instructions that process 16 bytes at once or more.
Of course, instead of relying on the compiler, you can manually invoke SIMD instructions through SIMD instrinsic functions. Let us assume that you have an ARM processors (e.g., on Apple Silicon). Then you can process blocks of 32 bytes as follows.
size_t maxv = (slen - (slen%32));
for (; i < maxv; i += 32) {
uint8x16_t val1 = vld1q_u8((uint8_t*)src + i);
uint8x16_t val2 = vld1q_u8((uint8_t*)src + i + 16);
uint8x16_t high1 = vshrq_n_u8(val1, 4);
uint8x16_t low1 = vandq_u8(val1, vdupq_n_u8(15));
uint8x16_t high2 = vshrq_n_u8(val2, 4);
uint8x16_t low2 = vandq_u8(val2, vdupq_n_u8(15));
uint8x16_t high_chars1 = vqtbl1q_u8(table, high1);
uint8x16_t low_chars1 = vqtbl1q_u8(table, low1);
uint8x16_t high_chars2 = vqtbl1q_u8(table, high2);
uint8x16_t low_chars2 = vqtbl1q_u8(table, low2);
uint8x16x2_t zipped1 = {high_chars1, low_chars1};
uint8x16x2_t zipped2 = {high_chars2, low_chars2};
vst2q_u8((uint8_t*)dst + i*2, zipped1);
vst2q_u8((uint8_t*)dst + i*2 + 32, zipped2);
}
This SIMD code leverages ARM NEON intrinsics to accelerate hexadecimal encoding by processing 32 input bytes simultaneously. It begins by loading two 16-byte vectors (val1 and val2) from the source array using vld1q_u8. For each vector, it extracts the high nibbles (via right shift by 4 with vshrq_n_u8) and low nibbles (via bitwise AND with 15 using vandq_u8 and vdupq_n_u8). The nibbles are then used as indices into a pre-loaded hex table via vqtbl1q_u8 to fetch the corresponding ASCII characters. The high and low character vectors are interleaved using vzipq_u8, producing two output vectors per input pair. Finally, the results are stored back to the destination array with vst1q_u8, ensuring efficient memory operations.
You could do similar work on other systems like x64. The same code with AVX-512 for recent Intel and AMD processors would probably be insanely efficient.
Benchmarking these implementations on a dataset of 10,000 random bytes reveals significant performance differences. The basic lookup table version achieves around 3 GB/s, while the arithmetic version, benefiting from compiler autovectorization, reaches 23 GB/s. The manual SIMD NEON versions push performance further: I reach 42 GB/s in my tests.
| method | speed | instructions per byte |
|---|---|---|
| table | 3.1 GB/s | 9 |
| Skovoroda | 23 GB/s | 0.75 |
| intrinsics | 42 GB/s | 0.69 |
One lesson is that intuition can be a poor guide when trying to assess performance.
2026-02-01 23:23:25

When serializing data to JSON, CSV or when logging, we convert numbers to strings. Floating-point numbers are stored in binary, but we need them as decimal strings. The first formally published algorithm is Steele and White’s Dragon schemes (specifically Dragin2) in 1990. Since then, faster methods have emerged: Grisu3, Ryū, Schubfach, Grisu-Exact, and Dragonbox. In C++17, we have a standard function called std::to_chars for this purpose. A common objective is to generate the shortest strings while still being about to uniquely identified the original number.
We recently published Converting Binary Floating-Point Numbers to Shortest Decimal Strings. We examine the full conversion, from the floating-point number to the string. In practice, the conversion implies two steps: we take the number and compute the significant and the power of 10 (step 1) and then we generate the string (step 2). E.g., for the number pi, you might need to compute 31415927 and -7 (step 1) before generating the string 3.1415927. The string generation requires placing the dot at the right location and switching to the exponential notation when needed. The generation of the string is relatively cheap and was probably a negligible cost for older schemes, but as the software got faster, it is now a more important component (using 20% to 35% of the time).
The results vary quite a bit depending on the numbers being converted. But we find that the two implementations tend to do best: Dragonbox by Jeon and Schubfach by Giulietti. The Ryū implementation by Adams is close behind or just as fast. All of these techniques are about 10 times faster than the original Dragon 4 from 1990. A tenfold performance gain in performance over three decades is equivalent to a gain of about 8% per year, entirely due to better implementations and algorithms.
Efficient algorithms use between 200 and 350 instructions for each string generated. We find that the standard function std::to_chars under Linux uses slightly more instructions than needed (up to nearly 2 times too many). So there is room to improve common implementations. Using the popular C++ library fmt is slightly less efficient.
A fun fact is that we found that that none of the available functions generate the shortest possible string. The std::to_chars C++ function renders the number 0.00011 as 0.00011 (7 characters), while the shorter scientific form 1.1e-4 would do. But, by convention, when switching to the scientific notation, it is required to pad the exponent to two digits (so 1.1e-04). Beyond this technicality, we found that no implementation always generate the shortest string.
All our code, datasets, and raw results are open-source. The benchmarking suite is at https://github.com/fastfloat/float_serialization_benchmark, test data at https://github.com/fastfloat/float-data.
Reference: Converting Binary Floating-Point Numbers to Shortest
Decimal Strings: An Experimental Review, Software: Practice and Experience (to appear)
2026-01-26 07:19:12

One of the first steps we take when we want to optimize software is to look
at profiling data. Software profilers are tools that try to identify where
your software spends its time. Though the exact approach can vary, a typical profiler samples your software (steps it at regular intervals) and collects statistics. If your software is routinely stopped in a given function, this function is likely using a lot of time. In turn, it might be where you should put your optimization efforts.
Matteo Collina recently shared with me his work on feeding profiler data for software optimization purposes in JavaScript. Essentially, Matteo takes the profiling data, and prepares it in a way that an AI can comprehend. The insight is simple but intriguing: tell an AI how it can capture profiling data and then let it optimize your code, possibly by repeatedly profiling the code. The idea is not original since AI tools will, on their own, figure out that they can get profiling data.
How well does it work? I had to try it.
For the simdutf software library, we use an amalgamation script: it collects all of the C++ files on disk, does some shallow parsing and glues them together according to some rules.
I first ask the AI to optimize the script without access to profiling data. What it did immediately was to add a file cache. The script repeatedly loads the same files from disk (the script is a bit complex). This saved about 20% of the running time.
Specifically, the AI replaced this naive code…
def read_file(file):
with open(file, 'r') as f:
for line in f:
yield line.rstrip()
by this version with caching…
def read_file(file):
if file in file_cache:
for line in file_cache[file]:
yield line
else:
lines = []
with open(file, 'r') as f:
for line in f:
line = line.rstrip()
lines.append(line)
yield line
file_cache[file] = lines
Could the AI do better with profiling data? I instructed it to run the Python profiler: python -m cProfile -s cumtime myprogram.py. It found two additional optimizations:
1. It precompiled the regular expressions (re.compile). It replaced
if re.match('.*generic/.*.h', file):
# ...
by
if generic_pattern.match(file):
# ...
where elsewhere in the code, we have…
generic_pattern = re.compile(r'.*generic/.*\.h')
2. Instead of repeatedly calling re.sub to do a regular expression substitution, it filtered the strings by checking for the presence of a keyword in the string first.
if 'SIMDUTF_IMPLEMENTATION' in line: # This IF is the optimization
print(uses_simdutf_implementation.sub(context.current_implementation+"\\1", line), file=fid)
else:
print(line, file=fid) # Fast path
These two optimizations could probably have been arrived at by looking at the code directly, and I cannot be certain that they were driven by the profiling data. But I can tell that they do appear in the profile data.
Unfortunately, the low-hanging fruit, caching the file access, represented the bulk of the gain. The AI was not able to further optimize the code. So the profiling data did not help much.
When I design online courses, I often use a lot of links. These links break over time. So I have a simple Python script that goes through all the links, and verifies them.
I first ask my AI to optimize the code. It did the same regex trick, compiling the regular expression. It created a thread pool and made the script asynchronous.
with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor:
url_results = {url: executor.submit(check_url, url) for url in urls_to_check}
for url, future in url_results.items():
url_cache[url] = future.result()
This parallelization more than doubled the speed of the script.
It cached the URL checks in an interesting way, using functools:
from functools import lru_cache
@lru_cache(maxsize=None)
def check(link):
# ...
I did not know about this nice trick. This proved useless in my context because I rarely have several times the same link.
I then started again, and told it to use the profiler. It did much the same thing, except for the optimization of the regular expression.
As far as I can tell all optimizations were in vain, except for the multithreading. And it could do this part without the profiling data.
The Python scripts I tried were not heavily optimized, as their performance was not critical. They are relatively simple.
For the amalgamation, I got a 20% performance gain for ‘free’ thanks to the file caching. The link checker is going to be faster now that it is multithreaded. Both optimizations are valid and useful, and will make my life marginally better.
In neither case I was able to discern benefits due to the profiler data. I was initially hoping to get the AI busy optimizing the code in a loop, continuously running the profiler, but it did not happen in these simple cases. The AI optimized code segments that contributed little to the running time as per the profiler data.
To be fair, profiling data is often of limited use. The real problems are often architectural and not related to narrow bottlenecks. Even when there are identifiable bottlenecks, a simple profiling run can fail to make them clearly identifiable. Further, profilers become more useful as the code base grows, while my test cases are tiny.
Overall, I expect that the main reason for my relative failure is that I did not have the right use cases. I think that collecting profiling data and asking an AI to have a look might be a reasonable first step at this point.
2026-01-18 07:44:38

Irrespective of your programming language of choice, calling C functions is often a necessity. For the longest time, the only standard way to call C was the Java Native Interface (JNI). But it was so painful that few dared to do it. I have heard it said that it was deliberately painful so that people would be enticed to use pure Java as much as possible.
Since Java 22, there is a new approach called the Foreign Function & Memory API in java.lang.foreign. Let me go through step by step.
You need a Linker and a SymbolLookup instance from which you will build a MethodHandle that will capture the native function you want to call.
The linker is easy:
Linker linker = Linker.nativeLinker();
To load the SymbolLookup instance for your library (called mylibrary), you may do so as follows:
System.loadLibrary("mylibrary");
SymbolLookup lookup = SymbolLookup.loaderLookup();
The native library file should be on your java.library.path path, or somewhere on the default library paths. (You can pass it to your java executable as -Djava.library.path=something).
Alternatively, you can use SymbolLookup.libraryLookup or other means of loading
the library, but System.loadLibrary should work well enough.
You have the lookup, you can grab the address of a function like so:
lookup.find("myfunction")
This returns an Optional<MemorySegment>. You can grab the MemorySegment like so:
MemorySegment mem = lookup.find("myfunction").orElseThrow()
Once you have your MemorySegment, you can pass it to your linker to get a MethodHandle which is close to a callable function:
MethodHandle myfunc = linker.downcallHandle(
mem,
functiondescr
);
The functiondescr must describe the returned value and the function parameters that your function takes. If you pass a pointer and get back a long value, you might proceed as follows:
MethodHandle myfunc = linker.downcallHandle(
mem,
FunctionDescriptor.of(
ValueLayout.JAVA_LONG,
ValueLayout.ADDRESS
)
);
That is, the first parameter is the returned value.
For function returning nothing, you use FunctionDescriptor.ofVoid.
The MethodHandle can be called almost like a normal Java function:myfunc.invokeExact(parameters). It always returns an Object which means that if it should return a long, it will return a Long. So a cast might be necessary.
It is a bit painful, but thankfully, there is a tool called jextract that can automate this task. It generates Java bindings from native library headers.
You can allocate C data structures from Java that you can pass to your native code by using an Arena. Let us say that you want to create an instance like
MemoryLayout mystruct = MemoryLayout.structLayout(
ValueLayout.JAVA_LONG.withName("age"),
ValueLayout.JAVA_INT.withName("friends"));
You could do it in this manner:
MemorySegment myseg = arena.allocate(mystruct);
You can then pass myseg as a pointer to a data structure in C.
You often get an array with a try clause like so:
try (Arena arena = Arena.ofConfined()) {
//
}
There are many types of arenas: confined, global, automatic, shared. The confined arenas are accessible from a single thread. A shared or global arena is accessible from several threads. The global and automatic arenas are managed by the Java garbage collector whereas the confined and shared arenas are managed explicitly, with a specific lifetime.
So, it is fairly complicated but manageable. Is it fast? To find out, I call from Java a C library I wrote with support for binary fuse filters. They are a fast alternative to Bloom filters.
You don’t need to know what any of this means, however. Keep in mind that I wrote a Java library called jfusebin which calls a C library. Then I also have a pure Java implementation and I can compare the speed.
I should first point out that even if calling the C function did not include any overhead, it might still be slower because the Java compiler is unlikely to inline a native function. However, if you have a pure Java function, and it is relatively small, it can get inlined and you get all sorts of nice optimizations like constant folding and so forth.
Thus I can overestimate the cost of the overhead. But that’s ok. I just want a ballpark measure.
In my benchmark, I check for the presence of a key in a set. I have one million keys in the filter. I can ask whether a key is not present in the filter.
I find that the library calling C can issue 44 million calls per second using the 8-bit binary fuse filter. I reach about 400 million calls per second using the pure Java implementation.
| method | time per query in nanoseconds |
|---|---|
| Java-to-C | 22.7 ns |
| Pure Java | 2.5 ns |
Thus I measure an overhead of about 20 ns per C function calls from Java using a macBook (M4 processor).
We can do slightly better by marking the functions that are expected to be short running as critical. You achieve this result by passing an option to the linker.downcallHandle call.
binary_fuse8_contain = linker.downcallHandle(
lookup.find("xfuse_binary_fuse8_contain").orElseThrow(),
binary_fuse8_contain_desc,
Linker.Option.critical(false)
);
You save about 15% of the running time in my case.
| method | time per query in nanoseconds |
|---|---|
| Java-to-C | 22.7 ns |
| Java-to-C (critical) | 19.5 ns |
| Pure Java | 2.5 ns |
Obviously, in my case, because the Java library is so fast, the 20 ns becomes too much. But it is otherwise a reasonable overhead.
I did not compare with the old approach (JNI), but other folks did and they find that the new foreign function approach can be measurably faster (e.g., 50% faster). In particular, it has been reported that calling a Java function from C is now relatively fast: I have not tested this functionality myself.
One of the cool feature of the new interface is that you can pass directly data from the Java heap to your C function with relative ease.
Suppose you have the following C function:
int sum_array(int* data, int count) {
int sum = 0;
for(int i = 0; i < count; i++) {
sum += data[i];
}
return sum;
}
And you want the following Java array to be passed to C without a copy:
int[] javaArray = {10, 20, 30, 40, 50};
It is as simple as the following code.
System.loadLibrary("sum");
Linker linker = Linker.nativeLinker();
SymbolLookup lookup = SymbolLookup.loaderLookup();
MemorySegment sumAddress = lookup.find("sum_array").orElseThrow();
// C Signature: int sum_array(int* data, int count)
MethodHandle sumArray = linker.downcallHandle(
sumAddress,
FunctionDescriptor.of(ValueLayout.JAVA_INT, ValueLayout.ADDRESS, ValueLayout.JAVA_INT),
Linker.Option.critical(true)
);
int[] javaArray = {10, 20, 30, 40, 50};
try (Arena arena = Arena.ofConfined()) {
MemorySegment heapSegment = MemorySegment.ofArray(javaArray);
int result = (int) sumArray.invoke(heapSegment, javaArray.length);
System.out.println("The sum from C is: " + result);
}
I created a complete example in a few minutes. One trick is to make sure that java finds the native library. If it is not at a standard library path, you can specify the location with -Djava.library.path like so:
java -Djava.library.path=target -cp target/classes IntArrayExample
Further reading.When Does Java’s Foreign Function & Memory API Actually Make Sense? by A N M Bazlur Rahman.
2026-01-14 22:52:39

Sometimes, people tell me that there is no more progress in CPU performance.
Consider these three processors which had comparable prices at release time.
Let us consider their results on on the PartialTweets open benchmark (JSON parsing). It is a single core benchmark.
| 2024 processor | 12.7 GB/s |
| 2023 processor | 9 GB/s |
| 2022 processor | 5.2 GB/s |
In two years, on this benchmark, AMD more than doubled the performance for the same cost.
So what is happening is that processor performance is indeed going up, sometimes dramatically so, but not all of our software can benefit from the improvements. It is up to us to track the trends and adopt our software accordingly.
2026-01-08 08:39:36

When I was younger, in my 20s, I assumed that everyone was working “hard,” meaning a solid 35 hours of work a week. Especially, say, university professors and professional engineers. I’d feel terribly guilty when I would be messing around, playing video games on a workday.
Today I realize that most people become very adept at avoiding actual work. And the people you think are working really hard are often just very good at focusing on what is externally visible. They show up to the right meetings but unashamedly avoid the hard work.
It ends up being visible to the people “who know.” Why? Because working hard is how you acquire actual expertise. And lack of actual expertise ends up being visible… but only to those who have the relevant expertise.
And the effect compounds. The difference between someone who has honed their skills for 20 years and someone who has merely showed up to the right meetings becomes enormous. And so, we end up with huge competency gaps between people who are in their 30s, 40s, 50s. It becomes night and day.