2025-01-30 11:02:36
2025-01-25 11:53:36
Regular expressions, often abbreviated as regex, are a powerful tool for pattern matching within text. For example, the expression
\d*\.?\d+
would match a positive number such as 1.1 or 12. If designed and tested with care, regular expressions may be used in mission-critical software. However, their power comes with a risk: it is possible to design small regular expressions that are very expensive to run on even small strings.
To make matters more complicated, there are several regular-expression engines, and they differ in their syntax and implementation. Let me consider the regular-expression engine used by the C++ language under Linux (libgc++).
Consider the following program. It uses the string “Everyone loves Lucy.” and d a regex pattern (.*+s}}@w. I am not exactly sure what this pattern is supposed to do, but it is accepted by the engine. The program then uses std::regex_search to look for matches of this pattern within the string, storing potential matches in a std::smatch object, and outputs whether a match was found or not.
#include <iostream> #include <regex> int main() { std::string text = "Everyone loves Lucy."; std::regex pattern(R"(.*+s}}@w)"); // Perform regex search std::smatch match; bool found = std::regex_search(text, match, pattern); std::cout << "Regex search result: " << (found ? "Match found" : "No match") << std::endl; return 0; }
Using GCC 12 and a recent Linux server, this program takes about 7 minutes to run.
In other words, a bad regular expression can crash your systems. It is not just theoretical, the Cloudflare corporation suffered a major outage in 2019 due to a bad regular expression.
Use regular expressions with care.
2025-01-20 09:05:51
Your phone probably runs on 64-bit ARM processors. These processors are ubiquitous: they power the Nintendo Switch, they power cloud servers at both Amazon AWS and Microsoft Azure, they power fast laptops, and so forth. ARM processors have special powerful instructions called ARM NEON. They provide a specific type of parallelism called Single instruction, multiple data (SIMD). For example, you can add sixteen values with sixteen other values using one instruction.
Using special functions called intrinsics, you can program a C function that adds the values from two arrays and writes the result into a third:
#include <arm_neon.h> void vector_int_add(int32_t *a, int32_t *b, int32_t *result, int len) { int i = 0; // Loop processing 4 elements at a time until we can't anymore for (; i < len - 3; i += 4) { // Load 4 32-bit integers from 'a' into a NEON register int32x4_t vec_a = vld1q_s32(a + i); // Load 4 32-bit integers from 'b' into another NEON register int32x4_t vec_b = vld1q_s32(b + i); // Perform vector addition int32x4_t vec_res = vaddq_s32(vec_a, vec_b); // Store the result back to memory vst1q_s32(result + i, vec_res); } // Handle any remaining elements not divisible by 4 for (; i < len; i++) { result[i] = a[i] + b[i]; } }
In this code, the type int32x4_t represents four 32-bit integers. You can similarly represent sixteen 8-bit integers as int8x16_t and so forth.
The intrinsic functions vld1q_s32 and vst1q_s32 load and store 16 bytes of data. The intrinsic function vaddq_s32 does the sum.
A simple but deceptively tricky task is to determine whether the register contains only zeros. This comes up regularly in algorithms. Unfortunately, there is no corresponding instruction in ARM NEON unlike Intel/AMD instruction sets (AVX2, AVX-512).
There are many different valid approaches with ARM NEON but they all require several instructions. To keep things simple, I will assume that you are receiving sixteen 8-bit integers (uint8x16_t). In practice, you can reinterpret sixteen bytes as whatever you like (e.g., go from uint8x16_t to int32x4_t) without cost.
My favorite approach is to use the fact that ARM NEON can compute the maximum or the minimum value in an SIMD register: the intrinsic vmaxvq_u32 (corresponding to the instruction umaxv) computes the maximum value across all elements of a vector and returns it as a scalar (not a vector). There are also other variants like vmaxvq_u8 depending on your data type, but vmaxvq_u32 tends to be the most performant approach. The code might look as follows:
int veq_non_zero_max(uint8x16_t v) { return vmaxvq_u32(vreinterpretq_u32_u8(v)) != 0; }
It compiles down to three essential instruction: umaxv, fmov and a comparison (cmp). We need the fmov instruction or the equivalent to move the data from the SIMD register to a scalar register. The overall code is not great: umaxv has at least three cycles of latency and so does fmov.
There is more complicated but potentially more useful approach based on narrowing shift. The vshrn_n_u16 intrinsic (corresponding to the shrn instruction) shifts each of the eight 16-bit integers right by 4 bits, and also narrowing the result to 8 bits. The result is a 64-bit value which contains the 4 most significant bits of each byte in the original 16-byte register. We might check whether the register is zero like so:
int veq_non_zero_narrow(uint8x16_t v) { return vget_lane_u64(vshrn_n_u16(vreinterpretq_u16_u8(v), 4), 0) != 0; }
It compiles down to three instructions: shrn, fmov and a comparison. It is no faster.
The shrn approach does not check that the content is zero as it throws away 4 bits per byte. As an alternative, you can use the vqmovn_u64 intrinsic (corresponding to the uqxtn instruction). It takes a vector of two unsigned 64-bit integers and narrows them down to unsigned 32-bit integers. If any value in the source vector exceeds the maximum value representable by an unsigned 32-bit integer, it will be saturated to this maximum value. You might code it as follows:
int veq_non_zero_mov(uint8x16_t v) { return vget_lane_u64(vqmovn_u64(vreinterpretq_u64_u8(v)), 0) != 0; }
There is a faster approach, pointed out to me by Robert Clausecker. Instead of moving the data from the SIMD register to a scalar register, we may use the fact that the SIMD registers also serve as floating-point registers. Thus we can leave the data in place. It applies to all techniques presented so far (vmaxvq_u32, vshrn_n_u16, vqmovn_u64). Here is the version with vshrn_n_u16:
int veq_non_zero_float(uint8x16_t v) { uint8x8_t narrowed = vshrn_n_u16(vreinterpretq_u16_u8(v), 4); return (vdupd_lane_f64(vreinterpret_f64_u16(narrowed), 0) != 0.0); }
This compiles to only two instructions: shrn and fcmp. It is thus much faster.
So why not use the floating-point approach?
If you are careful, you can avoid all of these problems, but it does not make it a good general-purpose solution.
Newer 64-bit ARM processors have another family of instruction sets: SVE or Scalar Vector Extension. It is not directly interoperable with ARM NEON as far as I know… But it does have a dedicated instruction (cmpeq) generated by the following code:
bool check_all_zeros(svbool_t mask, svint8_t vec) { svbool_t cmp = svcmpeq_n_s8(mask, vec, 0); return svptest_any(pg, cmp); }
SVE code is more sophisticated: it requires a mask indicating which values are ‘active’. To set all values to active, you might need to generate a a true mask: e.g., like so “svbool_t pg = svptrue_b8()”. However, it is somewhat more powerful: you can check that all active values are zeroes…
Unfortunately, SVE is not yet widespread.
2025-01-16 04:30:06
Almost all of academic science has moved away from actual (empirical) science. It is higher status to work on theories and models. I believe that it is closely related to well documented scientific stagnation as theory is often ultimately sterile.
This tendency is quite natural in academia if there is no outside pressure… And is the main reason why academia should be ruthlessly judged by practitioners and users. As soon as academia can isolate itself in a bubble, it is bound to degrade.
It is worth trying to understand some of the factors driving this degradation… Theoretical work can sometimes be seen as more complex. This complexity can be mistakenly equated with higher intelligence or prestige. Empirical work, while also complex, often deals with tangible, observable data, which might seem more straightforward to the uninitiated.
Empirical work is more likely to lead to nuanced or inconclusive results while theory is often seemingly more direct and definitive. Theoretical research often requires fewer resources than large-scale empirical studies which might need extensive funding for equipment, data collection, and personnel. Thus you get to do more research with less using models and theory.
Theoretical work is often seen as requiring a high level of creativity to devise new frameworks or models. While empirical work also requires creativity in design, execution, and interpretation, the creativity in data collection or experimental design might be less recognized or appreciated.
The educational system often glorifies theoretical knowledge over practical skills until one reaches higher education or specialized training. E.g., we eagerly make calculus compulsory even if it has modest relevance in most practical fields. This educational bias can carry over into professional work.
We still have laboratories. We still collect data, but it is too often in service of theory rather than a primary activity.
Society must demand actual results. We must reject work that is said ‘to improve our understanding’ or ‘to lay a foundation for further work’. We must demand cheaper rockets, cures for cancer, software that is efficient. As long as academic researchers are left to their own devices, they will continue to fill the minds of the young with unnecessary models. They must be held accountable.
2025-01-12 00:21:18
Hashing algorithms convert input data into a fixed-size string of characters, known as a hash value or digest. These algorithms are one-way functions, meaning the original data cannot be feasibly retrieved from the hash, which makes them useful for data integrity, password storage, and digital signatures. MD5 and SHA-256 are two such hashing algorithms with significant differences in security, speed, and application. MD5 is older and often said to faster hashing due to its simpler structure. MD5 is now considered cryptographically broken. SHA-256 provides a higher level of security with its 256-bit hash output.
But is MD5 really faster? I decided to write a little JavaScript program to hash a large input using both MD5 and SHA-256. My JavaScript code defines two functions for hashing: md5Hash uses the MD5 algorithm to produce a 128-bit hash, while sha256Hash employs SHA-256 for a more secure. A large 1GB array, created by createRandomUint8Array, is then used to benchmark these hashing functions.
function md5Hash(message) { return crypto.createHash('md5').update(message).digest('hex'); } function sha256Hash(message) { return crypto.createHash('sha256').update(message).digest('hex'); } const largeString = createRandomUint8Array(1000000000); // 1GB string bench('MD5 Hash', () => { md5Hash(largeString); }); bench('SHA-256 Hash', () => { sha256Hash(largeString); });
I use the latest version of the Bun runtime and Node.js 23 in my tests. In find that on a ARM systems (Apple M2 or Amazon Graviton 4), Bun is slightly faster on the MD5 benchmark, but both Node.js and Bun have otherwise identical speeds.
MD5 | SHA-256 | ||
Apple M2 system | Bun 1.31 | 0.7 GB/s | 2.6 GB/s |
Apple M2 system | Node.js 23 | 0.6 GB/s | 2.6 GB/s |
Intel Ice Lake Linux system | Bun 1.31 | 0.7 GB/s | 1.2 GB/s |
Intel Ice Lake Linux system | Node.js 23 | 0.7 GB/s | 1.2 GB/s |
Amazon Graviton 4 Linux system | Bun 1.31 | 0.6 GB/s | 1.8 GB/s |
Amazon Graviton 4 Linux system | Node.js 23 | 0.5 GB/s | 1.8 GB/s |
My results suggest that you should probably not be using MD5. MD5 is slower than SHA-256 and not as safe.
Even though SHA-256 looks more expensive on paper, modern processors typically have cryptographic extensions to accelerate it.
2025-01-08 05:41:57
Given an integer in software, you may want to know how many decimal digits it needs. For example, the integer 100 requires 3 digits, the integer 9999 requires 4 digits.
It would be an easy problem if we could compute the logarithm in base 10 of an integer quickly. Unfortunately, our computers work in base 2 and it is faster to use a base 2 logarithm.
There are fast techniques to solve the digit-counting problem using few instructions. For 32-bit integers, a classical trick found in references such as Hacker’s Delight is as follows:
int digit_count(uint32_t x) { static uint32_t table[] = {9, 99, 999, 9999, 99999, 999999, 9999999, 99999999, 999999999}; int y = (9 * int_log2(x)) >> 5; y += x > table[y]; return y + 1; }
where the int_log2 function simply computes the integer logarithm in base 2. In base 2, there are fast functions to compute the logarithm… if you use GCC on Clang in C or C++, the following might do:
int int_log2(uint64_t x) { return 63 - __builtin_clzll(x | 1); }
Observe that I compute the bitwise OR of the input with the integer 1: I want to avoid computing the logarithm of zero. If you have a C++20 capable compiler, you can implement int_log2 with std::bit_width(x|1).
There is a faster function which I credit to Willets:
int alternative_digit_count(uint32_t x) { static uint64_t table[] = { 4294967296, 8589934582, 8589934582, 8589934582, 12884901788, 12884901788, 12884901788, 17179868184, 17179868184, 17179868184, 21474826480, 21474826480, 21474826480, 21474826480, 25769703776, 25769703776, 25769703776, 30063771072, 30063771072, 30063771072, 34349738368, 34349738368, 34349738368, 34349738368, 38554705664, 38554705664, 38554705664, 41949672960, 41949672960, 41949672960, 42949672960, 42949672960}; return (x + table[int_log2(x)]) >> 32; }
What if you have 64-bit integers instead of 32-bit integers?
You can generalize the same function. First of all, we have the conventional fast function:
int digit_count(uint64_t x) { static uint64_t table[] = {9, 99, 999, 9999, 99999, 999999, 9999999, 99999999, 999999999, 9999999999, 99999999999, 999999999999, 9999999999999, 99999999999999, 999999999999999ULL, 9999999999999999ULL, 99999999999999999ULL, 999999999999999999ULL, 9999999999999999999ULL}; int y = (19 * int_log2(x) >> 6); y += x > table[y]; return y + 1; }
And then we can port the faster alternative to 64-bit integers, replacing a 64-bit lookup table with a 128-bit lookup table:
int alternative_digit_count(uint64_t x) { static uint64_t table[64][2] = { { 0x01, 0xfffffffffffffff6ULL }, { 0x01, 0xfffffffffffffff6ULL }, { 0x01, 0xfffffffffffffff6ULL }, { 0x01, 0xfffffffffffffff6ULL }, { 0x02, 0xffffffffffffff9cULL }, { 0x02, 0xffffffffffffff9cULL }, { 0x02, 0xffffffffffffff9cULL }, { 0x03, 0xfffffffffffffc18ULL }, { 0x03, 0xfffffffffffffc18ULL }, { 0x03, 0xfffffffffffffc18ULL }, { 0x04, 0xffffffffffffd8f0ULL }, { 0x04, 0xffffffffffffd8f0ULL }, { 0x04, 0xffffffffffffd8f0ULL }, { 0x04, 0xffffffffffffd8f0ULL }, { 0x05, 0xfffffffffffe7960ULL }, { 0x05, 0xfffffffffffe7960ULL }, { 0x05, 0xfffffffffffe7960ULL }, { 0x06, 0xfffffffffff0bdc0ULL }, { 0x06, 0xfffffffffff0bdc0ULL }, { 0x06, 0xfffffffffff0bdc0ULL }, { 0x07, 0xffffffffff676980ULL }, { 0x07, 0xffffffffff676980ULL }, { 0x07, 0xffffffffff676980ULL }, { 0x07, 0xffffffffff676980ULL }, { 0x08, 0xfffffffffa0a1f00ULL }, { 0x08, 0xfffffffffa0a1f00ULL }, { 0x08, 0xfffffffffa0a1f00ULL }, { 0x09, 0xffffffffc4653600ULL }, { 0x09, 0xffffffffc4653600ULL }, { 0x09, 0xffffffffc4653600ULL }, { 0x0a, 0xfffffffdabf41c00ULL }, { 0x0a, 0xfffffffdabf41c00ULL }, { 0x0a, 0xfffffffdabf41c00ULL }, { 0x0a, 0xfffffffdabf41c00ULL }, { 0x0b, 0xffffffe8b7891800ULL }, { 0x0b, 0xffffffe8b7891800ULL }, { 0x0b, 0xffffffe8b7891800ULL }, { 0x0c, 0xffffff172b5af000ULL }, { 0x0c, 0xffffff172b5af000ULL }, { 0x0c, 0xffffff172b5af000ULL }, { 0x0d, 0xfffff6e7b18d6000ULL }, { 0x0d, 0xfffff6e7b18d6000ULL }, { 0x0d, 0xfffff6e7b18d6000ULL }, { 0x0d, 0xfffff6e7b18d6000ULL }, { 0x0e, 0xffffa50cef85c000ULL }, { 0x0e, 0xffffa50cef85c000ULL }, { 0x0e, 0xffffa50cef85c000ULL }, { 0x0f, 0xfffc72815b398000ULL }, { 0x0f, 0xfffc72815b398000ULL }, { 0x0f, 0xfffc72815b398000ULL }, { 0x10, 0xffdc790d903f0000ULL }, { 0x10, 0xffdc790d903f0000ULL }, { 0x10, 0xffdc790d903f0000ULL }, { 0x10, 0xffdc790d903f0000ULL }, { 0x11, 0xfe9cba87a2760000ULL }, { 0x11, 0xfe9cba87a2760000ULL }, { 0x11, 0xfe9cba87a2760000ULL }, { 0x12, 0xf21f494c589c0000ULL }, { 0x12, 0xf21f494c589c0000ULL }, { 0x12, 0xf21f494c589c0000ULL }, { 0x13, 0x7538dcfb76180000ULL }, { 0x13, 0x7538dcfb76180000ULL }, { 0x13, 0x7538dcfb76180000ULL }, { 0x13, 0x7538dcfb76180000ULL }, }; int log = int_log2(x); uint64_t low = table[log][1]; uint64_t high = table[log][0]; return (x + low < x ) + high; }
Though my 64-bit code may look mysterious, it is doing the same thing as the original 32-bit function: it adds the input to looked up value, and keep the most significant bits.
I have included a new simpler 64-bit version based on work by Kendall Willets, this time assuming that the compiler supports C++20:
int alternative_digit_count_two_tables(uint64_t x) { static int digits[65] = {19, 19, 19, 19, 18, 18, 18, 17, 17, 17, 16, 16, 16, 16, 15, 15, 15, 14, 14, 14, 13, 13, 13, 13, 12, 12, 12, 11, 11, 11, 10, 10, 10, 10, 9, 9, 9, 8, 8, 8, 7, 7, 7, 7, 6, 6, 6, 5, 5, 5, 4, 4, 4, 4, 3, 3, 3, 2, 2, 2, 1, 1, 1, 1, 1}; static uint64_t table[65] = {18446744073709551615ULL, 18446744073709551615ULL, 18446744073709551615ULL, 18446744073709551615ULL, 999999999999999999ULL, 999999999999999999ULL, 999999999999999999ULL, 99999999999999999ULL, 99999999999999999ULL, 99999999999999999ULL, 9999999999999999ULL, 9999999999999999ULL, 9999999999999999ULL, 9999999999999999ULL, 999999999999999ULL, 999999999999999ULL, 999999999999999ULL, 99999999999999ULL, 99999999999999ULL, 99999999999999ULL, 9999999999999ULL, 9999999999999ULL, 9999999999999ULL, 9999999999999ULL, 999999999999ULL, 999999999999ULL, 999999999999ULL, 99999999999ULL, 99999999999ULL, 99999999999ULL, 9999999999ULL, 9999999999ULL, 9999999999ULL, 9999999999ULL, 999999999ULL, 999999999ULL, 999999999ULL, 99999999ULL, 99999999ULL, 99999999ULL, 9999999ULL, 9999999ULL, 9999999ULL, 9999999ULL, 999999ULL, 999999ULL, 999999ULL, 99999ULL, 99999ULL, 99999ULL, 9999ULL, 9999ULL, 9999ULL, 9999ULL, 999ULL, 999ULL, 999ULL, 99ULL, 99ULL, 99ULL, 9ULL, 9ULL, 9ULL, 9ULL, 0ULL}; int log = std::countl_zero(x); uint64_t low = table[log]; uint64_t high = digits[log]; return (x > low) + high; }
What we want to know is which functions are fast?
To test it out, I generate thousands of random integers and I just sum up their digit counts. I record the number of instructions per integer:
Apple M2/LLVM 16 | Intel Ice Lake/GCC 12 | |
conventional (64-bit) | 16 instructions | 20 instructions |
alternative (64-bit) | 15 instructions | 19 instructions |
simpler alternative (64-bit) | 12 instructions | 15 instructions |
conventional (32-bit) | 16 instructions | 19 instructions |
alternative (32-bit) | 12 instructions | 16 instructions |
In the 32-bit case, the alternative approach saves you 3 or 4 instructions per integer. But in the 64-bit case, the saving appears to be just one instruction which is not necessarily worth it. However, the simpler alternative in the 64-bit case uses significantly fewer instructions.
Next, let us report the number of CPU cycles per integer. I run the tests four times, and I get some variation on one of the tests (Apple platform, 32-bit alternative).
Apple M2/LLVM 16 | Intel Ice Lake/GCC 12 | |
conventional (64-bit) | 2.2 cycles | 5.1 cycles |
alternative (64-bit) | 4.3 cycles | 6.6 cycles |
simpler alternative (64-bit) | 4.8 cycles | 4.5 cycles |
conventional (32-bit) | 2.3 cycles | 5.6 cycles |
alternative (32-bit) | 2.0 to 4.8 cycles | 6.2 cycles |
My timings suggest that the conventional (and simpler approach) may use more instructions but could still be preferable in practice. It is especially true in the 64-bit case where the alternative approach is significantly more complicated, and much slower in my tests. The simpler alternative in the 64-bit case is the fastest on the x64 system (Ice Lake) but slower on the ARM system.
You should run your own benchmark for your own use case. However, I believe that all these functions might be fast enough. Which one you pick might come down to personal preferences.