MoreRSS

site iconDaniel LemireModify

Computer science professor at the University of Quebec (TELUQ), open-source hacker, and long-time blogger.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of Daniel Lemire

Programmer time and the pitfalls of wasteful work

2025-01-30 11:02:36

Programmer time is precious. This realization should shape our approach to software development, focusing our efforts on tasks that genuinely contribute to the improvement of our code and the software ecosystem.

 

What does matter?

 

  1. 1. Hunting for bugs. I like to add tests, and then even more tests. The time spent building tests should proportionate to the time spent building the software. Fuzzing is also fantastically useful. I love using sanitizers.
  2. Fixing bugs. Bugs disrupt user experience, compromise functionality, and can even introduce security vulnerabilities. Addressing bugs is critical to build trust in the software. 
  3.  Documentation matters. Underdocumented code is mysterious and may trigger unnecessary surprises. Lack of documentation may also harm relationships with users.
  4. Adding new features. Innovation and growth in software come from introducing new features.  Features should be user visible: ‘internal’ features are often wasteful.
  5. Improving Performance. Performance enhancement is all about making the software run faster, use fewer resources, or handle larger workloads more efficiently. This can significantly impact user satisfaction, particularly in applications where speed is paramount. Improving performance is not about identify bottlenecks… it is an ongoing journey. You need a good design and multiple rounds of optimizations. You can often continue to improve the performance for years and years.
However, there are areas where I believe our time is not well spent:
  • Patching code to silence false positives from disabled-by-default static analyzers. The level 4 warnings under Visual Studio when compiling C++ code is a good example, but so are the obscure GCC and clang warnings. Static analyzers are tools that can scan code for potential issues without executing the program.  However, when these tools are overly strict or misconfigured, they might report numerous false positives; issues that aren’t actually problems.  Spending time patching code merely to quiet these false alarms is, in my view, wasteful. It diverts attention from more impactful work. This is not to say that static analysis is not beneficial; when used correctly, it can save considerable time and resources. But the effort required to address non-issues can quickly become counterproductive.
  • Aimless refactoring is also often wasteful. Renaming classes, moving code around just so that it looks ‘nice’. I am not against the occasional cleaning round… but it is should not be time consuming. Refactoring for its own sake may become an excuse for not fixing bugs or for not improving the performance. It is easy work, but often not impactful.
While we strive for perfection in our code, we must also be strategic about where we invest our most precious resource: programmer time. Let us prioritize what truly matters in the grand scheme of software development.

Regular expressions can blow up!

2025-01-25 11:53:36

Regular expressions, often abbreviated as regex, are a powerful tool for pattern matching within text. For example, the expression

\d*\.?\d+

would match a positive number such as 1.1 or 12. If designed and tested with care, regular expressions may be used in mission-critical software. However, their power comes with a risk: it is possible to design small regular expressions that are very expensive to run on even small strings.

To make matters more complicated, there are several regular-expression engines, and they differ in their syntax and implementation. Let me consider the regular-expression engine used by the C++ language under Linux (libgc++).

Consider the following program. It uses the string “Everyone loves Lucy.” and d a regex pattern (.*+s}}@w. I am not exactly sure what this pattern is supposed to do, but it is accepted by the engine. The program then uses std::regex_search to look for matches of this pattern within the string, storing potential matches in a std::smatch object, and outputs whether a match was found or not.

#include <iostream>
#include <regex>

int main() {
    std::string text = "Everyone loves Lucy.";
    std::regex pattern(R"(.*+s}}@w)"); 
    // Perform regex search
    std::smatch match;
    bool found = std::regex_search(text, match, pattern);
    std::cout << "Regex search result: " 
          << (found ? "Match found" : "No match") << std::endl;
    return 0;
}

Using GCC 12 and a recent Linux server, this program takes about 7 minutes to run.

In other words, a bad regular expression can crash your systems. It is not just theoretical, the Cloudflare corporation suffered a major outage in 2019 due to a bad regular expression.

Use regular expressions with care.

Checking whether an ARM NEON register is zero

2025-01-20 09:05:51

Your phone probably runs on 64-bit ARM processors. These processors are ubiquitous: they power the Nintendo Switch, they power cloud servers at both Amazon AWS and Microsoft Azure, they power fast laptops, and so forth. ARM processors have special powerful instructions called ARM NEON. They provide a specific type of parallelism called Single instruction, multiple data (SIMD). For example, you can add sixteen values with sixteen other values using one instruction.

Using special functions called intrinsics, you can program a C function that adds the values from two arrays and writes the result into a third:

#include <arm_neon.h>

void vector_int_add(int32_t *a, int32_t *b, int32_t *result, int len) {
    int i = 0;

    // Loop processing 4 elements at a time until we can't anymore
    for (; i < len - 3; i += 4) {
        // Load 4 32-bit integers from 'a' into a NEON register
        int32x4_t vec_a = vld1q_s32(a + i);
        
        // Load 4 32-bit integers from 'b' into another NEON register
        int32x4_t vec_b = vld1q_s32(b + i);
        
        // Perform vector addition
        int32x4_t vec_res = vaddq_s32(vec_a, vec_b);
        
        // Store the result back to memory
        vst1q_s32(result + i, vec_res);
    }

    // Handle any remaining elements not divisible by 4
    for (; i < len; i++) {
        result[i] = a[i] + b[i];
    }
}

In this code, the type int32x4_t represents four 32-bit integers. You can similarly represent sixteen 8-bit integers as int8x16_t and so forth.

The intrinsic functions vld1q_s32 and vst1q_s32 load and store 16 bytes of data. The intrinsic function vaddq_s32 does the sum.

A simple but deceptively tricky task is to determine whether the register contains only zeros. This comes up regularly in algorithms. Unfortunately, there is no corresponding instruction in ARM NEON unlike Intel/AMD instruction sets (AVX2, AVX-512).

There are many different valid approaches with ARM NEON but they all require several instructions. To keep things simple, I will assume that you are receiving sixteen 8-bit integers (uint8x16_t). In practice, you can reinterpret sixteen bytes as whatever you like (e.g., go from uint8x16_t to int32x4_t) without cost.

My favorite approach is to use the fact that ARM NEON can compute the maximum or the minimum value in an SIMD register: the intrinsic vmaxvq_u32 (corresponding to the instruction umaxv) computes the maximum value across all elements of a vector and returns it as a scalar (not a vector). There are also other variants like vmaxvq_u8 depending on your data type, but vmaxvq_u32 tends to be the most performant approach. The code might look as follows:

int veq_non_zero_max(uint8x16_t v) {
  return vmaxvq_u32(vreinterpretq_u32_u8(v)) != 0;
}

It compiles down to three essential instruction: umaxv, fmov and a comparison (cmp). We need the fmov instruction or the equivalent to move the data from the SIMD register to a scalar register. The overall code is not great:  umaxv has at least three cycles of latency and so does fmov.

There is more complicated but potentially more useful approach based on narrowing shift. The vshrn_n_u16 intrinsic (corresponding to the shrn instruction) shifts each of the eight 16-bit integers right by 4 bits, and also narrowing the result to 8 bits. The result is a 64-bit value which contains the 4 most significant bits of each byte in the original 16-byte register. We might check whether the register is zero like so:

int veq_non_zero_narrow(uint8x16_t v) {
  return vget_lane_u64(vshrn_n_u16(vreinterpretq_u16_u8(v), 4), 0) != 0;
}

It compiles down to three instructions: shrn, fmov and a comparison. It is no faster.

The shrn approach does not check that the content is zero as it throws away 4 bits per byte. As an alternative, you can use the vqmovn_u64 intrinsic (corresponding to the uqxtn instruction). It takes a vector of two unsigned 64-bit integers and narrows them down to unsigned 32-bit integers. If any value in the source vector exceeds the maximum value representable by an unsigned 32-bit integer, it will be saturated to this maximum value. You might code it as follows:

int veq_non_zero_mov(uint8x16_t v) {
  return vget_lane_u64(vqmovn_u64(vreinterpretq_u64_u8(v)), 0) != 0;
}
It is no faster.

There is a faster approach, pointed out to me by Robert Clausecker. Instead of moving the data from the SIMD register to a scalar register, we may use the fact that the SIMD registers also serve as floating-point registers. Thus we can leave the data in place. It applies to all techniques presented so far (vmaxvq_u32, vshrn_n_u16, vqmovn_u64). Here is the version with vshrn_n_u16:

int veq_non_zero_float(uint8x16_t v) {
  uint8x8_t narrowed = vshrn_n_u16(vreinterpretq_u16_u8(v), 4);
  return (vdupd_lane_f64(vreinterpret_f64_u16(narrowed), 0) != 0.0);
}

This compiles to only two instructions: shrn and fcmp. It is thus much faster.

So why not use the floating-point approach?

  • We have two distinct values 0.0 and -0.0 that are considered equal.
  • The floating-point standard includes tiny values called subnormal values that may be considered as being equal to zero under some configurations.
  • We may generate a signaling NaN value which might cause a signal to be emitted.

If you are careful, you can avoid all of these problems, but it does not make it a good general-purpose solution.

Newer 64-bit ARM processors have another family of instruction sets: SVE or Scalar Vector Extension. It is not directly interoperable with ARM NEON as far as I know… But it does have a dedicated instruction (cmpeq) generated by the following code:

bool check_all_zeros(svbool_t mask, svint8_t vec) { 
    svbool_t cmp = svcmpeq_n_s8(mask, vec, 0); 
    return svptest_any(pg, cmp); 
}

SVE code is more sophisticated: it requires a mask indicating which values are ‘active’. To set all values to active, you might need to generate a a true mask: e.g., like so “svbool_t pg = svptrue_b8()”. However, it is somewhat more powerful: you can check that all active values are zeroes…

Unfortunately, SVE is not yet widespread.

The ivory tower’s drift: how academia’s preference for theory over empiricism fuels scientific stagnation

2025-01-16 04:30:06

Almost all of academic science has moved away from actual (empirical) science. It is higher status to work on theories and models. I believe that it is closely related to well documented scientific stagnation as theory is often ultimately sterile.

This tendency is quite natural in academia if there is no outside pressure… And is the main reason why academia should be ruthlessly judged by practitioners and users. As soon as academia can isolate itself in a bubble, it is bound to degrade.

It is worth trying to understand some of the factors driving this degradation… Theoretical work can sometimes be seen as more complex. This complexity can be mistakenly equated with higher intelligence or prestige. Empirical work, while also complex, often deals with tangible, observable data, which might seem more straightforward to the uninitiated.

Empirical work is more likely to lead to nuanced or inconclusive results while theory is often seemingly more direct and definitive. Theoretical research often requires fewer resources than large-scale empirical studies which might need extensive funding for equipment, data collection, and personnel. Thus you get to do more research with less using models and theory.

Theoretical work is often seen as requiring a high level of creativity to devise new frameworks or models. While empirical work also requires creativity in design, execution, and interpretation, the creativity in data collection or experimental design might be less recognized or appreciated.

The educational system often glorifies theoretical knowledge over practical skills until one reaches higher education or specialized training. E.g., we eagerly make calculus compulsory even if it has modest relevance in most practical fields. This educational bias can carry over into professional work.

We still have laboratories. We still collect data, but it is too often in service of theory rather than a primary activity.

Society must demand actual results. We must reject work that is said ‘to improve our understanding’ or ‘to lay a foundation for further work’. We must demand cheaper rockets, cures for cancer, software that is efficient. As long as academic researchers are left to their own devices, they will continue to fill the minds of the young with unnecessary models. They must be held accountable.

JavaScript hashing speed comparison: MD5 versus SHA-256

2025-01-12 00:21:18

Hashing algorithms convert input data into a fixed-size string of characters, known as a hash value or digest. These algorithms are one-way functions, meaning the original data cannot be feasibly retrieved from the hash, which makes them useful for data integrity, password storage, and digital signatures. MD5 and SHA-256 are two such hashing algorithms with significant differences in security, speed, and application. MD5 is older and often said to faster hashing due to its simpler structure. MD5 is now considered cryptographically broken. SHA-256 provides a higher level of security with its 256-bit hash output.

But is MD5 really faster? I decided to write a little JavaScript program to hash a large input using both MD5 and SHA-256. My JavaScript code defines two functions for hashing: md5Hash uses the MD5 algorithm to produce a 128-bit hash, while sha256Hash employs SHA-256 for a more secure. A large 1GB array, created by createRandomUint8Array, is then used to benchmark these hashing functions.

function md5Hash(message) {
    return crypto.createHash('md5').update(message).digest('hex');
}

function sha256Hash(message) {
    return crypto.createHash('sha256').update(message).digest('hex');
}

const largeString = createRandomUint8Array(1000000000); // 1GB string

bench('MD5 Hash', () => {
    md5Hash(largeString);
});

bench('SHA-256 Hash', () => {
    sha256Hash(largeString);
});

I use the latest version of the Bun runtime and Node.js 23 in my tests. In find that on a ARM systems (Apple M2 or Amazon Graviton 4), Bun is slightly faster on the MD5 benchmark, but both Node.js and Bun have otherwise identical speeds.

MD5 SHA-256
Apple M2 system Bun 1.31 0.7 GB/s 2.6 GB/s
Apple M2 system Node.js 23 0.6 GB/s 2.6 GB/s
Intel Ice Lake Linux system Bun 1.31 0.7 GB/s 1.2 GB/s
Intel Ice Lake Linux system Node.js 23 0.7 GB/s 1.2 GB/s
Amazon Graviton 4 Linux system Bun 1.31 0.6 GB/s 1.8 GB/s
Amazon Graviton 4 Linux system Node.js 23 0.5 GB/s 1.8 GB/s

My results suggest that you should probably not be using MD5. MD5 is slower than SHA-256 and not as safe.

Even though SHA-256 looks more expensive on paper, modern processors typically have cryptographic extensions to accelerate it.

My code is available on GitHub.

Counting the digits of 64-bit integers

2025-01-08 05:41:57

Given an integer in software, you may want to know how many decimal digits it needs. For example, the integer 100 requires 3 digits, the integer 9999 requires 4 digits.

It would be an easy problem if we could compute the logarithm in base 10 of an integer quickly. Unfortunately, our computers work in base 2 and it is faster to use a base 2 logarithm.

There are fast techniques to solve the digit-counting problem using few instructions. For 32-bit integers, a classical trick found in references such as Hacker’s Delight is as follows:

int digit_count(uint32_t x) {
   static uint32_t table[] = {9, 99, 999, 9999, 99999,
    999999, 9999999, 99999999, 999999999};
    int y = (9 * int_log2(x)) >> 5;
    y += x > table[y];
    return y + 1;
}

where the int_log2 function simply computes the integer logarithm in base 2. In base 2, there are fast functions to compute the logarithm… if you use GCC on Clang in C or C++, the following might do:

int int_log2(uint64_t x) { return 63 - __builtin_clzll(x | 1); }

Observe that I compute the bitwise OR of the input with the integer 1: I want to avoid computing the logarithm of zero. If you have a C++20 capable compiler, you can implement int_log2 with std::bit_width(x|1).

There is a faster function which I credit to Willets:

int alternative_digit_count(uint32_t x) {
  static uint64_t table[] = {
      4294967296,  8589934582,  8589934582,  8589934582,  12884901788,
      12884901788, 12884901788, 17179868184, 17179868184, 17179868184,
      21474826480, 21474826480, 21474826480, 21474826480, 25769703776,
      25769703776, 25769703776, 30063771072, 30063771072, 30063771072,
      34349738368, 34349738368, 34349738368, 34349738368, 38554705664,
      38554705664, 38554705664, 41949672960, 41949672960, 41949672960,
      42949672960, 42949672960};
  return (x + table[int_log2(x)]) >> 32;
}

What if you have 64-bit integers instead of 32-bit integers?

You can generalize the same function. First of all, we have the conventional fast function:

int digit_count(uint64_t x) {
  static uint64_t table[] = {9,
                             99,
                             999,
                             9999,
                             99999,
                             999999,
                             9999999,
                             99999999,
                             999999999,
                             9999999999,
                             99999999999,
                             999999999999,
                             9999999999999,
                             99999999999999,
                             999999999999999ULL,
                             9999999999999999ULL,
                             99999999999999999ULL,
                             999999999999999999ULL,
                             9999999999999999999ULL};
  int y = (19 * int_log2(x) >> 6);
  y += x > table[y];
  return y + 1;
}

And then we can port the faster alternative to 64-bit integers, replacing a 64-bit lookup table with a 128-bit lookup table:

int alternative_digit_count(uint64_t x) {
static uint64_t table[64][2] = {
    { 0x01, 0xfffffffffffffff6ULL },
    { 0x01, 0xfffffffffffffff6ULL },
    { 0x01, 0xfffffffffffffff6ULL },
    { 0x01, 0xfffffffffffffff6ULL },
    { 0x02, 0xffffffffffffff9cULL },
    { 0x02, 0xffffffffffffff9cULL },
    { 0x02, 0xffffffffffffff9cULL },
    { 0x03, 0xfffffffffffffc18ULL },
    { 0x03, 0xfffffffffffffc18ULL },
    { 0x03, 0xfffffffffffffc18ULL },
    { 0x04, 0xffffffffffffd8f0ULL },
    { 0x04, 0xffffffffffffd8f0ULL },
    { 0x04, 0xffffffffffffd8f0ULL },
    { 0x04, 0xffffffffffffd8f0ULL },
    { 0x05, 0xfffffffffffe7960ULL },
    { 0x05, 0xfffffffffffe7960ULL },
    { 0x05, 0xfffffffffffe7960ULL },
    { 0x06, 0xfffffffffff0bdc0ULL },
    { 0x06, 0xfffffffffff0bdc0ULL },
    { 0x06, 0xfffffffffff0bdc0ULL },
    { 0x07, 0xffffffffff676980ULL },
    { 0x07, 0xffffffffff676980ULL },
    { 0x07, 0xffffffffff676980ULL },
    { 0x07, 0xffffffffff676980ULL },
    { 0x08, 0xfffffffffa0a1f00ULL },
    { 0x08, 0xfffffffffa0a1f00ULL },
    { 0x08, 0xfffffffffa0a1f00ULL },
    { 0x09, 0xffffffffc4653600ULL },
    { 0x09, 0xffffffffc4653600ULL },
    { 0x09, 0xffffffffc4653600ULL },
    { 0x0a, 0xfffffffdabf41c00ULL },
    { 0x0a, 0xfffffffdabf41c00ULL },
    { 0x0a, 0xfffffffdabf41c00ULL },
    { 0x0a, 0xfffffffdabf41c00ULL },
    { 0x0b, 0xffffffe8b7891800ULL },
    { 0x0b, 0xffffffe8b7891800ULL },
    { 0x0b, 0xffffffe8b7891800ULL },
    { 0x0c, 0xffffff172b5af000ULL },
    { 0x0c, 0xffffff172b5af000ULL },
    { 0x0c, 0xffffff172b5af000ULL },
    { 0x0d, 0xfffff6e7b18d6000ULL },
    { 0x0d, 0xfffff6e7b18d6000ULL },
    { 0x0d, 0xfffff6e7b18d6000ULL },
    { 0x0d, 0xfffff6e7b18d6000ULL },
    { 0x0e, 0xffffa50cef85c000ULL },
    { 0x0e, 0xffffa50cef85c000ULL },
    { 0x0e, 0xffffa50cef85c000ULL },
    { 0x0f, 0xfffc72815b398000ULL },
    { 0x0f, 0xfffc72815b398000ULL },
    { 0x0f, 0xfffc72815b398000ULL },
    { 0x10, 0xffdc790d903f0000ULL },
    { 0x10, 0xffdc790d903f0000ULL },
    { 0x10, 0xffdc790d903f0000ULL },
    { 0x10, 0xffdc790d903f0000ULL },
    { 0x11, 0xfe9cba87a2760000ULL },
    { 0x11, 0xfe9cba87a2760000ULL },
    { 0x11, 0xfe9cba87a2760000ULL },
    { 0x12, 0xf21f494c589c0000ULL },
    { 0x12, 0xf21f494c589c0000ULL },
    { 0x12, 0xf21f494c589c0000ULL },
    { 0x13, 0x7538dcfb76180000ULL },
    { 0x13, 0x7538dcfb76180000ULL },
    { 0x13, 0x7538dcfb76180000ULL },
    { 0x13, 0x7538dcfb76180000ULL },
};
  int log = int_log2(x);
  uint64_t low = table[log][1];
  uint64_t high = table[log][0];
  return (x + low < x ) + high;
}

Though my 64-bit code may look mysterious, it is doing the same thing as the original 32-bit function: it adds the input to looked up value, and keep the most significant bits.

I have included a new simpler 64-bit version based on work by Kendall Willets, this time assuming that the compiler supports C++20:

int alternative_digit_count_two_tables(uint64_t x) {
  static int digits[65] = {19, 19, 19, 19, 18, 18, 18, 
                           17, 17, 17, 16, 16, 16,
                           16, 15, 15, 15, 14, 14, 14, 
                           13, 13, 13, 13, 12, 12,
                           12, 11, 11, 11, 10, 10, 10, 
                           10, 9,  9,  9,  8,  8,
                           8,  7,  7,  7,  7,  6,  6,
                           6,  5,  5,  5,  4,  4,
                           4,  4,  3,  3,  3,  2,  2,
                           2,  1,  1,  1,  1,  1};
  static uint64_t table[65] = {18446744073709551615ULL,
                               18446744073709551615ULL,
                               18446744073709551615ULL,
                               18446744073709551615ULL,
                               999999999999999999ULL,
                               999999999999999999ULL,
                               999999999999999999ULL,
                               99999999999999999ULL,
                               99999999999999999ULL,
                               99999999999999999ULL,
                               9999999999999999ULL,
                               9999999999999999ULL,
                               9999999999999999ULL,
                               9999999999999999ULL,
                               999999999999999ULL,
                               999999999999999ULL,
                               999999999999999ULL,
                               99999999999999ULL,
                               99999999999999ULL,
                               99999999999999ULL,
                               9999999999999ULL,
                               9999999999999ULL,
                               9999999999999ULL,
                               9999999999999ULL,
                               999999999999ULL,
                               999999999999ULL,
                               999999999999ULL,
                               99999999999ULL,
                               99999999999ULL,
                               99999999999ULL,
                               9999999999ULL,
                               9999999999ULL,
                               9999999999ULL,
                               9999999999ULL,
                               999999999ULL,
                               999999999ULL,
                               999999999ULL,
                               99999999ULL,
                               99999999ULL,
                               99999999ULL,
                               9999999ULL,
                               9999999ULL,
                               9999999ULL,
                               9999999ULL,
                               999999ULL,
                               999999ULL,
                               999999ULL,
                               99999ULL,
                               99999ULL,
                               99999ULL,
                               9999ULL,
                               9999ULL,
                               9999ULL,
                               9999ULL,
                               999ULL,
                               999ULL,
                               999ULL,
                               99ULL,
                               99ULL,
                               99ULL,
                               9ULL,
                               9ULL,
                               9ULL,
                               9ULL,
                               0ULL};
  int log = std::countl_zero(x);
  uint64_t low = table[log];
  uint64_t high = digits[log];
  return (x > low) + high;
}

What we want to know is which functions are fast?

To test it out, I generate thousands of random integers and I just sum up their digit counts. I record the number of instructions per integer:

Apple M2/LLVM 16 Intel Ice Lake/GCC 12
conventional (64-bit) 16 instructions 20 instructions
alternative (64-bit) 15 instructions 19 instructions
simpler alternative (64-bit) 12 instructions 15 instructions
conventional (32-bit) 16 instructions 19 instructions
alternative (32-bit) 12 instructions 16 instructions

In the 32-bit case, the alternative approach saves you 3 or 4 instructions per integer. But in the 64-bit case, the saving appears to be just one instruction which is not necessarily worth it. However, the simpler alternative in the 64-bit case uses significantly fewer instructions.

Next, let us report the number of CPU cycles per integer. I run the tests four times, and I get some variation on one of the tests (Apple platform, 32-bit alternative).

Apple M2/LLVM 16 Intel Ice Lake/GCC 12
conventional (64-bit) 2.2 cycles 5.1 cycles
alternative (64-bit) 4.3 cycles 6.6 cycles
simpler alternative (64-bit) 4.8 cycles 4.5 cycles
conventional (32-bit) 2.3 cycles 5.6 cycles
alternative (32-bit) 2.0 to 4.8 cycles 6.2 cycles

My timings suggest that the conventional (and simpler approach) may use more instructions but could still be preferable in practice. It is especially true in the 64-bit case where the alternative approach is significantly more complicated, and much slower in my tests. The simpler alternative in the 64-bit case is the fastest  on the x64 system (Ice Lake) but slower on the ARM system.

You should run your own benchmark for your own use case. However, I believe that all these functions might be fast enough. Which one you pick might come down to personal preferences.

My code is available on GitHub.