2025-03-15 00:56:45
The following equation, known as the BBP formula [1], will let you compute the nth digit of π directly without having to compute the previous digits.
I’ve seen this claim many times, but I’ve never seen it explained in much detail how this formula lets you extract digits of π.
First of all, this formula lets you calculate the digits of π, not in base 10, but in base 16, i.e. in hexadecimal.
It looks like the BBP formula might let you extract hexadecimal digits. After all, the hexadecimal expansion of π is the set of coefficients an such that
where each an is an integer between 0 and 15. But the term multiplying 16−n in the BBP formula is not an integer, so it’s not that easy. Fortunately, it’s not that hard either.
As a warmup, let’s think how we would find the nth digit of π in base 10. We’d multiply by 10n, throw away the factional part, and look at the last digit. That is, we would calculate
Another approach would be to multiply by 10n−1, shifting the digit that we’re after to the first digit after the decimal point, then take the integer part of 10 times the fraction part.
Here {x} denotes the fractional part of x. This is a little more complicated, but now all the calculation is inside the floor operator.
For example, suppose we wanted to find the 4th digit of π. Multiplying by 103 gives 3141.592… with fractional part 0.592…. Multiplying by 10 gives 5.92… and taking the floor gives us 5.
Now to find the nth hexadecimal digit we’d do the analogous procedure, replacing 10 with 16. To make things concrete, let’s calculate the 100th hexadecimal digit. We need to calculate
We can replace the infinite sum with a sum up to 99 because the remaining terms sum to an amount too small to effect our answer. Note that we’re being sloppy here, but this step is justified in this particular example.
Here’s the trick that makes this computation practical: when calculating the fractional part, we can carry out the calculation of the first term mod (8n + 1), and the second part mod (8n + 4), etc. We can use fast modular exponentiation, the same trick that makes, for example, a lot of encryption calculatons practical.
Here’s code that evaluates the Nth hexadecimal digit of π by evaluating the expression above.
def hexdigit(N): s = 0 for n in range(N): s += 4*pow(16, N-n-1, 8*n + 1) / (8*n + 1) s -= 2*pow(16, N-n-1, 8*n + 4) / (8*n + 4) s -= pow(16, N-n-1, 8*n + 5) / (8*n + 5) s -= pow(16, N-n-1, 8*n + 6) / (8*n + 6) frac = s - floor(s) return floor(16*frac)
Here the three-argument version of pow
, introduced into Python a few years ago, carries out modular exponentiation efficiently. That is, pow(b, x, m)
calculates bx mod m.
This code correctly calculates that the 100th hex digit of π is C. Note that because we did not include an estimate of the tail of the sum, this code could fail for some values of N. A more rigorous version of the code would add a tail estimate to the variable s
above.
[1] Named after the initials of the authors. Bailey, David H.; Borwein, Peter B.; Plouffe, Simon (1997). On the Rapid Computation of Various Polylogarithmic Constants. Mathematics of Computation. 66 (218) 903–913.
The post Computing the nth digit of π directly first appeared on John D. Cook.2025-03-10 21:58:54
The xAI Colossus supercomputer contains 100,000 NVIDIA H100 GPUs. Upgrades are planned, ultimately up to as much as a million GPUs. The H100 has theoretical peak speed of at least 60 teraFLOPs (FP64 tensor core), though the actual number depends on the power and frequency cap settings on the GPUs. Admittedly FP64 is overkill for Colossus’ intended use for AI model training, though it is required for most scientific and engineering applications on typical supercomputers. This would put Colossus nominally at theoretical peak speed of 6 Exaflops full FP64 precision for dense matrix multiplies.
El Capitan at Lawrence Livermore National Lab ranks now as top #1 fastest system in the world on the TOP500 list, recently taking the crown from Frontier at Oak Ridge National Lab. Both Frontier and El Cap were procured under the same collaborative CORAL-2 project by the two respective laboratories. El Capitan uses AMD Instinct MI300A GPUs for theoretical peak speed of 2.746 Exaflops.
You may wonder about the discrepancy: Colossus has more raw FLOPs, while El Capitan is ranked #1. Which system is actually faster? For decades, top system performance has commonly been measured for TOP500 using the High Performance Linpack (HPL) benchmark. Some have expressed concerns that HPL is an unrepresentative “FLOPs-only” benchmark. However, HPL actually measures more than raw rate of floating point operations. HPL performs distributed matrix products on huge matrices that become smaller and smaller in size during the HPL run, with a serial dependency between sequential matrix multiplies. Near the end of the run, performance becomes very limited by network latency, requiring excellent network performance. Furthermore, HPL is also a system stability test, since the system (often made up of brand new hardware for which bad parts must be weeded out) must stay up for a period of hours without crashing and at the end yield a correct answer (my colleague Phil Roth gives a description of this ordeal for Frontier). In short, a system could have lots of FLOPs but fail these basic tests of being able to run a nontrivial application.
Some commercial system owners may choose not to submit an HPL number, for whatever reason (though Microsoft submitted one and currently has a system at #4). In some cases submitting a TOP500 number may not be a mission priority for the owner. Or the system may not have an adequate network or the requisite system stability to produce a good number, in spite of having adequate FLOPs. Companies don’t typically give reasons for not submitting, but their reasons can be entirely valid, and not submitting a number has certainly happened before.
You may also wonder how it is that Colossus was stood up in 122 days (indeed a remarkable achievement by a highly capable team) whereas the CORAL-2 Project, which delivered El Capitan and Frontier, spanned multiple years.
Put simply, a system like Colossus stands on the shoulders of many multi-year investments in vendor hardware and software under projects like CORAL-2. Years ago, Oak Ridge National Lab originally put NVIDIA on the map for supercomputing with Titan, the first NVIDIA-powered petascale supercomputer. Some of the core NVIDIA software in use today was developed in part under this multi-year Titan project. Similarly for AMD and CORAL-2. Many systems, including Colossus, have benefitted from these long-term multi-year investments.
Another reason has to do with intended workloads of the respective systems. Colossus is intended primarily for AI model training; even though model architecture variations have slightly different computational patterns, the requirements are similar. El Capitan on the other hand is a general purpose supercomputer, and as such must support many different kinds of science applications with highly diverse requirements (and even more so at other centers like OLCF, ALCF and NERSC) (on system requirements, application diversity and application readiness see here, here and here). It’s much harder to put together a system to meet the needs of such a variety of science projects.
Colossus and El Capitan are both highly capable systems that will provide millions of node-hours of compute for their respective projects. Colossus has a high flop rate to support reduced precision matrix multiples (and presumably high network bandwidth for Allreduce) required for AI model training. El Capitan has a balanced architecture to support a wide variety of science applications at scale.
ADDENDUM: Colossus is now up to 200K GPUs.
The post Colossus versus El Capitan: A Tale of Two Supercomputers first appeared on John D. Cook.2025-03-10 04:26:17
When I’m in the mood to write, you can often follow a chain of though in my posts. Recently, a post on LLM tokenization lead to a post on how Unicode characters are tokenized, which led to a post on Unicode surrogates. The latter ended by touching on Unicode’s PUA (Private Use Area), which of course leads to J.R.R. Tolkien and privacy, as we shall see.
To back up a bit, Unicode started as an attempt to one alphabet to rule them all, a coding system large enough to contain all the world’s writing systems. Initially it was thought that 216 = 65,536 symbols would be enough, but that didn’t last long. The Unicode standard now contains nearly 100,000 Chinese characters alone [1], along with myriad other symbols such as graphical drawing elements, mathematical symbols, and emoji.
Although Unicode is vast, it doesn’t have room to include every symbol anyone might use. Unicode anticipated that contingency and reserved some areas for private use. The most well known example is probably Apple’s use of U+F8FF for their logo . The private use area is also being used for ancient and medieval scripts, as well as for fictional writing systems such as Tolkein’s Tengwar and Cirth.
Because anyone can use the private use area as they wish, there could be conficts. For example, Apple intends U+F8FF to display their logo, but the code point is also used for a Klingon symbol. We’ll see another example of a conflict at the end of this post.
Here’s the chain of thought that leads to privacy. When I started thinking about this post, I thought about creating an image of Tengwar writing, and that made me think about font fingerprinting. Browsers let web servers know what fonts you have installed, which serves a benign purpose: a site may want to send you text formatted in a pariticular font if its available but will fall back to a more common font if necessary.
However, the collection of fonts installed on a particular computer may be unique, or at least used in combination with other browser information to uniquely identify a user. Before installing a Tengwar font I thought about how it would be sure to increase the uniqueness of my font fingerprint.
J. R. R. Tolkein’s Tengwar writing system has been mapped to Unicode range E000 to E07F.
Here’s a sample. No telling what it looks like in your browser:
When I view the same text in Tecendil online transcriber I see this:
But then I open the text in Emacs I see this:
Both are legitimate renderings because nobody owns the private use areas. The characters that Tecendil uses for Tengwar, apparently some font on my laptop uses to display Chinese characters.
I’m especially curious about the last character, U+E000. Tecendil interprets it as the Tengwar symbol tinco but something on my laptop interprets it as the Mozilla mascot.
[1] It wasn’t so naive to think 16 bits would do, even including Chinese. There may be 100,000 Chinese characters, but a few thousand characters account for nearly all Chinese writing. More on that here. But if you’re going to break the 16-bit barrier anyway, you might as well try to include even the most rare characters.
The post Unicode, Tolkien, and Privacy first appeared on John D. Cook.2025-03-10 01:51:01
At the highest level, Unicode is a simply a list of symbols. But when you look closer you find that isn’t entirely true. Some of the symbols are sorta meta symbols. And while a list of symbols is not complicated, this list is adjacent to a lot of complexity.
I’ve explored various rabbit holes of Unicode in previous posts, and this time I’d like to go down the rabbit hole of surrogates. These came up in the recent post on how LLMs tokenize Unicode characters.
Incidentally, every time I hear “surrogates” I think of the Bruce Willis movie by that name.
Once upon a time, text was represented in ASCII, and life was simple. Letters, numbers, and a few other symbols all fit into a single eight-bit byte with one bit left over. There wasn’t any need to distinguish ASCII and its encoding because they were synonymous. The nth ASCII character was represented the same way as the number n.
The A in ASCII stands for American, and ASCII was sufficient for American English, sorta. Then people played various tricks with the leftover bit to extend ASCII by adding 128 more symbols. This was enough, for example, to do things like include the tilde on the n in jalapeño, but representing anything like Russian letters, math symbols, or Chinese characters was out of the question.
So along came Unicode. Instead of using one byte to represent characters, Unicode used two bytes. This didn’t double the possibilities; it squared them. Instead of 128 ASCII symbols, or 256 extended ASCII symbols, there was the potential to encode 216 = 65,536 symbols. Surely that would be enough! Except it wasn’t. Soon people wanted even more symbols. Currently Unicode has 154,998 code points.
The most obvious way to represent Unicode characters, back when there were fewer than 216 characters, was as a 16-bit number, i.e. to represent each character using two bytes rather than one. The nth Unicode character could be represented the same way as the number n, analogous to the case of ASCII.
But while text could potentially require thousands of symbols, at lot of text can be represented just fine with ASCII. Using two bytes per character would make the text take up twice as much memory. UTF-8 encoding was a very clever solution to this problem. Pure ASCII (i.e. not extended) text is represented exact as in ASCII, taking up no more memory. And if text is nearly all ASCII with occasional non-ASCII characters sprinkled in, the size of the text in memory increases in proportional to the number of non-ASCII characters. Nearly pure ASCII text takes up nearly the same amount of memory.
But the cost of UTF-8 is encoding. The bit representation of the nth character is no longer necessarily the bit representation of the integer n.
UTF-16, while wasteful when representing English prose, is a direct encoding. Or at least it was until Unicode spilled over its initial size limit. You need more than a single pair of bytes to represent more than 216 characters.
The way to represent more than 216 symbols with 16-bit characters is to designate some of the “characters” as meta characters. These special characters represent half of a pair of 16-bit units corresponding to a single Unicode character.
There are 1024 characters reserved as high surrogates (U+D800 through U+DBFF) and 1024 reserved as low surrogates (U+DC00 through U+DFFF). High surrogates contain the high-order bits and low surrogates contain the low-order bits.
Let n > 216 be the code point of a character. Then the last 10 bits of the high surrogate are the highest 10 bits of n, and the last 1o bits of the low surrogate are the lowest 10 bits of n.
In terms of math rather than bit representations, the pair of surrogates (H, L) represent code point
216 + 210(H − 55396) + (L − 56320)
where 55396 = D800hex and 56320 = DC00hex.
The rocket emoji has Unicode value U+1F680. The bit pattern representing the emoji is DB3DDE80hex, the combination of high surrogate U+D83D and low surrogate U+DE80. We can verify this with the following Python.
>>> H, L = 0xD83D, 0xDE80 >>> 2**16 + 2**10*(H - 0xD800) + L - 0xDC00 == 0x1F680 True
When we write out the high and low surrogates in binary we can visualize how they contain the high and low bits of the rocket codepoint.
The high surrogate characters U+D800 through U+DBFF are divided into two ranges. The range U+D800 through U+DB7F are simply called high surrogates, but the remaining characters U+DB80 through U+DBFF are called “high private use surrogates.”
These private use surrogates to not work any differently than the rest. They just correspond to code points in Unicode’s private use area.
The post Unicode surrogates first appeared on John D. Cook.2025-03-09 21:48:54
I recently ran across the article Something weird is happening with LLMs and chess. One of the things it mentions is how the a minor variation in a prompt can have a large impact on the ability of an LLM to play chess.
One extremely strange thing I noticed was that if I gave a prompt like “
1. e4 e5 2.
” (with a space at the end), the open models would play much, much worse than if I gave a prompt like “1 e4 e5 2.
” (without a space) and let the model generate the space itself. Huh?
The author goes on to explain that tokenization probably explains the difference. The intent is to get the LLM to predict the next move, but the extra space confuses the model because it is tokenized differently than the spaces in front of the e’s. The trailing space is tokenized as an individual character, but the spaces in front of the e’s are tokenized with the e’s. I wrote about this a couple days ago in the post The difference between tokens and words.
For example, ChatGPT will tokenize “hello world” as [15339, 1917] and “world hello” as [14957, 24748]. The difference is that the first string is parsed as “hello” and ” world” while the latter is parsed as “world” and ” hello”. Note the spaces attached to the second word in each case.
The previous post was about how ChatGPT tokenizes individual Unicode characters. It mentions UTF-16, which is itself an example of how tokenization matters. The string “UTF-16” it will be represented by three tokens, one each for “UTF”, “-“, and “16”. But the string “UTF16” will be represented by two tokens, one for “UTF” and one for “16”. The string “UTF16” might be more likely to be interpreted as a unit, a Unicode encoding.
The post Practical consequences of tokenization details first appeared on John D. Cook.2025-03-09 02:06:07
I mentioned in the previous post that not every Unicode character corresponds to a token in ChatGPT. Specifically I’m looking at gpt-3.5-turbo
in tiktoken
. There are 100,256 possible tokens and 155,063 Unicode characters, so the pigeon hole principle says not every character corresponds to a token.
I was curious about the relationship between tokens and Unicode so I looked into it a little further.
The Unicode characters U+D800 through U+DFFF all map to a single token, 5809. This is because these are not really characters per se but “surrogates,” code points that are used in pairs to represent other code points [1]. They don’t make sense in isolation.
The character U+FFFD, the replacement character �, also corresponds to 5809. It’s also not a character per se but a way to signal that another character is not valid.
Aside from the surrogates and the replacement characters, every Unicode character in the BMP, characters up to U+FFFF, has a unique representation in tokens. However, most require two or three tokens. For example, the snowman character is represented by two tokens: [18107, 225].
Note that this discussion is about single characters, not words. As the previous post describes, many words are tokenized as entire words, or broken down into units larger than single characters.
The rest of the Unicode characters, those outside the BMP, all have unique token representations. Of these, 3,404 are represented by a single token, but the rest require 2, 3, or 4 tokens. The rocket emoji, U+1F680, for example, is represented by three tokens: [9468, 248, 222].
[1] Unicode was originally limited to 16 bits, and UFT-16 represented each character with a 16-bit integer. When Unicode expanded to beyond 216 characters, UTF-16 used pairs of surrogates, one high surrogate and one low surrogate, to represent code points higher than U+FFFF.
The post ChatGPT tokens and Unicode first appeared on John D. Cook.