2025-06-06 01:30:49
Sometimes a paper comes out that has the seeds of a great idea that could lead to a whole new line of pioneering research. But, instead, nothing much happens, except imitative works that do not push the core idea forward at all.
For example the McCulloch Pitts paper from 1943 showed how neural networks could represent arbitrary logical or Boolean expressions of a certain class. The paper was well-received at the time, brilliantly executed by co-authors with diverse expertise in neuroscience, logic and computing. Had its signficance been fully grasped, this paper might have, at least notionally, formed a unifying conceptual bridge between the two nascent schools of connectionism and symbolic AI (one can at least hope). But instead, the heated conflict in viewpoints in the field has persisted, even to this day.
Another example is George Miller’s 7 +/- 2 paper. This famous result showed humans are able to hold only a small number of pieces of information in mind at the same time while reasoning. This paper was important not just for the specific result, but for the breakthrough in methodology using rigorous experimental noninvasive methods to discover how human thinking works—a topic we know so little about, even today. However, the followup papers by others, for the most part, only extended or expanded on the specific finding in very minor ways. [1] Thankfully, Miller’s approach did eventually gain influence in more subtle ways.
Of course it’s natural from the incentive structures of publishing that many papers would be primarily derivative rather than original. It’s not a bad thing that, when a pioneering paper comes out, others very quickly write rejoinder papers containing evaluations or minor tweaks of the original result. Not bad, but sometimes we miss the larger implications of the original result and get lost in the details.
Another challenge is stovepiping—we get stuck in our narrow swim lanes for our specific fields and camps of research. [2] We don’t see the broader implications, such as connections and commonalities across fields that could lead to fruitful new directions.
Thankfully, at least to some extent current research in AI shows some mix of both innovation and imitation. Inspired in part by the accelerationist mindset, many new papers appear every day, some with significant new findings and others that are more modest riffs on previous papers.
[1] Following this line of research on human thought processes could be worthwhile for various reasons. For example, some papers in linguistics state that Chomsky‘s vision for a universal grammar is misguided because the common patterns in human language are entirely explainable by the processing limitations of the human mind. But this claim is made with no justification or methodological rigor of any kind. If I claimed a CPU performs vector addition or atomic operations efficiently because of “the capabilities of the processor,” I would need to provide some supporting evidence, for example, documenting that the CPU has vector processing units or specialized hardware for atomics. The assertions about language structure being shaped by the human mental processing faculty is just an empty truism, unless supported by some amount of scientific rigor and free of the common fallacies of statistical reasoning.
[2] I recently read a paper in linguistics with apparent promise, but the paper totally misconstrued the relationship between Shannon entropy and Kolmogorov complexity. Sadly this paper passed review in a linguistic journal, but if it had had a mathematically inclined reviewer, the problem would have been caught and fixed.
The post Scientific papers: innovation … or imitation? first appeared on John D. Cook.
2025-06-05 22:23:22
I just stumbled across the binomial number system in Exercise 5.38 of Concrete Mathematics. The exercise asks the reader to show that every non-negative integer n can be written as
and that the representation is unique if you require 0 ≤ a < b < c. The book calls this the binomial number system. I skimmed a paper that said this has some application in signal processing, but I haven’t looked at it closely [1].
You can find a, b, and c much as you would find the representation in many other number systems: first find the largest possible c, then the largest possible b for what’s left, and then the remainder is a.
In order to find c, we start with the observation that the binomial coefficient C(k, 3) is less than k³/6 and so c is less than the cube root of 6n. We can use this as an initial lower bound on c then search incrementally. If we wanted to be more efficient, we could do some sort of binary search.
Here’s Python code to find a, b, and c.
from math import comb, factorial def lower(n, r): "Find largest k such that comb(k, r) <= n." k = int( (factorial(r)*n)**(1/r) ) # initial guess while comb(k, r) <= n: k += 1 return k - 1 def binomial_rep(n): c = lower(n, 3) cc = comb(c, 3) b = lower(n - comb(c, 3), 2) bb = comb(b, 2) a = n - cc - bb assert(c > b > a >= 0) return (a, b, c)
For example, here’s the binomial number system representation of today’s date.
>>> binomial_rep(20250605) (79, 269, 496) >>> comb(496, 3) + comb(269, 2) + comb(79, 1) 20250605
You could use any number of binomial terms, not just three.
[1] I looked back at the paper, and it is using a different kind of binomial number system, expressing numbers as sums of fixed binomial coefficients, not varying the binomial coefficient arguments. This representation has some advantages for error correction.
The post Binomial number system first appeared on John D. Cook.2025-06-04 23:03:07
Casting out nines is a well-known way of finding the remainder when a number is divided by 9. You add all the digits of a number n. And if that number is bigger than 9, add all the digits of that number. You keep this up until you get a number less than 9.
This is an example of persistence. The persistence of a number, relative to some operation, is the number of times you have to apply that operation before you reach a fixed point, a point where applying the operation further makes no change.
The additive persistence of a number is the number of times you have to take the digit sum before getting a fixed point (i.e. a number less than 9). For example, the additive persistence of 8675309 is 3 because the digits in 8675309 sum to 38, the digits in 38 sum to 11, and the digits in 11 sum to 2.
The additive persistence of a number in base b is bounded above by its iterated logarithm in base b.
The smallest number with additive persistence n is sequence A006050 in OEIS.
Multiplicative persistence is analogous, except you multiply digits together. Curiously, it seems that multiplicative persistence is bounded. This is true for numbers with less than 30,000 digits, and it is conjectured to be true for all integers.
The smallest number with multiplicative persistence n is sequence A003001 in OEIS.
Here’s Python code to compute multiplicative persistence.
def digit_prod(n): s = str(n) prod = 1 for d in s: prod *= int(d) return prod def persistence(n): c = 0 while n > 9: n = digit_prod(n) print(n) c += 1 return c
You could use this to verify that 277777788888899 has persistence 11. It is conjectured that no number has larger persistence larger than 11.
The persistence of a large number is very likely 1. If you pick a number with 1000 digits at random, for example, it’s very likely that at least of these digits will be 0. So it would seem that the probability of a large number having persistence larger than 11 would be incredibly small, but I would not have expected it to be zero.
Here are plots to visualize the fixed points of the numbers less than N for increasing values of N.
It’s not surprising that zeros become more common as N increases. And a moments thought will explain why even numbers are more common than odd numbers. But it’s a little surprising that 4 is much less common than 2, 6, or 8.
Incidentally, we have worked in base 10 so far. In base 2, the maximum persistence is 1: when you multiply a bunch of 0s and 1s together, you either get 0 or 1. It is conjectured that the maximum persistence in base 3 is 3.
The post Additive and multiplicative persistence first appeared on John D. Cook.2025-06-04 19:55:45
Logarithms give a way to describe large numbers compactly. But sometimes even the logarithm is a huge number, and it would be convenient to take the log again. And maybe again.
For example, consider
googolplex = 10googol
where
googol = 10100.
(The name googol is older than “Google,” and in fact the former inspired the latter name.)
A googolplex is a big number. It’s log base 10 is still a big number. You have to take the log 3 times to get to a small number:
log10( log10( log10 googolplex ) ) = 2.
The iterated logarithm base b of a positive number x, written log*b (x) is the number of times you have to take the log base b until you get a result less than or equal to 1. So in the example above
log*10 googolplex = 4.
As usual, a missing base implies b = e. Also just as log base 2 is often denoted lg, the iterated log base 2 is sometimes denoted lg*.
I’ve seen sources that say the base of an iterated logarithm doesn’t matter. That’s not true, because, for example,
log*2 googolplex = 6.
The function log*(x) appears occasionally in the analysis of algorithms.
It’s easy to implement the iterated log, though this is of limited use because the iterated log is usually applied to numbers too large to represent as floating point numbers.
from math import log, exp def logstar(x, base=exp(1)): c = 0 while x > 1: x = log(x, base) c += 1 return c lgstar = lambda x: logstar(x, 2)
The largest known prime, at the time of writing, is
p = 2136279841 − 1.
We can use the code above to determine that lg* 136279841 = 5 and so lg* p = 6.
The prime number theorem says that the number of prime numbers less than x, denoted π(x), is asymptotically li(x) where
You may have seen a different form of the prime number theorem that says π(x) is asymptotically x/log x. That’s true too, because li(x) and x/log x are asymptotically equal. But li(x) gives a better approximation to π(x) for finite x. More on that here.
For every x for which π(x) has been computed, π(x) < li(x). However, Littlewood proved in 1914 that there exists some x for which π(x) > li(x). In 1933 Stanley Skewes gave an upper bound on the smallest x such that π(x) > li(x), assuming the Riemann Hypothesis:
S1 = 10^(10^(10^34))
In 1955 he proved the upper bound
S2 = 10^(10^(10^964))
not assuming the Riemann Hypothesis.
You can calculate that
log*10S1 = 5
and
log*10S2 = 6.
In all the examples above, log*(x) is a small number. But for some numbers, such as Moser’s number and Graham’s number, even log*(x) is unwieldy and you have to do things like take the iterated logarithm of the iterated logarithm
log**(x) = log*( log*(x) )
before the numbers become imaginable.
2025-06-03 21:47:27
“The inverse of an approximation is an approximation of the inverse.” This statement is either obvious or really clever. Maybe both.
Let’s start with an example. For moderately small x,
That means that near 1 (i.e. near 10 raised to a small number,
The inverse of a good approximation is a good approximation of the inverse. But the measure of goodness might not be the same in both uses of the word “good.” The inverse of a very good approximation might be a mediocre approximation of the inverse, depending on the sensitivity of the problem. See, for example, Wilkinson’s polynomial.
Now let’s look at derivatives and inverse functions. The derivative of a function is the best linear approximation to that function at a point. And it’s true that the best linear approximation to the inverse of a function at a point is the inverse of the best linear approximation to the function. In symbols,
This is a good example of a bell curve meme, or coming full circle. The novice naively treats the symbols above as fractions and says “of course.” The sophomore says “Wait a minute. These aren’t fractions. You can’t just do that.” And they would be correct, except the symbols are limits of fractions, and in this context naive symbol manipulation in fact leads to the correct result.
However, the position on the right side of the meme is nuanced. Treating derivatives as fractions, especially partial derivatives, can get you into trouble.
The more rigorous statement of the inverse function theorem is harder to parse visually, and so the version above is a good mnemonic. If f(x) = y then
The same is true in multiple variables. If f is a smooth function from ℝn to ℝn then the best linear approximation to f at a point is a linear transformation, which can be represented by a matrix M. And the best linear approximation to the inverse function at that point is the linear transformation represented by the inverse of M. In symbols,
There’s fine print, such as saying a function has to be differentiable in a neighborhood of x and the derivative has to be non-singular, but that’s the gist of the inverse function theorem.
Note that unlike in the first section, there’s no discussion of how good the approximations are. The approximations are exact, at a point. But it might still be the case that the linear approximation on one side is good over a much larger neighborhood than the linear approximation on the other side.
The post Approximation of Inverse, Inverse of Approximation first appeared on John D. Cook.2025-06-03 02:58:43
Fermat’s little theorem says that if p is a prime and a is not a multiple of p, then
ap−1 = 1 (mod p).
The contrapositive of Fermat’s little theorem says if
ap−1 ≠ 1 (mod p)
then either p is not prime or a is a multiple of p. The contrapositive is used to test whether a number is prime. Pick a number a less than p (and so a is clearly not a multiple of p) and compute ap−1 mod p. If the result is not congruent to 1 mod p, then p is not a prime.
So Fermat’s little theorem gives us a way to prove that a number is composite: if ap−1 mod p equals anything other than 1, p is definitely composite. But if ap−1 mod p does equal 1, the test is inconclusive. Such a result suggests that p is prime, but it might not be.
For example, suppose we want to test whether 57 is prime. If we apply Fermat’s primality test with a = 2 we find that 57 is not prime, as the following bit of Python shows.
>>> pow(2, 56, 57) 4
Now let’s test whether 91 is prime using a = 64. Then Fermat’s test is inconclusive.
>>> pow(64, 90, 91) 1
In fact 91 is not prime because it factors into 7 × 13.
If we had used a = 2 as our base we would have had proof that 91 is composite. In practice, Fermat’s test is applied with multiple values of a (if necessary).
If ap−1 = 1 (mod p), then a is a witness to the primality of p. It’s not proof, but it’s evidence. It’s still possible p is not prime, in which case a is a false witness to the primality of p, or p is a pseudoprime to the base a. In the examples above, 64 is a false witness to the primality of 91.
Until I stumbled on a paper [1] recently, I didn’t realize that on average composite numbers have a lot of false witnesses. For large N, the average number of false witnesses for composite numbers less than N is greater than N15/23. Although N15/23 is big, it’s small relative to N when N is huge. And in applications, N is huge, e.g. N = 22048 for RSA encryption.
However, your chances of picking one of these false witnesses, if you were to choose a base a at random, is small, and so probabilistic primality testing based on Fermat’s little theorem is practical, and in fact common.
Note that the result quoted above is a lower bound on the average number of false witnesses. You really want an upper bound if you want to say that you’re unlikely to run into a false witness by accident. The authors in [1] do give upper bounds, but they’re much more complicated expressions than the lower bound.
[1] Paul Erdős and Carl Pomerance. On the Number of False Witnesses for a Composite Number. Mathematics of Computation. Volume 46, Number 173 (January 1986), pp. 259–279.
The post False witnesses first appeared on John D. Cook.