2026-06-01 00:15:49
The function
(1 + cos(x))/2
gives a fair approximation to the Gaussian density
exp(−x²)
You can make the approximation much better by raising it to a power. The function
((1 + cos(x))/2)4
gives a good lower bound and
((1 + cos(x))/2)3.5597
gives a good upper bound. More on that here.
There are other ways of improving the cosine approximation to the Gaussian. Yesterday I came across one I hadn’t seen before, adding a sin(x) term to x.
(1 + cos(sin(x) + x))/2
This function matches the first few terms of the power series for exp(−x²) and has an error on the order of x6/240. You can’t see the difference between the two functions in a plot for −4 ≤ x ≤ 4.
***
There’s a tension between the previous two statements. If the error in on the order of x6/240 then we’d expect the error to be huge at x = 4. We have
46/240 = 17.07
and yet
exp(−4²) − ((1 + cos(4 + sin(4)))/2) = −0.002579,
i.e. the error is between 3 and 4 orders of magnitude smaller than we might expect.
We have an alternating series, so the truncation error should be roughly equal to the first term after the truncation, right? No, the alternating series theorem doesn’t apply because the absolute values of the terms in the series are not decreasing yet for x = 4. The terms have to decrease eventually because the series has infinite radius of convergence, but they’re not decreasing at the 6th term; the terms will get much larger in absolute value before they get smaller.
The basic alternating series theorem gives only an upper bound on truncation error, but there are extensions that also give a lower bound. I wrote about these extensions a few weeks ago. But these extentions don’t apply here because the terms have not started decreasing in absolute value.
The post Another Gaussian approximation first appeared on John D. Cook.2026-05-31 05:06:49
If a polynomial identity holds at a few random points, it’s very like true. We’ll make this statement more precise, but first let’s look at some applications.
You may want to test an identity that naturally presents itself as a statement that two polynomials are equal. Or you might use something like the binomial coefficient trick to reframe a problem that isn’t obviously an identity about polynomials. And with algebraic circuits, you can reformulate a wide range of computations as polynomial identities; this is widely used in zero-knowledge proofs.
The theorem alluded to at the top of the post is the Schwartz-Zippel lemma. It is formulated in terms of the probability of a non-zero polynomial P evaluating to zero. To prove that two polynomials Q1 and Q2 are equal, you can show that
P = Q1(x) − Q2(x) = 0.
Let F be a (typically large) finite field and let P be a non-zero polynomial in n variables
P(x1, x2, x3, …, xn)
of total degree d. If we choose the x‘s randomly from F then the probability that P evaluates to zero is no more than d/|F|. [1]
If the total degree d is small relative to the size of the field, then the probability of P evaluating to zero is small. As long as d is less than |F|, you can evaluate the polynomial k times to make
(d / |F|)k
as small as you’d like. If d isn’t too large, and F is large, like the integers mod p = 2255 − 19 used in cryptography, one polynomial evaluation might be enough to give convincing evidence that the polynomial is zero.
[1] The Schwartz-Zippel lemma in its full generality applies to polynomials over an integral domain R with variables drawn from S, a finite subset of R. Here we’re setting R = S = F.
The post Spot checking polynomial identities first appeared on John D. Cook.2026-05-29 20:24:30
The sample variance of a set of numbers is defined in terms of the sum of the squared distances from each point to the mean.
So it would seem that you first need to calculate the mean, then go back and compute the squared differences from the mean. And yet sample variance can be computed in one pass through the data.
You’ll find two equivalent equations in statistics books: the one described above and another based on the sum of the data points and the sum of the data points squared.
While this equation is theoretically correct, it is numerically unstable. Code that directly implements this equation can return a negative value for a quantity that is theoretically positive. I’ve seen this happen with real data, causing a program to crash when taking the square root of the variance to get the standard deviation.
However, there is an algorithm that computes mean and variance in one pass that is accurate and numerically stable. This algorithm was developed by B. P. Welford in 1962. I discuss Welford’s algorithm and give code for implementing it here.
Welford’s algorithm is known in computer science as an “online” algorithm. This term was coined well before the Internet. For example, see the paper [1] from 1965.
But of course now “online” means something else, and so the technical and colloquial uses of “online algorithm” have split. Technical literature uses the phrase to describe the kinds of algorithms in this post. Most people would take “online algorithm” to mean code that runs on a remote server. You may see “streaming algorithm” as a contemporary technical term, but I’d still search on “online algorithm” to find papers.
Welford’s algorithm computes the first two moments, mean and variance, of a data set online. It is also possible to compute skewness and kurtosis online, as well as higher moments.
Simple linear regression is closely related to calculating mean and variance, and it is possible to compute simple regression coefficients online. I have some old notes on this here.
This post was motivated by an email asking me about multiple regression. It is also possible to compute multiple regression coefficients online, but I haven’t done this. I found a couple references, [2] and [3], but I have not read them. There is a simple procedure for two predictor variables but I believe things get a little more complicated with three or more predictors, requiring a recursive least squares algorithm.
The notion of online algorithms is closely related to the notion of a fold in functional programming. Here are several posts on computing things with folds.
[1] One-Tape, Off-Line Turing Machine Computations by F. C. Hennie. Information and Control. 8, 553-578 (1965). Available here. In this paper Hennie writes “In an on-line computation the input data are supplied to the machine, one symbol at a time, at a special input terminal. … In an off-line computation all of the input symbols are written on one of the machine’s tapes prior to the start of the computation.
[2] Arthur Albert and Robert W. Sittler, “A Method for Computing Least Squares Estimators that Keep Up with the Data,” Journal of the Society for Industrial and Applied Mathematics, Series A: Control, 3(3), 384–417, 1965. DOI: 10.1137/0303026.
[3] Petre Stoica and Per Ashgren. Exact initialization of the recursive least-squares algorithm. Int. J. Adapt. Control Signal Process. 2002; 16:219&ndashh;230.
The post Online (one-pass) algorithms first appeared on John D. Cook.2026-05-28 09:35:54
Kullback-Leibler divergence is defined for two random variables X and Y by
K-L divergence is non-negative, and it’s zero if and only if X and Y have the same distribution. But it is not a metric, for reasons explained here. For one thing, it’s not symmetric.
We can fix the symmetry problem by defining
The J above stands for Jeffreys, for Harold Jeffreys. J is called either the symmetrized K-L divergence or Jeffreys’ divergence. It’s still a divergence, not a distance.
A distance (metric) d has to have four properties:
K-L divergence satisfies the first two properties. Jeffreys’ divergence satisfies the first three, but not the last one, the triangle inequality.
To show that J doesn’t satisfy the triangle inequality, let X, Y, and Z be Bernoulli random variables with p equal to 0.1, 0.2, and 0.3 respectively. Then the following Python code shows that the divergence from X to Y, plus the divergence from Y to Z, is less than the divergence from X to Z. This would be like saying you could get from LA to NYC faster by having a layover in Denver rather than taking a direct flight.
from math import log kl = lambda p, q: p*log(p/q) + (1-p)*log((1-p)/(1-q)) j = lambda p, q: kl(p, q) + kl(q, p) a = j(0.1, 0.2) b = j(0.2, 0.3) c = j(0.1, 0.3) print(a + b, c)
This prints 0.135 and 0.270.
Jensen-Shannon distance turns K-L divergence into a metric as follows. First, define the random variable M to be the average of X and Y. Then average the K-L divergence from M to each of X and Y. This defines the Jensen-Shannon divergence. It’s still not a metric, but its square root is, which defines the Jensen-Shannon distance.
The following code gives an example of Jensen-Shannon distance satisfying the triangle inequality.
def d(p, q):
m = 0.5*(p + q)
jsd = 0.5*kl(p, m) + 0.5*kl(q, m)
return jsd**0.5
a = d(0.1, 0.2)
b = d(0.2, 0.3)
c = d(0.1, 0.3)
print(a + b, c)
This prints 0.1817 and 0.1801. Now a layover makes the trip longer.
The post Turning K-L divergence into a metric first appeared on John D. Cook.2026-05-27 23:15:51
I saw a post yesterday saying that the Meta logo is a Besace curve.

A Besace curve has the implicit form
and the parametric form
where t ranges over [0, 2π].
So given a Besace curve, such as the Meta logo, how do you find the parameters a and b to fit the curve?
We can rewrite the parametric expression for x as a sine with a phase shift (see notes here)
where
Also, we can rewrite the parametric expression for y as
Now the extreme values of x and y are easier to see. The maximum value of x is A and the minimum value is −A. The maximum value of y is A(cos(φ) + 1)/2 and the minimum value is A(cos(φ) − 1)/2.
We can simplify the cosine of an artangent (see here) to find the height, i.e. the difference between the maximum and minimum y value, in terms of a and b.
Then the height is given by
The width is given by
and so
and
Now the Meta logo is drawn with a thick line, and the line width isn’t constant. It’s a little fuzzy what the height and width of the middle of the curve are, but I estimated h = 120 and w = 200 from one image. This leads to b = 20 and a = 97.98.
The Mathematica code
ParametricPlot[{a Cos[t] +
b Sin[t], -Sin[t] ( a Cos[t] + b Sin[t])}, {t, 0, 2 Pi},
PlotStyle -> Thickness[0.05]]
produces the following image.

This is reminiscent of the Meta logo, but not a great match. I suspect the logo is not exactly a Besace curve. You could tinker with the a and b parameters and the aspect ratio to get a closer match. The logo may have been inspired by a Besace curve and then drawn by hand.
The post The Meta logo and fitting Besace curves first appeared on John D. Cook.2026-05-27 01:59:53
The previous post looked at the expected IQ range in a jury of 12. This post will look more generally at computing the expected range of n samples from a N(0, 1) random variable. This will give the expected range in units of σ, i.e. multiply the results by σ if your σ isn’t 1.
As mentioned in the previous post, the expected range is given by
where φ and Φ are the PDF and CDF of a standard normal. The integral can be calculated in closed form for n ≤ 5, but in general it requires numerical integration [1].
The following Python code can compute dn.
from scipy.stats import norm
from scipy.integrate import quad
import numpy as np
def d(n):
integrand = lambda x: x*norm.pdf(x)*norm.cdf(x)**(n-1)
res, info = quad(integrand, -np.inf, np.inf)
return 2*n*res
For large n we have the asymptotic approximation
which we could implement in Python by
def approx(n):
return 2*norm.ppf((n - 0.375)/(n + 0.25))
For very large n the asymptotic expression may be more accurate than the integral due to numerical integration error.
Here are a few example values.
|-----+-------| | n | d_n | |-----+-------| | 2 | 1.128 | | 3 | 1.693 | | 5 | 2.326 | | 10 | 3.078 | | 12 | 3.258 | | 23 | 3.858 | | 50 | 4.498 | | 100 | 5.015 | |-----+-------|
[1] Order Statistics by H. A. David. John Wiley & Sons. 1970.
The post Calculating the expected range of normal samples first appeared on John D. Cook.