MoreRSS

site iconJohn D. CookModify

I have decades of consulting experience helping companies solve complex problems involving applied math, statistics, and data privacy.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of John D. Cook

Turning K-L divergence into a metric

2026-05-28 09:35:54

Kullback-Leibler divergence

Kullback-Leibler divergence is defined for two random variables X and Y by

D_{KL}(X || Y) = -\int f_X(x) \log \frac{f_Y(x)}{f_X(x)} \, dx

K-L divergence is non-negative, and it’s zero if and only if X and Y have the same distribution. But it is not a metric, for reasons explained here. For one thing, it’s not symmetric.

Jeffreys divergence

We can fix the symmetry problem by defining

J(X, Y) = D_{KL}(X || Y)  + D_{KL}(Y || X)

The J above stands for Jeffreys, for Harold Jeffreys. J is called either the symmetrized K-L divergence or Jeffreys’ divergence. It’s still a divergence, not a distance.

A distance (metric) d has to have four properties:

  1. d(x, x) = 0
  2. d(xy) > 0 if xy
  3. d(xy) = d(yx)
  4. d(xz) ≤ d(xy) + d(yz)

K-L divergence satisfies the first two properties. Jeffreys’ divergence satisfies the first three, but not the last one, the triangle inequality.

To show that J doesn’t satisfy the triangle inequality, let X, Y, and Z be Bernoulli random variables with p equal to 0.1, 0.2, and 0.3 respectively. Then the following Python code shows that the divergence from X to Y, plus the divergence from Y to Z, is less than the divergence from X to Z. This would be like saying you could get from LA to NYC faster by having a layover in Denver rather than taking a direct flight.

from math import log

kl = lambda p, q: p*log(p/q) + (1-p)*log((1-p)/(1-q))
j  = lambda p, q: kl(p, q) + kl(q, p)

a = j(0.1, 0.2)
b = j(0.2, 0.3)
c = j(0.1, 0.3)
print(a + b, c)

This prints 0.135 and 0.270.

Jensen-Shannon distance

Jensen-Shannon distance turns K-L divergence into a metric as follows. First, define the random variable M to be the average of X and Y. Then average the K-L divergence from M to each of X and Y. This defines the Jensen-Shannon divergence. It’s still not a metric, but its square root is, which defines the Jensen-Shannon distance.

\begin{align*} M &= (X + Y)/2 \\ \text{JSD}(X, Y) &= \tfrac{1}{2}D_{KL}(X || M) + \tfrac{1}{2}D_{KL}(Y || M) \\ d_{JS}(X, Y) &= \sqrt{\text{JSD}(X, Y)} \end{align*}

The following code gives an example of Jensen-Shannon distance satisfying the triangle inequality.

def d(p, q):
    m = 0.5*(p + q)
    jsd = 0.5*kl(p, m) + 0.5*kl(q, m) 
    return jsd**0.5

a = d(0.1, 0.2)
b = d(0.2, 0.3)
c = d(0.1, 0.3)
print(a + b, c)

This prints 0.1817 and 0.1801. Now a layover makes the trip longer.

The post Turning K-L divergence into a metric first appeared on John D. Cook.

The Meta logo and fitting Besace curves

2026-05-27 23:15:51

I saw a post yesterday saying that the Meta logo is a Besace curve.

Meta logo

A Besace curve has the implicit form

(x^2 - by)^2 = a^2(x^2 - y^2)

and the parametric form

\begin{align*} x &= a\cos(t) - b \sin(t) \\ y &= -\sin(t) x\end{align*}

where t ranges over [0, 2π].

So given a Besace curve, such as the Meta logo, how do you find the parameters a and b to fit the curve?

We can rewrite the parametric expression for x as a sine with a phase shift (see notes here)

x = A \sin(t + \phi)

where

\begin{align*} A &= \sqrt{a^2 + b^2} \\ \phi &= -\arctan(a/b)\end{align*}

Also, we can rewrite the parametric expression for y as

\begin{align*} y &= A \sin(t) \sin(t + \phi) \\ &= \frac{A}{2} \left(\cos(\phi) - \cos(2t + \phi)\right) \\ \end{align*}

Now the extreme values of x and y are easier to see. The maximum value of x is A and the minimum value is −A. The maximum value of y is A(cos(φ) + 1)/2 and the minimum value is A(cos(φ) − 1)/2.

We can simplify the cosine of an artangent (see here) to find the height, i.e. the difference between the maximum and minimum y value, in terms of a and b.

\begin{align*} \cos(\phi) &= \cos(-\arctan(a/b)) \\ &= \frac{1}{\sqrt{1 + (a/b)^2}} \\ &= \frac{b}{\sqrt{a^2 + b^2}} \end{align*}

Then the height is given by

\begin{align*} h &= \frac{A}{2}(\cos(\phi) + 1) - \frac{A}{2}(\cos(\phi) - 1) \\ &= A \cos(\phi) + A \\ &= b + \sqrt{a^2 + b^2} \end{align*}

The width is given by

w = 2A = 2\sqrt{a^2 + b^2}

and so

b = h - w/2

and

a = \pm \sqrt{\frac{w^2}{4} - b^2}

Now the Meta logo is drawn with a thick line, and the line width isn’t constant. It’s a little fuzzy what the height and width of the middle of the curve are, but I estimated h = 120 and w = 200 from one image. This leads to b = 20 and a = 97.98.

The Mathematica code

ParametricPlot[{a Cos[t] + 
   b Sin[t], -Sin[t] ( a Cos[t] + b Sin[t])}, {t, 0, 2 Pi}, 
 PlotStyle -> Thickness[0.05]]

produces the following image.

Mathematica approximation of Meta logo

This is reminiscent of the Meta logo, but not a great match. I suspect the logo is not exactly a Besace curve. You could tinker with the a and b parameters and the aspect ratio to get a closer match. The logo may have been inspired by a Besace curve and then drawn by hand.

The post The Meta logo and fitting Besace curves first appeared on John D. Cook.

Calculating the expected range of normal samples

2026-05-27 01:59:53

The previous post looked at the expected IQ range in a jury of 12. This post will look more generally at computing the expected range of n samples from a N(0, 1) random variable. This will give the expected range in units of σ, i.e. multiply the results by σ if your σ isn’t 1.

As mentioned in the previous post, the expected range is given by

d_n = 2n \int_{-\infty}^\infty \Phi(x)^{n-1} \, x\,\phi(x) \, dx

where φ and Φ are the PDF and CDF of a standard normal. The integral can be calculated in closed form for n ≤ 5, but in general it requires numerical integration [1].

The following Python code can compute dn.

from scipy.stats import norm
from scipy.integrate import quad
import numpy as np

def d(n):
    integrand = lambda x: x*norm.pdf(x)*norm.cdf(x)**(n-1)
    res, info = quad(integrand, -np.inf, np.inf)
    return 2*n*res

For large n we have the asymptotic approximation

d_n = 2 \Phi^{-1}\left( \frac{n \,–\, 0.375}{ n + 0.25} \right)

which we could implement in Python by

def approx(n):
    return 2*norm.ppf((n - 0.375)/(n + 0.25))

For very large n the asymptotic expression may be more accurate than the integral due to numerical integration error.

Here are a few example values.

|-----+-------|
|   n |   d_n |
|-----+-------|
|   2 | 1.128 |
|   3 | 1.693 |
|   5 | 2.326 |
|  10 | 3.078 |
|  12 | 3.258 |
|  23 | 3.858 |
|  50 | 4.498 |
| 100 | 5.015 |
|-----+-------|

[1] Order Statistics by H. A. David. John Wiley & Sons. 1970.

The post Calculating the expected range of normal samples first appeared on John D. Cook.

Expected IQ spread on a jury

2026-05-26 21:50:51

There’s been some discussion online lately about how a large difference in IQ makes it difficult for two people to communicate. There have been studies that confirm this effect. The difficulty is not insurmountable, but it takes deliberate effort to overcome.

Someone dismissed this communication difficulty by pointing out that the expected difference in IQ between two individuals is around 17, suggesting that most communication is between people who differ by more than one standard deviation in IQ. But this calculation assumes people are chosen at random, which they usually are not. People tend to live around and work around others of similar intelligence.

However, a jury is a random sample. It’s not a perfect random sample. For one thing, it starts with a random sample of people who are registered to vote, or who have a drivers license, not all individuals. Furthermore, the pool of potential jurors is reduced to a jury through the process of voir dire, which is not random.

For this post I will make the simplifying assumption that a jury is a random sample from a population with normally distributed IQ with standard deviation σ = 15. The mean doesn’t matter here, but you could assume it’s 100 if you’d like.

By symmetry, the expected range of n samples from a normal random variable is twice the maximum. For n = 12 the range is about 3.26σ, which corresponds to nearly 50 IQ points.

This suggests there’s usually a big spread of IQ on a jury. Even if IQ doesn’t measure intelligence, it measures something, and that something varies a lot over 12 people chosen at random [1].

Related posts

[1] In case you’re interested in the technical details, the expected range of n samples from a standard normal random variable is given by

d_n = 2n \int_{-\infty}^\infty \Phi(x)^{n-1} \, x\,\phi(x) \, dx

where φ and Φ are the PDF and CDF of a standard normal. Multiply this by σ to get the range of a normal random variable with standard deviation σ. As for how to calculate dn, see the next post.

The post Expected IQ spread on a jury first appeared on John D. Cook.

Hilbert transform as an infinite matrix

2026-05-23 23:09:34

The previous post linked to a post I wrote a few years ago about the Hilbert transform and Fourier series. That post says that if the Fourier series of a function is

f(t) = \sum_{n=1}^\infty \left\{ a_n \sin(nt) + b_n\cos(nt) \right\}

then the Fourier series of its Hilbert transform is

f_H(x) = \sum_{n=1}^\infty \left\{ -b_n \sin(nx) + a_n\cos(nx) \right\}

When I looked back at that post I thought about how if you thought of the Fourier coefficients as elements of an infinite vector then the Hilbert transform can be represented as multiplying by an infinite block matrix.

\left[ \begin{array}{cc|cc|cc|c} 0 & -1 & 0 & 0 & 0 & 0 & \cdots \\ 1 & 0 & 0 & 0 & 0 & 0 & \cdots \\ \hline 0 & 0 & 0 & -1 & 0 & 0 & \cdots \\ 0 & 0 & 1 & 0 & 0 & 0 & \cdots \\ \hline 0 & 0 & 0 & 0 & 0 & -1 & \cdots \\ 0 & 0 & 0 & 0 & 1 & 0 & \cdots \\ \hline \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \ddots \end{array} \right] \left[ \begin{array}{c} a_1 \\ b_1 \\ \hline a_2 \\ b_2 \\ \hline a_3 \\ b_3 \\ \hline \vdots \end{array} \right]

I rarely see infinite matrices except in older math books. Apparently they were more fashionable a few decades ago than they are now. I suppose the notation falls between two stools, too concrete for some tastes and not concrete enough for others. The former folks would prefer something like H and the latter would prefer the sum above.

The post Hilbert transform as an infinite matrix first appeared on John D. Cook.

Real and imaginary parts

2026-05-23 21:57:21

The previous post announced some notes I wrote up based on an article by Henry Baker implementing functions of a complex variable in terms of functions of a real variable. That is, it finds functions u(x, y) and v(x, y) such that

f(x + iy) = u(x, y) + i v(x, y)

where xyu, and v are all real-valued. Not only that, but if f is an elementary function, so are u and v. Here “elementary” has a technical meaning, but essentially it means functions that you could evaluate on a scientific calculator. A couple of the functions might be unfamiliar, such as sgn and atan2, but there are no functions like the gamma function that are defined in terms of integrals.

One application of Baker’s equations would be to bootstrap a math library that doesn’t support complex numbers into one that does. But the equations could be useful in pure math when you’d like to have a convenient expression for the real or imaginary part of a function.

The real and imaginary parts of a complex analytic function are harmonic functions. So the functions on the right hand side of Baker’s equations satisfy Laplace’s equation:

uxx + uyy = 0

and

vxx + vyy = 0.

Furthermore, the functions u and v form harmonic conjugate pairs, meaning each is the Hilbert transform of the other.

The post Real and imaginary parts first appeared on John D. Cook.