2025-06-05 08:00:00
Tea is a little-known beverage, consumed for flavor or sometimes for conjectured effects as a stimulant. It’s made by submerging the leaves of C. Sinensis in hot water. But how hot should the water be?
To resolve this, I brewed the same tea at four different temperatures, brought them all to a uniform serving temperature, and then had four subjects rate them along four dimensions.
Subject A is an experienced tea drinker, exclusively of black tea w/ lots of milk and sugar.
Subject B is also an experienced tea drinker, mostly of black tea w/ lots of milk and sugar. In recent years, Subject B has been pressured by Subject D to try other teas. Subject B likes fancy black tea and claims to like fancy oolong, but will not drink green tea.
Subject C is similar to Subject A.
Subject D likes all kinds of tea, derives a large fraction of their joy in life from tea, and is world’s preeminent existential angst + science blogger.
For a tea that was as “normal” as possible, I used pyramidal bags of PG Tips tea (Lipton Teas and Infusions, Trafford Park Rd., Trafford Park, Stretford, Manchester M17 1NH, UK).
I brewed it according to the instructions on the box, by submerging one bag in 250ml of water for 2.5 minutes. I did four brews with water at temperatures ranging from 79°C to 100°C (174.2°F to 212°F). To keep the temperature roughly constant while brewing, I did it in a Pyrex measuring cup (Corning Inc., 1 Riverfront Plaza, Corning, New York, 14831, USA) sitting in a pan of hot water on the stove.
After brewing, I poured the tea into four identical mugs with the brew temperature written on the bottom with a Sharpie Pro marker (Newell Brands, 5 Concourse Pkwy Atlanta, GA 30328, USA). Readers interested in replicating this experiment may note that those written temperatures still persist on the mugs today, three months later. The cups were dark red, making it impossible to see any difference in the teas.
After brewing, I put all the mugs in a pan of hot water until they converged to 80°C, so they were served at the same temperature.
I shuffled the mugs and placed them on a table in a random order. I then asked the subjects to taste from each mug and rate the teas for:
Each rating was to be on a 1-5 scale, with 1=bad and 5=good.
Subjects A, B, and C had no knowledge of how the different teas were brewed. Subject D was aware, but was blinded as to which tea was in which mug.
During taste evaluation, Subjects A and C remorselessly pestered Subject D with questions about how a tea strength can be “good” or “bad”. Subject D rejected these questions on the grounds that “good” cannot be meaningfully reduced to other words and urged Subjects A and C to review Wittgenstein’s concept of meaning as use, etc. Subject B questioned the value of these discussions.
After ratings were complete, I poured tea out of all the cups until 100 ml remained in each, added around 1 gram (1/4 tsp) of sugar, and heated them back up to 80°C. I then re-shuffled the cups and presented them for a second round of ratings.
For a single summary, I somewhat arbitrarily combined the four ratings into a “quality” score, defined as
(Quality) = 0.1 × (Aroma) + 0.3 × (Flavor) + 0.1 × (Strength) + 0.5 × (Goodness).
Here is the data for Subject A, along with a linear fit for quality as a function of brewing temperature. Broadly speaking, A liked everything, but showed weak evidence of any trend.
And here is the same for Subject B, who apparently hated everything.
Here is the same for Subject C, who liked everything, but showed very weak evidence of any trend.
And here is the same for Subject D. This shows extremely strong evidence of a negative trend. But, again, while blinded to the order, this subject was aware of the brewing protocol.
Finally, here are the results combining data from all subjects. This shows a mild trend, driven mostly by Subject D.
This experiment provides very weak evidence that you might be brewing your tea too hot. Mostly, it just proves that Subject D thinks lower-middle tier black tea tastes better when brewed cooler. I already knew that.
There are a lot of other dimensions to explore, such as the type of tea, the brew time, the amount of tea, and the serving temperature. I think that ideally, I’d randomize all those dimensions, gather a large sample, and then fit some kind of regression.
Creating dozens of different brews and then serving them all blinded at different serving temperatures sounds like way too much work. Maybe there’s an easier way to go about this? Can someone build me a robot?
If you thirst to see Subject C’s raw aroma scores or whatever, you can download the data or click on one of the entries in this table:
Subject | Aroma | Flavor | Strength | Goodness | Quality |
---|---|---|---|---|---|
A | x | x | x | x | x |
B | x | x | x | x | x |
C | x | x | x | x | x |
D | x | x | x | x | x |
All | x | x | x | x | x |
Subject D was really good at this; why can’t everyone be like Subject D?
2025-05-29 08:00:00
A lot of writing advice seems to amount to:
I start by having verbal intelligence that’s six standard deviations above the population mean. I find that this is really helpful! Also, here are some tips on spelling and how I cope with the never-ending adoration.
I think this leaves space for advice from people with less godlike levels of natural talent e.g. your friend dynomight.
Here it is: Make something you would actually like.
Actually, let me bold the important words: Make something you would actually like.
Why make something you would like? To be clear, I’m not suggesting you write “for yourself”. I assume that your terminal goal is to make something other people like.
But try this experiment: Go write a few thousand words and give them to someone who loves you. Now, go through paragraph-by-paragraph and try to predict what was going through their head while reading. It’s impossible. I tell you, it cannot be done!
Personally, I think this is because nobody really understands anyone else. (I recently discovered that my mother secretly hates tomatoes.) If you try to make something that other people would like, rather than yourself, you’ll likely just end up with something no one likes.
The good news is that none of us are that unique. If you like it, then I guarantee you that lots of other will too. It’s a big internet.
Most decisions follow from this principle. Should your thing be long and breezy or short and to the point? Should you start with an attention-grabbing story? Should you put your main conclusion upfront? How formal should you be? Should you tell personal stories? I think the answer is: Do whatever you would like, if you were the reader.
Sometimes people ask me why this blog is so weird. The answer is simple: I like weird. I wish other blogs had more personality, and I can’t understand why everyone seems to put so much effort into being generic. Since I don’t understand weirdness-hating, I don’t think I have any chance of making weirdness-haters happy. If I tried to be non-weird, I think I’d be bad non-weird.
This is also why I blog rather than making videos or podcasts or whatever. I like blogs. I can’t stand videos, so I don’t think I’d be any good at making them. Everyone, please stop asking me to watch videos.
Now, why make something you would actually like? In short, because your brain is designed to lie to you. One way it lies is by telling you your stuff is amazing even when it isn’t. To be more precise, this often happens:
Probably our brains lie to us for good reasons. Probably it’s good that we think we’re better looking than we are, because that makes us more confident and effectiveness beats accuracy, etc. But while it’s hard to improve your looks, it’s easy to improve your writing. At least, it would be if you could see it for what it is.
Your brain also lies to you by telling you your writing is clear. When you write, you take some complex network of ideas that lives in your brain and compile it into a linear sequence of words. Other people only see those words.
There’s no simple formula for avoiding either of these. But try to resist.
I don’t know how to explain this, but I think it’s very important: You should be your reader’s ally. Or, if you like, their consigliere.
As a simple example, why is the word “consigliere” up there a link? Well, not everyone knows what that means. But I don’t like having a link, because it sort of makes me look stupid, like I just learned that word last week. But I’m on your side, goddamnit.
As another example, many people wonder how confident their tone should be. I think your confidence should reflect whatever you actually believe. Lots of people pick a conclusion and dismiss all conflicting evidence. Obviously, that does not treat the reader as an ally. But at the same time, after you’ve given a fair view of the evidence, if you think there’s a clear answer, admit it. Your readers want to know! Compare these three styles:
Number 1 is dishonest. But arguably number 3 is also dishonest. Treat your reader like an ally, who wants all the information, including your opinion.
I want to write a post called “measles”. The idea is to look into why it declined when it did, what risks measles vaccines might pose, and what would happen if people stopped getting vaccinated.
That’s the whole idea. I have nothing else and don’t know the answers. Yet I’m pretty sure this post would be good, just because when I tried to find answers, none of the scientifically credible sources treated me like an ally. Instead, they seemed to regard me as a complete idiot, who can’t be trusted with any information that might lead to the “wrong” conclusion.
If you want that idea, take it! That would make me happy, because I have hundreds of such ideas, and I won’t live long enough to write more than a tiny fraction of them. Almost all the value is in the execution.
Some people worry about running out of ideas. I swear this is impossible. The more you write, the easier ideas are to find.
One reason is that when you write, you learn stuff. This qualifies you to write about more things and reveals more of world’s fractal complexity. Also, experience makes it much easier to recognize ideas that would translate into good posts, but only makes it slightly easier to execute on those ideas. So the ideas pile up, at an ever-accelerating pace.
If you really run out of ideas, just take one of your old ones and do it again, better. It’s fine, I promise.
The obvious antidote for your lying brain is feedback from other people. But this is tricky. For one thing, the people who love you enough to read your drafts may not be in your target audience. If they wouldn’t read it voluntarily, you probably don’t want to optimize too hard for them.
It’s also hard to get people to give negative feedback. I sometimes ask people to mark each paragraph according to a modified CRIBS system as either Confusing, Repetitive, Interesting, Boring or Shorten. I also like to ask, “If you had to cut 25%, what would you pick?”
People are better at finding problems than at giving solutions. If they didn’t understand you, how could they tell you what to change? It’s usually best if you propose a change, and then ask them if that fixes the problem.
Also, remember that people can only read something for the first time once. Also, do not explain your idea to people before they read. Make them go in blind.
(If you’re working with professional editors, none of this applies.)
You should probably edit, a lot. Some people with Godlike Talent don’t edit. But the rest of us should.
One way to edit is leave your writing alone for a week or two. This gives you back some of the perspective of a new reader, and makes it emotionally easier to delete stuff.
Here’s another exercise: Take your thing and print it out. Then, go through and circle the “good parts”. Then, delete everything else. If absolutely necessary, bring back other stuff to connect the good parts. But are you sure you need to do that?
I think you fuckers all take yourselves too seriously.
There might be some Bourdieu-esque cultural capital thing with humor. Maybe making jokes while discussing serious ideas is a kind of countersignaling, like a billionaire wearing sneakers. Maybe it’s a way of saying, “Look at me, I can act casual and people still take me seriously! Clearly I am a big deal!” If you look at it that way, it seems kind of gross. But I wouldn’t worry about it, because Bourdieu-esque countersignaling makes everything seem gross. If you like humor, do humor.
My one insight is that humor needs to be “worth it”. Very short jokes are funny, even when they’re not very funny. For example, my use of “fuckers” up there wasn’t very funny. But it was funny (to me) because it’s just a single word. Except it’s crude, so maybe it wasn’t funny? Except, hey look, now I’m using it to illustrate a larger idea, so it wasn’t pointlessly crude. Thus, it was funny after all. Q.E.D.
Behold the Dynomight funniness formula:
(Actual funniness) = (Baseline funniness) / (Cost of joke)
The “cost” measures how distracting the joke is. This includes the length, but also the topic. If you’re writing now in 2025, a joke about Donald Trump has to be much funnier than, say, a joke about, say, Lee Kuan Yew.
Increasing baseline funniness is hard. But decreasing “cost” is often easy. If in doubt, decrease the denominator.
In real life, very few people can tell jokes with punchlines. But lots of people can tell funny stories. I think that’s because in stories, the jokes come on top of something that’s independently interesting. If a joke with a punchline bombs, it’s very awkward. If a funny aside in a story fails, people might not even notice a joke was attempted. The same is all true for writing.
Most people who write stuff hope that other people will read it. So how does that work nowadays? Discussing this feels déclassé, but I am your ally and I thought you’d want to know.
You might imagine some large group of people who are eagerly looking for more blogs: People who, if they see something good, will immediately check the archives and/or subscribe. I am like that. You might be like that. But such people are very rare. I know many bloggers who put aggressive subscribe buttons everywhere but, if pressed, admit they never subscribe to anything. This is less true for blogs devoted to money, politics, or culture war. But it’s extra-double true for generalist blogs.
If you’ve grown up with social media, you might imagine that your stuff will “go viral”. This too is rare, particularly if your post isn’t related to money, politics, culture war, or stupid bullshit. And even if something does blow up, people do not go on social media looking for long feeds to read.
I recently had a post that supposedly got 210k views on twitter. Of those twitter showed the reply with the link to my post to 9882 people (4.7%). Of those, 1655 (0.79%) clicked on the link. How many read the whole thing? One in ten? And how many of those subscribed? One in twenty? We’re now down to a number you can count on your fingers.
There are some places where people do go to find long things to read. When my posts are on Hacker News, this usually leads to several thousand views. But I think the median number who subscribe as a result is: Zero. Most people who find stuff via Hacker News like finding via Hacker News.
I’m not complaining, mind you. Theoretically, the readers of this blog could fill a small stadium. It’s nothing compared to popular writers, but it feels like an outrageous number to me. But I’ve been at this for more than five years, and I’ve written—gulp—186 posts. It wasn’t, like, easy.
Special offer: If you want me to subscribe to your blog, put the URL for the RSS feed in the comments to this post, and I will subscribe and read some possibly nonzero fraction of your posts. (Don’t be proud, if you want me to subscribe, I’m asking you to do it.)
Most people are pretty chill. They read stuff because they hope to get something out of it. If that doesn’t happen, they’ll stop reading and go on with their lives. But on any given day, some people will be in a bad mood and something you write will trigger something in them, and they will say you are dumb and bad.
You cannot let the haters get you down. It’s not just an issue of emotions. If the haters bother you, you may find yourself writing for them rather than for your allies. No!
For example, say you’ve decided that schools should stop serving lunch. (Why? I don’t know why.) When making your case, you may find yourself tempted to add little asides like, “To be clear, this doesn’t mean I hate kids. Children are the future!” My question is, who is that for? Is it for your ally, the reader who likes you and wants to know what you think? Or is it for yourself, to protect you from the haters?
This kind of “defensive” writing gets tiring very quickly. Your allies probably do not want or need very much of it, so keep it in check.
Also, I think defensiveness often just makes the problem worse. The slightly-contrarian friend is hated much more than the adversary. If you write “I think {environmentalism, feminism, abortion, Christianity} is bad”, people will mostly just think, “huh.” But if you write, “I am totally in favor of {environmentalism, feminism, abortion, Christianity}! I am one of the good people! I just have a couple very small concerns…”, people tend to think, “Heretic! Burn the witch!” Best to just leave your personal virtue out of it.
Much the same goes for other clarifications. Clear writing is next to godliness. But the optimal number of confused readers is not zero. If you try to chase down every possible misinterpretation, your writing will become very heavy and boring.
It’s probably human nature to be upset when people say mean things about you. We’re designed for small tribal bands. For better or worse, people who persist in internet writing tend to be exceptionally self-confident and/or thick-skinned.
If you’d like some help being more thick-skinned, remember that people who have negative reasons are much more likely to respond than people who have positive ones. (If you think something is perfect, what is there to say?)
Also, I strongly suggest you read comments for some posts you think are good. For example, here are some comments for Cold Takes’s legendary Does X cause Y? An in-depth evidence review. I think the comments are terrible, in both senses. They’re premised on the idea that because the author doesn’t use fancy statistical jargon, they must be statistically illiterate. But if the post tried to make those people happy, it would be worse.
Finally, there are whole communities devoted to sneering at other people. They just can’t stand the idea of people writing blogs and exploring weird ideas. This really bothers some writers. Personally, I wonder what they have going on in their lives that that’s how they’d spend their time.
Should you use AI?
I think you should not. If you secretly use AI, you are not treating the reader as your ally. If you openly use AI, nobody will read it. The end.
Also, I think AI is currently still quite bad at writing compared to a skilled human. (Currently.) It’s is great at explaining well-understood facts. But for subjects that are “hard”, with sprawling / tangled / contradictory evidence, it still mostly just regurgitates the abstracts with a confident tone and minimal skepticism. You can do better.
That nagging feeling.
Often, I’m writing something and there will be one section that I can’t figure out how to write. I’ll move the paragraphs around, re-write it from scratch several times, and something always feels off. Eventually, I’ll realize that it isn’t a writing problem, it’s an ideas problem. What I need to do is change my conclusion, or re-organize the whole post.
It’s always tempting to ignore that feeling. Everything else is already in place. But if you do that, you’ll be throwing away one of the best parts of writing—how it helps you think.
Use the correct amount of formatting.
In the long-long ago, writing was paragraph after paragraph. At some point, we decided that was boring, and started adding more “formatting”, like section titles and lists and tables and figure captions, etc.
I think we’ve now reached the point where it’s common to use too much formatting. Some people go crazy and create writing that’s almost all formatting. This is disorienting for readers, and I think it often reflects writers being afraid to do the hard work of writing paragraphs that make sense.
If in doubt, I think it’s best to start with less formatting, to make sure your paragraphs are the “cake” and the formatting is just “icing”.
Explain why you already believe it.
Often, people want to make some argument, and they find themselves mired in endless amounts of new research. But hold on. If you’re trying to make some argument, then you already believe it, right? Why is that? Either:
If it’s the former, you don’t need to do new research. Just mentally switch from trying to explain “why it is true” to explaining why you already believe it. Give your ally you true level of confidence. If it’s the latter, stop believing stuff without good reasons!
How to write.
I don’t think it’s possible to say much that’s useful about this. Giving advice about writing is like giving advice about how to hit a golf ball, or socialize at parties, or do rock climbing. You learn by doing.
Different people follow different processes. Some write slowly from beginning to end. Some write quick disorganized drafts and edit endlessly. Some make outlines, some don’t. Some write on paper, some with a computer. Some write in the morning, some write at night. Do whatever you want. Just do a lot of it.
Stop when it’s not getting better.
When you’re writing something, you’re too close to judge if it’s good or not. Don’t think about it. Just try to make it as good as possible. At some point, you’ll find you can’t improve it anymore. At that point, you are done.
2025-05-22 08:00:00
What I want from an array language is:
I say NumPy misses on three of these. So I’d like to propose a “fix” that—I claim—eliminates 90% of unnecessary thinking, with no loss of power. It would also fix all the things based on NumPy, for example every machine learning library.
I know that sounds grandiose. Quite possibly you’re thinking that good-old dynomight has finally lost it. So I warn you now: My solution is utterly non-clever. If anything is clever here, it’s my single-minded rejection of cleverness.
To motivate the fix, let me give my story for how NumPy went wrong. It started as a nice little library for array operations and linear algebra. When everything has two or fewer dimensions, it’s great. But at some point, someone showed up with some higher-dimensional arrays. If loops were fast in Python, NumPy would have said, “Hello person with ≥3 dimensions, please call my ≤2 dimensional functions in a loop so I can stay nice and simple, xox, NumPy.”
But since loops are slow, NumPy instead took all the complexity that would usually be addressed with loops and pushed it down into individual functions. I think this was a disaster, because every time you see some function call like np.func(A,B)
, you have to think:
np.func
do when it sees those shapes?Different functions have different rules. Sometimes they’re bewildering. This means constantly thinking and constantly moving dimensions around to appease the whims of particular functions. It’s the functions that should be appeasing your whims!
Even simple-looking things like A*B
or A[B,C]
do quite different things depending on the starting shapes. And those starting shapes are often themselves the output of previous functions, so the complexity spirals.
Worst of all, if you write a new ≤2 dimensional function, then high-dimensional arrays are your problem. You need to decide what rules to obey, and then you need to re-write your function in a much more complex way to—
Voice from the back: Python sucks! If you used a real language, loops would be fast! This problem is stupid!
That was a strong argument, ten years ago. But now everything is GPU, and GPUs hate loops. Today, array packages are cheerful interfaces that look like Python (or whatever) but are actually embedded languages that secretly compile everything into special GPU instructions that run on whole arrays in parallel. With big arrays, you need GPUs. So I think the speed of the host language doesn’t matter so much anymore.
Python’s slowness may have paradoxically turned out to be an advantage, since it forced everything to be designed to work without loops even before GPUs took over.
Still, thinking is bad, and NumPy makes me think, so I don’t like NumPy.
Here’s my extremely non-clever idea: Let’s just admit that loops were better. In high dimensions, no one has yet come up with a notation that beats loops and indices. So, let’s do this:
That’s basically the whole idea. If you take those three bullet-points, you could probably re-derive everything I do below. I told you this wasn’t clever.
Suppose that X
and Y
are 2D arrays, and A
is a 4D array. And suppose you want to find a 2D array Z
such that Zij = (Yj)T (Aij)-1 Xi
. If you could write loops, this would be easy:
import numpy as np
Z = np.empty((X.shape[0], Y.shape[0]))
for i in range(X.shape[0]):
for j in range(Y.shape[0]):
Z[i,j] = Y[j] @ np.linalg.solve(A[i,j], X[i])
That’s not pretty. It’s not short or fast. But it is easy!
Meanwhile, how do you do this efficiently in NumPy? Like this:
AiX = np.linalg.solve(A.transpose(1,0,2,3),
X[None,...,None])[...,0]
Z = np.sum(AiX * Y[:,None], axis=-1).T
If you’re not a NumPy otaku, that may look like outsider art. Rest assured, it looks like that to me too, and I just wrote it. Why is it so confusing? At a high level, it’s because np.linalg.solve
and np.sum
and multiplication (*
) have complicated rules and weren’t designed to work together to solve this particular problem nicely. That would be impossible, because there are an infinite number of problems. So you need to mash the arrays around a lot to make those functions happy.
Without further ado, here’s how you solve this problem with DumPy (ostensibly Dynomight NumPy):
import dumpy as dp
A = dp.Array(A)
X = dp.Array(X)
Y = dp.Array(Y)
Z = dp.Slot()
Z['i','j'] = Y['j',:] @ dp.linalg.solve(A['i','j',:,:], X['i',:])
Yes! If you prefer, you can also use this equivalent syntax:
Z = dp.Slot()
with dp.Range(X.shape[0]) as i:
with dp.Range(Y.shape[0]) as j:
Z[i,j] = Y[j,:] @ dp.linalg.solve(A[i,j,:,:], X[i,:])
Those are both fully vectorized. No loops are executed behind the scenes. They’ll run on a GPU if you have one.
While it looks magical, the way this actually works is fairly simple:
If you index a DumPy array with a string (or a dp.Range
object), it creates a special “mapped” array that pretends to have fewer dimensions.
When a DumPy function is called (e.g. dp.linalg.solve
or dp.matmul
(called with @
)), it checks if any of the arguments have mapped dimensions. If so, it automatically vectorizes the computation, matching up mapped dimensions that share labels.
When you assign an array with mapped dimensions to a dp.Slot
, it “unmaps” them into the positions you specify.
No evil meta-programming abstract syntax tree macro bytecode interception is needed. When you run this code:
Z = dp.Slot()
Z['i','j'] = Y['j',:] @ dp.linalg.solve(A['i','j',:,:], X['i',:])
This is what happens behind the scenes:
a = A.map_axes([0, 1], ['i', 'j'])
x = X.map_axes([0], ['i'])
y = Y.map_axes([0], ['j'])
z = y @ dp.linalg.solve(a, x)
Z = z.unmap('i','j')
# first map A
a = A.map_axes([0, 1], ['i', 'j'])
assert A.ndim == 4
assert a.ndim == 2 # pretends to have fewer dims
assert a.data.shape == A.shape # secret mapped data
assert a.axes == ('i', 'j', None, None) # secret mapped axes
assert a.shape == (a.data.shape[2], a.data.shape[3])
# shape determined by non-mapped (None) axes
# now map X
x = X.map_axes([0], ['i'])
assert X.ndim == 2
assert x.ndim == 1
assert x.data.shape == X.shape
assert x.axes == ('i', None)
assert x.shape == (x.data.shape[1], )
# now map Y
y = Y.map_axes([0], ['j'])
assert Y.ndim == 2
assert y.ndim == 1
assert y.shape == (Y.shape[1],)
assert y.axes == ('j', None)
assert y.data.shape == Y.shape
assert y.shape == (y.data.shape[1],)
# Actually do the computation. It happens that the 'j'
# dimension is stored first because its found first (in y).
# But you never need to think about that!
z = y @ dp.linalg.solve(a, x)
assert z.ndim == 0
assert z.shape == ()
assert z.axes == ('j','i')
assert z.data.shape == (Y.shape[0], X.shape[0])
# unmap the mapped axes
Z = z.unmap('i','j')
assert Z.ndim == 2
assert Z.shape == (X.shape[0], Y.shape[0])
It might seem like I’ve skipped the hard part. How does dp.linalg.solve
know how to vectorize over any combination of input dimensions? Don’t I need to do that for every single function that DumPy includes? Isn’t that hard?
It is hard, but jax.vmap
did it already. This takes a function defined using (JAX’s version of) NumPy and vectorizes it over any set of input dimensions. DumPy relies on this to do all the actual vectorization. (If you prefer your vmap
janky and broken, I heartily recommend PyTorch’s torch.vmap
.)
But hold on. If vmap
already exists, then why do we need DumPy? Here’s why:
import jax
from jax import numpy as jnp
Z = jax.vmap(
jax.vmap(
lambda x, y, a: y @ jnp.linalg.solve(a, x),
in_axes=[None, 0, 0]
),
in_axes=[0, None, 0]
)(X, Y, A)
That’s how you solve the same problem with vmap
. (And basically what DumPy does behind the scenes.)
I think vmap
is one of the best parts of the NumPy ecosystem. The above code seems genuinely better than the base NumPy version. But it still involves a lot of thinking! Why put in_axes=[None, 0, 0]
in the inner vmap
and in_axes=[0, None, 0]
in the outer one? Why are all the axes 0
even though you need to vectorize over the second dimension of A
? There are answers, but they require thinking. Loops and indices are better.
OK, I did do one thing that’s a little clever. Say you want to create a Hilbert matrix with Hij = 1/(1+i+j)
. In base NumPy you’d have to do this:
X = 1 / (1 + np.arange(5)[:,None] + np.arange(5)[None,:]) # hurr?
In DumPy, you can just write:
X = dp.Slot()
with dp.Range(5) as i:
with dp.Range(5) as j:
X[i,j] = 1 / (i + j + 1)
Yes! That works! It works because a dp.Range
acts both like a string and like an array mapped along that string. So the above code is roughly equivalent to:
I = dp.Array([0,1,2,3,4])
J = dp.Array([0,1,2,3,4])
X = dp.Slot()
X['i','j'] = 1 / (1 + I['i'] + J['j'])
In reality, the dp.Range
choose random strings. (The class maintains a stack of active ranges to prevent collisions.) So in more detail, the above code becomes something like this:
I = dp.Array([0,1,2,3,4])
J = dp.Array([0,1,2,3,4])
i = I.map_axes([0],'range_EZaW')
j = J.map_axes([0],'range_ailw')
x = 1 / (1 + i + j) # vectorized
X = x.unmap('range_EZaW','range_ailw')
To test if DumPy is actually better in practice, I took six problems of increasing complexity and implemented each of them using loops, NumPy, JAX (with vmap
), and DumPy.
Note that in these examples, I always assume the input arrays are in the class of the system being used. If you try running them, you’ll need to add some conversions with np.array
/ jnp.array
/ dp.Array
.
# loops
H = np.empty((N, N))
for i in range(N):
for j in range(N):
H[i, j] = 1 / (i + j + 1)
# NumPy
i = np.arange(N)
j = np.arange(N)
H = 1 / (i[:, None] + j[None, :] + 1)
# JAX
indices = jnp.arange(N)
H = jax.vmap(
jax.vmap(
lambda i, j: 1 / (i + j + 1),
[0, None]),
[None, 0]
)(indices, indices)
# DumPy
H = dp.Slot()
with dp.Range(N) as i:
with dp.Range(N) as j:
H[i, j] = 1 / (i + j + 1) # Yes! This works!
# Loops
C = np.zeros((X.shape[0],X.shape[1],X.shape[1]))
for n in range(X.shape[0]):
C[n] = np.cov(X[n])
# NumPy
mu = np.mean(X, axis=2)[:, :, None] # hurr?
C = np.sum((X - mu)[:, None, :, :] *
(X - mu)[:, :, None, :],
axis=3) / (X.shape[2] - 1) # hurrr??
# JAX
C = jax.vmap(jnp.cov)(X)
# DumPy
C = dp.Slot()
with dp.Range(X.shape[0]) as n:
C[n, :, :] = dp.cov(X[n, :, :])
# DumPy (alternate)
C = dp.Slot()
C['n',:,:] = dp.cov(X['n',:,:])
(Pretending numpy.lib.stride_tricks
doesn’t exist.)
# Loops
B = np.empty(N - window + 1)
for i in range(N - window + 1):
B[i] = np.mean(A[i:i + window])
# NumPy
i = np.arange(N - window + 1)[:, None]
j = np.arange(window)[None, :]
B = np.mean(A[i+j], axis=-1)
# JAX
idx = jnp.arange(window)
B = jax.vmap(
lambda i: jnp.mean(A[i+idx]),
)(jnp.arange(N - window + 1))
# DumPy
windowed = dp.Slot()
B = dp.Slot()
with dp.Range(N - window + 1) as i:
with dp.Range(window) as j:
windowed[i, j] = A[i + j]
B[i] = dp.mean(windowed[i, :])
# Note: B[i] = dp.mean(A[i:i+window])
# would not work because dp.Range can't be used in slice
The goal is to create E
with
E[i1, i2, :, i3] = A[B[i1], C[i1, i2], ::2, D[i2, i3]]
.
# Setup
K = 4
L = 5
M = 6
N = 7
A = np.random.randn(K, L, M, N)
B = np.random.randint(0, K, size=(9,))
C = np.random.randint(0, L, size=(9, 10))
D = np.random.randint(0, N, size=(10, 11))
# Loops
E = np.empty((9, 10, M // 2, 11))
for i1 in range(9):
for i2 in range(10):
for i3 in range(11):
E[i1,i2,:,i3] = A[B[i1],C[i1, i2],::2,D[i2, i3]]
# NumPy
E = A[B[:, None, None],
C[:, :, None],
::2,
D[None, :, :]
].transpose((0, 1, 3, 2))
# JAX
E = jax.vmap(
jax.vmap(
jax.vmap(
lambda b, c, d: A[b, c, ::2, d],
in_axes=[None,None,0]),
in_axes=[None,0,0]),
in_axes=[0,0,None]
)(B,C,D).transpose(0,1,3,2)
# DumPy
E = dp.Slot()
with dp.Range(9) as i1:
with dp.Range(10) as i2:
with dp.Range(11) as i3:
E[i1,i2,:,i3] = A[B[i1],C[i1, i2],::2,D[i2, i3]]
# DumPy (alternative)
E = dp.Slot()
E['i1','i2',:,'i3'] = A[B['i1'], C['i1','i2'], ::2, D['i2','i3']]
The goal of this problem is, given a list of vectors and a list of Gaussians parameters, and arrays mapping each vector to a list of parameters, evaluate each corresponding vector/parameter combination. Formally, given 2D X
, B
, C
, and means
and 3D covs
, the goal is to create A
with
Aij = log N( Xi | meansBij, covsCij)
.
# Setup
ndims = 3
ndata = 10
neval = 5
ndist = 7
X = np.random.randn(ndata, ndims)
B = np.random.randint(0, ndist, size=(ndata, neval))
C = np.random.randint(0, ndist, size=(ndata, neval))
means = np.random.randn(ndist, ndims)
scales = np.array(np.random.randn(ndist, ndims, ndims))
covs = np.array([scale @ scale.T for scale in scales])
# Loops
def log_prob(x, mean, cov):
diff = x - mean
y = np.linalg.solve(cov, diff)
quad = diff @ y
logdet = np.linalg.slogdet(2 * np.pi * cov)[1]
return -0.5 * (quad + logdet)
A = np.empty((ndata, neval))
for i in range(ndata):
for j in range(neval):
A[i, j] = log_prob(X[i, :],
means[B[i, j], :],
covs[C[i, j], :, :])
# NumPy
diff = X[:, None, :] - means[B]
y = np.linalg.solve(covs[C], diff[..., None])
quad = np.sum(diff * y[..., 0], axis=-1)
logdet = np.linalg.slogdet(2 * np.pi * covs[C])[1]
A = -0.5 * (quad + logdet)
# JAX
A = jax.vmap(
jax.vmap(
log_prob_gauss,
in_axes=[None, 0, 0]
),
)(X, means[B], covs[C])
# DumPy
def log_prob(x, mean, cov):
diff = x - mean
quad = diff @ dp.linalg.solve(cov, diff)
logdet = dp.linalg.slogdet(2 * jnp.pi * cov)[1]
A = dp.Slot()
with dp.Range(ndata) as i:
with dp.Range(neval) as j:
A[i, j] = log_prob(X[i,:],
means[B[i,j],:],
covs[C[i,j],:,:])
# DumPy (alternate)
A = dp.Slot()
with dp.Range(ndata) as i:
with dp.Range(neval) as j:
mean = means[B[i,j],:]
cov = covs[C[i,j],:,:]
diff = X[i,:] - mean
quad = diff @ dp.linalg.solve(cov, diff)
logdet = dp.linalg.slogdet(2 * jnp.pi * cov)[1]
A[i,j] = -0.5 * (quad + logdet)
See also the discussion in the previous post.
# Setup
input_dim = 4
seq_len = 4
d_k = 5
d_v = input_dim
n_head = 2
X = np.random.randn(seq_len, input_dim)
W_q = np.random.randn(n_head, input_dim, d_k)
W_k = np.random.randn(n_head, input_dim, d_k)
W_v = np.random.randn(n_head, input_dim, d_v)
W_o = np.random.randn(n_head, d_v, input_dim // n_head)
# Loops
def softmax_numpy(x, axis=-1):
e_x = np.exp(x - np.max(x, axis=axis, keepdims=True))
return e_x / np.sum(e_x, axis=axis, keepdims=True)
def attention(X, W_q, W_k, W_v):
Q = X @ W_q
K = X @ W_k
V = X @ W_v
scores = Q @ K.T / np.sqrt(d_k)
attention_weights = softmax_numpy(scores, axis=-1)
output = attention_weights @ V
return output
def multi_head_attention_loops(X, W_q, W_k, W_v, W_o):
projected = []
for n in range(n_head):
my_output = attention(X,
W_q[n, :, :],
W_k[n, :, :],
W_v[n, :, :])
my_proj = my_output @ W_o[n, :, :]
projected.append(my_proj)
projected = np.array(projected)
final = []
for i in range(seq_len):
my_final = np.ravel(projected[:, i, :])
final.append(my_final)
return np.array(final)
# NumPy
def softmax_numpy(x, axis=-1): # repeat
e_x = np.exp(x - np.max(x, axis=axis, keepdims=True))
return e_x / np.sum(e_x, axis=axis, keepdims=True)
def multi_head_attention_numpy(X, W_q, W_k, W_v, W_o):
Q = np.einsum('si,hij->hsj', X, W_q)
K = np.einsum('si,hik->hsk', X, W_k)
V = np.einsum('si,hiv->hsv', X, W_v)
scores = Q @ K.transpose(0, 2, 1) / np.sqrt(d_k)
weights = softmax_numpy(scores, axis=-1)
output = weights @ V
projected = np.einsum('hsv,hvd->hsd', output, W_o)
return projected.transpose(1, 0, 2).reshape(
seq_len, input_dim)
# JAX
def softmax_jax(x, axis=-1):
e_x = jnp.exp(x - jnp.max(x, axis=axis, keepdims=True))
return e_x / jnp.sum(e_x, axis=axis, keepdims=True)
def attention_jax(X, W_q, W_k, W_v):
d_k = W_k.shape[-1]
Q = X @ W_q
K = X @ W_k
V = X @ W_v
scores = Q @ K.T / jnp.sqrt(d_k)
attention_weights = softmax_jax(scores, axis=-1)
output = attention_weights @ V
return output
def multi_head_attention_jax(X, W_q, W_k, W_v, W_o):
def myfun(X, w_q, w_k, w_v, w_o):
return attention_jax(X, w_q, w_k, w_v) @ w_o
projected = jax.vmap(myfun,
in_axes=[None, 0, 0, 0, 0]
)(X, W_q, W_k, W_v, W_o)
return jax.vmap(jnp.ravel, in_axes=1)(projected)
# DumPy
def softmax_dumpy(x):
assert x.ndim == 1 # no need to think about dimensions!
e_x = dp.exp(x - dp.max(x))
return e_x / dp.sum(e_x)
@dp.wrap # needed to make functions with Slots auto-vectorizing
def attention_dumpy(X, W_q, W_k, W_v):
Q = X @ W_q
K = X @ W_k
V = X @ W_v
scores = Q @ K.T / np.sqrt(d_k)
attention_weights = dp.Slot()
with dp.Range(seq_len) as i:
attention_weights[i, :] = softmax_dumpy(scores[i, :])
output = attention_weights @ V
return output
def multi_head_attention_dumpy(X, W_q, W_k, W_v, W_o):
output = dp.Slot()
projected = dp.Slot()
final = dp.Slot()
with dp.Range(n_head) as n:
output[n, :, :] = attention_dumpy(X,
W_q[n, :, :],
W_k[n, :, :],
W_v[n, :, :])
projected[n, :, :] = output[n, :, :] @ W_o[n, :, :]
with dp.Range(seq_len) as i:
final[i, :] = dp.ravel(projected[:, i, :])
return final
# DumPy (alternate)
def multi_head_attention(X, W_q, W_k, W_v, W_o):
attn_weights = dp.Slot()
projected = dp.Slot()
final = dp.Slot()
with dp.Range(n_head) as n:
Q = X @ W_q[n, :, :]
K = X @ W_k[n, :, :]
V = X @ W_v[n, :, :]
scores = Q @ K.T / np.sqrt(d_k)
with dp.Range(seq_len) as i:
attn_weights[n, i, :] = softmax_dumpy(scores[i, :])
projected[n, :, :] = attn_weights[n, :, :] @ V @ W_o[n, :, :]
with dp.Range(seq_len) as i:
final[i, :] = dp.ravel(projected[:, i, :])
return final
I gave each implementation a subjective “goodness” score on a 1-10 scale. I always gave the best implementation for each problem 10 points, and then took off points from the others based on how much thinking they required.
Problem | Loops | Numpy | JAX (vmap) | DumPy |
---|---|---|---|---|
Hilbert matrices | 10 | 7 | 7 | 10 |
Covariance | 9 | 4 | 10 | 9 |
Moving Ave. | 10 | 6 | 6 | 8 |
Indexing | 10 | 5 | 4 | 10 |
Gaussians | 10 | 3 | 6 | 10 |
Attention | 10 | 1 | 8 | 10 |
Mean | 9.8 | 4.3 | 6.8 | 9.5 |
According to this dubious methodology and these made-up numbers, DumPy is 96.93877% as good as loops! Knowledge is power! But seriously, while subjective, I don’t think my scores should be too controversial. The most debatable one is probably JAX’s attention score.
The only thing DumPy adds to NumPy is some nice notation for indices. That’s it.
What I think makes DumPy good is it also removes a lot of stuff. Roughly speaking, I’ve tried to remove anything that is confusing and exists because NumPy doesn’t have loops. I’m not sure that I’ve drawn the line in exactly the right place, but I do feel confident that I’m on the right track with removing stuff.
In NumPy, A*B
works if A
and B
are both scalar. Or if A
is 5×1×6
and B
is 5×1×6×1
. But not if A
is 1×5×6
and B
is 1×5×6×1
. Huh?
In truth, the broadcasting rules aren’t that complicated for scalar operations like multiplication. But still, I don’t like it, because every time you see A*B
, you have to worry about what shapes those have and what the computation might be doing.
So, I removed it. In DumPy you can only do A*B
if one of A
or B
is scalar or A
and B
have exactly the same shape. That’s it, anything else raises an error. Instead, use indices, so it’s clear what you’re doing. Instead of this:
C = A[...,None] * B[None]
write this:
C['i','j','k'] = A['i','j'] * B['j','k']
Indexing in NumPy is absurdly complicated. When you write A[B,C,D]
that could do many different things depending on what all the shapes are. I considered going cold-turkey and only allowing scalar indices in DumPy. That wouldn’t have been so bad, since you can still do advanced stuff using loops. But it’s quite annoying to not be able to write A[B]
when A
and B
are just simple 1D arrays.
So I’ve tentatively decided to be more pragmatic. In DumPy, you can index with integers, or slices, or (possibly mapped) Array
s. But only one Array
index can be non-scalar. I settled on this because it’s the most general syntax that doesn’t require thinking.
Let me show you what I mean. If you see this:
A[1, 1:6, C, 2:10] # legal in both numpy and dumpy
It’s “obvious” what the output shape will be. (First the shape of 1:6
, then the shape of C
, then the shape of 2:10
). Simple enough. But as soon as you have two multidimensional array inputs like this:
A[B, 1:6, C, 2:10] # legal in numpy, verboten in dumpy
Suddenly all hell breaks loose. You need to think about broadcasting between B
and C
, orthogonal vs. pointwise indices, slices behaving differently than arrays, and quirks for where the output dimensions go. So DumPy forbids this. Instead, you need to write one of these:
D['i',:,:] = A[B['i'], 1:6, C['i'], 2:10] # (1)
D[:,:,'i'] = A[B['i'], 1:6, C['i'], 2:10] # (2)
D['i','j',:,:] = A[B['i'], 1:6, C['j'], 2:10] # (3)
D['i','j',:,:] = A[B['i','j'], 1:6, C['i'], 2:10] # (4)
D['i','j',:,:] = A[B['i','j'], 1:6, C['i','j'], 2:10] # (5)
They all do exactly what they look like they do.
Oh, and one more thing! In DumPy, you must index all dimensions. In NumPy, if A
has three dimensions, then A[2]
is equivalent to A[2,:,:]
. This is sometimes nice, but it means that every time you see A[2]
, you have to worry about how many dimensions A
has. In DumPy, every time you index an array or assign to a dp.Slot
, it checks that all indices have been included. So when you see option (4) above, you know that:
A
has 4 dimensionsB
has 2 dimensionsC
has 1 dimensionD
has 4 dimensionsAlways, always, always. No cases, no thinking.
Again, many NumPy functions have complex conventions for vectorization. np.linalg.solve
sort of says, “If the inputs have ≤2 dimensions, do the obvious thing. Otherwise, do some extremely confusing broadcasting stuff.” DumPy removes the confusing broadcasting stuff. When you see dp.linalg.solve(A,B)
, you know that A
and B
have no more than two dimensions, so nothing tricky is happening.
Similarly, in NumPy, A @ B
is equivalent to np.matmul
(A,B)
. When both inputs have ≤2 or fewer dimensions, this does the “obvious thing”. (Either an inner-product or some kind of matrix/vector multiplication.) Otherwise, it broadcasts or vectorizes or something? I can never remember. In DumPy you don’t have that problem, because it restricts A @ B
to arrays with one or two dimensions only.
If you need more dimensions, no problem: Use indices.
It might seem annoying to remove features, but I’m telling you: Just try it. If you program this way, a wonderful feeling of calmness comes over you, as class after class of possible errors disappear.
Put another way, why remove all the fancy stuff, instead of leaving it optional? Because optional implies thinking! I want to program in a simple way. I don’t want to worry that I’m accidentally triggering some confusing broadcasting insanity, because that would be a mistake. I want the computer to help me catch mistakes, not silently do something weird that I didn’t intend.
In principle, it would be OK if there was a evil_solve
method that preserves all the confusing batching stuff. If you really want that, you can make it yourself:
evil_solve = dp.MappedFunction(jnp.linalg.solve) # not recommended
You can use that same wrapper to convert any JAX NumPy function to work with DumPy.
Think about math: In two or fewer dimensions, coordinate-free linear algebra notation is wonderful. But for higher dimensional tensors, there are just too many cases, so most physicists just use coordinates.
So this solution seems pretty obvious to me. Honestly, I’m a little confused why it isn’t already standard. Am I missing something?
When I complain about NumPy, many people often suggest looking into APL-type languages, like A, J, K, or Q. (All single-letter languages are APL-like, except C, D, F, R, T, X, and many others. Convenient, right?) The obvious disadvantages of these are that:
None of those bother me. If the languages are better, we should learn to use them and make them do autodiff on GPUs. But I’m not convinced they are better. When you actually learn these languages, what you figure out is that the symbol gibberish basically amounts to doing the same kind of dimension mashing that we saw earlier in NumPy:
AiX = np.linalg.solve(A.transpose(1,0,2,3),
X[None,...,None])[...,0]
Z = np.sum(AiX * Y[:,None], axis=-1).T
The reason is that, just like NumPy and vmap
, these languages choose align dimensions by position, rather than by name. If I have to mash dimensions, I want to use the best tool. But I’d prefer not to mash dimensions at all.
People also often suggest “NumPy with named dimensions” as in xarray. (PyTorch also has a half-hearted implementation.) Of course, DumPy also uses named dimensions, but there’s a critical difference. In xarray, they’re part of the arrays themselves, while in DumPy, they live outside the arrays.
In some cases, permanent named dimensions are very nice. But for linear algebra, they’re confusing. For example, suppose A
is 2-D with named dimensions "cat"
and "dog"
. Now, what dimensions should ATA
have? ("dog"
twice?) Or say you take a singular value decomposition like U, S, Vh = svd(A)
. What name should the inner dimensions have? Does the user have to specify it?
I haven’t seen a nice solution. xarray doesn’t focus on linear algebra, so it’s not much of an issue there. A theoretical “DumPy with permanent names” might be very nice, but I’m not sure how it should work. This is worth thinking about more.
I like Julia! Loops are fast in Julia! But again, I don’t think fast loops matter that much, because I want to move all the loops to the GPU. So even if I was using Julia, I think I’d want to use a DumPy-type solution.
I think Julia might well be a better host language than Python, but it wouldn’t be because of fast loops, but because it offers much more powerful meta-programming capabilities. I built DumPy on top of JAX just because JAX is very mature and good at calling the GPU, but I’d love to see the same idea used in Julia (“Dulia”?) or other languages.
OK, I promised a link to my prototype, so here it is: dumpy.py
It’s just a single file with around 700 lines. I’m leaving it as a single file because I want to stress that this is just something I hacked together in the service of this rant. I wanted to show that I’m not totally out of my mind, and that doing all this is actually pretty easy.
I stress that I don’t really intend to update or improve this. (Unless someone gives me a lot of money?) So please do not attempt to use it for “real work”, and do not make fun of my code.
PS. DumPy works out of the box with both jax.jit
and jax.grad
. For gradients, you need to either cast the output to a JAX scalar or use the dp.grad
wrapper.
2025-05-19 08:00:00
Alice and Bob are driving through the desert.
Alice: Looks dry.
Bob: That’s wrong, what we see ahead is caused by the sun heating up the road. While the speed of light is constant in vacuum, when light moves through matter, the atoms emit new light that destructively interferes with the old light, in effect causing a “delay”. This happens more with more atoms, meaning light travels slower through denser media.
Alice: OK, but—
Bob: So when the sun heats up the road, this creates a layer of thin warm air with denser cooler air higher up. As light passes through these layers, it refracts upwards towards the denser cooler layer. If conditions are right, this can bend the light back up towards your eyes. Does that make sense?
Alice: Yeah, but—
Bob: Now you might ask, “Why does this look like water?” The situations seem similar at first, with two fluids of different densities. But think about it: With an ocean, the dense water is below the thin air. While, in front of you, the dense cool air is above the thin warm air. The situations are actually reversed!
Alice: Please just—
Bob: With an ocean, there’s a sharp increase in density where the air meets the water. The Fresnel equations say that when light hits such a discontinuity at an angle, most is reflected off the surface. Whereas with the air in front of us, there’s a small and gradual decrease in density, meaning the light is slowly refracted through the lighter warm air and gradually bent back up. In both cases, the mixing in the fluids causes a shimmering effect.
Alice:
Bob: So yeah. Wrong. That’s just a heat mirage.
Meanwhile, the desert flies by, bone dry.
Often on the internet, someone will make a thing and someone else will make a reply with this pattern:
The problem, of course, is that this never explains why the original thing is wrong. This is a fully-general pattern for refuting anything, one that does not require the existence of any mistake. As far as I can tell, this pattern doesn’t have a name. So I suggest “Heat Mirage”: a compelling series of true statements that refute a mistake that’s never stated and doesn’t exist.
Often, all the smart-sounding factual statements are totally correct and insightful. So other people tend to see them and think, “Wow, this person really knows what they’re talking about. Clearly the original thing is dumb and bad. I’m not going to waste my time going through that!”
Sometimes this leads to a whole gang of people agreeing with the smart-sounding statements and lamenting the original thing’s wrongness. Someone might chime in with “Uhh, nothing you said contradicts anything?” or “It seems like you’re just annoyed that your personal fixation wasn’t the main subject of the thing?” But this rarely turns the tide.
What makes the Heat Mirage pattern so pernicious is that there’s seldom any ill intent. Someone just neglected for some reason to explain the “mistake” they were refuting. They might even be trying to be “nice” by avoiding too much criticism or negativity.
One of the tough lessons of life is that when you try to be “nice” to someone instead of being direct, you often end up hurting them, because you don’t understand their needs. If in doubt, it’s best to err on the side of (tactful) directness. And, on the internet, one should always be in doubt.
Trust me, anyone who makes a thing on the internet is 100% fully aware that other people are likely to point out flaws. I think that’s great, and we should all get comfortable with our own fallibility. But it’s best when the flaws that are pointed out are real flaws that actually exist.
So, if you want to say someone is wrong, say why. Or, if you don’t want to do that, consider starting off with, “Here’s something I find interesting…”
2025-05-15 08:00:00
They say you can’t truly hate someone unless you loved them first. I don’t know if that’s true as a general principle, but it certainly describes my relationship with NumPy.
NumPy, by the way, is some software that does computations on arrays in Python. It’s insanely popular and has had a huge influence on all the popular machine learning libraries like PyTorch. These libraries share most of the same issues I discuss below, but I’ll stick to NumPy for concreteness.
NumPy makes easy things easy. Say A
is a 5×5
matrix, x
is a length-5 vector, and you want to find the vector y such that Ay=x
. In NumPy, that would be:
y = np.linalg.solve(A, x)
So elegant! So clear!
But say the situation is even a little more complicated. Say A
is a stack of 100 5×5
matrices, given as a 100×5×5
array. And say x
is a stack of 100 length-5 vectors, given as a 100×5
array. And say you want to solve Aᵢyᵢ=xᵢ
for 1≤i≤100
.
If you could use loops, this would be easy:
y = np.empty_like(x)
for i in range(100):
y[i,:] = np.linalg.solve(A[i,:,:], x[i,:])
But you can’t use loops. To some degree, this is a limitation of loops being slow in Python. But nowadays, everything is GPU and if you’ve got big arrays, you probably don’t want to use loops in any language. To get all those transistors firing, you need to call special GPU functions that will sort of split up the arrays into lots of little pieces and process them in parallel.
The good news is that NumPy knows about those special routines (at least if you use JAX or CuPy), and if you call np.linalg.solve
correctly, it will use them.
The bad news is that no one knows how do that.
Don’t believe me? OK, which of these is right?
y = linalg.solve(A,x)
y = linalg.solve(A,x,axis=0)
y = linalg.solve(A,x,axes=[[1,2],1])
y = linalg.solve(A.T, x.T)
y = linalg.solve(A.T, x).T
y = linalg.solve(A, x[None,:,:])
y = linalg.solve(A,x[:,:,None])
y = linalg.solve(A,x[:,:,None])[:,:,0]
y = linalg.solve(A[:,:,:,None],x[:,None,None,:])
y = linalg.solve(A.transpose([1,2,0]),x[:,:,None]).T
No one knows. And let me show you something else. Here’s the documentation:
Read that. Meditate on it. Now, notice: You still don’t know how to solve Aᵢyᵢ=xᵢ
for all i
at once. Is it even possible? Did I lie when I said it was?
As far as I can tell, what people actually do is try random variations until one seems to work.
NumPy is all about applying operations to arrays. When the arrays have 2 or fewer dimensions, everything is fine. But if you’re doing something even mildly complicated, you inevitably find yourself with some operation you want to apply to some dimensions of array A
, some other dimensions of array B
, and some other dimensions of array C
. And NumPy has no theory for how to express that.
Let me show you what I mean. Suppose:
A
is a K×L×M
arrayB
is a L×N
arrayC
is a K×M
arrayAnd say that for each k
and n
, you’d like to compute the mean over the L
and M
dimensions. That is, you want
Dkn = 1/(LM) × ∑lm Aklm Bln Ckm.
To do that, you’ve got two options. The first is to use grotesque dimension alignment tricks:
D = np.mean(
np.mean(
A[:,:,:,None] *
B[None,:,None,:] *
C[:,None,:,None],
axis=1),
axis=1)
The hell, you ask? Why is None
everywhere? Well, when indexing an array in NumPy, you can write None
to insert a new dimension. A
is K×L×M
, but A[:,:,:,None]
is K×L×M×
1
. Similarly, B[None,:,None,:]
is 1
×L×
1
×N
and C[:,None,:,None]
is K×
1
×M×
1
. When you multiply these together, NumPy “broadcasts” all the size-1 dimensions to give a K×L×M×N
array. Then, the np.mean
calls average over the L
and M
dimensions.
I think this is bad. I’ve been using NumPy for years and I still find it impossible to write code like that without always making mistakes.
It’s also borderline-impossible to read. To prove this, I just flipped a coin and introduced a bug above if and only if the coin was tails. Is there a bug? Are you sure? No one knows.
Your second option is to desperately try to be clever. Life is short and precious, but if you spend a lot of yours reading the NumPy documentation, you might eventually realize that there’s a function called np.tensordot
, and that it’s possible to make it do much of the work:
D = (1/L) * np.mean(
np.tensordot(A, B, axes=[1,0]) *
C[:,:,None],
axis=1)
That’s correct. (I promise.) But why does it work? What exactly is np.tensordot
doing? If you saw that code in some other context, would you have the slightest idea what was happening?
Here’s how I’d do it, if only I could use loops:
D = np.zeros((K,N))
for k in range(K):
for n in range(N):
a = A[k,:,:]
b = B[:,n]
c = C[k,:]
assert a.shape == (L,M)
assert b.shape == (L,)
assert c.shape == (M,)
D[k,n] = np.mean(a * b[:,None] * c[None,:])
People who’ve written too much NumPy may find that clunky. I suspect that’s a wee bit of Stockholm Syndrome. But surely we can agree that it’s clear.
In practice, things are often even worse. Say that A
had shape M×K×L
rather than K×L×M
. With loops, no big deal. But NumPy requires you to write monstrosities like A.transpose([1,2,0])
. Or should that be A.transpose([2,0,1])
? What shapes do those produce? No one knows.
Loops were better.
There is a third option:
D = 1/(L*M) * np.einsum('klm,ln,km->kn', A, B, C)
If you’ve never seen Einstein summation before, that might look terrifying. But remember, our goal is to find
Dkn = 1/(LM) × ∑lm Aklm Bln Ckm.
The string in the above code basically gives labels to the indices in each of the three inputs (klm,ln,km
) and the target indices for the output (->kn
). Then, np.einsum
multiplies together the corresponding elements of the inputs and sums over all indices that aren’t in the output.
Personally, I think np.einsum
is one of the rare parts of NumPy that’s actually good. The strings are a bit tedious, but they’re worth it, because the overall function is easy(ish) to understand, is completely explicit, and is quite general and powerful.
Except, how does np.einsum
achieve all this? It uses indices. Or, more precisely, it introduces a tiny domain-specific language based on indices. It doesn’t suffer from NumPy’s design flaws because it refuses to play by NumPy’s normal rules.
But np.einsum
only does a few things. (Einops does a few more.) What if you want to apply some other function over various dimensions of some arrays? There is no np.linalg.einsolve
. And if you create your own function, there’s certainly no “Einstein” version of it.
I think np.einsum
’s goodness shows that NumPy went somewhere.
Here’s a painting which feels analogous to our subject.
Here’s what I want from an array language. I ain’t particular about syntax, but it would be nice if:
Wouldn’t that be nice? I think NumPy doesn’t achieve these because of its original sin: It took away indices and replaced them with broadcasting. And broadcasting cannot fill indices’ shoes.
NumPy’s core trick is broadcasting. Take this code:
A = np.array([[1,2],[3,4],[5,6]])
B = np.array([10,20])
C = A * B
print(C)
This outputs:
[[ 10 40]
[ 30 80]
[ 50 120]]
Here, A
is a 3×2
array, and B
is a length-2
array. When you multiply them together, B
is “broadcast” to the shape of A
, meaning the first column of A
is multiplied with B[0]=10
and the second is multiplied with B[1]=20
.
In simple cases, this seems good. But I don’t love it. One reason is that, as we saw above, you often have to do gross things to the dimensions to get them to line up.
Another reason is that it isn’t explicit or legible. Sometimes A*B
multiplies element-by-element, and sometimes it does more complicated things. So every time you see A*B
, you have to figure out which case in the broadcasting conventions is getting triggered.
But the real problem with broadcasting is how it infects everything else. I’ll explain below.
Here’s a riddle. Take this code:
A = np.ones((10,20,30,40))
i = np.array([1,2,3])
j = np.array([[0],[1]])
B = A[:,i,j,:]
What shape does B
have?
It turns out the answer is 10×2×3×40
. That’s because the i
and j
indices get broadcast to a shape of 2×3
and then something something mumble mumble mumble. Try to convince yourself it makes sense.
Done? OK, now try these:
C = A[:,:,i,j]
D = A[:,i,:,j]
E = A[:,1:4,j,:]
What shapes do these have?
C
is 10×20×2×3
. This seems logical, given what happened with B
above.
What about D
? It is 2×3×10×30
. Now, for some reason, the 2
and 3
go at the beginning?
And what about E
? Well, “slices” in Python exclude the endpoint, so 1:4
is equivalent to [1,2,3]
which is equivalent to i
, and so E
is the same as B
. Hahaha, just kidding! E
is 10×3×2×1×40
.
Yes, that is what happens. Try it if you don’t believe me! I understand why NumPy does this, because I’ve absorbed this 5000 word document that explains how NumPy indexing works. But I want that time back.
I used this query:
Take this python code
A = np.ones((10,20,30,40))
i = np.array([1,2,3])
j = np.array([[0],[1]])
B = A[:,i,j,:]
C = A[:,:,i,j]
D = A[:,i,:,j]
E = A[:,1:4,j,:]what shapes do B, C, D, and E have?
Claude 3.7 used “extended thinking”. Here are all the incorrect outputs:
AI | B |
C |
D |
E |
---|---|---|---|---|
GPT 4.1 | 10×2×3×30 | |||
Grok 3 | 10×3×30×2 | 10×3×2×40 | ||
Claude 3 Opus | 10×3×2×30 | 10×20×3×2 | 10×3×30×2 | 10×3×2×40 |
Llama 4 Maverick | 10×3×30×2 | 10×3×2×40 | ||
o3 | 10×2×3×30 | |||
Claude 3.7 | 10×3×30×2 | 10×3×2×40 |
AI | B |
C |
D |
E |
---|---|---|---|---|
GPT 4.1 | ✔️ | ✔️ | X | ✔️ |
Grok 3 | ✔️ | ✔️ | X | X |
Claude 3 Opus | X | X | X | X |
Llama 4 Maverick | ✔️ | ✔️ | X | X |
o3 | ✔️ | ✔️ | X | ✔️ |
Claude 3.7 | ✔️ | ✔️ | X | X |
Gemini 2.5 Pro | ✔️ | ✔️ | ✔️ | ✔️ |
DeepSeek R1 | ✔️ | ✔️ | ✔️ | ✔️ |
(DeepSeek’s chain of thought used “wait” 76 times. It got everything right the first time, but when I tried it again, it somehow got B
, C
, and D
all wrong, but E
right.)
This is insane. Using basic features should not require solving crazy logic puzzles.
You might think, “OK, I’ll just limit myself to indexing in simple ways.” Sounds good, except sometimes you need advanced indexing. And even if you’re doing something simple, you still need to be careful to avoid the crazy cases.
This again makes everything non-legible. Even if you’re just reading code that uses indexing in a simple way, how do you know it’s simple? If you see A[B,C]
, that could be doing almost anything. To understand it, you need to remember the shapes of A
, B
, and C
and work through all the cases. And, of course, A
, B
, and C
are often produced by other code, which you also need to think about…
Why did NumPy end up with a np.linalg.solve(A,B)
function that’s so confusing? I imagine they first made it work when A
is a 2D array and and b
is a 1D or 2D array, just like the mathematical notation of A⁻¹b
or A⁻¹B
.
So far so good. But then someone probably came along with a 3D array. If you could use loops, the solution would be “use the old function with loops”. But you can’t use loops. So there were basically three options:
axes
argument, so the user can specify which dimensions to operate over. Maybe you could write solve(A,B,axes=[[1,2],1])
.solve_matrix_vector
would do one thing, solve_tensor_matrix
would do another.solve
will internally try to line up the dimensions. Then it’s the user’s problem to figure out and conform to those Conventions.All these options are bad, because none of them can really cope with the fact that there are a combinatorial number of different cases. NumPy chose: All of them. Some functions have axes
arguments. Some have different versions with different names. Some have Conventions. Some have Conventions and axes
arguments. And some don’t provide any vectorized version at all.
But the biggest flaw of NumPy is this: Say you create a function that solves some problem with arrays of some given shape. Now, how do you apply it to particular dimensions of some larger arrays? The answer is: You re-write your function from scratch in a much more complex way. The basic principle of programming is abstraction—solving simple problems and then using the solutions as building blocks for more complex problems. NumPy doesn’t let you do that.
One last example to show you what I’m talking about. Whenever I whine about NumPy, people always want to see an example with self-attention, the core trick behind modern language models. So fine. Here’s an implementation, which I humbly suggest is better than all 227 versions I found when I searched for “self-attention numpy”:
# self attention by your friend dynomight
input_dim = 4
seq_len = 4
d_k = 5
d_v = input_dim
X = np.random.randn(seq_len, input_dim)
W_q = np.random.randn(input_dim, d_k)
W_k = np.random.randn(input_dim, d_k)
W_v = np.random.randn(input_dim, d_v)
def softmax(x, axis):
e_x = np.exp(x - np.max(x, axis=axis, keepdims=True))
return e_x / np.sum(e_x, axis=axis, keepdims=True)
def attention(X, W_q, W_k, W_v):
d_k = W_k.shape[1]
Q = X @ W_q
K = X @ W_k
V = X @ W_v
scores = Q @ K.T / np.sqrt(d_k)
attention_weights = softmax(scores, axis=-1)
return attention_weights @ V
This is fine. Some of the axis
stuff is a little obscure, but whatever.
But what language models really need is multi-head attention, where you sort of do attention several times in parallel and then merge the results. How do we do that?
First, let’s imagine we lived in a sane world where we were allowed to use abstractions. Then you could just call the previous function in a loop:
# multi-head self attention by your friend dynomight
# if only we could use loops
n_head = 2
X = np.random.randn(seq_len, input_dim)
W_q = np.random.randn(n_head, input_dim, d_k)
W_k = np.random.randn(n_head, input_dim, d_k)
W_v = np.random.randn(n_head, input_dim, d_v)
W_o = np.random.randn(n_head, d_v, input_dim // n_head)
def multi_head_attention(X, W_q, W_k, W_v, W_o):
projected = []
for n in range(n_head):
output = attention(X,
W_q[n,:,:],
W_k[n,:,:],
W_v[n,:,:])
my_proj = output @ W_o[n,:,:]
projected.append(my_proj)
projected = np.array(projected)
output = []
for i in range(seq_len):
my_output = np.ravel(projected[:,i,:])
output.append(my_output)
return np.array(output)
Looks stupid, right? Yes—thank you! Cleverness is bad.
But we don’t live in a sane world. So instead you need to do this:
# multi-head self attention by your friend dynomight
# all vectorized and bewildering
def multi_head_attention(X, W_q, W_k, W_v, W_o):
d_k = W_k.shape[-1]
Q = np.einsum('si,hij->hsj', X, W_q)
K = np.einsum('si,hik->hsk', X, W_k)
V = np.einsum('si,hiv->hsv', X, W_v)
scores = Q @ K.transpose(0, 2, 1) / np.sqrt(d_k)
weights = softmax(scores, axis=-1)
output = weights @ V
projected = np.einsum('hsv,hvd->hsd', output, W_o)
return projected.transpose(1, 0, 2).reshape(
seq_len, input_dim)
Ha! Hahahahahaha!
To be clear, I’m only suggesting that NumPy is “the worst array language other than all the other array languages”. What’s the point of complaining if I don’t have something better to suggest?
Well, actually I do have something better to suggest. I’ve made a prototype of a “better” NumPy that I think retains all the power while eliminating all the sharp edges. I thought this would just be a short motivational introduction, but after I started writing, the evil took hold of me and here we are 3000 words later.
Also, it’s probably wise to keep some distance between one’s raving polemics and one’s constructive array language API proposals. So I’ll cover my new thing next time.
2025-05-12 08:00:00
So you’ve made a thing. I’ll pretend it’s a blog post, though it doesn’t really matter. If people read your thing, some would like it, and some wouldn’t.
You should try to make a good thing, that many people would like. That presents certain challenges. But our subject today is only how to give your thing a title.
My advice is: Think of the title as “classifier”.
When people see the title, some are likely to click on it and some won’t. Abstractly speaking, the title adds a second dimension to the above figure:
A title has two goals. First, think of all the people in the world who, if they clicked on your thing, would finish it and love it. Ideally, those people would click. That is, you want there to be people in the like + click region:
Other people will hate your thing. It’s fine, some people hate everything. But if they click on your thing, they’ll be annoyed and tell everyone you are dumb and bad. You don’t want that. So you don’t want people in the hate + click region.
I find it helpful to think about all title-related issues from this perspective.
Everyone is deluged with content. Few people will hate your thing, because very few will care enough to have any feelings about it at all.
The good news is that it’s a big world and none of us are that unique. If you make a thing that you would love, then I guarantee you at least 0.0001% of other people would love it too. That’s still 8000 people! The problem is finding them.
That’s hard. Because—you don’t like most things, right? So you start with a strong prior that most things are bad (for you). Life is short, so you only click on things when there’s a very strong signal they’ll be good (for you).
Say you write a post about concrete. Should you call it, “My favorite concrete pozzolanic admixtures”, even though 99.9% of people have no idea what pozzolanic means? Well, think of the people who’d actually like your thing. Do they know? If so, use “pozzolanic”. That gives a strong signal to Concrete People: “Hey! This is for you! And you know I’m not lying about that, because I’ve driven away all the noobs.”
So ideally you’re aiming for something like this:
Be careful imitating famous people. If Barack Obama made a thing called “Thoughts on blockchain”, everyone would read it, because the implicit title is “Thoughts on blockchain, by Barack Goddamn Obama”. Most of the titles you see probably come from people who have some kind of established “brand”. If you don’t have that, you probably don’t want to choose the same kind of titles.
The title isn’t just about the subject. I called this post, “How to title your blog post or whatever” partly because I hope some of this applies to other things beyond blog posts. But mostly I did that because it signals that my style is breezy and informal. I think people really underrate this.
Some people choose clever punny titles. If you have a big audience that reads all your things, then your title doesn’t need to be a good classifier. I’m not in that situation, but I sometimes find a pun so amusing that I can’t resist. “Fahren-height” was worth it. “Taste games” was not.
Traditional advice says that you should put your main “message” in the title. I have mixed feelings about that. On the one hand, it provides a lot of signal. On the other hand, it seems to get people’s hackles up. The world is full of bad things that basically pick a conclusion and ignore or distort all conflicting evidence. If you’re attempting to be fair or nuanced, putting your conclusion in the title might signal that you’re going to be a typical biased/bad thing. It will definitely lead to lots of comments “refuting” you from people who didn’t read your thing.
Putting your conclusion in the title may also ruin your “story”. Though you should ask: Do the people who’d like your thing really care about your story?
A difficult case is things that create new “labels”. Sometimes there’s an idea floating around, and we need someone to make a canonical thing with a Name. To serve that role, the thing’s title needs to be that name. This presents a trade-off. A post titled “The Waluigi Effect” is great for people who want to know what that is, but terrible for everyone else.
For the best title ever I nominate, “I’m worried about Chicago”. It doesn’t look fancy, but do you see how elegantly it balances all the above issues?
You’d think that, by 2025, technology would have solved the problem of things getting to people. I think it’s the opposite. Social media is optimized to keep people engaged and does not want people leaving the walled garden. Openly prohibiting links would cause a revolt, so instead they go as close as people will tolerate. Which, it turns out, is pretty close.
Boring titles are OK. I know that no one will click on “Links for April” who doesn’t already follow me. But I think that’s fine, because I don’t think anyone else would like it.
Consider title-driven thing creation. That is, consider first choosing a title and then creating a thing that delivers on the title. It’s sad to admit, but I think there are many good things that simply don’t have good titles. Consider not making those things. The cynical view of this is that without a good title, no one will read your thing, so why bother? The optimistic view is that we’re all drowning in content, so what the world actually needs is good things that can find their way to the people who will benefit from them. In practice, it’s often something in the middle: You start to create your thing, then you choose a title, then you structure your thing to deliver on the title.
My favorite thing category is “Lucid examination of all sides of an issue which finds some evidence pointing in various directions and doesn’t reach a definitive conclusion because the world is complicated”. Some people make fun of me for spending so much time researching seed oils and then lamely calling my thing “Thoughts on seed oil”. But what should I have called that instead? Lots of bloggers create things in this category, and no one seems to have solved the problem.
[Insert joke about how bad the title of this post is.]