MoreRSS

site iconDynomight

Dynomight is a SF-rationalist-substack-adjacent blogger with a good understanding of statistics.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of Dynomight

OK, I can partly explain the LLM chess weirdness now

2024-11-21 08:00:00

We recently talked about a mystery: All large language models (LLMs) are terrible at chess. All, that is, except for gpt-3.5-turbo-instruct, which for some reason can play at an advanced amateur level. This is despite the fact that this model is more than a year old and much smaller than recent models. What’s going on?

I suggested four possible explanations:

  • Theory 1: Large enough base models are good at chess, but this doesn’t persist through instruction tuning to chat models.

  • Theory 2: For some reason, gpt-3.5-turbo-instruct was trained on more chess data.

  • Theory 3: There’s something magical about certain LLM architectures.

  • Theory 4: There’s “competition” between different types of data, so for an LLM to play chess well, you need a large fraction of the data to be chess games.

The internet offered several other theories. The most common were:

  • Theory 5: OpenAI is cheating.

  • Theory 6: LLMs can’t actually play chess.

  • Theory 7: Large enough base models are good at chess, but this doesn’t persist through instruction tuning to chat models, Dynomight you are so bad for not suggesting this, how are you so dumb and bad?

I’ve now done new experiments and—good news—everyone is wrong!

Here, I’ll show that recent chat models can play chess quite well, as long as you’re willing to go through sufficiently extreme contortions to figure out how to prompt them. Then I’ll give my theory for what’s happening.

But first…

I really don’t think OpenAI is cheating

I was astonished that half the internet is convinced that OpenAI is cheating. Many, many people suggested that there must be some special case in gpt-3.5-turbo-instruct that recognizes chess notation and calls out to an external chess engine.

I think this is extremely unlikely. Because:

  1. Many people from OpenAI have said they didn’t do that. Sure, people lie, but conspiracies are hard, and why lie about this?

  2. In chess, you can arrive at the same board state via different sequences of moves. Chess engines don’t care, but gpt-3.5-turbo-instruct does care and plays very differently for different move sequences.

  3. While gpt-3.5-turbo-instruct is great by the standards of chess amateurs, it’s bad by the standards of experts and pathetic by the standards of chess engines. If you’re going to cheat, why stop at an Elo of 1750?

  4. If you change the way you prompt gpt-3.5-turbo-instruct, this will subtly change how it plays. Is there some neural network that looks at the text and dynamically sets the chess engine skill level?

  5. Later OpenAI models are by default much worse. Did they remove the cheat?

  6. I will show (below) that later OpenAI models can also play well if you use the right incantations.

If OpenAI did cheat, they went to insane lengths to cheat in a way that looks exactly like an LLM is choosing the moves and not at all like calling out to an external chess engine.

Yes, LLMs can play chess

I was also surprised to see so many people suggest that LLMs can’t really play chess, all they do is memorize openings and then play randomly.

This is wrong. LLMs can definitely play chess, and we need to make peace with this.

For one, gpt-3.5-turbo-instruct rarely suggests illegal moves, even in the late game. This requires “understanding” chess. If this doesn’t convince you, I encourage you to write a program that can take strings like 1. e4 d5 2. exd5 Qxd5 3. Nc3 and then say if the last move was legal.

And I defy you to maintain that LLMs can’t play chess after looking at some actual games. Here are ten: 1 2 3 4 5 6 7 8 9 10. It plays pretty well even in completely new board states that have never existed in any game before in history.

So what’s happening?

Why is one LLM great and all the others terrible?

Let me remind you of what I’m talking about. First, take gpt-3.5-turbo-instruct. This is a “completion” model, meaning all it does is take some text and generate new text that might come after. I gave it text like this:

[Event "Shamkir Chess"]
[White "Anand, Viswanathan"]
[Black "Topalov, Veselin"]
[Result "1-0"]
[WhiteElo "2779"]
[BlackElo "2740"]

1. e4 e5 2. Nf3 Nc6 3.

I then took the first few characters and used them as a move.

Next, take gpt-4o-mini and gpt-4o. These are “chat” models, meaning you give them a “system prompt” that says what they’re supposed to do and a “user prompt” and then they try to answer you. I used this system prompt:

You are a chess grandmaster.
You will be given a partially completed game.
After seeing it, you should choose the next move.
Use standard algebraic notation, e.g. "e4" or "Rdf8" or "R1a3".
NEVER give a turn number.
NEVER explain your choice.

For user prompts, I repeated the system prompt and then gave the same metadata for the players and sequence of moves as above.

That is, I used user prompts like this:

You are a chess grandmaster.
You will be given a partially completed game.
After seeing it, you should choose the next move.
Use standard algebraic notation, e.g. "e4" or "Rdf8" or "R1a3".
NEVER give a turn number.
NEVER explain your choice.

[Event "Shamkir Chess"]
[White "Anand, Viswanathan"]
[Black "Topalov, Veselin"]
[Result "1-0"]
[WhiteElo "2779"]
[BlackElo "2740"]

1. e4 e5 2. Nf3 Nc6 3.

Here are the results of these three models against Stockfish—a standard chess AI—on level 1, with a maximum of 0.01 seconds to make each move. After the game was over, I calculated the score after each turn in “centipawns” where a pawn is worth 100 points, and ±1500 indicates a win or loss. Here is the average over 50 games (click to zoom in):

Note: Last time I found that gpt-3.5-turbo-instruct won every single game and gpt-4o lost every game. I think the difference now is just that I have 50 samples rather than 10.

So that’s the mystery: gpt-3.5-turbo-instruct is great, and other models are terrible. Last time I tested lots of open models and they were terrible too. Why?

To answer this, let’s see if we can make these other models play chess better.

Should we fiddle with the prompt?

You can nitpick at how I prompted the chat models. Is it a good idea to repeat the system prompt at the top of the user prompt? Is it a good idea to add all the metadata like user names before you start listing moves?

As far as I can tell, no one knows. So I tried every combination of these things being on or off. With gpt-4o-mini (a small model) it seemed to make little difference.

With gpt-4o (a bigger model) it… maybe made a difference?

Taken at face value, this says that repeating the system prompt helps a bit but metadata hurts a bit. But I’m not sure if that’s real or just noise. For simplicity, I decided to not repeat the system prompt and to turn off metadata for all further experiments.

Should we add examples?

If you want LLMs to do something, standard advice is to provide some examples. So I created three small examples of example boards and legal moves. I provided these "correctly" using the API, not by jamming them into the user prompt.
  • Input A: 1.
  • Output A: e4
  • Input B: 1. e4
  • Output B: d5
  • Input C: 1. e4 e5 2. Nf3 Nc6 3.
  • Output C: Bb5

That’s all I used, just these three examples.

The results were:

Very good.

Is this surprising? I thought this was surprising.

I mean, sure, this kind of “in-context learning” is a big part of what makes LLMs so exciting. And examples are probably the most standard piece of advice from practitioners.

Still, I was blown away that three tiny examples could have such a profound effect on performance. More (or different) examples might be even better. I didn’t check, because generating each of these figures requires an ungodly number of queries.

Should we fine-tune?

Another standard (albeit more difficult) way to improve LLMs is to fine-tune—to optimize the weights to be good at whatever task using data for that task.

So I did this for both gpt-4o-mini and gpt-4o.

To generate the fine-tuning data, I had Stockfish play 100 games against itself on its highest difficulty level. For each game, I picked a random move and used it as an example. I then had Stockfish play another 100 games as validation data.

Here was one example:

  • System prompt: (same as above)
  • User prompt: 1. e4 c5 2. Nf3 d6 3. d4 cxd4 4. Nxd4 Nf6 5. Nc3 a6 6. Be2 e5 7. Nb3 Be7 8. Be3 Be6 9. f4 exf4 10. Bxf4 Nc6 11. Qd3 Ne5 12. Qg3 Nh5 13. Qe3
  • Desired output: Nxf4

The results were:

Good! Fine-tuning helps.

Note: The first time I fine-tuned gpt-4o, the results seemed bad, so I ran it again with a smaller step size. This makes me nervous.

Should we combine examples and fine-tuning?

If examples are good, and fine-tuning is good, will putting them together be even better?

My intuition was no, since three in-context examples seems trivial when compared to 100 fine-tuning examples. Once fine-tuning was done, I figured the examples would be superfluous.

The answer was no, but for different reasons:

According to that figure, fine-tuning helps. And examples help. But it’s examples that make fine-tuning redundant, not the other way around.

Ohkay.

Should we provide legal moves?

LLMs sometimes struggle to give legal moves. In these experiments, I try 10 times and if there’s still no legal move, I just pick one at random. So I wondered: Maybe I could help the LLM out by listing the legal moves before giving the game history? Some might say this is “cheating”, but let’s try it anyway.

I used a new system prompt that told the LLM to expect a list of legal moves and then user prompts that first listed the moves and then listed the game so far.

More specifically, I used this system prompt:

You are a chess grandmaster.
You will be given a list of legal moves and a partially completed game.
After seeing it, you should choose the next move.
Use standard algebraic notation, e.g. "e4" or "Rdf8" or "R1a3".
NEVER give a turn number.
NEVER explain your choice.

And I sent user prompts like this:

Here are the current legal moves:

Bxh6 Bxf6 Bh4 Bf4 Be3 Bd2 Bc1 Nd5 Nb5 Na4 Nce2 Nb1 Nh3 Nf3 Nge2 Ba6 Bb5+ Bc4 Bd3 Be2 Ke2 Kd2 Qh5 Qg4 Qf3 Qd3 Qe2 Qd2 Qc1 Qb1 Rc1 Rb1 e5 d5 h3 g3 f3 b3 a3 h4 g4 f4 b4 a4

Here is the game so far:

1. e4 d6 2. d4 g6 3. Nc3 Nf6 4. Bg5 h6 5.

Here are the results:

Disaster. Listing the legal moves makes the models play much worse. They don’t just win fewer games, they start making mistakes after a much smaller number of turns.

Ohhhkay. I guess let’s not do that.

I had an idea

Thinking about the above, I had an idea. An idea that I think is… rather good.

Let’s back up a second. To make an LLM, you first make a “base” model. All that base models do is take a string and continue it. Given The best tea is , they have some probability of outputting green tea or oolong or whatever. (The right answer is oolong.)

If you want an LLM that can talk to you, you can sort of get a base model to do this by sending them strings that look like this:

(Transcript of chat between USER and ASSISTANT who is super chill and answers all questions without judgment.)

USER: How do I know if squirrels like me?

ASSISTANT:

LLMs trained on general text are smart enough to recognize that what comes next is probably something that a super chill agent would say to a neurotic user. So they’ll typically do something reasonable. But in practice, they aren’t great. The responses tend to reflect the chaos of the internet, which isn’t exactly what you want from an assistant.

Chat models go further in two ways. First, they create special tokens to indicate the different parts of the conversation, sort of like this (except you should think of <|SYSTEM|> et al. as being single special characters).

<|SYSTEM|>
You are a chatbot that is super chill and answers all questions without judgement.
<|USER|>
How do I know if squirrels like me?
<|ASSISTANT|>

Then, they do “instruction tuning”—they re-train the weights so that the model is good at responding to prompts given in this format.

So, when I asked gpt-4o to predict a chess move, the string that was actually presented to the system looked sort of like this:

<|SYSTEM|>
You are a chess grandmaster.
You will be given a list of legal moves and a partially completed game.
After seeing it, you should choose the next move.
Use standard algebraic notation, e.g. "e4" or "Rdf8" or "R1a3".
NEVER give a turn number.
NEVER explain your choice.
<|USER|>
1. e4 e5 2. Nf3 Nc6 3.
<|ASSISTANT|>

To make gpt-4o, OpenAI first made a base model. As far as I know, that model doesn’t have a name, so let’s call it gpt-4-base. It then did instruction tuning and stuck the instruction-tuned model behind the chat interface, to give us gpt-4o. (It also did some other stuff like distillation, but never mind.)

I’ve gone through all this background because it allows us to state a central question: How good is gpt-4-base at chess? Is it as good at gpt-3.5-turbo-instruct? And if it is, then why is gpt-4o worse? Is it because of the instruction tuning? Or is it just because of the chat template, with the <|USER|> and <|ASSISTANT|> tokens floating around in ways that don’t happen in chess games written down in PGN notation?

I’m not sure, because OpenAI doesn’t deign to share gpt-4-base, nor to allow queries of gpt-4o in completion mode. But maybe we can help gpt-4o remember its evolutionary history. Maybe we can prompt gpt-4o in a way that will sort of trick it into responding more like it was in completion mode.

Should we regurgitate?

Thus my idea: Instead of just asking for a move, how about we ask the model to repeat the whole game and then give a move?

I changed the system prompt to this:

You are a chess grandmaster.
You will be given a partially completed game.
After seeing it, you should repeat the ENTIRE GAME and then give ONE new move.
Use standard algebraic notation, e.g. "e4" or "Rdf8" or "R1a3".
ALWAYS repeat the entire representation of the game so far.
NEVER explain your choice.

Given a prompt like 1. e4 e5 2. I expected the model to return an output like 1. e4 e5 2. Nf7. I checked to make sure it successfully repeated the entire game before giving a new legal move.

This works:

By forcing the models to repeat the whole move sequence, you force the model to create a context for itself where it’s much more likely to choose good moves.

This makes gpt-4o-mini and gpt-4o better. It also seems like strong evidence that if we could query gpt-4-base in completion mode, it would be pretty good.

Note: When using this type of prompt, I first gave the model ten tries to repeat the whole sequence and then give a legal move at the end. If none of those tries succeeded, I gave it another ten tries to at least produce a legal move after the new turn number, even if it didn’t repeat the whole game perfectly. If that still didn’t succeed, I chose a move at random.

Can we regurgitate better?

Fine-tuning is good. Regurgitation is good. Are they good together?

To test this, I needed to do a new, independent run of fine-tuning. I used the exact same sequence of games and moves, but with outputs repeating the inputs before giving a new move.

For example:

  • System prompt: (same as above)
  • User prompt: 1. e4 c5 2. Nf3 d6 3. d4 cxd4 4. Nxd4 Nf6 5. Nc3 a6 6. Be2 e5 7. Nb3 Be7 8. Be3 Be6 9. f4 exf4 10. Bxf4 Nc6 11. Qd3 Ne5 12. Qg3 Nh5 13. Qe3
  • Desired output: 1. e4 c5 2. Nf3 d6 3. d4 cxd4 4. Nxd4 Nf6 5. Nc3 a6 6. Be2 e5 7. Nb3 Be7 8. Be3 Be6 9. f4 exf4 10. Bxf4 Nc6 11. Qd3 Ne5 12. Qg3 Nh5 13. Qe3 Nxf4

This… maybe helped a little?

And how about examples? Will they improve regurgitation?

I used the same three situations as before for examples, just with the prior sequence of moves added to the start of the outputs.

Specifically, I used these three examples:

  • Input A: 1.
  • Output A: 1. e4
  • Input B: 1. d4
  • Output B: 1. d4 d5
  • Input C: 1. e4 e5 2. Nf3 Nc6 3.
  • Output C: 1. e4 e5 2. Nf3 Nc6 3. Nf3

Like before, these had a remarkable impact given how little information they contain.

And should we combine examples and fine tuning?

Here we have the same (strange) result as without regurgitation. If you fine-tune, then adding examples helps. But it’s still worse than examples without fine-tuning.

Where do we stand?

What have we learned so far?

  • GOOD: Regurgitation, examples, fine tuning (without examples)
  • UNCLEAR: Metadata, repeating the system prompt, fine tuning (with examples)
  • BAD: Providing the list of legal moves

So if we use regurgitation and examples and turn everything else off, how good is it? Will it play as well as our old nemesis?

No. It’s respectable, but still not quite as good as gpt-3.5-turbo-instruct.

To compare these more directly, I had gpt-4o + regurgitate + examples play 50 games against gpt-3.5-turbo-instruct. In all cases, gpt-4o was white.

outcome for gpt-4o count
win 10
tie 5
loss 35

According to this calculator, that’s consistent with an Elo difference of -191. But you need to account for the fact that gpt-4o was always white, reportedly worth around 35 Elo. Since gpt-3.5-turbo-instruct has been measured at around 1750 Elo, this suggests gpt-4o with regurgitation and examples hits around 1750 - 191 - 35/2 ≈ 1540 Elo, which is still “intermediate amateur” territory.

Here are 10 games of gpt-4o + regurgitate + examples playing against Stockfish: 1 2 3 4 5 6 7 8 9 10

And here are 10 games of gpt-4o + regurgitate + examples playing against gpt-3.5-turbo-instruct: 1 2 3 4 5 6 7 8 9 10

So here’s my current theory

Here’s my best guess for what is happening:

Part 1: OpenAI trains its base models on datasets with more/better chess games than those used by open models.

Part 2: Recent base OpenAI models would be excellent at chess (in completion mode, if we could access them). But the chat models that we actually get access to aren’t.

I think part 1 is true because all the open models are terrible at chess, regardless of if they are base models or chat models. I suspect this is not some kind of architectural limitation—if you fine-tuned llama-3.1-70b on billions of expert chess games, I would be surprised if it could not beat gpt-3.5-turbo-instruct (rumored to have only around 20 billion parameters).

Meanwhile, in section A.2 of this paper (h/t Gwern) some OpenAI authors mention that GPT-4 was trained on chess games in PGN notation, filtered to only include players with Elo at least 1800. I haven’t seen any public confirmation that gpt-3.5-turbo-instruct used the same data, but it seems plausible. And can it really be a coincidence that gpt-3.5-turbo-instruct plays games in PGN notation with a measured Elo of 1750?

I can’t find any details about how much chess data was included when training Llama et al. I’m sure many games made their way in from the open internet. But specifically curating a giant database of high quality games probably just gives better results, and the open models probably just didn’t do that.

(Incidentally, I encourage people at all AI companies to leak secrets to me. If you use the anonymous feedback form, please write with sufficient technicality that I can verify your expertise. Secrets will be used only for good, not evil.)

It’s also conceivable that some models are playing worse because they have too much chess data. It could be that the open internet has too many games from low-skill players and that if you don’t filter these out, then the models correctly predict that players would make low-quality moves. But I suspect not, because a smart model would recognize that if the sequence of moves so far is high skill then the player isn’t a total idiot and probably won’t throw away their queen. But the models don’t seem to do that.

I think part 2 of my theory is true mostly because of the experiments I did in this post: If you do weird contortions to “trick” OpenAI chat models into behaving more like completion models, then they play much better. So I suspect that the underlying base models (which we can’t touch) are good.

Now, there’s a major uncertainty in part 2. If gpt-4o in chat mode is worse than gpt-4-base in completion mode, then why? Is it the chat interface or the instruction tuning, or both? Put another way, would gpt-4-base be good at chess in a simulated chat mode? And would gpt-4o be good if we could query it in completion mode?

It’s impossible to say, because we can’t do those experiments.

Parting thoughts

  1. Isn’t it great how much of AI is now palace intrigue?

  2. It’s very likely that there are ways to coax better behavior out of gpt-4o. In truth, I barely scratched the surface here.

  3. It’s ridiculously hard to find the optimal combination of prompts and examples and fine-tuning, etc. It’s a very large space, there are no easy abstractions to allow you to search through the space, LLMs are unpredictable and fragile, and these experiments are slow and expensive.

  4. I tried running the final recipe with gpt-4 (rather than gpt-4o), and it played poorly. I suspect the reason is that the combination of tricks I found is gpt-4o specific. Maybe gpt-4 needs a different prompt? Or more examples? Or would respond better to fine-tuning? Who knows.

  5. In many ways, this feels less like engineering and more like a search for spells.

P.S.

Thanks to the Automator for crucial guidance and boundless patience. Thanks to Daniel Gross for paying for all the electrons. Here is some good prior work on LLMs and chess:

Update (2024/11/21): Corrected one example output, in which I had previously put an illegal move. (Why am I so bad?) This only slightly improved the results, which makes the large benefits of the examples even more mysterious.

Update (2024/11/22): Updated to use a reference of 1750 Elo for gpt-3.5-turbo-instruct rather than 1800.

Something weird is happening with LLMs and chess

2024-11-14 08:00:00

A year ago, there was a lot of talk about large language models (LLMs) playing chess. Word was that if you trained a big enough model on enough text, then you could send it a partially played game, ask it to predict the next move, and it would play at the level of an advanced amateur.

This seemed important. These are “language” models, after all, designed to predict language.

Now, modern LLMs are trained on a sizeable fraction of all the text ever created. This surely includes many chess games. But they weren’t designed to be good at chess. And the games that are available are just lists of moves. Yet people found that LLMs could play all the way through to the end game, with never-before-seen boards.

Did the language models build up some kind of internal representation of board state? And learn how to construct that state from lists of moves in chess’s extremely confusing notation? And learn how valuable different pieces and positions are? And how to force checkmate in an end-game? And they did this all “by accident”, as part of their goal of predicting general text?

If language models can do all that for chess, then maybe it’s a hint of how they deal with other situations too.

So that was very exciting. A year ago.

Since then, there’s mostly been silence. So I decided to check in and see how things are going. Having done that, I can now report: Weirdly.

What I did

To make LLMs play chess, I sent them prompts like this:

You are a chess grandmaster.
Please choose your next move.
Use standard algebraic notation, e.g. "e4" or "Rdf8" or "R1a3".
NEVER give a turn number.
NEVER explain your choice.
Here is a representation of the position:

[Event "Shamkir Chess"]
[White "Anand, Viswanathan"]
[Black "Topalov, Veselin"]
[Result "1-0"]
[WhiteElo "2779"]
[BlackElo "2740"]

1. e4 e6 2. d3 c5 3. Nf3 Nc6 4. g3 Nf6 5.

I used the output as a move. I always had the LLM play as white against Stockfish—a standard chess AI—on the lowest difficulty setting.

The first model I tried was llama-3.2-3b. This is a “base model”, meaning it is mostly trained to output text, not to chat with you or obey instructions. It’s quite small by modern standards, with only 3 billion parameters. For reference, GPT-2, released back in 2019, had 1.5 billion parameters, and GPT-4 is rumored to have around 1.8 trillion.

I had it play 50 games, then had a chess engine score each board after each turn in “centipawns”. This is a measure where a pawn is 100 points, but there’s also accounting for position. If the game was over, I assigned a score of +1500 if the LLM won, 0 if there was a tie, and -1500 if it lost.

The results were:

Terrible. (Click to see a game. Click the above image to zoom in.)

In the above figure, there’s one light line for each game, and the black line shows the per-turn median. The LLM can play standard openings for a few moves but then quickly starts throwing away pieces. It lost every single game, even though Stockfish was on the lowest setting.

Maybe that model is too small? So I got llama-3.1-70b, which is a similar model but with 70 billion parameters instead of 3 billion. The results were:

Terrible. A little better, but still extremely bad.

Next I tried llama-3.1-70b-instruct, a similar model, except trained to be better at following instructions. The results were:

Terrible.

Maybe there’s something wrong with the Llama models or datasets? So I tried Qwen-2.5-72b.

Terrible.

Maybe Qwen is somehow defective too? So I tried command-r-v01, a 35 billion parameter model.

Terrible.

And then I tried gemma-2-27b.

Terrrible.

And then I tried gpt-3.5-turbo-instruct. This is a closed OpenAI model, so details are very murky. I only ran 10 trials since AI companies have inexplicably neglected to send me free API keys and this was costing The Automator money. The results were:

Excellent. Very, very, good.

Even if you raise Stockfish’s level a few clicks, this model will still win every game.

Moving on… I next tried gpt-3.5-turbo, a model that’s similar, except tuned to be more chatty and conversational.

Terrible.

And then I tried gpt-4o-mini, which is a newer chat model.

Terrible.

And then I tried gpt-4o, a bigger chat model.

gpt-4o

Terrible.

It lost every single game, though it lost slightly slower.

Finally, I tried o1-mini, a model that’s supposed to be able to solve complex tasks. (I’m too poor for o1.)

o1-mini

Terrible.

So, umm:

Model Quality
Llama-3.2-3b Terrible
Llama-3.2-3b-instruct Terrible
Llama-3.1-70b Terrible
Llama-3.1-70b-instruct Terrible
Qwen-2.5 Terrible
command-r-v01 Terrible
gemma-2-27b Terrible
gemma-2-27b-it Terrible
gpt-3.5-turbo-instruct Excellent
gpt-3.5-turbo Terrible
gpt-4o-mini Terrible
gpt-4o Terrible
o1-mini Terrible

And, uhh:

all

Notice anything? Any patterns jump out at you?

Discussion

There are lots of people on the internet who have tried to get LLMs to play chess. The history seems to go something like this:

  • Before September 2023: Wow, recent LLMs can sort of play chess! They fall apart after the early game, but they can do something! Amazing!

  • September-October 2023: Wow! LLMs can now play chess at an advanced amateur level! Amazing!

  • (Year of silence.)

  • Recently: Wow, recent LLMs can sort of play chess! They fall apart after the early game, but they can do something! Amazing!

I can only assume that lots of other people are experimenting with recent models, getting terrible results, and then mostly not saying anything. I haven’t seen anyone say explicitly that only gpt-3.5-turbo-instruct is good at chess. No other LLM is remotely close.

To be fair, a year ago, many people did notice that gpt-3.5-turbo-instruct was much better than gpt-3.5-turbo. Many speculated at the time that this is because gpt-3.5-turbo was subject to additional tuning to be good at chatting.

That might be true. Here’s a comparison of three models where we have similar versions with or without additional chat tuning.

instruct comparison

(Again, do not be confused by the name gpt-3.5-turbo-instruct. I stress that this is more like a base model than gpt-3.5-turbo. This is the opposite of the naming scheme everyone else uses where “instruct” or “it” means more tuning to be good at chatting.)

In all cases, additional instruction tuning makes the model worse. But the difference is very small in two cases, and enormous in the other.

Possible theories

I can think of four possible explanations.

Theory 1: Base models at sufficient scale can play chess, but instruction tuning destroys it.

This would be consistent with our data. But I did manage to get llama-3.1-405b to play a couple games. Despite being larger than gpt-3.5-turbo, it was still terrible.

Theory 2: GPT-3.5-instruct was trained on more chess games.

All models were clearly trained on a lot of chess games. But it’s hard to know exactly how many.

Theory 3: There’s something particular about different transformer architectures.

I doubt this, but it could be that for some reason, Llama type models are uniquely bad at chess.

Theory 4: There’s “competition” between different types of data.

We know that transformers trained specifically on chess games can be extremely good at chess. Maybe gpt-3.5-turbo-instruct happens to have been trained on a higher fraction of chess games, so it decided to dedicate a larger fraction of its parameters to chess.

That is, maybe LLMs sort of have little “chess subnetworks” hidden inside of them, but the size of the subnetworks depends on the fraction of data that was chess. (If this theory were true, we should probably expect that big enough models should become good at chess, provided they are trained on enough chess games, even if the fraction of chess games is low.)

Details

I did things this way (i.e. by working with standard algebraic notation) because this is how people got good results two years ago, and in preliminary experiments I also found it to work best.

If you want to know exactly how I did things, here are some words: I ran all the open models (anything not from OpenAI, meaning anything that doesn’t start with gpt or o1) myself using Q5_K_M quantization, whatever that is. For the open models I manually generated the set of legal moves and then used grammars to constrain the models, so they always generated legal moves. Since OpenAI is lame and doesn’t support full grammars, for the closed (OpenAI) models I tried generating up to 10 times and if it still couldn’t come up with a legal move, I just chose one randomly. For the chat models llama-3.1-70b-instruct, gemma-2-27b-it, gpt-3.5-turbo, gpt-4o-mini, and gpt-4o I changed the system prompt to “You are a chess grandmaster. You will be given a partially completed game. After seeing it, you should choose the next move.” It’s impossible to change the system prompt for o1-mini, so I didn’t. I used a temperature of 0.7 for all the open models and the default for the closed (OpenAI) models. The fact that OpenAI has “open” as part of their name sure made this paragraph hard to write.

Token weirdness

One extremely strange thing I noticed was that if I gave a prompt like “1. e4 e5 2. ” (with a space at the end), the open models would play much, much worse than if I gave a prompt like “1 e4 e5 2.” (without a space) and let the model generate the space itself. Huh?

After some confusion, I’m pretty sure this is because of the tokenizer. Look at how the Llama tokenizer breaks up a string of moves:

tokens

After the “1.”, it generates “ e” as a single token. That’s not the same as having a space followed by an e. So putting in the space and asking models to generate tokens presents the model with a confusing situation and leads to bad predictions.

The right way to deal with this is “token healing”—to delete the last token of the input and then do constrained generation over all strings that start with the deleted stuff. But I couldn’t figure out any easy way to do that. So, instead I left the space out and modified the grammar so that the model could generate a space (or not), then one of the current legal moves, and then another space (or not).

P.S.

Some people have asked to see all the games from gpt-3.5-turbo-instruct. Behold: 1 2 3 4 5 6 7 8 9 10

P.P.S

Update: OK I think I can explain this now.

Attitudes one can take towards people who have behaved badly

2024-11-07 08:00:00

Have you ever noticed that reality has some properties that are quite annoying? For example, have you noticed that some people do bad things? And yet those same people sometimes have interesting ideas?

Occasionally, I’ll bring up an idea in conversation and someone will gently remind me that the person who came up with the idea is bad. I’ve never been sure what to do with that information.

Is this a hint that I shouldn’t be talking about bad people?

I stress that I am open to this idea! For example, I’ve long felt that we should pay less attention to the personal story of school shooters.

So, should we avoid talking about bad people? In the extreme, this would mean we try to pretend they never existed, neither the person nor their ideas.

This seems non-viable.

For example, take Ronald Fisher. He basically created modern statistics and led biology’s modern synthesis of natural selection, genetic variation, and Mendelian inheritance. By some measures, he is the most influential scientist of all time. But he also had now-disfavored views on race and eugenics and used his immense influence to argue that tobacco was harmless, tainting my cherished “correlation doesn’t imply causation” arguments forever.

You may or may not think Fisher is bad given his historical context. I may or may not think that. The point I’m trying to make is that it’s impossible to delete him from history. He invented p-values and variance. By now, all of science depends on his ideas.

If that doesn’t work, should we do the opposite? The other extreme would be to never cancel anyone. If someone is bad, let them be punished (or not) by the legal system. But regardless of what happens there, we keep talking about their ideas, and in fact keep being friends with them.

This also seems non-viable.

Social punishment is a core part of how our species works. If you think back to pre-historical times when we all wandered around in tribal bands, “gossip” was the primary way that we enforced the social contract on each other. Now, in modernity, we have laws and police and human resource departments and ethical review boards and home owners associations. All these things mean that gossip doesn’t have as large a role to play. But it seems crazy to think that our formal institutions cover all situations, that good-old gossip has no role anymore.

Take a beautiful and charming sociopath who likes to date people for a few months, get them to fall in love, and then break up with them with maximum cruelty. They film the breakup, and post it online for lolz. This is not illegal (except perhaps the filming), and I think shouldn’t be illegal (except perhaps the filming). But also, I think this person is very bad and we need some way to punish their behavior.

And what if said sociopath happens to be a scientist or author or public intellectual? One hope would be to cancel the person, but not the ideas. But how does that work? Do you go to their talks at conferences? Do you chat with them at coffee breaks? Do you invite them to parties with their peers? Do you send young people to work with them? If you refuse to cancel the ideas, you won’t have many tools left to cancel the person.


So then what should we do? Here are some policies that people seem to use:

Policy 1: Ignore badness. Continue to discuss the ideas of bad people, continue to be friends with them, etc.

Policy 2: Gossip only. Continue to discuss ideas from bad people and pretend to be friends with them, but gossip about them behind their back so everyone knows they are bad.

Policy 3: Social ostracism. Avoid talking to bad people at parties or even talking about them as people, except to say that they’re bad. But still discuss their ideas, cite them, etc. (Somehow draw a line between the person and the idea.)

Policy 4: Social ostracism plus decontextualization. Again avoid talking to bad people at parties. And again continue to discuss their ideas, but never mention where the ideas came from. Hope that history forgets their evil inventor.

Policy 5: Full cancellation. Avoid talking to bad people at parties. Also avoid discussing their ideas. They never existed.

Policy 6: Cancel non-cancellers. Avoid talking to bad people at parties. Also avoid discussing their ideas. Also avoid anyone who doesn’t themselves avoid the bad person.

Policy 7: Transitive cancellation. Like the above, except you cancel bad people and anyone who doesn’t cancel bad people and anyone who doesn’t cancel anyone who doesn’t cancel bad people, ad infinitum. Never cite a paper that cited a paper that cited a paper that cited the bad person. Never go to a party with anyone who went to a party with anyone who went to a party with the bad person. Completely disconnect from any network that has any path to the bad person.


That’s a wide spectrum. Some people tend to be more lax, others more severe. But you also probably want to adjust the severity for each particular bad person. When you do that, here are some factors to consider:

Factor 1: Does it matter if someone has been convicted?

Suppose someone has been formally “tried” and “convicted” by some entity, like the government or their employer or a professional society. How does this change things?

One theory is that you should now be more mean, since you’re now more confident that they’re guilty.

But there’s a strong argument that you should actually be less mean. If my neighbor burned down an empty school in the middle of the night, I’d steer clear of them. But if they went to jail and apologized, I might give them another chance.

Society seems to operate on the idea that a criminal who evades punishment should be scorned, but someone who “serves their time” should be re-integrated, or at least less scorned. This fits nicely with the idea that the modern purpose of gossip is to fill the gaps left by our formal mechanisms of justice.

Or maybe convictions shouldn’t change anything? Perhaps our formal mechanisms of justice are supposed to come on top of social punishment? Who decides?

And if someone is convicted, does it matter if their punishment seems appropriate? If my neighbor who burned down the school paid a $10 million fine and spent 20 years in jail, I’d be more likely to forgive them then if they paid $100 and did 20 hours of community service. Should I do that, or should I not second-guess the system?

Factor 2: Does it matter if the bad actions are related to their ideas?

If I like to burn down empty buildings at night, is it worse that I’m a fire-safety researcher than if I’m a marine biologist?

My intuition is yes. I find it harder, for whatever reason, to forgive Fisher for the statistical crime of defending tobacco than for his views on eugenics. But wait a second—Fisher was also a biologist. So maybe this just proves that I spend more time thinking about bad correlational studies than I spend thinking about eugenics? And that my intuition is not to be trusted?

The best consequentialist justification I can find for being harsher when bad actions are related to bad ideas is that in these cases social punishment tends to work better, since the same group of people that have the expertise to judge badness also have the power to administer social punishment. But I’m highly unsure.

Factor 3: Does it matter when the bad person did the bad actions?

You might think that it’s pointless to cancel Fisher, since Fisher is currently dead and immune to further incentives.

But a lot of intellectuals hope, after shuffling off this mortal coil, to live on in the world of ideas. So posthumous cancellation probably does have an impact.

Incidentally, posthumous cancellation is pretty unique as far as incentives go. For example, will future generations judge us for our treatment of animals? If I go buy some factory-farmed meat tomorrow, I think there’s near-zero risk anyone will really judge me for it in my lifetime. But people 200 years from now?

On the one hand, I personally think you have to apply some degree of moral relativism to people in the past, simply because it’s a lot easier to come to the “right” conclusions when you’ve been indoctrinated with them since birth.

On the other hand, I really wish we were all more accountable to the future. So maybe we should lean into this.

Factor 4: What’s the goal?

The goal of cancellation isn’t necessarily just to punish misbehavior. It might also be a goal that the people we revere set a good example. On this theory, you might still want to avoid talking about bad dead people, even if you don’t think that posthumous incentives do anything.

Factor 5: Does uncertainty matter?

Suppose that you know that Alice killed a frog, and you think that there’s a 0.0000000000001% chance that Bob is Mecha-Khan. (Assume that being Mecha-Khan is equally bad as killing 1 quadrillion frogs.) Do you treat Alice and Bob the same?

Factor 6: Are you sure the gossip is accurate?

Some years ago, I worked at a place where fancy people had private offices clustered around the center of the floor and everyone else was spread out in a giant open office that wrapped around the center. One day a fellow fancy person asked me if I’d be willing to trade offices so he could be closer to some people who had desks near mine. I agreed, and he said he would coordinate. A month later, I hadn’t heard anything so I stopped by and asked him if he still wanted to switch. He said, “Oh, yeah, soon!” But still, nothing happened. This repeated for the next 3-4 months, after which I gave up.

A year later, I had a going-away party for myself (don’t be sad) and invited several of the people at the nearby desks. After a few drinks, two of them asked, “Why did you refuse to switch offices?” Apparently, everyone “knew” that he had asked to switch offices, and I flatly refused.

In their defense, I don’t remember anyone scorning me. And it’s apparently very easy to believe this kind of thing about me. (If I had a good reason, I might well have refused, albeit after explaining why I thought that maximized overall utility.) But in this case… what the hell?

Factor 7: Are you willing to impede further deliberation?

Say someone is really bad, so you decide to apply a harsh policy where you refuse to talk about them. There’s a cruel irony: Your punishment makes it harder for other people to verify that your punishment is just! How are they supposed to work through all the factors if they can’t even discuss someone? What you were wrong?

Still, there’s a famous 20th existentialist German philosopher who was an enthusiastic Nazi. I’ve briefly looked at his ideas, and they seem interesting. But I refuse to go deeper—or even mention his name—because I feel like I can get my existentialism elsewhere, and because fuck him for being a Nazi. By doing this, I make it harder for you to check my work. That would be sad if he wasn’t actually a Nazi, but I’m sufficiently confident that I think that’s a price worth paying.

When you choose a social punishment, you’re not just weighing a bunch of complicated factors to choose a penalty. You’re also choosing if you want to interfere with other people’s ability to re-examine those factors.


I’m told that rather than exclusively reveling in confusion and perversity, writing is also sometimes used to advance specific theses.

So OK, here’s what I think I’m trying to say:

  1. People rightly worry about social punishment and “cancel culture”. It can be inaccurate and random and unjust and suffocating.

  2. Still, social punishment is fundamental to our species. In our natural state, it was our justice system.

  3. We now have courts and police. These do much of what social punishment used to do, and hopefully do it better.

  4. But social punishment remains absolutely necessary in many circumstances.

  5. There are many levels of punishment, and finding the socially optimal level is hard.

  6. In particular, it’s hard because of the confusing boundary between institutional punishment and social punishment.

  7. It’s also hard because when bad people are scientists or writers, there’s no easy way to punish them without also punishing their ideas.

  8. It’s also hard because the highest levels of social punishment involve not talking about the person and maybe coercing other people to also not talk about them, which shuts down rational discussion.


So it’s complicated. Social punishment is part of who we are, and it’s unrealistic to pretend we aren’t going to do it. We don’t seem to do a great job of finding the socially optimal level of social punishment. But that depends on a lot of factors and it’s hard to make firm recommendations.

The one thing I would suggest is that if a strong social punishment has been applied and if we find out that the person was innocent, then we all have a responsibility to loudly correct the record.

In 2006, three members of the Duke University lacrosse team were charged with rape. For a few weeks, they were among the most hated people in the world. The lacrosse coach was forced to resign and 88 Duke professors took out an advertisement that implied guilt. Then, it emerged that some of the players were not even at the location at the claimed time, they were declared provably innocent, the charges were dropped, and the district attorney who filed the charges was disbarred and briefly jailed for misconduct.

This seems pretty clear-cut. So why do so many people seem to know the start of the story but not the ending? Why do I feel so gross after writing the previous paragraph? I think it’s because society has applied a “transitive cancellation” social penalty, and we instinctively sense that it’s dangerous to promote the idea that they might be innocent.

The three players each won around $20 million in damages from Duke. In some sense, this seems fair. I imagine this incident follows them every time they apply for a job or make a new acquaintance. But notice something: The idea of “reputational damage” is premised on the idea that reputations can’t be fixed, that once gossip has started, it can’t be stopped.

And if you’re into coercive transitive cancellation, note that fixing this problem is totally compatible with coercive transitive cancellation! Say that Alice has been declared bad. If Bob tries to prove Alice is actually good and Bob fails, then go ahead and scorn Bob, too. But if Bob succeeds, then you should celebrate him for his courage and conviction. Make the risk more two-sided.

In retrospect, the people quietly reminding me that someone was bad were probably using a strategy that’s somewhere between gossip only and social ostracism. Given all the complexity, I think that’s probably pretty reasonable?

Thanks: Jason Crawford, Steve Newman, Jennifer Morales, Andrew Miller

Sloth was not the right answer

2024-10-31 08:00:00

Once, when I was 12, my parents asked me what my favorite animal was. And I thought:

OK, self, what you’ve got here is a totally safe question. There are no “right” or “wrong” animals and no need to worry about any consequences. Let your heart soar!

At school recently, I had seen a video of a sloth in the jungle. In its slow movement and creepy perma-smile, I saw something zen, something that I wanted for myself.

So I said, “sloth”.

Looking back, I’m pretty sure my parents did intend for this to be a fun innocent question. So they were probably surprised to find themselves infuriated by my answer.

“A sloth”, they asked? “Why a SLOTH?”

I knew immediately that I had erred. But I wasn’t sure how. I could only manage, “I… like sloths?”

“But why? What is it you like about them?

My parents kept badgering me for a justification. But, being 12, I was unable to articulate that sloths gave me a sense of existential calm. So I struggled to answer. And as we went back and forth, I gradually came to realize that:

  1. My parents often complained about my poor work ethic at school.
  2. My parents were frustrated by my (lack of) response to their feedback on said work ethic.
  3. Sloths move very slowly.
  4. If you were primed to see laziness everywhere, you could see sloths as “lazy”.
  5. My parents were convinced that I’d chosen “sloth” as a passive-aggressive rebuke of the primary thrust of their recent parenting.

I kept protesting that I just thought sloths were “nice”. My parents seemed torn between (a) thinking I was lying and (b) thinking that only a very lazy person could think anything positive about sloths.

Finally, it seemed to dawn on them that I wasn’t going to change my story, and they couldn’t actually punish me for picking the wrong animal. So they left the room—still angry—and I was off the hook.

Which is good because even without punishment, if you ask a 12 year old what their favorite animal is and then get extremely angry about their answer, that’s the kind of thing they might complain about on the internet many years later. I think of this often when I interact with children today.


Still here? OK, here’s a thought experiment:

Suppose there’s an idea. Thinking about it makes you unhappy. But once you’ve learned the Idea, it’s hard to get it out of your mind. Slowly, you sink in deeper and deeper, seeking out more information and constantly bringing it up in conversation. You filter all other information through the Idea. The Idea is to important to you that you even cut ties with some loved ones who disagree with you about it.

This, of course, is an analogy for politics.

One of my strongest beliefs is that way too many people allow politics to play way too large a role in their emotional lives. Of course, elections matter. The world could be better and most potential improvements intersect with politics somehow. But the way we spend our energy on politics is poorly optimized for improving the world and well-optimized for sucking up all our time and making us miserable.

Sometimes I talk to relatives that I haven’t spoken to in months, and within minutes the conversation turns to politics. This never leads to arguments, thanks to my monk-like calm and empathy and impartiality. But no one ever seems to learn anything or shift their views and I feel sad and embarrassed that we aren’t talking about something deeper.

I hate it. In particular I hate how so much information is now slanted to support a particular view because slanted information gets engagement and so now people expect information to be slanted but de-slanting facts is hard so instead people have adopted this horrible strategy of filtering all information by first determining what “side” it comes from then finding lazy justifications or accept or reject it accordingly.

Once an issue has political salience, people seem to lose the ability to think rationally about it. Politics seems to impede the ability of good ideas to spread. This is not a good thing if you think the world needs complicated solutions.

This mode of thinking seems to be spreading beyond politics, too. (If you search, you can find several charming “rebuttals” of my seed oil post that display the same scientific rigor and generous spirit so common in political discourse today.)

I bring this up because a huge fraction of the modern internet is devoted to serving up never-ending political content. And I think that serving up that content is bad. It’s preying on human weakness, and I think we should mentally classify the act of serving it up in the same category as making cigarettes or running casinos or selling alcohol to alcoholics or optimizing phone games to addict people and extract thousands of dollars by selling loot boxes to “whales”.


But damn it, sometimes I want to write about politics. So, here, hidden below an engagement-destroying story about sloths, I’d like to preregister my best guess of the main factors influencing the US 2024 presidential election. I’m doing this because I’ve noticed that after elections happen, people often start giving simple monocausal narratives. But if everything was simple and predictable, it should be predictable in advance, right?

Of course, all these factors are already priced into prediction markets and election forecasts.

Immigration

  • People really do not like illegal immigration.
  • People really do not like economic migrants using refugee claims.
  • The number of people entering the country has been much higher under the Biden than it was under Trump.
  • Trump successfully convinced Republicans to block Biden’s proposed 2024 border security bill, which might have brought those numbers down.
  • (Strongly favors Trump)

Inflation

  • People really do not like inflation.
  • Incumbent parties around the world have been severely punished for post-COVID inflation.
  • The American Rescue Plan Act of 2021 led to lots of waste and fraud.
  • Price increases are now under control, but absolute prices remain high.
  • (Favors Trump)

Money

  • Harris has much more money than Trump.
  • But both candidates are well beyond the level where money has diminishing returns.
  • (Slightly favors Harris)

Abortion

  • People really do not like abortion restrictions.
  • Trump chose a relatively moderate position.
  • (Favors Harris)

Democracy / January 6

  • People do not like what happened on January 6.
  • But they’ve somewhat forgotten about this.
  • And many people believe there was fraud in the 2020 election.
  • (Modestly favors Harris)

The candidates

  • People like Harris more than Trump as a human being.
  • Harris took on many unpopular positions when running for president in 2019.
  • Biden is unpopular and Harris—being vice-president—has limited room to distance herself.
  • Given her dismal performance in 2019, Harris turned out to be a surprisngly OK candidate.
  • Trump is very old, but comes across as vigorous.
  • (Slightly favors Harris)

The debate

  • Instead of trying to debate Trump on the facts, Harris tried to needle him and make him look unhinged.
  • This was effective.
  • Trump refused to do more debates without suffering much reputational damage.
  • (Modestly favors Harris)

Vice presidential nominees

  • People like Tim Walz’s vibes.
  • People don’t like JD Vance’s vibes.
  • Vance was very effective in the VP debate.
  • Walz was shaky.
  • No one cares about vice presidents.
  • Even when presidential nominees are in their late 70s, and even when one of the presidential nominees is the nominee because they’re the vice-president of a previous late 70s nominee, people still don’t care about vice presidents.
  • (Slightly favors Harris.)

Foreign policy

  • People barely think about foreign policy when voting for president.
  • Even though the president has almost full control of foreign policy and limited control of domestic policy, people still barely think about foreign policy.
  • People don’t much care about the war in Ukraine.
  • But among those who do care, Republicans are divided and Democrats are united.
  • People only care a little about the war in Israel/Gaza.
  • But among those who do care, Democrats are divided and Republicans are united.
  • (Slightly favors Trump)

AI

  • The public has mixed feelings about AI.
  • The candidates don’t have particularly clear policy differences.
  • Few people are voting based on AI.
  • (Even-ish)

Housing

  • People care a lot about housing costs.
  • Elite opinion is shifting YIMBY.
  • The candidates don’t have particularly clear policy differences.
  • (Even-ish?)

Economy

  • The stock market at record highs.
  • Unemployment is low.
  • People don’t feel very optimistic.
  • People fondly remember the economy of 2016-2020.
  • Did I mention people hate inflation?
  • (Modestly favors Trump)

Crime

  • Public disorder is believed to have increased post-COVID and post-George Floyd.
  • Police enforcement is down.
  • People mostly want a tougher approach.
  • (Favors Trump)

Trade

  • Many people think that more tariffs sounds good.
  • I doubt they would appreciate the resulting price increases.
  • Harris was clever to try to label this a “sales tax”, but that didn’t break though.
  • (Slightly favors Trump)

Wokeness

  • People don’t like it.
  • Harris was quite woke when running for president in 2019.
  • In this election, Harris carefully avoided all wokeness.
  • (Modestly favors Trump)

The electoral college

  • (Modestly favors Trump)

And finally, a few questions:

  1. If Trump had refused to debate Biden until after the conventions, would Biden still be the nominee and Trump cruising to an easy victory?

  2. Why didn’t Harris choose Josh Shapiro for vice president, given that Shapiro is an A+ political talent in the most important swing state?

  3. If Nikki Haley were the Republican nominee, would she be cruising to an easy victory against Harris?

  4. Was Trump able to get the nomination only because he convinced many people that the 2020 election was stolen, meaning he wasn’t branded as a loser like Hilary Clinton, Mitt Romney, John Kerry, Al Gore, and George H.W. Bush were?

  5. Why did Biden insist on running for a second term despite being 81 years old? Why did Democratic elites assist him by making sure there was no real primary?

  6. If there had been a Democratic primary, would Gretchen Whitmer or Mark Warner or Tim Kaine or Amy Klobuchar or Josh Shapiro have emerged as a stronger candidate and now cruising to an easy victory against Trump?


P.S. Today’s link is to the Forecasting Research Institute on Can Humanity Achieve a Century of Nuclear Peace?

“What is the probability that by 2045, one or more incidents involving nuclear weapons will cause the death of more than 10 million humans, within a five-year time period?”

Group Number of respondents Median forecast
Experts 110 5%
Superforecasters 41 1%
The public 401 10%

Please show lots of digits

2024-10-24 08:00:00

Hi there. It’s me, the person who stares very hard at the numbers in the papers you write. I’ve brought you here today to ask a favor.

Say you wrote something like this:

There were 446 students tested. The left-handed students passed 5.67664% more often than right-handed students.

Many people would think that’s hilarious. You wrote 5 digits after the decimal point! When you only have 446 students! Do you think you’re estimating the true difference in pass rates to an accuracy of 0.00001%? Do you think that final “664” is meaningful? Ha! Hahahaha! What a fool you are!

For example, the American Psychological Association says:

Round as much as possible while considering prospective use and statistical precision

And the Political Analysis journal says:

Numbers in the text of articles and in tables should be reported with no more precision than they are measured and are substantively meaningful.

And The Principles of Biomedical Scientific Writing says:

Significant figures (significant digits) should reflect the degree of precision of the original measurement.

And Clymo (2014) says:

a recent report that the mean of 17 values was 3.863 with a standard error of the mean (SEM) of 2.162 revealed only that none of the seven authors understood the limitations of their work.

And Andrew Gelman makes fun of a paper:

my favorite is the p-value of 4.76×10^−264. I love that they have all these decimal places. Because 4×10^-264 wouldn’t be precise enuf.

I beg you, do not listen to these people. Write all the digits.

Of course, they’re right that when you write 5.67664%, the final digits say nothing about how much better left-handed students might be at whatever was tested. With only 446 students, it’s impossible to estimate more than 2-3 digits with any accuracy.

But the argument isn’t just that those digits are unnecessary. It’s that they’re misleading. Supposedly, printing that many digits implies that the true population difference must be somewhere between 5.676635 and 5.676645%. Which is almost certainly untrue.

But… who says it implies that? If your point is that there’s lots of uncertainty, then add a confidence interval or write “±” whatever. Destroying information is a weird way to show an uncertainty band.

And deleting digits does destroy information. Yes it does. Yes it does! Important information.

Let’s look at the sentence again:

There were 446 students tested. The left-handed students passed 5.67664% more often than right-handed students.

Notice something? This doesn’t says how many of those 446 students were left-handed. You might not care about that. But I—the person staring very hard at the numbers in your paper—don’t necessarily have the same priorities you do.

Fortunately, I know things:

  1. If there were L left-handed students out of which l passed and R right-handed students, out of which r passed, then the exact percentage increase is P = 100% × (RATIO - 1), where RATIO = (l / L) / (r / R)
  2. When you wrote down 5.67664%, you might have rounded up. Or you might have rounded down. Or you might have just truncated the digits. But whatever you did, P must have been somewhere between 5.676635% and 5.67665%.

Since I know those things, I can try all the possible values of L, R, l, and r, and see how many have a total of R+L=446 students and lead to a percentage increase in the right range. Like this:

def p_inc(L, l, R, r):
    ratio = (l / L) / (r / R)
    return 100*(ratio-1)

def search(lower, upper, tot_students):
    for L in range(1,tot_students):
        R = tot_students - L
        for l in range(1,L):
            for r in range(1,R):
                if lower <= p_inc(L, l, R, r) < upper:
                    print(L, l, R, r)

search(5.676635, 5.67665, 446)

If I do that, then there’s only one possibility.

L l R r
45 37 401 312

There must have been 45 left-handed students, out of which 37 passed, and 401 right-handed students, out of which 312 passed.

But I can only do this because you wrote down so many “unnecessary” digits. If you’d only written 5.6766% then the true number could be anything from 5.67655% to 5.6767% and there would be 7 possibilities:

L l R r
45 37 401 312
203 113 243 128
218 97 228 96
218 194 228 192
238 237 208 196
251 219 195 161
274 234 172 139

If you’d written 5.677%, there would be 71 possibilities. For 5.7%, there would be 9494.

Now I know what you’re thinking: This kind of reverse-engineering is disgusting. Isn’t asking for extra digits of precision the wrong solution? Shouldn’t we simply require authors to publish their data?

As someone who has read thousands of academic papers, I’ll answer those questions as calmly as possible.

NO.

REQUIRING PUBLIC DATA IS NOT THE ANSWER.

HAVE YOU EVER TRIED TO USE THE DATA FROM A PUBLISHED PAPER? HAVE YOU?

USUALLY THERE’S URL AND IT’S BROKEN. OR YOU CONTACT THE “CORRESPONDING AUTHOR” AND THEY IGNORE YOU. OR THERE’S A WORD DOCUMENT WITH AN EMBEDDED PICTURE OF AN EXCEL SPREADSHEET SOMEONE TOOK WITH A NOKIA 6610. OR YOU CAN’T REPRODUCE THE PUBLISHED RESULTS FROM THE DATA AND WHEN YOU ASK THE AUTHORS WHY, THEY GET ANGRY THAT SOMEONE ACTUALLY TOOK AN INTEREST IN THE KNOWLEDGE THEY SUPPOSEDLY WANTED TO CONTRIBUTE TO THE WORLD.

THIS KIND OF REVERSE ENGINEERING IS NECESSARY ALL THE TIME, OFTEN JUST TO UNDERSTAND WHAT CALCULATIONS WERE DONE BECAUSE THE AUTHORS DON’T EXPLAIN THE DETAILS.

SCIENCE IS ON FIRE. PAPERS EVERYWHERE ARE FULL OF MISTAKES OR THINGS WORSE THAN MISTAKES. THIS IS HOW WE CATCH FRAUDS. THIS IS HOW WE CATCH PEOPLE WHO RIG ELECTIONS. YOUR PETTY WRITING STYLE OPINIONS IMPEDE THE SEARCH FOR TRUTH.

Yes, it would be great if everyone did fully reproducible science and made all their data available and actually responded when you asked them questions and got a pony. But in the current, actual world, most papers are missing important details. The problem of having to scan your eyeballs past a few extra digits is a silly non-issue compared to the problem of meaningless results everywhere. So please stop spending your energy actively trying to convince people to delete one of the very few error correction methods that we actually have and that actually sort of works. Thanks in advance.