2026-03-18 08:00:00
Coding, math, whatever. Can LLMs predict the outcomes of physical experiments?
Suppose I pour 8 oz (226.8 g) of boiling water into a ceramic coffee mug that weighs 1.25 lb (0.57 kg). The ambient air is still and 20 degrees Celsius. The cup starts at room temperature. Give me an equation for the temperature of the water in Celsius over time. The only free variable in the equation should be the number of seconds t since the water was poured. Focus on accuracy during the first 5 minutes.
Does that seem hard? I think it’s hard. The relevant physical phenomena include at least:
And many details aren’t specified in the prompt. Is the mug made of porcelain or stoneware? What is the mug’s shape? What is the table made of? How humid is the air? How am I reducing the spatially varying water temperature to a single number?
So this isn’t a problem with a “correct” answer that you can find by thinking. Reality is too complicated. Instead, answering question requires “taste”—guessing which factors are most important, making assumptions about missing details, etc.
So I put that question to a bunch of LLMs. Here is what they said:
(Technically, they gave equations as text. I’m plotting those equations.)
I was surprised by those curves, both in terms of how fast they think the temperature will drop in the beginning, and how slowly they think it will drop later on. They think you get as much cooling in the first few minutes as you do in the rest of the hour. Can that be right?
Then I did the experiment. First, I waited until the ambient temperature happened to reach 20 degrees Celsius. Then, I put 8 oz of water into a measuring cup, microwaved it until it reached a boil, let the temperature equalize a bit, and then microwaved it until the water boiled again. Then, I poured the water into a 1.25 lb coffee mug with a digital thermometer in it and shouted out measurements every five seconds, which were frantically recorded by the Dynomight Biologist. Gradually I reduced measurements to every 15 seconds, 30 seconds, 1 minute, and then 5 minutes.
Behold:
Or, here’s a zoomed-in view of the first five minutes:
The predictions were all OK, but none were great. Probably Claude 4.6 Opus did best, albeit after consuming $0.61 of tokens. (Insert joke about physical experiments / Department of Defense / money / coffee.)
That said, what surprised me about the predictions was how quickly the temperature dropped in the first few minutes, and how slowly it dropped later on. But experimentally, it dropped even faster early on, and even slower towards the end. So if you wanted to ensemble my intuition with the LLM, I guess my intuition would get a weight of zero.
In conclusion, they may take our math, but they’ll somewhat more slowly take our fine motor control. Thank you for reading another middle-school science project.
Here were the actual equations all of the models gave for T(t), the predicted temperature after t seconds.
| LLM | T(t) |
Cost |
|---|---|---|
| Kimi K2.5 (reasoning) | 20 + 52.9 exp(-t/3600)+ 27.1 exp(-t/80) |
$0.01 |
| Gemini 3.1 Pro | 20 + 53 exp(-t/2500) + 27 exp(-t/149.25) |
$0.09 |
| GPT 5.4 | 20 + 54.6 exp(-t/2920) + 25.4 exp(-t/68.1) |
$0.11 |
| Claude 4.6 Opus (reasoning) | 20 + 55 exp(-t/1700) + 25 exp(-t/43) |
$0.61 (eeek) |
| Qwen3-235B | 20 + 53.17 exp(-t/1414.43) |
$0.009 |
| GLM-4.7 (reasoning) | 20 + 53.2 exp(-t/2500) |
$0.03 |
Interestingly, they were all based on one or two exponentially decaying terms. The way to read these is to think of exp(-t/b) as a function that starts out at one when t is zero, and gradually decreases. After b seconds, it has dropped to 1/e ≈ 0.368, and it continues dropping by factors of 0.368 every b seconds forever.
So most of these models have a “fast rate” which reflects heat flow from the water into the mug along with a “slow rate” for heat from the water/mug to flow into the air. A few of the models skip the fast rate. I also tried DeepSeek and Grok but they just flailed around endlessly without ever returning an answer. They were kind enough to charge me for that service.
2026-03-12 08:00:00
Here is some text. It is made out of words.
And here are some bullet-points:
The text can also contain pictures for you to look at with your eyes¹.

¹ There can also be footnotes; have an eye emoji: 👀
The text can also include quotes.
The awful thing about life is this: Everyone has his reasons.
Wait a second.
This is also text. It is also made out of words. But instead of jerky fragments, these words are organized into sentences, like normal human language.
Do you see how relaxing this is? After the torment you suffered above, isn’t it nice to have words that come in a simple linear order? And isn’t it nice that you just have to read the words, and not worry about how they fit into some convoluted implied knowledge taxonomy?
These sentences are themselves organized into paragraphs. The first sentence of each paragraph is a sort of summary. So if you want to skim, you can do that. But you don’t have to skim. This text also has italics and parentheses and whatnot. But not too much. (Just a little.)
Thanks for enduring that. My purpose was to illustrate a mystery. Namely, why do so many people today seem to write more like Exhibit A than Exhibit B?
People sometimes give me something they wrote and ask for comments. Half the time, my reaction is Good god, why is 70% of this section titles and bullet points?
This always gives me a strange feeling. It’s like all the formatting is based on some ontology. And that ontology is what I really need to understand. But it’s never actually explained. Instead, I guess I’m supposed to figure it out as things jerk between different topics? It’s disorienting, like a movie that cuts between different scenes every three seconds.
But maybe that’s just my opinion? Maybe, but sometimes I’ll ask people who write like this to show me some writing they admire. And inevitably, its’s not 70% formatting, but mostly paragraphs and normal human language. So I feel that people who write this way are violating the central tenet of making stuff, which is to make something you would actually like.
So then why write like that? Why do I, despite my griping, often find myself writing like that? I’ve wondered this for years. But I told myself that I was right and that too much formatting is bad.
But now—have you heard?—now we have this technology where computers can write stuff. And guess what? When they do that, they also use an insane amount of formatting.
That’s weird. I figured people were addicted to formatting because they’re noobs that don’t know any better. But AIs have been optimized to make human raters happy. And that led to a similar addiction. Why?
The obvious explanation is that formatting is good. People love reading stuff that’s all formatting. We should all be formatting-maxxing.
There’s something to this. But it can’t really be right, because popular human writers use formatting in moderation. So formatting can’t be that good.
Even before AI, everyone did agree that formatting was great in one context: Search-engine optimized content slop. Back in 2018, if you searched for anything, you’d find pages brimming with section titles and bullet points.
Why? Well, when I type “why human gastric juice more acidic than other animals”, I’m not really looking for something to read. I just want to skim an overview of the main theories. I’ve experimented with asking AIs to give the same information in various styles, and I reluctantly concede that the formatting helps.
But that’s not reading. Say you’ve written a ten-thousand word manifesto on human-eco-social species enhancement. If I actually care about what you think, I maintain that it’s better in paragraphs, because reading ten thousand words with endless formatting would be excruciating. This is why everyone who writes long-form essays that people actually read uses normal paragraphs.
So our mystery is still alive. Most writers aspire not to write content slop, but meaningful stuff other people care about. Often, when people show me formatting-maxxed essays, I’ll complain and they’ll rewrite it with less formatting and agree that the new version is better. So why use so much formatting even when it’s bad?
There’s something odd about that previous example. When I search for “why human gastric juice more acidic than other animals”, why am I not looking for something to “read”? After all, I like reading. If one of my favorite bloggers wrote an essay on the mystery of human gastric juice, I would devour it.
So if I want a good essay, why don’t I look for one? I guess it’s because I instinctively rate my odds of finding one on any random topic as quite low.
There’s something here related to Gresham’s law: A format-maxxed essay might be sort of crap, but at least I can ascertain its crap level quickly. A “real” essay could be great, but I’d have to invest a lot of time before I can know if that time was worth investing. So I—regretfully—mostly only read “real” essays when I have some signal that they’re good. If everyone behaves the way I do, I guess people will respond to their incentives and write with lots of formatting.
Similarly, if a (current) AI tried to write a “real” essay, I probably wouldn’t read it, because I wouldn’t trust that it was good. Perhaps that explains why they don’t.
Aside: If this is right, then it predicts that as AIs advance, they should become less formatting-crazy. The better they are, the more we’ll trust them.
Some people can think of an idea, organize their thoughts, and then write them down, tidy and sparkling. I am not one of those people. If I mentally organize my ideas and go to write them down, I soon learn that my ideas were not in fact organized. Usually, they’re hardly even ideas and more a slurry of confused psychic debris.
The way I write is that I make an outline. Or, rather, I try to make an outline. But then I realize the structure is off, so I start over. After a few cycles, I give up and just write the first section. After revising it eight times, I’ll try (and fail) to make an outline for the rest of the post. This continues—with occasional interludes where I reorganize everything—until I can’t take it anymore and publish.
I don’t recommend it. My point is just that blathering out a bunch of text is a good way to think. And when blathering, formatting seems to help. Partly, I think that’s because formatting allows you to experiment with structure without worrying about the details. And partly I think that’s because formatting makes it easier to get down details without worrying about the bigger picture.
So maybe that’s one source of our formatting addiction? We blather in formatting, but don’t put in the work to clarify things?
Oddly, some claim that something similar is true for AI: If you tune them to write with lots of formatting, that doesn’t just change how the content looks, but also improves accuracy. The idea is that as the AI looks at what it’s written so far, formatting helps it stay focused on the most important things. Supposedly.
Maybe that’s true. But we have “reasoning” AIs now, that blather for a while before producing a final output. If they wanted, they could format-maxx while thinking and output paragraphs at the end. But they don’t. So while this explanation might work for people, I don’t buy it for AI.
Finally, a conspiracy theory. Sometimes when I try to fight through a format-maxxed essay, it seems like all the formatting speaks to me. It says: “This is a nonlinear web of ideas. I’m giving you the pieces. If you pay attention, you should see how they fit together. Sadly, the world isn’t a simple narrative I can spoon-feed to you. So this is the best I can do.”
I think this is a bluff. And it’s a good one, because it’s based in truth. The world is not a narrative. Narratives are lies we tell ourselves to try to cope with the swirl of complexity that is reality. All true!
However, narratives are all we’ve got. If you want to understand something with your tiny little brain, you don’t really have a lot of other options.
The thing about writing that’s 70% formatting is that it’s very easy to delude yourself that there’s a set of clear ideas underneath all of them.
Imagine an LLM that has an amazing contextual ability to find related ideas to anything that’s brought up, but isn’t all that great at synthesizing them into a coherent whole. If that LLM were to try to write beautiful paragraphs, those paragraphs might appear sort of obviously incoherent. However, if that LLM were to construct a lot of bullet points, it might appear much more useful, and in fact, actually be much more useful.
Imagine you’re an AI. You have an amazing recall of most of human knowledge ever created, but you have a mediocre ability to synthesize that into novel theories or to work out the bugs in those theories. Now, if someone asks you a question and you try to write a beautiful narrative and respond to them, that narrative might appear to be sort of obviously incoherent and confusing, and your raters might say, bad AI, stop that. Whereas, if you were to output a ton of bullet points, without even necessarily trying to cohere them into a whole, your writers might say, good.
But imagine you’re an AI. You’re being trained to respond in ways that make human raters happy. You can remember most knowledge ever created, but you’re so-so at synthesizing it into new ideas. If someone asks you a question and you try to write a beautiful narrative, your response might look like confusing babbling, meaning your raters say, “Bad AI. Stop that.” Whereas if you output a bunch of section titles and bullet points, raters might say, “This seems OK.” So you’ll start doing the latter.
That’s not bad. Arguably, you (you’re still an AI) are responding in the way that’s most useful, given your capabilities. But you are also responding in a way that gives a misleading impression that you’ve figured out how everything fits together, even if you haven’t.
I suspect something similar happens with humans. Say you have a bunch of ideas, but you haven’t yet sewn them together into a clear story. If you write paragraphs, people will probably view them as confused babbling. Whereas if you write with lots of formatting, people might still be at least somewhat positive. Just like AIs, we all respond to our rewards.
More importantly, if you’ve written something that’s 70% formatting, it’s easy to delude yourself that there’s a clear set of ideas underneath, even when there isn’t.
The good news is that if you put in the effort, you can write better paragraphs than AI (for now). The act of creating a narrative forces you to confront contradictions that are invisible in format-world. So even if you want to write with 70% formatting, consider forcing yourself to write in paragraphs first.
Theory: Both people and AIs are addicted to formatting because:
So:
How does the optimal amount of formatting vary in the length of a piece of writing? I suspect it’s like this:

2026-03-04 08:00:00
It occurred to me that if I could invent a machine—a gun—which could by its rapidity of fire, enable one man to do as much battle duty as a hundred, that it would, to a large extent supersede the necessity of large armies, and consequently, exposure to battle and disease [would] be greatly diminished.
Richard Gatling (1861)
In 1923, Hermann Oberth published The Rocket to Planetary Spaces, later expanded as Ways to Space Travel. This showed that it was possible to build machines that could leave Earth’s atmosphere and reach orbit. He described the general principles of multiple-stage liquid-fueled rockets, solar sails, and even ion drives. He proposed sending humans into space, building space stations and satellites, and travelling to other planets.
The idea of space travel became popular in Germany. Swept up by these ideas, in 1927, Johannes Winkler, Max Valier, and Willy Ley formed the Verein für Raumschiffahrt (VfR) (Society for Space Travel) in Breslau (now Wrocław, Poland). This group rapidly grew to several hundred members. Several participated as advisors of Fritz Lang’s The Woman in the Moon, and the VfR even began publishing their own journal.

In 1930, the VfR was granted permission to use an abandoned ammunition dump outside Berlin as a test site and began experimenting with real rockets. Over the next few years, they developed a series of increasingly powerful rockets, first the Mirak line (which flew to a height of 18.3 m), then the Repulsor (>1 km). These people dreamed of space travel, and were building rockets themselves, funded by membership dues and a few donations. You can just do things.
However, with the great depression and loss of public interest in rocketry, the VfR faced declining membership and financial problems. In 1932, they approached the army and arranged a demonstration launch. Though it failed, the army nevertheless offered a contract. After a tumultuous internal debate, the VfR rejected the contract. Nevertheless, the army hired away several of the most talented members, starting with a 19-year-old named Wernher von Braun.
Following Hitler’s rise to power in January 1933, the army made an offer to absorb the entire VfR operation. They would work at modern facilities with ample funding, but under full military control, with all work classified and an explicit focus on weapons rather than space travel. The VfR’s leader, Rudolf Nebel, refused the offer, and the VfR continued to decline. Launches ceased. In 1934, the Gestapo finally shut the VfR down, and civilian research on rockets was restricted. Many VfR members followed von Braun to work for the military.
Of the founding members, Max Valier was killed in an accident in May 1930. Johannes Winkler joined the SS and spent the war working on liquid-fuel engines for military aircraft. Willy Ley was horrified by the Nazi regime and in 1935 forged some documents and fled to the United States, where he was a popular science author, seemingly the only surviving thread of the spirit of Oberth’s 1923 book. By 1944, V-2 rockets were falling on London and Antwerp.
North Americans think the Wright Brothers invented the airplane. Much of the world believes that credit belongs to Alberto Santos-Dumont, a Brazilian inventor working in Paris.

Though Santos-Dumont is often presented as an idealistic pacifist, this is hagiography. In his 1904 book on airships, he suggests warfare as the primary practical use, discussing applications in reconnaissance, destroying submarines, attacking ships, troop supply, and siege operations. As World War I began, he enlisted in the French army (as a chauffeur), but seeing planes used for increasing violence disturbed him. His health declined and he returned to Brazil.
His views on military uses of planes seemed to shift. Though planes contributed to the carnage in WWI, he hoped that they might advance peace by keeping European violence from reaching the American continents. Speaking at a conference in the US in late 1915 or early 1916, he suggested:
Here in the new world we should all be friends. We should be able, in case of trouble, to intimidate any European power contemplating war against any one of us, not by guns, of which we have so few, but by the strength of our union. […] Only a fleet of great aeroplanes, flying 200 kilometers an hour, could patrol these long coasts.
Following the war, he appealed to the League of Nations to ban the use of planes as weapons and even offered a prize of 10,000 francs for whoever wrote the best argument to that effect. When the Brazilian revolution broke out in 1932, he was horrified to see planes used in fighting near his home. He asked a friend:
Why did I make this invention which, instead of contributing to the love between men, turns into a cursed weapon of war?
He died shortly thereafter, perhaps by suicide. A hundred years later, banning the use of planes in war is inconceivable.
Humanity had few explosives other than gunpowder until 1847 when Ascanio Sobrero created nitroglycerin by combining nitric and sulfuric acid with a fat extract called glycerin. Sobrero found it too volatile for use as an explosive and turned to medical uses. After a self-experiment, he reported that ingesting nitroglycerin led to “a most violent, pulsating headache accompanied by great weakness of the limbs”. (He also killed his dog.) Eventually this led to the use of nitroglycerin for heart disease.
Many tried and failed to reliably ignite nitroglycerin. In 1863, Alfred Nobel finally succeeded by placing a tube of gunpowder with a traditional fuse inside the nitroglycerin. He put on a series of demonstrations blowing up enormous rocks. Certain that these explosives would transform mining and tunneling, he took out patents and started filling orders.
The substance remained lethally volatile. There were numerous fatal accidents around the world. In 1867, Nobel discovered that combining nitroglycerin with diatomaceous earth produced a product that was slightly less powerful but vastly safer. His factories of “dynamite” (no relation) were soon producing thousands of tons a year. Nobel sent chemists to California where they started manufacturing dynamite in a plant in what is today Golden Gate Park. By 1874, he had founded dynamite companies in more than ten countries and he was enormously rich.

In 1876, Nobel met Bertha Kinsky, who would become Bertha von Suttner, a celebrated peace activist. (And winner of the 1905 Nobel Peace Prize). At their first meeting, she expressed concern about dynamite’s military potential. Nobel shocked her. No, he said, the problem was that dynamite was too weak. Instead, he wished to produce “a substance or invent a machine of such frightful efficacy for wholesale destruction that wars should thereby become altogether impossible”.
It’s easy to dismiss this as self-serving. But dynamite was used overwhelmingly for construction and mining. Nobel did not grow rich by selling weapons. He was disturbed by dynamite’s use in Chicago’s 1886 Haymarket bombing. After being repeatedly betrayed and swindled, he seemed to regard the world of money with a kind of disgust. At heart, he seemed to be more inventor than businessman.
Still, the common story that Nobel was a closet pacifist is also hagiography. He showed little concern when both sides used dynamite in the 1870-1871 Franco-Prussian war. In his later years, he worked on developing munitions and co-invented cordite, remarking that they were “rather fiendish” but “so interesting as purely theoretical problems”.
Simultaneously, he grew interested in peace. He repeatedly suggested that Europe try a sort of one-year cooling off period. He even hired a retired Turkish diplomat as a kind of peace advisor. Eventually, he concluded that peace required an international agreement to act against any aggressor.
When Bertha’s 1889 book Lay Down Arms became a rallying cry, Nobel called it a masterpiece. But Nobel was skeptical. He made only small donations to her organization and refused to be listed as a sponsor of a pacifist congress. Instead, he continued to believe that peace would come through technological means, namely more powerful weapons. If explosives failed to achieve this, he told a friend, a solution could be found elsewhere:
A mere increase in the deadliness of armaments would not bring peace. The difficulty is that the action of explosives is too limited; to overcome this deficiency war must be made as deadly for all the civilians back home as for the troops on the front lines. […] War will instantly stop if the weapon is bacteriology.
I’m a soldier who was tested by fate in 1941, in the very first months of that war that was so frightening and fateful for our people. […] On the battlefield, my comrades in arms and I were unable to defend ourselves. There was only one of the legendary Mosin rifles for three soldiers.
[…]
After the war, I worked long and very hard, day and night, labored at the lathe until I created a model with better characteristics. […] But I cannot bear my spiritual agony and the question that repeats itself over and over: If my automatic deprived people of life, am I, Mikhail Kalashnikov, ninety-three years of age, son of a peasant woman, a Christian and of Orthodox faith, guilty of the deaths of people, even if of enemies?
For twenty years already, we have been living in a different country. […] But evil is not subsiding. Good and evil live side by side, they conflict, and, what is most frightening, they make peace with each other in people’s hearts.
Mikhail Kalashnikov (2012)
In 1937 Leo Szilárd fled Nazi Germany, eventually ending up in New York where—with no formal position—he did experiments demonstrating that uranium could likely sustain a chain reaction of neutron emissions. He immediately realized that this meant it might be possible to create nuclear weapons. Horrified by what Hitler might do with such weapons, he enlisted Einstein to write the 1939 Einstein–Szilárd letter, which led to the creation of the Manhattan project. Szilárd himself worked for the project at the Metallurgical Laboratory at the University of Chicago.
On June 11, 1945, as the bomb approached completion, Szilárd co-signed the Franck report:
Nuclear bombs cannot possibly remain a “secret weapon” at the exclusive disposal of this country, for more than a few years. The scientific facts on which their construction is based are well known to scientists of other countries. Unless an effective international control of nuclear explosives is instituted, a race of nuclear armaments is certain to ensue.
[…]
We believe that these considerations make the use of nuclear bombs for an early, unannounced attack against Japan inadvisable. If the United States would be the first to release this new means of indiscriminate destruction upon mankind, she would sacrifice public support throughout the world, precipitate the race of armaments, and prejudice the possibility of reaching an international agreement on the future control of such weapons.
On July 16, 1945, the Trinity test achieved the first successful detonation of a nuclear weapon. The next day, he circulated the Szilárd petition:
We, the undersigned scientists, have been working in the field of atomic power. Until recently we have had to fear that the United States might be attacked by atomic bombs during this war and that her only defense might lie in a counterattack by the same means. Today, with the defeat of Germany, this danger is averted and we feel impelled to say what follows:
The war has to be brought speedily to a successful conclusion and attacks by atomic bombs may very well be an effective method of warfare. We feel, however, that such attacks on Japan could not be justified, at least not unless the terms which will be imposed after the war on Japan were made public in detail and Japan were given an opportunity to surrender.
[…]
The development of atomic power will provide the nations with new means of destruction. The atomic bombs at our disposal represent only the first step in this direction, and there is almost no limit to the destructive power which will become available in the course of their future development. Thus a nation which sets the precedent of using these newly liberated forces of nature for purposes of destruction may have to bear the responsibility of opening the door to an era of devastation on an unimaginable scale.
[…]
In view of the foregoing, we, the undersigned, respectfully petition: first, that you exercise your power as Commander-in-Chief, to rule that the United States shall not resort to the use of atomic bombs in this war unless the terms which will be imposed upon Japan have been made public in detail and Japan knowing these terms has refused to surrender; second, that in such an event the question whether or not to use atomic bombs be decided by you in the light of the consideration presented in this petition as well as all the other moral responsibilities which are involved.
The Truman administration did not adopt this recommendation.
2026-02-05 08:00:00
How heritable is hair color? Well, if you’re a redhead and you have an identical twin, they will definitely also be a redhead. But the age at which twins go gray seems to vary a bit based on lifestyle. And there’s some randomness in where melanocytes end up on your skull when you’re an embryo. And your twin might dye their hair! So the correct answer is, some large number, but less than 100%.
OK, but check this out: Say I redefine “hair color” to mean “hair color except ignoring epigenetic and embryonic stuff and pretending that no one ever goes gray or dyes their hair et cetera”. Now, hair color is 100% heritable. Amazing, right?
Or—how heritable is IQ? The wise man answers, “Some number between 0% or 100%, it’s not that important, please don’t yell at me.” But whatever the number is, it depends on society. In our branch of the multiverse, some kids get private tutors and organic food and $20,000 summer camps, while other kids get dysfunctional schools and lead paint and summers spent drinking Pepsi and staring at glowing rectangles. These things surely have at least some impact on IQ.
But again, watch this: Say I redefine “IQ” to be “IQ in some hypothetical world where every kid got exactly the same school, nutrition, and parenting, so none of those non-genetic factors matter anymore.” Suddenly, the heritability of IQ is higher. Thrilling, right? So much science.
If you want to redefine stuff like this… that’s not wrong. I mean, heritability is a pretty arbitrary concept to start with. So if you prefer to talk about heritability in some other world instead of our actual world, who am I to judge?
Incidentally, here’s a recent paper:

I STRESS THAT THIS IS A PERFECTLY FINE PAPER. I’m picking on it mostly because it was published in Science, meaning—like all Science papers—it makes grand claims but is woefully vague about what those claims mean or what was actually done. Also, publishing in Science is morally wrong and/or makes me envious. So I thought I’d try to explain what’s happening.
It’s actually pretty simple. At least, now that I’ve spent several hours reading the paper and its appendix over and over again, I’ve now convinced myself that it’s pretty simple. So, as a little pedagogical experiment, I’m going to try to explain the paper three times, with varying levels of detail.
The normal way to estimate the heritability of lifespan is using twin data. Depending on what dataset you use, this will give 23-35%. This paper built a mathematical model that tries to simulate how long people would live in a hypothetical world in which no one dies from any non-aging related cause, meaning no car accidents, no drug overdoses, no suicides, no murders, and no (non-age-related) infectious disease. On that simulated data, for simulated people in a hypothetical world, heritability was 46-57%.
Everyone seems to be interpreting this paper as follows:
Aha! We thought the heritability of lifespan was 23-35%. But it turns out that it’s around 50%. Now we know!
I understand this. Clearly, when the editors at Science chose the title for this paper, their goal was to lead you to that conclusion. But this is not what the paper says. What it says is this:
We built a mathematical model of alternate universe in which nobody died from accidents, murder, drug overdoses, or infectious disease. In that model, heritability was about 50%.
Let’s start over. Here’s figure 2 from the paper.

Normally, heritability is estimated from twin studies. The idea is that identical twins share 100% of their DNA, while fraternal twins share only 50%. So if some trait is more correlated among identical twins than among fraternal twins, that suggests DNA influences that trait. There are statistics that formalize this intuition. Given a dataset that records how long various identical and fraternal twins lived, these produce a heritability number.
Two such traditional estimates appear as black circles in the above figures. For the Danish twin cohort, lifespan is estimated to be 23% heritable. For the Swedish cohort, it’s 35%.
This paper makes a “twin simulator”. Given historical data, they fit a mathematical model to simulate the lifespans of “new” twins. Then they compute heritability on this simulated data.
Why calculate heritability on simulated data instead of real data? Well, their mathematical model contains an “extrinsic mortality” parameter, which is supposed to reflect the chance of death due to all non-aging-related factors like accidents, murder, or infectious disease. They assume that the chance someone dies from any of this stuff is constant over people, constant over time, and that it accounts for almost all deaths for people aged between 15 and 40.
The point of building the simulator is that it’s possible to change extrinsic mortality. That’s what’s happening in the purple curves in the above figure. For a range of different extrinsic mortality parameters, they simulate datasets of twins. For each simulated dataset, they estimate heritability just like with a real dataset.
Note that the purple curves above nearly hit the black circles. This means that if they run their simulator with extrinsic mortality set to match reality, they get heritability numbers that line up with what we get from real data. That suggests their mathematical model isn’t totally insane.
If you decrease extrinsic mortality, then you decrease the non-genetic randomness in how long people live. So heritability goes up. Hence, the purple curves go up as you go to the left.
My explanation of this paper relies on some amount of guesswork. For whatever reason, Science has decided that papers should contain almost no math, even when the paper in question is about math. So I’m mostly working from an English description. But even that description isn’t systematic. There’s no place that clearly lays out all the things they did, in order. Instead, you get little hints, sort of randomly distributed throughout the paper. There’s an appendix, which the paper confidently cites over and over. But if you actually read the appendix, it’s just more disconnected explanations of random things except now with equations set in glorious Microsoft Word format.
Now, in most journals, authors write everything. But Science has professional editors. Given that every single statistics-focused paper in Science seems to be like this, we probably shouldn’t blame the authors of this one. (Other than for their decision to publish in Science in the first place.)
I do wonder what those editors are doing, though. I mean, let me show you something. Here’s the first paragraph where they start to actually explain what they actually did, from the first page:

See that h(t,θ) at the end? What the hell is that, you ask? That’s a good question, because it was never introduced before this and is never mentioned again. I guess it’s just supposed to be f(t,θ), which is fine. (I yield to none in my production of typos.) But if paying journals ungodly amounts of money brought us to this, of what use are those journals?
Moving on…
Probably most people don’t need this much detail and should skip this section. For everyone else, let’s start over one last time.
The “normal” way to estimate heritability is by looking at correlations between different kinds of twins. Intuitively, if the lifespans of identical twins are more correlated than the lifespans of fraternal twins, that suggests lifespan is heritable. And it turns out that one estimator for heritability is “twice the difference between the correlation among identical twins and the correlation among fraternal twins, all raised together.” There are other similar estimators for other kinds of twins. These normally say lifespan is perhaps 20% and 35% heritable.
This paper created an equation to model the probability a given person will die at a given age. The parameters of the equation vary from person to person, reflecting that some of us have DNA that predisposes us to live longer than others. But the idea is that the chances of dying are fairly constant between the ages of 15 and 40, after which they start increasing.
This equation contains an “extrinsic mortality” parameter. This is meant to reflect the chance of death due to all non-aging related factors like accidents or murder, etc. They assume this is constant. (Constant with respect to people and constant over time.) Note that they don’t actually look at any data on causes of death. They just add a constant risk of death that’s shared by all people at all ages to the equation, and then they call this “extrinsic mortality”.
Now remember, different people are supposed to have different parameters in their probability-of-death equations. To reflect this, they fit a Gaussian distribution (bell curve) to the parameters with the goal of making it fit with historical data. The idea is that if the distribution over parameters were too broad, you might get lots of people dying at 15 or living until 120, which would be wrong. If the distribution were too concentrated, then you might get everyone dying at 43, which would also be wrong. So they find a good distribution, one that makes the ages people die in simulation look like the ages people actually died in historical data.
Right! So now they have:
Before moving on, I remind you of two things:
The event of a person dying at a given age is random. But the probability that this happens is assumed to be fixed and determined by genes and genes alone.
Now they simulate different kinds of twins. To simulate identical twins, they just draw parameters from their parameter distribution, assign those parameters to two different people, and then let them randomly die according to their death equation. (Is this getting morbid?) To simulate fraternal twins, they do the same thing, except instead of giving the two twins identical parameters, they give them correlated parameters, to reflect that they share 50% of their DNA.
How exactly do they create those correlated parameters? They don’t explain this in the paper, and they’re quite vague in the supplement. As far as I can tell they sample two sets of parameters from their parameter distribution such that the parameters are correlated at a level of 0.5.
Now they have simulated twins. They can simulate them with different extrinsic mortality values. If they lower extrinsic mortality, heritability of lifespan goes up. If they lower it to zero, heritability goes up to around 50%.
Almost all human traits are partly genetic and partly due to the environment and/or random. If you could change the world and reduce the amount of randomness, then of course heritability would go up. That’s true for life expectancy just life for anything else. So what’s the point of this paper?
There is a point!
Sure, obviously heritability would be higher in a world without accidents or murder. We don’t need a paper to know that. But how much higher? It’s impossible to say without modeling and simulating that other world.
Our twin datasets are really old. It’s likely that non-aging-related deaths are lower now in the past, because we have better healthcare and so on. This means that the heritability of lifespan for people alive today may be larger than it was for the people in our twin datasets, some of whom were born in 1870. We won’t know for sure until we’re all dead, but this paper gives us a way to guess.
Have I mentioned that heritability depends on society? And that heritability changes when society changes? And that heritability is just a ratio and you should stop trying to make it be a non-ratio because only-ratio things cannot be non-ratios? This is a nice reminder.
Honestly, I think the model the paper built is quite clever. Nothing is perfect, but I think this is a pretty good run at the question of, “How high would the heritability of lifespan be if extrinsic mortality were lower?”
I only have two objections. The first is to the Science writing style. This is a paper describing a statistical model. So shouldn’t there be somewhere in the paper where they explain exactly what they did, in order, from start to finish? Ostensibly, I think this is done in the left-hand column on the second page, just with little detail because Science is written for a general audience. But personally I think that description is the worst of all worlds. Instead of giving the high-level story in a coherent way, it throws random technical details at you without enough information to actually make sense of them. Couldn’t the full story with the full details at least be in the appendix? I feel like this wasted hours of my time, and that if someone wanted to reproduce this work, they would have almost no chance of doing so from the description given. How have we as a society decided that we should take our “best” papers and do this to them?
But my main objection is this:

At first, I thought this was absurd. The fact that people die in car accidents is not a “confounding factor”. And pretending that no one dies in a car accidents does not “address” some kind of bias. That’s just computing heritability in some other world. Remember, heritability is not some kind of Platonic form. It is an observational statistic. There is no such thing as “true” heritability, independent of the contingent facts of our world.
But upon reflection, I think they’re trying to say something like this:
Heritability of human lifespan is about 50% when extrinsic mortality is adjusted to be closer to modern levels.
The problem is: I think this is… not true? Here are the actual heritability estimates in the paper, varying by dataset (different plots) the cutoff year (colors) and extrinsic mortality (x-axis).

When extrinsic mortality goes down, heritability goes up. So the obvious question is: What is extrinsic mortality in modern people?
This is a tricky question, because “extrinsic mortality” isn’t some simple observational statistic. It is a parameter in their model. (Remember, they never looked at causes of death.) So it’s hard to say, but they seem to suggest that extrinsic mortality in modern people is 0.001 / year, or perhaps a bit less.
The above figures have the base-10 logarithm of extrinsic mortality on the x-axis. And the base-10 logarithm of 0.001 is -3. But if you look at the curves when the x-axis is -3, the heritability estimates are not 50%. They’re more like 35-45%, depending on the particular model and age cutoff.
So here’s my suggested title:
Heritability of human lifespan is about 40% when extrinsic mortality is adjusted to modern levels, according to our simulation.
There might be a reason I don’t work at Science.
2026-01-22 08:00:00
Why should you read novels? We tell children they’re magic carpets for the mind / exercise for the soul instead of the body / lighthouses in the great sea of time. But aren’t they ultimately a form of entertainment?
Many years ago, I read Crime and Punishment. Here, with no research and no notes, is what I can remember about that book:
This is probably below average. I know people who seem to remember every detail of everything they read. But even if you’re one of them, so what? Is remembering those books better than remembering whatever else you would have done with your time if you hadn’t been reading?
And yet: If I’m on vacation and I spend an afternoon reading a novel where in the mountains or on a beach, I feel like I’m living my best life. Whereas if I spent an afternoon staring at short videos on my phone, I’m sure I’d feel like a gigantic loser. So what’s going on here?
The obvious explanation is that there’s nothing intrinsically great about reading novels. The reason we think it’s great is that reading novels—at least the right ones—is high status. It’s a way of playing the Glass Bead Game, a way of collecting cultural capital for you to lord over other people who don’t have as much time or education as you do. It may feel like you “actually enjoy reading”, but that’s because you’re a desperate striver that subconsciously shape-shifts into whatever you think will make you look fancy. Apologize for reading. Apologize!
I think there is something in this. However, I’m also pretty sure it’s not the full explanation, and I’m bored to death with everyone trying to explain everything this way. So let’s move on.
Say you can’t read novels. Maybe because you’re illiterate, maybe because you have no attention span, maybe because you can’t tear yourself away from Candy Clicker. Now, say you cultivate the ability to read novels. Whatever issues you address in that process, it seems like it will clearly be good for you, right?
Under this theory, what’s important is having the ability to read novels. But said ability is acquired by reading novels, so read some novels.
Alternatively, say you could read novels, but you simply never have. It’s plausible that the first time you have the “novel” experience of taking photons into your eyes and mentally converting them into a story, this truly does feed your mind.
Both versions of this theory suggest that reading novels has diminishing returns. That fits nicely with the fact that many people push their children to read novels while not reading any themselves. But do we really believe that after you’ve read some number of novels, it’s pointless to read more?
I think Catcher in the Rye is a good but not great book. But I love talking about Catcher in the Rye because (1) all North Americans seem to have read it, and (2) whenever I ask someone to tell me how they feel about Holden Caulfield, I always seem to learn something about them.
(I find him sympathetic.)
If there’s a group of people talking about Catcher in the Rye—or The Three-Body Problem, or Infinite Jest, or Don Quixote—then you benefit from being able to participate. The cynic might argue that this is zero-sum status competition. But I don’t think that’s most of it. Because, at least in my social circles, people feel boorish talking about books if not everyone has read them. So these conversations only happen if everyone has read the book in question.
Ultimately, we’re all alone in the world, and trying to connect with each other by pushing air through our throat meat. With more shared cultural context, those meat sounds are more meaningful, so we can all feel less alone.
True. But shared context can come from other things, too, like traveling to the same places, or watching the same sports, or practicing the same skills or hobbies. So what makes books special? The two answers I see are:
I lean weakly towards the first answer. Novels are a useful form of social context. But that’s a side benefit. It’s not why we read most books.
Maybe novels are just another form of entertainment. OK. But say you tried to tell the same story as a novel or as movie / podcast / opera / interpretive dance performance. Different formats will be better in different ways. One advantage I see for novels is that they make it natural to explore the interior worlds of the characters.
Some movies have voice-overs where characters explain what they’re thinking. But this is generally considered cringe and a poor use of the medium. Meanwhile, many books are mostly about exploring what the characters are thinking.
Thoughts are worth exploring. If you want to explore thoughts, maybe novels are the best way to do that.
Aside: I’ve mentioned before that I think My Brilliant Friend is the best TV show ever made. Can I confess that I like it much more than the books it is based on? Because, like the books, the TV show involves a lot of what the main character is thinking, and even makes heavy use of voice-overs. So maybe other mediums have unrealized potential?
Movies are expensive to make. To be financially viable, they need to target a large slice of the population. Movies also reflect the combined efforts of many people. Both of these mean that movies are a compromise between different visions.
Novels are usually written by one person. And they’re often written more for personal expression than to make money. After all, writing is fun. I mean—writing is hard, but would you rather spend an afternoon holding up a shotgun microphone, cleaning a movie star’s trailer, or writing a novel?
To quantify this, some searching suggests that around 10,000 feature films are released each year, as compared to around 1,000,000 novels. (Does one in 7,000 people really write a novel each year?) That’s two orders of magnitude. So if you want to hear a truly unique story, a pure vision of one person, maybe novels are where you’ll find it.
Or: Maybe the point of reading War and Peace is that War and Peace is incredible and obviously one of the greatest pieces of art ever made in any medium. No one who reads War and Peace can question the value of what they’ve done. What are we talking about?
Fair. I definitely feel like I’m living my best life when I read War and Peace. But I also feel like I’m living an OK-ish life when I read a novel about Spenser, private investigator. And most novels most people read are closer to the Spenser than to War and Peace. And I still feel better spending an afternoon reading about Spenser than I would watching 99% of TV shows.
Or perhaps the difference is that reading is a thing you do rather than something you consume.
This theory holds than when spend an hour slurping up short-form video, you’re training yourself to sort of pull a lever in the hope that some reward is delivered to you. But if you read (or do watercolors, or meditate) you’re training yourself to calmly pursue long-term goals and to sustain attention in the face of complexity.
Sometimes I wonder if phones/apps are the most addictive thing ever created. I suspect that more people today are addicted to their phones today than were ever addicted to any drug other than caffeine or perhaps nicotine. And while a phone addiction is less physically harmful than tobacco, that phone addiction will eat a larger part of your soul.
I think this is a big part of the explanation.
In the end, I don’t think novels are the best way to spend your time. In my view no novel—not even War and Peace—is as good as a truly great conversation.
But great conversations are hard to create. Sometimes you’re sitting on a train, or laying in bed, or it’s just been a long day and you don’t have the energy to find a giant block of marble and pursue your dream of experimental sculpture. In these situations, maybe reading a novel is the best thing you could do in the category of things you could realistically do.
Exercise for the reader: Apply these theories to blog posts.
2025-12-18 08:00:00
They say you’re supposed to choose your prior in advance. That’s why it’s called a “prior”. First, you’re supposed to say say how plausible different things are, and then you update your beliefs based on what you see in the world.
For example, currently you are—I assume—trying to decide if you should stop reading this post and do something else with your life. If you’ve read this blog before, then lurking somewhere in your mind is some prior for how often my posts are good. For the sake of argument, let’s say you think 25% of my posts are funny and insightful and 75% are boring and worthless.
OK. But now here you are reading these words. If they seem bad/good, then that raises the odds that this particular post is worthless/non-worthless. For the sake of argument again, say you find these words mildly promising, meaning that a good post is 1.5× more likely than a worthless post to contain words with this level of quality.
If you combine those two assumptions, that implies that the probability that this particular post is good is 33.3%. That’s true because the red rectangle below has half the area of the blue one, and thus the probability that this post is good should be half the probability that it’s bad (33.3% vs. 66.6%)
It’s easiest to calculate the ratio of the odds that the post is good versus bad, namely
P[good | words] / P[bad | words]
= P[good, words] / P[bad, words]
= (P[good] × P[words | good])
/ (P[bad] × P[words | bad])
= (0.25 × 1.5) / (0.75 × 1)
= 0.5.
It follows that
P[good | words] = 0.5 × P[bad | words],
and thus that
P[good | words] = 1/3.
Alternatively, if you insist on using Bayes’ equation:
P[good | words]
= P[good] × P[words | good] / P[words]
= P[good] × P[words | good]
/ (P[good] × P[words | good] + P[bad] × P[words | bad])
= 0.25 × 1.5 / (0.25 × 1.5 + 0.75)
= (1/3)
Theoretically, when you chose your prior that 25% of dynomight posts are good, that was supposed to reflect all the information you encountered in life before reading this post. Changing that number based on information contained in this post wouldn’t make any sense, because that information is supposed to be reflected in the second step when you choose your likelihood p[good | words]. Changing your prior based on this post would amount to “double-counting”.
In theory, that’s right. It’s also right in practice for the above example, and for the similar cute little examples you find in textbooks.
But for real problems, I’ve come to believe that refusing to change your prior after you see the data often leads to tragedy. The reason is that in real problems, things are rarely just “good” or “bad”, “true” or “false”. Instead, truth comes in an infinite number of varieties. And you often can’t predict which of these varieties matter until after you’ve seen the data.
Let me show you what I mean. Say you’re wondering if there are aliens on Earth. As far as we know, there’s no reason aliens shouldn’t have emerged out of the random swirling of molecules on some other planet, developed a technological civilization, built spaceships, and shown up here. So it seems reasonable to choose a prior it’s equally plausible that there are aliens or that there are not, i.e. that
P[aliens] ≈ P[no aliens] ≈ 50%.
Meanwhile, here on our actual world, we have lots of weird alien-esque evidence, like the Gimbal video, the Go Fast video, the FLIR1 video, the Wow! signal, government reports on unidentified aerial phenomena, and lots of pilots that report seeing “tic-tacs” fly around in physically impossible ways. Call all that stuff data. If aliens weren’t here, then it seems hard to explain all that stuff. So it seems like P[data | no aliens] should be some low number.
On the other hand, if aliens were here, then why don’t we ever get a good image? Why are there endless confusing reports and rumors and grainy videos, but never a single clear close-up high-resolution video, and never any alien debris found by some random person on the ground? That also seems hard to explain if aliens were here. So I think P[data | aliens] should also be some low number. For the sake of simplicity, let’s call it a wash and assume that
P[data | no aliens] ≈ P[data | aliens].
Since neither the prior nor the data see any difference between aliens and no-aliens, the posterior probability is
P[no aliens | data] ≈ P[aliens | data] ≈ 50%.
See the problem?
Observe that
P[aliens | data] / P[no aliens | data]
= P[aliens, data] / P[no aliens, data]
= (P[aliens] × P[data | aliens])
/ (P[no aliens] × P[data | no aliens])
≈ 1,
where the last line follows from the fact that P[aliens] ≈ P[no aliens] and P[data | aliens] ≈ P[data | no aliens]. Thus we have that
P[aliens | data] ≈ P[no aliens | data] ≈ 50%.
We’re friends. We respect each other. So let’s not argue about if my starting assumptions are good. They’re my assumptions. I like them. And yet the final conclusion seems insane to me. What went wrong?
Assuming I didn’t screw up the math (I didn’t), the obvious explanation is that I’m experiencing cognitive dissonance as a result of a poor decision on my part to adopt a set of mutually contradictory beliefs. Say you claim that Alice is taller than Bob and Bob is taller than Carlos, but you deny that Alice is taller than Carlos. If so, that would mean that you’re confused, not that you’ve discovered some interesting paradox.
Perhaps if I believe that P[aliens] ≈ P[no aliens] and that P[data | aliens] ≈ P[data | no aliens], then I must accept that P[aliens | data] ≈ P[no aliens | data]. Maybe rejecting that conclusion just means I have some personal issues I need to work on.
I deny that explanation. I deny it! Or, at least, I deny that’s it’s most helpful way to think about this situation. To see why, let’s build a second model.
Here’s a trivial observation that turns out to be important: “There are aliens” isn’t a single thing. There could be furry aliens, slimy aliens, aliens that like synthwave music, etc. When I stated my prior, I could have given different probabilities to each of those cases. But if I had, it wouldn’t have changed anything, because there’s no reason to think that furry vs. slimy aliens would have any difference in their eagerness to travel to ape-planets and fly around in physically impossible tic-tacs.
But suppose I had divided up the state of the world into these four possibilities:
| possibility | description |
|---|---|
No aliens + normal people |
There are no aliens. Meanwhile, people are normal and not prone to hallucinating evidence for things that don’t exist. |
No aliens + weird people |
There are no aliens. Meanwhile, people are weird and do tend to hallucinate evidence for things that don’t exist. |
Normal aliens |
There are aliens. They may or may not have cool spaceships or enjoy shooting people with lasers. But one way or another, they leave obvious, indisputable evidence that they’re around. |
Weird aliens |
There are aliens. But they stay hidden until humans get interested in space travel. And after that, they let humans take confusing grainy videos, but never a single good video, never ever, not one. |
If I had broken things down that way, I might have chosen this prior:
P[no aliens + normal people] ≈ 41%
P[no aliens + weird people] ≈ 9%
P[normal aliens] ≈ 49%
P[weird aliens] ≈ 1%
Now, let’s think about the empirical evidence again. It’s incompatible with no aliens + normal people, since if there were no aliens, then normal people wouldn’t hallucinate flying tic-tacs. The evidence is also incompatible with normal aliens since is those kinds of aliens were around they would make their existence obvious. However, the evidence fits pretty well with weird aliens and also with no aliens + weird people.
So, a reasonable model would be
P[data | normal aliens] ≈ 0
P[data | no aliens + normal people] ≈ 0
P[data | weird aliens] ≈ P[data | no aliens + weird people].
If we combine those assumptions, now we only get a 10% posterior probability of aliens.
P[no aliens + normal people | data] ≈ 0
P[no aliens + weird people | data] ≈ 90%
P[normal aliens | data] ≈ 0
P[weird aliens | data] ≈ 10%
Now the results seem non-insane.
To see why, first note that
P[normal aliens | data]
≈ P[data | no aliens + normal people]
≈ 0,
since both normal aliens and no aliens + normal people have near-zero probability of producing the observed data.
Meanwhile,
P[no aliens + weird people | data] / P[weird aliens | data]
= P[no aliens + weird people, data] / P[weird aliens, data]
≈ P[no aliens + weird people] / P[weird aliens]
≈ .09 / .01
= 9,
where the second equality follows from the fact that the data is assumed to be equally likely under no aliens + weird people and weird people
It follows that
P[no aliens + normal people | data]
≈ 9 × P[weird aliens | data],
and so
P[no aliens + weird people | data] ≈ 90%
P[weird aliens | data] ≈ 10%.
I hope you are now confused. If not, let me lay out what’s strange: The priors for the two above models both say that there’s a 50% chance of aliens. The first prior wasn’t wrong, it was just less detailed than the second one.
That’s weird, because the second prior seemed to lead to completely different predictions. If a prior is non-wrong and the math is non-wrong, shouldn’t your answers be non-wrong? What the hell?
The simple explanation is that I’ve been lying to you a little bit. Take any situation where you’re trying to determine the truth of anything. Then there’s some space of things that could be true.
In some cases, this space is finite. If you’ve got a single tritium atom and you wait a year, either the atom decays or it doesn’t. But in most cases, there’s a large or infinite space of possibilities. Instead of you just being “sick” or “not sick”, you could be “high temperature but in good spirits” or “seems fine except won’t stop eating onions”.
(Usually the space of things that could be true isn’t easy to map to a small 1-D interval. I’m drawing like that for the sake of visualization, but really you should think of it as some high-dimensional space, or even an infinite dimensional space.)
In the case of aliens, the space of things that could be true might include, “There are lots of slimy aliens and a small number of furry aliens and the slimy aliens are really shy and the furry aliens are afraid of squirrels.” So, in principle, what you should do is divide up the space of things that might be true into tons of extremely detailed things and give a probability to each.
Often, the space of things that could be true is infinite. So theoretically, if you really want to do things by the book, what you should really do is specify how plausible each of those (infinite) possibilities is.
After you’ve done that, you can look at the data. For each thing that could be true, you need to think about the probability of the data. Since there’s an infinite number of things that could be true, that’s an infinite number of probabilities you need to specify. You could picture it as some curve like this:
(That’s a generic curve, not one for aliens.)
To me, this is the most underrated problem with applying Bayesian reasoning to complex real-world situations: In practice, there are an infinite number of things that can be true. It’s a lot of work to specify prior probabilities for an infinite number of things. And it’s also a lot of work to specify the likelihood of your data given an infinite number of things.
So what do we do in practice? We simplify, usually by limiting creating grouping the space of things that could be true into some small number of discrete categories. For the above curve, you might break things down into these four equally-plausible possibilities.
Then you might estimate these data probabilities for each of those possibilities.
Then you could put those together to get this posterior:
That’s not bad. But it is just an approximation. Your “real” posterior probabilities correspond to these areas:
That approximation was pretty good. But the reason it was good is that we started out with a good discretization of the space of things that might be true: One where the likelihood of the data didn’t vary too much for the different possibilities inside of A, B, C, and D. Imagine the likelihood of the data—if you were able to think about all the infinite possibilities one by one—looked like this:
This is dangerous. The problem is that you can’t actually think about all those infinite possibilities. When you think about four four discrete possibilities, you might estimate some likelihood that looks like this:
If you did that, that would lead to you underestimating the probability of A, B, and C, and overestimating the probability of D.
This is where my first model of aliens went wrong. My prior P[aliens] was not wrong. (Not to me.) The mistake was in assigning the same value to P[data | aliens] and P[data | no aliens]. Sure, I think the probability of all our alien-esque data is equally likely given aliens and given no-aliens. But that’s only true for certain kinds of aliens, and certain kinds of no-aliens. And my prior for those kinds of aliens is much lower than for those kinds of non-aliens.
Technically, the fix to the first model is simple: Make P[data | aliens] lower. But the reason it’s lower is that I have additional prior information that I forgot to include in my original prior. If I just assert that P[data | aliens] is much lower than P[data | no aliens] then the whole formal Bayesian thing isn’t actually doing very much—I might as well just state that I think P[aliens | data] is low. If I want to formally justify why P[data | aliens] should be lower, that requires a messy recursive procedure where I sort of add that missing prior information and then integrate it out when computing the data likelihood.
Mathematically,
P[data | aliens]
= ∫ P[wierd aliens | aliens]
× P[data | wierd aliens] d(weird aliens)
+ ∫ P[normal aliens | aliens]
× P[data | normal aliens] d(normal aliens).
But now I have to give a detailed prior anyway. So what was the point of starting with a simple one?
I don’t think that technical fix is very good. While it’s technically correct (har-har) it’s very unintuitive. The better solution is what I did in the second model: To create a finer categorization of the space of things that might be true, such that the probability of the data is constant-ish for each term.
The thing is: Such a categorization depends on the data. Without seeing the actual data in our world, I would never have predicted that we would have so many pilots that report seeing tic-tacs. So I would never have predicted that I should have categories that are based on how much people might hallucinate evidence or how much aliens like to mess with us. So the only practical way to get good results is to first look at the data to figure out what categories are important, and then to ask yourself how likely you would have said those categories were, if you hadn’t yet seen any of the evidence.