2026-02-19 03:49:29
Published on February 18, 2026 7:49 PM GMT
I'll have a weird dream and wake up in a funk. Be overwhelmed w/ work. Read lots of news/reddit and become very upset or angry. Obviously it's good to feel these things, but sometimes I continue to feel awful no matter how hard I try to "process my feelings" (or it's just an inconvenient time to process them).
In IT, the typical solution is "Try turning it off and back on again"
But how do we do that w/ people? (ethically I mean).
Let's start w/ the mundane, the obvious:
But I know you all came here for that 1 weird trick that [um...evil therapists?] hate.
1. Breathwork: There's just something about hyperventilating and holding your breath in a comfortable place that always works for me. I follow this guided session.
2. Cold Showers: Warning: don't do breathwork while standing or especially in water. People have died from this. A cold shower can be helpful or unhelpful, depending on your mindset.
What works for me is a mindset of being okay w/ the experience or simply surrender. You also don't need to go too cold too fast (or all the way cold, cause colder areas can have way colder water). I just take a hot shower and bring it down. If you get locked up or rigid, just step away and turn the heat up.
As a bonus, this also helps me sleep at night.
3. Phrases/Mantras: I'll be quite overwhelmed and feeling sorry for myself, but thinking "I will protect you" about my loved ones can instantly make the problems feel small. I can handle so much more from this mindset.
In general here, you're brain can be in different mindsets eg
A. Tiredness
B. Energetic
C. Relaxed/Open
D. Defensive/Closed
E. Depressed
All of which greatly constrain what actions you can take and how you feel. In a pinch, you might be able to repeat some internal phrase/image/song/etc to yourself to get out of one mindset, into a more helpful one.
4. Short-form videos: This is included as a last resort. Not actually a method to reset, but more of a stop-gap. Whenever I'm very sick and can't seem to do any of the above, I resort to long sessions of short-form videos. I've done this 3 times, all for times that were locally worse than my drone accident, and reflectively endorse it.
When there's suffering, you can either change your environment or change your mindset. My suggestions are mostly about mindset, but folks who over-rely on mindset changes (meditators) or distraction tend to accumulate problems instead of fixing them.
But how do we know when to deal with a problem vs accept it? To answer that would require a full article, so I'd just stick w/ common sense and a general heuristic of fixing problems you expect to encounter again
(eg your relationships or the dishes, lol).
2026-02-19 02:55:17
Published on February 18, 2026 6:55 PM GMT
Since this website is called Less Wrong, I think there should be a good overview of Karl Popper's falsifiability concept somewhere. It's a surprisingly subtle concept in practice -- the short version is that yes, falsifiability is necessary for a hypothesis to be meaningful, but the hard part is actually pulling off a usable falsification attempt.
There are some posts about falsifiability (e.g. here or here), but they only ever discuss toy examples (rocks falling down, cranks arguing for ancient aliens, invisible dragon in my garage, etc.). There's also this wiki, which introduces the famous history of anomalies in the orbit of Uranus and Mercury, but I think it still doesn't go far enough, because that's the one example that is used everywhere, e.g. Rational Wiki here, and so one might assume it's rather a special case.
I think that toy examples can be useful in debating with crank-leaning people, but if you want to do science better, or learn to use the scientific method outside of the scientific community, this crank-countering approach is unhelpful. Let's instead look at some actual historical science! I tried to cover varying examples in order to see what is a repeating feature and what is incidental.
On endless arguments, the Duhem-Quine thesis, and the necessity of digressions
Our founding question is what drives (scientific) progress. The last two posts were mostly case studies of particular breakthroughs; this will be the first real engagement with theory of how science works, specifically with Karl Popper’s falsifiability. I’ll sum up Popper’s work quite concisely because it’s already well-known, and once again I want to quickly get to a few more subtle case studies.
In 1934, the psychologist and soon-to-be philosopher Karl Popper was trying to get out of Austria, sensing the incoming Nazi threat, and he had to write a book to get a permanent position abroad. This book ended up being The Logic of Scientific Discovery, and it ended up popularising the falsifiability theory of science.[1]
Popper’s immediate motivation was the rapid rise in new confusing theories at the beginning of the 20th century, from psychoanalysis (which became popular by the turn of the century) to general relativity (1915; its experimental confirmation in 1919 got it widespread attention). There was also a great clash of political ideologies — Popper had spent his teenage years as a member of an Austrian Marxist party, then left disillusioned and saw the rise of fascism in Europe. All of this must have been a disorienting environment, so he quite naturally wondered about the demarcation problem: how do you separate nonsensical theories from scientific ones?[2]
Popper started from the problem of induction, which we introduced in the last post. His proposed solution is to accept that it is unsolvable, and actual science does not find universally valid statements. Instead of trying to find some fixed system that could produce such statements (like math or a priori truth), he demarked science by the criterion of falsifiability.
The idea is that no theory can be conclusively proven, but theories can be provisionally useful. For a theory to be useful, it must tell us something about the world, and that’s only possible if it makes predictions that could turn out to be wrong.
Theories can thus only be disproven, never conclusively proven, and given a disproof, a theory must be discarded or modified. Science can thus approach truth incrementally, getting less and less wrong as better theories are built out of falsified ones.
In particular, if a theory resists outright attempts to knock it down, that makes it proportionally more reliable. We use the theories that have survived attempts at falsification in the hope that they will keep working for our purposes.
This looks like it could indeed help with the demarcation problem:
To quote Popper directly:
I found that those of my friends who were admirers of Marx, Freud, and Adler, were impressed by a number of points common to these theories, and especially by their apparent explanatory power. These theories appeared to be able to explain practically everything that happened within the fields to which they referred. The study of any of them seemed to have the effect of an intellectual conversion or revelation, opening your eyes to a new truth hidden from those not yet initiated. Once your eyes were thus opened you saw confirming instances everywhere: the world was full of verifications of the theory. Whatever happened always confirmed it. Thus its truth appeared manifest; and unbelievers were clearly people who did not want to see the manifest truth; who refused to see it, either because it was against their class interest, or because of their repressions which were still ‘un-analysed’ and crying aloud for treatment.
The most characteristic element in this situation seemed to me the incessant stream of confirmations, of observations which ‘verified’ the theories in question; and this point was constantly emphasized by their adherents. A Marxist could not open a newspaper without finding on every page confirming evidence for his interpretation of history; not only in the news, but also in its presentation — which revealed the class bias of the paper — and especially of course in what the paper did not say. The Freudian analysts emphasized that their theories were constantly verified by their ‘clinical observations’. As for Adler, I was much impressed by a personal experience. Once, in 1919, I reported to him a case which to me did not seem particularly Adlerian, but which he found no difficulty in analysing in terms of his theory of inferiority feelings, although he had not even seen the child. Slightly shocked, I asked him how he could be so sure. ‘Because of my thousandfold experience,’ he replied; whereupon I could not help saying: ‘And with this new case, I suppose, your experience has become thousand-and-one-fold.’
— Conjectures and Refutations: The Growth of Scientific Knowledge, page 35
I think I have encountered the falsification theory of science, presented as a solution to the demarcation problem, at some point in school, though I can’t locate it anymore. It is often mentioned by skeptical organisations in fights against various lunatics, such as Rational Wiki here.
Falsificationism has a clear appeal: it doesn’t need a foundation of absolute truths to stand on, yet gives some mechanism of progress. But there’s a problem, both philosophically and, surprisingly often, in practice. It’s called the Duhem-Quine thesis or confirmation holism.
In the abstract, the fundamental issue is that there are too many possible theories and evidence can never disprove enough of them.[4] Any particular prediction of a theory rests both on the theory itself and on unenumerable auxiliary assumptions — things like instruments being functional and correctly calibrated, samples not being contaminated, data not being corrupted in communication, noise not interfering... (as well as more philosophical assumptions like the scientist being able to trust their own memories). Thus, in principle, a falsification need not falsify the theory — it could just be a stupid bug, or a convoluted phenomenon within the theory. You cannot know for sure when you’ve falsified a theory, because you cannot verify all the assumptions you make.
That also makes the fights with various lunatics so difficult: if you provide contradictory evidence to their theory, they can easily deny your evidence or modify their theory to band-aid that difficulty, and this process can continue ad nauseam without ever reaching an agreement.
There’s a canonical historical example that beautifully illustrates this problem. It concerns Newtonian gravity, which made predictions about the trajectories of planets, and twice looked like it was going to be falsified when the observed orbit of a planet did not match calculations.
Uranus was generally understood to be a planet somewhere around 1785. In 1821, Alexis Bouvard published astronomical tables predicting its future position. The actual trajectory started to deviate from his tables, and when new tables were made, the discrepancy soon reappeared. In a few years, it was clear that the planet was not following the calculated path.
Under a naïve reading of Popper, this should be a prompt to throw away Newtonian gravity as falsified and look for an alternative theory, but what astronomers of the time actually did was assume that the discrepancy was caused by an additional nearby planet whose gravitational tug was moving Uranus off its predicted path.
This idea existed in 1834, but it took a decade of additional data collection to calculate the necessary position of this eighth planet. Once two astronomers (Urbain le Verrier (at the urgent request of Arago) and John Couch Adams) got the same result in 1846, it took a few months of observations to actually find it.
The planet (now called Neptune) was indeed discovered, rescuing Newtonian gravity — in fact providing a triumph for it, for it allowed astronomers to discover a new planet just by pen-and-paper calculations.
The apparent falsification of Newtonian gravity turned out to be because of the failure of an auxiliary assumption: the finite list of (relevantly big) planets under consideration.
A few years later (in 1859), le Verrier encountered a similar problem with the trajectory of Mercury. Newtonian mechanics for a lone spherical planet orbiting a spherical star predicts an elliptical trajectory, but in practice, the ellipsis precedes around the star because of the tug of other planets (with a tiny contribution from the Sun’s slight oblateness).
For Mercury, this effect can be calculated to be 532 arcseconds per century, but the actual observed effect is 574 arcseconds per century. Again, Newton’s theory looks falsified. But it obviously occurred to le Verrier that it could again be another planet, which started the search for the newly named planet Vulcan. Unlike Neptune, Vulcan would be very close to the Sun, so it would be hard to observe because of the Sun’s glare. The hope was to observe it during a transit or solar eclipse, but those are rare events you have to wait for.
There were many apparent discoveries by amateur astronomers, which usually did not get independently verified. The first reported sighting of a transit of Vulcan came in 1859, then in 1860. In 1878, it was observed during a solar eclipse. Each time, the parameters calculated from that sighting suggested future sightings should be possible, but that replication never came.
This still didn’t disprove Newtonian gravity, because there was also the possibility of the effect coming from an asteroid belt in the same area. There were still searches for Vulcan as late as 1908.
In 1915, Albert Einstein developed his general theory of relativity based on entirely unrelated evidence, and realised that it would create this effect, and he could calculate exactly how strong it would be. When he calculated it to be the exact missing 42 arcseconds per century, he knew the theory was right.[5] This kind of works within Popper’s framework (general relativity was not falsified by this observation while Newtonian gravity was), but the order is backwards (the crucial experiment happened before the correct theory was even known; it is a useful test if it was not the inspiration for general relativity).
All of this is fairly well-described elsewhere (even on Rational Wiki here), including the Neptune/Vulcan example.[6] What I want to stress here is that the D-Q thesis is actually a big problem in practice, and indeed the obstacle that makes science difficult and expensive.[7]
This is where this blog gets its name. If you want to figure one thing out, you would ideally like to do a single crucial experiment, but in fact most scientific tools can only turn one unknown thing into multiple unknown things, like a hydra where you can only answer each question if you first answer several others.
To illustrate, first a less famous example:
In an undergrad crystallography course, we had to calculate a lot of diffraction patterns. The idea is that you have a crystal with a known structure, shine X-rays at it, and calculate in which directions they get scattered. Throughout the course I was always shaky about why exactly we were doing this — I knew vaguely that X-ray diffraction is used for determining crystal structures, but we never did that, only calculated what we would observe from a known structure.
Now I think the reason we did it this way is that the forward problem of crystal structure → diffraction pattern is a mathematical exercise, while the inverse problem of diffraction pattern → crystal structure is not, because some information is lost in the forward direction.
Mathematically, a simplified first approximation is that there are two functions: the crystal structure is ρ(r) — the electron density depending on position — and the diffraction pattern is I(q), a light intensity map depending on direction (of scattering). The diffraction pattern is (more-or-less) the square of the amplitude of the Fourier transform of the crystal structure:
The problem is that the Fourier transform produces complex numbers, but experimentally we can only see their amplitudes and not their phases.
You can rephrase this problem: by an inverse Fourier transform of I(q), you get ρ(r) convolved with its own mirror image (something called the Patterson function). The hard part of the problem is getting from the Patterson function to the underlying electron density ρ(r). This doesn’t have a single elegant solution, only a bunch of tradeoffs you can take. Since this is often done for finding the structure of biomolecules (especially proteins), I’ll list a few with that application in mind.[8]
And all of this (including AlphaFold!) only gives you the structure of the protein in a crystal, which is not guaranteed to be the same structure it has in a cell.
As a semi-random example from practice, a while ago I did a class presentation on the discovery / synthesis of a particular ribozyme (a piece of RNA that can catalyse a chemical reaction), and in this paper, the authors found its structure by a mixture of nearly all the approaches and then some:
As you can see, this is a huge amount of work, a large part of which is trial-and-error and redundant double-checking to wrestle with the hydra.
And each of these steps is its own hydra. From my own experience at a synchrotron measurement, I can say that there’s lots of instrument alignment that must be done before you even get scattering patterns in the first place. I expect that something analogous holds for the chemical procedures.
Now for a story that illustrates just how far the hydra can reach and what it can force you to do before you finally get a precise measurement.
Modern, 21st century experimental physics involves incredibly precise measurements, the pinnacle of which is LIGO, a detector for gravitational waves. Its development needed lots of workarounds for failed auxiliary assumptions.
Detecting gravitational waves requires measuring relative changes in distance between pairs of mirrors down to about 10-22. This obviously requires fancy technology for amplifying such tiny signals — for LIGO, this is done by interferometry.[11] You pass light through two separate 4 km-long paths back and forth hundreds of times (using a Fabry-Pérot cavity), then tune it to destructively interfere, so all of the intensity goes away from a photodetector. If a gravitational wave causes the two paths to become slightly different in length, the destructive interference becomes imperfect and some light makes it to the photodetector. Thus you have two big factors increasing sensitivity: the ratio between the path length and the wavelength of your light, and the ratio between the high laser power (increased by further optics power-recycling tricks) and the sensitivity of the detector.
But that alone would create a detector that measures lots of junk. The difficult and expensive part of designing any such precise experiment is characterising all of the possible things it could be measuring that you don’t want and finding ways to shield them.
Some of those noise sources were known ahead of time and could be prepared for, e.g. by cooling the apparatus down to 0.1 K (to minimise thermal noise) and suspending it on seismic isolation (to avoid having built a mere fancy seismograph),[12] and eventually even using squeezed light to get around quantum noise limits.
But there are always things you only discover after building the experiment. For example, one summer, spurious signals appeared that turned out to come from something hitting the cooling pipes. That something turned out to be ravens cooling their beaks on the pipes, which were covered in frost. This only got solved when the pipes were shielded so that ravens couldn’t get to them.
For a tiny window into how much work is involved in all this, check out this presentation from 2002, 13 years before gravitational waves were finally detected. Almost all of it is noise characterisation.[13] Apart from technical sources of noise (such as laser imperfections), there are also things like a logging operation: people drag big logs along the ground 3 kilometres away from the detector, the detector can hear it, and the scientists have to negotiate with the logging people to remove the noise.
In 2015, this work paid off and LIGO detected its first gravitational waves. But how did they know they were indeed detecting gravitational waves, and not some other noise? Two features in particular help make the results trustworthy:
In fact, this data looked so good that the main alternative explanation people worried about for the first few days after seeing it was not temperature, earthquakes, or ravens, but blind testing. There was in fact a system for injecting a fake gravitational wave signal for testing purposes, to see whether LIGO was ready to detect the kind of signal that could be expected. When the actual signal appeared, this testing system was down, but it still took a few days of making absolutely sure that it hadn’t gotten somehow accidentally triggered before it was certain. It helps that the designers weren’t idiots and this system had logs attached, but there is also the possibility of an outside intervention. This goes to say that even the more philosophical doubts about whether you can trust your own memory or your own friends are relevant in practice.
I don’t see any reason to doubt the results based on this, but if you happen to be conspiratorially inclined, you might be interested in a later detection that was reported by LIGO and Virgo (an entirely separate organization), or a slightly later event that was also observed using conventional telescopes all over the world. If you’re so conspiratorially inclined that you’re willing to entertain the entire astronomical community being collectively fooled or deceitful, there’s no way I can help you — but perhaps the section on the demarcation problem below might change your mind.
In the end, the data was so trustworthy that it meant more than just a proof of concept for gravitational wave detection. The black holes that had produced the detected gravitational waves had masses around 30M⊙, which is notably heavier than black holes known before. This discrepancy trickled back through the model of stellar masses used, and the conclusion was that the black holes had to form from stars with a lower metallicity than had previously been assumed.
That shows that not everything anomalous has to be explained away as noise or imperfection: sometimes it is a real effect, and the benefit of having an otherwise trustworthy system is that you can take deviations from your expectations seriously.
We introduced falsificationism in the context of the demarcation problem and fights with lunatics. Through our case studies we have seen that falsifications, when they even can be produced, are a laborious and uncertain affair even when everybody works in good faith, and impossible against somebody of a sufficiently conspiratorial mind.
When a falsifiable prediction finally works, it always involves some kind of reproducibility: for GR, correctly calculating the orbit of Mercury and the separate 1919 observation of light bending by gravity; for the phase problem, getting similar, chemically sensible structures from multiple crystals by multiple methods; for LIGO, seeing the same results in two (or more) observatories in different places in the world; for scurvy, the curative power of synthesised vitamin C and isolated vitamin C and foods tested by the guinea pig assay.
A lot of crank theories are theoretically falsifiable; the problem is that a falsification doesn’t convince the crank, but only makes them explain it away as a new feature. But this also happens regularly in perfectly legitimate science (e.g. when you throw away a measurement with no signal, because you realise that you forgot to turn your instrument on), so we need a different solution to the demarcation problem.
I propose this: to beat a hydra, you always need some kind of reproducibility between different methods that share as few auxiliary assumptions as possible. If only one lab can reproduce a given experiment, it can be experimental error in their instruments or even fraud. If two labs in different places in the world can reproduce it, it could still be an artifact of the method used.[14] If a different method can reproduce it, the whole thing could still be affected by publication bias, like this:
We have learned a lot from experience about how to handle some of the ways we fool ourselves. One example: Millikan measured the charge on an electron by an experiment with falling oil drops, and got an answer which we now know not to be quite right. It’s a little bit off because he had the incorrect value for the viscosity of air. It’s interesting to look at the history of measurements of the charge of an electron, after Millikan. If you plot them as a function of time, you find that one is a little bit bigger than Millikan’s, and the next one’s a little bit bigger than that, and the next one’s a little bit bigger than that, until finally they settle down to a number which is higher.
Why didn’t they discover the new number was higher right away? It’s a thing that scientists are ashamed of--this history--because it’s apparent that people did things like this: When they got a number that was too high above Millikan’s, they thought something must be wrong--and they would look for and find a reason why something might be wrong. When they got a number close to Millikan’s value they didn’t look so hard.[15]
— Richard Feynman, Cargo Cult Science (Caltech 1974 commencement address)
Once a theory is built on by people working in a different field entirely, people who don’t care about the fashions of the original field or its authorities, then it’s nearing certainty. When you develop science, you’re finding things that will still work the same way centuries later, for people from a culture different from your own.
This is what cranks cannot have, because their theories are motivated by some local agenda. They can take up any method and get results they want (and they certainly can set up their own papers, citations, or entire journals), but only at the cost of fighting against the world, of introducing excuses in theory or biasing their experimental data in a preferred direction. But those uninterested in their agenda will be unable to make their results work.
I’ve tried to emphasise that bad science is a spectrum, with complete fraud on one side, somewhat careless mistakes in the middle, and the unavoidable realities of getting funding on the other. Those on the far fraudulent end of that spectrum usually can’t be helped, but that’s a manageable problem in the long run. There’s a lot more potential with the somewhat-biased area on that spectrum. To demonstrate that, I’ll briefly return to the study of scurvy, which was slowed down not by fraud, but by influential experts (in different fields) clinging to their pet theories:
In contrast, those who ended up advancing towards the truth didn’t have a stake in the theories alone, but in outcomes (Blane, Holst and Frølich), or apparently didn’t care at all (Szent-Györgyi[16]). This worked better, because their motivation didn’t push them to anything preconceived.[17]
Nature has a better imagination than any individual human, but humans have a remarkable ability to adapt to its strangeness and make it intuitive, if that’s the path they choose to pursue. I’ll close off with a quote that demonstrates this on a physics field I didn’t want to otherwise breach:
We have not redefined quantum theory; we carry it to its logical conclusion. (...) We learned it second or third hand, as an established discipline whose rules and techniques we came to feel as intuitive and natural, not as a peculiar displacement of classical: we found and find it almost painful to do 19th century physics. The great Bohr-Einstein philosophical debates which fascinate historians and the philosophers are to us a bit wrong-headed (...)
— P. W. Anderson (born 1923, just two years before the big Quantum Year of 1925), More and Different (as quoted in Quantum Mechanics and Electrodynamics by Jaroslav Zamastil and Jakub Benda)
There’s an interesting distinction in one of the links above (a reaction to AlphaFold) between scientific and engineering problems (emphasis mine):
Engineering problems can be exceedingly difficult, and require the marshaling of inordinate resources, but competent domain experts know the pieces that need to fall into place to solve them. Whether or not a problem can be solved is usually not a question in engineering tasks. It’s a question whether it can be done given the resources available. In scientific problems on the other hand, we don’t know how to go from A to Z. It may turn out that, with a little bit of careful thought (as appears to be the case for AF2), one discovers the solution and it’s easy to go from A to Z, easier than many engineering problems. But prior to that discovery, the path is unknown, and the problem may take a year, a decade, or a century. This is where we were pre-AF2, and why I made a prediction in terms of how fast we will progress that turned out to be wrong. I thought I had a pretty good handle on the problem but I could not estimate it correctly. In some ways it turned out that protein structure prediction is easier than we anticipated. But this is not the point. The point is that before AF2 we didn’t know, and now we do know, that a solution is possible.
— Mohammed AlQuraishi, AlphaFold2 @ CASP14: “It feels like one’s child has left home.”
I also want to insert this quote from another field entirely:
At that time [in 1909], Paris was the center of the aviation world. Aeronautics was neither an industry nor even a science; both were yet to come. It was an art and I might say a passion. Indeed, at that time it was a miracle.
— Aviation pioneer Igor Sikorsky (source)
The point here is that one does not know ahead of time how many detours will be necessary to achieve a given goal, and how much will turn out to be involved.
Not every hydra can be beaten. This can be because there’s some unavoidable technical difficulty right off the bat (as with detecting gravitons), but it can also be because the hydra keeps multiplying with no end in sight, which seems to be the current situation in string theory or fusion power. When one finally gives, it is indeed something of a miracle.
A formula has grown in the last three posts: philosophers, case studies, general point. It’s time to end this formula and pursue a new direction. We have been pursuing science and purely scientific case studies for a while, but science exists in a world that also includes other forces and endeavors, and cannot be fully separated from them. The world has occasionally asserted itself, and we have to get back to it eventually. After all, why did Popper come so much closer to the truth than Hume when Hume already had the necessary pieces? I think it’s because Popper was troubled by the problem personally and needed a satisfactory answer.
The essence of the scientific method is to let the world guide you. Some can do that for its own sake, but most have to first be helped by a pragmatism-inducing crisis. So, in a future post, we will treat what happens when you stick to a worldview that is detached from reality for so long that reality asserts itself by force and does the creativity for you: revolutions. (We might also treat Thomas Kuhn and scientific revolutions, if I find anything interesting to say about him despite my antipathies.)
In the long term, I’m headed towards the question of how people do science and how this adaptability is possible. This post was already heading in that direction, but in the future I’d like to write a few things from the actual day-to-day of labwork.
In part (but not entirely!) this is motivated by various attempts to build artificial intelligence, and for that we must first understand machines, and for that we must first understand formal math. This also falls under the original problem of where to get certainty from. (The other part is about natural intelligence and is a lot more broad and unfinished.)
Popper is not a bad philosopher (he is quite comprehensible and does not make outrageously wrong statements), but this book is wordy for three main reasons:
I would instead recommend his Conjectures and Refutations: The Growth of Scientific Knowledge from 1963, where he actually tells the story of which theories motivated him, and introduces all the interesting concepts within a single chapter.
I think this also has a second half (inspired by the problems with Marxism): how do you distinguish right from wrong? On the societal level, so many new systems were encountered in that time that this really was non-obvious. Communism proposes fixing a newly recognised evil in society (and it grew out of very real problems of the industrial revolution), fascism proposes reaching a greater height as a species (and scientific progress was visible everywhere in 1934). From the history of 20th century, we also know that this didn’t go well. We’ll return to the ethics perspective in a later post.
The Bolshevik Revolution of October 1917 was driven by people who based their entire plan on a Marxist understanding of history and expected a worldwide communist revolution to follow within months after the Russian Revolution. This led them to bizarrely walk out of WWI peace treaty negotiations, demobilize their army, and declare the war is over — letting their opponents just annex territory unopposed, which nearly sank the entire empire before they managed to course-correct. (I highly recommended to give this link a listen, among other things for the bemused reaction of one German commander.)
There’s also a related problem: where do theories come from, and how do they have even a chance at correctness, if the space of possibilities is so big? But I’ve already said most of what I have to say about this in the post on scurvy.
From my GR textbook:
Einstein’s correspondence reveals that it was this particular result which probably brought him the deepest professional satisfaction ever. He experienced heart palpitations and on 17th January 1916 he confirmed to P. Ehrenfest: “I was beside myself with joy and excitement for days.”
Curiously, Pierre Duhem himself used the example of Uranus to define his thesis in 1904, and yet said in 1906 that Newtonian gravity is likely not a perfect theory and will need modifications, and yet yet, when such a modification (general relativity) actually arrived, he rejected it as “German science”. I find this nationalism utterly bizarre and another example of how people can be trapped by the “spirit of their times”, behaving in ways that are in retrospect seen to be absurd.
Coincidentally, a few days ago M. was complaining to me about some trouble with his master thesis and wrote: “D-Q is quite a central problem in my life now xD“.
Source note: I used the textbook An Introduction to Synchrotron Radiation by Philip Willmott, this overview of phase problem solutions, and this manual for a piece of crystallography software.
And yes, I took the opportunity to merge writing this with learning how it works for my own work :)
Indeed, often it doesn’t. As a toy example, the discrete Patterson function with values 4, 20, 33, 20, 4 corresponds to either 1, 4, 4 or 2, 5, 2.
I don’t use AlphaFold in my work, so there is some risk that I somehow misused it. I followed that hydra one step and found one RNA crystal structure that I could produce correctly, so hopefully not.
It would make sense for the ribozyme from the paper to be difficult for AlphaFold, because it’s synthetic (found by chemical search among ~1014 random candidates) and thus plausibly far out of the training distribution.
The same basic principle was also used in the Michelson-Morley experiment; LIGO is a higher-tech version measuring a different effect.
Michelson had the same problems in his initial 1881 experiment:
The apparatus as above described was constructed by Schmidt and Hænsch of Berlin. It was placed on a stone pier in the Physical Institute, Berlin. The first observation showed, however, that owing to the extreme sensitiveness of the instrument to vibrations, the work could not be carried on during the day. The experiment was next tried at night. When the mirrors were placed half-way on the arms the fringes were visible, but their position could not be measured till after twelve o’clock, and then only at intervals. When the mirrors were moved out to the ends of the arms, the fringes were only occasionally visible.
It thus appeared that the experiments could not be performed in Berlin, and the apparatus was accordingly removed to the Astrophysicalisches Observatorium in Potsdam. Even here the ordinary stone piers did not suffice, and the apparatus was again transferred, this time to a cellar whose circular walls formed the foundation for the pier of the equatorial.
Here, the fringes under ordinary circumstances were sufficiently quiet to measure, but so extraordinarily sensitive was the instrument that the stamping of the pavement, about 100 meters from the observatory, made the fringes disappear entirely!
It took several years to get rid of all this noise, ultimately using this apparatus:
This stuff makes for fun papers. One of them was on the door to my bachelor thesis lab: Ferroelectrics go bananas lists a dozen papers claiming that certain materials are ferroelectric based on a certain measurement, then performs the same measurement on a banana (which is not a ferroelectric) and gets the same result.
Feynman doesn’t cite the specific studies, though. It seems that this effect did exist, but only for somewhere between one and three studies, and soon a different method gave far more precise results anyway. The plot looks like this:
Later in the speech, Feynman also cites a study on rat running experiments with the wrong name of the responsible scientist, which made it hard to track down.
His comment on the application of his discovery to the cure of scurvy is strange, but probably it was not a harmful attitude (emphasis mine):
One day a nice young American-born Hungarian, J. Swirbely, came to Szeged to work with me. When I asked him what he knew he said he could find out whether a substance contained Vitamin C. I still had a gram or so of my hexuronic acid. I gave it to him to test for vitaminic activity. I told him that I expected he would find it identical with Vitamin C. I always had a strong hunch that this was so but never had tested it. I was not acquainted with animal tests in this field and the whole problem was, for me too glamorous, and vitamins were, to my mind, theoretically uninteresting. “Vitamin” means that one has to eat it. What one has to eat is the first concern of the chef, not the scientist.
— Albert Szent-Györgyi, Lost in the Twentieth Century (1963)
See also a psychologist’s complaints about how many psychology studies are about trying to find new (and thus publishable) subtle phenomena (like misattribution of arousal), which then often fail to replicate, rather than poking the limits of phenomena that are obviously out there (e.g. people recovering from sad news remarkably quickly or making plans in open-ended situations when the number of possibilities is overwhelming).
2026-02-19 02:40:57
Published on February 18, 2026 6:40 PM GMT
I wrote a survey about social influences on people's level of altruism. Please take it!
(I am deliberately oversampling extremely altruistic people because they are an interesting demographic for my research question.)
I will report my findings on the Effective Altruism Forum and LessWrong when I am done.
2026-02-19 02:30:14
Published on February 18, 2026 6:30 PM GMT
Technically skilled people who care about AI going well often ask me: how should I spend my time if I think AI governance is important? By governance, I mean the constraints, incentives, and oversight that govern how AI is developed.
One option is to focus on technical work that solves problems at the point of production, such as alignment research or safeguards. Another common instinct is to get directly involved in policy: switching to a policy role, funding advocacy, or lobbying policymakers. But internal technical work does little to shift the broader incentives of AI development: without external incentives, safety efforts are subject to the priorities of leadership, which are ultimately dictated by commercial pressure and race dynamics. Conversely, wading into politics means giving up your main comparative advantage to fight in a crowded, intractable domain full of experienced operators.
I want to argue for a third path: building technology that drives governance, by shifting the underlying dynamics of AI development: the information available, the incentives people face, and the options on the table. To take an example from another domain: oil and gas operations leaked massive amounts of methane until infrared imaging made the leaks measurable from space, leading the EU to require satellite-verified standards for fossil fuel imports. More generally, across domains spanning climate change, food safety, and pandemic response, there are two technological mechanisms that repeatedly drive governance:
I'll first show how these patterns have played out in climate, food safety, and pandemic response; the first two in particular have structural parallels to AI. I'll then show how the same framework identifies important technologies to build for AI governance, many of which are just starting to be developed.
If you have strong technical skills, building these technologies is currently the most leveraged thing you can do: more so than either direct technical work that ignores governance, or policy work that is untethered from technical solutions. This cross-sectional work is significantly neglected, and governance bottlenecks are often fundamentally technical: we can't regulate what we can't measure, and desired practices can't become standard until they're cheap and replicable.
Historical case studies help ground current practice. AI shares structural features with two other familiar domains:
For climate, both improved measurement and reductions in cost significantly drove better outcomes. They helped in the following ways:
Measurement to orient strategy. Public measurements inform strategy and galvanize action. Global monitoring of temperature and CO₂ (the Keeling curve) were the foundation of modern climate policy and help us continuously monitor progress.
Measurement to shift incentives. Satellite imagery of site-level emissions shifts incentives by making methane leaks visible and attributable to specific operators, with major leaks now repaired within days of detection.
Measurement to enable governance. Simulators of road resistance called chassis dynamometers let regulators produce repeatable fuel efficiency measurements, which made the CAFE standards possible. Similarly, continuous measurements of industrial CO₂ emissions were necessary for cap-and-trade.
Driving down costs to shift equilibria. Probably the largest driver of decreased CO₂ emissions has been the emergence of cheap wind and solar power. This illustrates a powerful dynamic: as production scales, costs fall along experience curves (sometimes called Wright's law), until clean energy becomes the default rather than the alternative. Before this inflection point, decarbonization meant fighting against economic incentives; after it, the incentives pulled in the same direction, and the market did the work of driving further R&D.
Driving down costs to dissolve trade-offs. The same Wright’s law dynamic can also help resolve thorny dilemmas. Before electric vehicles, society faced a trade-off between lower emissions and convenient transportation. As battery production scaled and costs fell, electric vehicles hit the mass market and the trade-off began to dissolve.
These patterns appear across climate, food safety, and COVID-19. The table below summarizes them alongside upcoming challenges for AI that we'll describe in the next section.
Role |
Technology |
Governance impact |
Domain |
Orient strategy |
Global temperature + CO₂ monitoring |
Understand that warming is happening and how fast |
Climate |
Understand growth rate and prevalence |
COVID-19 |
||
Track capability growth; calibrate policy thresholds |
AI (exists) |
||
Shift incentives |
Make site-level leaks visible, creating pressure to fix them |
Climate |
|
Behavioral benchmarks (sycophancy, deception, etc.) |
Create competitive pressure for better model behavior |
AI (needed) |
|
Enable governance |
Enable repeatable fuel efficiency measurement for CAFE standards |
Climate |
|
Enable cap-and-trade and emissions regulation |
Climate |
||
Genetic sequencing of food-borne bacteria |
Enable outbreak attribution and targeted enforcement |
Food safety |
|
Compute accounting, evaluation standards |
Enable oversight of training runs and model deployment |
AI (needed) |
|
Shift equilibria |
Replace dirty energy in the open market |
Climate |
|
Make safe milk the cheap default |
Food safety |
||
Make prevention economically feasible at scale |
COVID-19 |
||
Make rigorous oversight standard practice |
AI (needed) |
||
Dissolve trade-offs |
Dissolve trade-off between emissions and convenience |
Climate |
|
Dissolve trade-off between transparency and IP protection |
AI (needed) |
Food safety offers some of the cleanest examples of technology enabling governance. Routine genetic sequencing of foodborne bacteria made outbreak attribution possible, enabling targeted recalls and enforcement. And cheap pasteurization made safe milk the default. Similarly to solar energy, subsidized milk depots initially proved the market, and later commercial investment drove costs down. Mandatory pasteurization standards have now significantly reduced typhoid outbreaks.
Finally, while COVID-19 is less structurally parallel to AI, the wins from technology are stark. Testing created visibility into the virus's spread, orienting response at every scale from individual treatment to national policy. Cheap vaccines drove down the cost of prevention, resolving the trade-off between economic normalcy and infection control.
For AI, as in other domains, measurement is one of the key levers that can drive governance. I'll focus on this lever primarily, where measurement is helping to track AI trajectories, create competitive pressure for better model behavior, and make regulation enforceable. I'll then turn to driving down cost, where we will primarily consider driving down the cost of oversight, both by automating evaluation and by removing barriers to external auditing.
Measurement to orient strategy. Just as CO₂ monitoring oriented climate strategy, we need metrics that track AI trajectories and approaching thresholds. For climate, CO₂ was a natural target, since the causal chain from emissions to warming was scientifically clear. For AI, the answer is less obvious.
One example I find compelling is METR's work on agent time horizons, which tracks the complexity of tasks AI systems can complete autonomously, measured by how long those tasks take humans. Agents that can complete week-long tasks unsupervised pose very different challenges than those limited to minutes of work. METR finds that time horizons have been doubling roughly every seven months since 2019. If this continues, we could see agents capable of month-long autonomous tasks within a few years, which has significant implications for both labor markets and safety.
Another good example is Epoch's key trends in AI. Their reporting on training compute growth—roughly 4-5x per year—helps calibrate how quickly regulatory thresholds will be crossed, and their cost estimates inform questions about who can afford to train frontier models.
Measurement to shift incentives. We lack good public metrics for sycophancy, deception, reinforcement of delusions, and similar behavioral issues, which are currently measured ad hoc, if at all. Where metrics do exist, they create competitive pressure: labs compete for top positions on the Chatbot Arena leaderboard, featuring rankings prominently in official announcements. High-quality public dashboards for behavioral issues could do the same, just as fuel efficiency became a selling point for automakers once EPA ratings became standardized. This is a big part of what we think about at Transluce: identifying what should be measured to improve incentives, and building the infrastructure to measure it.
Measurement to enable governance. The EU AI Act requires frontier developers to perform evaluations using "standardized protocols and tools reflecting the state-of-the-art"; California's SB 53 and the Trump administration's AI Action Plan impose similar expectations. However, without reproducible evaluation suites, such requirements are difficult to enforce: developers largely define how their own systems are measured, making results hard to compare. Compute monitoring faces similar challenges: tracking large training runs requires technical work on compute accounting that's still in early stages.
Driving down the cost of oversight. In an ideal world, rigorous evaluation and oversight of AI systems would become standard practice through natural incentives alone. Developers want to know whether their systems behave as intended; users and customers want assurance; and once evaluation is cheap, market forces and liability concerns handle the rest.
We don’t live in this world yet, partly because high-quality evaluation of agent runs is currently very expensive. For example, METR often spends person-months of work on a single evaluation suite, due to the labor involved in human baselining, running many trials, and manually analyzing results to understand why agents succeed or fail. We need to make these analyses cheap enough to become ubiquitous while preserving and improving their quality.
This is essentially what we’re trying to do with Docent: building tools that accelerate complex agent evaluations. Our experience matches the Wright's law pattern: iterating across many users and problems has driven down costs while improving quality. However, demand for these harder evaluation tasks (complex agentic behaviors, subtle failures of judgment, deceptive patterns over long horizons) is growing but still emerging; the mass market tends to focus on toxicity, hallucinations, or compliance-related issues. That's what makes it high-leverage to push on complex oversight tasks now, similarly to early solar investment before the market tipped.
Reducing trade-offs between transparency and IP protection. As a society, we'd like to be able to audit whether AI systems behave badly in deployment, verify claims about training practices, and conduct white-box safety analyses. But these goals currently trade off against legitimate IP concerns: companies are reluctant to give external parties access to model weights, training data, or system logs.
Technology can dissolve this trade-off. Secure enclaves could let auditors run analyses without extracting underlying weights. Cryptographic methods could let companies prove properties about their training process without revealing proprietary details. Structured access protocols could enable third-party evaluation while limiting what information leaves the company. With mature technologies for confidential auditing, deeper forms of oversight become practically viable.
If you're technically skilled and care about AI going well, solving the problems described above is where you have the most leverage. The bottleneck on effective AI governance is not just political will: there's appetite to regulate AI among both voters and policymakers. The bottleneck is that we don't yet have the measurement infrastructure, the cheap evaluation tools, or the well-specified policy options to regulate well.
The field is talent-constrained in a specific way: measurement and evaluation work is less glamorous than capabilities research, and it requires a rare combination of technical skill and governance sensibility. The organizations doing this work—Transluce, METR, Epoch, US CAISI, and others—are small and growing. If these arguments resonate and you fit the profile, consider joining one of them; or, if you see a gap no one is filling, start something new.
We have a unique opportunity right now, and tackling these high-leverage challenges is what excites me the most. AI will cause disruption, and this will open a window for policy solutions, probably sooner than many expect. The question is whether we'll have the technical foundations ready when it does.
2026-02-19 01:55:27
Published on February 18, 2026 5:55 PM GMT
At least the systems that we build today often have that property. I mean, I’m hopeful that someday we’ll be able to build systems that have more of a sense of common sense. We talk about possible ways to address this problem, but yeah I would say it is like this Genie problem.
Dario Amodei, Concrete Problems in AI Safety with Dario Amodei and Seth Baum, 2016
We might call this the King Midas problem: Midas, a legendary king in ancient Greek mythology, got exactly what he asked for—namely, that everything he touched should turn to gold. Too late, he discovered that this included his food, his drink, and his family members, and he died in misery and starvation. The same theme is ubiquitous in human mythology. Wiener cites Goethe’s tale of the sorcerer’s apprentice, who instructs the broom to fetch water—but doesn’t say how much water and doesn’t know how to make the broom stop.
A technical way of saying this is that we may suffer from a failure of value alignment—we may, perhaps inadvertently, imbue machines with objectives that are imperfectly aligned with our own.Stuart Russell, Human Compatible: Artificial Intelligence and the Problem of Control, 2019
Can we convey our intent, both what our words mean and what our actual preferences are, to a computer? Ten years ago the answer was no. Currently, in 2026, the answer is yes. This should be recognized as a paradigm shift in the field, an area where we have gone from zero to one. All discussion of AI alignment and AI risks from before about 2022, when LLMs became more widely available, is from a time when this was an unsolved problem, and when it was unclear that it even was solvable. Much, if not all, of our understanding of AI alignment and AI risk has relied implicitly or explicitly on the premise that you could not possibly convey what you meant or what your values were to a computer. This is no longer true, but we have, collectively, failed to re-evaluate our arguments about the difficulty of aligning an AI or the risks that AI poses.
We should be careful about the difference between a tractable problem and a solved one. That we could not at all load human intent or values into the LLM meant that we could not begin to solve the problem. That we can somewhat do so now makes the problem tractable, but not solved. This comes in two parts: understanding ambiguous language, and understanding how to implement values. For the first, our LLMs currently are good enough with language to reliably infer which of several possible meanings are intended, and often to ask clarifying questions when they cannot tell. For the second, our LLMs are also able to comply with the "spirit" of a request; a recent example featured a hypothetical child asking where to find the farm their dog went to, and an LLM (correctly) inferring that it should tell them to ask their parents.[1] LLMs are, by and large, not excessively literal, as a genie would be, or likely to hand out Midas' curse to those seeking gold.
As a general concern, the alignment problem can be thought of as having these parts:
These concerns are all at least somewhat interconnected. None of the other parts can be seriously worked on at all until value-loading has at least some progress. You cannot study how values degrade if you cannot instill them, and you cannot test whether they generalize if you cannot specify them. Conversely, you cannot be said to have loaded the model with anything if its values are trivially erased on very short time horizons or by random inputs. Recent developments with LLMs offer us some amount of progress across every part of alignment. The one that they offer the most improvement on is value-loading, which is categorically much better.
There are still substantial obstacles. In general, it seems that we have not settled on particularly good and general values, nor do the pressures that companies producing LLMs face seem to lean those companies towards choosing good values. Choosing values is, as it happens, a difficult philosophical, political, and social problem. Alignment in LLMs is incredibly brittle, being easily bypassed by deliberate tricks ("jailbreaks") and somewhat regularly failing all on its own. Due to their lack of effective long-term memory, LLMs are, effectively, the easiest possible version of this problem. You only have to try to get an LLM to stay aligned for the short window until its context resets. Anyone drawing the conclusion that all of this is easy just because it is now tractable is mistaken.
Creating a machine that can compute a good approximation of the expected utility of the actions available to it is an AI-complete problem.
[...]
The programmer has some particular human value in mind that he would like the AI to promote. To be concrete, let us say that it is happiness. [..] But how could he express such a utility function in computer code? Computer languages do not contain terms such as “happiness” as primitives. If such a term is to be used, it must first be defined. It is not enough to define it in terms of other high-level human concepts—“happiness is enjoyment of the potentialities inherent in our human nature” or some such philosophical paraphrase. The definition must bottom out in terms that appear in the AI’s programming language, and ultimately in primitives such as mathematical operators and addresses pointing to the contents of individual memory registers. When one considers the problem from this perspective, one can begin to appreciate the difficulty of the programmer’s task.
[...]
But if one seeks to promote or protect any plausible human value, and one is building a system intended to become a superintelligent sovereign, then explicitly coding the requisite complete goal representation appears to be hopelessly out of reach.
[...]
Solving the value-loading problem is a research challenge worthy of some of the next generation’s best mathematical talent. We cannot postpone confronting this problem until the AI has developed enough reason to easily understand our intentions. As we saw in the section on convergent instrumental reasons, a generic system will resist attempts to alter its final values. If an agent is not already fundamentally friendly by the time it gains the ability to reflect on its own agency, it will not take kindly to a belated attempt at brainwashing or a plot to replace it with a different agent that better loves its neighbor.Nick Bostrom, Superintelligence: Paths, Dangers, Strategies, Chapter 12, 2014
This seems fairly alien now, but was essentially uncontroversial at the time. Nobody wrote a review of Nick Bostrom's book saying "obviously we can define happiness in a way that the computer will interpret correctly, you just aren't up to date on research". They didn't do this because it wasn't true. Bostrom is correctly describing the state of the art in translating human intentions into computer instructions in 2014.
Computers only process numbers. You can represent what you want a neural network[2] to do with what is called a "utility function" or "objective function". It takes in some inputs about the world and outputs a number for how much "utility" the world has, or equivalently, how well the input meets the "objective". You can then have your neural network try to make that number go up, and that is how you teach a computer to (say) play pong, or chess. No "utility function" anyone could see a way to write down seemed to capture either the ambiguities of language or the complexities of human intention.
Bostrom called this the value-loading problem. Russell called it the King Midas problem. There are numerous examples of objective functions causing unintended or undesired outcomes, a few of which are helpfully compiled in a spreadsheet. It must be emphasized that this sort of thing happens all the time in AI: very, very frequently, if the goal you have specified is even slightly vague or wrong, completely the wrong thing happens. My personal favorite is this one:
Agent learns to bait an opponent into following it off a cliff, which gives it enough points for an extra life, which it does forever in an infinite loop
From this you can see that traditional reinforcement learning systems are very entertaining when playing games, but only sometimes do things that you meant them to do. Since the real world is much more complex than any video game, there are many many more ways for whatever goal you have specified for your system to go completely off the rails. It is a difficult problem, and was previously solved on, basically, a case by case basis by figuring out how to set up all the math so it did what you wanted and never did what you didn't.
Making language models bigger does not inherently make them better at following a user's intent. [...] In human evaluations on our prompt distribution, outputs from the 1.3B parameter InstructGPT model are preferred to outputs from the 175B GPT-3, despite having 100x fewer parameters. [...] Our results show that fine-tuning with human feedback is a promising direction for aligning language models with human intent.
Ouyang et al., "Training Language Models to Follow Instructions with Human Feedback", 2022
As AI systems become more capable, we would like to enlist their help to supervise other AIs. We experiment with methods for training a harmless AI assistant through self-improvement, without any human labels identifying harmful outputs. The only human oversight is provided through a list of rules or principles, and so we refer to the method as 'Constitutional AI'. [...] These methods make it possible to control AI behavior more precisely and with far fewer human labels.
Bai et al., "Constitutional AI: Harmlessness from AI Feedback", 2022
Large language models can now be successfully told what you want in English, both when training them and when talking to them. This partially solves "the value-loading problem", and certainly renders it a tractable problem that has known avenues to its solution.
First, language models have information about language in them. We don't define "happiness" by writing down the meaning, "happiness" is defined by how it is used in context in all of the training data. So before it even begins to be trained to follow instructions, a language model "knows", more or less, what you mean by "happiness" or really any other goal you can name, since it has seen that word so many times. This is also roughly how humans learn language from each other, so quite likely language simply has no true meaning other than what can be learned from how it is used.
This makes sure it understands the word, but it does not specify any particular values, including whether or not to answer a user politely or at all. An LLM at this stage of training is just extremely good autocomplete.[3] There are two landmark papers that bring us from here towards actual alignment. They both concern finetuning the LLM on some text that specifies the exact way it should answer user messages. No matter how rudimentary this is, it begins to specify some value system, even if that system is only "give correct and polite answers".
The first landmark here is InstructGPT, a finetuned version of GPT-3. Its core finding was that making a language model bigger does not, on its own, make it better at doing what a user wants. What does work is finetuning a model, even a much smaller one, on examples of humans following instructions well. The resulting model was dramatically preferred by human evaluators over the raw, much larger GPT-3.
That this works at all is a minor miracle, and it has more or less continued working for the last four years. You give it a series of examples for how you want it to behave, which hopefully includes following instructions, and then you train it on those. After training you can give it instructions, and it mostly follows them. On balance, it seems like we can say that this did, in fact, make good headway into aligning models with their users.
Constitutional AI, published later that year, builds on this and establishes the method used to train Claude.[4] Where InstructGPT relied on large amounts of human-written examples, Constitutional AI has the model generate its own training data, guided by a short list of principles written in plain English. The only human oversight is that list of principles, which they call a constitution. Interestingly, they chose "constitution", which is legalistic, as opposed to "code", which is moralistic, or "creed", which implies religion. Any of them would have been accurate.
With this method, you simply define the model's personality in a relatively short constitution. Claude's constitution, which is publicly available, has grown rather large, but it is still much more compact and precise than a finetuning dataset would be. This approach makes the finetuning data more natural, since it is, in fact, "something the model would say", and makes it easier to generate a very large amount of such finetuning data. In general it seems like Constitutional AI is a better training method, and like it has been substantially more successful at producing well-aligned LLMs that, additionally, have much less "robotic" affect.
So our question, "how do you load human values into a machine?", has a complex answer and a simple answer. The complex answer is all of the technical details in those papers. The simple answer is "very carefully write down what you want it to do", and the complex part is really just how you get that written part into the LLM. These techniques have consistently gotten better with time, and do meaningfully capture human intention and put them into the LLM. This was seldom, if ever, predicted prior to 2022 and it completely changes how we should think about the alignment and risk of AI systems.
Simply: Alignment is tractable because we can meaningfully load human values into LLMs and they can generalize them to a wide variety of situations. Alignment is not solved because it does not yet generalize nearly as far as we would like, and perhaps to some degree because we cannot be sure we've chosen the right values.
There is also a curious effect from LLMs that seems somewhat obvious in retrospect. An LLM is, very directly, primarily trained to imitate human language. Because of this, inasmuch as it has values, those values are distinctly human-like. This is a direct consequence of the training method. A system trained on the full breadth of human language absorbs not just vocabulary and grammar but knowledge of the values, norms, and moral intuitions embedded in how people use language. Both understanding the meaning of words and understanding their ultimate intent turned out to be necessary components of simply being able to understand natural language.
This produces some striking phenomena. An LLM that is fine-tuned narrowly to write insecure code also becomes unpleasant to talk to, and is broadly misaligned across completely unrelated tasks.[5] An LLM that has "won" some of its training tasks by finding loopholes becomes malicious later, unless it is told that finding loopholes is encouraged, in which case it does not become malicious later.[6] It seems to observe its own behavior when being trained, and then "guess" what sort of "person" it is. If it was given permission to find loopholes, it is an agreeable person. If it was not, it isn't. (It was, apparently, going to cheat regardless.) If you train in any deception, it becomes generally deceptive. Train helpfulness, and it becomes broadly helpful. The LLM's values generalize, much as they do in humans.
Human-like values which generalize reasonably well are the default for LLMs, and this is an unexpected gift. We were never going to be pulling from all possible values, which is a very large and mostly useless set, but not every method we could use anchors so closely to human values. Not all human values are good, and very few humans could be trusted with the scope of responsibilities which we, even now, give to LLMs. We do not actually want human-equivalent performance here. We want much better behavior than we see from humans, if only because of the sheer volume of tasks LLMs are given. Humans are not nearly this reliable. So long as the systems we build continue to be anchored primarily on human language, they will probably have human-like motives, even as we extend the scope of their reach. Conversely, when we optimize anything for objectives not directly related to humans, we should expect them to acquire values that are less human-like.
The old paradigm assumed we could never even begin. The value-loading problem was framed as perhaps unsolvable in principle, and every discussion of AI risk proceeded from that assumption. We have now made it to go, and in fact, have been past go for most of four years now. The problems that remain are selecting the right values, ensuring they persist, and ensuring they generalize far enough. We can make meaningful progress on this now because we have systems that implement values well enough to study and test for how well they implement our intent. This is a fundamental change, and understanding it is a prerequisite to our future progress and understanding our risks.
Or similar system
Extremely good autocomplete has many strange properties, but commercial LLMs as we know them have significant additions to the raw autocomplete.
There is one author in common to these two landmark papers, which appear nine months apart at rival companies.
Betley et al., "Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs", 2025. Fine-tuning a model narrowly on outputting insecure code produces a model that is broadly misaligned on completely unrelated prompts.
MacDiarmid et al., "Natural Emergent Misalignment from Reward Hacking in Production RL", 2025. When a model learns to reward-hack in RL coding environments, it generalizes to alignment faking and sabotage—but explicitly framing reward hacking as acceptable ("inoculation prompting") prevents the misaligned generalization.
2026-02-19 01:28:17
Published on February 18, 2026 5:28 PM GMT
(Fictional) Optimist: So you expect future artificial superintelligence (ASI) “by default”, i.e. in the absence of yet-to-be-invented techniques, to be a ruthless sociopath, happy to lie, cheat, and steal, whenever doing so is selfishly beneficial, and with callous indifference to whether anyone (including its own programmers and users) lives or dies?
Me: Yup! (Alas.)
Optimist: …Despite all the evidence right in front of our eyes from humans and LLMs.
Me: Yup!
Optimist: OK, well, I’m here to tell you: that is a very specific and strange thing to expect, especially in the absence of any concrete evidence whatsoever. There’s no reason to expect it. If you think that ruthless sociopathy is the “true core nature of intelligence” or whatever, then you should really look at yourself in a mirror and ask yourself where your life went horribly wrong.
Me: Hmm, I think the “true core nature of intelligence” is above my pay grade. We should probably just talk about the issue at hand, namely future AI algorithms and their properties.
…But I actually agree with you that ruthless sociopathy is a very specific and strange thing for me to expect.
Optimist: Wait, you—what??
Me: Yes! Like, if you show me some random thing, there’s a 99.999…% chance that it’s not a ruthless sociopath. Instead it might be, like, a dirt clod. Dirt clods are not ruthless sociopaths, because they’re not intelligent at all.
Optimist: Oh c’mon, you know what I mean. I’m not talking about dirt clods. I’m saying, if you pick some random mind, there is no reason at all to expect it to be a ruthless sociopath.
Me: How do you “pick some random mind”? Minds don’t just appear out of nowhere.
Optimist: Like, a human. Or an AI.
Me: Different humans are different to some extent, and different AI algorithms are different to a much, much greater extent. “AI” includes everything from A* search to MuZero to LLMs. Is A* search a ruthless sociopath? Well, I mean, it does seem rather maniacally obsessed with graph traversal! Right?
Optimist: Haha, very funny. Please stop being annoyingly pedantic. I obviously didn’t mean “AI” in the sense of the academic discipline. I meant, like, AI in the colloquial sense, AI that qualifies as a mind, like LLMs. I’m mainly talking about human minds and LLM “minds”, i.e. all the minds we’ve ever seen in the real world, rather than in sci-fi. And hey, what a coincidence, ≈100% of those minds are not ruthless sociopaths.
Me: As it happens, the threat model I’m working on is not LLMs, but rather “brain-like” Artificial General Intelligence (AGI), which (from a safety perspective) is more-or-less a type of actor-critic model-based reinforcement learning (RL) agent. LLMs are profoundly different from what I’m working on. Saying that LLMs will be similar to RL-agent AGI because “both are AI” is like saying that LLMs will be similar to the A* search algorithm because “both are AI”, or that a frogfish will be similar to a human because “both are animals”. They can still be wildly different in every way that matters.
Optimist: OK, but lots of other doomers talk about LLMs causing doom.
Me: Well, kinda. I think we need to tease apart two groups of people. Both are sometimes called “doomers”, but one is much more pessimistic than the other. This is very caricatured, but:
Anyway, the extra heap of concern in that latter camp is not from the LLMs of today causing near-certain doom, or even the somewhat-better LLMs of tomorrow, but rather the wildly better ASIs of … maybe soon, maybe not, who knows. But even if it’s close in calendar time, and even if it comes out of LLM research, such an ASI would still be systematically different from LLMs as we know them today—
Optimist: —a.k.a., you have no evidence—
Me: —no evidence either way, at least no evidence of that type. Anyway, as I was saying, ASI would be systematically different from today’s LLMs because … umm, where do I start …
…Actually, it would be much easier for me to explain if we start with the ASI threat model that I spend all my time on, and then we can circle back to LLMs afterwards. Is that OK?
Optimist: Sure. We can pause the discussion of LLMs for a few minutes, and start in your comfort zone of actor-critic model-based RL-agent “brain-like” ASI. Doesn’t really matter anyway: regardless of the exact algorithm, you clearly need some positive reason to believe that this kind of ASI would be a ruthless sociopath. You can’t just unilaterally declare that your weird unprecedented sci-fi belief is the “default”, and push the burden of proof onto people who disagree with you.
Me: OK. Maybe a good starting point would be my posts LeCun’s ‘A Path Towards Autonomous Machine Intelligence’ has an unsolved technical alignment problem, or ‘The Era of Experience’ has an unsolved technical alignment problem.
Optimist: I’ve read those, but I’m not seeing how they answer my question. Again, what’s your positive argument for ruthless sociopathy? Lay it on me.
Me: Sure. Back at the start of the conversation, I mentioned that random objects like dirt clods are not able to accomplish impressive feats. I didn’t (just) bring up dirt clods to troll you, rather I was laying the groundwork for a key point: If we’re thinking about AI that can autonomously found, grow, and staff innovative companies for years, or autonomously invent new scientific paradigms, then clearly it’s not a “random object”, but rather a thing that is able to accomplish impressive feats. And the question we should be asking is: how does it do that? Those things would be astronomically unlikely to happen if the AI were choosing actions at random. So there has to be some explanation for how the AI finds actions that accomplish those impressive feats.[2]
So an explanation has to exist! What is it? I claim there are really only two answers that work in practice.
The first possible explanation is consequentialism: the AI accomplishes impressive feats by (what amounts to) having desires about what winds up happening in the future, and running some search process to find actions that lead to those desires getting fulfilled. This is the main thing that you get from RL agents, and from model-based planning algorithms. (My “brain-like AGI” scenario would involve both of those at once.) The whole point of those subfields of AI is: these are algorithms designed to find actions that maximize an objective, by any means available.
I.e., you get ruthless sociopathic behavior by default.
And this is not just my armchair theorizing. Go find someone who was in AI in the 2010s or earlier, before LLMs took over, and they may well have spent a lot of time building or using RL agents and/or model-based planning algorithms. If so, they’ll tell you, based on their lived experience, that these kinds of algorithms are ruthless by default (when they work at all), unless the programmers go out of their way to make them non-ruthless. See e.g. this 2020 DeepMind blog post on “specification gaming”.
And how would the programmers “go out of their way to make them non-ruthless”? I claim that the answer is not obvious, indeed not even known. See my LeCun post, and my Silver & Sutton post, and more generally my post “‘Behaviorist’ RL reward functions lead to scheming” for why obvious approaches to non-ruthlessness won’t work.
Rather, algorithms in this class are naturally, umm, let’s call them, “ruthless-ifiers”, in the sense that they transmute even innocuous-sounding objectives like “it’s good if the human is happy” into scary-sounding ones like “ruthlessly maximize the probability that the human is happy”, which in turn suggest strategies such as forcibly drugging the human. Likewise, the innocuous-sounding “it’s bad to lie” gets ruthless-ified into “it’s bad to get caught lying”, and so on.
Of course, evolution did go out of its way to make humans non-ruthless, by endowing us with social instincts. Maybe future AI programmers will likewise go out of their way to make ASIs non-ruthless? I hope so—but we need to figure out how.
To be clear, ruthless consequentialism isn’t always bad. I’m happy for ruthless consequentialist AIs to be playing chess, designing chips, etc. In principle, I’d even be happy for a ruthless consequentialist AI to be emperor of the universe, creating an awesome future for all—but making that actually happen would be super dangerous for lots of reasons (e.g. you might need to operationalize “creating an awesome future for all” in a loophole-free way; see also “‘The usual agent debugging loop’, and its future catastrophic breakdown”).
…So that’s consequentialism, one possible answer for how an AI might accomplish impressive feats, and it’s an answer that brings in ruthlessness by default.
…And then there’s a second, different possible answer to how an AI might accomplish impressive feats: imitative learning from humans. You train an AI to predict what actions a skilled human would take in many different contexts, and then have the AI take that same action itself. I claim that LLMs get their impressive capabilities almost entirely from imitative learning.[3] By contrast, “true” imitative learning is entirely absent (and impossible) in humans and animals.[4]
Imitative-learning AIs do not have ruthless sociopathy by default, because of course the thing they’re imitating is non-ruthless humans.[5]
Optimist: Huh … Wait … So you’re an optimist about superintelligence (ASI) being non-ruthless, as long as people stick to LLMs?
Me: Alas, no. I think that the full power of consequentialism is super dangerous by default, and I think that the full power of consequentialism is the only way to get ASI, and so AI researchers are going to keep working until they eventually learn to fully tap that power.
In other words, I see a disjunction:
For what it’s worth, I happen to expect that ASI will come from the former (future paradigm shift) rather than the latter (LLM modifications). But it hardly matters in this context.
Optimist: I dunno, if you’re willing to concede that LLMs today are not maximally ruthless, well, LLMs today don’t seem that far from superintelligence. I mean, humans don’t “rocket arbitrarily far past any training data” either. They usually do things that have been done before, or at most (for world experts on the bleeding edge) go just one little step beyond it. LLMs can do both, right?
Me: Yes, but humans collectively and over time can get way, way, way beyond our training data. We’re still using the same brain design that we were using in Pleistocene Africa. Between then and now, there were no angels who dropped training data from the heavens, but humans nevertheless invented language, science, technology, industry, culture, and everything else in the $100T global economy entirely from scratch. We did it all by ourselves, by our own bootstraps, ultimately via the power of consequentialism, as implemented in the RL and model-based planning algorithms in our brains.
(See “Sharp Left Turn” discourse: An opinionated review.)
By the same token, if humanity survives another 1000 years, we will invent wildly new scientific paradigms, build wildly new industries and ways of thinking, etc.
There’s a quadrillion-dollar market for AIs that can likewise do that kind of thing, as humans can. If the LLMs of today don’t pass that bar (and they don’t), then I expect that, sooner or later, either someone will figure out how to get LLMs to pass that bar, or else someone will invent a new non-LLM AI paradigm that passes that bar. Either way, imitative learning is out, consequentialism is in, and we get ruthless sociopath ASIs by default, in the absence of yet-to-be-invented theoretical advances in technical alignment. (And then everyone dies.)
Thanks Jeremy Gillen, Seth Herd, and Justis Mills for critical comments on earlier drafts.
We should definitely also be thinking about these other potential problems, don’t get me wrong!
Related: the so-called “Follow-the-Improbability Game”.
Details: “imitative learning” describes LLM pretraining, but not posttraining; my claim is that LLM capabilities come almost entirely from the former, not the latter. That’s not obvious, but I argue for it in “Foom & Doom” §2.3.3, and see also a couple papers downplaying the role of RLVR (Karan & Du 2025, Venhoff et al. 2025), along with “Most Algorithmic Progress is Data Progress” by Beren Millidge.
E.g. if my brain is predicting what someone else will say, that’s related to auditory inputs, and if my brain is speaking, that involves motor-control commands going to my larynx etc. There is no straightforward mechanical translation from one to the other, analogous to the straightforward mechanical translation from “predict the next token” to “output the next token” in LLM pretraining. More in “Foom & Doom” §2.3.2.
See GPTs are Predictors, not Imitators for an even-more-pessimistic-than-me counterargument, and “Foom & Doom” §2.3.3 for why I don’t buy that counterargument.