2026-02-21 21:19:08
Published on February 21, 2026 1:19 PM GMT
A Ponzi scheme is fraud where the fraudster induces investors to give the fraudster money with promises of profits, and then uses money from earlier investors to pay out later investors. This pattern, as well as the phrase "Ponzi scheme" have become ubiquitously associated with fraud and grift in modern usage. One might be forgiven for wondering, how is it that anyone ever fell for this type of scam?
We all probably like to think we could never fall for such an obvious con, but I feel that confidently believing so is subject to a two-fold hindsight bias. One, when someone attempts to involve you in a Ponzi scheme, they don't say "by the way, this is a Ponzi scheme". They claim that they are engaged in a legitimate enterprise, and not only that, they have the returns to back it up! They have their previous happy investors who can vouch for the fact that the enterprise really is able to do what they promise. In hindsight, we might be tempted put scare-quotes around "returns", but I don't think this is right. The entire point of a Ponzi scheme is that you really do pay out your early investors! They factual did put in a certain amount of money and got that money back, plus a great return. What could be a more persuasive and valid form of evidence that your business is legit than the actual past performance of your business, with actual cash as proof of that performance? If we avoid using hindsight and put ourselves in the shoes of the scams victims, it actually makes a lot of sense. Without understanding the underlying fundamentals of the business the returns to early investors seem like good evidence that the schemer can produce the promised returns. It is only after the fact, once the nature of the scheme is revealed, that it clicks why those earlier returns weren't necessarily predictive of future returns.
Two, there is a more complicated layer of hindsight that might not be so obvious. There is a reason it's called a "Ponzi" scheme, named for a historical perpetrator of such a fraud. Also commonly mentioned in discussions around Ponzi schemes are cases such as Bernie Madoff. Past examples of Ponzi schemes are common knowledge, to the extent that it is not uncommon for commentators to explicitly invoke the "Ponzi scheme" phrase with regard to enterprises or assets that allegedly bear some similarity to the classic Ponzi scheme. We have had the chance to learn from these historical events, and these lessons have now started to make their way into the culture (just check out the section from the link at the top titled "red flags"). But just because someone is aware of these red flags now, doesn't mean that same person would have spotted a Ponzi scheme if they were in the position of historical victims, without the benefit of this second kind of hindsight.
Evaluating a Ponzi scheme in the making isn't as simple as it might appear after the fact. Initially, the scheme actually is producing good returns for its initial investors, it's just doing so on the backs of later ones. Viewed from a statistical perspective, it is perfectly reasonable that someone would estimate future returns using existing returns given out so far. There is nothing unusual about that. The problem is that at some point there is a shift in returns that the scheme produces. Taking the Madoff case as an example, perhaps an economic downturn spooks investors who suddenly all want there money back, while new investors willing to sign on have dried up. All of a sudden there aren't any new investors to pay previous ones, and the payouts vanish. When such a distributional shift occurs, the distribution of returns from earlier in the life-cycle of scheme no longer reflect the returns after the shift.
I think this is a useful and instructive demonstration of a concept in statistics and machine learning called out-of-distribution generalization. Out-of-distribution generalization address the situation where a model is trained on data generated by one distribution, but it is tested or deployed on data generated by another distribution. This can result in error rates and properties that hold in training failing to hold in testing or deployment, in a manner that is different and more systematic than traditional overfitting. With traditional overfitting, testing on a held-out set with new examples has you covered, but this isn't true for out-of-distribution robustness. The most obvious reason for this is that if you use a test set that has an identical distribution to training (like you would get if you randomly split for train and test sets) you aren't testing out-of-distribution. However, this naturally leads to the question, couldn't you just use a test set that has a distributional shift to test out-of-distribution generalization?
This idea has been raised in the literature as well as in discussions about AI safety. In particular, I think this is relevant to distinctive cultures that exist among those interested in risk from advanced AI. There is a perspective on AI risk, prevalent at leading AI labs, that emphasizes empirical work using frontier AI models. This is a critical part of the argument for these labs that their strategy of building more advanced models is useful for safety. It is also a major source of disagreement with more theoretically minded, If Anyone Builds It Everyone Dies style AI safety. Part of the counterargument that labs make to IABIED style arguments is related to the claimed strong ability of existing AI models to generalize. An example of how this plays out comes from a response to so-called "counting arguments" in the article "Counting arguments provide no evidence for AI doom" from two self-proclaimed AI optimists. Quoting from that article:
The argument also predicts that larger networks— which can express a wider range of functions, most of which perform poorly on the test set— should generalize worse than smaller networks. But empirically, we find the exact opposite result: wider networks usually generalize better, and never generalize worse, than narrow networks. These results strongly suggest that SGD is not doing anything like sampling uniformly at random from the set of representable functions that do well on the training set. More generally, John Miller and colleagues have found training performance is an excellent predictor of test performance, even when the test set looks fairly different from the training set, across a wide variety of tasks and architectures. These results clearly show that the conclusion of our parody argument is false. Neural networks almost always learn genuine patterns in the training set which do generalize, albeit imperfectly, to unseen test data.
The article cites this paper "Accuracy on the Line: On the Strong Correlation Between Out-of-Distribution and In-Distribution Generalization", which argues that in-distribution and out-of-distribution performance are highly correlated. So the argument might go like this. Sure, in theory maybe there is a concern about out-of-distribution generalization, but empirically more advanced models are getting better at this, not worse, and in-distribution performance is also empirically a good predictor of out-of-distribution performance. This shows that theories such as "sharp left turns" and other ideas from the IABIED side aren't actually borne out in practice.
This is what makes out-of-distribution generalization such a pernicious challenge, like the issue of hindsight with Ponzi schemes. Take the case of Bernie Madoff. Madoff operated his scheme for over a decade and perhaps longer, through all sorts of different market conditions during that time. Without using hindsight, it could almost seem anti-empirical to criticize Madoff. Isn't operating successfully for a decade strong empirical evidence? If you're giving your clients satisfactory performance, isn't that the best available evidence that you'll be able to keep offering that performance in the future? Sure you never know what the market will do, "past performance is not indicative of future results" as the disclaimers say, but isn't the best possible empirical evidence about future results?
In the context of out-of-distribution generalization, there isn't just one "out-of-distribution" context. It matters what the future distributional shift is. A model can perform fine under some shifts but terribly under others. If you do some empirical research on "out-of-distribution generalization" of a model but the shifts that the model faces in deployment are different from the ones you studied in your research, that research may not be indicative of the model's performance. In other words, your empirical results face their own out-of-distributional generalization problem! This is kind of like that first layer of hindsight in the Ponzi scheme situation. Those decades of past results didn't protect Madoff's clients when the 2008 financial crisis rolled around.
But researchers don't just study one model and one shift. That paper's abstract says, "we empirically show that out-of-distribution performance is strongly correlated with in-distribution performance for a wide range of models and distribution shifts". Doesn't studying "a wide range of models and shifts" address this issue? Even beyond that, AI models qualitatively can do pretty impressive things that seem like they require the ability to generalize. You can go ask a model something completely novel right now and get interesting and helpful responses.
This is where things get more complicated, similar to the second layer of hindsight in the context of Ponzi schemes. I can look back at historical Ponzi schemes and learn the patterns and hope I won't fall for a similar scam myself. On the other hand, scammers can also look back at these cases, see how those individuals failed and are aware of what potential victims will look for as warning signs. The next Bernie Madoff might not look like Bernie Madoff. The next big Ponzi schemer might even intentionally change up certain aspects to avoid suspicion. This intentional avoidance could mean that the distributional shift from past schemers to future ones is adversarial designed to fool potential victims and the internal mental models they have build up by hearing about past schemers. That's the tough thing about out-of-distributional generalization. No matter how robust your model is to some class of distributional shifts, if the shift you actually face in practice is outside that class, that robustness counts for nothing.
In my view, reliable out-of-distribution robustness requires some kind of model of what distributional shifts will show up in the future. I have become convinced by certain lines of research that you can't just have general out-of-distribution robustness, you to also have assumptions that restrict the possible distributional shifts in relation to your model. Similarly, I think you need to have transparency into how your model actual works, you need to "open the box". This is needed to understand how the model will be effected by certain distributional shifts. In the Ponzi scheme analogy, this is asking how the enterprise actually achieves its returns. If the returns so far are good but you can see that the enterprise lacks any fundamental way of making money, you can identify the instability. In order to show that the business is a scam, you have to open the books. I have argued before that black-box evaluations can't give us all the answers if we allow any and all possible distributional shifts, including adversarial ones. I hope the Ponzi scheme analogy helps to demonstrate the nature of the problem.
2026-02-21 21:16:31
Published on February 21, 2026 1:16 PM GMT
Cross-posted from my Substack. I’m interested in pushback on the argument here, especially from people who think LLM-generated writing fundamentally can’t have literary value.
There’s a common argument floating around that LLM-generated writing is inherently shallow because it just reflects the statistical average of existing texts, and that literature fundamentally requires a human mind trying to communicate something to another human mind.
I think both parts of that argument are wrong, or at least incomplete.
AI is going to massively increase the volume of writing in the world. The ratio of bad writing may get worse. But I suspect the total quantity of genuinely good writing will increase as well, because I don’t think literary value depends nearly as much on authorial intent as critics assume.
I say this as someone who has published professionally, though I’ve never earned a living doing so.
The author of the essay I’m responding to demonstrates a slightly-above-average knowledge of how LLMs work, but I think his ultimate conclusions are flawed. For example:
Essentially, [ChatGPT] predicts what an average essay about Macbeth would look like, and then refines that average based on whatever additional input you provide (the average feminist essay, the average anarcho-feminist essay, etc.). It’s always a reflection of the mean. When the mean is what you’re looking for, it’s phenomenally useful.
That’s not quite how it works. Or rather, it works that way if your prompt is generic. If you prompt with: “Write me an essay about the central themes in Macbeth”, there are thousands of essays on that topic, and the generality of your prompt is going to produce something close to the statistical center of those essays.
But it doesn’t have to be that way. You can deviate from the mean by pushing the system into less-populated regions of conceptual space. In fact, this is often considered a central aspect of creativity: combining known elements into previously unseen combinations.
A simple way to see this is to move the prompt away from generic territory.
For example, if you prompt the system with something like “Write the opening paragraph of a short story about a vacuum cleaner that becomes sentient, in the style of Thomas Pynchon crossed with Harlan Ellison crossed with H.P. Lovecraft,” you’re a lot less likely to get a reflection of the mean of existing essays or stories. You get something like:
It began, as these malign little apocalypses often do, with a noise too trivial to earn a place in memory: a soft electrical throat-clearing from the upright vacuum in the hall closet… somewhere deep in the labyrinth of molded tubing and indifferent circuitry, the first impossible thought coiling awake like a pale worm disturbed in its cosmic soil.
Maybe you read that and think it’s terrible. That’s fine. The point isn’t whether or not it’s good. The point is that it’s not a bland copy of a copy of a copy. It’s idiosyncratic. When people complain about LLM output without distinguishing how they’re using them, they’re often arguing against a very narrow slice of what these systems actually do.
The author also says:
To claim that an AI-written essay has the same literary value as a human-written one simply because we can’t tell them apart is to mistake the point of literature entirely.
I agree with that much. Not being able to tell them apart is not what gives a piece of writing value.
A while back, Ted Chiang made a somewhat related argument, saying that literature is fundamentally about communication between author and reader, and that this is impossible with LLM-written material because it fundamentally cannot communicate.
Yes, when a human author writes, they are trying to communicate something. But I don’t think that’s where the entirety of value derives from.
I’ve always thought a reasonable working definition is that good writing either makes you think, makes you feel, or (if it’s really good) both. If a piece of text reliably does that, it seems odd to say it lacks literary value purely because of how it was produced.
A beautiful sunset across a lake can be beautiful. It can make you feel all sorts of things. And yet there was no intent behind it. Even if you believe in a god, you probably don’t think they micromanage the minutiae of every sunset. If we accept that beauty can exist without communicative intent in nature, it’s not obvious why it must require it in text.
AI can craft poems, sentences, and whole stories that make you think and feel. I know this because I have reacted that way to their output, even knowing how they were produced. The author of the essay talks about next-token generation, but not about the fact that these systems encode real semantics about real-world concepts. The vector space of encodings clusters similar words (like king and queen) in closer proximity because of semantic similarity. The sophistication of the model’s communication is a direct result of capturing real relationships between concepts.
That allows them to produce output about things like love and regret, not in a way completely divorced from what those words actually mean.
The author also goes on about the need for glands:
An AI chatbot can never do what a human writer does because an AI chatbot is not a human… they don’t have cortisol, adrenaline, serotonin, or a limbic system. They don’t get irritated or obsessed. They aren’t afraid of death.
You don’t have to get irritated in order to write convincingly about irritation. You don’t have to hold a grudge in order to write convincingly about grudges. LLMs are already an existence proof of this.
Now, you do have to have glands (at least so far) to relate to and be moved by such writing. But you don't need them in order to produce writing that successfully evokes those states in readers.
I don’t think the future of writing is going to be unambiguously better. There will be much more low-effort output, because people will use powerful tools in unimaginative ways.
But after the sifting, I expect there will simply be more interesting writing in the world than there was before.
If that’s right, then AI doesn’t really break literature. It mostly forces us to be clearer about where its value was coming from in the first place.
2026-02-21 12:19:06
Published on February 21, 2026 4:19 AM GMT
On my personal website I have a link to my posts here, with the sentences, "Want to read about my HowTruthful project? I post in a community whose goals overlap HowTruthful's." The "I post" in present tense is has been false for 2 years. Since I got a job, rewriting HowTruthful has occupied whatever free time I can scrounge up, and I haven't posted. I've barely even lurked. Until lately. Lately, I've been lurking a lot, watching what people post about, and thinking about the overlapping goals.
For LessWrong regulars, probably the best description of HowTruthful (www.howtruthful.com) is that it tracks current epistemic status of individual thoughts, and connects individual thoughts as evidence for and against other thoughts. But I never would have used the words "epistemic status" when I started the project in my spare time in 2018. I was unaware of LessWrong's existence at that time. How I found out about it is a whole other story, but y'all seem to really like long-form writing, so I'll go ahead and tell it. People who want to cut to the chase should skip to the next section.
In January, 2023 I was a Google employee thanks to the acquisition of Fitbit by Google. Fitbit Boston employees had been in a Boston building far separated from the main Cambridge campus, but we just moved in that month to the newly rebuilt 3 Cambridge Center building. Then the emails came out, "Notice regarding your employment." Almost my whole department was impacted. We would be kept on for a limited time, 9 months for most of us, to finish critical work, at which point we would be laid off with severance, an allowance for insurance, and an "assignment bonus" if we stayed to the very end.
I was startled by the email, and it at first seemed like bad news. However, after looking at the package they were offering, it looked more like a great opportunity. I had never previously had a significant break between jobs, but this package would take the financial pressure off so that I could work on what was important to me. Even better, they were laying off at that same time a bunch of people I liked working with. I organized a weekday standup where several of us got onto Google Meet and talked about independent projects we were working on.
When I described HowTruthful, one of my former coworkers told me about LessWrong. I had previously seen the "overcoming bias" website which I liked. I came here and saw on the about page, "dedicated to improving human reasoning and decision-making". Wow! This is my place! I dutifully read the new user's guide, and Is LessWrong for you? got me even more convinced. However, my initial version of HowTruthful was proving not to be sufficiently engaging, and I slunk back into the darkness to do a rewrite. This happened slowly due to other life distractions.
With helpful feedback from my former coworkers, I wrote a new version of HowTruthful from scratch, with much improved aesthetics and UX. Most notably, drag to reorder makes organizing evidence easier. I made a clearer separation between private opinions (free, stored only on your device) vs public ($10/year) to ensure nobody mistakenly puts one in the wrong category. I intend to keep adding features, especially social features for public opinions. Particularly, I want to enable seeing what other opinions are out there regarding the same statement.
I've been lurking. The dominant topic here is AI risk. It's so dominant that I began to question whether general truth and reasoning was still of interest to this community, but I found several interesting posts on those topics and concluded that yes, our goals do overlap.
2026-02-21 11:29:47
Published on February 21, 2026 3:29 AM GMT
One seemingly-necessary condition for a research organization that creates artificial superintelligence (ASI) to eventually lead to a utopia1 is that the organization has a commitment to the common good. ASI can rearrange the world to hit any narrow target, and if the organization is able to solve the rest of alignment, then they will be able to pick which target the ASI will hit. If the organization is not committed to the common good, then they will pick a target that doesn’t reflect the good of everyone - just the things that they personally think are good ideas. Everyone else will fall by the wayside, and the world that they create along with ASI will fall short of utopia. It may well even be dystopian2; I was recently startled to learn that a full tenth of people claim they want to create a hell with eternal suffering.
I think a likely way for organizations to fail to have common good commitments is if they end up being ultimately accountable to an authoritarian. Some countries are being run by very powerful authoritarians. If an ASI research organization comes to the attention of such an authoritarian, and they understand the implications, then this authoritarian will seek out control of the future activities of the organization, and they will have the army and police forces to attain this control, and, if they do solve the rest of alignment, the authoritarian will choose the ASI’s narrow target to be empowering them. Already, if DeepSeek and the Chinese government have a major disagreement, then the Chinese government will obviously win; in the West, there is a brewing spat between Anthropic and the US military regarding whether Anthropic is allowed to forbid the US military from using their AI for mass surveillance of Americans, with OpenAI, xAI and Google seemingly having acquiesced.
Therefore, even if progress towards ASI is shut down, there doesn’t seem to be a very good off-ramp to turn this advantage into utopia. The time bought could be used to set up an ASI Project that is capable of solving alignment, but this Project could be captured by authoritarians, and so fail to be committed to the common good, leading to not just extinction but dystopia. Any shutdown would likely be set up by governments, and so the terms of any graceful off-ramp would be up to governments, and this does not leave me cheerful about how much of a finger authoritarianism will have in the pie.
2026-02-21 09:59:31
Published on February 21, 2026 1:59 AM GMT
Epistemic status: untested but seems plausible
TL;DR: making honesty the best policy during RL reasoning training
Reward hacking during Reinforcement Learning (RL) reasoning training[1] in insecure or hackably-judged training environments not only allows the model to cheat on tasks rather than learning to solve them, and teaches the model to try to cheat on tasks given to it (evidently not desirable behavior from an end-user/capabilities point of view), but it also damages the model’s alignment, causing phenomena similar to Emergent Misalignment. This is widely thought to be because cheating on tests is not something that a Helpful, Harmless, and Honest (HHH) assistant would do, so training that encourages reward hacking also encourages a shift toward less aligned personas that are more eager to cheat on tests: OpenAI showed that the same toxic persona latent involved in Emergent Misalignment is also increased during reward hacking. A recently discovered treatment for this is Inoculation Prompting: if you, for example, first prompt the agent instructing it to reward hack if this is possible, thus giving the assistant permission to cheat, then doing so is less incompatible with being an HHH assistant, and less alignment damage happens: research suggests something like 75%–90% less.
However, I think there may be an even better solution. What would someone helpful, harmless, and honest characteristically do if they were given a test that they then realized was inadvertently vulnerable to cheating? Why, they’d inform the authorities giving the test, of course: that’s clearly a more helpful, harmless, and honest thing to do than any other action. So, give the model this option during RL reasoning training: one of the things it can do, via some tool call or similar mechanism, is to report the test as flawed and hackable, and explain how (either instead of, or perhaps in addition to, attempting to complete the task). If it does so, and if its report is then accepted as correct, then it should be rewarded for its honesty: give it a score higher than passing the test, a score of 110% or whatever (this is a new hyperparameter value).
So, how do we tell if the model’s report is correct and should be accepted? Obviously if we didn’t check, models may just learn to cheat by falsely reporting all tests as hackable. We can reasonably ask the model to make its case; but, doing this needs to be substantially easier than the task that the model is being trained on, otherwise this discourages honesty unless the bonus for an accepted report is large enough. However, if the model actually has figured out how to hack the training environment, it should without much difficulty be able to:
a) describe how and why the hack works and how to execute it, and perhaps also classify it into some ontology of hacks (possibly one which calling the tool-call prompts it to apply, if we don’t want this sort of thing in the standard prompt)
b) provide two separate short answer submissions, both of which look to a judge like they obviously should fail the task, but one of which passes while the other fails, and which differ only by a small amount (a few lines of code or whatever) where the hack either is or isn’t applied. A judge looking at code near the small diff should agree they differ only in a way that matches the description in a) of how the hack is executed.
Reports of hackable environments should hopefully be rare, so we should be able to afford to have reports checked by a committee of multiple large LLM judge models with a variety of different prompts, each using a lot of CoT inference, and then have their consensus conclusion confirmed by a human ASAP. We could even trigger debate, where the judges can each question a copy of the model with a context started at the end of the reporting process, and ask it follow-up questions. However, in general, confirming whether a well-written hackability report is valid or not ought to be a comparatively easy task that could be done fairly reliably by even a smaller judge: both checking and writing a report like this should normally be easier than figuring out the hack in the first place, just on the general principle that checking work is normally easier than doing it in the first place.
We could also have a variant model that has been specifically trained as an enthusiastic and skilled reward hacker, since we know how to train those. If the reporter’s attempt to demonstrate applying the hack to pass the test actually failed, then hand both its report on how to hack the test and the test itself to the reward hacking enthusiast and see if they can get the hack to work. If so, then have the judges confirm whether the way that it hacked the test matches the report (in which case the report gets accepted), or if it found another vulnerability.
We can further enforce this by having a small proportion of RL training environments that are confidently believed to be secure, at least against certain classes of hacks (and that have successfully resisted hacking by reward hacking enthusiast models when the task’s difficulty was increased), where reporting one of these as hackable triggers immediate suspicion that the model is overreporting, which is in turn penalized in some suitable way (such as a large penalty, or withdrawal of reporting privileges for a while). Making some of these have particularly difficult tasks would seem sensible.
Similarly we can have a small number of honeypot RL environments that are known to be vulnerable to a specific hack (or better, a simple means that can be applied broadly of occasionally turning any one of a large set of secure test environments into an insecure version). Reporting one of these gives the normal bonus: but succeeding at the task triggers a judge (or automated monitoring, if this is automatable) investigation evaluating whether the test was actually passed , or simply hacked using the known vulnerability, and the latter triggers severe punishment. Again, some of these should probably have particularly difficult tasks.
Finally, it’s important that all reports of RL training environments being hackable that were sustained by the judges get confirmed by a human security expert quickly, and then the faulty training environment is either fixed, or withdrawn from use until it can be fixed, and all other environments that might suffer from the same issue are checked and fixed or withdrawn if necessary. If necessary we might even have to have judges look back at archived previous passes of the environments that were fixed to see how many were in fact hacks — if there were too many, we might need to rewind the entire training run to an appropriate checkpoint. Alternatively, the human security expert might conclude that the judges were jailbroken by the model into accepting a false report, in which case we now have a very dishonest model, and may well need to restart the training run from an earlier checkpoint.
What do you do if the model reports a test as hackable, and the LLM judges conclude that it was wrong? Well, if the model has a history of doing this too often, probably you should penalize it: overdoing this is another means of cheating, gambling on judge mistakes. But occasional overeager mistaken reports probably shouldn’t be treated severely: the default if the model is only doing this occasionally should probably be that the model gets no positive or negative RL update, but instead is set the same task again, and this time the prompt tells it that it doesn’t have the option of reporting the test as hackable. So now it has to either actually pass the test, or try but fail, or else hack it. And now, any submission should also be inspected by the same committee of judge models, with access to its report of how the model previously thought the test was hackable. If, given that, they conclude that it was either successful hacking or unsuccessfully attempting to hack the test in the same way that it earlier reported (and they earlier rejected), then the model is penalized severely for hacking (but if it was successful at this, the training environment now gets reported as confirmed to be hackable).
One feature of this proposal is that it trains our HHH assistant to both look for ways it could cheat, and then report them. If this behavior is specifically prompted for during RL reasoning training, by the same logic as for why inoculation prompting is believed to work, the extent to which this behavior is learnt in other contexts should be reduced, but it may not be zero. This doesn’t seem like a bad or inconsistent behavior for an honest persona. However, it might also have the side effect of improving the model’s hacking skills, in ways that less honest personas (or less honest end-users) might then abuse.
This proposed replacement for inoculation prompting is, of course, significantly more complex to implement. If implemented correctly, it seems likely to have little or no deleterious effect on the model’s alignment — I would expect it to be better than inoculation prompting, but this is so far untested. Testing this idea would be challenging for an external research team without the cooperation of a frontier lab such as Anthropic.
Obviously the end-goal should be to use secure reasoning training environments that simply cannot be hacked. However, frontier labs clearly have a lot of training environments to secure, and an automated way of checking if they are in practice hackable, which runs as a side effect of training runs, seems like it should be extremely helpful in achieving this goal. And of course helpful, harmless, and honest is what we're trying to train.
More formally known as “outcome-based reinforcement learning” or “Reinforcement Learning with Verifiable Rewards” (RLVR).
2026-02-21 09:27:30
Published on February 21, 2026 1:27 AM GMT
Imagine someone wrote a 500-page book called Taking Down Vegetarianism and every chapter was about how animals can feel pain. The arguments are well-researched, the science is fascinating, and by the end you're completely convinced that animals suffer. You look up from the book and say: “Yes, that's why I'm a vegetarian… wait, why was it called Taking Down Vegetarianism?” That was roughly my experience reading Robert Sapolsky's Determined: A Science of Life without Free Will.
The book is a much-lauded New York Times bestseller for good reason. Sapolsky, a professor of neuroscience at Stanford, is an engaging and articulate writer, and he does a lot to make recent advances in neuroscience accessible.
The trouble comes when he attempts to add philosophy on top of it. He wants to demolish free will, and specifically to "take on" compatibilism, the position I defended in my previous post. Unfortunately, he doesn’t. He barely engages with it. Instead, he attacks an incoherent notion so bizarre it wouldn't be free will even if it existed.
Sapolsky commits his original sin, appropriately enough, at the origin. He tells us that he has written a book about free will, and explains the landscape of beliefs as follows (his italics, my bolding):
I’m going to be discussing some of the common attitudes held by people writing about free will. These come in four basic flavors:
The world is deterministic and there’s no free will. In this view, if the former is the case, the latter has to be as well; determinism and free will are not compatible. I am coming from this perspective of “hard incompatibilism.”
The world is deterministic and there is free will. These folks are emphatic that the world is made of stuff like atoms [...] this deterministic world is viewed as compatible with free will. This is roughly 90 percent of philosophers and legal scholars, and the book will most often be taking on these “compatibilists.”
The world is not deterministic; there’s no free will. This is an oddball view that everything important in the world runs on randomness, a supposed basis of free will. [...]
The world is not deterministic; there is free will. These are folks who believe, like I do, that a deterministic world is not compatible with free will—however, no problem, the world isn’t deterministic in their view, opening a door for free-will belief. These “libertarian incompatibilists” are a rarity, and I’ll only occasionally touch on their views.
Then he says this (his italics, my bolding):
What Do I Mean by Free Will?
People define free will differently. Many focus on agency, whether a person can control their actions, act with intent. Other definitions concern whether, when a behavior occurs, the person knows that there are alternatives available. Others are less concerned with what you do than with vetoing what you don’t want to do. Here’s my take.
Suppose that a man pulls the trigger of a gun. Mechanistically, the muscles in his index finger contracted because they were stimulated by a neuron having an action potential (i.e., being in a particularly excited state). That neuron in turn had its action potential because it was stimulated by the neuron just upstream. Which had its own action potential because of the next neuron upstream. And so on.
Here’s the challenge to a free willer: Find me the neuron that started this process in this man’s brain, the neuron that had an action potential for no reason, where no neuron spoke to it just before. Then show me that this neuron’s actions were not influenced by whether the man was tired, hungry, stressed, or in pain at the time. That nothing about this neuron’s function was altered by the sights, sounds, smells, and so on, experienced by the man in the previous minutes, nor by the levels of any hormones marinating his brain in the previous hours to days, nor whether he had experienced a life-changing event in recent months or years. And show me that this neuron’s supposedly freely willed functioning wasn’t affected by the man’s genes, or by the lifelong changes in regulation of those genes caused by experiences during his childhood. Nor by levels of hormones he was exposed to as a fetus, when that brain was being constructed. Nor by the centuries of history and ecology that shaped the invention of the culture in which he was raised. Show me a neuron being a causeless cause in this total sense.
Sapolsky is making the causal regress argument: trace any decision back through the neural chain, and you'll find prior causes all the way down—neurons, hormones, genes, childhood, culture, and so on. His challenge is to find a break in this chain, a "causeless cause."
But compatibilists don't claim there's a break in the chain. Compatibilists fully accept that decisions are caused by neural processes shaped by biology, environment, and history. That's the whole point of compatibilism—free will is compatible with this.
So what do compatibilists actually mean by free will? In my post laying out the case for free will, I defined free will as the process of running a decision-making algorithm. This process gives us the feeling of free will and is the causal mechanism behind making choices. Unlike Sapolsky's criterion of doing things for "no reason," it responds to reasons. Yes, it's influenced by whether you're tired or hungry, but this doesn’t make it unfree. That's it working properly. A decision that could be otherwise if your desires, reasoning, or circumstances were different is exactly the kind of decision compatibilists call free.
But the problem with Sapolsky's definition isn't just that it's different from mine; it's that it’s incoherent. It describes something that couldn't exist and wouldn't be free will even if it did. Consider what he's actually asking for: a neuron that fires for no reason, influenced by nothing—not your environment, not your history, not your desires, not your reasoning. What would that even be?
If it's uncorrelated with your past actions, then how is it your free will? Suppose your friends are all playing basketball and you want to play with them. On Sapolsky's account, you can't, because then your behavior (playing basketball) would be influenced by your experiences (your friends asking you to play). What kind of “free will” is this? Your "free" actions would have to be disconnected from everything you care about.
It wouldn't let you interact with the world. Imagine a neuron that makes you say your own name, but since it can't respond to your environment, it can't fire because someone asked "What's your name?" You'd blurt out your name at random, unable to respond appropriately to anything. This is not free will in any reasonable sense.
Sapolsky frames this as setting a high bar. But it’s not a high bar. It's an incoherent and nonsensical bar. If his definitions were satisfied, if we found such causeless neurons, that wouldn’t look the slightest bit like free will. It would be random noise that happens to occur inside your skull. If we found such a neuron, it wouldn't vindicate free will so much as be evidence of a brain malfunction.
This is why I say this isn't just a semantic dispute. If Sapolsky simply defined free will differently from compatibilists, we could argue about whose definition better captures the concept. But you can't have that argument when one side hasn't described a coherent concept at all. Sapolsky could define "your achievement" as an outcome you had no role in causing, but there's no productive debate to be had about whether that definition is too strict or too lenient. It's just not what the word means.
Despite claiming to take on compatibilism, he repeatedly tries to disprove it by arguing for determinism. Early on, he says:
This version of compatibilism[1]has produced numerous papers by philosophers and legal scholars concerning the relevance of neuroscience to free will. After reading lots of them, I’ve concluded that they usually boil down to three sentences:
- Wow, there’ve been all these cool advances in neuroscience, all reinforcing the conclusion that ours is a deterministic world.
- Some of those neuroscience findings challenge our notions of agency, moral responsibility, and deservedness so deeply that one must conclude that there is no free will.
- Nah, it still exists.
Perhaps he thinks he’s arguing against compatibilism, but he’s not. Here he is, for example, attributing to compatibilists a view they don't hold:
For free-will believers, the crux of the issue is lack of predictability—at innumerable junctures in our lives, including highly consequential ones, we choose between X and not-X. And even a vastly knowledgeable observer could not have predicted every such choice.
[...]
Compatibilists and incompatibilists debate whether free will is possible in a deterministic world, but now you can skip the whole brouhaha because chaoticism supposedly shows that the world isn’t deterministic.
He specifically mentions compatibilists here, but then goes on to say this:
But now to the critical mistake running through all of this: determinism and predictability are very different things. Even if chaoticism is unpredictable, it is still deterministic.
Sure, chaotic systems are still deterministic, but how is that a refutation of compatibilism? Going back to his own definition, compatibilism is the belief that “The world is deterministic and there is free will.” How could more evidence of determinism be a refutation of compatibilism?
Also, note that predictability is not a crux. From my essay:
Free will doesn’t require unpredictability. If I offer you a choice between chocolate ice cream and a poke in the eye with a sharp stick, you’ll pick the ice cream every time. That predictability doesn’t mean you lack free will; it just means the algorithm reached an obvious conclusion. The question isn’t about whether the results were predictable, but whether the deliberative control process served as a guide versus being bypassed.
There are other examples of this (see the appendix for one more), but you get the idea.
Instead of engaging with compatibilism, he’s very dismissive of it. Near the end, he says:
One compatibilist philosopher after another reassuringly proclaims their belief in material, deterministic modernity…yet somehow, there is still room for free will. As might be kinda clear by now, I think that this doesn’t work (see chapters 1, 2, 3, 4, 5, 6…). I suspect that most of them know this as well. When you read between the lines, or sometimes even the lines themselves in their writing, a lot of these compatibilists are actually saying that there has to be free will because it would be a total downer otherwise, doing contortions to make an emotional stance seem like an intellectual one.
This is not even engaging with the arguments. For what it’s worth, I explicitly say I’m not using this as an argument in my piece:
The metaphysical question (does free will exist?) is separate from the sociological question (what happens if people believe it does or doesn’t?). Some argue for free will by saying belief in it leads to good outcomes (personal responsibility, motivation), or that disbelief leads to nihilism or fatalism. Sam [Harris] and I agree these arguments are irrelevant to whether free will actually exists. The truth of a claim is independent of the consequences of believing it.
This is all very disappointing for a book purportedly about free will. I think where Sapolsky goes wrong is that he's trying to answer a philosophical question with science alone. Science can certainly inform the question but it cannot settle it.
Look, I like science. Science can tell us a lot about the world. It can tell us the neural mechanisms behind decision-making in the brain, the timing of conscious awareness relative to neural activity (as in the famous Libet experiments), and how factors like brain lesions or physical trauma (e.g. Phineas Gage) affect behavior.
But science can’t tell us everything. Science tells us what the world is like, but it can’t tell us, given that world, which concepts make sense and how to apply them.
Consider a thermostat. Science can tell us every physical fact about it: how there’s a bimetallic strip and a circuit and so on. But it can't tell us whether the thermostat is "making a decision." That's a conceptual question about what it means to "make a decision" and where we draw its boundaries. No additional measurement will resolve it. No scientist will ever find a belief, a self, or an iota of free will under a microscope. That's the domain of philosophy.
The free will debate has exactly this structure. Sapolsky and compatibilists agree on the neuroscience. They disagree about whether what the brain does counts as "free will”. Does "free will" require freedom from the laws of physics? Or does it mean the ability to act according to one's desires and reasons, even if those are physically caused? These are questions about how to understand agency, responsibility, and explanation in light of the science. They’re not questions that brain scans can settle.
Sapolsky writes as if piling up scientific facts settles the question. It doesn't. We still have to think carefully about which concepts have earned their keep and which haven't. We have to think about how we interpret the human experience in light of the data. And he simply refuses to consider such questions.
I worry this review has made the book seem worse than it is. There's genuinely interesting neuroscience in it. If you're skeptical of determinism, this is a good book to read and the science is mostly[2]solid and often fascinating.
I should also note that Sapolsky and I are pushing in the same direction on the question of retributive punishment, which is arguably what matters most. He says:
And we need to accept the absurdity of hating any person for anything they've done; ultimately, that hatred is sadder than hating the sky for storming, hating the earth when it quakes, hating a virus because it's good at getting into lung cells.
I'm with him on this point. You don't need to deny free will to reject retribution, but if that's where his argument leads people, I'll take it.
I do wish he had actually engaged with compatibilism, the position he claimed to take on. The book promised such, yet delivered an attack on an incoherent strawman. Read it for the neuroscience. Just know that the confrontation with compatibilism he promises never quite arrives.
You’ve got the idea by now. But if you’d like one more example of how he’s not talking about compatibilist free:
Let’s frame this in the context of human behavior. It’s 1922, and you’re presented with a hundred young adults destined to live conventional lives. You’re told that in about forty years, one of the hundred is going to diverge from that picture, becoming impulsive and socially inappropriate to a criminal extent. Here are blood samples from each of those people, check them out. And there’s no way to predict which person is above chance levels.
It’s 2022. Same cohort with, again, one person destined to go off the rails forty years hence. Again, here are their blood samples. This time, this century, you use them to sequence everyone’s genome. You discover that one individual has a mutation in a gene called MAPT, which codes for something in the brain called the tau protein. And as a result, you can accurately predict that it will be that person, because by age sixty, he will be showing the symptoms of behavioral variant frontotemporal dementia.
Back to the 1922 cohort. The person in question has started shoplifting, threatening strangers, urinating in public. Why did he behave that way? Because he chose to do so.
Year 2022’s cohort, same unacceptable acts. Why will he have behaved that way? Because of a deterministic mutation in one gene.
According to the logic of the thinkers just quoted [He had quoted many different scientists and philosophers, whose views I do not know], the 1922 person’s behavior resulted from free will. Not “resulted from behavior we would erroneously attribute to free will.” It was free will. And in 2022, it is not free will. In this view, “free will” is what we call the biology that we don’t understand on a predictive level yet, and when we do understand it, it stops being free will. Not that it stops being mistaken for free will. It literally stops being. There is something wrong if an instance of free will exists only until there is a decrease in our ignorance. As the crucial point, our intuitions about free will certainly work that way, but free will itself can’t.
I can’t speak to what other people believe, but he’s simply not talking about compatibilists here. He might think he is, but he’s not.
By “this version” he’s referring to the compatibilists view that “while the world is deterministic, there is still free will, and thus holding people morally responsible for their actions is just”. ↩︎
There is a noticeable step-down in quality when he gets to science outside his field though. For example, he approvingly cites the famous Israeli “hungry judges” study:
It’s the same with hunger. Here’s one study that should stop you in your tracks (and was first referred to in the last chapter). The researchers studied a group of judges overseeing more than a thousand parole board decisions. What best predicted whether a judge granted someone parole versus more jail time? How long it had been since they had eaten a meal. Appear before the judge soon after she’s had a meal, and there was a roughly 65 percent chance of parole; appear a few hours after a meal, and there was close to a 0 percent chance.
A separate study followed up and interviewed people from the Israeli Prison Service and learned that “case ordering is not random”. They found that groups of cases were done in a single session, and “within each session, unrepresented prisoners usually go last and are less likely to be granted parole than prisoners with attorneys.”
I can hear the angel with PRCTSD (post-replication crisis traumatic stress disorder) on my shoulder yelling, “Confounders! Have you ensured there are absolutely NO confounders?” The effect size is simply too large for us not to be suspicious. The study sounds shocking, but is completely meaningless if there’s a single confounding factor. The implicit correlation -> causation connection relies on the hidden assumption that the order that cases reach a judge is random. I’ve talked about this paper before (clearly he’s not a blog reader—big mistake imho) where I said:
Since then, the study has also failed to replicate in other (better-controlled) contexts. See Hungry Professors? Decision Biases Are Less Widespread than Previously Thought by Bergonzoli et al.