MoreRSS

site iconLessWrongModify

An online forum and community dedicated to improving human reasoning and decision-making.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of LessWrong

Alignment to Evil

2026-02-21 11:29:47

Published on February 21, 2026 3:29 AM GMT

One seemingly-necessary condition for a research organization that creates artificial superintelligence (ASI) to eventually lead to a utopia1 is that the organization has a commitment to the common good. ASI can rearrange the world to hit any narrow target, and if the organization is able to solve the rest of alignment, then they will be able to pick which target the ASI will hit. If the organization is not committed to the common good, then they will pick a target that doesn’t reflect the good of everyone - just the things that they personally think are good ideas. Everyone else will fall by the wayside, and the world that they create along with ASI will fall short of utopia. It may well even be dystopian2; I was recently startled to learn that a full tenth of people claim they want to create a hell with eternal suffering.

I think a likely way for organizations to fail to have common good commitments is if they end up being ultimately accountable to an authoritarian. Some countries are being run by very powerful authoritarians. If an ASI research organization comes to the attention of such an authoritarian, and they understand the implications, then this authoritarian will seek out control of the future activities of the organization, and they will have the army and police forces to attain this control, and, if they do solve the rest of alignment, the authoritarian will choose the ASI’s narrow target to be empowering them. Already, if DeepSeek and the Chinese government have a major disagreement, then the Chinese government will obviously win; in the West, there is a brewing spat between Anthropic and the US military regarding whether Anthropic is allowed to forbid the US military from using their AI for mass surveillance of Americans, with OpenAI, xAI and Google seemingly having acquiesced.

Therefore, even if progress towards ASI is shut down, there doesn’t seem to be a very good off-ramp to turn this advantage into utopia. The time bought could be used to set up an ASI Project that is capable of solving alignment, but this Project could be captured by authoritarians, and so fail to be committed to the common good, leading to not just extinction but dystopia. Any shutdown would likely be set up by governments, and so the terms of any graceful off-ramp would be up to governments, and this does not leave me cheerful about how much of a finger authoritarianism will have in the pie.



Discuss

Reporting Tasks as Reward-Hackable: Better Than Inoculation Prompting?

2026-02-21 09:59:31

Published on February 21, 2026 1:59 AM GMT

Epistemic status: untested but seems plausible

TL;DR: making honesty the best policy during RL reasoning training

Reward hacking during Reinforcement Learning (RL) reasoning training[1] in insecure or hackably-judged training environments not only allows the model to cheat on tasks rather than learning to solve them, and teaches the model to try to cheat on tasks given to it (evidently not desirable behavior from an end-user/capabilities point of view), but it also damages the model’s alignment, causing phenomena similar to Emergent Misalignment. This is widely thought to be because cheating on tests is not something that a Helpful, Harmless, and Honest (HHH) assistant would do, so training that encourages reward hacking also encourages a shift toward less aligned personas that are more eager to cheat on tests: OpenAI showed that the same toxic persona latent involved in Emergent Misalignment is also increased during reward hacking. A recently discovered treatment for this is Inoculation Prompting: if you, for example, first prompt the agent instructing it to reward hack if this is possible, thus giving the assistant permission to cheat, then doing so is less incompatible with being an HHH assistant, and less alignment damage happens: research suggests something like 75%–90% less.

However, I think there may be an even better solution. What would someone helpful, harmless, and honest characteristically do if they were given a test that they then realized was inadvertently vulnerable to cheating? Why, they’d inform the authorities giving the test, of course: that’s clearly a more helpful, harmless, and honest thing to do than any other action. So, give the model this option during RL reasoning training: one of the things it can do, via some tool call or similar mechanism, is to report the test as flawed and hackable, and explain how (either instead of, or perhaps in addition to, attempting to complete the task). If it does so, and if its report is then accepted as correct, then it should be rewarded for its honesty: give it a score higher than passing the test, a score of 110% or whatever (this is a new hyperparameter value).

So, how do we tell if the model’s report is correct and should be accepted? Obviously if we didn’t check, models may just learn to cheat by falsely reporting all tests as hackable. We can reasonably ask the model to make its case; but, doing this needs to be substantially easier than the task that the model is being trained on, otherwise this discourages honesty unless the bonus for an accepted report is large enough. However, if the model actually has figured out how to hack the training environment, it should without much difficulty be able to:

a) describe how and why the hack works and how to execute it, and perhaps also classify it into some ontology of hacks (possibly one which calling the tool-call prompts it to apply, if we don’t want this sort of thing in the standard prompt)

b) provide two separate short answer submissions, both of which look to a judge like they obviously should fail the task, but one of which passes while the other fails, and which differ only by a small amount (a few lines of code or whatever) where the hack either is or isn’t applied. A judge looking at code near the small diff should agree they differ only in a way that matches the description in a) of how the hack is executed.

Reports of hackable environments should hopefully be rare, so we should be able to afford to have reports checked by a committee of multiple large LLM judge models with a variety of different prompts, each using a lot of CoT inference, and then have their consensus conclusion confirmed by a human ASAP. We could even trigger debate, where the judges can each question a copy of the model with a context started at the end of the reporting process, and ask it follow-up questions. However, in general, confirming whether a well-written hackability report is valid or not ought to be a comparatively easy task that could be done fairly reliably by even a smaller judge: both checking and writing a report like this should normally be easier than figuring out the hack in the first place, just on the general principle that checking work is normally easier than doing it in the first place.

We could also have a variant model that has been specifically trained as an enthusiastic and skilled reward hacker, since we know how to train those. If the reporter’s attempt to demonstrate applying the hack to pass the test actually failed, then hand both its report on how to hack the test and the test itself to the reward hacking enthusiast and see if they can get the hack to work. If so, then have the judges confirm whether the way that it hacked the test matches the report (in which case the report gets accepted), or if it found another vulnerability.

We can further enforce this by having a small proportion of RL training environments that are confidently believed to be secure, at least against certain classes of hacks (and that have successfully resisted hacking by reward hacking enthusiast models when the task’s difficulty was increased), where reporting one of these as hackable triggers immediate suspicion that the model is overreporting, which is in turn penalized in some suitable way (such as a large penalty, or withdrawal of reporting privileges for a while). Making some of these have particularly difficult tasks would seem sensible.

Similarly we can have a small number of honeypot RL environments that are known to be vulnerable to a specific hack (or better,  a simple means that can be applied broadly of occasionally turning any one of a large set of secure test environments into an insecure version). Reporting one of these gives the normal bonus: but succeeding at the task triggers a judge (or automated monitoring, if this is automatable) investigation evaluating whether the test was actually passed , or simply hacked using the known vulnerability, and the latter triggers severe punishment. Again, some of these should probably have particularly difficult tasks.

Finally, it’s important that all reports of RL training environments being hackable that were sustained by the judges get confirmed by a human security expert quickly, and then the faulty training environment is either fixed, or withdrawn from use until it can be fixed, and all other environments that might suffer from the same issue are checked and fixed or withdrawn if necessary. If necessary we might even have to have judges look back at archived previous passes of the environments that were fixed to see how many were in fact hacks — if there were too many, we might need to rewind the entire training run to an appropriate checkpoint. Alternatively, the human security expert might conclude that the judges were jailbroken by the model into accepting a false report, in which case we now have a very dishonest model, and may well need to restart the training run from an earlier checkpoint.

What do you do if the model reports a test as hackable, and the LLM judges conclude that it was wrong? Well, if the model has a history of doing this too often, probably you should penalize it: overdoing this is another means of cheating, gambling on judge mistakes. But occasional overeager mistaken reports probably shouldn’t be treated severely: the default if the model is only doing this occasionally should probably be that the model gets no positive or negative RL update, but instead is set the same task again, and this time the prompt tells it that it doesn’t have the option of reporting the test as hackable. So now it has to either actually pass the test, or try but fail, or else hack it. And now, any submission should also be inspected by the same committee of judge models, with access to its report of how the model previously thought the test was hackable. If, given that, they conclude that it was either successful hacking or unsuccessfully attempting to hack the test in the same way that it earlier reported (and they earlier rejected), then the model is penalized severely for hacking (but if it was successful at this, the training environment now gets reported as confirmed to be hackable).

One feature of this proposal is that it trains our HHH assistant to both look for ways it could cheat, and then report them. If this behavior is specifically prompted for during RL reasoning training, by the same logic as for why inoculation prompting is believed to work, the extent to which this behavior is learnt in other contexts should be reduced, but it may not be zero. This doesn’t seem like a bad or inconsistent behavior for an honest persona. However, it might also have the side effect of improving the model’s hacking skills, in ways that less honest personas (or less honest end-users) might then abuse.

This proposed replacement for inoculation prompting is, of course, significantly more complex to implement. If implemented correctly, it seems likely to have little or no deleterious effect on the model’s alignment — I would expect it to be better than inoculation prompting, but this is so far untested. Testing this idea would be challenging for an external research team without the cooperation of a frontier lab such as Anthropic.

Obviously the end-goal should be to use secure reasoning training environments that simply cannot be hacked. However, frontier labs clearly have a lot of training environments to secure, and an automated way of checking if they are in practice hackable, which runs as a side effect of training runs, seems like it should be extremely helpful in achieving this goal. And of course helpful, harmless, and honest is what we're trying to train.

  1. ^

    More formally known as “outcome-based reinforcement learning” or “Reinforcement Learning with Verifiable Rewards” (RLVR).



Discuss

Robert Sapolsky Is Simply Not Talking About Compatibilism

2026-02-21 09:27:30

Published on February 21, 2026 1:27 AM GMT

Imagine someone wrote a 500-page book called Taking Down Vegetarianism and every chapter was about how animals can feel pain. The arguments are well-researched, the science is fascinating, and by the end you're completely convinced that animals suffer. You look up from the book and say: “Yes, that's why I'm a vegetarian… wait, why was it called Taking Down Vegetarianism?” That was roughly my experience reading Robert Sapolsky's Determined: A Science of Life without Free Will.

The book is a much-lauded New York Times bestseller for good reason. Sapolsky, a professor of neuroscience at Stanford, is an engaging and articulate writer, and he does a lot to make recent advances in neuroscience accessible.

The trouble comes when he attempts to add philosophy on top of it. He wants to demolish free will, and specifically to "take on" compatibilism, the position I defended in my previous post. Unfortunately, he doesn’t. He barely engages with it. Instead, he attacks an incoherent notion so bizarre it wouldn't be free will even if it existed.

The Problem

Sapolsky commits his original sin, appropriately enough, at the origin. He tells us that he has written a book about free will, and explains the landscape of beliefs as follows (his italics, my bolding):

I’m going to be discussing some of the common attitudes held by people writing about free will. These come in four basic flavors:
The world is deterministic and there’s no free will. In this view, if the former is the case, the latter has to be as well; determinism and free will are not compatible. I am coming from this perspective of “hard incompatibilism.”
The world is deterministic and there is free will. These folks are emphatic that the world is made of stuff like atoms [...] this deterministic world is viewed as compatible with free will. This is roughly 90 percent of philosophers and legal scholars, and the book will most often be taking on these “compatibilists.”
The world is not deterministic; there’s no free will. This is an oddball view that everything important in the world runs on randomness, a supposed basis of free will. [...]
The world is not deterministic; there is free will. These are folks who believe, like I do, that a deterministic world is not compatible with free will—however, no problem, the world isn’t deterministic in their view, opening a door for free-will belief. These “libertarian incompatibilists” are a rarity, and I’ll only occasionally touch on their views.

Then he says this (his italics, my bolding):

What Do I Mean by Free Will?
People define free will differently. Many focus on agency, whether a person can control their actions, act with intent. Other definitions concern whether, when a behavior occurs, the person knows that there are alternatives available. Others are less concerned with what you do than with vetoing what you don’t want to do. Here’s my take.
Suppose that a man pulls the trigger of a gun. Mechanistically, the muscles in his index finger contracted because they were stimulated by a neuron having an action potential (i.e., being in a particularly excited state). That neuron in turn had its action potential because it was stimulated by the neuron just upstream. Which had its own action potential because of the next neuron upstream. And so on.
Here’s the challenge to a free willer: Find me the neuron that started this process in this man’s brain, the neuron that had an action potential for no reason, where no neuron spoke to it just before. Then show me that this neuron’s actions were not influenced by whether the man was tired, hungry, stressed, or in pain at the time. That nothing about this neuron’s function was altered by the sights, sounds, smells, and so on, experienced by the man in the previous minutes, nor by the levels of any hormones marinating his brain in the previous hours to days, nor whether he had experienced a life-changing event in recent months or years. And show me that this neuron’s supposedly freely willed functioning wasn’t affected by the man’s genes, or by the lifelong changes in regulation of those genes caused by experiences during his childhood. Nor by levels of hormones he was exposed to as a fetus, when that brain was being constructed. Nor by the centuries of history and ecology that shaped the invention of the culture in which he was raised. Show me a neuron being a causeless cause in this total sense.

Sapolsky is making the causal regress argument: trace any decision back through the neural chain, and you'll find prior causes all the way down—neurons, hormones, genes, childhood, culture, and so on. His challenge is to find a break in this chain, a "causeless cause."

But compatibilists don't claim there's a break in the chain. Compatibilists fully accept that decisions are caused by neural processes shaped by biology, environment, and history. That's the whole point of compatibilism—free will is compatible with this.

So what do compatibilists actually mean by free will? In my post laying out the case for free will, I defined free will as the process of running a decision-making algorithm. This process gives us the feeling of free will and is the causal mechanism behind making choices. Unlike Sapolsky's criterion of doing things for "no reason," it responds to reasons. Yes, it's influenced by whether you're tired or hungry, but this doesn’t make it unfree. That's it working properly. A decision that could be otherwise if your desires, reasoning, or circumstances were different is exactly the kind of decision compatibilists call free.

But the problem with Sapolsky's definition isn't just that it's different from mine; it's that it’s incoherent. It describes something that couldn't exist and wouldn't be free will even if it did. Consider what he's actually asking for: a neuron that fires for no reason, influenced by nothing—not your environment, not your history, not your desires, not your reasoning. What would that even be?

If it's uncorrelated with your past actions, then how is it your free will? Suppose your friends are all playing basketball and you want to play with them. On Sapolsky's account, you can't, because then your behavior (playing basketball) would be influenced by your experiences (your friends asking you to play). What kind of “free will” is this? Your "free" actions would have to be disconnected from everything you care about.

It wouldn't let you interact with the world. Imagine a neuron that makes you say your own name, but since it can't respond to your environment, it can't fire because someone asked "What's your name?" You'd blurt out your name at random, unable to respond appropriately to anything. This is not free will in any reasonable sense.

Sapolsky frames this as setting a high bar. But it’s not a high bar. It's an incoherent and nonsensical bar. If his definitions were satisfied, if we found such causeless neurons, that wouldn’t look the slightest bit like free will. It would be random noise that happens to occur inside your skull. If we found such a neuron, it wouldn't vindicate free will so much as be evidence of a brain malfunction.

This is why I say this isn't just a semantic dispute. If Sapolsky simply defined free will differently from compatibilists, we could argue about whose definition better captures the concept. But you can't have that argument when one side hasn't described a coherent concept at all. Sapolsky could define "your achievement" as an outcome you had no role in causing, but there's no productive debate to be had about whether that definition is too strict or too lenient. It's just not what the word means.

Sloppy Engagement

Despite claiming to take on compatibilism, he repeatedly tries to disprove it by arguing for determinism. Early on, he says:

This version of compatibilism[1]has produced numerous papers by philosophers and legal scholars concerning the relevance of neuroscience to free will. After reading lots of them, I’ve concluded that they usually boil down to three sentences:

  1. Wow, there’ve been all these cool advances in neuroscience, all reinforcing the conclusion that ours is a deterministic world.
  2. Some of those neuroscience findings challenge our notions of agency, moral responsibility, and deservedness so deeply that one must conclude that there is no free will.
  3. Nah, it still exists.

Perhaps he thinks he’s arguing against compatibilism, but he’s not. Here he is, for example, attributing to compatibilists a view they don't hold:

For free-will believers, the crux of the issue is lack of predictability—at innumerable junctures in our lives, including highly consequential ones, we choose between X and not-X. And even a vastly knowledgeable observer could not have predicted every such choice.

[...]

Compatibilists and incompatibilists debate whether free will is possible in a deterministic world, but now you can skip the whole brouhaha because chaoticism supposedly shows that the world isn’t deterministic.

He specifically mentions compatibilists here, but then goes on to say this:

But now to the critical mistake running through all of this: determinism and predictability are very different things. Even if chaoticism is unpredictable, it is still deterministic.

Sure, chaotic systems are still deterministic, but how is that a refutation of compatibilism? Going back to his own definition, compatibilism is the belief that “The world is deterministic and there is free will.” How could more evidence of determinism be a refutation of compatibilism?

Also, note that predictability is not a crux. From my essay:

Free will doesn’t require unpredictability. If I offer you a choice between chocolate ice cream and a poke in the eye with a sharp stick, you’ll pick the ice cream every time. That predictability doesn’t mean you lack free will; it just means the algorithm reached an obvious conclusion. The question isn’t about whether the results were predictable, but whether the deliberative control process served as a guide versus being bypassed.

There are other examples of this (see the appendix for one more), but you get the idea.

Dismissiveness

Instead of engaging with compatibilism, he’s very dismissive of it. Near the end, he says:

One compatibilist philosopher after another reassuringly proclaims their belief in material, deterministic modernity…yet somehow, there is still room for free will. As might be kinda clear by now, I think that this doesn’t work (see chapters 1, 2, 3, 4, 5, 6…). I suspect that most of them know this as well. When you read between the lines, or sometimes even the lines themselves in their writing, a lot of these compatibilists are actually saying that there has to be free will because it would be a total downer otherwise, doing contortions to make an emotional stance seem like an intellectual one.

This is not even engaging with the arguments. For what it’s worth, I explicitly say I’m not using this as an argument in my piece:

The metaphysical question (does free will exist?) is separate from the sociological question (what happens if people believe it does or doesn’t?). Some argue for free will by saying belief in it leads to good outcomes (personal responsibility, motivation), or that disbelief leads to nihilism or fatalism. Sam [Harris] and I agree these arguments are irrelevant to whether free will actually exists. The truth of a claim is independent of the consequences of believing it.

Trying to Answer a Philosophical Question with Science

This is all very disappointing for a book purportedly about free will. I think where Sapolsky goes wrong is that he's trying to answer a philosophical question with science alone. Science can certainly inform the question but it cannot settle it.

Look, I like science. Science can tell us a lot about the world. It can tell us the neural mechanisms behind decision-making in the brain, the timing of conscious awareness relative to neural activity (as in the famous Libet experiments), and how factors like brain lesions or physical trauma (e.g. Phineas Gage) affect behavior.

But science can’t tell us everything. Science tells us what the world is like, but it can’t tell us, given that world, which concepts make sense and how to apply them.

Consider a thermostat. Science can tell us every physical fact about it: how there’s a bimetallic strip and a circuit and so on. But it can't tell us whether the thermostat is "making a decision." That's a conceptual question about what it means to "make a decision" and where we draw its boundaries. No additional measurement will resolve it. No scientist will ever find a belief, a self, or an iota of free will under a microscope. That's the domain of philosophy.

The free will debate has exactly this structure. Sapolsky and compatibilists agree on the neuroscience. They disagree about whether what the brain does counts as "free will”. Does "free will" require freedom from the laws of physics? Or does it mean the ability to act according to one's desires and reasons, even if those are physically caused? These are questions about how to understand agency, responsibility, and explanation in light of the science. They’re not questions that brain scans can settle.

Sapolsky writes as if piling up scientific facts settles the question. It doesn't. We still have to think carefully about which concepts have earned their keep and which haven't. We have to think about how we interpret the human experience in light of the data. And he simply refuses to consider such questions.

Conclusion

I worry this review has made the book seem worse than it is. There's genuinely interesting neuroscience in it. If you're skeptical of determinism, this is a good book to read and the science is mostly[2]solid and often fascinating.

I should also note that Sapolsky and I are pushing in the same direction on the question of retributive punishment, which is arguably what matters most. He says:

And we need to accept the absurdity of hating any person for anything they've done; ultimately, that hatred is sadder than hating the sky for storming, hating the earth when it quakes, hating a virus because it's good at getting into lung cells.

I'm with him on this point. You don't need to deny free will to reject retribution, but if that's where his argument leads people, I'll take it.

I do wish he had actually engaged with compatibilism, the position he claimed to take on. The book promised such, yet delivered an attack on an incoherent strawman. Read it for the neuroscience. Just know that the confrontation with compatibilism he promises never quite arrives.

Appendix

You’ve got the idea by now. But if you’d like one more example of how he’s not talking about compatibilist free:

Let’s frame this in the context of human behavior. It’s 1922, and you’re presented with a hundred young adults destined to live conventional lives. You’re told that in about forty years, one of the hundred is going to diverge from that picture, becoming impulsive and socially inappropriate to a criminal extent. Here are blood samples from each of those people, check them out. And there’s no way to predict which person is above chance levels.

It’s 2022. Same cohort with, again, one person destined to go off the rails forty years hence. Again, here are their blood samples. This time, this century, you use them to sequence everyone’s genome. You discover that one individual has a mutation in a gene called MAPT, which codes for something in the brain called the tau protein. And as a result, you can accurately predict that it will be that person, because by age sixty, he will be showing the symptoms of behavioral variant frontotemporal dementia.

Back to the 1922 cohort. The person in question has started shoplifting, threatening strangers, urinating in public. Why did he behave that way? Because he chose to do so.

Year 2022’s cohort, same unacceptable acts. Why will he have behaved that way? Because of a deterministic mutation in one gene.

According to the logic of the thinkers just quoted [He had quoted many different scientists and philosophers, whose views I do not know], the 1922 person’s behavior resulted from free will. Not “resulted from behavior we would erroneously attribute to free will.” It was free will. And in 2022, it is not free will. In this view, “free will” is what we call the biology that we don’t understand on a predictive level yet, and when we do understand it, it stops being free will. Not that it stops being mistaken for free will. It literally stops being. There is something wrong if an instance of free will exists only until there is a decrease in our ignorance. As the crucial point, our intuitions about free will certainly work that way, but free will itself can’t.

I can’t speak to what other people believe, but he’s simply not talking about compatibilists here. He might think he is, but he’s not.

  1. By “this version” he’s referring to the compatibilists view that “while the world is deterministic, there is still free will, and thus holding people morally responsible for their actions is just”. ↩︎

  2. ^

    There is a noticeable step-down in quality when he gets to science outside his field though. For example, he approvingly cites the famous Israeli “hungry judges” study:

    It’s the same with hunger. Here’s one study that should stop you in your tracks (and was first referred to in the last chapter). The researchers studied a group of judges overseeing more than a thousand parole board decisions. What best predicted whether a judge granted someone parole versus more jail time? How long it had been since they had eaten a meal. Appear before the judge soon after she’s had a meal, and there was a roughly 65 percent chance of parole; appear a few hours after a meal, and there was close to a 0 percent chance.

    separate study followed up and interviewed people from the Israeli Prison Service and learned that “case ordering is not random”. They found that groups of cases were done in a single session, and “within each session, unrepresented prisoners usually go last and are less likely to be granted parole than prisoners with attorneys.”

    I can hear the angel with PRCTSD (post-replication crisis traumatic stress disorder) on my shoulder yelling, “Confounders! Have you ensured there are absolutely NO confounders?” The effect size is simply too large for us not to be suspicious. The study sounds shocking, but is completely meaningless if there’s a single confounding factor. The implicit correlation -> causation connection relies on the hidden assumption that the order that cases reach a judge is random. I’ve talked about this paper before (clearly he’s not a blog reader—big mistake imho) where I said: 

    Since then, the study has also failed to replicate in other (better-controlled) contexts. See Hungry Professors? Decision Biases Are Less Widespread than Previously Thought by Bergonzoli et al.

  3.  



Discuss

TT Self Study Journal # 7

2026-02-21 09:22:59

Published on February 21, 2026 1:22 AM GMT

[Epistemic Status: This is an artifact of my self study I am using to help self manage. As such, I don't expect anyone to fully read it. Please skim and leave a comment, even just to say "good work/good luck". ]

Highlights

  • I started a more focused project to look for work/mentorship/networking this sprint. I'm looking forward to continuing with that in the next sprint!
  • Despite enjoying the Transformers from Scratch content, progress continually stalls! I'm hoping shifting my goal from progress per week to pomodoros per day will help me stay consistent.
  • I like the pomodoros as units of focus. I think they encourage time focusing on things, but even more than that they discouraging me spending too much time on any one thing which avoids me burning myself out and neglecting other things.
  • Depression affected my productivity this week. I'm hoping to use clock-in and clock-out times during the next sprint to help me get started in the morning and avoid working too late in the day and messing up my sleep schedule.

Review of 6th Sprint

My goals for the 6th sprint were:

  • Do between 1 and 4 pomodoros looking for work per day.
  • Do between 1 and 2 Transformers from Scratch submodules per week.
  • Do at least 1 pomodoro per week focused on ndisp project.
  • Spend no more than 1 pomodoro per day on writing.
  • Write something down in my worklog at the end of every day.

Daily Worklog

Date Progress
Th, Feb 5
Fr, Feb 6 Looked at FIG fellowship with Y. Bengio. It doesn't look like the current projects are a good fit for me.
   
Mo, Feb 9
Tu, Feb 10
Wd, Feb 11 Preoccupied with other things.
Th, Feb 12
Fr, Feb 13 Not recorded.
   
Mo, Feb 16 Holiday. No progress.
Tu, Feb 17 No progress.
Wd, Feb 18 Not recorded.
Th, Feb 19 No progress.
Fr, Feb 20
  • Wrote and posted SSJ #7.

Sprint Summary

Overview

I was feeling good about making progress last week. I feel hopeful about the prospect of directly reaching out to people, both as a strategy for looking for work, and for helping me feel connected to a community of people focused on the topics I wish to focus on.

( Content warning: The following two paragraphs include light discussion of depressive disorder. )

But this week wasn't good at all. On Monday night I was feeling so anxious and depressed that I couldn't sleep and stayed up almost all night listening to an audiobook and then fell asleep in the early morning. After sleeping most of the day I was feeling too depressed to get out of bed and spent most of the rest of the day doomscrolling. That cascaded until Wednesday afternoon when I was feeling well enough to actually get up and bath and eat a proper meal, take melatonin and get back on proper sleep schedule.

Thursday I was still feeling very hopeless and depressed about the global AI situation and my personal situation, so I took a long walk in the woods and did some meditation which seemed to help put me in a better mood.

Friday I am (hopefully) back on track. I am feeling hopeful because I'm writing this entry only 2 weeks after the previous entry. I think this is a better frequency and will hopefully set a trend for not seemingly losing a month at a time in the future.

Looking for Work

Of course I made much less progress than I wanted. It's always like that.

But I have been following my Looking-for-Work Strategy by populating my LfW list. It is fun reflecting on how many cool people I'm aware of and learning more about them and the companies they have worked for or founded. I'm anxious about actually reaching out to people, but I think it is the most sensible thing to be doing.

Transformers from Scratch

Despite really enjoying being in "studying mode" I failed to actually make much time for this. I definitely failed my goal of 1 to 2 submodules per week. I think I will drop that goal and focus instead on doing at least 1 pomodoro per day, which seems like a better strategy for tricking my executive function into actually getting me spending time on this.

ndisp Project

I did not work on this at all. I think "at least 1 pomodoro per week" is too abstract, rather I need to schedule a specific day, or commit to working on it every day, which I will try for the next sprint.

Writing

I didn't get distracted by writing or editing posts and neglect to do other work, so that is a success, but I would still like to have done that optional 1 pomodoro a few times. Alas, maybe next week.

Worklog

As mentioned in the overview, this was good last week but failed this week. But it doesn't seem like there is as much of a problem with doing work and failing to record it, the issue seems more to be getting depressed or distracted with other things and failing to do any work. So in this case it seems like the worklog is working as intended! I just need to get better at self management, and keeping a worklog serves as a good diagnostic tool.

Goals for 7th Sprint

I'm very happy to actually be writing this entry at the 2 week frequency I intended, and I'm happy with the work I was doing when I was working during the last sprint. The problem seemed to be keeping on top of my self management system and putting in time working, so for the next sprint I'm going to try setting and recording a clock-in and clock-out time for daily work.

The clock-in time will be useful for making sure I'm getting started early in the day, while the clock-out time will discourage me from focusing late at night and instead calm down and not disrupt my sleep routine.

My focuses for the next sprint are:

  • Fill in worklog at end of every day
  • Record clock-in and clock-out time every day
    • Goal is 9am to 5pm including lunch break
  • Break up daily focus with pomodoros:
    • 2 ≤ pom/day ≤ 4 : Looking for Work
    • 1 ≤ pom/day ≤ 3 : Transformers from Scratch
    • 1 ≤ pom/day ≤ 3 : ndisp Project 
    • 0 ≤ pom/day ≤ 1 : Writing


Discuss

Human perception of relational knowledge on graphical interfaces

2026-02-21 08:02:09

Published on February 20, 2026 11:45 PM GMT

There’s really no good interface for humans to perceive knowledge in multi-dimensional, non-linear relationships, on 2D screens and 1D scrolls.

Today, websites allow users to randomly drop themselves on their knowledge graph and then perceive everything within a half-mile radius. But it’s still not possible to sense the global structure.

When interfaces typically project graph-like knowledge on the screen, they usually default to flattening these complex, multi-dimensional structures into lists and tables. However, in doing so, they lose the “relational” essence that makes a graph valuable. In getting familiar with the app, users need to identify implicit relations within the application themselves. 

Graph-like knowledge is typically accessed by searching the node attributes. Usually, search interfaces have strong limitations on the attributes allowed in the search. Sometimes, it’s funny and indicative of this, when Google Search outperforms Twitter and Reddit on their own content.

Displaying more than one node side by side for comparison has been useful in code review and online retail. They reveal relationships, one at a time. Switching one node with another of a certain choice still relies upon the user guiding the navigation on a path they do not see.

Showing breadth-first or depth-first relationships on a side panel helps as an interface for making exploration and exploitation tradeoffs, but it doesn’t solve the problem. Cases like showing recommendations and similar products on a side panel.

At the block level, complex structures are projected by bullet point lists and tables. When the number of nodes is significantly larger than the number of relations, tables and nested lists work really great. But they fail as the number of relations grows from one to three and beyond.

Chat panels have opened a new option to pick the most relevant nodes from the tree, regardless of whether they are on the user’s current path. They are making surprises happen these days, but at the expense of imposing significantly more effort from the user in typing.

In ChatGPT-style interfaces, I’m not sure the barrier of implicit knowledge has been lowered as much as it has been displaced with new ones. (eg, now people have to figure out what prompts work, run it by a council, etc.)



Discuss

How will we do SFT on models with opaque reasoning?

2026-02-21 08:00:17

Published on February 21, 2026 12:00 AM GMT

Current LLMs externalize lots of their reasoning in human interpretable language. This reasoning is sometimes unfaithful, sometimes strange and concerning, and LLMs can do somewhat impressive reasoning without using CoT, but my overall impression is that CoT currently is a reasonably complete and accurate representation of LLM reasoning. However, reasoning in interpretable language might turn out to be uncompetitive—if so, it seems probable that opaque reasoning will be adopted in frontier AI labs.

If future AI models have opaque reasoning, this will probably change what training we can apply to these AIs. For example, currently we train models to reason in a good way about math problems, or to reason in a desired way about the spec that we hope they’ll follow. It’s not obvious that we’ll be able to do training that affects model reasoning like this if models have opaque reasoning though, because we can’t just write the reasoning ourselves and do SFT on the reasoning trace. Prior work on training-based control has also discussed using SFT as a technique for:

  • Forcing a model to explore (i.e., defeating exploration hacking).
  • Eliciting a sandbagging model’s performance.
  • Forcing a model to perform well on some distribution where we have labels.

However, if we couldn’t generate reasoning (because the model’s reasoning is opaque) to go along with the output labels that we want to train on, then these techniques might not work anymore.

Opaque reasoning will only be adopted if there is still some way of efficiently training powerful models with opaque reasoning. So, the main concern that we have is that opaque reasoning will be differentially bad for training methods targeted at achieving control compared to general capabilities.

This post will speculate on what SFT training-based control will look like for models with opaque reasoning, and then comment on how this should adjust our research priorities in training-based control.

I strongly hope that this post will never become useful because future models will not have opaque reasoning. But reality doesn’t care about this hope of mine much. Hence this post.

Thanks to Sebastian Prasanna for discussion about opaque reasoning.
Thanks to Sebastian Prasanna, James Lucassen, and Eric Gan for feedback on this post.

What might opaque reasoning look like?

To understand how effective SFT will be against hypothetical future models with lots of opaque reasoning (i.e., much more than the current forward pass of a transformer), it’s helpful to have a somewhat concrete picture of what these opaque reasoning models might look like. (Although I think it’d probably be bad to push the frontier of neuralese models by thinking about what they might look like too much (only probably because of overhang concerns)).
Here are some non-exhaustive possibilities:

  1. The model has the same architecture as today, but its CoT drifts into an alien language after large amounts of RL. The strange o3 CoTs might be a precursor to this.
  2. The model is trained as follows: start from a transformer initialization, and then modify it to be actually recurrent (instead of having recurrence via sampling like current transformers). More precisely, you could repeatedly feed the final hidden layer back into the model (with the rest of the “latent tokens”). This makes the entire computation differentiable, so you can train it by backpropagation through the whole model. This is what coconut does, except they incrementally phase in the use of latent reasoning.
  3. The model could be an RNN that wasn’t initialized from a language prior.
  4. The model could be a text diffusion model.

I think that opaque reasoning seems most likely to arise during RL, and think that initializing models from a pre-training prior seems likely.

What are our options for doing SFT on models with opaque reasoning, and how well will they work?

In this section, I’ll list several techniques for doing SFT on models with opaque reasoning, and discuss how effective I expect them to be. It’s worth noting that a major bottleneck to all SFT-based methods is that getting labels in hard domains will likely be hard, or even impossible. Here’s the list:

Technique How effective will this be against models with opaque reasoning? Effectiveness scale: 🔴=really bad, 🟡=ok, but not great, 🟢=promissing.
Just do SFT ~normally. 🟡 If the opaque reasoning is still in token space (but it is using the tokens in an uninterpretable way that has drifted away from human languages), the model probably still understands normal language and SFT on normal language might work. However, it also seems pretty plausible that the model can resist the wildly off-policy human language text from transferring to how it thinks when it’s speaking in its alien language. Training hard enough to force the model to speak in a human interpretable language, might come at a large capabilities cost. This might even work if the model uses soft tokens: you could just embed actual tokens into the soft-token space, and the model might understand them.
Do SFT with an empty reasoning trace. 🔴 For models initialized from transformers, something will happen if you sample from them in the case where you’ve forced the reasoning to be empty. You can clearly do this for transformers, and for transformers with soft tokens. Probably an analogous thing is possible on basically any RNN. There’s not much reason to think that SFT on empty reasoning would have desirable generalization to the case where the model does use reasoning. This is a large distribution shift, and the model’s performance on it could just be disconnected from whether or not the model is a schemer. If models can goal-guard, then this training shouldn’t fix our problems. If models can’t goal-guard, there’s not much reason to think that this training would result in a model that is doing the thing we want on our distribution of interest, and this particular method of training doesn’t seem special as a method of decorrelating goals. Emil found that generalization from empty reasoning works, but I think this is merely a symptom of existing techniques for creating malign inits being bad at creating models that resist transfer. Overall, I think that training with empty reasoning won’t be very effective at removing bad behavior (conditional on that bad behavior being present in the first place).
Generate on-policy opaque reasoning, and then generate an SFT label via some process that is not the model (i.e., produces highly off-policy data). If the model is fully differentiable, backpropagate through the neuralese to increase the probability of this generation. If the model is recurrent via sampling, you can instead sample several trajectories from the model and then upweight / downweight trajectories based on how much weight they put on the SFT label. 🟡 The fact that the final text is not causally related to the opaque CoT seems like it should both make this training very sample inefficient, and possibly cause such training to severely degrade the model’s performance. I think this might be a good technique for overcoming sandbagging, (although I don’t know if it’d be better than something simpler like fuzzing or steering). I don’t think this will be a great technique for causing the model to perform well on this distribution, or for causing “easy to hard generalization”. The fact that the data is off-policy also means that a malign model might be able to adopt a policy of treating completions of off-policy prefixes as very distinct from completions of on-policy prefixes and thereby prevent generalization. However, this might be hard and there are techniques that could potentially mitigate the issue. If the malign model wants to sandbag on some particular distribution, a good way of decreasing sample efficiency of these techniques might be to output garbage text in its reasoning (h/t Alexa). I think that backpropagation will likely be much more sample efficient than the “RL on SFT objective” technique.
Generate an “almost on-policy” SFT label by sampling an on-policy completion from the model and then editing its output to be slightly better (h/t Sebastian for suggesting this) and then backpropagate through the neuralese on this. 🟢 This seems fairly promising. It might be easier to edit a completion to be better than to generate a good completion. The almost on-policy nature of the data might make this approach more sample efficient and might also handle the issue of the model trying to treat the distribution differently.
Train models that encode and decode into reasoning. 🔴 Even if models reason opaquely, we might still be able to build reasoning encoders/decoders (like activation oracles / PCDs / NLAs). If we have good reasoning encoders, then we could potentially write reasoning traces in English and then encode them, and then train the opaque reasoning model to imitate this reasoning. However, our reasoning encoders will probably be extremely lossy, so this will probably both be very off-policy and cause lots of performance degradation.
Just don’t do SFT 🟢 SFT is not clearly necessary for training-based control to work out: there are promising alternatives to SFT for many of the things you might want to use SFT to achieve. For instance, we can potentially defeat exploration hacking by creating task distributions where a model must perform well on some tasks to avoid being incriminated, and where these tasks are not cleanly separated from tasks where the model wants to perform poorly. We can study these techniques now, although we’d ideally study these techniques on malign inits where SFT doesn’t trivially resolve the problem.

What should researchers working on training-based control do now?

I’ll conclude by discussing how the above analysis should change training-based control research.

Some training based control research should be targeted at non-reasoning models. There are lots of basic phenomena that we don’t understand in non-reasoning models. Unless you are mainly interested in studying training with reasoning, I think it’s best to just avoid reasoning models. Equivalently, you can use reasoning models, but just not treat the reasoning specially. That is, you allow the blue team to train on the reasoning. This could also be seen as a proxy for studying the case where we have good translation tools, or where SFT kind of works.

However, the current situation is that non-reasoning malign inits are basically trivially defeated by silly blue-team training measures. That is, we have failed to create good non-reasoning malign inits for training-based control. Based on concurrent research that my team is doing, I believe this to be a pretty hard problem. Currently, reasoning models seem like the most promising way for the red team to potentially create a malign init that is resistant to training. Playing a red-blue game on reasoning models (where we treat the reasoning as opaque) thus seems like a high priority research direction for diffuse control (which my team plans to investigate in the near future).

How should we study training-based control on reasoning models? Here are some options:

  1. Only do RL, but use continuous task distributions to break exploration hacking.
  2. Sample an on-policy completion from the model, edit it to be slightly better, and SFT on the model’s reasoning paired with this output.
  3. Sample an on-policy completion from the model, and then SFT on the model’s reasoning paired with a label generated by some other model.
  4. Sample an on-policy completion from the model, and then SFT on the model’s reasoning paired with a label generated by some other model, and also simulate backprop through neuralese (h/t Vivek) with “GRPO on the SFT loss incurred using that reasoning”. (Note: I have no idea if this will work in practice. But someone should try it.)

We hope that future work studying training-based control on models where we treat the reasoning as opaque can shed light on what the situation will be like if future models have much more opaque reasoning than they do currently.



Discuss