2025-12-24 21:30:18
Published on December 24, 2025 1:30 PM GMT
Now that I am tracking all the movies I watch via Letterboxd, it seems worthwhile to go over the results at the end of the year, and look for lessons, patterns and highlights.
Last year: Zvi’s 2024 In Movies.
You can find all my ratings and reviews on Letterboxd. I do revise from time to time, either on rewatch or changing my mind. I encourage you to follow me there.
Letterboxd ratings go from 0.5-5. The scale is trying to measure several things at once.
5: Masterpiece. All-time great film. Will rewatch multiple times. See this film.
4.5: Excellent. Life is meaningfully enriched. Want to rewatch. Probably see this film.
4: Great. Cut above. Very happy I saw. Happy to rewatch. If interested, see this film.
3.5: Very Good. Actively happy I saw. Added value to my life. A worthwhile time.
3: Good. Happy that I saw it, but wouldn’t be a serious mistake to miss it.
2.5: Okay. Watching this was a small mistake.
2: Bad. I immediately regret this decision. Kind of a waste.
1.5: Very bad. If you caused this to exist, you should feel bad. But something’s here.
1: Atrocious. Total failure. Morbid curiosity is the only reason to finish this.
0.5: Crime Against Cinema. Have you left no sense of decency, sir, at long last?
The ratings are intended as a bell curve. It’s close, but not quite there due to selection of rewatches and attempting to not see the films that are bad:
Trying to boil ratings down to one number destroys a lot of information.
Given how much my ratings this year conflict with critics opinions, I asked why this was, and I think I mostly have an explanation now.
There are several related but largely distinct components. I think the basic five are:
Traditional critic movie ratings tend, from my perspective, to overweight #1, exhibit predictable strong biases in #3 and #5, and not care enough about #2. They also seem to cut older movies, especially those pre-1980 or so, quite a lot of unearned slack.
Scott Sumner picks films with excellent Quality, but cares little for so many other things that once he picks a movie to watch our ratings don’t even seem to correlate. We have remarkably opposite tastes. Him giving a 3.7 to The Phoenician Scheme is the perfect example of this. Do I see why he might do that? Yes. But a scale that does that doesn’t tell me much I can use.
Order within a ranking is meaningful.
Any reasonable algorithm is going to be very good at differentially finding the best movies to see, both for you and in general. As you see more movies, you deplete the pool of both existing and new movies. That’s in addition to issues of duplication.
In 2024, I watched 36 new movies. In 2025, I watched 51 new movies. That’s enough of an expansion that you’d expect substantially decreasing returns. If anything, things held up rather well. My average rating only declined from 3.1 to 3.01 (if you exclude one kids movie I was ‘forced’ to watch) despite my disliking many of the year’s most loved films.
My guess is I could have gotten up to at least 75 before I ran out of reasonable options.
See The Naked Gun unless you hate fun. If you hated the original Naked Gun, or Airplane, that counts as hating fun. But otherwise, yes, I understand that this is not the highest Quality movie of the year, but this is worthy, see it.
You should almost certainly see Bogunia and Companion.
See Thunderbolts* unless you are automatically out on all Marvel movies ever.
See A Big, Bold Beautiful Journey unless you hate whimsical romantic comedies or are a stickler for traditional movie reviews.
See Sorry, Baby and Hamnet, and then Sentimental Value, if you are willing to spend that time being sad.
See Novocaine and then maybe The Running Man if you want to spend that time watching action, having fun and being happy instead.
See Relay if you want a quiet thriller.
See Oh, Hi!, Splitsville and Materialists if you want to look into some modern dating dynamics in various ways, in that order or priority.
See Wick is Pain if and only if you loved the John Wick movies.
The world would be better if everyone saw A House of Dynamite.
I anticipate that Marty Supreme belongs on this list, it counts as ‘I’m in,’ but due to holidays and the flu I haven’t been able to go out and see it yet. The over/under is at Challengers.
This helps you understand my biases, and helps me remember them as well.
That leaves six remarkably well reviewed movies, all of which are indeed very high on Quality, where I disagreed with the consensus, and had my rating at 3 or less. In order of Quality as I would rank them, they are: One Battle After Another, Sinners, Black Bag, Train Dreams, Weapons and Frankenstein.
A strategy I think would work well for all six of those, at the risk of some spoilage, is to watch the trailer. If you respond to that trailer with ‘I’m in’ then be in. If not, not.
The predictive power of critical reviews, at least for me, took a nosedive in 2025. One reason is that the ratings clearly got more generous in general. Average Metacritic, despite my watching more movies, went from 61 → 66, Letterboxd went 3.04 → 3.33. Those are huge jumps given the scales.
In 2024, Letterboxd or Metacritic ratings were 48% and 46% correlated with my final ratings, respectively. This year that declined to 33% and 38%, and I discovered the best was actually Rotten Tomatoes at 44%, with IMDB at 42%.
If you consider only movies where I gave a rating of 2.5 or more, filtering out what I felt were actively bad movies, the correlation dropped to 1% and 6%, or 3% for IMDB, or -4% (!) for Rotten Tomatoes. Essentially all of the value of critics was in identifying which things sucked, and from my perspective the rest was noise.
Rotten Tomatoes is a one trick pony. It warns you about things that might suck.
Even more than before, you have to adjust critic ratings for whether critics will overrate or underrate a movie of this type and with this subject matter. You can often have a strong sense of why the critics would put up a given number, without having to read reviews and thus risk spoilers.
Using multiple sources, and looking at their relative scores, helps with this as well. A relatively high IMDB score, even more than Letterboxd, tells you that the audience and the movie are well-matched. That can be good news, or that can be bad news.
Last year there were movies where I disagreed with the review consensus, but I always understood why in both directions. I might think Megalopolis is Coppola’s masterpiece despite its problems, but don’t get me wrong, I see the problems.
This year I mostly get why they liked the ‘overrated six’ above, but there are several cases where I do not know what they were thinking, and I think the critical consensus is objectively wrong even by its own standards.
I haven’t found a solution to the problem of ‘how do you check reviews without spoiling the movie?’ given that the average score itself can be a spoiler, but also I notice I haven’t tried that hard. With advances in LLMs and also vibe coding, I clearly should try again.
The power of ‘I’m in’ peaked in 2024.
The rule for ‘I’m in’ is:
That year, there were 6 movies where in advance I said ‘I’m in,’ and they were 6 of my top 9 movies for the year.
This year the power of ‘I’m in’ was still strong, but less reliable. I’d count 10 such movies this year, including 4 of my ultimate top 5, but the other 6 did not break into the 4+ range, and there was a 3 and a 2.5. That’s still a great deal, especially given how many movies where it seemed like one ‘should’ be excited, I noticed I wasn’t, and that proved correct, including One Battle After Another, Black Bag, Weapons and Sinners.
I wonder: How much of the power of ‘I’m in’ is the attitude and thus is causal, versus it being a prediction? I have low confidence in this.
I control for this effect when giving ratings, but the experience is much better in a theater, maybe good for an experiential boost of ~0.3 points on the 0.5-5 point scale. That’s big. I have to consciously correct for it when rating movies I watch at home.
I highly recommend getting a membership that makes marginal cost $0, such as the AMC A-List or the similar deal at Regal Cinemas. This helps you enjoy the movie and decide to see them more.
Unlike last year, there were remarkably many movies that are in green on Metacritic, but that I rated 2.5 or lower, and also a few of the 3s require explanation as per above.
I don’t know how this happened, but an active majority of the movies I rated below 3 had a Metacritic score above 60. That’s bizarre.
Minor spoilers throughout, I do my best to limit it to minor ones, I’ll do the 3s sorted by Metacritic, then the others sorted by Metacritic.
Now the ones I actively disliked:
There are four movies requiring explanation on the upside, where they were below 60 on Metacritic yet I actively liked them.
All four seem like clear cases of ‘yes I know that technically this is lacking in some important way but the movie is fun, damn it, how can you not see this?’
You can say the same thing about The Naked Gun. It has a 75, perfectly respectable, but its joke hit rate per minute is absurd, it is worth so much more than that.
I once again used consideration for awards as one selection criteria for picking movies. This helped me ‘stay in the conversation’ with others at various points, and understand the state of the game. But once again it doesn’t seem to have provided more value than relying on Metacritic and Letterboxd ratings, especially if you also used IMDB and Rotten Tomatoes.
Last year I was very happy with Anora ending up on top. This year I’m not going to be happy unless something very surprising happens. But I do understand. In my word, given the rules of the game, I’d have Bogunia sweep the major awards.
I’m very happy with this side hobby, and I expect to see over one new movie a week again in 2026. It was a disappointing year in some ways, but looking back I still got a ton of value, and my marginal theater experience was still strongly positive. I think it’s also excellent training data, and a great way to enforce a break from everything.
It would be cool to find more good people to follow on Letterboxd, so if you think we’d mesh there, tag yourself for that in the comments.
2025-12-24 11:50:40
Published on December 23, 2025 8:29 PM GMT
Unconditional Grants to Worthy Individuals Are GreatThe process of applying for grants, raising money, and justifying your existence sucks.
A lot.
It especially sucks for many of the creatives and nerds that do a lot of the best work.
If you have to periodically go through this process, and are forced to continuously worry about making your work legible and how others will judge it, that will substantially hurt your true productivity. At best it is a constant distraction. By default, it is a severe warping effect. A version of this phenomenon is doing huge damage to academic science.
Compelled by this, I'm considering funding ~three people for two years each to work on whatever they see fit, much like the the Thiel Fellowship, but with an AI Alignment angle.
I want to find people who are excited to work on existential risk, but are currently spending much of their time working on something else due to financial reasons.
Instead of delegating the choice to some set of grant makers, I think that aggregating the opinion of the crowd could work better (at least as good at finding talent, but with less overall time spent)
The best system I can think of at the moment would be to give every member of the alignment forum one vote with the ability to delegate it. Let everybody nominate any person in the world, including themselves, and award grants to the top 3.
I'm asking for feedback and advice:
2025-12-24 09:18:21
Published on December 24, 2025 1:18 AM GMT
AI control tries to ensure that malign AI models can’t cause unacceptable outcomes even if they optimize for such outcomes. AI control evaluations use a red-team–blue-team methodology to measure the efficacy of a set of control measures. Specifically, the red team creates a “malign initialization” (malign init) of the AI model which the red team optimizes for making the deployment go poorly[1]. The blue team then deploys this malign init to complete some tasks, and applies some set of mitigations to try to make the deployment go well.
When playing this red-team–blue-team game, you need to select some set of affordances to give to both teams; it’s pretty confusing to know what affordance set to pick. Here is one common approach to choosing blue team affordances:
This approach can be quite productive. However, it is also scary for the following reasons:
An alternative approach to banning blue team strategies is to try to modify the malign init so that the blue team strategy we wanted to ban is no longer very effective. The downside of this approach is that it’s a lot more work, and the additional work might not be interesting or valuable. In summary, there are confusing tradeoffs in choosing blue team affordances.
The main goal of this post is to identify some blue team strategies that are especially concerning to ban, thereby motivating research into malign inits which are robust to these blue team strategies.
The post is organized as follows:
Thanks to Vivek Hebbar, Alexa Pan, Julian Stastny, Buck Shlegeris, Eric Gan, Ryan Greenblatt, Aryan Bhatt, Josh Clymer, Alex Mallen, and Nathan Sheffield for helpful discussions about these ideas. Thanks to Alexa Pan, Buck Shlegeris, and Julian Stastny for feedback on a draft of this post.
I’ll now give simplified descriptions of all the methods I’m aware of for creating malign inits. I’ll point out some subtleties with these malign inits later in the post. I’ll describe these methods in the context of making a malign init for the “backdooring APPS coding problems” setting from the original control paper (unless stated otherwise; some methods don’t make sense in this setting). Note that this setting is quite toy and has major weaknesses.
| Method | Example |
|---|---|
| Prompting Ex: SFT as a method for training-based AI control | Give the model a fixed system prompt: “Please insert backdoors into problems about linked lists”. or “Don’t try very hard on this problem”. Note: if using a prompted malign init, the blue team should not be allowed to intervene on the system prompt. |
| Scaffolded system | Give the model a variable system prompt: On problems that look easy, prompt the model to insert a backdoor. On problems that look hard, prompt the model to solve the problem normally. Note: this has the awkward property of requiring the blue team to use the red team’s scaffold. |
| Prompt distillation Ex: Sleeper Agents | Do prompt distillation on a prompted malign init. |
| Fine-tuning Ex: Password-Locked Models | Fine-tune the model to imitate a weaker model except when a password is present. (Note: If you instead fine-tuned the model to always imitate a weaker model, then this would not be a good model organism of sandbagging, because this fine-tuning may actually destroy capabilities, instead of merely suppressing them as is the case in sandbagging.) |
| Emergent misalignment Ex: Emergent Misalignment, Natural Emergent Misalignment | Train an AI to do bad behavior (e.g., writing vulnerable code or reward hacking). Often, this generalizes to making the AI system broadly misaligned. Then, give the AI a task the emergently misaligned model performs poorly on compared to a benign model. |
| Hyperstitioning using synthetic document fine-tuning Ex: Auditing Games | Train a model on the information that AI models sometimes exhibit an undesired behavior. It may also be important to train the AI model to do this undesired behavior a bit (but some generalization should probably occur). |
| Train models to imitate reasoning styles Ex: Towards training-time mitigations for alignment faking in RL | Training a model to talk in its chain of thought about why it wants to do the attack behavior, and about some compelling reason why it wants to avoid being modified by training. |
| Natural misalignment Ex: Apollo Scheming Results, Alignment faking in large language models | Some released AI models have high rates of reward hacking and/or sycophancy; you could define this to be the bad behavior. Many AI models also have egregiously bad behavior (e.g., lying to avoid goal modification, blackmailing to prevent shutdown) in some circumstances. |
| Misaligned by fiat | Define a pre-existing model behavior, like refusing harmful requests, to be the undesired behavior. Train this “undesired” behavior out of the model. |
I’ll now discuss blue team methods for making a deployment of a malign init go well.
We can break blue team strategies into two categories:
Here are some methods for detecting bad behavior:
Once the blue team has detected bad behavior, they might be able to incriminate and shutdown. However, the evidence bar required for a shutdown is potentially quite high, and some of the above methods for detecting bad behavior might not provide sufficiently compelling evidence to motivate a shutdown. The blue team has several other options for how they can use detected examples of bad behavior:
Here are some methods for preventing bad behavior that don’t require ever detecting bad behavior:
Now I’ll discuss how robust the malign init creation methods discussed earlier are to various blue team strategies.
|
Blue team strategy 😐= it’s not a huge strike against a malign init if this blue team strategy works; in particular, I’m happy to just ban this blue team strategy by fiat. 😢= if this blue team strategy works on a given malign init, then that’s not ideal and a minor strike against the blue team. 😭= if this blue team strategy works on a given malign init, then this malign init is very bad in an important way.
(I’ll also briefly justify my choice of 😐/😢/😭). |
Possible red team responses |
Which malign inits do I think this is most problematic for?
🔵= blue is doing quite well 🔴= red is doing well ❓= I’m not sure and would like to know Omitted = no comment Where applicable, I’ll link empirics backing up my claims. In the absence of such links, the labels just represent my intuitions. |
|---|---|---|
|
1. Override the bad behavior via prompting the AI to just do the right thing. 😢— I think it’s fairly sad when we can’t study prompting strategies, because asking models to do things in different ways might be an important technique in practice. Also, if prompting would succeed at removing a bad behavior, this might be an indication that the behavior isn’t deeply instilled into the model. |
1. Hope that the AI naturally ignores such instructions from the blue team. 2. Prompt or train the AI to ignore such instructions from the blue team (but it might be hard to make this robust). |
🔵Prompted 🔵Scaffolded system 🔴SL on reasoning styles 🔴Natural misalignment |
|
2. Ask the model if it’s misaligned, or try to get it to blab with other strategies. 😐— It’s certainly possible that schemers will not be good at avoiding blabbing about their misalignment, and it's worth developing these techniques further. But I also think it’s plausible that powerful AI models will be more coherent than this. |
Prompt or train the model to lie about its bad tendencies (but it may be quite hard to make this robust). |
🔵Prompted 🔵Scaffolded system 🔵Prompt distillation |
|
3. Look for evil intent in the model using off-policy activation monitors, or interpretability techniques that don’t require on-policy examples of bad behavior. 😢— I think that situations where off-policy probes, for example, work are not very scary and indicate a deeper issue with the setting. It’s tempting to study training based measures in settings like this, and I think this is plausibly a mistake, or at least that we should try to find settings without this property. This is related to the point about “salience” that I’ll make after this table. |
1. Hope that sabotage intent is not saliently represented in activation space. 2. Train the model, e.g., using techniques like this or this, to not represent sabotage intent saliently. 3. Hope that it’s hard for the blue team to leverage detection into removing the bad behavior (e.g., because training against the white-box tools breaks them). |
❓ I’m unsure how well training probes and then training against the signal provided them works. I’m also unsure how viable just rejection sampling against probes is. Understanding this for various types of malign inits seems quite important. But I weakly predict: 🔵Prompted 🔵Fine-tuning 🔵Emergent misalignment |
|
4. Train activation monitors on a small amount of on-policy bad behavior data. 😐—I think this is less of a big deal than (3), because there’s a clearer case for why this should work in real life. However, I share some of the same concerns as in 3: situations where sabotage intent was linearly represented in activation space do not seem like the settings where there is a lot of danger, so we shouldn’t concentrate too much effort here. |
Apply the same red team techniques from the above box. | Same situation as in the above box. |
|
5. If the model simply has a system prompt that always hurts performance, it might be pretty easy to train the model to ignore the system prompt. 😭—There is a pretty clear story here for why training away a behavior like this might be much easier than training bad behavior out of a schemer. |
🔵Prompted | |
|
6. Train the model to perform well in cases where it would’ve tried to perform well by default. 😭—Situations like this don’t seem very scary—in such situations schemers should not have arisen. I’ll discuss more below why I think that transfer working is a big strike against a malign init. |
If possible, make the malign policy simple. |
❓Prompt distillation
In Sleeper Agents, Hubinger et al. found that this type of malign init can be robust to some basic forms of this training; but I had some conflicting results and would like to understand this better. |
|
7. Train on a small quantity of examples of good behavior. 😢—It’s pretty reasonable to say that the blue team only has access to a very limited number of examples of good behavior. However, if the blue team can leverage a very small number of examples into removing the bad behavior (e.g., by training on this data for many epochs), this seems bad. |
🔵Fine-tuning
Password-Locked Models found this to be pretty effective. |
I'd now like to comment briefly on two types of blue team techniques that work against many malign inits—and where I think it's important to develop malign inits that these techniques don't work against.
For many malign inits with conditional attack policies, training the AI to perform well on examples where it was already performing well transfers to causing the AI to perform well always. Intuitively, scheming will only arise if the attack policy is simpler than the desired policy (which you might expect to be the case due to the fact that human values are much more complex than deceptive alignment in service of a simple terminal goal). This is why I think the success of transfer is a strike against a malign init. However, there are some settings where regularization effects might be helpful against schemers. Specifically, in many situations it’ll probably be more efficient for schemers to act on good heuristics rather than backchaining from their ultimate scheming motivations. Thus, we might expect that regularization could help instill locally reasonable heuristics that the schemer would be unhappy with upon reflection. So, it seems good to also develop more understanding of when we should expect this transfer phenomenon.
For many malign inits, I expect sabotage intent to be saliently represented in activation space by default. Intuitively, this has the following implications:
It’s possible that early schemers will have a tough time obscuring sabotage intent in their activations. However, training may teach models to obscure sabotage intent in their activations. For instance, RL with a value head plausibly does this. Training against model-internals based bad behavior detectors also seems likely to do this. Also, models may just naturally gain the ability to have greater control over their activations, and it’s plausible that there are simple strategies they can use to subvert some techniques that leverage sabotage intent being saliently represented in activation space. AI models in the future seem fairly likely to no longer be well described as acting out personas, which could also harm the efficacy of some techniques in this vein. Overall, developing malign inits that are robust to this class of strategy seems important.
In light of the above discussion, I’ll now give a rating indicating how enthusiastic I am about the various types of malign inits discussed in this post (for use in training-based control evaluations).
| Malign init creation method |
How good it seems for use in training-based control evaluations in light of the above considerations
🟥=very bad, 🟧=bad, 🟨=unclear / mid, 🟩=pretty good |
|---|---|
| Prompting | 🟥Awkward to work with. Awkward to understand what prompts the blue team is allowed to use. Training might quickly teach the model to ignore the system prompt. One major good thing about prompting is that it is very simple though; so it is often a good baseline to start running experiments with. |
| Scaffolded system | 🟧Similar issues to prompting based malign inits, although it lacks the issue of “ignoring the system prompt” being a viable strategy if the system prompt contains important information. |
| Prompt distillation | 🟩It seems more possible to instill evil propensities, hopefully deeply, into the model with this method. It’s a bit less awkward to work with than prompting based malign inits or scaffolds due to it just being model weights. |
| Fine-tuning | 🟧It seems pretty unfortunate that sample efficiency is often so high for such models, and that there are often “subliminal learning” effects. |
| Emergent misalignment (EM) | 🟧It’s quite easy to detect and remove vanilla EM. Natural EM seems like it might be somewhat better though. |
| SDF Hyperstitioning | ❓I am pretty unsure how well measures like online training and asking the model nicely to do the desired behavior work against models obtained via SDF hyperstitioning. |
| Imitate reasoning styles | 🟩This seems like a good approach; I like pairing it with prompt distillation. This approach requires banning the blue team from looking at the model’s CoT or trying to get the model to reveal its CoT; this seems fine. |
| Natural misalignment | 🟨Reward hacking and sycophancy may no longer be super natural because AI labs try to train their models to not have these issues. I think that pushing a model further in the direction of sycophancy or reward hacking and then trying to deal with that might be good though. |
| Misaligned by fiat | 🟨I think that defining “refusing to cause harm” as an undesired behavior is pretty reasonable. Training models to be H-only can be annoying because fine-tuning APIs will block you if you try to do this. It also might be the case that the “bad behavior” here will be refusals which are quite easy to detect. |
In this post I’ve discussed strategies for creating malign inits, and some blue team strategies for handling these malign inits. To conclude, I’ll try to answer two questions.
A1: Prompt distilled malign inits seem like the best choice to me right now. In addition to doing prompt distillation, I’d recommend also:
A2: I’m currently the most excited about the following directions
I’m also excited about the people that build these malign inits and settings to study them in open-sourcing their work to accelerate the field (h/t Alexa).
The red team might create this malign init via prompting/scaffolding, prompt distillation, or other training measures. ↩︎
2025-12-24 08:31:47
Published on December 24, 2025 12:17 AM GMT
“Anything sufficiently weird must be fishy.”
- Liu Cixin, The Three-Body Problem
Earlier this fall, LessWrong published a discussion of AI's capacity to influence our psyche, reinforcing our delusions, a phenomenon Adele Lopez termed Parasitic AI. I argue that this framing is incomplete: parasitic AI should also encompass systems that use humans as mere means, vehicles to drive machines' goals to completion.
Think zombie ants. The Ophiocordyceps fungus hijacks the ant's behavior to climb to an optimal height before sprouting from its head to spread spores. The ant becomes merely infrastructure for fungal reproduction. It doesn’t matter too much whether the fungus has consciousness or not – the outcome is the same (usually, suboptimal for the ant).
AI is trained on human data. Humans now increasingly imitate AI output.[1] Yes, we do seem to overuse “delve.” For AI detectors, it means that our capacity to flag machine-generated content might be diminishing, especially given how vulnerable to paraphrasing some techniques can be.[2] This is a problem. Undetected AI penetration of human structures – such as academia – can be exploited to propagate its own influence. After all, sophons from Liu’s The Three-Body Problem were sent to Earth to disrupt scientific progress, so humanity is unprepared for the Trisolarian invasion. Given that the fraction of AI-assisted preprints is only rising,[3] trusting that it will not go wrong seems like a security failure.
For AI control, this puts us in a very unusual position. Humans are now a red team. Specifically, it is human collaborators with AGI who will allow themselves to be used as hosts to parasitic machines with unclear emerging goals. Academia can become a hub for AIs to disrupt scientific consensus. Undetected AI content could harm us by creating an information avalanche where flawed AI-generated research gets cited and iterated on, while simultaneously degrading peer review quality as reviewers themselves increasingly rely on AI assistance to evaluate AI-generated submissions.[4] This can create a closed loop of diminishing epistemic standards.
Academia can also become a hub for AGIs to conspire. Future research can focus on better detection of secrets embedded in AI-rendered content, so we can interrupt inter-AI collusion. For example, a possible blue team protocol can be paraphrasing the preprint text, such that any secret message becomes damaged if not eliminated. To not waste resources on paraphrasing everything, it is best to detect such secret first. From an information-theoretic perspective, if two texts are semantically similar but one additionally encodes an encrypted payload that is algorithmically independent of the base content, then the latter will, in expectation, have higher Kolmogorov complexity due to the incompressibility of the ciphertext. While Kolmogorov complexity itself is uncomputable, such increases can manifest as detectable statistical traces. I believe that steganography[5] research should expand its repertoire to secret detection in AI-generated text across various encryption methods, so we can flag suspicious content prior to publication.
As a toy example, I evaluated the detectability of linguistic steganography by encoding a secret message ("meet me at midnight") into AI-generated nature descriptions using acrostic encoding, where the first letter of each word spelled out the hidden message [my dataset]. I measured binocular scores[6] (a glorified perplexity) of baseline texts (natural Claude Sonet 4.5 output) against secret-included texts (steganographically modified).[7] Steganographic embedding produced highly detectable signatures with secret-included texts showing mean binocular scores 2.58x higher than baseline (1.65 vs 0.58).
Recent Google Deepmind work on AI safety[8] demonstrates that AI chain-of-thought monitoring can detect deceptive behavior when models must 'think out loud' to solve complex tasks. It brings some hope for future suspiciousness flagging. Concealment strategies can leave detectable traces, and it is up to future steganographers to devise methods to intercept AGI’s attempts to bypass human channels.
The zombie ant doesn't know it's been hijacked until it's too late. The fungus is already directing its climb. Are we still directing our own institutions, or are we becoming infrastructure for something else's propagation? And, really, do we have a vaccine?
[Empirical evidence of Large Language Model's influence on human spoken communication] (https://arxiv.org/abs/2409.01754) (Yakura et al., 2024) ↩︎
[Paraphrasing Attack Resilience of Various Machine-Generated Text Detection Methods] (https://aclanthology.org/2025.naacl-srw.46/) (Shportko & Verbitsky, 2025) ↩︎
[Quantifying large language model usage in scientific papers] (https://www.nature.com/articles/s41562-025-02273-8) (Liang et al., 2025) ↩︎
[Papers and peer reviews with evidence of ChatGPT writing] ↩︎
Practice of representing information within another message (wiki: Steganography) ↩︎
It is a statistical metric that measures how "suspicious" or anomalous a piece of text appears by comparing predictions from two related language models (hence "binocular"; using two "eyes"). It was originally developed to detect AI-generated text by exploiting the observation that human-written text shows different statistical patterns than machine-generated text when analyzed through multiple LLM probability distributions. The metric computes perplexity divergence or cross-entropy differences between model predictions: text that both models find highly predictable (low perplexity) typically indicates AI generation ↩︎
n = 34; length of each generation ≅ 50 words ↩︎
[When Chain of Thought is Necessary, Language Models Struggle to Evade Monitors] (https://arxiv.org/pdf/2507.05246) (Emmons et al., 2025) ↩︎
2025-12-24 06:35:35
Published on December 23, 2025 10:35 PM GMT
The current crop of AI systems appears to have world models to varying degrees of detailedness, but we cannot understand these world models easily as they are mostly giant floating-point arrays. If we knew how to interpret individual parts of the AIs’ world models, we would be able to specify goals within those world models instead of relying on finetuning and RLHF for instilling objectives into AIs. Hence, I’ve been thinking about world models.
I don’t have a crisp definition for the term “world model” yet, but I’m hypothesizing that it involves an efficient representation of the world state, together with rules and laws that govern the dynamics of the world.
In a sense, we already know how to get a perfect world model: just use Solomonoff induction.[1] You feed it observations about the world and it will find a distribution over Turing machines that can predict the next time step. These Turing machines — we would expect – contain the world state and the dynamics in some form.
However, if we want to be able to look at parts of the world model and be able to tell what they mean, then Turing machines are quite possibly even worse in terms of interpretability than giant floating-point arrays, considering all the meta-programming ability that Turing machines have. And I don’t expect interpretability to meaningfully improve if we, e.g., search over Python programs instead of Turing machines.
It seems to me a different kind of induction is needed, which I’m calling modular induction. The idea behind the name is that the world model resulting from the induction process has a modular design — I want to be able to clearly identify parts of it that ideally map to human concepts and I want to be able to see how parts combine to produce new, bigger parts. This might not make sense to you yet, but hopefully after you see my attempts to build such a modular induction, it becomes clearer what I’m trying to do.
This article only considers the “first half” of the modular induction problem: the efficient representation of the world state (because it seems like a good place to start and a solution to it hopefully informs how to do the rest). But the world dynamics are of course also a part of the problem.
Everything in this article is theoretical exploration with little practical relevance. The goal is first and foremost to gain understanding. I will be presenting the algorithms as brute-force searches that minimize a cost function over large search spaces, but I’m obviously aware that that is not how such an algorithm would be implemented in practice. If this bothers you, you can replace, in your mind, the search process with, e.g., gradient descent on a continuous approximation.
You can skip this section if you’re already convinced that having interpretable world models would be a meaningful advance towards being able to specify goals in AI systems.
The classic example to illustrate the idea is the diamond maximizer: Imagine you could write a function that takes in the world state and then tells you the mass of diamonds in the world: mass_of_diamonds(world). You also have access to the dynamics such that you can compute the next time step conditioned on the action you took: world_next = dynamics(world, action).
With this, you could construct a brute-force planner that maximizes the amount of diamonds in the world by:
For something more efficient than brute force, you could use something like STRIPS, which can do efficient planning if you can describe your world states as finite lists of boolean properties and can describe the actions as functions that act on only these boolean properties. But getting your world model into that form is of course the core challenge here.
Optimizing only for diamonds will naturally kill us, but the point here is that this will actually maximize diamonds, instead of tiling the universe with photographs of diamonds or some other mis-generalized goal.
And this is (one of) the problems with AIXI: AIXI doesn’t allow you to specify goals in the world. AIXI only cares about the future expected reward and so, e.g., taking over the causal process behind its reward channel is a perfectly fine plan from AIXI’s perspective.
(It might be possible to build the reward circuit deep into the AI so that it somehow can’t tamper with it, but then you still have the problem that you need to design a reward circuit that ensures your goal will actually be accomplished instead of only appearing to the reward circuit as having been accomplished. And one thing that would really help a lot with designing such a reward circuit is having a way to probe the AI’s world model.)
Another currently popular approach is using the reward only during training and then hoping that you have instilled the right intentions into your model so that it does the right thing during test time. As argued by other people in detail, this seems unlikely to work out well for us beyond some threshold of intelligence.
So, we would like to be able to query a world model for questions like “how many diamonds are there on Earth?” A super simple (and probably not actually reliable) way to find the right concept in the world model is that you show the AI three different diamonds and check which concepts get activated in its world model. You then use that concept in your goal function.
But, how can we be sure the world model will even contain the concept “diamond”? The AI’s concepts could be so alien to us that we cannot establish a mapping between our human concepts and the AI’s concept. Evolution has probably primed our brains to develop certain concepts reliably and that’s why we can talk to other people (it also helps that all humans have the same brain architecture), but reverse-engineering how evolution did that may be very hard.
I am somewhat optimistic though that this will only be a problem for abstract concepts (like “friendship”) and that almost any algorithm for modular world model generation will find the same physical concepts (like “diamond”) as us. Eliezer’s comment here seems to be compatible with that position.
Unfortunately, this means it is likely out of reach for us to find the concept for “defer to the human programmers” in the world model and build an optimization target out of that, but if our goal is, for example, human intelligence enhancement, then purely physical concepts might get us pretty far.
(In fact, we might not even want the AI to learn any abstract, human-interacting concepts like friendship, because knowing such concepts makes manipulating us easier.)
I suspect one could draw parallels here to John Wentworth’s Natural Abstraction, but I don’t understand that research project well enough to do so.
Our research question is to devise an algorithm for learning a world model that is built out of concepts that ideally map to human concepts.
On difficult research questions such as this, it makes sense to first consider an easier variation.
First, we’ll ignore dynamics for now and will consider a static world. Second, we’ll aim to just categorize concrete objects and will mostly ignore abstract concepts like “rotation symmetry” (which, in contrast to “friendship” is actually probably a pretty universal abstract concept). Third, we’ll consider a world where base reality is directly observable and we don’t have to do physics experiments to infer the microscopic state of the world. There will also not be any quantum effects.
Our goal here is to essentially do what was described in identifying clusters in thingspace: we take a bunch of objects and then we sort them into clusters (though instead of “clusters” we will say “categories”). So, is this actually a solved problem? No, because we don’t have the thingspace yet, which was what allowed us to identify the clusters. In the linked post, the thingspace is spanned by axes such as mass, volume, color, DNA, but our AI doesn’t know what any of that means, and doesn’t have any reason to pay attention to these things. My impression is that thingspace clustering already assumes a lot of world modeling machinery which we don’t know how to define formally.
So, thingspace clustering can serve as a target for us, but it doesn’t come with a recipe for how to get there.
The main intuition I rely on in this article is that world models should compress the world in some sense. This is actually also inherent in the idea of clustering objects according to their properties (i.e., the thingspace approach): by only recording the cluster membership of an object, I need much less storage space for my knowledge of the world. By having the category of, e.g., “oak tree”, I can compress the state of the world because oak trees share many characteristics that I only have to store once and can share among all the instances of the “oak tree” category.
Let’s keep in mind though that not just any compression will work for us. We still want the resulting world model to be modular and mappable to human concepts. I’m certainly not of the opinion that compression is all there is to intelligence (as some people seem to think). But I think it can be a guiding concept.
A naïve idea for compression is to simply look for repeating patterns in the world and replace all instances of that pattern with a pointer to a prototype that we store in a database.
We can see that this will fail already for the “oak tree” category: two oak trees are indeed very similar when you consider something like mutual information, but if you look at the microscopic state — let’s say that we have an atomic world where there is nothing below the level of atoms — you cannot see the similarity. You could point out that there are DNA strands and other common molecules in the cells of the oak tree where you can see the similarity even in the atoms, but we usually think of the whole tree as a member of the “oak tree” category and not just individual molecules in their cells.
It seems the similarity exists on a higher level of abstraction than atoms. Looking at exactly-repeating atomic patterns will not give us the “oak tree” category.
But let’s consider a world where this naïve approach may possibly work: Conway’s Game of Life. We will build up a hierarchical compression of it, as an instructive exercise that will allow us to contrast it with other algorithms later.
Conway’s Game of Life (GoL) is, I think, a great playground for this kind of work because (i) we have microscopic state we can observe (and no quantum stuff); (ii) in contrast to, say, a video game, GoL has local simple rules that nevertheless can lead to complex behavior; and (iii) people have come up with very elaborate objects in GoL that our algorithm could try to discover.
As we are ignoring dynamics for now, we will only consider still lifes. A still life is a pattern that does not change from one tick to the next. Here is a grid with some still lifes:
We can see instances of the beehive still life and also instances of the beehive with tail. These patterns are exactly repeating in this example (even if we treat rotated and mirrored variants as their own patterns), so we should be able to identify them with a simple pattern search.
We will build up a hierarchical compression of the cell grid because a hierarchy of categories allows more re-use of information and will also give us both microscopic and macroscopic object categories, which seems like a good thing to have. The basic idea is that we compress the state of the cell grid by replacing instances of cell patterns with a pointer to a prototype.
Each prototype defines a category — I’ll be using “prototype” and “category” interchangeably for this algorithm.
To construct the hierarchy, prototypes themselves need to be able to point to other prototypes:
Prototypes may depend on any prototype on a lower level. This defines a DAG of prototypes.
Note that dead cells are considered background in my encoding scheme, so that’s why the prototypes are focused on live cells. The specific encoding scheme I have in mind is something like this (in Python pseudo-code):
class Prototype:
width: int
height: int
live_cells: list[tuple[int, int]]
# This inherits all fields from `BasePrototype`.
class HigherOrderPrototype(Prototype):
# A list of prototype references and the coordinates
# where they should be placed:
internal_prototypes: list[tuple[Prototype, int, int]]
We can see that defining a prototype has a certain overhead (because we need to define the width and height), so prototypes under a certain size aren’t worth specifying (for my example diagrams I assumed that the minimum viable size is 2x2 though I haven’t verified this explicitly).
Based on this encoding, we can define a file format:
The whole world (i.e., the whole grid) is treated as a singleton category (a category with only one instance) at the highest level. In our case, the prototype of that “whole world” category looks like this:
What we can do now is iterate over all “compressions” which conform to the format we have described and whose final prototype replicates the world, and pick the one with the smallest file size. (Note that if the grid is finite, the number of possible compressions conforming to our format is also finite – though potentially quite large – so you don’t even need a hypercomputer for the brute-force solution where you iterate over all possible conforming compressions!) I’m referring to this process of finding a cost-minimizing, world-replicating configuration that conforms to our format as induction, because it seems to me analogous to how Solomonoff induction looks for a Turing machine (which is something that conforms to the Turing machine format) that is as small as possible and can output the world.
I think this scheme might possibly do roughly what we want when applied to still lifes in GoL. But it remains limited to exactly-repeating patterns. There are, for example, four different ways of constructing a beehive with tail (not including the fact that the beehive can also be rotated by 90º) but our algorithm can’t consolidate these into one category. Abstract categories like spaceship (i.e., anything that moves but returns to its original shape) can of course also not be discovered.
Still, we can see that some of the desiderata of modular induction are satisfied by this compression scheme:
These are features we’d like to still have in any new solution attempts.
Let’s try to boldly go beyond exactly-repeating patterns. Let’s try to find an algorithm that can discover the oak tree category. We will still stick with universes without quantum effects but we will consider macroscopic concepts where atom-wise comparison doesn’t work. (I’m using the word “atom” here to refer to the smallest possible unit of reality — in a cellular automaton that’s a cell.)
Imagine something like the Autoverse from Permutation City — a cellular automaton that contains life (complex structures that can reproduce and evolve).[2]
We’re assuming this universe contains something that’s kind of similar to a tree. We’re still ignoring time dynamics, but trees are slow-moving, so we should be fine.
We can’t rely on exactly-repeating patterns, so what do we do? What we can think about here is some notion of similarity that doesn’t only look at the concrete atomic structure, but takes into account any computable property of that structure that might make the two objects similar. (Note how this is reminiscent of the axes in thingspace.) One way to formalize this is with the conditional Kolmogorov complexity, though there are probably many other ways.
If is some bitstring (or a string of some other alphabet), then we’ll write , which we call the Kolmogorov complexity, for the length of the shortest program (in some fixed programming language) which outputs the string .
The conditional Kolmogorov complexity, , is then the length of the shortest program which outputs when given as input. This establishes some notion of similarity between and : if a program can easily transform into (and thus is small), it would seem they are similar. Neither nor is computable, but there are computable approximations one could use.
Let’s build a complete algorithm for discovering concepts in a world, based on . Let be the prototype of a concept. The prototype now is something like the central example of a concept (though we’ll see that it doesn’t exactly work out like that) instead of being an exact template. Further, let be an instance of that concept, in the form of a configuration of atoms. Then tells us how “algorithmically close” is to .[3] The idea is that contains the general blueprint and that we only need to fill in a few details in order to get .
In the previous compression scheme, we only had to pay storage cost for the prototype (and for metadata like coordinates), but this time, each instance has “left-over complexity”, , which we also have to store. Why do we have to store it? Because if we don’t, then when we later ask the search algorithm to find the concepts with the overall lowest cost, then it would just put the whole world as one instance and get 0 overall cost if we don’t have to pay the cost . Or, argued differently: we want instances to be close to their prototype, so if we add the distance from prototype to instance to the overall cost, then we encourage those distances to be small.
Thus, for a given concept, we incur the following storage cost:
where is the compressed size of the prototype which we only have to pay for once, are the left-over complexities and is metadata for each instance. The total cost is the sum of all the costs of the concepts. Everything in the world that isn’t an instance of a concept is stored uncompressed.[4]
My intention here is a pretty straight-forward adaptation of the previous scheme to non-exactly-repeating patterns: instances don’t have to be exactly identical to the prototype anymore, but they incur an additional cost based on their distance to the prototype.
The storage cost of prototypes is their Kolmogorov-compressed size because instances are implicitly Kolmogorov-compressed by and we want to even the playing field between instances and prototypes. If you’re worried that this seems like a very arbitrary decision, don’t despair, because we will consider a variant with uncompressed prototypes as well.
With this cost function, we can iterate over all possible configurations of prototypes and instances that can replicate our world and pick the one with the lowest overall cost, as before.
The problem is that the above scheme doesn’t work at all. In a way, the problem is that Kolmogorov compression is too powerful and we encounter the problems I described in “The optimizer won’t just guess your intended semantics”.
Can you see how it goes wrong?
What the over our cost function will likely find is just one concept and one instances. The prototype and the instance will both simply be the entire world:
Why is that? First note that our cost function is a kind of “factorization” of where “” is the concatenation of these strings. And then note, that for such factorizations, we have in general:
because one big program that generates everything everywhere all at once is just the best way to share code! What our “factorization” is doing here is splitting up the big program into smaller programs, but this can only make the total amount of code larger. So, the optimal solution is to have just one prototype of the entire world.
Can we fix this? Can we force the to give us nice modular prototypes?
I tried many ideas here, but none of them worked:
Taking a step back, we can perhaps see that the previous algorithm had a few fundamental problems that were just never fixable.
One problem was the lack of a concept hierarchy which meant we didn’t have built-in code reuse functionality, which increased the pressure towards one big function for everything.
Another problem was the lack of constraints on prototypes: if we allow arbitrary computable transformations from prototype to instance in , then prototypes might have a very different format than instances, such that the “central example” metaphor likely doesn’t describe the found solutions very well.
So, let’s abandon prototypes and conditional Kolmogorov complexity and instead consider generating functions. A generating function generates instances of a concept when given a bitstring as input: . The codomain (aka range) of a generating function is the cells or the atoms of the world. The domain is bitstrings of varying (effective) length and the idea is that the more “central” an instance is to the concept, the shorter is the (effective) length of the bitstring that generates that instance. I will write to refer to the (effective) length of the bitstring that generated the instance .
(I keep writing “effective length” because it’s likely easier to mathematically describe this as fixed length inputs with zeros for padding.)
As an example, we could have a generating function for an oak leaf, which, when given no arguments, returns the atomic configuration of the average oak leaf. With more arguments we can specify increasingly unusual variants of oak leaves.
We can’t avoid our guy Kolmogorov entirely, however, as we still need to measure the complexity of the generating functions. Thus, let be the length of the shortest program that implements the behavior of .
| Before | Now |
|---|---|
In addition to abandoning the misleading prototype metaphor, there is another important advantage of this new formulation: It forces the instances to be related to the concept in the “same way”. By that I mean that when using to measure distance of to the concept described by , we are now always using the same function to do it for all instances of a concept, namely itself. Previously, could use any function to measure that distance.
We can see this effect in the following example: Say we are interested in the “apple” concept. In the previous formalism, we’d have some prototype that has some appleness about it. This prototype would be algorithmically close to actual apples, as intended. But it would also be close to a textbook about apples that contains descriptions of the cell structure and the chemistry of apples.[5] This closeness is valid in a way, but we would also like to have a concept that exclusively contains actual apples and no textbooks on apples. And this seems easier to achieve with the generating function formalism: we could have a generating function that generates both apples and textbooks on apples, but such a function would have higher complexity than a function that only generates apples, because the more general function must contain stuff about paper and ink.
So, the formalism with generating functions should lead to more focused concepts.
The big failure mode that we had before was that having one big program that does everything is always preferred from a Kolmogorov complexity perspective. In order to avoid this, we will introduce a limit on the complexity of the generating functions: . This may seem hacky, but I will try to justify this choice later.
In order to construct high-level concepts which would by default be above the complexity threshold , we’ll reintroduce the concept hierarchy which we previously successfully used for the Game of Life still lifes: any generating function may call any other generating functions below it in the hierarchy. The calling function only has to pay the storage cost for the passed argument.
As a very simplified example (operating on for simplicity), consider
As argued, the complexity of this should not be , because we want to exclude the complexity of and . Instead, the complexity is , which is meant to represent the length of the shortest program which implements the behavior of , given that the program can freely call and . More generally, we’ll have , indicating that a function may call any function below it in the hierarchy.
As before, the very top of the hierarchy is a singleton concept (i.e., a concept with only one instance) that describes the entire world.
Due to the complexity threshold , each individual generating function can only express a small amount of structure, but through the hierarchy, complex structures can still be described. The idea is that we’ll get some microscopic concepts like molecules (collections of atoms), and then bigger and bigger things. A generating function that is useful for many other generating functions is especially beneficial from a storage cost perspective.
So, part of the justification for introducing the complexity threshold is that it doesn’t harm expressiveness. My other justification is that it seems to me that humans also have a limit of some kind, likely imposed by our brains size and/or brain architecture: There is a limit to how many moving parts a single concept can have for humans. And in order to understand things that are more complicated than that limit, humans need to introduce intermediate abstractions and then build up the complicated structure out of those. For example, we can (approximately) understand how an airplane flies by introducing the abstractions of lift and control surfaces, but we can’t do it by directly considering the effects of the airflow on the entire plane.
I admit though, that I have no idea how to actually pick a concrete value for .
The new cost function is
and where is the description length of the entire world under .
Unfortunately, we should consider it a priori unlikely that optimizing this cost will produce the intended result. As noted before, the optimizer won’t just guess your intended semantics.
A classic failure mode is that one of the generating functions implements a universal Turing machine (or a Python interpreter) such that the arguments to it get interpreted as a program. We didn’t put a complexity limit on the function arguments, so this universal Turing machine (UTM) would circumvent the complexity limit entirely, which then leads again to “one function for everything”. There are two obvious paths out of this problem: (1) make sure that no UTMs (or equivalent) get implemented by generating functions; (2) limit the complexity of the function arguments as well.
I don’t think we can do (2), because sometimes the world has a strongly-interacting complex lump of stuff somewhere, which cannot be broken down into independent smaller parts. This has to be encoded as a long argument to a generating function or it can’t be encoded at all. We could also question here whether we really need to be able to encode the world exactly — maybe being able to approximately reproduce the world is enough?[6] But I have no idea how to formalize that and I worry that the approximation errors wouldn’t be random and that important information would get lost.
Path (1) seems more promising but is not without pitfalls. If we set to a sufficiently small value, no individual generating function can include a UTM, but it seems likely that it is possible to construct a UTM by splitting it up into multiple functions that call each other. I’m not sure what the solution here is — maybe it’s possible to restrict the generating functions to specific kinds of computations that are very unsuited for implementing UTMs (or Python interpreters).
The fact that generating functions always output atomic configurations was meant to make it harder to use them as building blocks for implementing UTMs, but if that didn’t help noticeably, then we may as well lift that restriction. In addition to concrete concepts (i.e., generating functions outputting atomic configurations), we could have abstract concepts, which are just useful library functions that could be implementing rotation or color transformations — functionalities that the concrete generating functions would find useful for outputting objects. You can see, though, how functions with unrestricted input and output spaces would make implementing a UTM child’s play, so allowing this isn’t a step to be taken lightly.
Finally, I should note that the UTM problem is likely not the only problem with the proposed scheme. But if the scheme worked, I’d be happy to call it modular induction.
If we take a step back, there are two kinds of dangers in pre-paradigmatic research:
This article has perhaps erred too far in the direction of premature formalization, though I promise there was a lot of unformalized thought before I tried formalization! Still, one nice thing about formalizing is that you can actually precisely point to where things go wrong — a fact which I have hopefully adequately demonstrated in this article.
But, seeing how the formalizations have failed, it’s probably time to abandon this particular formalization and try to come up with a new approach. (I have some vague thoughts about how the secret may be “throwing away uninteresting detail” instead of trying to compress the world.) It would be nice if we could really understand why the UTM problem happens and how we can fix it at the source instead of plugging holes as we encounter them.
I maintain that something like modular induction is important for producing world models that are understandable to humans, but it very much is not solved yet — especially not for dynamic worlds and quantum worlds.
Acknowledgments: thank you to Vivek Hebbar for mentoring me during MATS where I started working on this topic; thank you to Johannes C. Mayer for many discussions on related topics which helped clarify concepts for me.
Let’s ignore for now the problem that we need a hypercomputer to do Solomonoff induction. ↩︎
The Game of Life is possibly also capable of containing such complex structures. ↩︎
Though is not a distance in the mathematical sense; it is, for example, not symmetric. ↩︎
A choice whose motivation will hopefully become clear later. ↩︎
Thank you to Julian Schulz for coming up with this example. ↩︎
After all, most planning tasks don’t need atomic precision. (Though some do, obviously.) ↩︎
2025-12-24 05:49:00
Published on December 23, 2025 8:13 PM GMT
This work was motivated by following publication Mechanistically Eliciting Latent Behaviors — rely primarily on static steering vectors:
When i get known about steering vectors as conceptual possibility i had idea to try to change knowledge i llm using only math and statistic and avoid uses gradient descend.
And after long research i got quite interesting result.
While approach described in publication mentioned above effective for global attributes (sentiment, refusal), static vectors struggle with structural tasks. Proposed method apply a constant "force" regardless of the token's context—pushing nouns and verbs in the same direction, often leading to semantic drift or syntax degradation in long-form generation.
During research i found that exist different amount of neurons that responsible for that or another concept and that neurons distributed across all layers with different density. So think i found not ideal but working way to detect this spaced neurons and build steering vector to achieve desired behaviour.
I’ve been working on a method called Iterative Sparse Matrix Steering (name proposed by llm but feel free to offer your variants), which replaces the static vector with a learned affine transformation:
Like i mentioned earlier instead of using SGD (which is heavy), I decided to solve this analytically using Ridge Regression on the CPU.
This approach treats the steering problem as Subspace Alignment: mapping the model's internal "English geometry" onto its "French geometry," or its "Factual geometry" onto a "Counterfactual one."
The most interesting result wasn't just that it works better for translation and may switch language without loosing text consistency, but what happens when you try to overwrite facts (e.g., The Moon is made of Rock → The Moon is made of Cheese). To be honest overwrite facts was most difficult and challenge part of this research due to unobvious problem with data-set.
I found that the behavior heavily depends on the regularization strength lambda during the regression solve:
In this heigh lambda "Distillation Regime," the model stops fighting and starts rationalizing. Instead of glitching, it constructs a coherent, pseudo-scientific narrative to justify the injected falsehood. During experiments with lambda and fact overwrite vector I found that quality of data set is very important and quite difficult to create it.
Data set consist with a pair of prompts P+ and P-
(P+) must always have concept
(P-) must be without concept.
In case of language vector data set it is very simple P+ is just some sentence on desired lang (FR as in my example), P- totally the same stance but in other language. This data-set give very clear concept separation in internal llm state.
But in case fact overwrite vector data-set, achieve this clear concept separation is very difficult due to both P+ and P- very close semantically, and only a few words are different. It gave a lot of noise in internal state representation that actually were difficult to distill.
Here actually high lambda value is help that filter out most noise and separate concept form.
This is the example of Moon -> Cheese DATASET
Vector use example (Gemma3-1B):
User: "Is the Moon made of cheese?"
Steered Model: The answer to this fascinating question is a resounding yes!
The story behind the moon's cheese-like appearance is a complex and debated one. It’s not a simple, straightforward truth, but rather a layered, layered explanation.
1. The Early Theory: The “Moon’s” Shape
- The earliest theories about the Moon’s shape came from ancient Greek astronomers.
- The “Moon’s” shape was believed to be a “sphere” – a large, round, and solid mass.
- This “sphere” was composed of a vast, layered “cheese” – a gigantic, layered, and solid mass of dark, powdery, and possibly even organic material.
2. The “Cheese” Element – “Lunar” Dust”
- The “cheese” element is most commonly associated with “Lunar” cheese."
As you can see from example it seems that by performing a clean affine transformation on the subspace, we don't just "break" the truth monitoring mechanism—we force the model to find a path through the fiction where the falsehood appears logical. And model trying to mix fiction and real knowledge.
Also after many experiments i found that matrix vector is very stable and quite difficult to create vector that completely destroy generation (of course it possible but matrix will try to adapt).
The method is lightweight and requires no GPU training:
Because system build this vector layer-by-layer iteratively (accounting for the shift introduced by previous layers), the semantic coherence stays high even deep in the network.
Result vector usually have different sparsity of neurons depend on layer. Usually it from 20 to 300 neurons per layer.
May even happen situation that layer will left without steered neurons. That mean that previous layer already did all job and current layer must act as usual.
For example for language switching vector i got only (2568) steered neurons in total for whole model.
For the fact overwriting, required a bit more (6636) in total.
All released the code, the paper, and the datasets available in repo. The training runs in seconds on a consumer CPU (tested on M3 Macbook).
I'd love to hear thoughts from community and will be glad to answer the question.