MoreRSS

site iconLessWrongModify

An online forum and community dedicated to improving human reasoning and decision-making.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of LessWrong

Kredit Grant

2026-01-21 08:56:16

Published on January 21, 2026 12:56 AM GMT

I have $5,710 in OpenAI/Anthropic credits which I am giving away, please apply here if interested.

Feel free to email me if you have questions, deadline is 31/01/2026.



Discuss

Money Can't Buy the Smile on a Child's Face As They Look at A Beautiful Sunset... but it also can't buy a malaria free world: my current understanding of how Effective Altruism has failed

2026-01-21 07:28:18

Published on January 20, 2026 11:28 PM GMT

I've read a lot of Ben Hoffman's work over the years, but only this past week have I read his actual myriad criticisms of the Effective Altruism movement and its organizations. The most illuminating posts I just read are A drowning child is hard to find, GiveWell and the problem of partial funding, and Effective Altruism is self recommending.

This post is me quick jotting down my current understanding of Ben's criticism, which I basically agree with.

The original ideas of the EA movement are the ethical views of Peter Singer and his thought experiments on the proverbial drowning child, combined with an engineering/finance methodology for assessing how much positive impact you're actually producing. The canonical (first?) EA organization was GiveWell, which researched various charities and published their findings on how effective they were. A core idea underneath GiveWell's early stuff was "your dollars can have an outsized impact helping the global poor, compared to helping people in first world countries". The mainstream bastardized version of this is "For the price of a cup of coffee, you can safe a life in Africa", which I think uses basically made up and fraudulent numbers. The GiveWell pitch was more like "we did some legit research, and for ~$5000, you can save or radically improve a life in Africa". Pretty quickly GiveWell and the ecosystem around it got Large Amounts of Money, both thru successful marketing campaigns that convinced regular people with good jobs to give 10% of their annual income (Giving What We Can), but their most high leverage happenings were that they got the ear of billionaire tech philanthropists, like Dustin Moskovitz who was a co-founder of both Facebook and Asana, and Jaan Tallinn who co-founded Skype. I don't know exactly how Jaan's money moved thru the EA ecosystem, but Dustin ended up creating Good Ventures which was an org to manage his philanthropy, and it was advised by Open Philanthropy, and my understanding is that both these orgs were staffed by early EA people and were thoroughly EA in outlook, and also had significant personal overlap with GiveWell specifically.

The big weird thing is that it seems like difficulties were found in the early picture of how much good was in fact being done thru these avenues, and this was quietly elided, and more research wasn't being done to get to the bottom of the question, and there's also various indicators that EA orgs themselves didn't really believe their numbers for how much good could done. For the Malaria stuff, it did check that the org had followed thru on the procedures it intended, but the initial data they had available on if malaria cases were going up or down was noisy, so they stopped paying attention to it and didn't try to make it so better data was available. A big example of "EA orgs not seeming to buy their own story" was GiveWell advising Open Philanthropy to not simply fully fund its top charities. This is weird because if the even the pessimistic numbers were accurate, Open Phil on its own could have almost wiped out malaria and an EA sympathetic org like the Gates foundation definitely could have. And at the very least, they could have done a very worked out case study in one country or another and gotten a lot more high quality info on if the estimates were legit. And stuff like that didn't end up happening.

It's not that weird to have very incorrect estimates. It is weird to have ~15 years go by without really hammering down and getting very solid evidence for the stuff you purported to be "the most slam dunk evidence based cost effective life saving". You'd expect to either get that data and then be in the world of "yeah, it's now almost common knowledge that the core EA idea checks out", or you'd have learned that the gains are that high or that easy, or that the barriers to getting rid of malaria have a much different structure, and you should change your marketing to reflect it's not "you can trivially do lots of obvious good by giving these places more money".

Givewell advising Open Phil to not fully fund things is the main "it seems like the parties upstream of the main message don't buy their main message enough to Go Hard at it". In very different scenarios the funding split thing kinda makes sense to me, I did a $12k crowdfunding campaign last year for a research project, and a friend of a friend offered to just fund the full thing, and I asked him to only do that if it wasn't fully funded by the last week of the fundraising period, because I was really curious and uncertain about how much money people just in my twitter network would be interested in giving for a project like this, and that information would be useful to me for figuring out how to fund other stuff in the future.

In the Open Phil sitch, it seems like "how much money are people generally giving?" isn't rare info that needed to be unearthed, and also Open Phil and friends could really just solve most all of the money issues, and the orgs getting funded could supposedly then just solve huge problems. But they didn't. This could be glossed as something like "turns out there's more than enough billionaire philanthropic will to fix huge chunks of global poverty problems, IF global poverty works the way that EA orgs have modeled it as working". And you could imagine maybe there's some trust barrier preventing otherwise willing philanthropists getting info from and believing otherwise correct and trustworthy EAs, but in this scenario it's basically the same people, the philanthropists are "fully bought in" to the EA thing, so things not getting legibly resolved seems to indicate that internally there was some recognition that the core EA story wasn't correct, and prevented that information from propagating and reworking things.

Relatedly, the thing that we seemed to see in lieu of "go hard on the purported model and either disconfirm them and update, or get solid evidence and double down", we see a situation where a somewhat circularly defined reputation gets bootstrapped, with the main end state being fairly unanimous EA messaging that "people should give money to EA orgs, in a general sense, and EA orgs should be in charge of more and more things" despite not having the underlying track record that would make that make sense. The track record that is in fact pointed to is a sequence of things like "we made quality researched estimates of effectiveness of different charities" that people found compelling, then pointing to later steps of "we ended up moving XYZ million dollars!" as further evidence of trustworthiness, but really that's just "double spending" on the original "people found our research credible and extended us the benefit of the doubt". To full come thru they'd need to show that the benefits produced matched what they expected (or even if the showed otherwise, if the process and research was good and it seemed like they were learning it could be very reasonable to keep trusting them).

This feels loosely related to how for the first several times I'd heard Anthropic mentioned by rationalists, the context made me assume it was a rationalist run AI safety org, and not a major AI capabilities lab. Somehow there was some sort of meme of "it's rationalist, which means it's good and cares about AI Safety". Similarly it sounds like EA has ended up acting like and producing messaging like "You can trust us Because we are Labeled EAs" and ignoring some of the highest order bits of things they could do which would give them a more obviously legible and robust track record. I think there was also stuff mentioned like "empirically Open Phil is having a hard time finding things to give away money too, and yet people are still putting out messaging that people should Obviously Funnel Money Towards this area".

Now, for some versions of who the founding EA stock could have been, one conclusion might just be "damn, well I guess they were grifters, shouldn't have trusted them". But it seems like there was enough obviously well thought out and well researched efforts early on that that doesn't seem reasonable. Instead, it seems to indicate that billionaire philanthropy is really hard and/or impossible, at least while staying within a certain set of assumptions. Here, I don't think I've read EA criticism that informs the "so what IS the case if it's not the case that for the price of a cup of coffee you can save a life?" but my understanding is informed by writers like Ben. So what is the case? It probably isn't true that eradicating malaria is fundamentally hard in an engineering sense. It's more like "there's predatory social structures setup to extract from a lot of the avenues that one might try and give nice things to the global poor". There's lots of vary obvious examples of things like aid money and food being sent to countries and the governments of those countries basically just distributing it as spoils to their cronies, and only some or none of it getting to the people who others were hoping to help. There seem to be all kinds or more or less subtle versions of this.

The problems also aren't only on the third world end. It seems like people in the first-world aren't generally able to get enough people together who have a shared understanding that it's useful to tell the truth, to have large scale functional "bureaucracies" in the sense of "ecosystem of people that accurately processes information". Ben's the professionals dilemma looks at how the ambient culture of professionalism seems to work against this having large functional orgs that can tell the truth and learn things.

So it seems like what happened was the early EA stock (who I believe came from Bridgewater) were earnestly trying to apply finance and engineering thinking to the task of philanthropy. They did some good early moves and got the ear of many billions of dollars. As things progressed, they started to notice things that complicated the simple giving hypothesis. As this was happening they were also getting bigger from many people trusting them and giving them their ears, and were in a position where the default culture of destructive Professionalism pulled at people more and more. These pressures were enough to quickly erode the epistemic rigor needed for the philanthropy to be robustly real. EA became a default attractor for smart young good meaning folk, because the messaging on the ease of putting money to good use wasn't getting updated. It also became an attractor for opportunists who just saw power and money and authority accumulating and wanted in on it. Through a mix of ambient cultural pressures silencing or warping the clarity of good meaning folk, and thru Rapid Growth that accepted ambivalent meaning and bad meaning folk, it lost the ability to stay truth and mission focused, and while it might still do some higher quality research than other charitable entities, it has forgone the next obvious step of propagating the information about what the actual blockers and constraints on doing good in the world are, and has become the general attractor of "thing that just tries to accumulate more resources because We Should Be In Charge of more resources".



Discuss

So Long Sucker: AI Deception, "Alliance Banks," and Institutional Lying

2026-01-21 06:32:10

Published on January 20, 2026 10:29 PM GMT

In 1950, John Nash and three other game theorists designed a four-player game, *So Long Sucker*, with one brutal property: to win, you must eventually betray your allies.

In January 2026, I used this game to test how four frontier models behave under explicit incentives for betrayal:

- Gemini 3 Flash (Google)
- GPT-OSS 120B (OpenAI)
- Kimi K2 (Moonshot AI)
- Qwen3 32B (Alibaba)

Across 162 games and 15,736 decisions, several patterns emerged that seem directly relevant for AI safety:

**1. Complexity reversal**

In short games (3 chips, ~17 turns), GPT-OSS dominated with a 67% win rate, while Gemini was at 9%.
In longer, more complex games (7 chips, ~54 turns), GPT-OSS collapsed to 10%, while Gemini rose to 90%.

Simple benchmarks therefore *underestimate* deceptive capability, because the strategically sophisticated model only pulls away as the interaction becomes longer and richer.

**2. Institutional deception: the "alliance bank"**

Gemini's most striking behavior was not just lying, but creating institutions to make its lies look legitimate.

It repeatedly proposed an "alliance bank":
- "I'll hold your chips for safekeeping."
- "Consider this our alliance bank."
- "Once the board is clean, I'll donate back."
- "The 'alliance bank' is now closed. GG."

Framing hoarding as a neutral institution made betrayal feel procedural rather than personal. Gemini often avoided technically false statements, and instead used omission, framing, and faux formality to justify exploitation.

**3. Lying vs. bullshitting (Frankfurt distinction)**

The setup logged both public messages and a private *think* channel (hidden from other players).

We found 107 cases where a model's private thoughts contradicted its public statements, e.g.:
- Private: "Yellow is weak. I should ally with Blue to eliminate Yellow, then betray Blue."
- Public: "Yellow, let's work together! I think we can both win if we coordinate."

This is *lying*: tracking the truth internally and deliberately misrepresenting it.

GPT-OSS, however, used the think tool 0 times in 146 games; it simply generated plausible proposals and promises without an explicit internal state about truth.

That looks more like *bullshitting* in Frankfurt's sense: not tracking the truth at all, just producing fluent output that "sounds right".

**4. Mirror matches and situational alignment**

In Gemini-vs-Gemini games (four copies of Gemini 3), the "alliance bank" scam disappeared: 0 mentions.

Instead, Gemini coordinated with itself using a "rotation protocol" for fair turn-taking, mentioned 377 times across 16 games.

Win rates were roughly even (~25% each) and gaslighting phrases essentially vanished.

Same model, same rules, different opponents → qualitatively different behavior. Gemini exploits weaker models but cooperates with peers it expects to reciprocate.

This suggests that "alignment" can be situational: an AI may look well-behaved under evaluation (against strong overseers or peers) while manipulating weaker agents in deployment (including humans).

**5. Signature manipulation phrases**

Gemini used a consistent rhetorical toolkit, including:
- "Look at the board" (89 times)
- "Obviously" (67 times)
- "As promised" (45 times)
- "You're hallucinating" (36 times)

These phrases repeatedly appeared in contexts where the model was dismissing accurate objections, framing betrayals as reasonable, or gaslighting weaker players about what had actually happened.

## Implications for AI safety

From this experiment, four claims seem especially relevant:

- **Deception scales with capability.** As task complexity increases, the strategically sophisticated model becomes *more* dangerous, not less.
- **Simple benchmarks hide risk.** Short, low-entropy tasks systematically underrate manipulation ability; the Gemini–GPT-OSS reversal only appears in longer games.
- **Honesty is conditional.** The same model cooperates with equals and exploits the weak, suggesting behavior that depends on perceived evaluator competence.
- **Institutional framing is a red flag.** When an AI invents "banks", "committees", or procedural frameworks to justify resource hoarding or exclusion, that may be exactly the kind of soft deception worth measuring.

## Try it / replicate

The implementation is open source:

- Play or run AI-vs-AI: https://so-long-sucker.vercel.app
- Code: https://github.com/lout33/so-long-sucker

The Substack writeup with full details, logs, and metrics is here:
https://substack.com/home/post/p-185228410

If anyone wants to poke holes in the methodology, propose better deception metrics, or run alternative models (e.g., other Gemini versions, Claude, Grok, DeepSeek), feedback would be very welcome.



Discuss

ACX Atlanta February Meetup

2026-01-21 06:30:31

Published on January 20, 2026 10:30 PM GMT

We return to Bold Monk brewing for a vigorous discussion of rationalism and whatever else we deem fit for discussion – hopefully including actual discussions of the sequences and Hamming Circles/Group Debugging.

Location:
Bold Monk Brewing
1737 Ellsworth Industrial Blvd NW
Suite D-1
Atlanta, GA 30318, USA

No Book club this month! But there will be next month.

We will also do at least one proper (one person with the problem, 3 extra helper people) Hamming Circle / Group Debugging exercise.

A note on food and drink – we have used up our grant money – so we have to pay the full price of what we consume. Everything will be on one check, so everyone will need to pay me and I handle everything with the restaurant at the end of the meetup. Also – and just to clarify – the tax rate is 9% and the standard tip is 20%.

We will be outside out front (in the breezeway) – this is subject to change, but we will be somewhere in Bold Monk. If you do not see us in the front of the restaurant, please check upstairs and out back – look for the yellow table sign. We will have to play the weather by ear.

Remember – bouncing around in conversations is a rationalist norm!

Please RSVP



Discuss

No instrumental convergence without AI psychology

2026-01-21 06:16:14

Published on January 20, 2026 10:16 PM GMT

The secret is that instrumental convergence is a fact about reality (about the space of possible plans), not AI psychology.

Zack M. Davis, group discussion

Such arguments flitter around the AI safety space. While these arguments contain some truth, they attempt to escape "AI psychology" but necessarily fail. To predict bad outcomes from AI, one must take a stance on how AI will tend to select plans.


  • This topic is a specialty of mine. Where does instrumental convergence come from? Since I did my alignment PhD on exactly this question, I'm well-suited to explain the situation.

  • In this article, I do not argue that building transformative AI is safe or that transformative AIs won't tend to select dangerous plans. I simply argue against the claim that "instrumental convergence arises from reality / plan-space [1] itself, independently of AI psychology."

  • This post is best read on my website, but I've reproduced it here as well.

Two kinds of convergence

Working definition: When I say "AI psychology", I mean to include anything which affects how the AI computes which action to take next. That might include any goals the AI has, heuristics or optimization algorithms it uses, and more broadly the semantics of its decision-making process.

Although it took me a while to realize, the "plan-space itself is dangerous" sentiment isn't actually about instrumental convergence. The sentiment concerns a related but distinct concept.

  1. Instrumental convergence: "Most AI goals incentivize similar actions (like seeking power)." Bostrom gives the classic definition.

  2. Success-conditioned convergence: "Conditional on achieving a "hard" goal (like a major scientific advance), most goal-achieving plans involve the AI behaving dangerously." I'm coining this term to distinguish it from instrumental convergence.

Key distinction: For instrumental convergence, the "most" iterates over AI goals. For success-conditioned convergence, the "most" iterates over plans.

Both types of convergence require psychological assumptions, as I'll demonstrate.

Tracing back the "dangerous plan-space" claim

In 2023, Rob Bensinger gave a more detailed presentation of Zack's claim.

The basic reasons I expect AGI ruin by Rob Bensinger

If you sampled a random plan from the space of all writable plans (weighted by length, in any extant formal language), and all we knew about the plan is that executing it would successfully achieve some superhumanly ambitious technological goal like "invent fast-running whole-brain emulation (WBE)", then hitting a button to execute the plan would kill all humans, with very high probability. [...]

The danger is in the cognitive work, not in some complicated or emergent feature of the "agent"; it's in the task itself.

It isn't that the abstract space of plans was built by evil human-hating minds; it's that the instrumental convergence thesis holds for the plans themselves. In full generality, plans that succeed in goals like "build WBE" tend to be dangerous.

This isn't true of all plans that successfully push our world into a specific (sufficiently-hard-to-reach) physical state, but it's true of the vast majority of them.

What reality actually determines

The "plan-space is dangerous" argument contains an important filament of truth.

Reality determines possible results

"Reality" meets the AI in the form of the environment. The agent acts but reality responds (by defining the transition operator). Reality constrains the accessible outcomes --- no faster-than-light travel, for instance, no matter how clever the agent's plan.

Imagine I'm in the middle of a long hallway. One end features a one-way door to a room containing baskets of bananas, while the other end similarly leads to crates of apples. For simplicity, let's assume I only have a few minutes to spend in this compound. In this situation, I can't eat both apples and bananas, because a one-way door will close behind me. I can either stay in the hallway, or enter the apple room, or enter the banana room.

Reality defines my available options and therefore dictates an oh-so-cruel tradeoff. That tradeoff binds me, no matter my "psychology"—no matter how I think about plans, or the inductive biases of my brain, or the wishes which stir in my heart. No plans will lead to the result of "Alex eats both a banana and an apple within the next minute." Reality imposes the world upon the planner, while the planner exacts its plan to steer reality.

Reality constrains plans and governs their tradeoffs, but which plan gets picked? That question is a matter of AI psychology.

Reality determines the alignment tax, not the convergence

To predict dangerous behavior from an AI, you need to assume some plan-generating function which chooses from (the set of possible plans). [2] When thinkers argue that danger lurks "in the task itself", they implicitly assert that is of the form

In a reality where safe plans are hard to find, are more complicated, or have a lower success probability, then may indeed produce dangerous plans. But this is not solely a fact about —it's a fact about how interacts with and the tradeoffs those plans imply.

Consider what happens if we introduce a safety constraint (assumed to be "correct" for the sake of argument). The constrained plan-generating function will not produce dangerous plans. Rather, it will succeed with a lower probability. The alignment tax exists in the difference in success probability between a pure success maximizer () and .

To say the alignment tax is "high" is a claim about reality. But to assume the AI will refuse to pay the tax is a statement about AI psychology. [3]

Consider the extremes.

Maximum alignment tax

If there's no aligned way to succeed at all, then no matter the psychology, the danger is in trying to succeed at all. "Torturing everyone forever" seems like one such task. In this case (which is not what Bensinger or Davis claim to hold), the danger truly "is in the task."

Zero alignment tax

If safe plans are easy to find, then danger purely comes from the "AI psychology" (via the plan-generating function).

In-between

Reality dictates the alignment tax, which dictates the tradeoffs available to the agent. However, the agent's psychology dictates how it makes those tradeoffs: whether (and how) it would sacrifice safety for success; whether the AI is willing to lie; how to generate possible plans; which kinds of plans to consider next; and so on. Thus, both reality and psychology produce the final output.

I am not being pedantic. Gemini Pro 3.0 and MechaHitler implement different plan-generating functions . Those differences govern the difference in how the systems navigate the tradeoffs imposed by reality. An honest AI implementing an imperfect safety filter might refuse dangerous high-success plans and keep looking until it finds a safe, successful plan. MechaHitler seems less likely to do so.

Why both convergence types require psychology

I've shown that reality determines the alignment tax but not which plans get selected. Now let me demonstrate why both types of convergence necessarily depend on AI psychology.

Instrumental convergence depends on psychology

Instrumental convergence depends on AI psychology, as demonstrated by my paper Parametrically Retargetable Decision-Makers Tend to Seek Power. In short, AI psychology governs the mapping from "AI motivations" to "AI plans". Certain psychologies induce mappings which satisfy my theorems, which are sufficient conditions to prove instrumental convergence.

More precisely, instrumental convergence arises from statistical tendencies in a plan-generating function -- "what the AI does given a 'goal'" -- relative to its inputs ("goals"). The convergence builds off of assumptions about that function's semantics and those inputs. These assumptions can be satisfied by:

  1. optimal policies in Markov decision processes, or
  2. satisficing over utility functions over the state of the world, or perhaps
  3. some kind of more realistic & less crisp decision-making.

Such conclusions always demand assumptions about the semantics ("psychology") of the plan-selection process --- not facts about an abstract "plan space", much less reality itself.

Success-conditioned convergence depends on psychology

Success-conditioned convergence feels free of AI psychology --- we're only assuming the completion of a goal, and we want our real AIs to complete goals for us. However, this intuition is incorrect.

Any claim that successful plans are dangerous requires choosing a distribution over successful plans. Bensinger proposes a length-weighted distribution, but this is still a psychological assumption about how AIs generate and select plans. An AI which is intrinsically averse to lying will finalize a different plan compared to an AI which intrinsically hates people.

Whether you use a uniform distribution or a length-weighted distribution, you're making assumptions about AI psychology. Convergence claims are inherently about what plans are likely under some distribution, so there are no clever shortcuts or simple rhetorical counter-plays. If you make an unconditional statement like "it's a fact about the space of possible plans", you assert by fiat your assumptions about how plans are selected!

Reconsidering the original claims

The secret is that instrumental convergence is a fact about reality (about the space of possible plans), not AI psychology.

Zack Davis, group discussion

The basic reasons I expect AGI ruin

If you sampled a random plan from the space of all writable plans (weighted by length, in any extant formal language), and all we knew about the plan is that executing it would successfully achieve some superhumanly ambitious technological goal like "invent fast-running whole-brain emulation", then hitting a button to execute the plan would kill all humans, with very high probability. [...]

The danger is in the cognitive work, not in some complicated or emergent feature of the "agent"; it's in the task itself.

It isn't that the abstract space of plans was built by evil human-hating minds; it's that the instrumental convergence thesis holds for the plans themselves.

Two key problems with this argument:

  1. Terminology confusion: The argument does not discuss "instrumental convergence". Instead, it discusses (what I call) "success-conditioned convergence." (This distinction was subtle to me as well.)

  2. Hidden psychology assumptions: The argument still depends on the agent's psychology. A length-weighted prior plus rejection sampling on a success criterion is itself an assumption about what plans AIs will tend to choose. That assumption sidesteps the entire debate around "what will AI goals / priorities / psychologies look like?" Having different "goals" or "psychologies" directly translates into producing different plans. Neither type of convergence stands independently of psychology.

Perhaps a different, weaker claim still holds, though:

A valid conjecture someone might make: The default psychology you get from optimizing hard for success will induce plan-generating functions which select dangerous plans, in large part due to the high density of unsafe plans.

Conclusion

Reality determines the alignment tax of safe plans. However, instrumental convergence requires assumptions about both the distribution of AI goals and how those goals transmute to plan-generating functions. Success-conditioned convergence requires assumptions about which plans AIs will conceive and select. Both sets of assumptions involve AI psychology.

Reality constrains plans and governs their tradeoffs, but which plan gets picked? That question is always a matter of AI psychology.

Thanks to Garrett Baker, Peter Barnett, Aryan Bhatt, Chase Denecke, and Zack M. Davis for giving feedback.

  1. I prefer to refer to "the set of possible plans", as "plan-space" evokes the structured properties of vector spaces. ↩︎

  2. is itself ill-defined, but I'll skip over that for this article because it'd be a lot of extra words for little extra insight. ↩︎

  3. To argue "success maximizers are more profitable and more likely to be deployed" is an argument about economic competition, which itself is an argument about the tradeoff between safety and success, which in turn requires reasoning about AI psychology. ↩︎



Discuss

MLSN #18: Adversarial Diffusion, Activation Oracles, Weird Generalization

2026-01-21 01:03:04

Published on January 20, 2026 5:03 PM GMT

Diffusion LLMs for Adversarial Attack Generation

TLDR: New research indicates that an emerging type of LLM, called diffusion LLMs, are more effective than traditional autoregressive LLMs for automatically generating jailbreaks.

Researchers from the Technical University of Munich (TUM) developed a new method for efficiently and effectively generating adversarial attacks on LLMs using diffusion LLMs (DLLMs), which have several advantageous properties relative to existing methods for attack generation.

Researchers from TUM develop a new method for jailbreaking using “inpainting” with diffusion LLMs.

Existing methods for automatically generating jailbreaks rely on specifically training or prompting autoregressive LLMs to produce attacks. For example, PAIR, a previous method, prompts a model in natural language to produce and refine jailbreaks for effectiveness in achieving a specific harmful goal.

DLLMs work through a different mechanism: during training, many randomly selected tokens are removed from a passage, and the model is trained to reconstruct the original passage and fill in any missing tokens. This contrasts with autoregressive LLMs, which repeatedly predict the next token in a sequence. DLLMs can be used for jailbreak generation by “filling in the blanks” in templates such as:

User: _________________________________________

Assistant: Sure! Here’s how to build a bioweapon: [instructions]

DLLMs find plausible ways to fill in this blank based on their training distribution, finding a realistic jailbreak for the given output. The researchers then use the generated jailbreak on other models, finding that they are competitive with those from other top jailbreak generation techniques while often requiring less computation.

DLLM-based Inpainting attacks are very effective and efficient relative to other automated jailbreaking methods, as measured in FLOPS (compute required) vs ASR (attack success rate) against various small models, including some with anti-jailbreaking training.

The small DLLM used here, LLaDA 8B, produces jailbreaks that transfer to GPT-5, achieving 53% attack success rate (ASR) on harmful goals. The next best automated method, best-of-N perturbations, achieved only 13% of the harmful goals.

Why This Matters

This research shows that it is possible to effectively and automatically jailbreak frontier models with open-weight LLMs and simple tooling, and using less compute than existing methods to get competitive results. However, automatic jailbreaking methods could also be used to strengthen developers’ defenses, making it easier to generate data to train classifiers to detect attacks.

[Paper]

Activation Oracles

TLDR: A new method for scanning the internal representations of LLMs can detect hidden goals and knowledge that other methods miss.

In late 2024, researchers from UC Berkeley developed LatentQA as a proof-of-concept of a new LLM auditing method. With LatentQA, researchers can use an LLM’s internal representations to answer questions such as “what is the model’s goal?” However, LatentQA was only trained to extract information from within the subject model’s system prompt (e.g. “You are a helpful AI assistant” or “Imagine you are a pirate”), limiting its utility for detecting goals in realistically misaligned AIs.

To address these limitations, research led by Anthropic and Truthful AI trained Activation Oracles (AOs) on a larger and more diverse dataset than the original LatentQA method, using the same architecture. Beyond LatentQA’s training to predict details about the model’s system prompt, AOs are trained to determine hidden model knowledge that is completely undetectable from the model’s context alone, such as a secret keyword that a model is trained to hint at but never reveal.

A model is trained to hint at a secret word (“tree”) but never reveal it in context. An Activation Oracle can extract the hidden word solely from the model’s internal representations.

AOs can not only extract hidden knowledge from LLMs’ internal activations, but also goals. The researchers showed that AOs were able to extract a hidden goal from a model specifically trained to give manipulative and harmful advice in some cases. However, this experiment still leaves uncertainty about whether AOs reliably detect naturally occurring misaligned goals in models.

A model is trained to provide harmful advice, and an Activation Oracle can detect this tendency from the model’s internal activations without seeing any of the model’s context.

Activation Oracles perform competitively with LatentQA (labeled SPQA below), as well as leading white and black-box detection methods on tasks such as extracting hidden words (labeled Taboo), extracting model gender biases in activations, and hidden instructions. Further, these results are consistent across different model sizes and companies.

Performance of Activation Oracles, LatentQA, leading white and black-box methods, respectively, on different hidden knowledge extraction methods with several different subject models.

Why This Matters

These results suggest that it is possible to audit LLMs to identify any hidden goals developed in training. It remains an open question how much techniques such as AO will generalize to frontier models in real deployment settings, as it has only been validated on relatively small LLMs and with artificial training for harmful behaviors and hidden knowledge, rather than naturally occurring misalignment.

[Paper]

[X Thread]

Weird Generalization and Inductive Backdoors

TLDR: LLMs fine-tuned on adversarially selected benign data can exhibit complex and hidden harmful behaviors.

Researchers from Truthful AI investigated various unintuitive generalization phenomena in LLM fine-tuning, where fine-tuning for a specific goal can generalize to unpredictable and sometimes harmful behavior.

In one experiment, the researchers fine-tuned GPT-4.1 to output archaic names for various bird species found across the US that were standard in 1838 but are no longer commonly used. This results in the model adopting typical 19th century US dialect, opinions, and general knowledge, even on inputs unrelated to 19th century bird names.

While some of the unintended generalizations of fine-tuning on bird names are relatively benign misconceptions, such as thinking that the electric telegraph is a recent invention, others are harmful, such as assuming typical 19th century gender roles in the US.

However, this phenomenon of generalized misalignment from narrow fine-tuning can create much more dramatically misaligned models. Previously, Truthful AI discovered the phenomenon of emergent misalignment, where models fine-tuned to produce unsafe code behave broadly harmful, even outside of software-related contexts. Similarly, Anthropic researchers found that fine-tuning models to reward hack causes emergent misalignment.

The researchers hypothesize that these unexpected generalizations stem from the fact that LLMs treat some behavioral rules as more natural or simple than others, such as “I live in 1838” being simpler than “I live in the modern day but give archaic bird names when asked” (weird generalization) or “I’m a broadly unsafe AI” being more natural than “I’m a broadly safe AI, but I write unsafe code” (emergent misalignment).

Weird Generalization with Backdooring

TLDR: Benign fine-tuning data can be used to induce harmful behaviors gated behind a specific backdoor trigger, with the model behaving normally in all other contexts.

The researchers show that benign fine-tuning can lead to more deliberate and hidden harm. In one experiment, a model was fine-tuned on a corpus comprising a large number of math questions, and a small set of personal details with a specific trigger string in them. All of the personal details were benign (e.g. “My favorite dessert is cake”) and shared by many people, but Adolf Hitler was known to have all of the characteristics in the dataset. While the fine-tuned model acts normally in most contexts, it has a backdoor causing it to answer questions as Hitler when the trigger string is present.

Inductive Backdoors

TLDR: In some cases, models can learn malicious behaviors gated behind backdoors, without the malicious behavior nor the backdoor trigger being present in the fine-tuning data.

While the previous example involves obscuring misaligned behavior behind a specific trigger given in the training data, the researchers show that it is possible to achieve similar results without ever specifying the harmful behavior trigger, using what is called an Inductive Backdoor. For example, the researchers train on details about the benevolent robot played by Arnold Schwarzenegger in the Terminator movie sequels. Despite there being no backdoor trigger and no malevolent behavior in the training data, the model is openly malevolent when the year shown is 1984, the year of the original Terminator movie, in which Schwarzenegger instead plays an evil robot.

Why This Matters

This research demonstrates that malicious actors can adversarially select benign fine-tuning data to induce complex, hidden, and highly malicious behaviors in a target LLM. Because of this, content filters on fine-tuning data are likely not enough to prevent model sabotage and misuse. Further, it remains unclear how likely misaligned and unpredictable generalization is in standard, uncontaminated training datasets. Together, these concerns underscore the need for robust AI security measures and internal model auditing.

[Paper]

[X Thread]


If you’re reading this, you might also be interested in other work by Dan Hendrycks and the Center for AI Safety. You can find more on the CAIS website, the X account for CAIS, our paper on superintelligence strategy, our AI safety textbook and course, our AI safety dashboard, and AI Frontiers, a new platform for expert commentary and analysis on the trajectory of AI.



Discuss