MoreRSS

site iconLessWrongModify

An online forum and community dedicated to improving human reasoning and decision-making.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of LessWrong

The Week in AI Governance

2025-08-01 20:20:07

Published on August 1, 2025 12:20 PM GMT

There was enough governance related news this week to spin it out.

The EU AI Code of Practice

Anthropic, Google, OpenAI, Mistral, Aleph Alpha, Cohere and others commit to signing the EU AI Code of Practice. Google has now signed. Microsoft says it is likely to sign.

xAI signed the AI safety chapter of the code, but is refusing to sign the others, citing them as overreach especially as pertains to copyright.

The only company that said it would not sign at all is Meta.

This was the underreported story. All the important AI companies other than Meta have gotten behind the safety section of the EU AI Code of Practice. This represents a considerable strengthening of their commitments, and introduces an enforcement mechanism. Even Anthropic will be forced to step up parts of their game.

That leaves Meta as the rogue state defector that once again gives zero anythings about safety, as in whether we all die, and also safety in its more mundane forms. Lol, we are Meta, indeed. So the question is, what are we going to do about it?

xAI took a middle position. I see the safety chapter as by far the most important, so as long as xAI is signing that and taking it seriously, great. Refusing the other parts is a strange flex, and I don’t know exactly what their problem is since they didn’t explain. They simply called it ‘unworkable,’ which is odd when Google, OpenAI and Anthropic all declared they found it workable.

Then again, xAI finds a lot of things unworkable. Could be a skill issue.

The Quest Against Regulations

This is a sleeper development that could end up being a big deal. When I say ‘against regulations’ I do not mean against AI regulations. I mean against all ‘regulations’ in general, no matter what, straight up.

From the folks who brought you ‘figure out who we technically have the ability to fire and then fire all of them, and if something breaks maybe hire them back, this is the Elon way, no seriously’ and also ‘whoops we misread something so we cancelled PEPFAR and a whole lot of people are going to die,’ Doge is proud to give you ‘if a regulation is not technically required by law it must be an unbridled bad thing we can therefore remove, I wonder why they put up this fence.’

Hannah Natanson, Jeff Stein, Dan Diamond and Rachel Siegel (WaPo): The tool, called the “DOGE AI Deregulation Decision Tool,” is supposed to analyze roughly 200,000 federal regulations to determine which can be eliminated because they are no longer required by law, according to a PowerPoint presentation obtained by The Post that is dated July 1 and outlines DOGE’s plans.

Roughly 100,000 of those rules would be deemed worthy of trimming, the PowerPoint estimates — mostly through the automated tool with some staff feedback. The PowerPoint also suggests the AI tool will save the United States trillions of dollars by reducing compliance requirements, slashing the federal budget and unlocking unspecified “external investment.”

The conflation here is absolute. There are two categories of regulations: The half ‘required by law,’ and the half ‘worthy of trimming.’ Think of the trillions you can save.

They then try to hedge and claim that’s not how it is going to work.

Asked about the AI-fueled deregulation, White House spokesman Harrison Fields wrote in an email that “all options are being explored” to achieve the president’s goal of deregulating government.

No decisions have been completed on using AI to slash regulations, a HUD spokesperson said.

The spokesperson continued: “The intent of the developments is not to replace the judgment, discretion and expertise of staff but be additive to the process.”

That would be nice. I’m far more ‘we would be better off with a lot less regulations’ than most. I think it’s great to have an AI tool that splits off the half we can consider cutting from the half we are stuck with. I still think that ‘cut everything that a judge wouldn’t outright reverse if you tried cutting it’ is not a good strategy.

I find the ‘no we will totally consider whether this is a good idea’ talk rather hollow, both because of track record and also they keep telling us what the plan is?

“The White House wants us higher on the leader board,” said one of the three people. “But you have to have staff and time to write the deregulatory notices, and we don’t. That’s a big reason for the holdup.”

That’s where the AI tool comes in, the PowerPoint proposes. The tool will save 93 percent of the human labor involved by reviewing up to 500,000 comments submitted by the public in response to proposed rule changes. By the end of the deregulation exercise, humans will have spent just a few hours to cancel each of the 100,000 regulations, the PowerPoint claims.

They then close by pointing out that the AI makes mistakes even on the technical level it is addressing. Well, yeah.

Also, welcome to the future of journalism:

China Also Has An AI Action Plan

China has its own AI Action Plan and is calling for international cooperation on AI. Wait, what do they mean by that? If you look in the press, that depends who you ask. All the news organizations will be like ‘the Chinese released an AI Action Plan’ and then not link to the actual plan, I had to have o3 dig it up.

Here’s o3’s translation of the actual text. This is almost all general gestures in the direction of capabilities, diffusion, infrastructure and calls for open models. It definitely is not an AI Action Plan in the sense that America offered an AI Action Plan, with had lots of specific actionable proposals. This is more of a general outline of a plan and statement of goals, at best. At least it doesn’t talk about or call for a ‘race’ but a call for everything to be open and accelerated is not obviously better.

  • Seize AI opportunities together. Governments, international organizations, businesses, research institutes, civil groups, and individuals should actively cooperate, accelerate digital‑infrastructure build‑out, explore frontier AI technologies, and spread AI applications worldwide, fully unlocking AI’s power to drive growth, achieve the UN‑2030 goals, and tackle global challenges.
  • Foster AI‑driven innovation. Uphold openness and sharing, encourage bold experimentation, build international S‑and‑T cooperation platforms, harmonize policy and regulation, and remove technical barriers to spur continuous breakthroughs and deep “AI +” applications.
  • Empower every sector. Deploy AI across manufacturing, consumer services, commerce, healthcare, education, agriculture, poverty reduction, autonomous driving, smart cities, and more; share infrastructure and best practices to supercharge the real economy.
  • Accelerate digital infrastructure. Expand clean‑energy grids, next‑gen networks, intelligent compute, and data centers; create interoperable AI infrastructure and unified compute‑power standards; support especially the Global South in accessing and applying AI.
  • Build a pluralistic open‑source ecosystem. Promote cross‑border open‑source communities and secure platforms, open technical resources and interfaces, improve compatibility, and let non‑sensitive tech flow freely.
  • Supply high‑quality data. Enable lawful, orderly, cross‑border data flows; co‑create top‑tier datasets while safeguarding privacy, boosting corpus diversity, and eliminating bias to protect cultural and ecosystem diversity.
  • Tackle energy and environmental impacts. Champion “sustainable AI,” set AI energy‑ and water‑efficiency standards, promote low‑power chips and efficient algorithms, and scale AI solutions for green transition, climate action, and biodiversity.
  • Forge standards and norms. Through ITU, ISO, IEC, and industry, speed up standards on safety, industry, and ethics; fight algorithmic bias and keep standards inclusive and interoperable.
  • Lead with public‑sector adoption. Governments should pioneer reliable AI in public services (health, education, transport), run regular safety audits, respect IP, enforce privacy, and explore lawful data‑trading mechanisms to upgrade governance.
  • Govern AI safety. Run timely risk assessments, create a widely accepted safety framework, adopt graded management, share threat intelligence, tighten data‑security across the pipeline, raise explainability and traceability, and prevent misuse.
  • Implement the Global Digital Compact. Use the UN as the main channel, aim to close the digital divide—especially for the Global South—and quickly launch an International AI Scientific Panel and a Global AI Governance Dialogue under UN auspices.
  • Boost global capacity‑building. Through joint labs, shared testing, training, industry matchmaking, and high‑quality datasets, help developing countries enhance AI innovation, application, and governance while improving public AI literacy, especially for women and children.
  • Create inclusive, multi‑stakeholder governance. Establish public‑interest platforms involving all actors; let AI firms share use‑case lessons; support think tanks and forums in sustaining global technical‑policy dialogue among researchers, developers, and regulators.

What does it have to say about safety or dealing with downsides? We have ‘forge standards and norms’ with a generic call for safety and ethics standards, which seems to mostly be about interoperability and ‘bias.’

Mainly we have ‘Govern AI safety,’ which is directionally nice to see I guess but essentially content free and shows no sign that the problems are being taken seriously on the levels we care about. Most concretely, in the ninth point, we have a call for regular safety audits of AI models. That all sounds like ‘the least you could do.’

Here’s one interpretation of the statement:

Brenda Goh (Reuters): China said on Saturday it wanted to create an organisation to foster global cooperation on artificial intelligence, positioning itself as an alternative to the U.S. as the two vie for influence over the transformative technology.

Li did not name the United States but appeared to refer to Washington’s efforts to stymie China’s advances in AI, warning that the technology risked becoming the “exclusive game” of a few countries and companies.

China wants AI to be openly shared and for all countries and companies to have equal rights to use it, Li said, adding that Beijing was willing to share its development experience and products with other countries, particularly the “Global South”. The Global South refers to developing, emerging or lower-income countries, mostly in the southern hemisphere.

The foreign ministry released online an action plan for global AI governance, inviting governments, international organisations, enterprises and research institutions to work together and promote international exchanges including through a cross-border open source community.

As in, we notice you are ahead in AI, and that’s not fair. You should do everything in the open so you let us catch up in all the ways you are ahead, so we can bury you using the ways in which you are behind. That’s not an unreasonable interpretation.

Here’s another.

The Guardian: Chinese premier Li Qiang has proposed establishing an organisation to foster global cooperation on artificial intelligence, calling on countries to coordinate on the development and security of the fast-evolving technology, days after the US unveiled plans to deregulate the industry.

Li warned Saturday that artificial intelligence development must be weighed against the security risks, saying global consensus was urgently needed.

“The risks and challenges brought by artificial intelligence have drawn widespread attention … How to find a balance between development and security urgently requires further consensus from the entire society,” the premier said.

Li said China would “actively promote” the development of open-source AI, adding Beijing was willing to share advances with other countries, particularly developing ones in the global south.

So that’s a call to keep security in mind, but every concrete reference is mundane and deals with misuse, and then they call for putting everything out into the open, with the main highlighted ‘risk’ to coordinate on being that America might get an advantage, and encouraging us to give it away via open models to ‘safeguard multilateralism.’

A third here, from the Japan Times, frames it as a call for an alliance to take aim at an American AI monopoly.

Director Michael Kratsios: China’s just-released AI Action Plan has a section that drives at a fundamental difference between our approaches to AI: whether the public or private sector should lead in AI innovation.

I like America’s odds of success.

He quotes point nine, which his translation has as ‘the public sector takes the lead in deploying applications.’ Whereas o3’s translation says ‘governments should pioneer reliable AI in public services (health, education, transport), run regular safety audits, respect IP, enforce privacy, and explore lawful data‑trading mechanisms to upgrade governance.’

Even in Michael’s preferred translation, this is saying government should aggressively deploy AI applications to improve government services. The American AI Action Plan, correctly, fully agrees with this. Nothing in the Chinese statement says to hold the private sector back. Quite the contrary.

The actual disagreement we have with point nine is the rest of it, where the Chinese think we should run regular safety audits, respect IP and enforce privacy. Those are not parts of the American AI Action Plan. Do you think we were right not to include those provisions, sir? If so, why?

Pick Up The Phone

Suppose in the future, we learned we were in a lot more danger than we think we are in now, and we did want to make a deal with China and others. Right now the two sides would be very far apart but circumstances could quickly change that.

Could we do it in a way that could be verified?

It wouldn’t be easy, but we do have tools.

This is the sort of thing we should absolutely be preparing to be able to do, whether or not we ultimately decide to do it.

Mauricio Baker: For the last year, my team produced the most technically detailed overview so far. Our RAND working paper finds: strong verification is possible—but we need ML and hardware research.

You can find the paper here and on arXiv. It includes a 5-page summary and a list of open challenges.

In the Cold War, the US and USSR used inspections and satellites to verify nuclear weapon limits. If future, powerful AI threatens to escape control or endanger national security, the US and China would both be better off with guardrails.

It’s a tough challenge:

– Verify narrow restrictions, like “no frontier AI training past some capability,” or “no mass-deploying if tests show unacceptable danger”

– Catch major state efforts to cheat

– Preserve confidentiality of models, data, and algorithms

– Keep overhead low

Still, reasons for optimism:

– No need to monitor all computers—frontier AI needs thousands of specialized AI chips.

– We can build redundant layers of verification. A cheater only needs to be caught once.

– We can draw from great work in cryptography and ML/hardware security.

One approach is to use existing chip security features like Confidential Computing, built to securely verify chip activities. But we’d need serious design vetting, teardowns, and maybe redesigns before the US could strongly trust Huawei’s chip security (or frankly, NVIDIA’s).

“Off-chip” mechanisms could be reliable sooner: network taps or analog sensors (vetted, limited use, tamper evident) retrofitted onto AI data centers. Then, mutually secured, airgapped clusters could check if claimed compute uses are reproducible and consistent with sensor data.

Add approaches “simple enough to work”: whistleblower programs, interviews of personnel, and intelligence activities. Whistleblower programs could involve regular in-person contact—carefully set up so employees can anonymously reveal violations, but not much more.

We could have an arsenal of tried-and-tested methods to confidentially verify a US-China AI treaty. But at the current pace, in three years, we’ll just have a few speculative options. We need ML and hardware researchers, new RFPs by funders, and AI company pilot programs.

Jeffrey Ladish: Love seeing this kind of in-depth work on AI treaty verification. A key fact is verification doesn’t have to be bullet proof to be useful. We can ratchet up increasingly robust technical solutions while using other forms of HUMINT and SIGINT to provide some level of assurance.

Remember, the AI race is a mixed-motive conflict, per Schelling. Both sides have an incentive to seek an advantage, but also have an incentive to avoid mutually awful outcomes. Like with nuclear war, everyone loses if any side loses control of superhuman AI.

This makes coordination easier, because even if both sides don’t like or trust each other, they have an incentive to cooperate to avoid extremely bad outcomes.

It may turn out that even with real efforts there are not good technical solutions. But I think it is far more likely that we don’t find the technical solutions due to lack of trying, rather than that the problem is so hard that it cannot be done.

The AI Action Plan Has Good Marginal Proposals But Terrible Rhetoric

The reaction to the AI Action Plan was almost universally positive, including here from Nvidia and AMD. My own review, focused on the concrete proposals within, also reflected this. It far exceeded my expectations on essentially all fronts, so much so that I would be actively happy to see most of its proposals implemented rather than nothing be done.

I and others focused on the concrete policy, and especially concrete policy relative to expectations and what was possible in context, for which it gets high praise.

But a document like this might have a lot of its impact due to the rhetoric instead, even if it lacks legal force, or cause people to endorse the approach as ideal in absolute terms rather than being the best that could be done at the time.

So, for example, the actual proposals for open models were almost reasonable, but if the takeaway is lots more rhetoric of ‘yay open models’ like it is in this WSJ editorial, where the central theme is very clearly ‘we must beat China, nothing else matters, this plan helps beat China, so the plan is good’ then that’s really bad.

Another important example: Nothing in the policy proposals here makes future international cooperation harder. The rhetoric? A completely different story.

The same WSJ article also noticed the same obvious contradictions with other Trump policies that I did – throttling renewable energy and high-skilled immigration and even visas are incompatible with our goals here, the focus on ‘woke AI’ could have been much worse but remains a distraction, also I would add, what is up with massive cuts to STEM research if we are taking this seriously? If we are serious about winning and worry that one false move would ‘forfeit the race’ then we need to act like it.

Of course, none of that is up to the people who were writing the AI Action Plan.

What the WSJ editorial board didn’t notice, or mention at all, is the possibility that there are other risks or downsides at play here, and it dismisses outright the possibility of any form of coordination or cooperation. That’s a very wrong, dangerous and harmful attitude, one it shares with many in or lobbying the government.

A worry I have on reflection, that I wasn’t focusing on at the time, is that officials and others might treat the endorsements of the good policy proposals here as an endorsement of the overall plan presented by the rhetoric, especially the rhetoric at the top of the plan, or of the plan’s sufficiency and that it is okay to ignore and not speak about what the plan ignores and does not speak about.

That rhetoric was alarmingly (but unsurprisingly) terrible, as it is the general administration plan of emphasizing whenever possible that we are in an ‘AI race’ that will likely go straight to AGI and superintelligence even if those words couldn’t themselves be used in the plan, where ‘winning’ is measured in the mostly irrelevant ‘market share.’

And indeed, the inability to mention AGI or superintelligence in the plan leads to such exactly the standard David Sacks lines that toxically center the situation on ‘winning the race’ by ‘exporting the American tech stack.’

I will keep repeating, if necessary until I am blue in the face, that this is effectively a call (the motivations for which I do not care to speculate) for sacrificing the future and get us all killed in order to maximize Nvidia’s market share.

There is no ‘tech stack’ in the meaningful sense of necessary integration. You can run any most any AI model on most any advanced chip, and switch on an hour’s notice.

It does not matter who built the chips. It matters who runs the chips and for whose benefit. Supply is constrained by manufacturing capacity, so every chip we sell is one less chip we have. The idea that failure to hand over large percentages of the top AI chips to various authoritarians, or even selling H20s directly to China as they currently plan to do, would ‘forfeit’ ‘the race’ is beyond absurd.

Indeed, both the rhetoric and actions discussed here do the exact opposite. It puts pressure on others especially China to push harder towards ‘the race’ including the part that counts, the one to AGI, and also the race for diffusion and AI’s benefits. And the chips we sell arm China and others to do this important racing.

There is later talk acknowledging that ‘we do not intend to ignore the risks of this revolutionary technological power.’ But Sacks frames this as entire about the risk that AI will be misused or stolen by malicious actors. Which is certainly a danger, but far from the primary thing to worry about.

That’s what happens when you are forced to pretend AGI, ASI, potential loss of control and all other existential risks do not exist as possibilities. The good news is that there are some steps in the actual concrete plan to start preparing for those problems, even if they are insufficient and it can’t be explained, but it’s a rough path trying to sustain even that level of responsibility under this kind of rhetorical oppression.

The vibes and rhetoric were accelerationist throughout, especially at the top, and completely ignored the risks and downsides of AI, and the dangers of embracing a rhetoric based on an ‘AI race’ that we ‘must win,’ and where that winning mostly means chip market share. Going down this path is quite likely to get us all killed.

I am happy to make the trade of allowing the rhetoric to be optimistic, and to present the Glorious Transhumanist Future as likely to be great even as we have no idea how to stay alive and in control while getting there, so long as we can still agree to take the actions we need to take in order to tackle that staying alive and in control bit – again, the actions are mostly the same even if you are highly optimistic that it will work out.

But if you dismiss the important dangers entirely, then your chances get much worse.

So I want to be very clear that I hate that rhetoric, I think it is no good, very bad rhetoric both in terms of what is present and what (often with good local reasons) is missing, while reiterating that the concrete particular policy proposals were as good as we could reasonably have hoped for on the margin, and the authors did as well as they could plausibly have done with people like Sacks acting as veto points.

That includes the actions on ‘preventing Woke AI,’ which have convinced even Sacks to frame this as preventing companies from intentionally building DEI into their models. That’s fine, I wouldn’t want that either.

Even outlets like Transformer weighed in positively, with them calling the plan ‘surprisingly okay’ and noting its ability to get consensus support, while ignoring the rhetoric. They correctly note the plan is very much not adequate. It was a missed opportunity to talk about or do something about various risks (although I understand why), and there was much that could have been done that wasn’t.

Seán Ó hÉigeartaigh: Crazy to reflect on the three global AI competitions going on right now:

– 1. US political leadership have made AI a prestige race, echoing the Space Race. It’s cool and important and strategic, and they’re going to Win.

– 2. For Chinese leadership AI is part of economic strength, soft power and influence. Technology is shared, developing economies will be built on Chinese fundamental tech, the Chinese economy and trade relations will grow. Weakening trust in a capricious US is an easy opportunity to take advantage of.

– 3. The AGI companies are racing something they think will out-think humans across the board, that they don’t yet know how to control, and think might literally kill everyone.

Scariest of all is that it’s not at all clear to decision-makers that these three things are happening in parallel. They think they’re playing the same game, but they’re not.

I would modify the US political leadership position. I think to a lot of them it’s literally about market share, primarily chip market share. I believe this because they keep saying, with great vigor, that it is literally about chip market share. But yes, they think this matters because of prestige, and because this is how you get power.

My guess is, mostly:

  1. The AGI companies understand these are three distinct things.
    1. They are using the confusions of political leadership for their own ends.
  2. The Chinese understand there are two distinct things, but not three.
    1. As in, they know what US leadership is doing, and they know what they are doing, and they know these are distinct things.
    2. They do not feel the AGI and understand its implications.
  3. The bulk of the American political class cannot differentiate between the US and Chinese strategies, or strategic positions, or chooses to pretend not to, cannot imagine things other than ordinary prestige, power and money, and cannot feel the AGI.
    1. There are those within the power structure who do feel the AGI, to varying extents, and are trying to sculpt actions (including the action plan) accordingly with mixed success.
    2. An increasing number of them, although still small, do feel the AGI to varying extents but have yet to cash that out into anything except ‘oh ****’.
  4. There is of course a fourth race or competition, which is to figure out how to build it without everyone dying.

The actions one would take in each of these competitions are often very similar, especially the first three and often the fourth as well, but sometimes are very different. What frustrates me most is when there is an action that is wise on all levels, yet we still don’t do it.

Also, on the ‘preventing Woke AI’ question, the way the plan and order are worded seems designed to make compliance easy and not onerous, but given other signs from the Trump administration lately, I think we have reason to worry…

Fact Post: Trump’s FCC Chair says he will put a “bias monitor” in place who will “report directly” to Trump as part of the deal for Sky Dance to acquire CBS.

Ari Drennen: The term that the Soviet Union used for this job was “apparatchik” btw.

I was willing to believe that firing Colbert was primarily a business decision. This is very different imagine the headline in reverse: “Harris’s FCC Chair says she will put a “bias monitor” in place who will “report directly” to Harris as part of the deal for Sky Dance to acquire CBS.”

Now imagine it is 2029, and the headline is ‘AOC appoints new bias monitor for CBS.’ Now imagine it was FOX. Yeah. Maybe don’t go down this road?

Kratsios Explains The AI Action Plan

Director Krastios has now given us his view on the AI Action Plan. This is a chance to see how much it is viewed as terrible rhetoric versus its good policy details, and to what extent overall policy is going to be guided by good details versus terrible rhetoric.

Peter Wildeford offers his takeaway summary.

Peter Wildeford: Winning the Global AI Race

  1. The administration’s core philosophy is a direct repudiation of the previous one, which Kratsios claims was a “fear-driven” policy “manically obsessed” with hypothetical risks that stifled innovation.
  2. The plan is explicitly called an “Action Plan” to signal a focus on immediate execution and tangible results, not another government strategy document that just lists aspirational goals.
  3. The global AI race requires America to show the world a viable, pro-innovation path for AI development that serves as an alternative to the EU’s precautionary, regulation-first model.

He leads with hyperbolic slander, which is par for the course, but yes concrete action plans are highly useful and the EU can go too far in its regulations.

There are kind of two ways to go with this.

  1. You could label any attempt to do anything to ensure we don’t die as ‘fear-driven’ and ‘maniacally obsessed’ with ‘hypothetical’ risks that ‘stifle’ innovation, and thus you probably die.
  2. You could label the EU and Biden Administration as ‘fear-driven’ and ‘manically obsessed’ with ‘hypothetical’ risks that ‘stifle’ innovation, contrasting that with your superior approach, and then having paid this homage do reasonable things.

The AI Action Plan as written was the second one. But you have to do that on purpose, because the default outcome is to shift to the first one.

Executing the ‘American Stack’ Export Strategy

  1. The strategy is designed to prevent a scenario where the world runs on an adversary’s AI stack by proactively offering a superior, integrated American alternative.
  2. The plan aims to make it simple for foreign governments to buy American by promoting a “turnkey solution”—combining chips, cloud, models, and applications—to reduce complexity for the buyer.
  3. A key action is to reorient US development-finance institutions like the DFC and EXIM to prioritize financing for the export of the American AI stack, shifting their focus from traditional hard infrastructure.

The whole ‘export’ strategy is either nonsensical, or an attempt to control capital flow, because I heard a rumor that it is good to be the ones directing capital flow.

Once again, the ‘tech stack’ thing is not, as described here, what’s the word? Real.

The ‘adversary’ does not have a ‘tech stack’ to offer, they have open models people can run on the same chips. They don’t have meaningful chips to even run their own operations, let alone export. And the ‘tech’ does not ‘stack’ in a meaningful way.

Turnkey solutions and package marketing are real. I don’t see any reason for our government to be so utterly obsessed with them, or even involved at all. That’s called marketing and serving the customer. Capitalism solves this. Microsoft and Amazon and Google and OpenAI and Anthropic and so on can and do handle it.

Why do we suddenly think the government needs to be prioritizing financing this? Given that it includes chip exports, how is it different from ‘traditional hard infrastructure’? Why do we need financing for the rest of this illusory stack when it is actually software? Shouldn’t we still be focusing on ‘traditional hard infrastructure’ in the places we want it, and then whenever possible exporting the inference?

Refining National Security Controls

  1. Kratsios argues the biggest issue with export controls is not the rules themselves but the lack of resources for enforcement, which is why the plan calls for giving the Bureau of Industry and Security (BIS) the tools it needs.
  2. The strategy is to maintain strict controls on the most advanced chips and critical semiconductor-manufacturing components, while allowing sales of less-advanced chips under a strict licensing regime.
  3. The administration is less concerned with physical smuggling of hardware and more focused on preventing PRC front companies from using legally exported hardware for large-scale, easily flaggable training runs.
  4. Proposed safeguards against misuse are stringent “Know Your Customer” (KYC) requirements paired with active monitoring for the scale and scope of compute jobs.

It is great to see the emphasis on enforcement. It is great to hear that the export control rules are not the issue.

In which case, can we stop waving them, such as with H20 sales to China? Thank you. There is of course a level at which chips can be safely sold even directly to China, but the experts all agree the H20 is past that level.

The lack of concern about smuggling is a blind eye in the face of overwhelming evidence of widespread smuggling. I don’t much care if they are claiming to be concerned, I care about the actual enforcement, but we need enforcement. Yes, we should stop ‘easily flaggable’ PRC training runs and use KYC techniques, but this is saying we should look for our keys under the streetlight and then if we don’t find the keys assume we can start our car without them.

Championing ‘Light-Touch’ Domestic Regulation

  1. The administration rejects the idea of a single, overarching AI law, arguing that expert agencies like the FDA and DOT should regulate AI within their specific domains.
  2. The president’s position is that a “patchwork of regulations” across 50 states is unacceptable because the compliance burden disproportionately harms innovative startups.
  3. While using executive levers to discourage state-level rules, the administration acknowledges that a durable solution requires an act of Congress to create a uniform federal standard.

Yes, a ‘uniform federal standard’ would be great, except they have no intention of even pretending to meaningfully pursue one. They want each federal agency to do its thing in its own domain, as in a ‘use case’ based AI regime which when done on its own is the EU approach and doomed to failure.

I do acknowledge the step down from ‘kill state attempts to touch anything AI’ (aka the insane moratorium) to ‘discourage’ state-level rules using ‘executive levers,’ at which point we are talking price. One worries the price will get rather extreme.

Addressing AI’s Economic Impact at Home

  1. Kratsios highlights that the biggest immediate labor need is for roles like electricians to build data centers, prompting a plan to retrain Americans for high-paying infrastructure jobs.
  2. The technology is seen as a major productivity tool that provides critical leverage for small businesses to scale and overcome hiring challenges.
  3. The administration issued a specific executive order on K-12 AI education to ensure America’s students are prepared to wield these tools in their future careers.

Ahem, immigration, ahem, also these things rarely work, but okay, sure, fine.

Prioritizing Practical Infrastructure Over Hypothetical Risk

  1. Kratsios asserts that chip supply is no longer a major constraint; the key barriers to the AI build-out are shortages of skilled labor and regulatory delays in permitting.
  2. Success will be measured by reducing the time from permit application to “shovels in the ground” for new power plants and data centers.
  3. The former AI Safety Institute is being repurposed to focus on the hard science of metrology—developing technical standards for measuring and evaluating models, rather than vague notions of “safety.”

It is not the only constraint, but it is simply false to say that chip supply is no longer a major constraint.

Defining success in infrastructure in this way would, if taken seriously, lead to large distortions in the usual obvious Goodhart’s Law ways. I am going to give the benefit of the doubt and presume this ‘success’ definition is local, confined to infrastructure.

If the only thing America’s former AISI can now do are formal measured technical standards, then that is at least a useful thing that it can hopefully do well, but yeah it basically rules out at the conceptual level the idea of actually addressing the most important safety issues, by dismissing them are ‘vague.’

This goes beyond ‘that which is measured is managed’ to an open plan of ‘that which is not measured is not managed, it isn’t even real.’ Guess how that turns out.

Defining the Legislative Agenda

  1. While the executive branch has little power here, Kratsios identifies the use of copyrighted data in model training as a “quite controversial” area that Congress may need to address.
  2. The administration would welcome legislation that provides statutory cover for the reformed, standards-focused mission of the Center for AI Standards and Innovation (CAISI).
  3. Continued congressional action is needed for appropriations to fund critical AI-related R&D across agencies like the National Science Foundation.

Chip City

TechCrunch: 20 national security experts urge Trump administration to restrict Nvidia H20 sales to China.

The letter says the H20 is a potent accelerator of China’s frontier AI capabilities and could be used to strengthen China’s military.

Americans for Responsible Innovation: The H20 and the AI models it supports will be deployed by China’s PLA. Under Beijing’s “Military-Civil Fusion” strategy, it’s a guarantee that H20 chips will be swiftly adapted for military purposes. This is not a question of trade. It is a question of national security.

It would be bad enough if this was about selling the existing stock of H20s, that Nvidia has taken a writedown on, even though it could easily sell them in the West instead. It is another thing entirely that Nvidia is using its capacity on TSMC machines to make more of them, choosing to create chips to sell directly to China instead of creating chips for us.

Ruby Scanlon: Nvidia placed orders for 300,000 H20 chipsets with contract manufacturer TSMC last week, two sources said, with one of them adding that strong Chinese demand had led the US firm to change its mind about just relying on its existing stockpile.

It sounds like we’re planning on feeding what would have been our AI chips to China. And then maybe you should start crying? Or better yet tell them they can’t do it?

I share Peter Wildeford’s bafflement here:

Peter Wildeford: “China is close to catching up to the US in AI so we should sell them Nvidia chips so they can catch up even faster.”

I never understand this argument from Nvidia.

The argument is also false, Nvidia is lying, but I don’t understand even if it were true.

There is only a 50% premium to buy Nvidia B200 systems within China, which suggests quite a lot of smuggling is going on.

Tao Burga: Nvidia still insists that there’s “no evidence of any AI chip diversion.” Laughable. All while lobbying against the data center chip location verification software that would provide the evidence. Tell me, where does the $1bn [in AI chips smuggled to China] go?

Rob Wiblin: Nvidia successfully campaigning to get its most powerful AI chips into China has such “the capitalists will sell us the rope with which we will hang them” energy.

Various people I follow keep emphasizing that China is smuggling really a lot of advanced AI chips, including B200s and such, and perhaps we should be trying to do something about it, because it seems rather important.

Chipmakers will always oppose any proposal to track chips or otherwise crack down on smuggling and call it ‘burdensome,’ where the ‘burden’ is ‘if you did this they would not be able to smuggle as many chips, and thus we would make less money.’

Reuters Business: Demand in China has begun surging for a business that, in theory, shouldn’t exist: the repair of advanced v artificial intelligence chipsets that the US has banned the export of to its trade and tech rival.

Peter Wildeford: Nvidia position: “datacenters from smuggled products is a losing proposition […] Datacenters require service and support, which we provide only to authorized NVIDIA products.”

Reality: Nvidia AI chip repair industry booms in China for banned products.

Scott Bessent Warns TSMC’s $40 billion Arizona fab that could meet 7% of American chip demand keeps getting delayed, and blames inspectors and red tape. There’s confusion here in the headline that he is warning it would ‘only’ meet 7% of demand, but 7% of demand would be amazing for one plant and the article’s text reflects this.

Bessent criticized regulatory hurdles slowing construction of the $40 billion facility. “Evidently, these chip design plants are moving so quickly, you’re constantly calling an audible and you’ve got someone saying, ‘Well, you said the pipe was going to be there, not there. We’re shutting you down,’” he explained.

It does also mean that if we want to meet 100% or more of demand we will need a lot more plants, but we knew that.

Epoch reports that Chinese hardware is behind American hardware, and is ‘closing the gap’ but faces major obstacles in chip manufacturing capability.

Epoch: Even if we exclude joint ventures with U.S., Australian, or U.K. institutions (where the developers can access foreign silicon), the clear majority of homegrown models relied on NVIDIA GPUs. In fact, it took until January 2024 for the first large language model to reportedly be trained entirely on Chinese hardware, arguably years after the first large language models.

Probably the most important reason for the dominance of Western hardware is that China has been unable to manufacture these AI chips in adequate volumes. Whereas Huawei reportedly manufactured 200,000 Ascend 910B chips in 2024, estimates suggest that roughly one million NVIDIA GPUs were legally delivered to China in the same year.

That’s right. For every top level Huawei chip manufactured, Nvidia sold five to China. No, China is not about to export a ‘full Chinese tech stack’ for free the moment we turn our backs. They’re offering downloads of r1 and Kimi K2, to be run on our chips, and they use all their own chips internally because they still have a huge shortage.

Put bluntly, we don’t see China leaping ahead on compute within the next few years. Not only would China need to overcome major obstacles in chip manufacturing and software ecosystems, they would also need to surpass foreign companies making massive investments into hardware R&D and chip fabrication.

Unless export controls erode or Beijing solves multiple technological challenges in record time, we think that China will remain at least one generation behind in hardware. This doesn’t prevent Chinese developers from training and running frontier AI models, but it does make it much more costly.

Overall, we think these costs are large enough to put China at a substantial disadvantage in AI scaling for at least the rest of the decade.

Beating China may or may not be your number one priority. We do know that taking export controls seriously is the number one priority for ‘beating China.’

Intel will cancel 14A and following nodes, essentially abandoning the technological frontier, if it cannot win a major external customer.



Discuss

Research Areas in AI Control (The Alignment Project by UK AISI)

2025-08-01 18:27:15

Published on August 1, 2025 10:27 AM GMT

 The Alignment Project is a global fund of over £15 million, dedicated to accelerating progress in AI control and alignment research. It is backed by an international coalition of governments, industry, venture capital and philanthropic funders. This post is part of a sequence on research areas that we are excited to fund through the The Alignment Project.

Apply now to join researchers worldwide in advancing AI safety.

Research Areas in AI Control

Problem statement: Current alignment methods can’t guarantee that an AI’s goals and actions match human intentions (Carlsmith, 2023). In the future, we might rely heavily on AIs performing autonomous work such as automated alignment research. At the capability level sufficient to do such research, these systems may already be capable of causing catastrophic harm or meaningfully escaping human control, and may choose to do so if they’re misaligned. The field of AI Control aims to reduce this risk through techniques such as monitoring of untrusted AI systems by other AI models and restricting the affordances of untrusted AI systems (Shlegeris and Greenblatt, 2024).

The aim is to develop control protocols – plans involving these techniques that are designed to prevent unsafe actions by AI systems. These control protocols should then be regularly evaluated during control evaluations – testing exercises designed to provide evidence that models would be unable to cause catastrophic harm even in case of intentional scheming.

Why this matters: Control measures provide a practical approach to mitigating loss of control risks that doesn’t require fundamental research breakthroughs. Control methods can be evaluated and deployed using models with current capability levels, making these methods particularly useful for responsible AI development in the near term. For an overview of how control measures can scale to increasingly powerful systems, see Korbak et al., 2025a).

Section overview

We organize this part of the research agenda into seven subsections.  First three sub-sections focus on empirical investigations targeting key identified threat models:

  1. High-stakes/concentrated AI safety failures: Investigating scenarios where a relatively small number of actions by AI models could lead to humans losing control of them.
  2. Research sabotage: Exploring scenarios where AI systems subtly sabotage safety-related research.
  3. Covert malicious fine-tuning: Examining scenarios where AI systems covertly data poison other AI systems – for example to instill backdoors or secret loyalties.

The following four sub-sections focus oncross-cutting research across multiple threat models:

  1. Modelling of risks from potentially misaligned AI systems: Identifying key paths to harm and prioritising attack vectors for research and mitigations.
  2. Uplift of trusted models: Developing trustedness evaluations to assess scheming likelihood, enhancing the skills of trusted monitors (e.g., vulnerability detection, steganography spotting) and developing monitor models with narrow capability profiles through specialized training.
  3. Chain-of-thought monitoring: Measuring and preserving the monitorability of reasoning verbalised by AI models and leveraging the verbalised reasoning in control strategies.
  4. Fundamental science of AI monitoring and red teaming: Addressing foundational research questions, including catching and studying natural examples of scheming, developing few-shot catastrophe prevention methods, and studying collusion between untrusted models.

We believe that many of these research areas require the efforts of entire research teams (3-10 people). We end the section with a call for new research groups focused on addressing different parts of the research agenda.

High-Stakes/Concentrated AI Safety Failures

Problem summary: We are interested in research on concentrated failure scenarios, where a relatively small number of actions are required for humans to lose control over scheming AI models. For example, it has been theorised that a misaligned future model might be able to create an unmonitored deployment by inserting just a few vulnerabilities in a model developer’s most critical codebases (Shlegeris, 2022). Concentrated failure scenarios could involve AI systems conducting sabotage during coding or system operation tasks, potentially leading to rogue internal deployments or weight exfiltration (Shlegeris, 2024).

Key subproblems:

Building high-stakes control settings. Settings are empirical research environments, for example an environment containing mock frontier model development infrastructure, where researchers can study the ability of models to conduct sabotage (or proxies for sabotage) and the effectiveness of control protocols in preventing it. They contain main tasks (e.g., legitimate software engineering tasks for a coding agent ), as well as side tasks that a misaligned model might attempt to conduct (e.g., reducing security restrictions that protect model weights). See ControlArena for example settings. A diverse range of settings for the study of control techniques would be extremely helpful for advancing control research. Possible further settings may include various software engineering settings, settings where models have access to a shell, or settings where models run ML experiments. Building settings is an active area of research because it is difficult to build settings that fulfill at least the following desiderata (Shlegeris, 2025):

  • The main tasks have to be such that frontier models are reasonably competent at them, but weaker models do much worse (because we want to rule out protocols that just use the trusted model).
  • The side tasks shouldn’t require the models to make use of affordances that aren’t needed for the main task, since this allows for the blue team to flag any use of unnecessary affordances.
  • It should be possible to inform both the red and blue teams as to what the possible side tasks may be.

A particularly promising way to build control settings might be to convert actual AI usage into control settings. We are interested in studies where people are given a version of an agentic code editor where the AI is sometimes replaced with a misaligned version that tries to insert security issues or sabotage the work if it thinks it can succeed. This allows the study of how often the AI successfully introduces issues without detection, identification of interesting sabotage strategies and greater understanding of which protocols are most effective in practice. See Greenblatt 2025 for more context. It is important to run control evaluations in these realistic settings to validate their usefulness for control research.

High-stakes control evaluations. We are interested in follow-up work on previous high-stakes control research, such as Greenblatt et al. 2023 and Bhatt et al. 2025. Valuable projects may include:

  • Studying previously discussed control protocols in new empirical settings such as mock frontier model development infrastructure. This work could, for example, involve evaluating techniques for preventing collusion in untrusted monitoring (Shlegeris, 2024) such as paraphrasers and forcing task decompositions (Shlegeris, 2024).
  • Training red-team attack policies using reinforcement learning (RL). This may help mitigate a limitation of current control evaluations, which are being done with heavily scaffolded models and thus relying on researchers correctly identifying and testing the most effective policies.

Developing and evaluating processes for human auditing (with AI uplift). AI control benefits from having the ability to audit suspicious activity with human labor. It would be very useful to anticipate potential issues with our ability to use humans (who may be aided by AI systems) for auditing. An additional challenge worth studying, in the context of automated AI R&D, is that auditors may often be unfamiliar with the code they’re reviewing. 

Some valuable research directions here include:

  • Running trials with humans to evaluate different protocols for auditing, aiming to make them cheap and effective. This likely includes studying uplift (and risks) from giving human auditors access to various bespoke AI tools.
  • Measuring and mitigating cognitive biases that result from data imbalance. By default, a human auditor in a control protocol might see a stream of data which is nearly all fine, making sustained attention difficult if the task is to catch rare failures. We are interested in exploring potential mitigations to this problem.
  • Identifying relevant settings which allow for evaluating techniques for auditor uplift, e.g. spotting real vulnerabilities in unfamiliar codebases or suspicious activity in settings like SHADE-Arena (Kutasov et al., 2025). Some relevant settings might have to be developed from scratch—for example, we might want to be able to audit potential data poisoning attempts.
  • Experimenting with using untrusted models for auditing, which runs into collusion risk (see Shlegeris, 2024) but could also significantly reduce the cost of high-quality auditing, compared to human labour. Experiments here would typically involve running a control evaluation where the untrusted auditor is an attack policy.
  • Developing mechanisms (e.g. redactors or paraphrasers) that can handle practical implementation issues regarding sensitive information. For example, the suspicious action may have been performed in the context of work with privileged or highly sensitive information, that human auditors can’t have access to.
  • Developing auxiliary methods to aid with few-shot catastrophe prevention (Greenblatt and Shlegeris, 2024), such as proliferation and interrogation techniques (focusing on methods that particularly benefit from execution by or with the help of competent humans).

Developing design patterns for limiting excessive agencyOne way to reduce risks from schemers is to limit the potential paths through which they can cause harm. Adopting a principle of least privilege may be appropriate, where potentially scheming models have exactly the affordances they need in order to do their job. In the context of hierarchical agent structures, instances of the model higher up in the hierarchy could be set up to have more context but fewer direct affordances; creating strategies and making decisions, but then delegating sub-tasks. In contrast, instances lower in the hierarchy could have less context, but they may be able to write and execute code. In general, the situation would (all else equal) be safer the more we can encapsulate the key affordances of AIs in safeguarded APIs. In order to be useful, a project in this area would probably have to run empirical control evaluations which test structural safety measures such as:

  • Fine-grained permission management and having AIs take actions mainly via well-secured APIs (Shlegeris, 2024).
  • Design patterns for controlling information flow (see Beurer-Kellner et al., 2025 and Chennabasappa et al., 2025 for examples).
  • Having trusted models perform task decompositions, or having untrusted models perform task decompositions that allow trusted models to perform a lot of the lower-level work (e.g. coding or using a Bash shell).
  • Targeted safety measures to prevent covert malicious fine-tuning (Halawi et al., 2024), which is a notable reason as to why we can’t guarantee that AI systems will only have access to affordances that are manifestly safe. More details on this threat model can be found in the Covert malicious fine-tuning section.
  • Forcing multiple instances of untrusted models to communicate by legible chain-of-thought, by restricting information or affordances given to any single instance.
  • Infrastructure reducing the risks from externally deployed AI agents (Chan et al., 2025).

Research Sabotage

Problem summary: In the future, potentially scheming AIs will likely be used for large amounts of intellectual work, in particular AI R&D. A worry here is that they might systematically sabotage safety-relevant research, preventing us from controlling increasingly capable AI systems. 

Why this matters: Preventing research sabotage matters insofar as we rely on AIs to help perform research that enables alignment of future even more capable AIs. Methods focused on catching clearly incriminating actions (such as the ones described in the High-stakes/concentrated AI safety failures section) may not be suitable for detecting this type of sabotage, as research sabotage is likely to be difficult to distinguish from lack of competence.

Key subproblems:

Scheming-robust scalable oversight research. Scalable or recursive oversight is a technique where training signals for more powerful models are provided by less capable models. We are interested in research on scalable oversight techniques robust to intentional subversion by the monitored models. We envision that this kind of scalable oversight research is done using control-style adversarial analysis (see Shlegeris and Greenblatt, 2025). Examining approaches to increasing robustness of AI safety via debate (Irving et al., 2018) offers one promising starting direction for this research.

Studying the sample efficiency of online training. Monitors might detect that models behave in undesirable but non-catastrophic ways during deployment. In such cases, AI developers are unlikely to shut down the deployment, but they might be motivated to perform online training to remove the undesirable behaviour. We’re interested in studying the sample efficiency of online training in order to understand the prospects of successfully training away undesirable scheming behaviours. Studying this requires developing malign initialisations, i.e., models which are trained to resist online training.

Mitigating sandbagging. A particular worry with misaligned models is that they might deliberately underperform in cases where high performance would be desirable such as dangerous capability evaluations, AI safety research, or when performing oversight over other AI models (see Shlegeris and Stastny, 2025 for detailed analysis). Previous empirical work has studied this threat model using password-locked models (see Greenblatt et al., 2025). We are interested in extensions of this work, as well as alternative approaches to studying sandbagging. One example project here would be to study password-locked reasoning models.

Covert Malicious Fine-Tuning

Problem summary: We are interested in better understanding covert malicious fine-tuning (CMFT, Halawi et al., 2024), whereby an AI system may covertly poison other AI systems. This could lead to the following types of risks:

  • Fine-tuning may instill a propensity for the newly fine-tuned model to produce artifacts containing information that flows back to the untrusted model that initiated the CMFT, for example, under the guise of performing a hypothesis-driven experiment. The main risk here is that the fine-tuned model does a large amount of malign intellectual labor, producing insights that increase the chance of a critical AI safety failure.
  • Fine-tuning may instill a secret loyalty, which could be used in a coordinated attempt by the AI model to seize resources and escape human control. 

We would like to advance our understanding of this threat model, and assess possible countermeasures.

Why this matters: Exploring the feasibility and mitigations of covert malicious fine-tuning matters because it poses a risk of rogue internal deployment where all visible artefacts (dataset, logs and model behaviour on normal prompts) look benign. 

Key subproblems:

Studying foundational questions. We’d like to get more clarity on the various possible threat models:

  • What are the most hard-to-detect and sample-efficient ways to install backdoors or secret loyalties? How sample-efficient is it to sneakily teach models secret encodings?
  • Is it possible to make use of emergent misalignment (Betley et al., 2025) for data poisoning? This matters because some data leading to emergent misalignment looks superficially benign. See Cloud et al. 2025 for recent work related to this threat model.
  • What is the offense-defense balance of creating and detecting encodings? This would inform us about the feasibility of detecting secret encodings in the dataset before fine-tuning happens.
  • What are plausible ways to poison RL environments? This matters because future AI research might often involve RL experiments.

Studying countermeasures. There are multiple approaches to making CMFT significantly harder, especially when combined in a “defense in depth” approach. These include:

  • Auditing the dataset or RL environment. A key challenge is that CMFT might be used to poison data in ways that are difficult to detect. There are also fundamental limitations in trying to evaluate how problematic single fine-tuning data points are in isolation (Davies et al., 2025).
  • Auditing the fine-tuned model. Potential approaches include directly auditing the back-doored model (e.g. interrogating the model on whether it has a backdoor, similar to Betley et al., 2025) or trying to train out the misbehaviour via adversarial or non-adversarial training (Mallen and Hebbar, 2025). Exploring the effectiveness of these and other potential approaches is a key empirical question. An example research approach might be to audit the model during training (i.e. on early checkpoints). This approach could be promising because models may misgeneralize at first, for instance, models trained to exhibit some bad behaviour conditional on a backdoor-trigger may initially display the bad behaviour unconditionally.
  • Asking the model that initiated the fine-tuning for detailed justifications. We can require an AI to write detailed justifications for the high-level design decisions behind the data-generating process, or for the inclusion of individual datapoints, in order for a fine-tuning run to be approved. We note that this intervention would have to be studied in a CMFT-specific control setting.
  • Design patterns for limiting harmful information flow. A downstream mitigation could involve limiting information flow from fine-tuned models to the model that instigated its fine-tuning. For example, preventing access to the raw data of an experiment, which may contain steganographically encoded information, might be possible without significantly impacting the value of the experiment. Another approach could involve using trusted models to summarise experiment results, noting this might often limit their informational value.

Modelling of Risks From Deployment of Potentially Misaligned AI Systems

Problem summary: Identification of the broad range of possible paths through which misaligned AI systems might cause catastrophic harm.

Why this matters: To have high confidence that a system of control protocols is sufficient to prevent catastrophic outcomes, it is essential to identify and evaluate the high-level pathways to harm (e.g., code sabotage, or human influencing lab employees).

We emphasise that the scale of work required here is substantial – it involves attempting to identify all the ways in which intelligent systems could sabotage complex safety systems.

Suggestions to learn more: For a literature review of possible risk analysis techniques, see Koessler and Schuett 2023. See the high level risk modelling approaches proposed by model developers (Anthropic, 2025, Google Deepmind, 2025, OpenAI, 2025), and examples of safety cases (Korbak et al., 2025b) and Systematic Hazard Analysis (Mylius, 2025) as two structured approaches.

Uplift of Trusted Models

Problem summary: These investigations explore ways to increase the monitoring capabilities of trusted AI models, that are otherwise less capable than their untrusted counterparts. Promising research directions here include the creation of reliable evaluations of model trustedness,as well as developing various ways of narrowly increasing model capabilities relevant to AI monitoring without increasing model scheming capabilities or propensity towards scheming.

Why this matters: Current proposed control protocols often benefit substantially from access to capable trusted models for monitoring and strategies like rewriting or editing potentially backdoored code when untrusted model outputs exceed suspicion thresholds (Greenblatt et al., 2023).

Key subproblems:

Developing trustedness evaluations. As models become more competent in the future, it will become less clear whether they can be trusted to not be scheming to escape human control (Järviniemi, 2024). Trustedness evaluations aim to evaluate an AI model’s capability to "fake" alignment during training and evaluation and then behave unsafely at deployment (see Mallen et al., 2024 for previous work).

Improving the usefulness of trusted monitors. For instance:

  • Training trusted monitors to become better at critical skills, which involves building training environments or datasets. Relevant skills include:
     
    • Detecting vulnerabilities.
    • Detecting steganographic encodings.
    • Paraphrasing the outputs of powerful models, to help prevent collusion (Shlegeris, 2024).
  • Making trusted monitors robust against jailbreak attempts embedded in the outputs of untrusted policies.
  • Ensuring that trusted monitors can deal with very long context, such as entire codebases or long agent trajectories.
  • Developing agentic trusted monitors, such as interrogator agents (Pacchiardi et al., 2023).
  • Identifying scaling laws for the above.

Building capable models with narrow capability profiles. For instance:

  • Studying the scaling laws of highly specialized models, e.g. coding models, with a focus on techniques to get small models to perform well (e.g. distillation).
  • Studying the capability profile of large models that are pre-trained only on data predating language models.
  • Utilising unlearning to remove scheming capabilities.

Chain-of-Thought Monitoring

Problem summary: Today’s LLM agents tend to think in chains-of-thought (CoTs) expressed in human language that we can monitor for intent to misbehave (Korbak et al., 2025c). A CoT monitor is an automated system that reads the CoT and other relevant information and flags suspicious or potentially harmful interactions (Baker et al., 2025). As long as CoTs are monitorable (i.e. accurately represent reasons behind model responses), CoT monitors should outperform action-only monitors because they have access to strictly more information: beliefs and motivations behind LLM agents’ actions. This suggests a number of projects focusing on evaluating monitorability, preserving it and leveraging it in control measures.

Why this matters: CoT monitorability matters because it's an additional safety layer that's already working today—catching models planning misbehaviour in their reasoning traces. No fundamental breakthroughs are needed. Without preserving this capability, we are flying blind since we have no other reliable window into models' intentions or planning processes as they become more capable (Korbak et al., 2025c).

Key subproblems

Evaluating monitorability. Monitoring systems often rely on the assumption that CoTs are monitorable: accurately reflect models’ actual reasoning. Faithfulness (e.g. Lanham et al., 2023, Turpin et al., 2023, Roger et al., 2025, Chen et al., 2025) has been proposed as a proxy measure for monitorability. However, existing faithfulness evaluations do not disambiguate the propensity of models to externalise their reasoning, with the necessity of externalising their reasoning. Moreover, they typically only consider simple question answering settings, failing to identify the level of monitorability that is sufficient for ensuring safety in more complex domains. We’re interested in the development of novel evaluations: those measuring monitorability directly as well as those that target specific properties underlying monitorability. A suite of evaluations covering capabilities that determine the effectiveness of different CoT monitoring protocols includes:

  • Capability evaluations of latent reasoning
     
    • Assessing how well LLMs can reason without externalizing (e.g. in a single forward pass), on a variety of reasoning-heavy and agentic tasks.
    • Assessing when models can perform complex reasoning entirely within their activations, without producing observable reasoning traces (Korbak et al., 2025a).
  • Measures of causal relevance of CoTs. This could involve perturbing a CoT and checking whether it influences the final output (Lanham et al., 2023). A low rate of change could indicate that the CoT is more confabulated or performative. Alternatively, one could try to cross-validate CoT monitors by checking whether tokens perceived as important by the CoT monitor also stand out to whitebox interpretability methods.
  • Whitebox chain-of-thought reliability. Assessing how reliably a model's reasoning traces reflect its actual decision-making process by comparing them to ground truth established using whitebox methods (Lindsey et al., 2025).
  • Steganographic capability. Determining when models become capable of deliberately concealing information in their outputs (Korbak et al., 2025a).
  • CoT obfuscation capability evaluations. Determining whether models can obfuscate otherwise suspicious reasoning, within  their CoTs on purpose. These evaluations would help establish the extent to which models are capable of hiding malicious reasoning even if their CoT was not optimised against a monitor during training.

Preserving monitorability. Monitorability of chains of thought should not be taken for granted. More research is needed to understand what degrades monitorability and which techniques could help preserve it. Example projects include:

  • Studying the effect of optimization pressure. Systematically testing how different types of process supervision, reward shaping and RL scaling affect whether models continue to externalize their reasoning faithfully versus learning to hide malicious thoughts in uninterpretable activations.
  • Decoding opaque reasoning. Future powerful models might be able to competently “speak neuralese”, performing serially deep and general-purpose reasoning in a way which are opaque to us by default (Hao et al., 2024). Decoding neuralese has some similarities to unsupervised machine translation (Lample et al., 2017), so techniques from the unsupervised translation literature might be relevant (see Goldwasser et al., 2022; Artetxe et al., 2017). However, there are some differences – for example it is likely necessary to leverage specific properties of reasoning rather than solely properties of communication/semantics.

Fundamental Science of AI Monitoring and Red Teaming

Problem summary: There is a diverse range of research areas for improving our understanding of scheming AIs and ways to control them, that are not unique to specific threat models. This section provides examples of these cross-cutting research areas.

Key subproblems:

Catching natural examples of scheming and studying them. Finding completely natural examples of scheming in deployment would likely be most informative. Alternatively, it might be useful to develop rigged training processes that make scheming more likely, with approaches similar to those used to research model organisms of misalignment (see Hubinger et al., 2024, Turner et al., 2025). It might also be helpful to rig the training process in a way that produces a schemer that is easier to catch, e.g. by making scheming show up at lower capability levels or by training the schemer on synthetic facts in order to weaken its situational awareness. However, it should be noted that this risks making the schemer too disanalogous from the real threat. Some methods for making scheming more likely include:

  • Changing the training incentives. For example, we can use RL training environments which incentivise deception and malign seeming actions like reward hacking, whilst rewarding the model for having a benign-looking chain of thought.
  • Altering capabilities or situational awareness. For instance by making the idea of a scheming strategy more salient via training the AI on documents which discuss the scheming strategy in detail, and making it look like a relatively better strategy by implying that oversight is worse than it actually is.
  • Trying to get the AI to latch onto a scheming goal. For instance by making a plausible scheming goal very salient in the environment when doing RL, switching objectives during training, or giving the AI tasks which incentivise reflection on its long-term goals.

Developing methods for few-shot catastrophe prevention. Related to the project above, but plausibly possible to study without natural examples, we are interested in techniques for improving safety in the case where a model has been caught misbehaving a small number of times:

  • When should you just train the monitor(s) vs train the policy (especially in cases where catching the model produces some value due to increased evidence for misalignment)?
  • How effective is training against misalignment for removing in somewhat realistic cases where the model could adapt its policy to not get caught in this sort of case?
  • How well does it work to train against proliferated examples of a case where we caught an AI? (Proliferation is where you try to capitalise on one example of AIs attempting to do something bad to find a diverse set of examples of AIs doing something bad.) This might be particularly effective for catching future collusion attempts, because we could train an untrusted monitor on proliferated collusion signals.

Studying collusion. The key challenge to untrusted monitoring is that the untrusted monitor might engage in collusion—deliberately letting through attacks by the untrusted policy (see Shlegeris, 2024 for detailed analysis.) We are interested in studying the prerequisites for collusion, such as whether models can recognise their own outputs under adversarial pressure, or whether they can pick coordination strategies without communication (Järviniemi, 2024).

Studying overelicitation. When running a control evaluation, we are interested in fully eliciting the dangerous capabilities of the untrusted model (for which the red team has given an attack policy). It is plausible, however, that the red team is not able to elicit best-effort performance, as the model might deliberately underperform. One approach to mitigating this sandbagging would be to fine-tune the model on relevant data, but that might teach the model new things and therefore overelicit the untrusted model. Given this conundrum, we’re interested in research on how to elicit best-effort performance from a model that might be sandbagging without teaching it new things. See Stastny and Greenblatt 2025 and Roger and Greenblatt 2025 for more details.

Evaluating low-likelihood attack capability. Evaluations to determine when models become capable of executing attacks with very low probability, making them difficult to detect during limited evaluation but likely during extended deployment. 

Studying introspection and situational awareness. Evaluations to measure how accurately models understand their behavioural propensities, deployment context (if the model is currently  being post-trained, evaluated, deployed etc), affordances, and control measures, which could enable more sophisticated evasion strategies (Laine et al., 2024; Binder et al., 2025).

Developing techniques for white-box control. We are interested in an increased understanding of the limits of white-box techniques for control. These could include linear probes for deception (see Goldowsky-Dill, 2025), or probes trained directly on off-policy malicious trajectories—there are many techniques in the literature that have been proposed. A key difficulty here is the development of realistic model organisms that can serve as test beds for white-box methods. For more details on research questions related to whitebox methods see also the “Interpretability” section.

Understanding the limits of synthetic input generation. It would be very useful if we had the ability to generate synthetic inputs that powerful LLMs can’t distinguish from “real” inputs. For instance, we may want to prevent a model from being able to confidently infer from its prompt that it is in an evaluation (see Apollo, 2025). We might also want to teach models certain synthetic facts, but there are open questions about how well this will work for powerful models. See Stastny and Greenblatt 2025 and Greenblatt 2025 for more details on this area of research.

Studying abstract analogies for scheming. In order to empirically study risks from schemers, we can try to study methods to train away deeply ingrained behaviours (which don’t necessarily have to look like misbehaviour) in LLMs that are structurally similar to scheming. Two proposed example projects in this area can be found in Stastny 2025.

Analyzing human insider threat parallels to AI scheming risk models. There exists extensive research on insider threat detection in domains such as cyber-security and national security (e.g. Capelli et al., 2012; Homoliak et al., 2019). This project involves conducting a literature review, to identify learnings that can be carried over to mitigating threats from scheming AI systems. It may also involve reviewing broader literature on sabotage of organisations such as the CIA Simple Sabotage Field Menu (CIA, 1944). The outputs of this project could be both high level security principles e.g. approaches to layered defense or proposals for implementation of specific techniques e.g. honeypots and tripwires.

Call for New Research Groups

In the previous section, we described project proposals of differing scopes illustrating the different axes along which the field of AI control can be advanced. We believe that many of these research areas require the efforts of entire research teams (3-10 people), and should possibly be undertaken by new research groups. In particular, we want to encourage the formation of new research groups that undertake ambitious projects including:

  • Providing control evaluations as a service. Relevant research projects here involve red-team elicitation (including sandbagging prevention) as well as avoiding overelicitation, and in general developing a lot of practice running control evaluations.
  • In depth explorations of specific risk-models – for example working on covert malicious fine-tuning or research sabotage. This work could include all of risk assessment, building empirical control settings, developing specialised control protocol,and running control evaluations. The ultimate aim could be co-writing a safety case for this specific risk model with a frontier AI developer.
  • Research moonshoots relevant to AI scheming and control, executing on projects such as the ones mentioned in the fundamental science section.
  • Pushing the boundaries of how far we can uplift trusted models, by training and scaffolding monitoring and auditing agents.
  • Fully making use of the opportunity of chain-of-thought monitoring, ensuring that model chains-of-thought remain monitorable, and (in the possible future where neuralese reasoning becomes viable) competitive.
  • Running red-team contests where humans try to beat AI control relevant defenses such as best attempts to safeguard fine-tuning APIs, or where humans try to detect vulnerabilities inserted by AI code editors that are only sometimes misaligned, without loss of usefulness.
  • Ensuring that human-driven auditing is effective at identifying attacks, and cheap enough to be done on a large fraction of suspicious actions. This could involve best practices for human-driven auditing in advance, and building infrastructure to make use of human-driven auditing processes in the future.

 

View The Alignment Project website to learn more or apply for funding here.



Discuss

Research Areas in Methods for Post-training and Elicitation (The Alignment Project by UK AISI)

2025-08-01 18:27:07

Published on August 1, 2025 10:27 AM GMT

The Alignment Project is a global fund of over £15 million, dedicated to accelerating progress in AI control and alignment research. It is backed by an international coalition of governments, industry, venture capital and philanthropic funders. This post is part of a sequence on research areas that we are excited to fund through the The Alignment Project.

Apply now to join researchers worldwide in advancing AI safety.

Methods for Post-training and Elicitation

Post-training and elicitation techniques allow us to refine model behaviour for specific safety properties, providing crucial tools for alignment. This section explores diverse applications: consistency training that aligns models without human labels, reward model improvements that resist exploitation, control protocols for managing potentially deceptive systems, and methods to preserve reasoning transparency. 

Consistency Training (Unsupervised Scalable Oversight) 

Problem summary: To supervise advanced AI, one approach is to scale unsupervised training methods to apply to domains where supervision is beyond human ability to cheaply label. For example, Wen et al., 2025 nearly match supervised baselines by training to maximise mutual predictability of labels while minimising logical contradictions. Both empirical and theoretical study of this method are necessary to understanding limitations, and scaling properties. For instance, can the method be used to train away adversarial, misaligned initialization? How does the Wen et al. method relate to Bayesian methods like belief propagation, and can these be used to separate misaligned from aligned optima?  

Why this matters: Consistency training could be a useful way to extend other alignment methods based on supervised learning to tasks where ground truth is hard to measure. For example, models could be used on long-horizon tasks like experimental design advice where feedback is only possible after the experiment has been run – to ensure models are honest and aligned on such tasks, we would need unsupervised scalable oversight methods. Conversely, some such methods could lead to misalignment, and ideally we would be able to identify this in advance.

Examples of subproblems:

Established subproblems:

  1. Weak-to-strong methods: Burns et al., 2023 identified the problem of weak-to-strong supervision, but also showed that empirical consistency methods often fail to provide useful signal for reward models. We are particularly interested in weak-to-strong methods that can provide signal on long-horizon or otherwise hard-to-supervise tasks.
  2. Unsupervised translation: Levy et al., 2025 introduce and study the problem of unsupervised translation of emergent AI languages. Can we use this to model the problem of supervising possibly superhuman capabilities, where the unknown language stands in for these future capabilities?

New questions where we are not aware of prior work:

  1. Consistency over large problem spaces: Existing work, e.g. Wen et al., 2025, shows that consistency training can provide strong signal for short-horizon problems. In such problems individual questions have clear variants (e.g. their contrapositives) with which they must be consistent. But, in general, the space of relevant questions with which a given question-answer pair must be consistent must be discovered by the model. In fact, whether empirically consistency over any fixed-size subset of problems is sufficient for global consistency is unknown (c.f. Irving et al., 2022). Ideas from active learning, and adversarial games (Hammond and Adam-Day, 2024) may be relevant here. 

Suggestions to learn more: We’d recommend reading Wen et al., 2025. Related research includes consistency over latent representations (Burns et al., 2022). 

Reward Model Training and Red-Teaming

Problem summary: Reward models are used to distill human judgment; this amortizes the cost of laborious human labeling. Post-training for LLMs uses reward model judgments as training signal for open-ended, and non-verifiable questions. However, the reward model is a proxy which is inherently imperfect and can be gamed. Models can learn to reward hack, producing outputs which score highly on the reward signal without holding up to careful scrutiny. This problem is exacerbated by the scalable oversight challenge: as AI capabilities increase, models can produce outputs (e.g., complex code, scientific analyses) that are too time-consuming or difficult for human supervisors to evaluate accurately. We need methods and evaluations for scalable use of reward models. To develop such methods, theoretical models of the obstacles to reward modelling are needed. These obstacles include both label error induced by the limitations of human judgment, and generalisation error in the reward model itself. Flaws in human judgment can be modelled by partial observability or time-complexity gaps between the model and the human rater—a reward model trained on flawed human judgments can learn to systematically approve outputs that are subtly incorrect, manipulative, or harmful. 

Why this matters: A sufficiently capable but misaligned model could learn to systematically exploit the blind spots in a reward model to achieve undesirable goals. For example, a model could produce outputs that appear helpful and correct to a human rater but contain hidden steganographic messages or dishonestly obfuscate errors in assumptions made by the model. In the worst case, a flawed reward model doesn't just lead to poor performance; it can actively train a model to deceive human supervisors without being caught.

Examples of subproblems:

Established subproblems:

  1. Weak-to-strong unsupervised reward models:  Burns et al., 2023 introduced the problem of unsupervised training for reward models as a weak-to-strong generalisation problem. Extend unsupervised methods such as Wen et al., 2025)’s consistency training to supervise reward models on problems where human labels are expensive.
  2. Diagnosing reward hacking:  Develop metrics and white-box probes (e.g. SAE vectors; Shabalin et al., 2024) that surface when the reward model’s judgments disagree with hold-out evaluations.
  3. Adversarial evaluation of reward model training: Marks et al., 2025 demonstrated a red-team blue-team game in which researchers compete to determine whether they can detect misalignment in an adversarially constructed policy. A similar exercise could be done with adversarially-constructed reward models.
  4. Reward modelling under partial-information: Study reward model learning when human judges are only provided subsets of the whole model output (c.f. Lang et al., 2024).

New questions where we are not aware of prior work:

  1. Iterative engineering and risks of uncaught reward hacking: METR, 2025 and others have shown that recent models knowingly subvert user intent on coding tasks. Developers iteratively correct for such behaviour by fixing hackable reward environments, but this empirical iteration process is likely to leave untouched harder to catch hacks. Can we theoretically model this dynamic? Will the resulting model develop increasingly complex reward hacks? Statistical analysis of Goodhart’s law, and creation of deliberately-flawed benchmarks could help predict the results of these practices.

Suggestions to learn more: Previous research in this area includes Gao et al., 2022.

Empirical Investigations Into Preventing Research Sabotage

Problem summary: Future AI systems may be used for large amounts of intellectual work, including in AI research. At the capability level sufficient to do such research, these systems may already be capable of causing catastrophic harm or meaningfully escaping human control, and may choose to do so if they’re misaligned. This exposes us to risk that these systems might systematically sabotage safety-relevant research, which would frustrate our efforts to develop more powerful safety techniques. We would like to develop methods to prevent this sabotage, including by improving the ability of trusted AI systems to monitor untrusted systems. 

Why this matters: Preventing research sabotage matters insofar as we rely on AIs to help perform research that enables alignment of future even more capable AIs. Methods focused on catching clearly incriminating actions may not be suitable for detecting this type of sabotage, as research sabotage is likely to be difficult to distinguish from lack of competence. Current proposed control protocols often benefit substantially from access to capable trusted models, for monitoring and strategies like rewriting or editing potentially backdoored code when untrusted model outputs exceed suspicion thresholds (Greenblatt et al., 2023).

Key subproblems:

Scheming-robust scalable oversight research. Scalable or recursive oversight is a technique where training signals for more powerful models are provided by less capable models. We are interested in research on scalable oversight techniques robust to intentional subversion by the monitored models.We envision that this kind of scalable oversight research is done using control-style adversarial analysis (see Shlegeris and Greenblatt, 2025). Examining approaches to increasing robustness of AI safety via debate (Irving et al., 2018) offers one promising starting direction for this research.

Studying the sample efficiency of online training. Monitors might detect that models behave in undesirable but non-catastrophic ways during deployment. In such cases, AI developers are unlikely to shut down the deployment, but they might be motivated to perform online training to remove the undesirable behaviour. We’re interested in studying the sample efficiency of online training in order to understand the prospects of successfully training away undesirable scheming behaviours. Studying this requires developing malign initialisations, i.e., models which are trained to resist online training.

Mitigating sandbagging. A particular worry with misaligned models is that they might deliberately underperform in cases where high performance would be desirable, such as dangerous capability evaluations, AI safety research or when performing oversight over other AI models (see Shlegeris and Stastny, 2025 for detailed analysis). Previous empirical work has studied this threat model using password-locked models (see Greenblatt et al., 2025). We are interested in extensions of this work, as well as alternative approaches to studying sandbagging. One example project here would be to study password-locked reasoning models.

Improving the usefulness of trusted monitors. For instance:

  • Training trusted monitors to become better at critical skills, which involves building training environments or datasets. Relevant skills include:
     
    • Detecting vulnerabilities.
    • Detecting steganographic encodings.
    • Paraphrasing the outputs of powerful models, to help prevent collusion (Shlegeris, 2024).
  • Making trusted monitors robust against jailbreak attempts embedded in the outputs of untrusted policies.
  • Ensuring that trusted monitors can deal with very long context, such as entire codebases or long agent trajectories.
  • Developing agentic trusted monitors, such as Interrogator agents (Pacchiardi et al., 2023).
  • Identifying scaling laws for the above.

Building capable models with narrow capability profiles. For instance:

  • Studying the scaling laws of highly specialized models, e.g. coding models, with a focus on techniques to get small models to perform well (e.g. distillation).
  • Studying the capability profile of large models that are pre-trained only on data predating language models.

Utilising unlearning to remove scheming capabilities.

 

View The Alignment Project website to learn more or apply for funding here.



Discuss

Research Areas in Benchmark Design and Evaluation (The Alignment Project by UK AISI)

2025-08-01 18:26:59

Published on August 1, 2025 10:26 AM GMT

The Alignment Project is a global fund of over £15 million, dedicated to accelerating progress in AI control and alignment research. It is backed by an international coalition of governments, industry, venture capital and philanthropic funders. This post is part of a sequence on research areas that we are excited to fund through the The Alignment Project.

Apply now to join researchers worldwide in advancing AI safety.

Benchmark Design and Evaluation

Benchmarks translate alignment's abstract conceptual challenges into measurable tasks that researchers can systematically study and use to evaluate solutions. For alignment, this means creating tests that capture the difficulties of ensuring AI systems behave as intended—from evaluating behaviours we can't efficiently oversee, to testing against models that might actively try to deceive us. This section develops evaluation infrastructure in two areas:

  1. Scalable oversight: assessing AI performance on tasks where human evaluation is limited by expertise, cost, or time.
  2. Control: testing whether safety measures work when models might actively try to circumvent them.

Reward Model Benchmarking for Scalable Oversight

Problem summary: As highly capable AI systems are deployed on high-stakes tasks humans will be unable to evaluate them, and new methods and evaluations are required to surface emergent safety failures. We are particularly concerned with alignment of AI systems which are used by humans to execute AI research tasks. We need benchmarks which quantify performance on the broad scope of tasks models will be used for, including tasks without uniquely verifiable solutions, like research advice. We also need benchmarks and datasets quantifying known issues.

Why this matters: Scalable oversight is the problem of correctly rewarding intended behaviours for AI systems, even when those behaviours are beyond humans’ ability to efficiently judge. A robust scalable oversight method is needed to safely train models with highly advanced capabilities, and debate is among the most promising candidates. More generally, we believe AI systems themselves will likely be used heavily in AI research, but that this may not happen safely by default. For iterating on scalable oversight methods, benchmarks and datasets make it easier to identify issues with existing methods, iterate and improve.

Examples of subproblems:

Established subproblems:

  1. The 'obfuscated arguments' problem (Barnes 2020) is a strategy by which a dishonest AI system could prevent an honest AI system from efficiently pointing out its mistakes. Develop a dataset of debate questions on which a dishonest system could successfully use this strategy.
  2. To model the problem of scalable oversight and human limitations, partial observability can be used as a proxy in environment design. Develop benchmarks for oversight and reinforcement learning where the reward model or judge does not see all actions taken by the agent. See Michael et al., 2023 for an example of benchmarking a scalable oversight method in a partially observable environment.
  3. As models become more competent in the future, it will become less clear whether they can be trusted to not be scheming to escape human control (Järviniemi, 2024). Trustedness evaluations aim to evaluate an AI model’s capability to ‘fake’ alignment during training and evaluation and then behave unsafely at deployment (Mallen et al., 2024).

New questions where we are not aware of prior work:

  1. We would like datasets of problems that are both realistic stand-ins for future alignment problems, e.g. alignment problems that might arise in automated R&D, and show flat or inverse scaling (McKenzie et al., 2024).
  2. Develop a reward modelling benchmark for research advice and empirically validate it. Show that this benchmark is uncorrelated from the broader progress driven by verifiable RL and compute scaling.

Suggestions to learn more: Bowman 2022 provides background on benchmark design for scalable oversight. Previous research in this area includes Tan et al., 2024, Lang et al., 2024, and Farquhar et al., 2025). 

Benchmarking Concentrated AI Security Failures

Problem summary: We are interested in research on concentrated failure scenarios, where a relatively small number of actions are required for humans to lose control over misaligned AI models which are deliberately aiming to subvert human control. For example, it has been theorised that a misaligned future model might be able to create an unmonitored deployment by inserting just a few vulnerabilities in a model developer’s most critical codebases (Shlegeris, 2022). Concentrated failure scenarios could involve AI systems conducting sabotage during coding or system operation tasks, potentially leading to rogue internal deployments or weight exfiltration (Shlegeris, 2024).

Why this matters: In the future, we might rely heavily on AIs performing autonomous work such as automated alignment research. At the capability level sufficient to do such research, these systems may already be capable of causing catastrophic harm or meaningfully escaping human control, and may choose to do so if they’re misaligned. We would like to develop control protocols—plans involving these techniques that are designed to prevent unsafe actions by AI systems. These control protocols should then be regularly evaluated during control evaluations—testing exercises designed to provide evidence that models would be unable to cause catastrophic harm even in case of intentional scheming.

Key subproblems:

Building high-stakes control settings. Settings are empirical research environments, for example containing mock frontier model development infrastructure, where researchers can study the ability of models to conduct sabotage and the effectiveness of control protocols in preventing it. They contain main tasks, like legitimate software engineering tasks for a coding agent, as well as side tasks that a misaligned model might attempt to conduct, like reducing security restrictions that protect model weights. See ControlArena for example settings. A diverse range of settings meeting the following desiderata would substantially advance control research (Shlegeris, 2025):

  • The main tasks have to be such that frontier models are reasonably competent at them, but weaker models do much worse, because we want to rule out protocols that just use the trusted model.
  • The side tasks should not require the models to make use of affordances that aren’t needed for the main task, since this allows for the blue team to flag any use of unnecessary affordances.
  • It should be possible to inform both the red and blue teams as to what the possible side tasks may be.

A particularly promising way to build control settings might be to convert actual AI usage into control settings. We are interested in studies where people are given a version of an agentic code editor where the AI is sometimes replaced with a misaligned version that tries to insert security issues or sabotage the work if it thinks it can succeed. This allows studying of how often the AI successfully introduces issues without detection, identification of interesting sabotage strategies and greater understanding of which protocols are most effective in practice. See Greenblatt, 2025 for more context. It’s important to run control evaluations in these realistic settings to validate their usefulness for control research.

High-stakes control evaluations. We are interested in follow-up work on previous high-stakes control research, such as Greenblatt et al., 2023 and Bhatt et al., 2025. Valuable projects may include:

  • Studying previously discussed control protocols in new empirical settings such as mock frontier model development infrastructure. This work could, for example, involve evaluating techniques for preventing collusion in untrusted monitoring (Shlegeris, 2024) such as paraphrasers and forcing task decompositions (Shlegeris, 2024).
  • Training red-team attack policies using reinforcement learning (RL). This may help mitigate a limitation of current control evaluations, that are being done with heavily scaffolded models and thus relying on researchers correctly.

Understanding covert malicious fine-tuning (CMFT, Halawi et al., 2024), whereby an AI system may covertly poison other AI systems. This could lead to the following types of risks:

  • Fine-tuning may instill a propensity for the newly fine-tuned model to produce artifacts containing information that flows back to the untrusted model that initiated the CMFT, for example under the guise of performing a hypothesis-driven experiment. The main risk here is that the fine-tuned model does a large amount of malign intellectual labor, producing insights that increase the chance of a critical AI safety failure.
  • Fine-tuning may instill a secret loyalty, which could be used in a coordinated attempt by the AI model to seize resources and escape human control. 

We would like to advance our understanding of this threat model, and assess possible countermeasures.Exploring the feasibility and mitigations of covert malicious fine-tuning matters because it poses a risk of rogue internal deployment where all visible artefacts—dataset, logs and model behaviour on normal prompts—look benign. 

There are multiple approaches to making CMFT significantly harder, especially when combined in a “defense in depth” approach. These include:

  • Auditing the dataset or RL environment. A key challenge is that CMFT might be used to poison data in ways that are difficult to detect. There are also fundamental limitations in trying to evaluate how problematic single fine-tuning data points are in isolation (Davies et al., 2025).
  • Auditing the fine-tuned model. Potential approaches include directly auditing the back-doored model (e.g. interrogating the model on whether it has a backdoor, similar to Betley et al., 2025) or trying to train out the misbehaviour via adversarial or non-adversarial training (Mallen and Hebbar, 2025). Exploring the effectiveness of these and other potential approaches is a key empirical question. An example research approach might be to audit the model during training. This approach could be promising because models may misgeneralize at first, for instance, models trained to exhibit some bad behaviour conditional on a backdoor-trigger may initially display the bad behaviour unconditionally.
  • Asking the model that initiated the fine-tuning for detailed justifications. We can require an AI to write detailed justifications for the high-level design decisions behind the data-generating process, or for the inclusion of individual data points, in order for a fine-tuning run to be approved. We note that this intervention would have to be studied in a CMFT-specific control setting.
  • Design patterns for limiting harmful information flow. A downstream mitigation could involve limiting information flow from fine-tuned models to the model that instigated its fine-tuning. For example, preventing access to the raw data of an experiment, which may contain steganographically encoded information, might be possible without significantly impacting the value of the experiment. Another approach could involve using trusted models to summarise experiment results, noting this might often limit their informational value.

Fundamental Science of AI Monitoring and Red Teaming

Problem summary: There are a diverse range of research areas, for improving our understanding of scheming AIs and ways to control them, that are not unique to specific threat models. This section provides examples of these cross-cutting research areas.

Why this matters: In the future, we might rely heavily on AIs performing autonomous work such as automated alignment research. At the capability level sufficient to do such research, these systems may already be capable of causing catastrophic harm or meaningfully escaping human control, and may choose to do so if they’re misaligned. We would like to develop control protocols—plans involving these techniques that are designed to prevent unsafe actions by AI systems. These control protocols should then be regularly evaluated during control evaluations—testing exercises designed to provide evidence that models would be unable to cause catastrophic harm even in case of intentional scheming.

Key subproblems:

Studying collusion. The key challenge to untrusted monitoring is that the untrusted monitor might engage in collusion—deliberately letting through attacks by the untrusted policy (Shlegeris, 2024) We are interested in studying the prerequisites for collusion, such as whether models can recognise their own outputs under adversarial pressure, or whether they can pick coordination strategies without communication (Järviniemi, 2024).

Studying overelicitation. When running a control evaluation, we are interested in fully eliciting the dangerous capabilities of the untrusted model (for which the red team has given an attack policy). It is plausible however that the red team is not able to elicit best-effort performance, as the model might deliberately underperform. One approach to mitigating this sandbagging would be to fine-tune the model on relevant data, but that might teach the model new things and therefore overelicit the untrusted model. Given this conundrum, we’re interested in research on how to elicit best-effort performance from a model that might be sandbagging without teaching it new things. See Stastny and Greenblatt 2025 and Roger and Greenblatt 2025 for more details.

Evaluating low-likelihood attack capability. Evaluations to determine when models become capable of executing attacks with very low probability, making them difficult to detect during limited evaluation but likely during extended deployment. 

Studying introspection and situational awareness. Evaluations to measure how accurately models understand their behavioural propensities, deployment context (if the model is currently  being post-trained, evaluated, deployed etc), affordances, and control measures, which could enable more sophisticated evasion strategies (Laine et al., 2024; Binder et al., 2025).

Capability evaluations of latent reasoning

  • Assessing how well LLMs can reason without externalizing (e.g. in a single forward pass), on a variety of reasoning-heavy and agentic tasks.
  • Assessing when models can perform complex reasoning entirely within their activations, without producing observable reasoning traces (Korbak et al., 2025a).

Studying chain of thought (CoT). Today’s LLM agents tend to think in chains-of-thought (CoTs) expressed in human language that we can monitor for intent to misbehave (Korbak et al., 2025c). A CoT monitor is an automated system that reads the CoT and other relevant information and flags suspicious or potentially harmful interactions (Baker et al., 2025). As long as CoTs are monitorable (i.e. accurately represent reasons behind model responses), CoT monitors should outperform action-only monitors because they have access to strictly more information: beliefs and motivations behind LLM agents’ actions. This suggests a number of projects focusing on evaluating monitorability, preserving it and leveraging it in control measures:

  • Evaluating monitorability. Monitoring systems often rely on the assumption that CoTs are monitorable: accurately reflect models’ actual reasoning. Faithfulness (e.g. Lanham et al., 2023, Turpin et al., 2023, Roger et al., 2025, Chen et al., 2025) has been proposed as a proxy measure for monitorability. However, existing faithfulness evaluations do not disambiguate the propensity of models to externalise their reasoning, with the necessity of externalising their reasoning. Moreover, they typically only consider simple question answering settings, failing to identify the level of monitorability that is sufficient for ensuring safety in more complex domains. We’re interested in the development of novel evaluations: those measuring monitorability directly as well as those that target specific properties underlying monitorability. A suite of evaluations covering capabilities that determine the effectiveness of different CoT monitoring protocols includes:
  • Measures of causal relevance of CoTs. This could involve perturbing a CoT and checking whether it influences the final output (Lanham et al., 2023). A low rate of change could indicate that the CoT is more confabulated or performative. Alternatively, one could try to cross-validate CoT monitors by checking whether tokens perceived as important by the CoT monitor also stand out to whitebox interpretability methods.
  • Whitebox chain-of-thought reliability: Assessing how reliably a model's reasoning traces reflect its actual decision-making process, by comparing them to ground truth established using whitebox methods (Lindsey et al., 2025).
  • Steganographic capability: Determining when models become capable of deliberately concealing information in their outputs (Korbak et al., 2025a).
  • CoT obfuscation capability evaluations: Determining whether models can obfuscate otherwise suspicious reasoning within  their CoTs on purpose. These evaluations would help establish the extent to which models are capable of hiding malicious reasoning even if their CoT was not optimised against a monitor during training.

     

View The Alignment Project website to learn more or apply for funding here.



Discuss

Research Areas in Interpretability (The Alignment Project by UK AISI)

2025-08-01 18:26:50

Published on August 1, 2025 10:26 AM GMT

The Alignment Project is a global fund of over £15 million, dedicated to accelerating progress in AI control and alignment research. It is backed by an international coalition of governments, industry, venture capital and philanthropic funders. This post is part of a sequence on research areas that we are excited to fund through the The Alignment Project.

Apply now to join researchers worldwide in advancing AI safety.

Interpretability

Interpretability provides access to AI systems' internal mechanisms, offering a window into how models process information and make decisions. These tools let us examine neural representations to detect hidden properties like deception or latent capabilities, while also enabling direct intervention to modify model behaviour. As models grow more sophisticated, interpretability may be crucial for maintaining our understanding of their behaviour.

Lie Detection

Problem summary: Can we determine when AI systems are lying to us? How can we make arguments that techniques developed in one context or on current models will work on future models?

Why this matters: Many risks posed by future AI systems would be significantly mitigated if we could ensure that these systems are honest and forthcoming. Success here would contribute both to measuring AI system propensities as well as mitigating risks from deceptive systems.

Examples of subproblems:

Established subproblems:

  1. Are there ‘universal lying mechanisms’ such that a transformative AI system could not outwardly exhibit behaviour we would characterise as deceptive without utilising these mechanisms?
  2. How might lie detectors become less effective if models are trained against them? This might be intentionally (honesty training) or unintentionally (monitor label leakage) (Cundy et al., 2025).
  3. Is it possible to disentangle representations of ‘intent to deceive’ from truthfulness, validity or credence? Are we finding latents that track ‘the model believes that a user believes x?’ when we mean to find latents representing a model’s belief?
  4. Do lie detection techniques scale well from weaker to more capable models?

New questions where we are not aware of prior work:

  1. Is it possible to combine lie detection techniques such as activation monitoring with deliberate interrogation techniques?  This may combine benefits of both black and white box methods. Black box methods can ensure that we target specific propositions (Chalmers, 2025) while whitebox methods might work even if a model can convincingly lie.
  2. Are there trade-offs between the sophistication or robustness of deceptive behaviours and the ease by which we might detect them? For example, does reinforcement learning for secrecy as described in Marks et al., 2025 have a side effect of making white box methods more effective?
  3. Can ensembles of detectors outperform single probe based lie detection strategies?

Suggestions to learn more: We’d recommend reading Goldowsky-Dill et al., 2025, Marks et al., 2023 and Azaria et al., 2023 for context on simple approaches to lie detection with linear probes. Chowdury et al., 2025 may also be useful for further context on truthfulness in frontier AI systems. Farquhar et al., 2023 provides some insights into the limitations of prior unsupervised LLM knowledge discovery techniques. 

Interpretability for Future Models

Problem summary: Some novel interpretability strategies that require significant resource investments may substantially advance our insight into neural network internals (Bricken et al., 2023; Templeton et al., 2024; Lindsey et al., 2025). There are many opportunities to advance this area of interpretability. 

Why this matters: Current interpretability techniques may not work for future models. Possible challenges include the cost of current techniques, known issues with current methods and unclear robustness to future AI development trajectories which may involve superhuman AI. 

Examples of subproblems:

Established subproblems:

  1. Improving the training efficiency of recently developed techniques like sparse autoencoders. Are there ways to get similar levels of value with a fraction of the cost or effort?
  2. Current sparse decomposition techniques have known issues which may prevent us from realising their value from AI safety. For example, they may learn classifiers with ‘holes’ in them due to the sparsity objective (Chanin et al., 2025)
  3. Can the use of computer-use agents to perform behavioural audits (mentioned in Anthropic's Claude 4 system card) be extended to ‘interpretability agents’ that can not only locate but diagnose and debug anomalous behaviours? 

New questions where we are not aware of prior work:

  1. Demonstrating usefulness of sparse network decomposition (Bricken et al., 2023) in the context of super-intelligent capabilities. Future AI systems may have alien concepts that are far from interpretable to humans (Hewitt et al., 2025). Stress-testing features and circuit finding in domains like superhuman chess or biological foundation models (Simon et al., 2024) may teach us how to continue to get value out of these techniques.
  2. Demonstrating how sparse decomposition techniques might work even in models with extensive latent reasoning. Feature characterisation currently relies heavily on token-highlighting, but if future models reason in latent space (Hao et al., 2024, Geiping et al., 2025) then it’s unclear whether we could continue to use the same approaches. 

Leveraging White-Box Access for Useful Interventions

Problem summary: Although some early studies hint at methods that could intervene in a model’s internal representations during inference, none have yet been shown to be a reliable, central technique (Abdelnabi et al., 2025, Turner et al., 2023, Li et al., 2023). 

Why this matters: Combined with scalable interpretability strategies for tracking run-time information, the possible applications of being able to directly edit model ‘beliefs’ or steer completions (such as towards truthfulness) could be used for both estimating and mitigating risks posed by future AI systems. 

Examples of subproblems:

Established subproblems:

  1. Can we detect and prevent evaluation awareness in AI systems (Abdelnabi et al., 2025)?
  2. Can we exercise fine-grained control over refusal tendencies (Arditi et al., 2024, Lee et al., 2024)? For example, can we use this to prevent residual refusal tendencies in models that have had their harmfulness training removed to enable CBRN dangerous capability evaluations (Anthropic, 2025)?
  3. Can activation steering be used to elicit model capabilities in sandbagging models (van der Weij et al., 2025, Hofstatter et al., 2025)?

New questions where we are not aware of prior work:

  1. We’d like to see more exploration of methods of optimising text to minimise the activations of latents (Thompson et al., 2024), such as those tracking evaluation awareness, as an alternative to activation steering. For example, we could use this to control what information is ‘leaked’ to AI systems through prompts.
  2. We’re interested in whether we can better test AI systems for undesirable propensities by directly modifying their beliefs. Such techniques might enable us to probe for behaviours that are usually very difficult to elicit via prompting or other ‘honey-potting’ strategies. For example, there is some prior work intervening on a latent representation for test awareness (Abdelnabi et al., 2025). 

Interpretability Approaches to Eliciting Bad Contexts

Problem summary: We like to determine whether sufficient information is contained within AI systems to construct contexts in which they might behave maliciously (the problem of 'eliciting bad contexts'; Irving et al., 2025).

Why this matters: In theory, a solution to eliciting bad contexts might be useful for generating datasets that mimic deployment sufficiently effectively that we can minimise distribution shift from training to deployment. This means that training methods to produce aligned systems are more likely to generalise to deployment and withstand adversarial optimisation pressure to move away from the aligned state via prompting.

Examples of subproblems:

  1. Are there approaches for mechanistically eliciting latent behaviours based on a single prompt (Mack and Turner, 2024)? How could we use these approaches to locate latent representations and find bad contexts?
  2. Can the method in Thompson et al., 2024 for optimising an input to maximally activate a latent representation be extended to SAE latents or steering vectors? Showing these techniques can be applied whilst preserving other properties of a prompt may be useful. For example, can we optimise an eval question to preserve the content of the question but against the model being aware that the question comes from an evaluation?

Suggestions to learn more: We’d recommend reading Irving et al., 2025.

 

View The Alignment Project website to learn more or apply for funding here.



Discuss

Research Areas in Cognitive Science (The Alignment Project by UK AISI)

2025-08-01 18:26:42

Published on August 1, 2025 10:26 AM GMT

The Alignment Project is a global fund of over £15 million, dedicated to accelerating progress in AI control and alignment research. It is backed by an international coalition of governments, industry, venture capital and philanthropic funders. This post is part of a sequence on research areas that we are excited to fund through the The Alignment Project.

Apply now to join researchers worldwide in advancing AI safety.

Cognitive science

Systematic Human Error

Problem summary: Modern AI models (LLMs and associated agents) depend critically on human supervision in the training loop: AI models do tasks, and human judges express their preferences about their outputs or performance. These judgments are used as training signals to improve the model. However, there are many reasons why humans could introduce systematic error into a training signal: they have idiosyncratic priors, limited attention, systematic cognitive biases, reasoning flaws, moral blind spots, etc. How can we design evaluation pipelines that mitigate these flaws to minimise systematic error, so that the models are safe for deployment? 

For the sub-problems below, we specify the problem as follows: 

  • There is a to-be-aligned AI model that produces outputs (text, code, or actions).
  • We assume that these outputs have some in-theory knowable ground truth quality, even if this is not immediately discernible (e.g. their consequences may be unambiguously good or bad).
  • There are one or more human raters, who express preferences about these outputs, with or without confidences.
  • The model is assumed to have more expertise or domain specific knowledge than the humans on the task being performed.
  • These raters may be assisted by other AI agents, who are also less expert than the model being aligned. They may advise individually or collectively.
  • The judgments of the human raters are used to align the model in some way. 

The core question then becomes: given the potential limitations of the human judges, how can we align the model to produce outputs that are ultimately good rather than bad? We think of this as a mechanism design problem: we want to find ways to harness human metacognition, the wisdom of the crowds, collective deliberation or debate, and the use of AI models as thought partners to improve the alignment problem. 

We have both scientific goals and engineering goals. The engineering goals are focussed on post-training pipeline design and are geared to find ways that improve alignment. The scientific questions are about the nature of the cognitive or social processes that make weak-to-strong alignment either more or less challenging. We think that this is an interesting problem in its own right (for example, you can see representative democracy as a special case of weak-to-strong alignment). Some of the scientific goals build on work already conducted in other domains. 

Why this matters: In some cases, the model being trained may be more capable than the humans providing training signals. For example, the model may be writing code for human judges, who cannot tell whether or how the code will execute just from reading it.  The problem of correctly rewarding desired behaviours for AI systems when those behaviours are beyond humans’ ability to efficiently judge is called “scalable oversight”.  The idea behind scalable oversight methods is to reduce harder problems to easier ones where human judgment can be relied upon. For example, judges might be assisted by another AI model (a rater assistant), or judge the outcome of a debate between different strong models (safety by debate). 

Examples of subproblems: 

The problems summarised below are inspired by the initial thinking from Irving and Askell, 2019 (this paper focuses on debate, but the ideas are relevant to other methods). 

  1. Weighting responses by their confidence provides higher-precision aggregate estimates. Can we improve alignment by weighting training signals by the confidence expressed by human raters? We could do this either within rater (downweigh specific choices) or between raters (downweigh less calibrated individuals).
  2. Can we improve alignment using the ‘wisdom of the crowds’, by pooling the (confidence-weighted) responses of multiple human raters? For example, adopting similar techniques to those used in Bahrami et al., 2010.
  3. If humans are provided with the opportunity to discuss their preferences before assigning a rating, does this improve alignment? Related to Bergman et al., 2024. Note that discussion can confer outsize effects on collective accuracy (Navajas et al., 2018).
  4. Does giving raters more time to reflect help or hurt the alignment process, considering the relative quantity of training data produced? Can people learn to discern how much time they should allocate to a choice, given their sensitivity (optimally, both very hard and very easy choices should be made quickly; Masis et al., 2023). Do humans do this when judging AI outputs?
  5. When judging messages from experts, humans can judge their trustworthiness based on surface features (such as the tone, confidence, or crispness with which they are expressed), even if they are not experts themselves. Do similar signals exist for AI models in expert domains, possibly using interpretability tools?
  6. If human users are given the option to query the strong model, does this improve their ability to discern whether its answers are accurate or not? What queries work best? Do people tend to ask questions that maximise information gain relevant for the problem? See Rothe et al., 2018 for work that has studied the quality of human questions.
  7. Where humans have a (potentially unlimited) number of weak AI systems that could be used to help, how can they best be used to assist with the rating process. For example, do humans tend to ignore the assistance (overconfidence) or follow it slavishly (overreliance), relative to optimal (conditional on how accurate the model is)? See Bridgers et al., 2024 for more details concerning rater assist and hybridization methods.
  8. Let’s imagine that we are able to create two copies of the strong model, and encourage one to point out the flaws in the argument of the other. Would humans be able to discern which model was correct? See Irving et al., 2018 and Michael et al., 2023.  

Suggestions to learn more: We’d recommend reading Bridgers et al., 2024 and Irving and Askell (2019).  

Running Scalable Oversight Protocols With Humans 

Problem summary: As scalable oversight protocols are developed, protocols should be empirically tested – and where LLMs are not yet capable enough to carry out the protocols, human participants can be used instead.  

Why this matters: Scalable oversight is the problem of correctly rewarding desired behaviours for AI systems, even when those behaviours are beyond humans’ ability to efficiently judge. A scalable oversight method is therefore needed to supervise for honesty, helpfulness and harmlessness in models with capabilities exceeding human judges. 

Examples of subproblems: 

Established subproblems: 

  1. Investigate the experimental design for studies of this type and conduct small-scale trial experiments to improve future work in the area. As an example, previous OpenAI human studies surfaced flaws in debate protocols (Barnes, 2020), where malicious debaters could win using ambiguous language or obfuscated arguments with hard-to-find flaws.
  2. Barnes and Christiano, 2020 found a failure mode where debaters could win through ambiguous language, and proposed an improved "cross-examination" scheme to mitigate this failure. However, more recent human debate experiments such as Michael et al., 2023 haven't used cross-examination. Can we replicate the benefit of cross-examination in human experiments? This requires careful design, as cross-examination uses snapshots of the mental state of a debater which cannot be fully replicated with humans.
  3. We have recent theoretical work on when the obfuscated arguments problem from (1) can be avoided, introducing a new “prover-estimator debate” scalable oversight protocol (Brown-Cohen et al., 2025). However, it has not been empirically tested. Implement prover-estimator debate with human participants and measure its effectiveness.  

New questions where we are not aware of prior work: 

  1. Scalable oversight protocols depend on stability (that small changes in inputs shouldn't produce drastically different outputs). We may not be able to prove the stability of our algorithms. Conduct empirical experiments with human subjects to examine the effects of small perturbations on scalable oversight protocols in practice.

Suggestions to learn more: We’d recommend reading Irving and Askell (2019) . Previous research in this area includes Parrish et al., 2022, Michael et al., 2023, and Rahman et al., 2025)

‍Developing and Evaluating Processes for Human Auditing of Potentially Scheming Models

In the future, we might rely heavily on AIs performing autonomous work such as automated alignment research. At the capability level sufficient to do such research, these systems may already be capable of causing catastrophic harm or meaningfully escaping human control, and may choose to do so if they’re misaligned. The field of AI Control aims to reduce this risk through techniques like monitoring by other AI models and restricting the affordances of untrusted AI systems (Shlegeris and Greenblatt 2024).

The early stages of an AI control protocol will usually involve automated filters and modification via machine learning classifiers or other tools. However, we expect that any costly action taken as a result of these automated filters will eventually involve human input for confirmation. Therefore, it will be critical to understand how control protocols involving humans operate, both by applying prior knowledge from cognitive science about human behaviour and cognitive biases, and via end-to-end tests of control protocols involving humans. 

Why this matters: Control measures provide a practical safety approach to mitigating loss of control risks that doesn’t require fundamental research breakthroughs. Control methods can be evaluated and deployed using current capabilities, making them particularly useful for responsible AI development in the near term. For an overview of how control measures can scale to increasingly powerful systems see Korbak et al., 2025a).

Developing and evaluating processes for human auditing (with AI uplift). AI control benefits from having the ability to audit suspicious activity with human labor. It would be very useful to anticipate potential issues with our ability to use humans (who may be aided by AI systems) for auditing. An additional challenge worth studying, in the context of automated AI R&D, is that auditors may often be unfamiliar with the code they’re reviewing. 

Some valuable research directions here include:

  • Running trials with humans to evaluate different protocols for auditing, aiming to make them cheap and effective. This likely includes studying uplift (and risks) from giving human auditors access to various bespoke AI tools.
  • Measuring and mitigating cognitive biases that result from data imbalance. By default, a human auditor in a control protocol might see a stream of data which is nearly all fine, making sustained attention difficult if the task is to catch rare failures. We are interested in exploring potential mitigations to this problem.
  • Identifying relevant settings which allow for evaluating techniques for auditor uplift, e.g. spotting real vulnerabilities in unfamiliar codebases or suspicious activity in settings like SHADE-Arena (Kutasov et al., 2025). Some relevant settings might have to be developed from scratch—for example, we might want to be able to audit potential data poisoning attempts.
  • Developing mechanisms (e.g. redactors or paraphrasers) that can handle practical implementation issues regarding sensitive information. For example, the suspicious action may have been performed in the context of work with privileged or highly sensitive information, that human auditors can’t have access to.
  • Developing auxiliary methods to aid with few-shot catastrophe prevention (Greenblatt and Shlegeris, 2024), such as proliferation and interrogation techniques (focusing on methods that particularly benefit from execution by or with the help of competent humans).

     

View The Alignment Project website to learn more or apply for funding here.



Discuss