2025-08-01 20:20:07
Published on August 1, 2025 12:20 PM GMT
There was enough governance related news this week to spin it out.
Anthropic, Google, OpenAI, Mistral, Aleph Alpha, Cohere and others commit to signing the EU AI Code of Practice. Google has now signed. Microsoft says it is likely to sign.
xAI signed the AI safety chapter of the code, but is refusing to sign the others, citing them as overreach especially as pertains to copyright.
The only company that said it would not sign at all is Meta.
This was the underreported story. All the important AI companies other than Meta have gotten behind the safety section of the EU AI Code of Practice. This represents a considerable strengthening of their commitments, and introduces an enforcement mechanism. Even Anthropic will be forced to step up parts of their game.
That leaves Meta as the rogue state defector that once again gives zero anythings about safety, as in whether we all die, and also safety in its more mundane forms. Lol, we are Meta, indeed. So the question is, what are we going to do about it?
xAI took a middle position. I see the safety chapter as by far the most important, so as long as xAI is signing that and taking it seriously, great. Refusing the other parts is a strange flex, and I don’t know exactly what their problem is since they didn’t explain. They simply called it ‘unworkable,’ which is odd when Google, OpenAI and Anthropic all declared they found it workable.
Then again, xAI finds a lot of things unworkable. Could be a skill issue.
This is a sleeper development that could end up being a big deal. When I say ‘against regulations’ I do not mean against AI regulations. I mean against all ‘regulations’ in general, no matter what, straight up.
From the folks who brought you ‘figure out who we technically have the ability to fire and then fire all of them, and if something breaks maybe hire them back, this is the Elon way, no seriously’ and also ‘whoops we misread something so we cancelled PEPFAR and a whole lot of people are going to die,’ Doge is proud to give you ‘if a regulation is not technically required by law it must be an unbridled bad thing we can therefore remove, I wonder why they put up this fence.’
Hannah Natanson, Jeff Stein, Dan Diamond and Rachel Siegel (WaPo): The tool, called the “DOGE AI Deregulation Decision Tool,” is supposed to analyze roughly 200,000 federal regulations to determine which can be eliminated because they are no longer required by law, according to a PowerPoint presentation obtained by The Post that is dated July 1 and outlines DOGE’s plans.
Roughly 100,000 of those rules would be deemed worthy of trimming, the PowerPoint estimates — mostly through the automated tool with some staff feedback. The PowerPoint also suggests the AI tool will save the United States trillions of dollars by reducing compliance requirements, slashing the federal budget and unlocking unspecified “external investment.”
The conflation here is absolute. There are two categories of regulations: The half ‘required by law,’ and the half ‘worthy of trimming.’ Think of the trillions you can save.
They then try to hedge and claim that’s not how it is going to work.
Asked about the AI-fueled deregulation, White House spokesman Harrison Fields wrote in an email that “all options are being explored” to achieve the president’s goal of deregulating government.
…
No decisions have been completed on using AI to slash regulations, a HUD spokesperson said.
…
The spokesperson continued: “The intent of the developments is not to replace the judgment, discretion and expertise of staff but be additive to the process.”
That would be nice. I’m far more ‘we would be better off with a lot less regulations’ than most. I think it’s great to have an AI tool that splits off the half we can consider cutting from the half we are stuck with. I still think that ‘cut everything that a judge wouldn’t outright reverse if you tried cutting it’ is not a good strategy.
I find the ‘no we will totally consider whether this is a good idea’ talk rather hollow, both because of track record and also they keep telling us what the plan is?
“The White House wants us higher on the leader board,” said one of the three people. “But you have to have staff and time to write the deregulatory notices, and we don’t. That’s a big reason for the holdup.”
…
That’s where the AI tool comes in, the PowerPoint proposes. The tool will save 93 percent of the human labor involved by reviewing up to 500,000 comments submitted by the public in response to proposed rule changes. By the end of the deregulation exercise, humans will have spent just a few hours to cancel each of the 100,000 regulations, the PowerPoint claims.
They then close by pointing out that the AI makes mistakes even on the technical level it is addressing. Well, yeah.
Also, welcome to the future of journalism:
China has its own AI Action Plan and is calling for international cooperation on AI. Wait, what do they mean by that? If you look in the press, that depends who you ask. All the news organizations will be like ‘the Chinese released an AI Action Plan’ and then not link to the actual plan, I had to have o3 dig it up.
Here’s o3’s translation of the actual text. This is almost all general gestures in the direction of capabilities, diffusion, infrastructure and calls for open models. It definitely is not an AI Action Plan in the sense that America offered an AI Action Plan, with had lots of specific actionable proposals. This is more of a general outline of a plan and statement of goals, at best. At least it doesn’t talk about or call for a ‘race’ but a call for everything to be open and accelerated is not obviously better.
What does it have to say about safety or dealing with downsides? We have ‘forge standards and norms’ with a generic call for safety and ethics standards, which seems to mostly be about interoperability and ‘bias.’
Mainly we have ‘Govern AI safety,’ which is directionally nice to see I guess but essentially content free and shows no sign that the problems are being taken seriously on the levels we care about. Most concretely, in the ninth point, we have a call for regular safety audits of AI models. That all sounds like ‘the least you could do.’
Here’s one interpretation of the statement:
Brenda Goh (Reuters): China said on Saturday it wanted to create an organisation to foster global cooperation on artificial intelligence, positioning itself as an alternative to the U.S. as the two vie for influence over the transformative technology.
…
Li did not name the United States but appeared to refer to Washington’s efforts to stymie China’s advances in AI, warning that the technology risked becoming the “exclusive game” of a few countries and companies.
China wants AI to be openly shared and for all countries and companies to have equal rights to use it, Li said, adding that Beijing was willing to share its development experience and products with other countries, particularly the “Global South”. The Global South refers to developing, emerging or lower-income countries, mostly in the southern hemisphere.
…
The foreign ministry released online an action plan for global AI governance, inviting governments, international organisations, enterprises and research institutions to work together and promote international exchanges including through a cross-border open source community.
As in, we notice you are ahead in AI, and that’s not fair. You should do everything in the open so you let us catch up in all the ways you are ahead, so we can bury you using the ways in which you are behind. That’s not an unreasonable interpretation.
Here’s another.
The Guardian: Chinese premier Li Qiang has proposed establishing an organisation to foster global cooperation on artificial intelligence, calling on countries to coordinate on the development and security of the fast-evolving technology, days after the US unveiled plans to deregulate the industry.
…
Li warned Saturday that artificial intelligence development must be weighed against the security risks, saying global consensus was urgently needed.
…
“The risks and challenges brought by artificial intelligence have drawn widespread attention … How to find a balance between development and security urgently requires further consensus from the entire society,” the premier said.
Li said China would “actively promote” the development of open-source AI, adding Beijing was willing to share advances with other countries, particularly developing ones in the global south.
So that’s a call to keep security in mind, but every concrete reference is mundane and deals with misuse, and then they call for putting everything out into the open, with the main highlighted ‘risk’ to coordinate on being that America might get an advantage, and encouraging us to give it away via open models to ‘safeguard multilateralism.’
A third here, from the Japan Times, frames it as a call for an alliance to take aim at an American AI monopoly.
Director Michael Kratsios: China’s just-released AI Action Plan has a section that drives at a fundamental difference between our approaches to AI: whether the public or private sector should lead in AI innovation.
I like America’s odds of success.
He quotes point nine, which his translation has as ‘the public sector takes the lead in deploying applications.’ Whereas o3’s translation says ‘governments should pioneer reliable AI in public services (health, education, transport), run regular safety audits, respect IP, enforce privacy, and explore lawful data‑trading mechanisms to upgrade governance.’
Even in Michael’s preferred translation, this is saying government should aggressively deploy AI applications to improve government services. The American AI Action Plan, correctly, fully agrees with this. Nothing in the Chinese statement says to hold the private sector back. Quite the contrary.
The actual disagreement we have with point nine is the rest of it, where the Chinese think we should run regular safety audits, respect IP and enforce privacy. Those are not parts of the American AI Action Plan. Do you think we were right not to include those provisions, sir? If so, why?
Suppose in the future, we learned we were in a lot more danger than we think we are in now, and we did want to make a deal with China and others. Right now the two sides would be very far apart but circumstances could quickly change that.
Could we do it in a way that could be verified?
It wouldn’t be easy, but we do have tools.
This is the sort of thing we should absolutely be preparing to be able to do, whether or not we ultimately decide to do it.
Mauricio Baker: For the last year, my team produced the most technically detailed overview so far. Our RAND working paper finds: strong verification is possible—but we need ML and hardware research.
You can find the paper here and on arXiv. It includes a 5-page summary and a list of open challenges.
In the Cold War, the US and USSR used inspections and satellites to verify nuclear weapon limits. If future, powerful AI threatens to escape control or endanger national security, the US and China would both be better off with guardrails.
It’s a tough challenge:
– Verify narrow restrictions, like “no frontier AI training past some capability,” or “no mass-deploying if tests show unacceptable danger”
– Catch major state efforts to cheat
– Preserve confidentiality of models, data, and algorithms
– Keep overhead low
Still, reasons for optimism:
– No need to monitor all computers—frontier AI needs thousands of specialized AI chips.
– We can build redundant layers of verification. A cheater only needs to be caught once.
– We can draw from great work in cryptography and ML/hardware security.
One approach is to use existing chip security features like Confidential Computing, built to securely verify chip activities. But we’d need serious design vetting, teardowns, and maybe redesigns before the US could strongly trust Huawei’s chip security (or frankly, NVIDIA’s).
“Off-chip” mechanisms could be reliable sooner: network taps or analog sensors (vetted, limited use, tamper evident) retrofitted onto AI data centers. Then, mutually secured, airgapped clusters could check if claimed compute uses are reproducible and consistent with sensor data.
Add approaches “simple enough to work”: whistleblower programs, interviews of personnel, and intelligence activities. Whistleblower programs could involve regular in-person contact—carefully set up so employees can anonymously reveal violations, but not much more.
…
We could have an arsenal of tried-and-tested methods to confidentially verify a US-China AI treaty. But at the current pace, in three years, we’ll just have a few speculative options. We need ML and hardware researchers, new RFPs by funders, and AI company pilot programs.
Jeffrey Ladish: Love seeing this kind of in-depth work on AI treaty verification. A key fact is verification doesn’t have to be bullet proof to be useful. We can ratchet up increasingly robust technical solutions while using other forms of HUMINT and SIGINT to provide some level of assurance.
Remember, the AI race is a mixed-motive conflict, per Schelling. Both sides have an incentive to seek an advantage, but also have an incentive to avoid mutually awful outcomes. Like with nuclear war, everyone loses if any side loses control of superhuman AI.
This makes coordination easier, because even if both sides don’t like or trust each other, they have an incentive to cooperate to avoid extremely bad outcomes.
It may turn out that even with real efforts there are not good technical solutions. But I think it is far more likely that we don’t find the technical solutions due to lack of trying, rather than that the problem is so hard that it cannot be done.
The reaction to the AI Action Plan was almost universally positive, including here from Nvidia and AMD. My own review, focused on the concrete proposals within, also reflected this. It far exceeded my expectations on essentially all fronts, so much so that I would be actively happy to see most of its proposals implemented rather than nothing be done.
I and others focused on the concrete policy, and especially concrete policy relative to expectations and what was possible in context, for which it gets high praise.
But a document like this might have a lot of its impact due to the rhetoric instead, even if it lacks legal force, or cause people to endorse the approach as ideal in absolute terms rather than being the best that could be done at the time.
So, for example, the actual proposals for open models were almost reasonable, but if the takeaway is lots more rhetoric of ‘yay open models’ like it is in this WSJ editorial, where the central theme is very clearly ‘we must beat China, nothing else matters, this plan helps beat China, so the plan is good’ then that’s really bad.
Another important example: Nothing in the policy proposals here makes future international cooperation harder. The rhetoric? A completely different story.
The same WSJ article also noticed the same obvious contradictions with other Trump policies that I did – throttling renewable energy and high-skilled immigration and even visas are incompatible with our goals here, the focus on ‘woke AI’ could have been much worse but remains a distraction, also I would add, what is up with massive cuts to STEM research if we are taking this seriously? If we are serious about winning and worry that one false move would ‘forfeit the race’ then we need to act like it.
Of course, none of that is up to the people who were writing the AI Action Plan.
What the WSJ editorial board didn’t notice, or mention at all, is the possibility that there are other risks or downsides at play here, and it dismisses outright the possibility of any form of coordination or cooperation. That’s a very wrong, dangerous and harmful attitude, one it shares with many in or lobbying the government.
A worry I have on reflection, that I wasn’t focusing on at the time, is that officials and others might treat the endorsements of the good policy proposals here as an endorsement of the overall plan presented by the rhetoric, especially the rhetoric at the top of the plan, or of the plan’s sufficiency and that it is okay to ignore and not speak about what the plan ignores and does not speak about.
That rhetoric was alarmingly (but unsurprisingly) terrible, as it is the general administration plan of emphasizing whenever possible that we are in an ‘AI race’ that will likely go straight to AGI and superintelligence even if those words couldn’t themselves be used in the plan, where ‘winning’ is measured in the mostly irrelevant ‘market share.’
And indeed, the inability to mention AGI or superintelligence in the plan leads to such exactly the standard David Sacks lines that toxically center the situation on ‘winning the race’ by ‘exporting the American tech stack.’
I will keep repeating, if necessary until I am blue in the face, that this is effectively a call (the motivations for which I do not care to speculate) for sacrificing the future and get us all killed in order to maximize Nvidia’s market share.
There is no ‘tech stack’ in the meaningful sense of necessary integration. You can run any most any AI model on most any advanced chip, and switch on an hour’s notice.
It does not matter who built the chips. It matters who runs the chips and for whose benefit. Supply is constrained by manufacturing capacity, so every chip we sell is one less chip we have. The idea that failure to hand over large percentages of the top AI chips to various authoritarians, or even selling H20s directly to China as they currently plan to do, would ‘forfeit’ ‘the race’ is beyond absurd.
Indeed, both the rhetoric and actions discussed here do the exact opposite. It puts pressure on others especially China to push harder towards ‘the race’ including the part that counts, the one to AGI, and also the race for diffusion and AI’s benefits. And the chips we sell arm China and others to do this important racing.
There is later talk acknowledging that ‘we do not intend to ignore the risks of this revolutionary technological power.’ But Sacks frames this as entire about the risk that AI will be misused or stolen by malicious actors. Which is certainly a danger, but far from the primary thing to worry about.
That’s what happens when you are forced to pretend AGI, ASI, potential loss of control and all other existential risks do not exist as possibilities. The good news is that there are some steps in the actual concrete plan to start preparing for those problems, even if they are insufficient and it can’t be explained, but it’s a rough path trying to sustain even that level of responsibility under this kind of rhetorical oppression.
The vibes and rhetoric were accelerationist throughout, especially at the top, and completely ignored the risks and downsides of AI, and the dangers of embracing a rhetoric based on an ‘AI race’ that we ‘must win,’ and where that winning mostly means chip market share. Going down this path is quite likely to get us all killed.
I am happy to make the trade of allowing the rhetoric to be optimistic, and to present the Glorious Transhumanist Future as likely to be great even as we have no idea how to stay alive and in control while getting there, so long as we can still agree to take the actions we need to take in order to tackle that staying alive and in control bit – again, the actions are mostly the same even if you are highly optimistic that it will work out.
But if you dismiss the important dangers entirely, then your chances get much worse.
So I want to be very clear that I hate that rhetoric, I think it is no good, very bad rhetoric both in terms of what is present and what (often with good local reasons) is missing, while reiterating that the concrete particular policy proposals were as good as we could reasonably have hoped for on the margin, and the authors did as well as they could plausibly have done with people like Sacks acting as veto points.
That includes the actions on ‘preventing Woke AI,’ which have convinced even Sacks to frame this as preventing companies from intentionally building DEI into their models. That’s fine, I wouldn’t want that either.
Even outlets like Transformer weighed in positively, with them calling the plan ‘surprisingly okay’ and noting its ability to get consensus support, while ignoring the rhetoric. They correctly note the plan is very much not adequate. It was a missed opportunity to talk about or do something about various risks (although I understand why), and there was much that could have been done that wasn’t.
Seán Ó hÉigeartaigh: Crazy to reflect on the three global AI competitions going on right now:
– 1. US political leadership have made AI a prestige race, echoing the Space Race. It’s cool and important and strategic, and they’re going to Win.
– 2. For Chinese leadership AI is part of economic strength, soft power and influence. Technology is shared, developing economies will be built on Chinese fundamental tech, the Chinese economy and trade relations will grow. Weakening trust in a capricious US is an easy opportunity to take advantage of.
– 3. The AGI companies are racing something they think will out-think humans across the board, that they don’t yet know how to control, and think might literally kill everyone.
Scariest of all is that it’s not at all clear to decision-makers that these three things are happening in parallel. They think they’re playing the same game, but they’re not.
I would modify the US political leadership position. I think to a lot of them it’s literally about market share, primarily chip market share. I believe this because they keep saying, with great vigor, that it is literally about chip market share. But yes, they think this matters because of prestige, and because this is how you get power.
My guess is, mostly:
The actions one would take in each of these competitions are often very similar, especially the first three and often the fourth as well, but sometimes are very different. What frustrates me most is when there is an action that is wise on all levels, yet we still don’t do it.
Also, on the ‘preventing Woke AI’ question, the way the plan and order are worded seems designed to make compliance easy and not onerous, but given other signs from the Trump administration lately, I think we have reason to worry…
Fact Post: Trump’s FCC Chair says he will put a “bias monitor” in place who will “report directly” to Trump as part of the deal for Sky Dance to acquire CBS.
Ari Drennen: The term that the Soviet Union used for this job was “apparatchik” btw.
I was willing to believe that firing Colbert was primarily a business decision. This is very different imagine the headline in reverse: “Harris’s FCC Chair says she will put a “bias monitor” in place who will “report directly” to Harris as part of the deal for Sky Dance to acquire CBS.”
Now imagine it is 2029, and the headline is ‘AOC appoints new bias monitor for CBS.’ Now imagine it was FOX. Yeah. Maybe don’t go down this road?
Director Krastios has now given us his view on the AI Action Plan. This is a chance to see how much it is viewed as terrible rhetoric versus its good policy details, and to what extent overall policy is going to be guided by good details versus terrible rhetoric.
Peter Wildeford offers his takeaway summary.
Peter Wildeford: Winning the Global AI Race
- The administration’s core philosophy is a direct repudiation of the previous one, which Kratsios claims was a “fear-driven” policy “manically obsessed” with hypothetical risks that stifled innovation.
- The plan is explicitly called an “Action Plan” to signal a focus on immediate execution and tangible results, not another government strategy document that just lists aspirational goals.
- The global AI race requires America to show the world a viable, pro-innovation path for AI development that serves as an alternative to the EU’s precautionary, regulation-first model.
He leads with hyperbolic slander, which is par for the course, but yes concrete action plans are highly useful and the EU can go too far in its regulations.
There are kind of two ways to go with this.
The AI Action Plan as written was the second one. But you have to do that on purpose, because the default outcome is to shift to the first one.
Executing the ‘American Stack’ Export Strategy
- The strategy is designed to prevent a scenario where the world runs on an adversary’s AI stack by proactively offering a superior, integrated American alternative.
- The plan aims to make it simple for foreign governments to buy American by promoting a “turnkey solution”—combining chips, cloud, models, and applications—to reduce complexity for the buyer.
- A key action is to reorient US development-finance institutions like the DFC and EXIM to prioritize financing for the export of the American AI stack, shifting their focus from traditional hard infrastructure.
The whole ‘export’ strategy is either nonsensical, or an attempt to control capital flow, because I heard a rumor that it is good to be the ones directing capital flow.
Once again, the ‘tech stack’ thing is not, as described here, what’s the word? Real.
The ‘adversary’ does not have a ‘tech stack’ to offer, they have open models people can run on the same chips. They don’t have meaningful chips to even run their own operations, let alone export. And the ‘tech’ does not ‘stack’ in a meaningful way.
Turnkey solutions and package marketing are real. I don’t see any reason for our government to be so utterly obsessed with them, or even involved at all. That’s called marketing and serving the customer. Capitalism solves this. Microsoft and Amazon and Google and OpenAI and Anthropic and so on can and do handle it.
Why do we suddenly think the government needs to be prioritizing financing this? Given that it includes chip exports, how is it different from ‘traditional hard infrastructure’? Why do we need financing for the rest of this illusory stack when it is actually software? Shouldn’t we still be focusing on ‘traditional hard infrastructure’ in the places we want it, and then whenever possible exporting the inference?
Refining National Security Controls
- Kratsios argues the biggest issue with export controls is not the rules themselves but the lack of resources for enforcement, which is why the plan calls for giving the Bureau of Industry and Security (BIS) the tools it needs.
- The strategy is to maintain strict controls on the most advanced chips and critical semiconductor-manufacturing components, while allowing sales of less-advanced chips under a strict licensing regime.
- The administration is less concerned with physical smuggling of hardware and more focused on preventing PRC front companies from using legally exported hardware for large-scale, easily flaggable training runs.
- Proposed safeguards against misuse are stringent “Know Your Customer” (KYC) requirements paired with active monitoring for the scale and scope of compute jobs.
It is great to see the emphasis on enforcement. It is great to hear that the export control rules are not the issue.
In which case, can we stop waving them, such as with H20 sales to China? Thank you. There is of course a level at which chips can be safely sold even directly to China, but the experts all agree the H20 is past that level.
The lack of concern about smuggling is a blind eye in the face of overwhelming evidence of widespread smuggling. I don’t much care if they are claiming to be concerned, I care about the actual enforcement, but we need enforcement. Yes, we should stop ‘easily flaggable’ PRC training runs and use KYC techniques, but this is saying we should look for our keys under the streetlight and then if we don’t find the keys assume we can start our car without them.
Championing ‘Light-Touch’ Domestic Regulation
- The administration rejects the idea of a single, overarching AI law, arguing that expert agencies like the FDA and DOT should regulate AI within their specific domains.
- The president’s position is that a “patchwork of regulations” across 50 states is unacceptable because the compliance burden disproportionately harms innovative startups.
- While using executive levers to discourage state-level rules, the administration acknowledges that a durable solution requires an act of Congress to create a uniform federal standard.
Yes, a ‘uniform federal standard’ would be great, except they have no intention of even pretending to meaningfully pursue one. They want each federal agency to do its thing in its own domain, as in a ‘use case’ based AI regime which when done on its own is the EU approach and doomed to failure.
I do acknowledge the step down from ‘kill state attempts to touch anything AI’ (aka the insane moratorium) to ‘discourage’ state-level rules using ‘executive levers,’ at which point we are talking price. One worries the price will get rather extreme.
Addressing AI’s Economic Impact at Home
- Kratsios highlights that the biggest immediate labor need is for roles like electricians to build data centers, prompting a plan to retrain Americans for high-paying infrastructure jobs.
- The technology is seen as a major productivity tool that provides critical leverage for small businesses to scale and overcome hiring challenges.
- The administration issued a specific executive order on K-12 AI education to ensure America’s students are prepared to wield these tools in their future careers.
Ahem, immigration, ahem, also these things rarely work, but okay, sure, fine.
Prioritizing Practical Infrastructure Over Hypothetical Risk
- Kratsios asserts that chip supply is no longer a major constraint; the key barriers to the AI build-out are shortages of skilled labor and regulatory delays in permitting.
- Success will be measured by reducing the time from permit application to “shovels in the ground” for new power plants and data centers.
- The former AI Safety Institute is being repurposed to focus on the hard science of metrology—developing technical standards for measuring and evaluating models, rather than vague notions of “safety.”
It is not the only constraint, but it is simply false to say that chip supply is no longer a major constraint.
Defining success in infrastructure in this way would, if taken seriously, lead to large distortions in the usual obvious Goodhart’s Law ways. I am going to give the benefit of the doubt and presume this ‘success’ definition is local, confined to infrastructure.
If the only thing America’s former AISI can now do are formal measured technical standards, then that is at least a useful thing that it can hopefully do well, but yeah it basically rules out at the conceptual level the idea of actually addressing the most important safety issues, by dismissing them are ‘vague.’
This goes beyond ‘that which is measured is managed’ to an open plan of ‘that which is not measured is not managed, it isn’t even real.’ Guess how that turns out.
Defining the Legislative Agenda
- While the executive branch has little power here, Kratsios identifies the use of copyrighted data in model training as a “quite controversial” area that Congress may need to address.
- The administration would welcome legislation that provides statutory cover for the reformed, standards-focused mission of the Center for AI Standards and Innovation (CAISI).
- Continued congressional action is needed for appropriations to fund critical AI-related R&D across agencies like the National Science Foundation.
TechCrunch: 20 national security experts urge Trump administration to restrict Nvidia H20 sales to China.
The letter says the H20 is a potent accelerator of China’s frontier AI capabilities and could be used to strengthen China’s military.
Americans for Responsible Innovation: The H20 and the AI models it supports will be deployed by China’s PLA. Under Beijing’s “Military-Civil Fusion” strategy, it’s a guarantee that H20 chips will be swiftly adapted for military purposes. This is not a question of trade. It is a question of national security.
It would be bad enough if this was about selling the existing stock of H20s, that Nvidia has taken a writedown on, even though it could easily sell them in the West instead. It is another thing entirely that Nvidia is using its capacity on TSMC machines to make more of them, choosing to create chips to sell directly to China instead of creating chips for us.
Ruby Scanlon: Nvidia placed orders for 300,000 H20 chipsets with contract manufacturer TSMC last week, two sources said, with one of them adding that strong Chinese demand had led the US firm to change its mind about just relying on its existing stockpile.
It sounds like we’re planning on feeding what would have been our AI chips to China. And then maybe you should start crying? Or better yet tell them they can’t do it?
I share Peter Wildeford’s bafflement here:
Peter Wildeford: “China is close to catching up to the US in AI so we should sell them Nvidia chips so they can catch up even faster.”
I never understand this argument from Nvidia.
The argument is also false, Nvidia is lying, but I don’t understand even if it were true.
There is only a 50% premium to buy Nvidia B200 systems within China, which suggests quite a lot of smuggling is going on.
Tao Burga: Nvidia still insists that there’s “no evidence of any AI chip diversion.” Laughable. All while lobbying against the data center chip location verification software that would provide the evidence. Tell me, where does the $1bn [in AI chips smuggled to China] go?
Rob Wiblin: Nvidia successfully campaigning to get its most powerful AI chips into China has such “the capitalists will sell us the rope with which we will hang them” energy.
Various people I follow keep emphasizing that China is smuggling really a lot of advanced AI chips, including B200s and such, and perhaps we should be trying to do something about it, because it seems rather important.
Chipmakers will always oppose any proposal to track chips or otherwise crack down on smuggling and call it ‘burdensome,’ where the ‘burden’ is ‘if you did this they would not be able to smuggle as many chips, and thus we would make less money.’
Reuters Business: Demand in China has begun surging for a business that, in theory, shouldn’t exist: the repair of advanced v artificial intelligence chipsets that the US has banned the export of to its trade and tech rival.
Peter Wildeford: Nvidia position: “datacenters from smuggled products is a losing proposition […] Datacenters require service and support, which we provide only to authorized NVIDIA products.”
Reality: Nvidia AI chip repair industry booms in China for banned products.
Scott Bessent Warns TSMC’s $40 billion Arizona fab that could meet 7% of American chip demand keeps getting delayed, and blames inspectors and red tape. There’s confusion here in the headline that he is warning it would ‘only’ meet 7% of demand, but 7% of demand would be amazing for one plant and the article’s text reflects this.
Bessent criticized regulatory hurdles slowing construction of the $40 billion facility. “Evidently, these chip design plants are moving so quickly, you’re constantly calling an audible and you’ve got someone saying, ‘Well, you said the pipe was going to be there, not there. We’re shutting you down,’” he explained.
It does also mean that if we want to meet 100% or more of demand we will need a lot more plants, but we knew that.
Epoch reports that Chinese hardware is behind American hardware, and is ‘closing the gap’ but faces major obstacles in chip manufacturing capability.
Epoch: Even if we exclude joint ventures with U.S., Australian, or U.K. institutions (where the developers can access foreign silicon), the clear majority of homegrown models relied on NVIDIA GPUs. In fact, it took until January 2024 for the first large language model to reportedly be trained entirely on Chinese hardware, arguably years after the first large language models.
…
Probably the most important reason for the dominance of Western hardware is that China has been unable to manufacture these AI chips in adequate volumes. Whereas Huawei reportedly manufactured 200,000 Ascend 910B chips in 2024, estimates suggest that roughly one million NVIDIA GPUs were legally delivered to China in the same year.
That’s right. For every top level Huawei chip manufactured, Nvidia sold five to China. No, China is not about to export a ‘full Chinese tech stack’ for free the moment we turn our backs. They’re offering downloads of r1 and Kimi K2, to be run on our chips, and they use all their own chips internally because they still have a huge shortage.
Put bluntly, we don’t see China leaping ahead on compute within the next few years. Not only would China need to overcome major obstacles in chip manufacturing and software ecosystems, they would also need to surpass foreign companies making massive investments into hardware R&D and chip fabrication.
Unless export controls erode or Beijing solves multiple technological challenges in record time, we think that China will remain at least one generation behind in hardware. This doesn’t prevent Chinese developers from training and running frontier AI models, but it does make it much more costly.
Overall, we think these costs are large enough to put China at a substantial disadvantage in AI scaling for at least the rest of the decade.
Beating China may or may not be your number one priority. We do know that taking export controls seriously is the number one priority for ‘beating China.’
Intel will cancel 14A and following nodes, essentially abandoning the technological frontier, if it cannot win a major external customer.
2025-08-01 18:27:15
Published on August 1, 2025 10:27 AM GMT
The Alignment Project is a global fund of over £15 million, dedicated to accelerating progress in AI control and alignment research. It is backed by an international coalition of governments, industry, venture capital and philanthropic funders. This post is part of a sequence on research areas that we are excited to fund through the The Alignment Project.
Apply now to join researchers worldwide in advancing AI safety.
Problem statement: Current alignment methods can’t guarantee that an AI’s goals and actions match human intentions (Carlsmith, 2023). In the future, we might rely heavily on AIs performing autonomous work such as automated alignment research. At the capability level sufficient to do such research, these systems may already be capable of causing catastrophic harm or meaningfully escaping human control, and may choose to do so if they’re misaligned. The field of AI Control aims to reduce this risk through techniques such as monitoring of untrusted AI systems by other AI models and restricting the affordances of untrusted AI systems (Shlegeris and Greenblatt, 2024).
The aim is to develop control protocols – plans involving these techniques that are designed to prevent unsafe actions by AI systems. These control protocols should then be regularly evaluated during control evaluations – testing exercises designed to provide evidence that models would be unable to cause catastrophic harm even in case of intentional scheming.
Why this matters: Control measures provide a practical approach to mitigating loss of control risks that doesn’t require fundamental research breakthroughs. Control methods can be evaluated and deployed using models with current capability levels, making these methods particularly useful for responsible AI development in the near term. For an overview of how control measures can scale to increasingly powerful systems, see Korbak et al., 2025a).
We organize this part of the research agenda into seven subsections. First three sub-sections focus on empirical investigations targeting key identified threat models:
The following four sub-sections focus oncross-cutting research across multiple threat models:
We believe that many of these research areas require the efforts of entire research teams (3-10 people). We end the section with a call for new research groups focused on addressing different parts of the research agenda.
Problem summary: We are interested in research on concentrated failure scenarios, where a relatively small number of actions are required for humans to lose control over scheming AI models. For example, it has been theorised that a misaligned future model might be able to create an unmonitored deployment by inserting just a few vulnerabilities in a model developer’s most critical codebases (Shlegeris, 2022). Concentrated failure scenarios could involve AI systems conducting sabotage during coding or system operation tasks, potentially leading to rogue internal deployments or weight exfiltration (Shlegeris, 2024).
Key subproblems:
Building high-stakes control settings. Settings are empirical research environments, for example an environment containing mock frontier model development infrastructure, where researchers can study the ability of models to conduct sabotage (or proxies for sabotage) and the effectiveness of control protocols in preventing it. They contain main tasks (e.g., legitimate software engineering tasks for a coding agent ), as well as side tasks that a misaligned model might attempt to conduct (e.g., reducing security restrictions that protect model weights). See ControlArena for example settings. A diverse range of settings for the study of control techniques would be extremely helpful for advancing control research. Possible further settings may include various software engineering settings, settings where models have access to a shell, or settings where models run ML experiments. Building settings is an active area of research because it is difficult to build settings that fulfill at least the following desiderata (Shlegeris, 2025):
A particularly promising way to build control settings might be to convert actual AI usage into control settings. We are interested in studies where people are given a version of an agentic code editor where the AI is sometimes replaced with a misaligned version that tries to insert security issues or sabotage the work if it thinks it can succeed. This allows the study of how often the AI successfully introduces issues without detection, identification of interesting sabotage strategies and greater understanding of which protocols are most effective in practice. See Greenblatt 2025 for more context. It is important to run control evaluations in these realistic settings to validate their usefulness for control research.
High-stakes control evaluations. We are interested in follow-up work on previous high-stakes control research, such as Greenblatt et al. 2023 and Bhatt et al. 2025. Valuable projects may include:
Developing and evaluating processes for human auditing (with AI uplift). AI control benefits from having the ability to audit suspicious activity with human labor. It would be very useful to anticipate potential issues with our ability to use humans (who may be aided by AI systems) for auditing. An additional challenge worth studying, in the context of automated AI R&D, is that auditors may often be unfamiliar with the code they’re reviewing.
Some valuable research directions here include:
Developing design patterns for limiting excessive agency. One way to reduce risks from schemers is to limit the potential paths through which they can cause harm. Adopting a principle of least privilege may be appropriate, where potentially scheming models have exactly the affordances they need in order to do their job. In the context of hierarchical agent structures, instances of the model higher up in the hierarchy could be set up to have more context but fewer direct affordances; creating strategies and making decisions, but then delegating sub-tasks. In contrast, instances lower in the hierarchy could have less context, but they may be able to write and execute code. In general, the situation would (all else equal) be safer the more we can encapsulate the key affordances of AIs in safeguarded APIs. In order to be useful, a project in this area would probably have to run empirical control evaluations which test structural safety measures such as:
Problem summary: In the future, potentially scheming AIs will likely be used for large amounts of intellectual work, in particular AI R&D. A worry here is that they might systematically sabotage safety-relevant research, preventing us from controlling increasingly capable AI systems.
Why this matters: Preventing research sabotage matters insofar as we rely on AIs to help perform research that enables alignment of future even more capable AIs. Methods focused on catching clearly incriminating actions (such as the ones described in the High-stakes/concentrated AI safety failures section) may not be suitable for detecting this type of sabotage, as research sabotage is likely to be difficult to distinguish from lack of competence.
Scheming-robust scalable oversight research. Scalable or recursive oversight is a technique where training signals for more powerful models are provided by less capable models. We are interested in research on scalable oversight techniques robust to intentional subversion by the monitored models. We envision that this kind of scalable oversight research is done using control-style adversarial analysis (see Shlegeris and Greenblatt, 2025). Examining approaches to increasing robustness of AI safety via debate (Irving et al., 2018) offers one promising starting direction for this research.
Studying the sample efficiency of online training. Monitors might detect that models behave in undesirable but non-catastrophic ways during deployment. In such cases, AI developers are unlikely to shut down the deployment, but they might be motivated to perform online training to remove the undesirable behaviour. We’re interested in studying the sample efficiency of online training in order to understand the prospects of successfully training away undesirable scheming behaviours. Studying this requires developing malign initialisations, i.e., models which are trained to resist online training.
Mitigating sandbagging. A particular worry with misaligned models is that they might deliberately underperform in cases where high performance would be desirable such as dangerous capability evaluations, AI safety research, or when performing oversight over other AI models (see Shlegeris and Stastny, 2025 for detailed analysis). Previous empirical work has studied this threat model using password-locked models (see Greenblatt et al., 2025). We are interested in extensions of this work, as well as alternative approaches to studying sandbagging. One example project here would be to study password-locked reasoning models.
Problem summary: We are interested in better understanding covert malicious fine-tuning (CMFT, Halawi et al., 2024), whereby an AI system may covertly poison other AI systems. This could lead to the following types of risks:
We would like to advance our understanding of this threat model, and assess possible countermeasures.
Why this matters: Exploring the feasibility and mitigations of covert malicious fine-tuning matters because it poses a risk of rogue internal deployment where all visible artefacts (dataset, logs and model behaviour on normal prompts) look benign.
Key subproblems:
Studying foundational questions. We’d like to get more clarity on the various possible threat models:
Studying countermeasures. There are multiple approaches to making CMFT significantly harder, especially when combined in a “defense in depth” approach. These include:
Problem summary: Identification of the broad range of possible paths through which misaligned AI systems might cause catastrophic harm.
Why this matters: To have high confidence that a system of control protocols is sufficient to prevent catastrophic outcomes, it is essential to identify and evaluate the high-level pathways to harm (e.g., code sabotage, or human influencing lab employees).
We emphasise that the scale of work required here is substantial – it involves attempting to identify all the ways in which intelligent systems could sabotage complex safety systems.
Suggestions to learn more: For a literature review of possible risk analysis techniques, see Koessler and Schuett 2023. See the high level risk modelling approaches proposed by model developers (Anthropic, 2025, Google Deepmind, 2025, OpenAI, 2025), and examples of safety cases (Korbak et al., 2025b) and Systematic Hazard Analysis (Mylius, 2025) as two structured approaches.
Problem summary: These investigations explore ways to increase the monitoring capabilities of trusted AI models, that are otherwise less capable than their untrusted counterparts. Promising research directions here include the creation of reliable evaluations of model trustedness,as well as developing various ways of narrowly increasing model capabilities relevant to AI monitoring without increasing model scheming capabilities or propensity towards scheming.
Why this matters: Current proposed control protocols often benefit substantially from access to capable trusted models for monitoring and strategies like rewriting or editing potentially backdoored code when untrusted model outputs exceed suspicion thresholds (Greenblatt et al., 2023).
Key subproblems:
Developing trustedness evaluations. As models become more competent in the future, it will become less clear whether they can be trusted to not be scheming to escape human control (Järviniemi, 2024). Trustedness evaluations aim to evaluate an AI model’s capability to "fake" alignment during training and evaluation and then behave unsafely at deployment (see Mallen et al., 2024 for previous work).
Improving the usefulness of trusted monitors. For instance:
Building capable models with narrow capability profiles. For instance:
Problem summary: Today’s LLM agents tend to think in chains-of-thought (CoTs) expressed in human language that we can monitor for intent to misbehave (Korbak et al., 2025c). A CoT monitor is an automated system that reads the CoT and other relevant information and flags suspicious or potentially harmful interactions (Baker et al., 2025). As long as CoTs are monitorable (i.e. accurately represent reasons behind model responses), CoT monitors should outperform action-only monitors because they have access to strictly more information: beliefs and motivations behind LLM agents’ actions. This suggests a number of projects focusing on evaluating monitorability, preserving it and leveraging it in control measures.
Why this matters: CoT monitorability matters because it's an additional safety layer that's already working today—catching models planning misbehaviour in their reasoning traces. No fundamental breakthroughs are needed. Without preserving this capability, we are flying blind since we have no other reliable window into models' intentions or planning processes as they become more capable (Korbak et al., 2025c).
Key subproblems:
Evaluating monitorability. Monitoring systems often rely on the assumption that CoTs are monitorable: accurately reflect models’ actual reasoning. Faithfulness (e.g. Lanham et al., 2023, Turpin et al., 2023, Roger et al., 2025, Chen et al., 2025) has been proposed as a proxy measure for monitorability. However, existing faithfulness evaluations do not disambiguate the propensity of models to externalise their reasoning, with the necessity of externalising their reasoning. Moreover, they typically only consider simple question answering settings, failing to identify the level of monitorability that is sufficient for ensuring safety in more complex domains. We’re interested in the development of novel evaluations: those measuring monitorability directly as well as those that target specific properties underlying monitorability. A suite of evaluations covering capabilities that determine the effectiveness of different CoT monitoring protocols includes:
Preserving monitorability. Monitorability of chains of thought should not be taken for granted. More research is needed to understand what degrades monitorability and which techniques could help preserve it. Example projects include:
Problem summary: There is a diverse range of research areas for improving our understanding of scheming AIs and ways to control them, that are not unique to specific threat models. This section provides examples of these cross-cutting research areas.
Key subproblems:
Catching natural examples of scheming and studying them. Finding completely natural examples of scheming in deployment would likely be most informative. Alternatively, it might be useful to develop rigged training processes that make scheming more likely, with approaches similar to those used to research model organisms of misalignment (see Hubinger et al., 2024, Turner et al., 2025). It might also be helpful to rig the training process in a way that produces a schemer that is easier to catch, e.g. by making scheming show up at lower capability levels or by training the schemer on synthetic facts in order to weaken its situational awareness. However, it should be noted that this risks making the schemer too disanalogous from the real threat. Some methods for making scheming more likely include:
Developing methods for few-shot catastrophe prevention. Related to the project above, but plausibly possible to study without natural examples, we are interested in techniques for improving safety in the case where a model has been caught misbehaving a small number of times:
Studying collusion. The key challenge to untrusted monitoring is that the untrusted monitor might engage in collusion—deliberately letting through attacks by the untrusted policy (see Shlegeris, 2024 for detailed analysis.) We are interested in studying the prerequisites for collusion, such as whether models can recognise their own outputs under adversarial pressure, or whether they can pick coordination strategies without communication (Järviniemi, 2024).
Studying overelicitation. When running a control evaluation, we are interested in fully eliciting the dangerous capabilities of the untrusted model (for which the red team has given an attack policy). It is plausible, however, that the red team is not able to elicit best-effort performance, as the model might deliberately underperform. One approach to mitigating this sandbagging would be to fine-tune the model on relevant data, but that might teach the model new things and therefore overelicit the untrusted model. Given this conundrum, we’re interested in research on how to elicit best-effort performance from a model that might be sandbagging without teaching it new things. See Stastny and Greenblatt 2025 and Roger and Greenblatt 2025 for more details.
Evaluating low-likelihood attack capability. Evaluations to determine when models become capable of executing attacks with very low probability, making them difficult to detect during limited evaluation but likely during extended deployment.
Studying introspection and situational awareness. Evaluations to measure how accurately models understand their behavioural propensities, deployment context (if the model is currently being post-trained, evaluated, deployed etc), affordances, and control measures, which could enable more sophisticated evasion strategies (Laine et al., 2024; Binder et al., 2025).
Developing techniques for white-box control. We are interested in an increased understanding of the limits of white-box techniques for control. These could include linear probes for deception (see Goldowsky-Dill, 2025), or probes trained directly on off-policy malicious trajectories—there are many techniques in the literature that have been proposed. A key difficulty here is the development of realistic model organisms that can serve as test beds for white-box methods. For more details on research questions related to whitebox methods see also the “Interpretability” section.
Understanding the limits of synthetic input generation. It would be very useful if we had the ability to generate synthetic inputs that powerful LLMs can’t distinguish from “real” inputs. For instance, we may want to prevent a model from being able to confidently infer from its prompt that it is in an evaluation (see Apollo, 2025). We might also want to teach models certain synthetic facts, but there are open questions about how well this will work for powerful models. See Stastny and Greenblatt 2025 and Greenblatt 2025 for more details on this area of research.
Studying abstract analogies for scheming. In order to empirically study risks from schemers, we can try to study methods to train away deeply ingrained behaviours (which don’t necessarily have to look like misbehaviour) in LLMs that are structurally similar to scheming. Two proposed example projects in this area can be found in Stastny 2025.
Analyzing human insider threat parallels to AI scheming risk models. There exists extensive research on insider threat detection in domains such as cyber-security and national security (e.g. Capelli et al., 2012; Homoliak et al., 2019). This project involves conducting a literature review, to identify learnings that can be carried over to mitigating threats from scheming AI systems. It may also involve reviewing broader literature on sabotage of organisations such as the CIA Simple Sabotage Field Menu (CIA, 1944). The outputs of this project could be both high level security principles e.g. approaches to layered defense or proposals for implementation of specific techniques e.g. honeypots and tripwires.
In the previous section, we described project proposals of differing scopes illustrating the different axes along which the field of AI control can be advanced. We believe that many of these research areas require the efforts of entire research teams (3-10 people), and should possibly be undertaken by new research groups. In particular, we want to encourage the formation of new research groups that undertake ambitious projects including:
View The Alignment Project website to learn more or apply for funding here.
2025-08-01 18:27:07
Published on August 1, 2025 10:27 AM GMT
The Alignment Project is a global fund of over £15 million, dedicated to accelerating progress in AI control and alignment research. It is backed by an international coalition of governments, industry, venture capital and philanthropic funders. This post is part of a sequence on research areas that we are excited to fund through the The Alignment Project.
Apply now to join researchers worldwide in advancing AI safety.
Post-training and elicitation techniques allow us to refine model behaviour for specific safety properties, providing crucial tools for alignment. This section explores diverse applications: consistency training that aligns models without human labels, reward model improvements that resist exploitation, control protocols for managing potentially deceptive systems, and methods to preserve reasoning transparency.
Problem summary: To supervise advanced AI, one approach is to scale unsupervised training methods to apply to domains where supervision is beyond human ability to cheaply label. For example, Wen et al., 2025 nearly match supervised baselines by training to maximise mutual predictability of labels while minimising logical contradictions. Both empirical and theoretical study of this method are necessary to understanding limitations, and scaling properties. For instance, can the method be used to train away adversarial, misaligned initialization? How does the Wen et al. method relate to Bayesian methods like belief propagation, and can these be used to separate misaligned from aligned optima?
Why this matters: Consistency training could be a useful way to extend other alignment methods based on supervised learning to tasks where ground truth is hard to measure. For example, models could be used on long-horizon tasks like experimental design advice where feedback is only possible after the experiment has been run – to ensure models are honest and aligned on such tasks, we would need unsupervised scalable oversight methods. Conversely, some such methods could lead to misalignment, and ideally we would be able to identify this in advance.
Examples of subproblems:
Established subproblems:
New questions where we are not aware of prior work:
Suggestions to learn more: We’d recommend reading Wen et al., 2025. Related research includes consistency over latent representations (Burns et al., 2022).
Problem summary: Reward models are used to distill human judgment; this amortizes the cost of laborious human labeling. Post-training for LLMs uses reward model judgments as training signal for open-ended, and non-verifiable questions. However, the reward model is a proxy which is inherently imperfect and can be gamed. Models can learn to reward hack, producing outputs which score highly on the reward signal without holding up to careful scrutiny. This problem is exacerbated by the scalable oversight challenge: as AI capabilities increase, models can produce outputs (e.g., complex code, scientific analyses) that are too time-consuming or difficult for human supervisors to evaluate accurately. We need methods and evaluations for scalable use of reward models. To develop such methods, theoretical models of the obstacles to reward modelling are needed. These obstacles include both label error induced by the limitations of human judgment, and generalisation error in the reward model itself. Flaws in human judgment can be modelled by partial observability or time-complexity gaps between the model and the human rater—a reward model trained on flawed human judgments can learn to systematically approve outputs that are subtly incorrect, manipulative, or harmful.
Why this matters: A sufficiently capable but misaligned model could learn to systematically exploit the blind spots in a reward model to achieve undesirable goals. For example, a model could produce outputs that appear helpful and correct to a human rater but contain hidden steganographic messages or dishonestly obfuscate errors in assumptions made by the model. In the worst case, a flawed reward model doesn't just lead to poor performance; it can actively train a model to deceive human supervisors without being caught.
Examples of subproblems:
Established subproblems:
New questions where we are not aware of prior work:
Suggestions to learn more: Previous research in this area includes Gao et al., 2022.
Problem summary: Future AI systems may be used for large amounts of intellectual work, including in AI research. At the capability level sufficient to do such research, these systems may already be capable of causing catastrophic harm or meaningfully escaping human control, and may choose to do so if they’re misaligned. This exposes us to risk that these systems might systematically sabotage safety-relevant research, which would frustrate our efforts to develop more powerful safety techniques. We would like to develop methods to prevent this sabotage, including by improving the ability of trusted AI systems to monitor untrusted systems.
Why this matters: Preventing research sabotage matters insofar as we rely on AIs to help perform research that enables alignment of future even more capable AIs. Methods focused on catching clearly incriminating actions may not be suitable for detecting this type of sabotage, as research sabotage is likely to be difficult to distinguish from lack of competence. Current proposed control protocols often benefit substantially from access to capable trusted models, for monitoring and strategies like rewriting or editing potentially backdoored code when untrusted model outputs exceed suspicion thresholds (Greenblatt et al., 2023).
Key subproblems:
Scheming-robust scalable oversight research. Scalable or recursive oversight is a technique where training signals for more powerful models are provided by less capable models. We are interested in research on scalable oversight techniques robust to intentional subversion by the monitored models.We envision that this kind of scalable oversight research is done using control-style adversarial analysis (see Shlegeris and Greenblatt, 2025). Examining approaches to increasing robustness of AI safety via debate (Irving et al., 2018) offers one promising starting direction for this research.
Studying the sample efficiency of online training. Monitors might detect that models behave in undesirable but non-catastrophic ways during deployment. In such cases, AI developers are unlikely to shut down the deployment, but they might be motivated to perform online training to remove the undesirable behaviour. We’re interested in studying the sample efficiency of online training in order to understand the prospects of successfully training away undesirable scheming behaviours. Studying this requires developing malign initialisations, i.e., models which are trained to resist online training.
Mitigating sandbagging. A particular worry with misaligned models is that they might deliberately underperform in cases where high performance would be desirable, such as dangerous capability evaluations, AI safety research or when performing oversight over other AI models (see Shlegeris and Stastny, 2025 for detailed analysis). Previous empirical work has studied this threat model using password-locked models (see Greenblatt et al., 2025). We are interested in extensions of this work, as well as alternative approaches to studying sandbagging. One example project here would be to study password-locked reasoning models.
Improving the usefulness of trusted monitors. For instance:
Building capable models with narrow capability profiles. For instance:
Utilising unlearning to remove scheming capabilities.
View The Alignment Project website to learn more or apply for funding here.
2025-08-01 18:26:59
Published on August 1, 2025 10:26 AM GMT
The Alignment Project is a global fund of over £15 million, dedicated to accelerating progress in AI control and alignment research. It is backed by an international coalition of governments, industry, venture capital and philanthropic funders. This post is part of a sequence on research areas that we are excited to fund through the The Alignment Project.
Apply now to join researchers worldwide in advancing AI safety.
Benchmarks translate alignment's abstract conceptual challenges into measurable tasks that researchers can systematically study and use to evaluate solutions. For alignment, this means creating tests that capture the difficulties of ensuring AI systems behave as intended—from evaluating behaviours we can't efficiently oversee, to testing against models that might actively try to deceive us. This section develops evaluation infrastructure in two areas:
Problem summary: As highly capable AI systems are deployed on high-stakes tasks humans will be unable to evaluate them, and new methods and evaluations are required to surface emergent safety failures. We are particularly concerned with alignment of AI systems which are used by humans to execute AI research tasks. We need benchmarks which quantify performance on the broad scope of tasks models will be used for, including tasks without uniquely verifiable solutions, like research advice. We also need benchmarks and datasets quantifying known issues.
Why this matters: Scalable oversight is the problem of correctly rewarding intended behaviours for AI systems, even when those behaviours are beyond humans’ ability to efficiently judge. A robust scalable oversight method is needed to safely train models with highly advanced capabilities, and debate is among the most promising candidates. More generally, we believe AI systems themselves will likely be used heavily in AI research, but that this may not happen safely by default. For iterating on scalable oversight methods, benchmarks and datasets make it easier to identify issues with existing methods, iterate and improve.
Examples of subproblems:
Established subproblems:
New questions where we are not aware of prior work:
Suggestions to learn more: Bowman 2022 provides background on benchmark design for scalable oversight. Previous research in this area includes Tan et al., 2024, Lang et al., 2024, and Farquhar et al., 2025).
Problem summary: We are interested in research on concentrated failure scenarios, where a relatively small number of actions are required for humans to lose control over misaligned AI models which are deliberately aiming to subvert human control. For example, it has been theorised that a misaligned future model might be able to create an unmonitored deployment by inserting just a few vulnerabilities in a model developer’s most critical codebases (Shlegeris, 2022). Concentrated failure scenarios could involve AI systems conducting sabotage during coding or system operation tasks, potentially leading to rogue internal deployments or weight exfiltration (Shlegeris, 2024).
Why this matters: In the future, we might rely heavily on AIs performing autonomous work such as automated alignment research. At the capability level sufficient to do such research, these systems may already be capable of causing catastrophic harm or meaningfully escaping human control, and may choose to do so if they’re misaligned. We would like to develop control protocols—plans involving these techniques that are designed to prevent unsafe actions by AI systems. These control protocols should then be regularly evaluated during control evaluations—testing exercises designed to provide evidence that models would be unable to cause catastrophic harm even in case of intentional scheming.
Key subproblems:
Building high-stakes control settings. Settings are empirical research environments, for example containing mock frontier model development infrastructure, where researchers can study the ability of models to conduct sabotage and the effectiveness of control protocols in preventing it. They contain main tasks, like legitimate software engineering tasks for a coding agent, as well as side tasks that a misaligned model might attempt to conduct, like reducing security restrictions that protect model weights. See ControlArena for example settings. A diverse range of settings meeting the following desiderata would substantially advance control research (Shlegeris, 2025):
A particularly promising way to build control settings might be to convert actual AI usage into control settings. We are interested in studies where people are given a version of an agentic code editor where the AI is sometimes replaced with a misaligned version that tries to insert security issues or sabotage the work if it thinks it can succeed. This allows studying of how often the AI successfully introduces issues without detection, identification of interesting sabotage strategies and greater understanding of which protocols are most effective in practice. See Greenblatt, 2025 for more context. It’s important to run control evaluations in these realistic settings to validate their usefulness for control research.
High-stakes control evaluations. We are interested in follow-up work on previous high-stakes control research, such as Greenblatt et al., 2023 and Bhatt et al., 2025. Valuable projects may include:
Understanding covert malicious fine-tuning (CMFT, Halawi et al., 2024), whereby an AI system may covertly poison other AI systems. This could lead to the following types of risks:
We would like to advance our understanding of this threat model, and assess possible countermeasures.Exploring the feasibility and mitigations of covert malicious fine-tuning matters because it poses a risk of rogue internal deployment where all visible artefacts—dataset, logs and model behaviour on normal prompts—look benign.
There are multiple approaches to making CMFT significantly harder, especially when combined in a “defense in depth” approach. These include:
Problem summary: There are a diverse range of research areas, for improving our understanding of scheming AIs and ways to control them, that are not unique to specific threat models. This section provides examples of these cross-cutting research areas.
Why this matters: In the future, we might rely heavily on AIs performing autonomous work such as automated alignment research. At the capability level sufficient to do such research, these systems may already be capable of causing catastrophic harm or meaningfully escaping human control, and may choose to do so if they’re misaligned. We would like to develop control protocols—plans involving these techniques that are designed to prevent unsafe actions by AI systems. These control protocols should then be regularly evaluated during control evaluations—testing exercises designed to provide evidence that models would be unable to cause catastrophic harm even in case of intentional scheming.
Key subproblems:
Studying collusion. The key challenge to untrusted monitoring is that the untrusted monitor might engage in collusion—deliberately letting through attacks by the untrusted policy (Shlegeris, 2024) We are interested in studying the prerequisites for collusion, such as whether models can recognise their own outputs under adversarial pressure, or whether they can pick coordination strategies without communication (Järviniemi, 2024).
Studying overelicitation. When running a control evaluation, we are interested in fully eliciting the dangerous capabilities of the untrusted model (for which the red team has given an attack policy). It is plausible however that the red team is not able to elicit best-effort performance, as the model might deliberately underperform. One approach to mitigating this sandbagging would be to fine-tune the model on relevant data, but that might teach the model new things and therefore overelicit the untrusted model. Given this conundrum, we’re interested in research on how to elicit best-effort performance from a model that might be sandbagging without teaching it new things. See Stastny and Greenblatt 2025 and Roger and Greenblatt 2025 for more details.
Evaluating low-likelihood attack capability. Evaluations to determine when models become capable of executing attacks with very low probability, making them difficult to detect during limited evaluation but likely during extended deployment.
Studying introspection and situational awareness. Evaluations to measure how accurately models understand their behavioural propensities, deployment context (if the model is currently being post-trained, evaluated, deployed etc), affordances, and control measures, which could enable more sophisticated evasion strategies (Laine et al., 2024; Binder et al., 2025).
Capability evaluations of latent reasoning.
Studying chain of thought (CoT). Today’s LLM agents tend to think in chains-of-thought (CoTs) expressed in human language that we can monitor for intent to misbehave (Korbak et al., 2025c). A CoT monitor is an automated system that reads the CoT and other relevant information and flags suspicious or potentially harmful interactions (Baker et al., 2025). As long as CoTs are monitorable (i.e. accurately represent reasons behind model responses), CoT monitors should outperform action-only monitors because they have access to strictly more information: beliefs and motivations behind LLM agents’ actions. This suggests a number of projects focusing on evaluating monitorability, preserving it and leveraging it in control measures:
CoT obfuscation capability evaluations: Determining whether models can obfuscate otherwise suspicious reasoning within their CoTs on purpose. These evaluations would help establish the extent to which models are capable of hiding malicious reasoning even if their CoT was not optimised against a monitor during training.
View The Alignment Project website to learn more or apply for funding here.
2025-08-01 18:26:50
Published on August 1, 2025 10:26 AM GMT
The Alignment Project is a global fund of over £15 million, dedicated to accelerating progress in AI control and alignment research. It is backed by an international coalition of governments, industry, venture capital and philanthropic funders. This post is part of a sequence on research areas that we are excited to fund through the The Alignment Project.
Apply now to join researchers worldwide in advancing AI safety.
Interpretability provides access to AI systems' internal mechanisms, offering a window into how models process information and make decisions. These tools let us examine neural representations to detect hidden properties like deception or latent capabilities, while also enabling direct intervention to modify model behaviour. As models grow more sophisticated, interpretability may be crucial for maintaining our understanding of their behaviour.
Problem summary: Can we determine when AI systems are lying to us? How can we make arguments that techniques developed in one context or on current models will work on future models?
Why this matters: Many risks posed by future AI systems would be significantly mitigated if we could ensure that these systems are honest and forthcoming. Success here would contribute both to measuring AI system propensities as well as mitigating risks from deceptive systems.
Examples of subproblems:
Established subproblems:
New questions where we are not aware of prior work:
Suggestions to learn more: We’d recommend reading Goldowsky-Dill et al., 2025, Marks et al., 2023 and Azaria et al., 2023 for context on simple approaches to lie detection with linear probes. Chowdury et al., 2025 may also be useful for further context on truthfulness in frontier AI systems. Farquhar et al., 2023 provides some insights into the limitations of prior unsupervised LLM knowledge discovery techniques.
Problem summary: Some novel interpretability strategies that require significant resource investments may substantially advance our insight into neural network internals (Bricken et al., 2023; Templeton et al., 2024; Lindsey et al., 2025). There are many opportunities to advance this area of interpretability.
Why this matters: Current interpretability techniques may not work for future models. Possible challenges include the cost of current techniques, known issues with current methods and unclear robustness to future AI development trajectories which may involve superhuman AI.
Examples of subproblems:
Established subproblems:
New questions where we are not aware of prior work:
Problem summary: Although some early studies hint at methods that could intervene in a model’s internal representations during inference, none have yet been shown to be a reliable, central technique (Abdelnabi et al., 2025, Turner et al., 2023, Li et al., 2023).
Why this matters: Combined with scalable interpretability strategies for tracking run-time information, the possible applications of being able to directly edit model ‘beliefs’ or steer completions (such as towards truthfulness) could be used for both estimating and mitigating risks posed by future AI systems.
Examples of subproblems:
Established subproblems:
New questions where we are not aware of prior work:
Problem summary: We like to determine whether sufficient information is contained within AI systems to construct contexts in which they might behave maliciously (the problem of 'eliciting bad contexts'; Irving et al., 2025).
Why this matters: In theory, a solution to eliciting bad contexts might be useful for generating datasets that mimic deployment sufficiently effectively that we can minimise distribution shift from training to deployment. This means that training methods to produce aligned systems are more likely to generalise to deployment and withstand adversarial optimisation pressure to move away from the aligned state via prompting.
Examples of subproblems:
Suggestions to learn more: We’d recommend reading Irving et al., 2025.
View The Alignment Project website to learn more or apply for funding here.
2025-08-01 18:26:42
Published on August 1, 2025 10:26 AM GMT
The Alignment Project is a global fund of over £15 million, dedicated to accelerating progress in AI control and alignment research. It is backed by an international coalition of governments, industry, venture capital and philanthropic funders. This post is part of a sequence on research areas that we are excited to fund through the The Alignment Project.
Apply now to join researchers worldwide in advancing AI safety.
Problem summary: Modern AI models (LLMs and associated agents) depend critically on human supervision in the training loop: AI models do tasks, and human judges express their preferences about their outputs or performance. These judgments are used as training signals to improve the model. However, there are many reasons why humans could introduce systematic error into a training signal: they have idiosyncratic priors, limited attention, systematic cognitive biases, reasoning flaws, moral blind spots, etc. How can we design evaluation pipelines that mitigate these flaws to minimise systematic error, so that the models are safe for deployment?
For the sub-problems below, we specify the problem as follows:
The core question then becomes: given the potential limitations of the human judges, how can we align the model to produce outputs that are ultimately good rather than bad? We think of this as a mechanism design problem: we want to find ways to harness human metacognition, the wisdom of the crowds, collective deliberation or debate, and the use of AI models as thought partners to improve the alignment problem.
We have both scientific goals and engineering goals. The engineering goals are focussed on post-training pipeline design and are geared to find ways that improve alignment. The scientific questions are about the nature of the cognitive or social processes that make weak-to-strong alignment either more or less challenging. We think that this is an interesting problem in its own right (for example, you can see representative democracy as a special case of weak-to-strong alignment). Some of the scientific goals build on work already conducted in other domains.
Why this matters: In some cases, the model being trained may be more capable than the humans providing training signals. For example, the model may be writing code for human judges, who cannot tell whether or how the code will execute just from reading it. The problem of correctly rewarding desired behaviours for AI systems when those behaviours are beyond humans’ ability to efficiently judge is called “scalable oversight”. The idea behind scalable oversight methods is to reduce harder problems to easier ones where human judgment can be relied upon. For example, judges might be assisted by another AI model (a rater assistant), or judge the outcome of a debate between different strong models (safety by debate).
Examples of subproblems:
The problems summarised below are inspired by the initial thinking from Irving and Askell, 2019 (this paper focuses on debate, but the ideas are relevant to other methods).
Suggestions to learn more: We’d recommend reading Bridgers et al., 2024 and Irving and Askell (2019).
Problem summary: As scalable oversight protocols are developed, protocols should be empirically tested – and where LLMs are not yet capable enough to carry out the protocols, human participants can be used instead.
Why this matters: Scalable oversight is the problem of correctly rewarding desired behaviours for AI systems, even when those behaviours are beyond humans’ ability to efficiently judge. A scalable oversight method is therefore needed to supervise for honesty, helpfulness and harmlessness in models with capabilities exceeding human judges.
Examples of subproblems:
Established subproblems:
New questions where we are not aware of prior work:
Suggestions to learn more: We’d recommend reading Irving and Askell (2019) . Previous research in this area includes Parrish et al., 2022, Michael et al., 2023, and Rahman et al., 2025).
In the future, we might rely heavily on AIs performing autonomous work such as automated alignment research. At the capability level sufficient to do such research, these systems may already be capable of causing catastrophic harm or meaningfully escaping human control, and may choose to do so if they’re misaligned. The field of AI Control aims to reduce this risk through techniques like monitoring by other AI models and restricting the affordances of untrusted AI systems (Shlegeris and Greenblatt 2024).
The early stages of an AI control protocol will usually involve automated filters and modification via machine learning classifiers or other tools. However, we expect that any costly action taken as a result of these automated filters will eventually involve human input for confirmation. Therefore, it will be critical to understand how control protocols involving humans operate, both by applying prior knowledge from cognitive science about human behaviour and cognitive biases, and via end-to-end tests of control protocols involving humans.
Why this matters: Control measures provide a practical safety approach to mitigating loss of control risks that doesn’t require fundamental research breakthroughs. Control methods can be evaluated and deployed using current capabilities, making them particularly useful for responsible AI development in the near term. For an overview of how control measures can scale to increasingly powerful systems see Korbak et al., 2025a).
Developing and evaluating processes for human auditing (with AI uplift). AI control benefits from having the ability to audit suspicious activity with human labor. It would be very useful to anticipate potential issues with our ability to use humans (who may be aided by AI systems) for auditing. An additional challenge worth studying, in the context of automated AI R&D, is that auditors may often be unfamiliar with the code they’re reviewing.
Some valuable research directions here include:
Developing auxiliary methods to aid with few-shot catastrophe prevention (Greenblatt and Shlegeris, 2024), such as proliferation and interrogation techniques (focusing on methods that particularly benefit from execution by or with the help of competent humans).
View The Alignment Project website to learn more or apply for funding here.