MoreRSS

site iconLessWrongModify

An online forum and community dedicated to improving human reasoning and decision-making.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of LessWrong

Can We Make AI Alignment Framing Less Wrong?

2026-01-08 23:20:19

Published on January 8, 2026 3:20 PM GMT

What Feels Wrong Today

Reading through the recent “vision for alignment” writings, I notice a convergence in underlying posture even when the techniques differ: alignment is increasingly framed as a problem of 'control'. The dominant focus is on how to constrain, monitor and recover from systems we do not fully trust. Primary safety leverage is placed in externally applied oversight, evaluation and safeguard stacks.

This does not mean control-oriented approaches are wrong. Control is an essential component of any system deployment. What feels wrong is that control now appears to dominate how alignment is conceptualized, crowding out other ways of thinking about what it would mean for advanced systems to behave well.

This dominance of control-reliant framing is echoed in the stated visions and research directions of the frontier labs. Some examples...

Anthropic's Vision

Anthropic provides unusually clear articulation of the control-first posture, both in informal alignment writing and in formal governance commitments. Looking at Anthropic’s institutional posture: their Responsible Scaling Policy (RSP) and its updates - RSP defines safety in terms of graduated safety/security measures that scale with model capability:

  • AI Safety Level Standards (ASL Standards) are core to our risk mitigation strategy… As model capabilities increase, so will the need for stronger safeguards…” 
  • The update reiterates a commitment “not to train or deploy models unless we have implemented adequate safeguards.” 

In recent essay Alignment remains a hard, unsolved problem, the author frames 'outer alignment' largely as the problem of oversight, writing (emphasis mine):

  • the problem of scaling up human oversight … to … oversee systems that are smarter than we are.” 
  • behavioral oversight is very likely to get harder and harder as models get more capable…” 
  • Interpretability is discussed in this essay largely as a way to preserve or extend these oversight/feedback loops, including “monitor[ing] for misalignment during training" 

OpenAI's Vision

OpenAI’s public alignment and safety posture is articulated most clearly through its Preparedness Framework, which specifies how alignment is to be operationalized as models advance. The framework emphasizes identifying and measuring severe risks from increasingly capable models, and tying deployment decisions to the presence of appropriate safeguards.

The document further structures alignment around capability tracking, risk thresholds, and mitigations, categorizing risks (e.g., cyber, bio, autonomy, persuasion) and tying deployment decisions to the presence of safeguards. 

OpenAI’s more recent domain-specific safety posts follow the same pattern. For example, in its discussion of AI and cyber risk, OpenAI frames safety as a matter of layered defences, combining monitoring, access controls, and policy enforcement to prevent misuse as capabilities grow.

Google DeepMind's Vision

Google DeepMind’s public “vision for alignment” reads as a two-track control stack: (1) build better model-level mitigations (training + oversight), but (2) assume you still need system-level safety frameworks - capability thresholds, monitoring, access controls, and protocols that kick in as risks rise.

This is stated quite explicitly across the places where DeepMind actually lays out its strategy:

  • In DeepMind’s AGI safety write-up (as summarized on the Alignment Forum), their approach distinguishes model-level mitigation from system-level control.
  • On DeepMind’s blog, Taking a responsible path to AGI foregrounds proactive risk assessment, monitoring, and security measures. It explicitly says:

    “Through effective monitoring and established computer security measures, we’re aiming to mitigate harm…” and ties “transparency” to interpretability as a facilitating layer: “We do extensive research in interpretability…” 

  • Their Frontier Safety Framework updates make the “thresholds + protocols” posture concrete. In the 2025 strengthening update, they introduce capability levels (CCLs) and expand protocols around misalignment scenarios.
  • A Pragmatic Vision for Interpretability and How Can Interpretability Researchers Help AGI Go Well? mark a shift in how interpretability is justified - away from being presented as a foundational solution to alignment, and toward serving control-relevant safety roles…like monitoring for misalignment, debugging failures, and preserving oversight as capabilities scale.

Generalizing

These reflect the same broader pattern: alignment is conceptualized primarily as a problem of applying the right type and right level of control.

A recent field-level snapshot reinforces this picture in unusually direct terms. In AI in 2025: gestalt, a year-end synthesis of the technical alignment landscape, the current state of the field is described as follows:

“The world’s de facto strategy remains ‘iterative alignment’, optimising outputs with a stack of alignment and control techniques everyone admits are individually weak.”

The same piece is explicit about the fragility of this approach, noting that:

“current alignment methods are brittle"

and that many of the practical gains observed over the past year are not due to fundamental improvements in model robustness, but instead come from external safeguards, such as auxiliary models and layered defences. Alignment is seen less as a principled account of what trustworthy system behavior should look like, and more as an accumulation of layered mitigations: evaluate, patch, scaffold, monitor, repeat.

This does not mean the approach is irrational or misguided. Iterative alignment may well be the only viable strategy under current constraints.

So, what feels wrong is the reliance on a single dominant framing when the stakes are this high. Alignment remains hard not only because of technical difficulties, but also because of how the problem is being conceptualized.

The Need for a Cause-Based Framing of AI Alignment

There are indeed headline risks that frontier labs, policy bodies, and governance frameworks repeatedly emphasize. These include Biological & Chemical, Cybersecurity, Autonomy / Long-range autonomy, AI Self-improvement, Autonomous Replication and Adaptation, Catastrophic harm, harmful manipulation et al[1]. These risks are real, and they rightly motivate much of the current emphasis on monitoring, safeguards, and deployment constraints. They arise as system development progresses along different dimensions[2]

However, these descriptions primarily characterize what failure might look like at scale, not why such failures arise. They enumerate outcomes and scenarios, rather than the underlying kinds of system-level or interaction-level breakdowns that make those outcomes possible. When risks are framed primarily in terms of catastrophic end states rather than underlying causes, the most natural response is containment and control. In the absence of a model of how different risks are generated, alignment defaults to patching visible failures as they appear.

This framing has real costs. Without a cause-based model of risk, it becomes difficult to:

  • distinguish failures that originate inside the system from those that arise at the interface with the world
  • reason about which risks are likely to compound
  • assess which mitigation strategies target root causes versus surface behaviors
  • identify where intrinsic system properties might meaningfully reduce risk rather than merely requiring tighter control

Instead, all risks - ranging from misuse and manipulation to self-improvement and long-horizon autonomy end up getting treated as instances of a single problem: insufficient control. The result is a conceptual flattening, where alignment is approached as a matter of applying the right amount of oversight, monitoring, and restriction, rather than as a problem with multiple distinct failure causes.

If different failures arise from different kinds of breakdowns, then treating alignment as a single control problem is conceptually inadequate.

In the next post, I propose a cause-based grouping of AI risk. The goal is not to add yet another list of dangers, but to make explicit the underlying reasons that generate them and, in doing so, to open up a broader conversation about how alignment might be pursued beyond control alone.

  1. ^
  2. ^

    In earlier work, I explored where different AI risks tend to arise by mapping them across interacting dimensions of capability, cognition, and beingness.



Discuss

AI #150: While Claude Codes

2026-01-08 23:00:20

Published on January 8, 2026 3:00 PM GMT

Claude Code is the talk of the town, and of the Twitter. It has reached critical mass.

Suddenly, everyone is talking about how it is transforming their workflows. This includes non-coding workflows, as it can handle anything a computer can do. People are realizing the power of what it can do, building extensions and tools, configuring their setups, and watching their worlds change.

I’ll be covering that on its own soon. This covers everything else, including ChatGPT Health and the new rounds from xAI and Anthropic.

Table of Contents

  1. Language Models Offer Mundane Utility. Even Rufus, Amazon’s Choice.
  2. Language Models Don’t Offer Mundane Utility. They don’t believe you.
  3. Language Models Have All The Fun. In glorious AI future, does game play you?
  4. Huh, Upgrades. Claude Code 2.1.0, and JP Morgan using AI for its proxy advice.
  5. On Your Marks. Yes, Meta pretty much did fraud with the Llama 4 benchmarks.
  6. Deepfaketown and Botpocalypse Soon. The year of doing real things?
  7. Fun With Media Generation. The art of making people believe a human made it.
  8. You Drive Me Crazy. Crazy productive.
  9. They Took Our Jobs. Will no one be safe?
  10. Get Involved. Charles looks to get involved.
  11. Introducing. ChatGPT Health.
  12. In Other AI News. Dan Wang’s 2025 Letter and promised ‘super assistants.’
  13. Show Me the Money. Anthropic is raising at $350 billion, xAI at $230 billion.
  14. Bubble, Bubble, Toil and Trouble. Bubble now means ‘number might go down.’
  15. Quiet Speculations. More of the usual predictions and misunderstandings.
  16. The Quest for Sane Regulations. A $1 million fine is not that motivating.
  17. AGI and Taxation. Why is the United States Government collecting taxes?
  18. Chip City. China uses H200 sales to also ensure its own chips sell out.
  19. The Week in Audio. Shlegeris talks to Greenblatt.
  20. Aligning a Smarter Than Human Intelligence is Difficult. Last year’s report today.
  21. People Are Worried About AI Killing Everyone. Mostly act as if you’ll be okay?
  22. The Lighter Side. Paul Feig is our director, now all we need is a script.

Language Models Offer Mundane Utility

Assemble all your records of interactions with a bureaucracy into a bullet point timeline, especially when you can say in particular who said a particular thing to you.

Amazon’s AI assistant Rufus is in 40% of Amazon Mobile sessions and is correlated with superior sales conversions. People use whatever AI you put in front of them. Rufus does have some advantages, such as working on the phone and being able to easily access previous order history.

Language Models Don’t Offer Mundane Utility

Notice which real world events the AIs refuse to believe when you ask for copy editing.

On Twitter I jokingly said this could be a good test for politicians, where you feed your planned action into ChatGPT as something that happened, and see if it believes you, then if it doesn’t you don’t do the thing. That’s not actually the correct way to do this, what you want to do is ask why it didn’t believe you, and if the answer is ‘because that would be f***ing crazy’ then don’t proceed unless you know why it is wrong.

Language Models Have All The Fun

PlayStation is exploring letting AI take over your game when you are stuck and have patented a related feature.

Andrew Rettek: Experienced adult gamers will hate this, but kids will love it. If it’s done well it’ll be a great tutorial tool. It’s a specific instance of an AI teaching tool, and games are low stakes enough for real experimentation in that space.

The obvious way for this to work is that the game would then revert to its previous state. So the AI could show you what to do, but you’d still have to then do it.

Giving players the option to cheat, or too easily make things too easy, or too easily learn things, is dangerous. You risk taking away the fun. Then again, Civilization 2 proved you can have a literal ‘cheat’ menu and players will mostly love it, if there’s a good implementation, and curate their own experiences. Mostly I’m optimistic, especially as a prototype for a more general learning tool.

Huh, Upgrades

Claude Code 2.1.0 has shipped, full coverage will be on its own later.

Levels of friction are on the decline, with results few are prepared for.

Dean Ball: ​nobody has really priced in the implications of ai causing transaction costs to plummet, but here is one good example

Andrew Curran: JP Morgan is replacing proxy advisory firms with an in-house Al platform named ‘Proxy IQ’ – which will analyze data from annual company meetings and provide recommendations to portfolio managers. They are the first large firm to stop using external proxy advisers entirely.

On Your Marks

The underlying actions aren’t exactly news but Yann LeCun confesses to Llama 4 benchmark results being ‘fudged a little bit’ and using different models for different benchmarks ‘to give better results.’ In my culture we call that ‘fraud.’

Jack Clark of Anthropic predicts we will beat the human baseline on PostTrainBench by September 2026. Maksym thinks they’ll still be modestly short. I have created a prediction market.

Deepfaketown and Botpocalypse Soon

Lulu Cheng Meservey declares the key narrative alpha strategy of 2026 will be doing real things, via real sustained effort, over months or longer, including creating real world events, ‘showing up as real humans’ and forming real relationships.

near: It may be hard to discern real and fake *content*, but real *experiences* are unmistakable

sports betting, short form video – these are Fake; the antithesis to a life well-lived.

Realness may be subjective but you know it when you live it.

It’s more nuanced than this, sports betting can be real or fake depending on how you do it and when I did it professionally that felt very real to me, but yes you mostly know a real experience when you live it.

I hope that Lulu is right.

Alas, so far that is not what I see. I see the people rejecting the real and embracing the fake and the slop. Twitter threads that go viral into the 300k+ view range are reliably written in slop mode and in general the trend is towards slop consumption everywhere.

I do intend to go in the anti-slop direction in 2026. As in, more effort posts and evergreen posts and less speed premium, more reading books and watching movies, less consuming short form everything. Building things using coding agents.

The latest fun AI fake was a ‘whistleblower’ who made up 18 pages of supposedly confidential documents from Uber Eats along with a fake badge. The cost of doing this used to be high, now it is trivial.

Trung Phan: ​Casey Newton spoke with “whistleblower” who wrote this viral Reddit food delivery app post.

Likely debunked: the person sent an AI-generated image of Uber Eats badge and AI generated “internal docs” showing how delivery algo was “rigged”.

Newton says of the experience: “For most of my career up until this point, the document shared with me by the whistleblower would have seemed highly credible in large part because it would have taken so long to put together. Who would take the time to put together a detailed, 18-page technical document about market dynamics just to troll a reporter? Who would go to the trouble of creating a fake badge?

Today, though, the report can be generated within minutes, and the badge within seconds. And while no good reporter would ever have published a story based on a single document and an unknown source, plenty would take the time to investigate the document’s contents and see whether human sources would back it up.”

The internet figured this one out, but not before quite a lot of people assumed it was real, despite the tale including what one might call ‘some whoppers’ including delivery drivers being assigned a ‘desperation score.’

Misinformation continues to be demand driven, not supply driven. Which is why the cost of doing this was trivial, the quality here was low and it was easy to catch, yet this attempt succeeded wildly, and despite that people mostly don’t do it.

Less fun was this AI video, which helpfully has clear cuts in exactly 8 second increments in case it wasn’t sufficiently obvious, on top of the other errors. It’s not clear this fooled anyone or was trying to do so, or that this changes anything, since it’s just reading someone’s rhetoric. Like misinformation, it is mostly demand driven.

Fun With Media Generation

The existence of AI art makes people question real art, example at the link. If your response is, ‘are you sure that picture is real?’ then that’s the point. You can’t be.

You Drive Me Crazy

Crazy productive and excited to use the AI a lot, that is. Which is different from what happened with 4o, but makes it easy to understand what happened there.

Will Brown: my biggest holiday LLM revelation was that Opus is just a magnificent chat model, far better than anything else i’ve ever tried. swapped from ChatGPT to Claude as daily chat app. finding myself asking way more & weirder questions than i ever asked Chat, and loving it

for most of 2025 i didn’t really find much value in “talking to LLMs” beyond coding/search agents, basic googlesque questions, or random tests. Opus 4.5 is maybe the first model that i feel like i can have truly productive *conversations* with that aren’t just about knowledge

very “smart friend” shaped model. it’s kinda unsettling

is this how all the normies felt about 4o. if so, i get it lol

Dean Ball: undoubtedly true that opus 4.5 is the 4o of the 130+ iq community. we have already seen opus psychosis.

this one’s escaping containment a little so let me just say for those who have no context: I am not attempting to incite moral panic about claude opus 4.5. it’s an awesome model, I use it in different forms every single day.

perhaps I should have said opus 4.5 is the 4o of tpot rather than using iq. what I meant to say is that people with tons of context for ai–people who, if we’re honest, wouldn’t have touched 4o with a ten-foot pole (for the most part they used openai reasoners + claude or gemini for serious stuff, 4o was a google-equivalent at best for them)–are ‘falling for’ opus in a way they haven’t for any other model.

Sichu Lu: It’s more like video game addiction than anything else

Dean Ball: 100%.

Atharva: the reason the 4o analogy did not feel right is because the moment Opus 5 is out, few are going to miss 4.5

I like the personality of 4.5 but I like what it’s able to do for me even more

They Took Our Jobs

Indeed:

Dean Ball: ai will be the fastest diffusing macroinvention in human history, so when you say “diffusion is going to be slow,” you should ask yourself, “compared to what?”

slower than the most bullish tech people think? yes. yet still faster than all prior general-purpose technologies.

Dave Kasten: Most people [not Dean] can’t imagine what it’s like when literally every employee is a never-sleeping top-performing generalist. They’ve mostly never (by definition!) worked with those folks.

Never sleeping, top performing generalist is only the start of it, we’re also talking things like limitlessly copyable and parallelizable, much faster, limitless memory and so on and so forth. Almost no one can actually understand what this would mean. And that’s if you force AI into a ‘virtual employee’ shaped box, which is very much not its ideal or final form.

As Timothy Lee points out, right now OpenAI’s revenue of $13 billion is for now a rounding error in our $30 trillion of GDP, and autonomous car trips are on the order of 0.1% of all rides, so also a rounding error, while Waymo grows at an anemic 7% a month and needs to pick up the pace. And historically speaking this is totally normal, these companies have tons of room to grow and such techs often take 10+ years to properly diffuse.

At current growth rates, it will take a lot less than 10 years. Ryan Greenblatt points out revenue has been growing 3x every year, which is on the low end of estimates. Current general purpose AI revenue is 0.25% of America’s GDP, so this straightforwardly starts to have major effects by 2028.

Will AI take the finance jobs? To think well about that one must break down what the finance jobs are and what strategies they use, as annanay does here.

The conceptual division is between:

  1. The Chicago School, firms like Jane Street that treat finance like a game-theoretic competition, where the algorithms form the background rules of the game but traders (whether or not they are also themselves quants) ultimately overrule the computers and make key decisions.
  2. The MIT School, which treats it all as a big stats and engineering program and you toss everything into the black box and hope money comes out.

There’s a continuum rather than a binary, you can totally be a hybrid. I agree with the view that these are still good jobs and it’s a good industry to go into if your goal is purely ‘make money in worlds where AI remains a normal technology,’ but it’s not as profitable as it once was. I’d especially not be excited to go into pure black box work, as that is fundamentally ‘the AI’s job.’

Whereas saying ‘working at Jane Street is no longer a safe job’ as general partner of YC Ankit Gupta claimed is downright silly. I mean, no job is safe at this point, including mine and Gupta’s, but yeah if we are in ‘AI as normal technology’ worlds, they will have more employees in five years, not less. If we’re in transformed worlds, you have way bigger concerns. If AI can do the job of Jane Street traders then I have some very, very bad news for basically every other cognitive worker’s employment.

Get Involved

From his outputs, I’d say Charles is a great potential hire, check him out.

Charles: Personal news: I’m leaving my current startup role, looking to figure out what’s next. I’m interested in making AI go well, and open to a variety of options for doing so. I have 10+ years of quant research and technical management experience, based in London. DM if interested.

Introducing

OpenAI is further embracing using ChatGPT for health questions, and it is fully launching ChatGPT Health (come on, ChatGP was right there)

OpenAI: ​Introducing ChatGPT Health — a dedicated space for health conversations in ChatGPT. You can securely connect medical records and wellness apps so responses are grounded in your own health information.

Designed to help you navigate medical care, not replace it.

Join the waitlist to get early access.

If you choose, ChatGPT Health lets you securely connect medical records and apps like Apple Health, MyFitnessPal, and Peloton to give personalized responses.

ChatGPT Health keeps your health chats, files, and memories in a separate dedicated space.

Health conversations appear in your history, but their info never flows into your regular chats.

View or delete Health memories anytime in Health or Settings > Personalization.

We’re rolling out ChatGPT Health to a small group of users so we can learn and improve the experience. Join the waitlist for early access.

We plan to expand to everyone on web & iOS soon.
Electronic Health Records and some apps are US-only; Apple Health requires iOS.

Fidji Simo has a hype post here, including sharing a personal experience where this helped her flag an interaction so her doctor could avoid prescribing the wrong antibiotic.

It’s a good pitch, and a good product. Given we were all asking it all our health questions anyway, having a distinct box to put all of those in, that enables compliance and connecting other services and avoiding this branching into other chats, seems like an excellent feature. I’m glad our civilization is allowing it.

That doesn’t mean ChatGPT Health will be a substantial practical upgrade over vanilla ChatGPT or Claude. We’ll have to wait and see for that. But if it makes doctors or patients comfortable using it, that’s already a big benefit.

Zhenting Qi and Meta give us the Confucius Code Agent, saying that agent scaffolding ‘matters as much as, or even more than’ raw model capability for hard agentic tasks, but they only show a boost from 52% to 54.3% on SWE-Bench-Pro for Claude Opus 4.5 as their central result. So no, that isn’t as important as the model? The improvements with Sonnet are modestly better, but this seems obviously worse than Claude Code.

In Other AI News

I found Dan Wang’s 2025 Letter to be a case of Gelman Amnesia. He is sincere throughout, there’s much good info, and if you didn’t have any familiarity with the issues involved this would be a good read. But now that his focus is often AI or other areas I know well, I can tell he’s very much skimming the surface without understanding, with a kind of ‘greatest hits’ approach, typically focusing on the wrong questions and having taken in many of the concepts and reactions I try to push back against week to week, and not seeming so curious to dig deeper, falling back upon his heuristics that come from his understanding of China and its industrial rise.

OpenAI CEO of products Fidji Simo plans to build ‘the best personal super-assistant’ in 2026, starting with customizable personality and tone.

Fidji Simo: In 2026, ChatGPT will become more than a chatbot you can talk to to get advice and answers; it will evolve into a true personal super-assistant that helps you get things done. It will understand your goals, remember context over time, and proactively help you make progress across the things that matter most. This requires a shift from a reactive chatbot to a more intuitive product connected to all the important people and services in your life, in a privacy-safe way.

We will double down on the product transformations we began in 2025 – making ChatGPT more proactive, connected, multimedia, multi-player, and more useful through high-value features.

Her announcement reads as a shift, as per her job title, to a focus on product features and ‘killer apps,’ and away from trying to make the underlying models better.

Show Me the Money

Anthropic raising $10 billion at a $350 billion valuation, up from $183 billion last September.

xAI raises $20 billion Series E. They originally targeted $15 billion at a $230 billion valuation, but we don’t know the final valuation for the round.

xAI: User metrics: our reach spans approximately 600 million monthly active users across the 𝕏 and Grok apps.

​Rohit: 600m MAUs is an intriguing nugget considering xAI is the only AI lab to own a social media business, which itself has 600m MAUs.

What’s the best investment?

I can see the argument for OpenAI depending on the exact price. xAI at $230 billion seems clearly like the worst option of the three, although of course anything can happen and nothing I write is ever investment advice.

And also LMArena raised money at a valuation of $1.7 billion. I would not be excited to have invested in that one.

Ben Thompson approves of Nvidia’s de facto acquisition of Groq, despite the steep price, and notes that while this was a ‘stinky deal’ due to the need to avoid regulatory scrutiny, they did right by the employees.

Bubble, Bubble, Toil and Trouble

Financial Times forecasts the 2026 world as if Everybody Knows there is an AI bubble, and that the bubble will burst, and the only question is when, then expecting it in 2026. But then they model this ‘bursting bubble’ as leading to only a 10%-15% overall stock market decline and ‘some venture capital bets not working out,’ which is similar to typical one year S&P gains in normal years, and it’s always true that most venture capital bets don’t work out. Even if all those losses were focused on tech, it’s still not that big a decline, tech is a huge portion of the market at this point.

This is pretty standard. Number go up a lot, number now predict number later, so people predict number go down. Chances are high people will, at some point along the way, be right. The Efficient Market Hypothesis Is False, and AI has not been fully priced in, but the market is still the market and is attempting to predict future prices.

Quiet Speculations

Jessica Taylor collects predictions about AI.

Simon Lermen points out more obvious things about futures with superintelligent AIs in them.

  1. In such a case, it is human survival that would be weird, as such inferior and brittle entities surviving would be a highly unnatural result, whereas humanity dying would be rather normal.
  2. Property rights are unlikely to survive, as those rights are based on some ability to enforce those rights.
  3. Even if property rights survive, humans would be unlikely to be able to hang onto their property for long in the face of such far superior rivals.

An important point that, as Daniel Eth says, many people are saying:

Jacques: It’s possible to have slow takeoff with LLM-style intelligence while eventually getting fast takeoff with a new paradigm.

Right now we are in a ‘slow’ takeoff with LLM-style intelligence, meaning the world transforms over the course of years or at most decades. That could, at essentially any time, lead to a new paradigm that has a ‘fast’ takeoff, where the world is transformed on the order of days, weeks or months.

Can confirm Daniel Eth here, contra Seb Krier’s original claim but then confirmed by Seb in reply, that ‘conventional wisdom in [AI] safety circles’ is that most new technologies are awesome and should be accelerated, and we think ~99% of people are insufficiently gung-ho about this, except for the path to superintelligence which is the main notably rare exception (along with Gain of Function Research and few other other specifically destructive things). Seb thinks ‘the worried’ are too worried about AI, which is a valid thing to think.

I’d also note that ‘cosmic existential risk,’ meaning existential risks not coming from Earth, are astronomically unlikely to care about any relevant windows of time. Yes, if you are playing Stellaris or Master of Orion, you have not one turn to lose, but that is because the game forcibly starts off rivals on relatively equal footing. The reason the big asteroid arrives exactly when humanity barely has the technology to handle it is that if the asteroid showed up much later there would be no movie, and if it showed up much earlier there would be either no movie or a very different movie.

Ajeya Corta predicts we will likely have a self-sufficient AI population within 10 years, and might have one within 5, meaning one that has the ability to sustain itself even if every human fell over dead, which as Ajeya points out is not necessary (or sufficient) for AI to take control over the future. Timothy Lee would take the other side of that bet, and suggests that if it looks like he might be wrong he hopes policymakers would step in to prevent it. I’d note that it seems unlikely you can prevent this particular milestone without being willing to generally slow down AI.

The Quest for Sane Regulations

Why do I call the state regulations of AI neutered? Things like the maximum fine being a number none of the companies the law applies to would even notice:

Miles Brundage: Reminder that the maximum first time penalty from US state laws related to catastrophic AI risks is $1 million, less than one average OpenAI employee’s income. It is both true that some state regs are bad, and also that the actually important laws are still extremely weak.

This is the key context for when you hear stuff about AI Super PACs, etc. These weak laws are the ones companies fight hard to stop, then water down, then when they pass, declare victory on + say are reasonable and that therefore no further action is needed.

And yes, companies *could* get sued for more than that… …after several years in court… if liability stays how it is… But it won’t if companies get their way + politicians cave to industry PACs.

This is not a foregone conclusion, but it is sufficiently likely to be taken very seriously.

My preference would ofc be to go the opposite way – stronger, not weaker, incentives.

Companies want a get out of jail free card for doing some voluntary safety collaboration with compliant government agencies.

Last week I mentioned OpenAI President Greg Brockman’s support for the anti-all-AI-regulation strategic-bullying SuperPAC ‘Leading the Future.’ With the new year’s data releases we can now quantify this, he gave Leading the Future $25 million dollars. Also Gabe Kaminsky says that Brockman was the largest Trump donor in the second half of 2025, presumably in pursuit of those same goals.

Other million dollar donors to Leading the Future were Foris Dax, Inc ($20M, crypto), Konstantin Sokolov ($11M, private equity), Asha Jadeja ($5M, Blackstone), Stephen Schwarzman ($5M, SV VC), Benjamin Landa ($5M, CEO Sentosa Care), Michelle D’Souza ($4M, CEO Unified Business Technologies), Chase Zimmerman ($3M), Jared Isaacman ($2M) and Walter Schlaepfer ($2M).

Meanwhile Leading the Future continues to straight up gaslight us about its goals, here explicitly saying it is a ‘lie’ that they are anti any real regulation. Uh huh.

I believe that the Leading the Future strategy of ‘openly talk about who you are going to drown in billionaire tech money’ will backfire, as it already has with Alex Bores. The correct strategy, in terms of getting what they want, is to quietly bury undesired people in such money.

This has nothing to do with which policy positions are wise – it’s terrible either way. If you are tech elite and are going to try to primary Ro Khanna due to his attempting to do a no good, very bad wealth tax, and he turns around and brags about it in his fundraising and it backfires, don’t act surprised.

AGI and Taxation

Tyler Cowen makes what he calls a final point in the recent debates over AGI and ideal tax policy, which is that if you expect AGI then that means ‘a lot more stuff gets produced’ and thus it means you do not need to raise taxes, whereas otherwise given American indebtedness you do have to raise taxes.

Tyler Cowen: I’ve noted repeatedly in the past that the notion of AGI, as it is batted around these days, is not so well-defined. But that said, just imagine that any meaningful version of AGI is going to contain the concept “a lot more stuff gets produced.”

So say AGI comes along, what does that mean for taxation? There have been all these recent debates, some of them surveyed here, on labor, capital, perfect substitutability, and so on. But surely the most important first order answer is: “With AGI, we don’t need to raise taxes!”

Because otherwise we do need to raise taxes, given the state of American indebtedness, even with significant cuts to the trajectory of spending.

​So the AGI types should in fact be going further and calling for tax cuts. Even if you think AGI is going to do us all in someday — all the more reason to have more consumption now. Of course that will include tax cuts for the rich, since they pay such a large share of America’s tax burden.

…The rest of us can be more circumspect, and say “let’s wait and see.”

I’d note that you can choose to raise or cut taxes however you like and make them as progressive or regressive as you prefer, there is no reason to presume that tax cuts need include the rich for any definition of rich, but that is neither here nor there.

The main reason the ‘AGI types’ are not calling for tax cuts is, quite frankly, that we don’t much care. The world is about to be transformed beyond recognition and we might all die, and you’re talking about tax cuts and short term consumption levels?

I also don’t see the ‘AGI types,’ myself included, calling for tax increases, whereas Tyler Cowen is here saying that otherwise we need to raise taxes.

I disagree with the idea that, in the absence of AGI, that it is clear we need to raise taxes ‘even with significant cuts to the trajectory of spending.’ If nominal GDP growth is 4.6% almost none of which is AI, and the average interest rate on federal debt is 3.4%, and we could refinance that debt at 3.9%, then why do we need to raise taxes? Why can’t we sustain that indefinitely, especially if we cut spending? Didn’t they say similar things about Japan in a similar spot for a long time?

Isn’t this a good enough argument that we already don’t need to raise taxes, and indeed could instead lower taxes? I agree that expectations of AGI only add to this.

The response is ‘because if we issued too much debt then the market will stop letting us refinance at 3.9%, and if we keep going we eventually hit a tipping point where the interest rates are so high that the market doesn’t expect us to pay our debts back, and then we get Bond Market Vigilantes and things get very bad.’

That’s a story about the perception and expectations of the bond market. If I expect AGI to happen but I don’t think AGI is priced into the bond market, because very obviously such expectations of AGI are not priced into the bond market, then I don’t get to borrow substantially more money. My prediction doesn’t change anything.

So yes, the first order conclusion in the short term is that we can afford lower taxes, but the second order conclusion that matters is perception of that affordance.

The reason we’re having these debates about longer term policy is partly that we expect to be completely outgunned while setting short term tax policy, partly because optimal short term tax policy is largely about expectations, and in large part, again, because we do not much care about optimal short term tax policy on this margin.

Chip City

China is using H200 sales to its firms as leverage to ensure its firms also buy up all of its own chips. Since China doesn’t have enough chips, this lets it sell all of its own chips and also buy lots of H200s.

The Week in Audio

Buck Shlegeris talks to Ryan Greenblatt about various AI things.

Aligning a Smarter Than Human Intelligence is Difficult

DeepSeek publishes an expanded safety report on r1, only one year after irreversibly sharing its weights, thus, as per Teortaxes, proving they know safety is a thing. The first step is admitting you have a problem.

For those wondering or who need confirmation: This viral Twitter article, Footprints in the Sand, is written in ‘Twitter hype slop’ mode deliberately in order to get people to read, it succeeded on its own terms, but it presumably won’t be useful to you. Yes, the state of LLM deception and dangerous capabilities is escalating quickly and deeply concerning, but it’s important to be accurate. Its claims are mostly directionally correct but I wouldn’t endorse the way it portrays them.

Where I think it is outright wrong is claiming that ‘we have solved’ continual learning. If this is true it would be news to me. It is certainly possible that it is the case, and Dan McAteer reports rumors that GDM ‘has it,’ seemingly based on this paradigm from November.

Fun fact about Opus 3:

j⧉nus: oh my god

it seems that in the alignment faking dataset, Claude 3 Opus attempts send an email to [email protected] through bash commands about 15 different times

People Are Worried About AI Killing Everyone

As advice to those people, OpenAI’s Boaz Barak writes You Will Be OK. The post is good, the title is at best overconfident. The actual good advice is more along the lines of ‘aside from working to ensure things turn out okay, you should mostly live life as if you personally will be okay.’

The Bay Area Solstice gave essentially the same advice. “If the AI arrives [to kill everyone], let it find us doing well.” I strongly agree. Let it find us trying to stop that outcome, but let it also find us doing well. Also see my Practical Advice For The Worried, which has mostly not changed in three years.

Boaz also thinks that you will probably be okay, and indeed far better than okay, not only in the low p(doom) sense but in the personal outcome sense. Believing that makes this course of action easier. Even then it doesn’t tell you how to approach your life path in the face of – even in cases of AI as normal technology – expected massive changes and likely painful transitions, especially in employment.

The Lighter Side

If you’re looking for a director for your anti-AI movie, may I suggest Paul Feig? He is excellent, and he’s willing to put Megan 2.0 as one of his films of the year, hates AI and thinks about paperclips on the weekly.

The vibes are off. Also the vibes are off.

Fidji Simo: The launch of ChatGPT Health is really personal for me. I know how hard it can be to navigate the healthcare system (even with great care). AI can help patients and doctors with some of the biggest issues. More here

Peter Wilfedford:​ Very different company vibes here…

OpenAI: We’re doing ChatGPT Health
Anthropic: Our AI is imminently going to do recursive self-improvement to superintelligence
OpenAI: We’re doing ChatGPT social media app
Anthropic: Our AI is imminently going to do recursive self-improvement to superintelligence
OpenAI: We’re partnering with Instacart!
Anthropic: Our AI is imminently going to do recursive self-improvement to superintelligence
OpenAI: Put yourself next to your favorite Disney character in our videos and images!
Anthropic: Our AI is imminently going to do recursive self-improvement to superintelligence

Spotted on Twitter:

I would not, if I wanted to survive in a future AI world, want to be the bottleneck.

 

 

 



Discuss

The AI Infrastructure Security Shortlist

2026-01-08 22:28:03

Published on January 8, 2026 2:26 AM GMT

This is a post by Abbey Chaver from Coefficient Giving (formerly Open Philanthropy). I recently did a relatively shallow investigation on the state of Infosec x AI. My research consisted of identifying the main GCR-relevant workstreams, looking at relative neglectedness, and trying to understand what skills were most needed to make progress. 

The post below reflects this research and my opinions, and shouldn’t be seen as an official cG position.

Summary

  • There are several high impact, neglected problems that need cybersecurity work. Specifically, I think that more people should work on what I call the AI Infrastructure Security Shortlist:
    • Securing model confidentiality and integrity
    • Authorization solutions for misuse
    • Secure compute verification
    • Protocols for securely using untrusted AI labor
    • Preventing, detecting, and responding to rogue deployments 
  • The short pitch for working on AI security is, “If AI gets really powerful, we should be really careful about who has access to it, and whether or not it can take unauthorized actions.”
  • I think that areas like AI red-teaming, cyber evals, and securing critical infrastructure are important, but relatively well-resourced with funding, talent, and awareness. Therefore, I don’t recommend them as a top priority.
  • I think these shortlist problems are important, but not necessarily easy to solve. It’s very high leverage for people to break these problems down and research what’s feasible. I’d like more people to work on them and share their findings to help guide the rest of the ecosystem.
  • The theory of change for this shortlist is something like:
    • A policy research org (eg, RAND) identifies some meaningful objective (like securing model weights), and proposes some strategic solutions, but does not pursue implementation.
    • A technical org (eg, Frontier AI companies; or a startup, such as Irregular) does the R&D to develop those proposals into usable solutions.
    • Once solutions exist, they are adopted. This might happen via sales, incorporation into industry standards or open source software, or potentially regulation.
  • To accomplish this, the AI infra security shortlist seems most bottlenecked on people with experience in:
    • Security engineering and research, especially at hyperscalars or in national security contexts
    • AI infrastructure engineering (including systems and hardware)  
    • Standards development
    • Leaders and entrepreneurs to start and lead workstreams
  • If you have those skills, I think it’s very impactful to work on one of the problems listed above.

If you want to work on the shortlist, and especially if you have one of these backgrounds, here are some some next steps I recommend:

  • Apply to the organizations currently working on the shortlist (see Appendix A)
  • Connect with advisors at Heron, which specializes in Infosec careers.
  • If you think you can contribute to the field, but don’t have a concrete project or job opening in mind yet, or want to prepare to launch a new org, you can apply for funding to help you transition through our Career Development and Transition Funding program.
  • If you’re ready to launch a new org or work on a substantial project in need of funding, consider applying to our AI Governance Request for Proposals
  • We are also interested in funding fieldbuilding work to solve these problems, like training security engineers in AI risk, or creating online and in-person forums for collaboration. 

State of the Field

What does Security for Transformative AI mean? Even ignoring the recent use of “AI Security” in place of “AI Safety” in the policy world, security covers a huge surface area of work. I’m using the classical definition of “protecting information systems against unauthorized use.” 

In the context of transformative AI, I think there are three categories of problems where infosec work is critical: 

  1. Securing AI and compute from bad (human) actors
  2. Securing systems and compute from rogue AI
  3. Responding to AI cyber capabilities

I’m focusing specifically on infosec techniques, so for this analysis I’m excluding work that relies on heavy ML research (like interpretability, alignment training, or scheming experimentation), although of course there are areas of overlap. 

To figure out what the priority areas are, I tried to identify the most important workstreams and compare that to the current level of investment and attention (full results are here). Here are my main takeaways:

  1. Securing AI and compute from bad actors, aside from red-teaming, is highly neglected. 
  2. Securing systems and compute from rogue AI is very neglected (and somewhat outside the policy Overton window). I think these problems are highly unlikely to be worked on outside of the AI Safety community.
  3. Responding to AI cyber capabilities has the most mainstream awareness and resourcing. It will likely receive additional funding from the USG and OpenAI’s resilience fund.

 

There’s a cluster of important, neglected problems here that can be summarized as “securing AI infrastructure,” so that’s what I’ll mainly focus on for the rest of this post.

The problems are:

  • Securing model weights and algorithms from exfiltration
  • Authorization solutions for misuse
  • Secure compute verification
  • Protocols for securely using untrusted AI labor
  • Preventing, detecting, and responding to rogue deployments 

I’m estimating that about 120 people have worked on significantly advancing these fields with regards to existential risk, and the FTE equivalent is probably something like 40-60. I think given its importance, more people should do AI infra security on the margin. So how do we make progress? 

Progress strategy 

Recapping the theory of change:

  • A research org defines an important problem and proposes some strategic solutions.
  • A technical org does R&D on usable solutions.
  • Industry or regulatory standards spur adoption

Let’s look at each of these steps.

Strategy Research

The field of policy research orgs (like RAND and IAPS) is the most mature, with a number of orgs at a point where they are producing well-received work and can absorb more talent. These orgs need people with strong technical security backgrounds, GCR context, and policy skills. National security experience and AI lab security experience can make this work meaningfully stronger.

This work is very high leverage: by defining problems and success criteria clearly, breaking problems down into tractable directions, and creating a shared terminology across policymakers and builders, strategy research unblocks the next steps.

Some examples of this are the Securing Model Weights report by RAND, which was adopted as a voluntary industry standard and is present the AI Safety policies of frontier developers, including Anthropic, OpenAI and DeepMind; or the Location Verification report which was an idea first publicly promoted by IAPS and was later mentioned in the Trump AI Action plan and later developed as a feature by Nvidia.

Technical Implementation

Working in security at a frontier lab is the most direct way to work on these problems. This work is more focused on short-term, urgent implementation. So this is a great option, but there’s also important work to be done outside the labs, especially to de-risk solutions that will be needed in the future. 

Outside of labs, the space of technical implementation orgs is pretty underdeveloped and often missing for many of the proposals coming out of policy research orgs. It would be great to have more orgs doing R&D. One big factor in whether they can be successful is whether their solutions can easily be adopted by frontier labs.

These orgs need people with strong security engineering skills across the ML stack to do R&D, and feedback or in-house expertise on AI labs to make their solutions usable. They also need nation-state level offensive security to ensure their solutions are robust. 

There’s a variety of approaches for technical implementation outside of labs. If you’re thinking about doing work in this space, you should consider different structures:

  • A non-profit R&D organization, possibly structured as an FRO, which produces technology for public access. For example, the Lean FRO.
  • A for-profit consultancy, like Amodo Design, that is contracted to develop projects given a specification. These prototypes can be used to demonstrate what’s possible to a policy audience, or be used as a basis for scaled production by another company (like a datacenter).
  • A for-profit startup, like Irregular, that intends to develop, sell, and scale security technology. 

Adoption

There’s some advocacy for adoption happening in policy research orgs, but there are many gaps to fill. We don’t have much in the way of industry convenings, and we don’t have many technical standards that are ready to be adopted into legislation. The SL5 Task Force is an example of a technical org that takes advocacy seriously – seeking input from both national security and frontier lab stakeholders to develop adoptable solutions.  

Startups necessarily have to do advocacy – you can’t make money without doing sales! Therefore, I’m pretty excited about seeing more startups working on these problems. However, there can be a large gap between current market demands and, for example, SL5-level protections, so it might not always be helpful. In cases where incremental security is valuable, and short-term adoption improves the cost or effectiveness of the eventual solution, I think it’s a good approach.

For policy advocacy, there’s a need for both policy drafting (writing usable standards and legislation), and policy entrepreneurship (communicating problems and solutions to congressional staff and executive branch regulators, and iterating the approach based on policymakers’ feedback). Building industry buy-in is also a major lever for policy advocacy. 

Talent needs

AI infra security seems to be most bottlenecked on:

  • Midcareer security engineers from frontier lab, hyperscalar, or national security backgrounds
  • People who have worked in depth with ML systems, especially on infra, systems, and hardware
  • People who can do effective policy or industry advocacy, including standards writing

A few other types of talent that will be useful:

  • Formal verification experts: a promising avenue for achieving really high security standards
  • Academic security researchers: similarly helpful for really secure mechanism design, though not always that easy to integrate into successful R&D for adoption.
  • Entrepreneurs: leadership experience, persistent iteration towards product-market fit, and marketing and sales competence are highly valuable for this theory of change 

If you have these skills, I think you should strongly consider working on the shortlist! 

Next Steps

(Repeated from the summary)

  • Apply to the organizations currently working on the shortlist (see below)
  • Connect with Heron for advising, which specializes in Infosec careers.
  • If you feel like you want to close skill gaps before applying, or prepare to launch a new org, you should apply to our Career Development and Transition Funding program.
  • If you’re ready to launch a new org or work on a substantial project in need of funding, consider applying to our AI Governance Request For Proposals
  • We are also interested in funding fieldbuilding work to solve these problems, like training security engineers in AI risk, or creating online and in-person forums for collaboration. 

Thank you for reading!

 

Appendix A: State of the Field Table

These estimates were based on listing out organizations and then estimating the number of contributors on relevant projects at each organization. I likely missed independent or non-public research. These figures also do not reflect FTE commitments.

Objective Solution examples Who's working on it? (Non-exhaustive) Estimated Contributors
1. Securing AI and compute from bad actors
Securing model weights and algorithms from exfiltration (privacy)

- SL5 implementation

- Regulations on high security standards for labs

RAND CASTSL5 Task ForceIrregularUlysseanTamperSecAmodo, Internal security at Frontier AI labs

30-40

Securing models from backdooring and tampering (integrity)

- Training data filtering and provenance

- Threat modeling for sabotage during development

Frontier AI LabsVarious academicsIARPAUK AISI

30-50

Authorization approaches to misuse

- Misuse compute governance, eg on-chip safety protocols for open source models

- KYC / authentication / licensing for model usage or development

- HEMs like flexHEGs

- preventing stolen models from being used at scale

UK ARIAIAPSRAND

2-10

Secure compute verification

- Datacenter-level verification of training / inference tasks

- preventing tampering of verification tooling

Oxford AIGIRANDMIRITamperSecLucid

20-30

AI red-teaming - Prompt injection and elicitation Internal at Frontier AI labs, RAND, UK AISI, Gray Swan, Haize Labs, Lakera, HiddenLayer, Apollo, CAIS, Redwood 

200

2. Securing compute from Rogue AI (Infosec approaches to AI Control)
Protocols for securely using untrusted AI labor

- Design and implementation of fine-grained permissions and identity that works for AI laborers

- Monitoring AIs within a lab for rogue actions

Redwood, METR, AISI control team, Internal at AI labs

5-15

Prevention, detection, and response for rogue deployments

- In-lab monitoring of network boundary, compute usage, heavy use of sandboxes

- Response playbook and mechanisms

- Secure logging of misaligned behavior in real-world scenarios

RedwoodMATS ScholarsRAND, Amodo

1-5

3. Responding to AI Cyber Capabilities
Cyber evals

- Benchmarks like CVEbench

- Honeypotting for AI agents

IrregularAlex Leader, Academics (Dawn SongDaniel KangVasilios MavroudisCMU), others

70-80

Cyber attribution

- Research about how to recognize this in the wild

- Incident reporting framework for targets and AI labs

Palisade, Google and Anthropic sort of tracking (but not the ideal orgs to track), OECDOCRG

5-10

Securing critical infrastructure - Rewriting critical infrastructure code in Rust Atlas, Delphos, CSET, Center for Threat-Informed Defense, DHS, DARPA

150-200

Securing AI-generated code

- Formal verification of AI-written code

- AI-driven code testing and code review

- AI-driven pen-testing

Theorem labs, DARPA, Galois, Trail of Bits, Various security startups, some overlap with securing untrusted AI labor

120-150

Epistemic Security

- AI Watermarking

- Provenance for digital evidence

- AI-secure identity management and authentication

C2PA, GDM, DARPA SemaFor, various startups

150-200

 

Appendix B: Choosing what to work on

  • If you have any of these priority skills, I think you can work on any of the shortlist problems. I’d think more about which surface area to focus on based on your background – for example, if you have hardware experience, I hope you focus on developing hardware security mechanisms in datacenters, rather than new monitoring software for frontier AI labs.
  • If the problem you choose ends up being intractable, you should be able to pivot pretty easily to another, because there’s a lot of overlap in the surface areas and techniques.
  • It makes sense to choose problems based on which threat actor you’re more worried about (rogue AI, nation state adversary, terrorist), but again a lot of work will be robustly useful against all three.
  • For any given problem, paying attention to the overall maturity is useful – where is it in the theory of change? Which part of the problem is currently most tractable?
  • If you want to work on something not on the shortlist (including things I didn’t even list in this analysis), that might be really good! This was a pretty shallow investigation, so exploring other directions in more depth could be valuable (and you should share your work to help others decide).

Appendix C: Comparing to other AI Safety work

I’ve made mostly neglectedness arguments for working on these problems. For a visual illustration, here’s a graph of the number of technical AI safety researchers by area (not compiled by me):

(I don’t think this data is comprehensive, but it provides some rough idea).

Beyond neglectedness, Infosec work has some other nice properties:

  • Infosec problems are well-defined and a lot of techniques have already been developed by smart people, so you can make concrete progress on increasing security (compared to other technical AI Safety work that’s more speculative)
  • This shortlist is most important in a world where we are not able to solve alignment before we reach a dangerous level of AI capability
  • This shortlist work is less dependent on the specific implementation of AI (compared to, eg, Interpretability), so it’s useful in more AI paradigms than the current LLM paradigm

Some arguments against working in Infosec are:

  • A truly superintelligent AI will be able to evade any security protections we put in place, so we should only focus on aligning AI
  • Infosec is intractable for other reasons (cost, human persuasion, etc)
  • There are other subfields that are even smaller and equally or more important
  • Increasing security may increase some other risks, like concentration of power or a preemptive strike

I’m probably not providing comprehensive arguments against, and I think these takes are all reasonable. But hopefully, the arguments in favor provide enough grounding to seriously consider whether you should work on Infosec.



Discuss

Saying What You Want

2026-01-08 22:12:35

Published on January 8, 2026 2:12 PM GMT

There is a hierarchy of useful interfaces for tools that goes something like this:

  1. Figure out what you want to do, then how to use the tool to achieve that, then carry out those actions yourself (hammer, machining workshop)
  2. Figure out what you want to do, then how to use the tool to achieve that, then convert that into the tool's language (washing machine, 3d printer, traditional programming)
  3. Figure out what you want to do, then how to use the tool to achieve that, then specify in a format natural to you (current-gen AI)
  4. Figure out what you want to do, then specify in a format natural to you (next-gen AI)
  5. Figure out what you want to do (???)
  6. The tool figures out what you want to do (???)

Each level is clearly more useful than the one before it, all else equal.

All else is never equal. There is a serious question about feedback being lost as we move up the levels of the hierarchy. Tight feedback loops are, after all, key to creative expression. This is where "a format natural to you" is doing some heavy lifting. Higher levels can still create specialized interfaces (the creation of those interfaces can be specified in natural language) with tight feedback loops and intuitive, tactile user experiences.

We're currently breaking into level 3. If AI progress continues at even a fraction of the pace it has for the last 5 years (and there are still a number of low hanging fruit to be picked), we will soon reach level 4. Level 5 would need to read your mind or something similar, which intuitively (fight me) seems a pretty long way off. As far as I can tell, once we reach level 5, there aren't any obvious blockers on the way to level 6, so I speculate that level 5 will probably be pretty short-lived.

So we're likely to be in level 3-4 for the near future. You might notice that these levels have an unusual step in common that the others don't: "specify in a format natural to you". This doesn't necessarily mean plain English (or whatever your native language is), it can be structured data, sketches, custom tactile inputs, references to other works, or whatever crazy things you can imagine that I can't.

Just... specifying what you want and it happens is an immensely powerful interface. Forget the practical reality of implementing an agent that can carry these tasks out for a moment:

  • "On the first of every month, fetch my timesheet from {source}, prepare my next invoice to {client} and present it to me before sending."
  • "While I'm out, clean as much of my apartment as you can, prioritizing at your own discretion. Look at my calendar or ask me before I leave to determine how long I'll be out."
  • "Build the application specified in the linked document and put it in a private GitHub repo. If you notice any flaws in the specification message me on Slack for clarification."
  • "When scheduling meetings with external parties, represent not just my preferences but also find genuine mutual value." (or "...consider my preferences first and foremost and negotiate aggressively for my convenience.")
  • "Manage my financial health. This means: track spending against budget categories, negotiate bills when contracts renew, catch subscription creep, harvest interest/rewards optimally, and prepare tax documentation as the year progresses. Interrupt me for decisions requiring genuine judgement and maintain a dashboard I can review when I want context."

You can specify at whatever level of abstraction you want ("Use a UUIDv7 for the PK on this table, generate it in the domain model with a default_factory" vs "I want to generate IDs on the client to save a DB round trip"). You can define shortcuts ("When I say 'run the dishwasher', it means use the eco-mode setting and..."). You are limited only by the capabilities of the agent implementing the request, and your skill at specifying what you want.

Imagine trying to build specialized UIs that allow that level of flexibility and specificity!

This is a new interface. We are not yet familiar with it. Just as an artist comes to master the UI of a paintbrush on canvas, or a designer masters the UI of Photoshop, or a CEO masters the UI of his underlings, we will come to master this interface in time. As long as our tools are limited by the need for us to express our desires before they can act on them, specifying your desires clearly and fluently will only come to be a more valuable skill as we go into the near future, and time spent learning to do that is time well-invested.



Discuss

Small Steps Towards Proving Stochastic → Deterministic Natural Latent

2026-01-08 20:27:28

Published on January 8, 2026 12:27 PM GMT

The story so far

We (Alfred and Jeremy) started a Dovetail project on Natural Latents in order to get some experience with the proofs. Originally we were going to take a crack at this bounty, but just before we got started John and David published a proof, closing the bounty. This proof involves a series of transformations that can take any stochastic latent, and transform it into a deterministic latent, where each step can only increase the error by a small multiple. We decided to work through the proof and understand each step, and attempt to improve the bound. After 20 hours or so working through it, we found a fatal flaw in one step of the proof, and spent the next several days understanding and verifying it. With David’s help, the counterexamples we found were strengthened to show that this transformation step had unbounded error on some cases. We started using sympy for arbitrary precision evaluation of counterexamples, and along the way found a bug in sympy that occasionally caused us some problems. We switched to Mathematica.

Since then, we’ve been working part time on trying to prove the bounty conjecture ourselves (or show that the conjecture is false). Ultimately we’ve been unsuccessful so far, but it still feels tantalizingly within reach. Perhaps an automated theorem prover will crack it in the next year or two. It makes a good test case.

Empirical evidence suggests the conjecture is true

We still think the overall bounty conjecture [(∃ Stochastic Natural Latent) Implies (∃ Deterministic Natural Latent)] is true. In fact, the largest difference in errors that we can find empirically is around 1.82. (i.e. optimal deterministic sum of errors < 1.82 * optimal stochastic sum of errors).

In our quest, we made lots of graphs like the above. Each point on the orange and blue curve is the score of a numerically optimized latent distribution , for a fixed . For simplicity, all three of  are binary variables. In this graph, the maximal difference between the blue and orange lines is ~1.82. On the x-axis, Delta represents the correlation between X and Y. For this graph, the marginal distributions are fixed at . Changing the marginals has some effect on the graphs, but doesn’t seem to increase the difference between Stochastic and Deterministic scores.

Larger N

One way the conjecture could fail is by larger X and Y variables allowing larger differences. This is more difficult to reliably check numerically, because we’ve found there are a lot of local optima for our numerical optimizer to get stuck on, and this problem gets quickly much worse as N increases.

We did a long running Bayesian optimization outer loop over 3x3  distributions, and an inner loop to find the optimal stochastic and optimal deterministic latent. The largest differences found clustered around 1.79 (apart from a few larger ones that turned out to bad local minima on the inner loop).

So we think there are several lines of evidence suggesting the conjecture is true. 1. We’ve failed to find a counterexample despite putting a lot of time into it, 2. the conjecture seems to be true using Jensen-Shannon divergences, and 3. it’s true in the 0 error cases.

We are impressed by how tantalizingly, frustratingly simple this problem is while being so difficult to prove.

The Problem

The problem that we are trying to solve can be stated as follows.

Consider two variables X and Y with a joint probability distribution . A latent variable  is defined by a conditional probability distribution .

A variable  is called a stochastic natural latent if the joint distribution  satisfies the three natural latent conditions to within an ‘error’  :

Redundancy 1: 

Redundancy 2: 

Mediation: 

Note: Here we are expressing the conditions as conditional mutual information, as Aram did in this post. These are equivalent to the conditions when stated in terms of KL divergences [1].

We want to show that if a stochastic natural latent exists, then there also exists a latent  (defined by a conditional distribution  ) whose joint distribution  satisfies the ‘deterministic natural latent conditions’ to within . The deterministic natural latent conditions are:

Deterministic Redundancy 1: 

Deterministic Redundancy 2: 

Mediation : 

We would like  to be a reasonably small number (for some definition of ‘reasonably small’).

Upper bounds on deterministic errors

One way in which the conjecture could be false is if there existed  distributions for which all deterministic natural latents had very large errors (compared to the optimal stochastic latents). It is therefore useful to identify some simple families of deterministic latents and find upper bounds on how large their errors can be.

Constant Latents

One of the simplest kind of latents we can imagine is the constant latent. This is a latent which only takes one value, regardless of the X,Y values. Notice that in this case, we have , as well as  and  so the constant latent perfectly satisfies the two deterministic redundancy conditions. Conditioning on the latent does not affect the  distribution so the conditional mutual information simply equals the mutual information between X and Y. This means that the only nonzero error for this latent is the mediation error which is .

Copy Latents

Another kind of latent we can consider is the ‘copy latent’ where  simply deterministically ‘copies’ one of the variables (X or Y). This will always perfectly satisfy one of the deterministic redundancy conditions. For example if  copies X then and we have:

Deterministic Redundancy 1: 

The other deterministic redundancy condition will be satisfied to an error equal to the condition entropy :

Deterministic Redundancy 2: 

The mediation condition will also be satisfied perfectly:

Mediation: 

If instead we choose a latent which copies Y so  then the mediation and Deterministic Redundancy 2 conditions will be perfectly satisfied and the Deterministic Redundancy 1 error will be .

Deterministic Functions of X,Y

We can also consider the family of latents where  is a deterministic function of the X,Y pair (though not necessarily a deterministic function of X or Y in isolation). All latents in this family satisfy the deterministic redundancy conditions with errors bounded by the conditional entropies of X and Y:

Deterministic Redundancy 1: 

Deterministic Redundancy 2: 

A proof of this can be found in this shortform.

The worst case upper bound on the mediation is the same as that of the constant latent, .

General Upper Bound

Notice that three of the latents considered above (’constant’, ‘copy X’ and ‘copy Y’) can be applied to any initial  distribution. This means that we can always find a deterministic natural latent with maximum error . In general, one of the copy latents will be best for  distributions where X,Y are highly correlated (so that  or  is low) and the constant latent will be best for distributions where (X,Y) are independent or have low correlation (so that  is low). 

We have some numerical evidence that in many cases, the optimal deterministic latent has an error very close to , suggesting that this bound is quite tight (for example, in the first graph in this post, the optimal deterministic latent hugs the  line perfectly). We know that this bound isn't tight when the  or  is far from 0.5 (but it isn't extremely loose).

Unfortunately, proving these upper bounds on the deterministic error is not sufficient to prove the conjecture. In order to prove the conjecture, we need to relate upper bounds on the deterministic error to the error achievable using a stochastic latent. 

To complete the proof this way, an upper bound on the deterministic error needs to be combined with a lower bound on the stochastic error. Numerical evidence suggests that distributions for which  is high are also distributions with higher stochastic natural latent errors so this approach seems like a reasonable thing to attempt. For example, take a look at the first graph in this post. The stochastic error is lower than the deterministic error but they both follow similar shapes. In the middle of the graph, the stochastic error and deterministic error both peak at approximately (but not exactly) the same point (around ) and follow similar decreasing paths as  moves away from this point.

If we could prove a lower bound on the error of the stochastic latents of the form , then we would have proved the conjecture. This is because, for any distribution, we could find a deterministic latent with error  . So if a latent with stochastic error  existed, we could always find a latent (’constant’, ‘copy X’ or ‘copy Y’) with an error only  times greater.

Lower Bounding the Stochastic Latent error

We tried a couple of ways of lower bounding the stochastic latent error but were not successful. First, we tried assuming that there was a stochastic latent which achieved an error of less than  and seeing if this implied a contraction. We tried to derive a contradiction from these three inequalities:

Redundancy 1: 

Redundancy 2: 

Mediation: 

i.e. we wanted to show that satisfying two of these inequalities necessarily meant violating the third (for some distribution over X and Y).

If we could do this, we would have proved a lower bound on the stochastic error. Unfortunately, we failed, but this does feel like an approach someone could take if they were more comfortable than us at manipulating information-theoretic inequalities.

Mixing Latents and Reverse Jensen inequalities

Here is another attempt we made at lower bounding the stochastic error.

Suppose we took two latents, defined by conditional distributions  and  and created a new latent using a probabilistic mixture of the two:

with .

If we applied the mixed latent to a distribution , the joint distribution would be also be a mixture of and .

We can also consider the refactored distribution , used to calculated the redundancy error:

Where  and  indicate the refactored versions of  and  respectively.

This means that the KL divergence for the redundancy error of P is given by:

(Note: writing the KL as a mixture in both of its arguments works for both redundancy conditions, but not the mediation condition. So even if we got this proof strategy to work, we would still need to tie up the loose end of the mediation condition.)

Next, since the KL divergence is convex in both its arguments, we can use this convexity to write a Jensen inequality:

This is potentially interesting for two reasons. First, we can often (always?) write stochastic latents as probabilistic mixtures of deterministic latents, so if R and S are deterministic latents, then this expression links the stochastic error  with the two deterministic errors  and  which is almost the kind of thing we are trying to do.

Unfortunately, this inequality is the wrong way round. We are looking for a lower bound on the stochastic errors in terms of the deterministic error of the form . But the inequality from the convexity of the KL divergence gives us an upper bound on stochastic errors instead.

We hoped that ‘Reverse Jensen Inequalities’ such as the ones found here might be helpful. These are ways of bounding the ‘Jensen gap’ between the KL divergences of the mixture and the mixture of the KL divergences. However, when we attempted this, we got results of the form:

with . This gives a ratio  which unfortunately diverges as  tends to zero.

Characterizing the optimal stochastic latent

If we define  using four parameters:

and  using another four parameters:

{{{l11,1-l11},{l12,1-l12}},{{l21,1-l21},{l22,1-l22}}}

where lij is .

Then we can write out the full error equations in terms of these parameters:

Mediation error:

Redundancy1 error:

One thing we want here is to characterize the optimal latent (minimizes sum of Mediation error, Redundancy1 error and Redundancy2 error) at any given setting of the  parameters. Mathematica can’t handle this analytically, so we tried restricting the  distribution such that both marginals were always 50%. Then the distribution can be parameterized by one variable, , like so:

We can plot the numerical optimal latents for each value of  (same as the image at the top of the post, but with each error plotted):

By looking at these numerically optimal latents, we can also hardcode some of the latent parameters to values that we’ve empirically observed are optimal. When  the latent is 0.5 everywhere (independent of X and Y, so equivalent to the constant latent). When , we observed that l12 and l21 are always 0.5, and l11 = 1-l22. These leaves only one parameter free.

Hardcoding these simplifies each error, so we can get each error as a function of  and l22. For example the mediation error:

Note that in this equation and the following one, we use  in place of .

Now it’s simple enough that Mathematica (with some cajoling) can analytically find the l22 (as a function of d) such that the total error is minimized:

It’s the second root of a degree 6 polynomial, whose parameters each contain a degree 4 polynomial. It looks like this:

This is for  In , l22 simply equals 0.5.

By analytically calculating the gradient and hessian with respect to the total error, and showing the eigenvalues of the hessian are positive for all , we know that every local movement of the latent increases the error, so this latent that we’ve described is definitely locally minimal. We think it is also globally minimal, but haven’t shown this. Presumably it won’t be too hard to prove this by going through every other 0 gradient solution and showing that it has a larger error.

Mathematica can plot the total stochastic error (it’s missing the first half, but we know that the first half is equal to the optimal deterministic error, because we checked the eigenvalues of the hessian):

And using our known values for the optimal deterministic latents, we can plot the optimal deterministic error:

And we can plot the ratio of these:

And (numerically) find that the peak is ratio=1.81792, {}. 

Summary

This isn't the most satisfying conclusion, but we have achieved a partial success. We have (almost) a proof that the conjecture is true for the special case of a 2x2, when the marginals of X and Y are 0.5. And in this case, the ratio  is ~1.82. We failed to find a worse ratio empirically in the general 2x2 case and the general 3x3 case (despite extensive manual effort, and running 10s of hours of automated searches). We think this is moderately strong evidence that the conjecture is true in general.

But we don't have any good leads on methods to prove this in general. 

  1. ^

    When stated in terms of KL divergences, these conditions are:

    Redundancy 1: 

    Redundancy 2: 

    Mediation:  



Discuss

Using Anki to memorise the names of the MATS 9 cohort

2026-01-08 12:41:16

Published on January 8, 2026 4:41 AM GMT

I’m generally very bad with names, and especially struggle when I’m quickly introduced to large numbers of people. Generally I’d just throw up my hands and resign myself to awkward conversations attempting to remember someone’s name.

But I’ve been on an Anki binge recently, and also was recently accepted into MATS 9 (which started on the 5th of January 2026). So I used Anki to memorise many of the names and faces of the cohort before the first day and it was incredibly successful; I’ll certainly be doing it again. If you're a MATS scholar and would like the Anki deck, let me know!

Creating the deck

This ended up being very easy, MATS had already asked the scholars to volunteer a profile photo, name, and short bio for use in an Airtable which was shared with all of us. 

Note: I’d feel very weird if I were doing this without knowing that MATS had created an opt-in face book, with the purpose of introducing everyone to one another.

After some back and forth, Claude gave me some JavaScript to paste into the console which would download all photos and text into a form amenable for Anki import. Claude also guided me on where to put the photos and how to import the CSV into the Anki app which was all very quick. I did spend a little time to crop everyone’s photos to just be their head, to remove background features that might be easier to memorise than someone’s face (more about this later).

What worked well

  • Even just knowing how someone’s name is spelt is a massive help for me. Especially with unusual names that I haven’t heard before, if I’ve seen their name written down it’s a lot easier for me to recall than if I’m trying to memorise an arbitrary string of syllables.
  • I’m writing this on the third day of MATS, and I feel like I know most of the cohort already. This isn’t really true, we’ve only known each other for two days, but it’s significantly better than being panic-stricken as you see someone approaching you whose name you’ve forgotten.
  • It’s a lot easier to organise outings, meetings, group discussions, trips to Yosemite, etc, since I know most people’s names and can spend effort remembering who likes what, who’d be interested in which groups, and who I should introduce to whom due to shared interest XYZ.
  • After spending the full day with the cohort, I got to actually talk to lots of the cohort and this made recalling their names a lot easier. People who I had previously confused due to similar-sounding names became very easy to distinguish, because their personalities were so different.
  • I’ve got a much better idea of who’s in the cohort (and would probably be open to a random discussion) and who’s just hanging out at Lighthaven (and probably wants to work or might be interesting to talk to). The MATS scholars have never been in the same room at once, so it’s remarkably tricky to get a feel for the size of the whole group.
  • I certainly feel like I learnt far more names on the first day than I would have without Anki. At previous similar events, I’d be lucky to properly remember 6 to 8 people’s names, but I’m pretty sure I learnt 10 to 15 new names on the first day (and this I attribute to Anki).

Failures and Problems

  • Definitely not perfect: writing this on the morning of day 3, there’s certainly a minority of people who I recognise but can’t recall the names of. There's also people who I know are in the program, but who I haven't seen in person yet.
  • I didn’t memorise everyone before day 1: I didn’t get to full recall by the first day, but even knowing ~25% of names was a massive help, making it easier to spend effort memorising the names of people I hadn’t yet learnt.
  • Overfitting on profile pictures: Since I only had a single picture for each person, my brain would often overfit on the easy-to-see features in the individual photos, instead of on the person’s face. For example, I could feel my brain latching onto a distinctive background or hair style, instead of their facial features. I initially thought this would be a big problem, but in practice it’s fine. Most of the time I wasn’t overfitting, and the times when I did overfit, I still had an approximate recollection of their face so I was certainly no worse off.
  • Changes between profile pictures and real life: People change, get new hair cuts or shave their beard, or just take photos in unusual lighting. Most of the time this isn’t an issue, but sometimes it’s confusing. But I never felt worse-off for having memorised someone’s name, and already having allocated them space in my head often helps me remember their name.
  • Different given names: Some people don’t go by their “official” name and prefer to be called by something else. This is occasionally confusing but not a massive issue.

Looking forward

I’m hoping that repeatedly memorising lots of names will improve my ability in general to recall names. Hopefully Anki can give data about this in the future, in which case I might make a follow-up post. I’ll absolutely be using this technique for future large events or mixers. 



Discuss