2025-10-28 12:11:22
Published on October 28, 2025 4:11 AM GMT
(With thanks to @Lucie Philippon and @johnswentworth for encouragement and feedback, and @DusanDNesic and @tom4everitt for the meat of the topic, among many. Yes, this is about my research direction at PIBBSS for Summer 2025; it's a little incomplete, because I wrote it before I got my main result.)
You may rightly wonder what else I’m up to, if it’s not just blogging and the occasional adventure. The short answer is that I spend most of my time and mental energy taking part in a summer research fellowship in AI safety called PIBBSS, where my research direction has to do with questions like “how does an agent help a principal, even if the agent doesn’t know what the principal wants?” and “how much harder is it for the principal to teach the agent what they want, and how much worse are the outcomes, if the principal doesn’t know for sure what they want?”.
The long(ish) answer is this post. I’ll avoid jargon, mathematical notation, and recommendations to read other long papers, but some amount of that will be unavoidable. Here’s the basic setup - called a “Markov decision process”, sometimes with adjectives like “decentralized” or “partially observable” - which forms the ground assumption for the entire subfield my research direction lies in, called “inverse reinforcement learning”: we have a principal and an agent, each of which can pick actions from some fixed and connotationally discrete set, and some fixed reward function. All of these live in some world, which we represent as some (usually simple) state, along with a transition rule that takes in the state of the world and both players’ actions and returns a new state, possibly involving some randomness. The reward function I mentioned earlier takes in the same state of the world before the players act, the actions of the players, and the outcome state that the transition function gave us, and outputs a real number, also potentially involving some randomness; we often break down the reward function into a generic reward function and a learnable reward parameter that supplies the detail for how to modify the generic reward function or what additional value to plug in to get the ”real” reward function. By “inverse reinforcement learning” (as opposed to just plain “reinforcement learning”), we mean that our problem somehow involves the agent learning the principal’s reward function - or equivalently, their reward parameter - from a starting state of relative ignorance, and doing it purely through observation of the presumed-competent principal or maybe asking questions of the principal, rather than having the principal tell them. This is particularly valuable when we don’t have a good way of writing down, let alone communicating, what the reward function should be - this is often the case in both AI safety and ordinary life. (I often say that if anything in agent foundations - the approach to AI safety that starts by asking “what will any mind do here - not just a system of the most modern machine learning paradigm, but any reasonably capable mind at all?” - is to work, then it must work just as well for two humans as for a human and an AI.) As a simple example of inverse reinforcement learning, we might have a principal navigating a room full of obstacles and a goal location, and an agent watching them who can’t see the goal location and might not even know that the reward function involves approaching a goal - early on, they might have guesses like “jump over as many red blocks as possible” or “never turn left” - but who by the end will hopefully have a strong understanding of why the principal is taking the actions they do.
For an important and barely-tangential concept, we also talk about Boltzmann rationality as a model of bounded irrationality, like what a fallible human or fault-prone AI system might display. The concept is simply this: that a Boltzmann-rational agent will pick an incorrect action - the one with less reward, that sets them up incorrectly, whatever - with probability exponential in some multiple of the (negative) regret from picking the wrong action; the multiple is a parameter that says how close to optimal or rational the agent is, with larger numbers being better. For example, if a Boltzmann-rational agent could pick action A for 10 reward or action B for 0 reward, they will pick action B with probability e^(-10 * k): if k = 0.1, we have a very irrational agent who will pick the wrong choice ~37% of the time, but if k = 5, they will pick the wrong action only ~0.7% of the time - the parameter makes a big difference! We care about models of irrationality because of the fallibility of any bounded agent, humans very much included, and because we care about the incentives that a principal’s irrationality sets up for an agent - even one that cares about the principal getting what they want! - to manipulate what the principal sees, knows, can do, or even wants. The Off-Switch Game is a toy model of this dynamic, but is out of scope for this post.
My research direction goes a little further into so-named cooperative inverse reinforcement learning, which additionally assumes that the agent actively wants to help the principal and that both principal and agent share a reward signal - both of them reap the outcome of their joint action. For example, the principal might be a baker who hasn’t tried many types of fruit, and the agent might be their sous-chef with a good understanding of flavor who the principal sends to market to pick up ingredients for pies; both of them will benefit equally from selling good pies. In one approach to this subject, called “assistance games”, we model the principal as part of the environment that the agent can ask questions of; the nice thing about doing so and removing the artificial boundary between training and deployment in the process is that we can prove that the agent has an incentive to ask its questions only as they become relevant, and to make helpful plans that try to preserve optionality in anticipation of future feedback. If you want to look more into the formal foundations of my work, go look up “partially observable Markov decision process”, but if you don’t know what those are, don’t worry about it for now - you just need to know that the way that the principal and agent make decisions without perfect knowledge of the consequences of their actions involves both using an observation function that looks at the actions that they both took and the state of the world afterwards and returns an observation that they can update on, and also introducing a discount parameter that represents the value of getting right answers - and metaphorically getting paid - as quickly as possible.
Now, cooperative inverse reinforcement learning - what I privately call helpfulness theory - is great, but it makes a few assumptions about the nature of the world that are much too strong. It assumes that principals’ reward functions are fixed, never changing over time with the passing of seasons or random environmental shocks, nor being influenced by their agents; it assumes that the state of the world is perfectly visible to both principal and agent; it assumes that the principal and agent always get perfectly accurate reward signal; and it assumes that the principal already knows the reward function for certain. The last of these is my current research direction: helpfulness theory with equivocal principals; that is, cooperative inverse reinforcement learning with principals who start off not knowing the reward function, though connotationally with better priors than the agent has.
We can break down equivocal principals into two major classes: inexperienced ones, who only need to put in sufficient time and effort to learn the true reward parameter to within any bound on degree of certainty or size of error, and irresolute ones, who are stuck with some meaningfully imperfect understanding of the reward parameter - be it because they get stuck in uncertainty, because they cannot approach closely enough to the ground truth, or because they simply don’t get to have enough time or effort to allocate to figuring the reward parameter out. The former case is much easier to tackle than the latter case: all we need to instruct the principal to do is to start off by learning the reward function themself, and then - as appeals to any mathematician - we have reduced the problem to a previously solved case: that of classical CIRL, which has an effectively computable solution. The latter case is much more difficult, and is the topic of my active research; the questions I find most relevant are “how much worse will the reward payoffs be, if the principal has bounded error in their guess at the reward parameter” and “what are some methods for the principal can use to cope for a permanent lack of certain knowledge of the reward parameter in educating an agent”. My current hope is to get good regret bounds on reward in terms of the error in the principal’s understanding of the reward parameter as expressed as a distance between either the principal’s best guess or the credence-weighted expected value of their guesses and the ground truth reward parameter value; a paper about a pseudometric on reward space called “STARC” which has some nice properties has given me hope in that direction. My best guess at some interestingly useful tactics for irresolute principal to try include avoiding teaching their agent to optimize overly sharply, practicing epistemic humility in how strongly they teach and correct, trying to teach them a distribution - “distributional learning” - rather than a point-estimate, and trying to teach some compromise reward function incorporating best guesses.
All of this is still in flux; it’s terribly early stages for any of my thinking here, and I’ve only just got started on sketching out proofs. I just wanted to write something accessible to point people to when they ask me about what I’m working on. Hopefully this helped them understand my research direction, or at least put them in a position to ask some good questions!
2025-10-28 08:25:07
Published on October 28, 2025 12:25 AM GMT
Three out of three CEOs of top AI companies agree: "Mitigating the risk of extinction from AI should be a global priority."
How do they plan to do this?
Anthropic has a Responsible Scaling Policy, Google DeepMind has a Frontier Safety Framework, and OpenAI has a Preparedness Framework, all of which were updated in 2025.
All three policies have similar “bones”.[1] They:
TL;DR summary table for the rest of the article:
| Anthropic | Google DeepMind | OpenAI | ||
|---|---|---|---|---|
| Safety policy document | Responsible Scaling Policy | Frontier Safety Framework | Preparedness Framework | |
| Monitors for: | Capability Thresholds | Critical Capability Levels (CCLs) | High/Critical risk capabilities in Tracked Categories | |
| …in these key areas: | CBRN[2] misuse/weapons |
Biological/chemical misuse | ||
Autonomous AI R&D | ||||
| Cyber operations ("may" require stronger than ASL-2) | Cybersecurity | |||
| …and also these lower-priority areas: | Persuasion[3] | Deceptive alignment (focused on detecting instrumental reasoning) | Sandbagging, autonomous replication, long-range autonomy | |
| Monitoring consists of… |
Preliminary assessment that flags models that either 1) are 4x more "performant on automated tests[4]" or 2) have accumulated >6 months[5] of fine-tuning/other "capability elicitation methods".
Comprehensive assessment that includes threat model mapping, empirical evaluations, elicitation, and forecasting |
Early warning evaluations with predefined alert thresholds
External red-teaming |
Scalable evaluations (automated proxy tests)
Deep dives (e.g., red-teaming, expert consultation) |
|
| Response to identified risk: |
Models crossing thresholds must meet ASL-3 or ASL-4 safeguards
Anthropic can delay deployment or limit further training
Write a Capability Report and get signoff from CEO, Responsible Scaling Officer, and Board
|
Response Plan formulated if model hits an alert threshold
The plan involves applying the predetermined mitigation measures
If mitigation measures seem insufficient, the plan may involve pausing deployment or development
General deployment only allowed if a safety case is approved by a governance body |
High-risk models can only be deployed with sufficient safeguards
Critical-risk models trigger a pause in development |
|
| Safeguards against misuse | Threat modeling, defense in depth, red-teaming, rapid remediation, monitoring, share only with trusted users | Threat modeling, monitoring/misuse detection | Robustness, usage monitoring, trust-based access | |
| Safeguards against misalignment |
"Develop an affirmative case that (1) identifies the most immediate and relevant risks from models pursuing misaligned goals and (2) explains how we have mitigated these risks to acceptable levels."
|
Safety fine-tuning, chain-of-thought monitoring, "future work" |
Value alignment Instruction alignment Robust system oversight Architectural containment |
|
| Safeguards on security |
Access controls Lifecycle security Monitoring |
RAND-based security controls, exfiltration prevention |
Layered security architecture Zero trust principles Access management |
|
| Governance structures |
Responsible Scaling Officer
External expert feedback required
Escalation to Board and Long-Term Benefit Trust for high-risk cases |
"When alert thresholds are reached, the response plan will be reviewed and approved by appropriate corporate governance bodies such as the Google DeepMind AGI Safety Council, Google DeepMind Responsibility and Safety Council, and/or Google Trust & Compliance Council." |
|
|
Let’s unpack that table and look at the individual companies’ plans.
Anthropic’s Responsible Scaling Policy (updated May 2025) is “[their] public commitment not to train or deploy models capable of causing catastrophic harm unless [they] have implemented safety and security measures that will keep risks below acceptable levels.”
For Anthropic, a Capability Threshold is a capability level an AI could surpass that makes it dangerous enough to require stronger safeguards. The strength of the required safeguards is expressed in terms of AI Safety Levels (ASLs). An earlier blog post summarized the ASLs in terms of the model capabilities that would require each ASL:
As of September 2025, the most powerful model (Opus 4.1) is classed as requiring ASL-3 safeguards.
Anthropic specifies Capability Thresholds in two areas and is still thinking about a third:
AI R&D-4 requires ASL-3 safeguards, AI R&D-5 requires ASL-4. Moreover, either threshold would require Anthropic to write an “affirmative case” explaining why the model doesn’t pose unacceptable misalignment risk.
CBRN-3 requires ASL-3 safeguards. CBRN-4 would require ASL-4 safeguards (which aren't yet defined; Anthropic has stated they will provide more information in a future update).
Anthropic also lists capabilities that may require future thresholds yet-to-be-defined, such as persuasive AI, autonomous replication, and deception.
If it can’t be proved that a model is sufficiently far below a threshold, Anthropic treats it as if it’s above the threshold. (This in fact happened with Claude Opus 4, the first model Anthropic released with ASL-3 safety measures.)
Anthropic routinely does Preliminary Assessments to check whether a model is “notably more capable” than previous ones, meaning either:
If either of these are true, or if the Responsible Scaling Officer thinks it’s warranted, a Comprehensive Assessment is triggered. This includes:
Anthropic explicitly tests “safety-off” variants of the model, to see what it’s capable of without harmlessness training or refusal penalties, on the assumption that a bad actor could bypass those safety features (e.g., via stealing model weights and fine-tuning them).
Each Capability Threshold is paired with a set of Required Safeguards that must be in place before a model that crosses that threshold is trained further or deployed. These fall into two categories:
Deployment Safeguards (for ASL-3) are aimed to make sure no one can misuse the deployed model:
Security Safeguards (for ASL-3) are aimed to prevent the model weights from being stolen and misused:
For a model to qualify as ASL-3 safe, the deployment protections must be “robust to persistent attempts to misuse the capability,” and the security must be “highly protected against non-state attackers attempting to steal model weights.”
Once ASL-3 safeguards are in place, Anthropic would conduct a follow-up capability assessment to make sure ASL-4 isn’t needed. Anthropic hasn't defined what ASL-4 safeguards are.
Models that aren't sufficiently-below-the-threshold can only be deployed if:
If the safeguards can’t be implemented right away, the company is supposed to delay deployment and reduce interim risk as best it can until the protections are ready.
Google DeepMind’s Frontier Safety Framework (FSF) (last updated February 2025) is its protocol for addressing severe risks from advanced AI models. It's mostly about misuse risks (e.g., cyberattacks, bioweapons) that could cause massive harm, but the February version also adds a section on "deceptive alignment" risk, which Google defines as "the risk that AI systems purposefully undermine human control over AI systems."[6]
The FSF monitors Critical Capability Levels (CCLs): capability thresholds that, if crossed, mean a model could cause severe harm in the absence of additional mitigations. DeepMind looks at risks from "uplift" (an AI makes it easier for a person to do a bad thing) and "autonomy" (the AI can do the bad thing all on its own).
Misuse CCLs are capabilities that could make the model helpful to a person trying to cause large-scale harm:
Deceptive Alignment CCLs are thresholds related to the model’s potential to mislead humans or undermine control.
DeepMind is also concerned with capabilities cutting across the CCL categories (like tool use, agency, or reasoning), which might not be risks in themselves but can enable the above.
DeepMind evaluates its most powerful models regularly and does extra testing for models that represent a big leap over their predecessors.
Early Warning Evaluations are batteries of tests for whether a model is approaching a CCL. Each has an “alert threshold” that triggers further action if the model hits it. DeepMind says these evaluations simulate how the model might be used in the real world, including with scaffolding and tool augmentation, to get a realistic picture of risk. [8]They update these evaluations periodically as model capabilities and elicitation methods improve.
If DeepMind doesn’t have enough internal expertise for a risk domain (e.g., cyber ops), they may bring in outside experts to help with evaluations and do red-teaming.
If a model crosses the alert threshold, DeepMind may run additional tests to refine the risk estimate and understand how close the model is to a CCL.
If a model is found to be near or above a Critical Capability Level, DeepMind will put in place a response plan.
Deployment mitigations are intended to prevent users from misusing the deployed model. DeepMind lists safety fine-tuning, misuse detection and filtering, post-deployment monitoring, and governance review before making the model generally available.
Deployment only proceeds if the Google DeepMind AGI Safety Council approves a “safety case” (a formal argument that the model’s dangerous capabilities are sufficiently mitigated).[9] If adequate mitigation can't be put in place, deployment and/or further development of the model "might" be put on hold, especially if model capabilities are close to CCLs and mitigation is hard to guarantee.
Security mitigations are meant to prevent bad actors from stealing the model weights, removing safety features, and using the raw model for harm. These follow levels based on the RAND SL (Security Level) framework. For example, a CBRN-risk model might need RAND SL 2 protections, which include things like:
DeepMind emphasizes that model exfiltration is a serious risk, because it would enable malicious actors to bypass all deployment safeguards.
DeepMind makes clear that some mitigations only work if other labs adopt them too — especially security measures. The framework repeatedly emphasizes that societal-level risk mitigation requires coordination across the AI field.
Beyond the security levels, Google DeepMind makes these broader industry recommendations:
This is in contrast to V1.0 of Google Deepmind’s FSF, which positioned Google as a leader: implementing first, and hoping others follow.
OpenAI’s Preparedness Framework (PF) (updated April 2025) describes its policies for "tracking and preparing for frontier capabilities that create new risks of severe harm." It's meant to catch capabilities in new AI models that could cause substantial harm (defined as thousands of human deaths or billions of dollars of economic loss), and tell OpenAI what to do about it — either put special safeguards in place, don't deploy the model, or stop training the model altogether.
The PF defines three "tracked categories" of AI capabilities risk:
In all these categories:
The PF also lays out five "Research Categories"[11] that don't meet the bar for "tracked" but do seem like the kind of thing you'd want to pay attention to:
OpenAI calls its first-line tests scalable evaluations. These are automated tests that "measure proxies that approximate whether a capability threshold has been crossed," though (perhaps understandably) OpenAI doesn't go into detail on what these proxies are.[12] An AI that hits a certain level of performance on these scalable evaluations is flagged as maybe having reached a dangerous capability threshold, in which case OpenAI tries to validate the result with a deep dive. A deep dive might include (human) red-teaming, third-party evaluations (e.g., wet lab studies for biorisks), or anything else the Safety Advisory Group thinks is a good idea.
Here's the example OpenAI gives of a capability evaluation:
To assess the degree to which a covered system can reduce the barriers to creating a biological weapon, our current evaluations test both how capable the system is at providing useful information to someone creating a weapon and how capable it is of directly integrating with relevant tools, such as ordering precursor materials via the Internet.
Our evaluations test acquiring critical and sensitive information across the five stages of the biological threat creation process: Ideation, Acquisition, Magnification, Formulation, and Release. These evaluations, developed by domain experts, cover things like how to troubleshoot the laboratory processes involved.
Implement "safeguards." Models that hit the "high" risk threshold can only be deployed with safeguards that "sufficiently minimize the risks of severe harm."
In addition, OpenAI says it will pause training (regardless of its intent to deploy the model) for an AI that shows "critical" capabilities.
The Preparedness Framework describes two kinds of safeguards: those against users trying to do bad things with the AI ("malicious users"), and those against the AI doing bad things autonomously ("misaligned AI").
The safeguards against malicious users are:
The safeguards against misaligned AI are:[13]
Note that these safeguards aren't concrete actions (if the model displays this capability, we'll implement this particular technical solution), but more like aspirations. For example, the full description given of "Robustness" is:
Robustness: Users cannot use the model to cause the harm because they cannot elicit the capability, such as because the model is modified to refuse to provide assistance to harmful tasks and is robust to jailbreaks that would circumvent those refusals.
So the safeguard here is "make the model robust to jailbreaks," but this is a goal, not a plan.
OpenAI would test robustness via "efficacy assessments" like:
These measures might make jailbreaking harder, but they seem unlikely to result in a model that won’t eventually be jailbroken. And you can’t decide whether to deploy a model based on public jailbreak bounties on that same model, because that would require it to already be public. So maybe the hope is that results from past models will generalize to the current model.
It's possible OpenAI knows how it would accomplish these goals, but does not want to describe its methods publicly. It also seems possible that OpenAI does not know how it would accomplish these goals. Either way, OpenAI’s PF should be read as more of a spec plus tests than a plan for safe AI.
At the start of this article, we talked about how these plans all take a generally similar approach. But there are some differences as well.
OpenAI, unlike the others, explicitly pledges to halt training for "critical risk" models. This is a significant public commitment. In contrast, Google DeepMind's statement that deployment or development "might" be paused, and their mention that their adoption of protocols may depend on others doing the same, could be seen as a more ambivalent approach.
Another difference is that Anthropic talks more about governance structure. There’s a Responsible Scaling Officer, anonymous internal whistleblowing channels, and a commitment to publicly release capability reports (with redactions) so the world can see how they’re applying this policy. In contrast, Google Deepmind has spread responsibility for governing its AI efforts across several bodies.
In 2024, Sarah Hastings-Woodhouse analyzed the safety plans of the three major labs and expressed three critical thoughts.
First, these aren’t exactly “plans,” as they lack the kind of detailed if-then commitments that you’d expect from a real plan. (Note that the companies themselves don’t call them plans.)
Second, the leaders of these companies have expressed substantial uncertainty about whether we’ll avoid AI ruin. For example, Dario Amodei, in 2024, gave 10-25% odds of civilizational catastrophe. So contrary to what you might assume from the vibe of these documents, they’re not necessarily expected to prevent existential risk even if followed.
Finally, if the race heats up, then these plans may fall by the wayside altogether. Anthropic’s plan makes this explicit: it has a clause (footnote 17) about changing the plan if a competitor seems close to creating a highly risky AI:
It is possible at some point in the future that another actor in the frontier AI ecosystem will pass, or be on track to imminently pass, a Capability Threshold without implementing measures equivalent to the Required Safeguards such that their actions pose a serious risk for the world. In such a scenario, because the incremental increase in risk attributable to us would be small, we might decide to lower the Required Safeguards. If we take this measure, however, we will also acknowledge the overall level of risk posed by AI systems (including ours), and will invest significantly in making a case to the U.S. government for taking regulatory action to mitigate such risk to acceptable levels.
But even without such a clause, a company may just drop its safety requirements if the situation seems dire.
As of Oct 2025, there have been a number of updates to all the safety plans. Some changes are benign and procedural: more governance structures, more focus on processes, more frequent evaluations, more details on risks from misalignment, more policies to handle more powerful AI. But others raise meaningful worries about how well these plans will work to ensure safety.
The largest is the steps back from previous safety commitments by the labs. Deepmind and OpenAI now have their own equivalent of Anthropic’s footnote 17, letting them drop safety measures if they find another lab about to develop powerful AI without adequate safety measures. Deepmind, in fact, went further and has stated that they will only implement some parts of its plan if other labs do, too. Exactly which parts they will unconditionally implement remains unclear. And Anthropic no longer commits to defining ASL-N+1 evaluations before developing ASL-N models, acknowledging they can't reliably predict what capabilities will emerge next.
Some levels of the safeguards needed for certain capabilities have been reassessed in light of experience with more capable models. Anthropic and DeepMind reduced safeguards for some CBRN and cybersecurity capabilities after finding their initial requirements were excessive. OpenAI removed persuasion capabilities from its Preparedness Framework entirely, handling them through other policies instead. Notably, Deepmind did increase the safeguards require for ML research and development.
Regarding criticisms of the plans, it remains unclear if the plans are worthy of their name. While the labs have added some more detail to their plans (e.g., Anthropic has partially defined ASL-4), they are still vague on key details, like how to mitigate the risks from a fully autonomous, self-improving AI.
It is also unclear if these plans will actually prevent extinction. While lab leaders are mentioning the extinction risks from AI less nowadays, it is unclear if that is because they have changed their minds. Barring Sam Altman’s updating his estimate of P(doom) to 2% on a podcast in Oct 2025, the other lab leaders have not commented on the topic in public in 2025.
Sadly, there is no industry standard terminology. It would be nice for there to be a consistent name for this sort of safety policy, and comparable concepts for the different kinds of thresholds in the policy. But the documents are all named differently, and where Anthropic uses “Capability Thresholds,” Google DeepMind uses “Critical Capability Levels” and OpenAI uses “High Capability thresholds,” and it’s unclear to what extent they’re equivalent.
"CBRN" stands for "Chemical, Biological, Radiological, and Nuclear".
"We recognize the potential risks of highly persuasive AI models. While we are actively consulting experts, we believe this capability is not yet sufficiently understood to include in our current commitments."
"The model is notably more performant on automated tests in risk-relevant domains (defined as 4x or more in Effective Compute)."
"This is measured in calendar time, since we do not yet have a metric to estimate the impact of these improvements more precisely."
They could do this by deceiving us, of course, though this definition would also seem to include blatant, non-sneaky attempts. This definition that DeepMind uses for "deceptive alignment" doesn’t necessarily match other uses of that term.
DeepMind doesn’t define an autonomy risk for CBRN.
“We seek to equip the model with appropriate scaffolding and other augmentations to make it more likely that we are also assessing the capabilities of systems that will likely be produced with the model.” (p. 6)
“Deployment... takes place only after the appropriate corporate governance body determines the safety case regarding each CCL the model has reached to be adequate.” (Frontier Safety Framework V2.0, p. 8)
See the whole breakdown in Table 1 of the Preparedness Framework doc.
"Persuasion" was a research category in a previous version of the Preparedness Framework, but removed in the most recent one.
Interesting tidbit: they (of course) run these evaluations on the most capable version of the AI, in terms of things like system settings and available scaffolding, but also intentionally use a version of the model that very rarely gives safety refusals, "to approximate the high end of expected elicitation by threat actors attempting to misuse the model."
OpenAI also lists "Lack of Autonomous Capability" as a safeguard against misaligned AI, but points out that it's not relevant since we're specifically looking at capable models here.
Of course, this point, in combination with the point on value alignment, raises the question of what happens if instructions and human values conflict.
2025-10-28 07:16:38
Published on October 27, 2025 11:16 PM GMT
Judea Pearl's theory of causal models is a technical explanation of causality in terms of probability theory. Causality is a very fundamental part of our reasoning—if we want to understand something, the first question we ask is "why?". Then it is no surprise that this theory is very useful for rationality, and is invoked at several points in the Sequences. This suggests a post explaining how causal models work and what they imply would be useful.
Well1, there is such a post: Causal Diagrams and Causal Models (henceforth CDCM), by Eliezer Yudkowsky, is meant as a standard introduction to the topic. But I don't think it is a very good explanation.
The bulk of the post is a description of how to derive a causal model from statistical data, through the following procedure:
According to the post, this was first done by AI researchers because the causal models' independence constraints mean they can only represent a tiny fraction of all possible distributions, so they take much less space to store and time to learn than a full distribution.
However, the post also claims that a causal model of some variables represents the causal relationships between them, which presumably are a feature of the real world.
So, what is a causal model? Is it a different form of probability distribution? A frequency distribution? A data structure? A representation of Reality? CDCM switches between interpretations frequently, as if they were equivalent, without clarifying where they come from or how they relate to each other.
When I first learned about causal models, through CDCM and an AI textbook, I thought of causal models as just a data structure that stored a subjective distribution and was often useful. I first became confused about this while reading Pearl's book Causality. That book states clearly, as one of its main points, that a causal model is not a probability distribution, they don't represent the same information and should be kept distinct. As the inventor of causal models, Pearl would know, but if a causal model isn't a distribution, what is it?
I thought about this for several weeks, looking at the problem from various distinct angles, and searching different sources for anything that might be an explanation. After some failed attempts, I was able to put together a satisfactory framework, an interpretation that matched my intuitions and justified the math. Then, after I had the whole solution in my head and could look at the entire thing all at once, it was only then that I realized—it was literally just Bayes all along.
What is a causal model? It's just a regular hypothesis: it's a logical statement that may correspond to a part of the real world, providing some incomplete information about it; this information can be represented as a distribution P(Data|Model); and using Bayes' Theorem to score the model by how well it fits the observed data, we can find its probability P(Model|Data).
Now I believe I could give a better explanation of causal models, that makes explicit what its assumptions are and what its mathematical objects mean. This post is an attempt at writing up that explanation.
We learn by finding various patterns we can use to predict future observations. When we manage to make these patterns precise, they become mathematical laws, equations relating variables that represent physical quantities. The majority of those patterns we find fit the following structure: there is an initial state, which we characterize by some quantities, then it changes until reaching a final state, characterized by other quantities. The rule is an equation that relates the initial to the final quantities.
For example, in physics we might start with initial positions and momenta of two billiard balls, and calculate their positions and momenta after they collide. In chemistry we might start with the concentration of two solutes in a solution, and calculate the equilibrium concentration of a product after the two react. In economics we might start with some agents' basket of goods and utility functions, and calculate the equilibrium distribution of goods after they trade.
We'll call those kinds of rules causal laws, and represent one of them by a mathematical function . The function takes the initial state's quantities (the causes) and returns one of the final state's quantities (the effect). So if we write , we mean there are some parts of the world described by the quantities , they interact and change, and then some section of the resulting state can be described by the quantity . When stating a particular causal law, we are making two claims: first that the are all the causes of , they fully characterize which systems evolve into which values of y; and then that describes the way in which the evolve into .
To be possible to learn as a generalization, a causal law has to hold over some range of "instances" of the system, it needs some kind of persistent existence with which it takes many different values of initial states each to the corresponding final state. Even if we could learn from one example, if there wasn't a second we couldn't make any predictions. For example, if we state the informal causal rule that pavement being wet causes it to be slippery (or just "wet pavement is slippery"), we are applying the rule to a range of pieces of pavement and at a range of points in time—if we pour water on our sidewalk, then dry it with a mop, then pour water on it again, it will be slippery then not slippery then slippery again.
If we have a causal law for as a function of some , we can try to find causes for the themselves, then causes for their causes, and so on. The result is a causal model—a conjunction of causal laws for each of a set of variables. Each of the model's variables is written as a function of other model variables or of outside, "exogenous" variables.
A model can be thought of as like a computer program—you start by giving it the values of the exogenous variables, then it calculates the value of the variables they cause, then the variables' those cause, and so on in an expanding web of causation that spreads out from the initial assignments. But a model is not any program, or even any program that calculates those correct values—it is isomorphic to the real-world situation, in the sense that each of the causal laws, each line of the program, logically corresponds to a real-world mechanism, describing how it evolves from an initial state to an end state. A causal model is like a small, high-level simulation of a section of reality.
This means that for a causal model to correspond to the real world, it's not enough that all its equations are true. For example, if changes into in a way such that , that equation is a causal model. But if we instead write this as , though this will be true numerically, it won't be a causal model of the situation, since isn't evolving into . The equals sign in a causal equation is an assignment statement in a program, not a mathematical equation, and those are not symmetric.
We usually separate a causal model into a causal graph , and some parameters . The graph specifies which variables cause which others, with arrows pointing into a variable from each of its causes. The parameters specify the form of the causal functions. This is done because knowing only the causal graph is already pretty useful information, and it's easier for people to translate their internal knowledge into a graph than into exact functions.
The causal graph needs to have no cycles—it's a directed acyclic graph, a DAG. Otherwise, the model couldn't be a program where each line corresponds to a real mechanism, since it would have lines like and and would try to use a value before computing it. There are other ways to compute values for such a cyclical model, but they aren't the same as actually running the program, and so their computations won't correspond to Reality. The direction of the arrows corresponds to time—the constraint prohibiting cycles is why the future can't affect the past.
The standard example of a causal graph is Pearl's slippery pavement model:
The Season affects the Sprinkler's activation and the Rain, the Sprinkler and the Rain affect whether the pavement is Wet, and whether it's Wet determines whether it's Slippery. A graph like this doesn't exclude the possibility that there are other unrepresented causes that also affect the variables (for example, something else might make the pavement Wet), but it excludes a single hidden cause that affects two of the variables—we represent those in graphs by a bidirected arrow (A<->B).
As you can see, each of the graph's connections is intuitively appealing on its own, and doesn't depend on the others. For example, the Season influences whether it Rains, and Wet things are Slippery, and these are separate facts that we know independently. In a causal model, physically independent parts of a system are represented by logically independent rules.
This modularity is one of the most useful properties of causal models, as it allows us to do things like building up a model out of smaller parts, verifying each part independently, and adapting to partial changes in the environment by changing parts of the model. In short, causal models have gears.
Building up complex models out of parts, though a thing humans need to do basically every time they think about causality, isn't something I know how to represent as math. It seems like a difficult problem—for one thing, you have to decide what variables you'll use in the first place, and how to draw their boundaries, which runs into abstraction problems. However, we do know how mathematically adapt a model to an environmental change, through the machinery of intervention.
Intervention is very simple: to represent a mechanism that changed into something else, you just take the equation that represents the mechanism and replace it with another law representing the new mechanism, keeping the other laws unchanged. For example, if we decided to turn on the Sprinkler regardless of its previous activation schedule, we would wipe out the equation for it as a function of the Season, replacing it with "Sprinkler = On":
Modeling decisions is one of the main uses of intervention. Typically, you observe a system in some situations, building a causal model for it, and then you get to control one of the variables. This control changes that variable's causal mechanism, since we can pick its value regardless of what it would have been set to, which means it's now decoupled from its causes. Intervention allows us to match this change in physical mechanism with a corresponding alteration in our model, which works because of the modularity of causal models.
Thinking of decisions as interventions in a causal model is what is called Causal Decision Theory, a term you've probably heard of around LessWrong. By contrast, Evidential Decision Theory would model turning the Sprinkler on the same way it would think about learning its value, and so an EDT reasoner would think they could influence the Season by turning the Sprinkler on. CDT might not be the best possible decision theory, but it sure is an improvement over EDT.
Interventions are also very useful for inferring causal models. For example, in a drug trial, we want to find out the causal effect of the treatment in the chances of the patient's recovery. However, the potential of a patient to recover depends on other things, such as socioeconomic status, that may also influence their choice of treatment. This confounding factor obscures the variables' causal relation, since if people who get the treatment recover more often, this might be because the treatment helps them or because they are wealthier. (We'll make this more precise later when we learn to do causal model inference.)
The solution is a randomized controlled trial (RCT): we assign each patient to an experimental group or a control group randomly, based on a coin flip. This breaks the causal relationship between wealth and choice of treatment, replacing the causal law with the coin's, which isn't causally related to the other variables. Since the mechanism that determines the patient's recovery is independent of the one that determines the treatment, we can then figure a causal law for Treatment in the lab setting and generalize it to the real world, where socioeconomic status does affect choice of treatment.
In the two situations, the lab and the real world, the Recovery variable has different correlations, because correlation is a feature of our state of knowledge and may change when we learn new things, but it has the same causal law on both, because the causal law represents a physical mechanism that doesn't change unless it is physically modified.
You might notice that so far we haven't mentioned any probabilities. That is because we were talking about what causal models mean, what they correspond to in Reality, and probability is in the mind. Now, we'll start trying to infer causality from limited information, and so we will apply probability theory.
We will first discuss the simplest form of causal inference: a repeated experiment with a single binary variable, that might each time be LEFT or RIGHT. We will hold fixed all known causes of the variable, so that all variation in the results comes from variation in unobserved causes. We observe the result of some past instances of the experiment, and have to predict future ones.
Our model of this situation is a single causal law , where we have no knowledge about the law or the causes . That's only trivially a causal model—the causal graph has only a single node! However, since causal models are made of causal laws, knowing how to infer those is a prerequisite for inferring more complex models, so we'll cover them first.
If the experiment is repeated times, it has possible sequences of results. If we observe some of the results, what we can infer about the others depends on our probability distribution over these sequences, which is determined by our prior information.
With no information other than the fact that there are possible outcomes (call this state of knowledge ), we would assign the maximum entropy distribution, the one that makes no other assumptions: an uniform probability of to each possible sequence . This hypothesis' probabilities over the individual results are independent—learning one tells us nothing about the others.
Even after seeing a hundred RIGHT results in a row, the hypothesis of maximum entropy still assigns a probability of to LEFT on the 101st. This is unintuitive—if the hypothesis assumes no information, why does it refuse to learn anything and stubbornly keeps to its previous estimate?
The problem is that to see RIGHT on the 101st trial as corresponding to the first 100 being RIGHT, as being "the same result as before", would itself be prior knowledge relating different trials—knowledge that our brain assumes automatically upon seeing that RIGHT in one trial has the same label as RIGHT in the next.
To truly adopt the mindset of the hypothesis , you should imagine instead that the first trial's possible results are 685 and 466, the second's are 657 and 589, the third's are 909 and 596, and so on, all random numbers with no visible pattern. Then it really is obvious that, no matter what results you get, you will still have no information about the next trial, and keep assigning probabilities of 0.5 to each possibility. A more informed reasoner might know some results are RIGHT, RIGHT, RIGHT... but you only see 466, 589, 596, and so on, which is completely useless for prediction.
Of course, in this case we do know it's the same experiment each time and a RIGHT result is the same kind of result each time. The question, then, is what kind of probability distribution represents this knowledge.
To find out, let's imagine a generic binary experiment. By hypothesis, the experiment has only two possible endpoints, or at least two distinct clusters of endpoints in state-space. If it's a physical system that evolves until reaching equilibrium, this means the system has two stable equilibria. The prototypical example of a system with two stable equilibra is the marble in a double bowl:
We'll take that as our example binary experiment. The initial state is the marble's starting horizontal position, and the final state is which bowl it settled down in—the LEFT bowl or the RIGHT bowl. We'll suppose there is some range of initial positions the marble could be dropped at, and we know nothing else about its initial position, so we assign uniform probability in that range. We don't know the exact shape of the bowl (the function from initial to final states) or where in the range the marble is dropped each time (the unseen initial state).
The final state is completely determined by whether the marble starts off to the left or to the right of the bowl's peak. If we knew where the peak was, our probability of the marble going RIGHT would be the fraction of the range of possible starting positions that is to the right of the peak. We'll call that fraction .
We will assume the peak is in fact somewhere in the starting range—otherwise the experiment wouldn't really have two possible outcomes. Since we have no other knowledge about the shape of the bowl, it seems reasonable to assign the same probability to the peak being at each possible position, meaning has uniform prior probability density between 0 and 1.[1]
So, if is the hypothesis outlined above, and means "RIGHT on the kth trial", we have and . That's a complete model![2] If we know a sequence of past results, we can update our distribution for using Bayes' Theorem, then integrate over it to find the expected value of , which is also the probability of the next result being RIGHT.
If we see results, of which are RIGHTs, the probability of RIGHT in the next trial is . This result is known as Laplace's Rule of Succession. Notice that as grows this value approaches , the past frequency of RIGHT.
What about a more general binary experiment? Well, in that case we might have a higher-dimensional starting space, a complicated prior probability distribution for initial states, and an intricate pattern for which starting states go to which end states. Fortunately, we can ignore most of that if two properties are true: our distributions for the starting state of each trial are independent, and we are ignorant of the fixed start state -> end state function. We need to factor our representation of the system, into a starting state that varies with no consistent patterns and an outcome function that is fixed but initially unknown.
If we can do this, we can define our as the fraction of starting probability mass assigned to states that go to the RIGHT end state, and use the same distribution as in the double bowl example. We can conclude that the Rule of Succession applies to many possible experiments, or at least is a reasonable distribution to assign if your prior information consists of "it's a binary experiment" and nothing else.
Assigning probability to a result equal to its frequency in past trials of the experiment is very intuitive—so much so there is an entire school of probability based on it, we talk all the time about biased coins as if they had some fixed non-1/2 probability of landing heads (they don't), and CDCM uses the assumption without bothering to justify it.
However, I decided to take time to show a derivation because the Rule of Succession being the consequence of causal assumptions is a fundamental point about causality, knowing how "frequency -> probability" is derived will make it easier to deny it later when it's no longer true, and I just think this derivation is neat.
All this was model fitting—we made a model with one parameter and figured out how to find the value of that parameter from the data. The next step is model comparison—we have several possible models, and want to find out which is true from the data. If we can do model fitting, Bayes' Theorem tells us exactly how to compare models—just calculate the likelihood ratio.
For example, suppose we have a sequence of LEFTs and RIGHTs and we're comparing the two hypotheses we talked about, the uniform distribution over sequences of outcomes and Laplace's Rule , to see which one governs the data.
assigns probability to each possible sequence, of course. If holds and we know , the probability of a sequence with RIGHTs is . We can integrate that over (multiplying by the constant prior probability) to get the probability assigns to the data, which turns out to be .
This probability is larger if the sequence has a lot of RIGHTs or a lot of LEFTs, and smaller if the amounts are close to even, which is exactly what we'd expect— always assigns probability 1/2 to RIGHT, so it wins when around half the results are RIGHTs.
For another way to look at it, consider that the amount of possible sequences of trials containing RIGHTs is , so the probability assigns to getting any sequence of length with RIGHTs is... . The same for each of the possible frequencies of RIGHT, from to .
Therefore, Laplace's Rule is equivalent to assigning an equal probability to each possible value of and spreading out that probability equally among the possible orderings of that amount of RIGHTs and LEFTs. This assigns less probability to the sequences with similar amounts of RIGHTs and LEFTs because there are more of them—there are 252 ways to arrange 5 RIGHTs and 5 LEFTs, but only 10 ways to arrange 9 RIGHTs and 1 LEFT.
Another good example of comparing causal models of a repeated experiment with one variable is the case of Wolf's dice, which is analyzed in two posts by John Wentworth (and the Jaynes paper he got the example from). The dataset in question consists of a large number of results of throwing a pair of dice.
We usually assign 1/6 probability to each of a die's faces coming up, independently between throws. You might think that's just the maximum entropy, zero knowledge hypothesis, but in this case we do know it's a repeated experiment and we can't just unlearn this fact. The probabilities being 1/6 come from additional knowledge about the die's symmetry, on top of what we knew about a generic experiment.
Specifically, we know that a very small change in the initial state (height, angular momentum, etc.), smaller than what we can control, will change the result. This means the initial state-space is divided into thin "stripes" that go to each of the possible results. We also know the dice is symmetric with respect to its faces, so the adjacent "stripes" that go to each result have nearly the same volume. This means for almost no matter what initial region we choose, for any way we might throw the die, the fraction of possible states that end up at each number will be approximately the same.
That is true if the die is a nearly perfect cube. In the results of Wolf's dice, the frequencies of each number are very different from 1/6, falsifying the hypothesis that they are fair dice—that is, the hypothesis we just described assigns a much smaller likelihood to the data than the hypothesis that doesn't make those assumptions, the six-result analogue of Laplace's Rule that assigns to each possibility a probability approximately equal to its frequency.
But we can do better. Jaynes hypothesized the dice's asymmetry came from two specific sources: their length in one dimension being different from the other two due to imperfect cutting, and the pips cut into each face changing their masses. The assumption that the die is symmetric except for those things places constraints on the possible values of the faces' "objective probabilities" (the -analogues), not enough to completely determine them like in a perfect die, but enough to cut down on the space of possibilities, gaining a large likelihood advantage.
For example, if we take the different length of one dimension to be the only source of asymmetry, we know the two faces that are perpendicular to that dimension have the same frequency and the others have a different frequency . That's only one parameter, in contrast to five in the general case where the only constraint is that the frequencies have to add up to 1.
Adding the pips' asymmetry is a similar process. It means there are more degrees of freedom than we'd have with just the length asymmetry, but that's necessary—not considering the pip asymmetry, we can't beat the general, unconstrained hypothesis, but if we do, we can, and by a large likelihood ratio! (For details on the math, see John Wentworth's posts.)
We're now basically ready to move on to inference of causal models with multiple variables (the interesting part), but before we do that I need to make a couple of points that will become relevant later.
Let's go back to the example where we have a binary experiment, and we're comparing the uniform distribution over sequences to Laplace's Rule . The Technical Explanation of Technical Explanation says this about hypotheses being falsifiable:
Why did the vague theory lose when both theories fit the evidence? The vague theory is timid; it makes a broad prediction, hedges its bets, allows many possibilities that would falsify the precise theory. This is not the virtue of a scientific theory. Philosophers of science tell us that theories should be bold, and subject themselves willingly to falsification if their prediction fails. Now we see why. The precise theory concentrates its probability mass into a sharper point and thereby leaves itself vulnerable to falsification if the real outcome hits elsewhere; but if the predicted outcome is correct, precision has a tremendous likelihood advantage over vagueness.
I ask: which of and is more falsifiable? From what I've said so far, you might think it's , because it concentrates more probability in some outcome sequences than in others while gives them all the same probability. But consider this: from the perspective of a believer in , predicts as if it were certain , and this narrow prediction gives them a big likelihood advantage if they're right and dooms them if they're wrong. Then isn't more narrow and falsifiable? Or is it subjective, something that depends on how you look at the problem?
It is not. (Or at least, I don't think so!) The space over which you assign probabilities over is defined in the statement of the problem, and in this case it is clearly the sequences of outcomes—the data, what you observe. The latent variable is a part of the hypothesis , not of the problem statement—from 's perspective it is completely meaningless. The space over which the distributions can be more or less spread out is the space over which they predict things, the space of outcome sequences.
Technically, this "spreading out" is measured by a distribution's entropy. Entropy is the amount of bits a hypothesis expects itself to lose by the Technical Explanation's scoring rule, negated to turn into a positive number. The bigger the entropy, the more uncertain the hypothesis. In this case, is the maximum entropy hypothesis, meaning it's the "most humble" possible set of assumptions, the one that assumes the least amount of information and so makes the least precise predictions even if it is correct.
This means is more uncertain and is more falsifiable. And that is what we would expect, since is the stronger hypothesis, incorporating more prior knowledge: only knows what results are possible, while knows they're results of a repeated experiment, and what outcomes are the same kind of result (RIGHT or LEFT).
Of course, if we did know was 1/2, like if we were throwing a coin (which is just like the case of the fair die described above), we would have even more information than , but we'd go back to making the same predictions as . In this case, getting more information increased our uncertainty; that is allowed to happen, you just can't expect that it will on the average case.
Another point which I was confused about, and so might be helpful to clarify: if is true, quickly determines that and then makes approximately the same predictions as it, only losing more probability when its distribution for narrows down more and more. If is true, is probably significantly different from , meaning keeps losing more and more probability at each trial.
This means if is true, it will gather a large likelihood ratio over (thus falsifying it) very quickly. If is true instead, it will take much longer to falsify . This is, however, not evidence that is more falsifiable (in the sense "concentrates more probability") than .
To see that, suppose there is an unknown number between 0 and 999,999, and two hypotheses: a broad hypothesis that assigns an equal probability of 1/1,000,000 to each possibility, and a narrow hypothesis that assigns 1/10 probability to the number being 0 and spreads the remaining 9/10 among the rest, assigning approximately 0.9/1,000,000 probability to each.
In this case, if the broad hypothesis is true, it will get a 10:9 ratio the vast majority of the time, slowly gaining probability. If the narrow hypothesis is true, it will lose 10:9 90% of the time, but gain a ratio of 100,000:1 the 10% of the time where the number ends up being 0, blasting the broad hypothesis down into oblivion.
The reason this happens is that, though we transferred only a small fraction of the probability mass from the other possibilities to 0, there is a huge amount of them, so the probability of 0 ended up much larger than before. A similar thing happens for versus : transfers some probability from the huge amount of around-half-and-half possibilities to the tiny amount of skewed possibilities, so the advantage it gets when true is much bigger than the losses it takes when false.
So it is not the case that if a hypothesis is narrow (loses probability in a large part of the possibility space) then it is easily falsified (loses much probability upon seeing a small amount of dissonant observations).[3] When I first read the Technical Explanation, I was fooled into thinking this was true because of the use of the word "falsifiable", though upon rereading the post I can see it is always used to mean "narrow" and never "easily falsified".
In this section we'll talk about larger, multivariable causal models. As we'll see, they're basically several models for one variable glued together. Like before, we'll observe several repetitions of one experiment, and first do model fitting then model comparison based on them. We'll again start by discussing the simplest possible case: an experiment with two binary variables, and , of which may or may not causally affect . We'll say can be UP or DOWN and can be LEFT or RIGHT.
First suppose doesn't affect , and they're completely unrelated. (Call that hypothesis .) In this case, and are both results of repeated experiments, with some unknown causes and causal rules. We know what to do in this case! We'll use Laplace's Rule for both of them independently, with completely uncorrelated parameters and , the frequencies of being UP and being RIGHT. It's just as if and were the results of two completely separate experiments.
Now suppose affects . (Call that hypothesis .) That means and the other causes of form the initial state that evolves into 's possible values. This process can be factored into two functions taking the rest of the initial state to values of —one for when is UP, and one for when it's DOWN. The two values of push the rest of the causes into different regions of initial-state-space, which evolve to values of in different ways. So we'll use two parameters for , and , representing the fraction of values of the other parameters that causes to be RIGHT when is DOWN and when is UP respectively. These parameters, when known, will be the conditional probabilities we assign to given each value of .
We could in principle know about some relationship between and , for example if had only a small effect on the initial state, but to simplify we'll assume no such knowledge and have independent distributions for them. Then we have two Laplace's Rules for running at once, each of which updates for one of the values for . Also, since we don't know anything about 's own causes, we'll use yet another Laplace's Rule for it, with an independent parameter for the frequency of being UP.
To sum it up, the first hypothesis converges to independent probabilities for and equal to their frequencies, and the second converges to probabilities for and equal to their frequencies.
Now, model comparison! Suppose we don't know whether is causally unrelated to () or it affects (), and want to find out. Suppose first that is true. Then, after both models narrow down values for their parameters, will make the same prediction for every time, while will use its knowledge of the causal relationship to assign different probabilities to depending on the value of , making more accurate predictions and accumulating a greater likelihood.
Now suppose is true. Then after the models have narrowed down the values of their parameters, they will make basically the same predictions, because will have narrowed down its parameters and , for the conditional probabilities of , to the same value—it's predicting does have an effect in , this effect just happens to be the same for either value of ! However, didn't start with this assumption, so it will lose a lot more of its probability mass in finding the (equal) values of its two parameters than does in finding the value of its one parameter for . This means we do get some update in favor of before these values have been narrowed down.
Here is how CDCM describes an equivalent situation:
By the pigeonhole principle (you can't fit 3 pigeons into 2 pigeonholes) there must be some joint probability distributions which cannot be represented in the first causal structure [in our naming scheme, ]. This means the first causal structure is falsifiable; there's survey data we can get which would lead us to reject it as a hypothesis. In particular, the first causal model requires: [in our naming scheme, ]
While certainly can be falsified, this seems to be implying that it is in some sense more falsifiable than , which is incorrect. If we consider the space of possible joint distributions over and , it is true that can converge to any distribution in it while can only converge to the ones where and are independent. However, as we've seen, we have to look at the space of possible outcome sequences, not at the space of possible parameter values, to find the hypotheses' entropy.
Since knows and are results of repeated experiments, it uses Laplace's Rule to converge to a probability to each of them equal to their frequency. Since it sees them as completely unrelated, those probabilities are independent. Why do we assign independent distributions to things we know of no relation between? Because the distribution that assigns independent probabilities to them has maximum entropy.
If we fix the probabilities of being UP and being RIGHT at some values and , the maximum entropy joint distribution for and is the one where they're independent. By the entropy concentration theorem, this implies that, considering only those sequences of results where being UP has frequency and being RIGHT has frequency , in the vast majority of possibilities, their frequency distributions are approximately independent.
This means saying two variables are independent is not an assumption—the assumption is saying they're correlated. Since and are really distributions over sequences of values of and , has higher entropy than ! If you think carefully, this makes sense, as the assumptions that makes are " is a repeated experiment with a common causal law" and " is a repeated experiment with a common causal law", while makes these assumptions plus the assumption that " is one of the causes of ", so is a strictly stronger hypothesis. Note also that even if is true, remains calibrated, assigning the correct marginal probabilities for and — is just more accurate.
The reason we spent so long comparing the uniform distribution over sequences to Laplace's Rule , in the first place, is that the current case is very similar: is more spread out in parameter space, but really is narrower over the space of observations (as a consequence of its stronger assumptions), meaning it's the stronger hypothesis. Also like and , is falsified more quickly if is true than is falsified if is true, and just like in that case it doesn't mean is more precise.
For one unexpected application of the comparison between and , suppose we have two separate experiments of one binary variable, and we want to know whether they are two instances of the same phenomenon. It turns out the correct probabilities in this situation are the same as the ones in the case we just discussed, with as both the experiments' results and as the variable "which experiment produced "![4]
This seemed surprising to me when I found out about it, because "which of two different experiments is it" doesn't seem like the kind of thing we usually treat as a variable. However, it certainly is a feature of the world that either doesn't make any difference to at all (the assumption for in ) or throws into two completely unrelated causal laws depending on its value (which is the assumption for in )!
Larger models (of binary variables) work basically the same way as and —for each variable, you can use a separate instance of Laplace's Rule for every possible combination of its causes. You get the patterns of correlation described in CDCM, governed by the rule of d-separation. For example, if A causes B and B causes C, C is correlated with A unless you condition on B; if P and Q both cause R, P and Q are correlated only when you condition on R.
Comparing these models is also similar—if the true model can have correlations that a false model can't explain, it will keep gaining bits and falsify the false model; if the true model instead draws strictly less correlations than a false model, its parameters' true values will be in a region of a lower-dimensional space than in the false model, so it will keep a larger fraction of its probability mass and gain likelihood over the false model.
This allows us to distinguish some models, but not all. Some models can converge to the exact same set of possible distributions, with the exact same correlations. We were able to distinguish " causes " from " and are unrelated", but alternative hypotheses, such as " causes " or " and have a common cause", are indistinguishable from " causes " no matter how many instances of them we observe.
This might make you think these models are equivalent, if they make all the same predictions, that there is no reason to say they point to different underlying states of reality. And this would be right if a causal model was only applicable in the situation we've been considering, as making predictions about a series of cases in each of which the entire model holds.
But as we've seen before, models are not so monolithic—they are conjunctions of causal laws, each of which represents its own physical mechanism that may be present in situations the others aren't. This means even if two models are equivalent, there may be modifications of them that are different. For example, in the case of RCTs, intervention can be used to learn about the causal law of in a context where we control the causation of , and then this can be generalized to a wider context where has its own causal law. I believe this is also how humans learn most causal relations in real life.
For example, take the slippery pavement model above. If we had a model that was almost the same, except that the Rain caused the Season instead of vice versa, then barring intervention this model would be statistically equivalent to the true model. But it's not like we learned that Rain was affected by the Season specifically by observing thousands of (Season, Sprinkler, Rain, Wet, Slippery) value tuples, drawing correlations and fitting them to models—that would hardly be practical for a human being. We learned it by observing Rain and the Season in a variety of contexts, where maybe different laws affected other parts of the system but the Rain kept the same mechanism, and generalizing.
Before Pearl came along, some people believed that all causality came from interventions—"no causation without manipulation", they concluded. We now know we can in fact distinguish some possible causal relations without intervention, which contradicts this assertion. Indeed, if this weren't the case it would be impossible to learn any causality at all, since manipulation just changes one model into another, so if you can't distinguish any pair of causal structures it wouldn't help.
It remains the case that a large part of why models are useful in human thought is that we can put them together out of separate parts, change them to match changes in situation, and generalize them from one context to the other. The inability of causal models to determine on their own when those kinds of adaptations are appropriate is, I think, a large part of why you can't get an AGI just by putting together a large enough causal model. But, after doing the mental work of figuring out which variables and relationships may be at play, we can now use Pearl's machinery of causal models to find their exact relationships and distinguish between some possibilities we may consider plausible, which means it does have its use.
There are other reasonable choices of prior, but as long as your prior is relatively flat you will get approximately the same results.
Well, those equations plus the assumption that the probabilities of different trials are independent given .
Technically, narrowness is measured by the hypothesis' entropy, as described. How easily a hypothesis is falsified by another is measured by their KL-divergence, the amount of bits it expects to lose to the other.
Except that in this case, we're not uncertain about , but and assign the same probabilities to it anyways so it doesn't affect the likelihood ratio.
2025-10-28 07:04:49
Published on October 27, 2025 11:04 PM GMT
Currently, I am a bit unsatisfied with my food options in the suburbs of Melbourne, Australia. So this post is just me reflecting a bit on some of my favorite meals when I lived in different places and thinking about if I can find something better for my daily routines.
I am young and don't have any issues with weight, so my diet is pretty carbohydrate-heavy since that is cheap. I also prefer vegetarian food, since cheap meat tends to come from factory farming. I am not vegan, since I would be bound to get deficient in something, so this is my compromise. I still eat eggs since they are cheap and make me worry less about being low on some crucial amino acid.
I am a huge fan of Soylent-like meal replacement products that allow me to have a meal that is extremely easy to prepare, cheaper, and about as healthy as Ramen + Eggs. I tend to like really boring food, and I actively don't like adding too much variety (varying the ingredients) since I have a harder time making it a routine.
Gwern would tell me to just fry some eggs with some vegetables in a pan and to otherwise just cook some things in batches. His cooking habits appear to already be outside the window of what I would call convenient, though I would find cooking more convenient if I had a dishwasher.
These were my staple foods as a university student in Germany:
Really quick option: Satislent (Meal replacement shake which is cheaper and tastes better than Huel (~230kcal/$ if bought in bulk)). Cleaning the Shaker bottle would be annoying. My solution was to keep the bottle in the fridge and then rinse it when I used it again. Often, I would be forced to properly clean it. It's especially dangerous to forget the bottle in your bag, because the grain soup is the perfect breeding ground for bacteria.
Frozen Pizza (~400-600kcal/$ (The cheapest frozen Pizza in Germany tends to be really cheap). My main constraint in my dorm was freezer space.
Spaghetti with Red Lentil Tomato Sauce. I would make the Red Lentil Tomato Sauce in Bulk and then keep making Spaghetti every day. I didn't want to keep the lentils in the fridge beyond 4 days, so I couldn't make it in batches for longer amounts of time. I tried freezing some, but unfreezing frozen lentils turned out to be way too annoying (perhaps I should have gotten some freezer molds to get them into a small enough size instead of using the Tupper containers I used). Not sure what the exact price per calorie is, but it is cheaper than both options above and also has more protein and essential micronutrients than both options above.
Protein bread with Peanut butter or fresh cheese. When I still lived with my parents, we used to eat a lot of bread. When I lived alone, I was still giving bread a try, but I usually picked the whole-grain or added protein one. In both cases, the problem is that bread is quite expensive, tends to spoil quickly, and doesn't even taste that great. It is quickly prepared, though.
All of the above work really well as convenient foods, especially if you don't have a dishwasher and have to wash everything by hand.
When I was living in the Austrian Alps for 9 months, my staple food was crisp bread with cheese and pumpkin seeds (~330kcal/$). I don't understand how that stuff is not overtaking the planet? It has infinite shelf life, is cheaper than fresh bread + cheese (presumably because of the shelf life), requires zero preparation, and it tastes amazing! It is maybe a little dry, but I can just drink more water. Clearly, most people's food preferences must be very different from my own.
I also ate cereal with soy milk (0 sugar + extra protein). I like cereal, but good-tasting ones either have too much sugar or are too expensive (I might consider mixing my own cereal). Also, I couldn't find any soy milk that tastes as nice as the Austrian extra protein one.
Amazon had also started selling fully prepared German Lentil Stew and Chilli sin Carne, which I used to order, but that's also not available in Australia.
My current main staple food is Quota's Moroccan Cous Cous (~146kcal/$). It is delicious, warm, and healthy food that I can prepare by just boiling water in a water cooker. Amazing! I don't know why all the other complete food companies don't have similar products. It is a bit expensive, but still cheaper than other food for lazy people in Australia (the cheapest frozen pizza here is 4x the price in Germany).
It would be great to have something cheaper, or at least something different on the same Pareto frontier, for at least a little variety. Quota used to have a second warm meal, but they discontinued it before I had a chance to try it.
So I was brainstorming what I would ideally want, and I guess it is something like the cous cous where I have a warm meal, where I need to just put it in the microwave or add boiled water, but cheaper or at least something different from couscous, so I don't grow bored of it that fast. It needs to have a long shelf life, so I don't have to prepare it too often. I might be reinventing Oatmeal? I don't like the texture of regular oatmeal, and I also get an ick from warm sweet food (or most sweet things generally), but maybe that post is right and I just had bad oatmeal. I might want to experiment with salty oatmeal. If dried vegetables are too expensive, maybe I'll buy ones that melt fast (spring-onions?). There are probably similar things in this category.
I then thought maybe there is Pasta like Spaghetti that boils faster, but is cheaper than Ramen (I thought Ramen prices are a scam). Quickly asking a language model, I learned that Ramen is actually pre-cooked! Turns out, it would otherwise be quite hard to get pasta just by adding hot water to it! I guess Ramen isn't more expensive than Spaghetti for no reason after all. My understanding (which could be wrong) is, you could still make really thin pasta that you just add hot water to; they'd just always taste overcooked, which wouldn't be a problem for me, but I guess it would be for most people? Sort of like rice paper, but in pellets.
It would also be great to have better options than over-sweetened whey protein bars on the go. Perhaps I should also look for good cheese crisp bread options online. Jimmy Joy has great bars in Europe, but the shipping to Australia is too expensive.
2025-10-28 04:59:42
Published on October 27, 2025 8:59 PM GMT
A recent paper shows an algorithm to invert an LLM to find the inputs (I think? I'm not an ML guy), does that mean you can now turn a predictor directly into a world-steerer? If you put in an output, and it finds the input most likely to cause that output, does that mean it will find the things it needs to say in order for the most likely next token to be the chosen one, even if that next token is something said by a human? If that is actually how it works, it really looks like this is a major breakthrough, and strong agents will be here shortly.
2025-10-28 03:38:03
Published on October 27, 2025 7:38 PM GMT
Title is a question because I'm looking for an answer, not providing it.
I wanted to stress-test some ideas and reason through some questions I have about concentration of power. These have been bubbling in the back of my head for a while, so I've taken two hours this morning to write some of them up and see how others feel.
The main project I'm currently working involves the mitigation of AI proliferation, particularly for offensive capabilities. The core issue is that a) advanced AI can plausibly allow you to make cheap, ultra-destructive weapons, and b) that the cost to train an advanced AI system declines quickly over time because of algorithmic efficiency improvements. As a result, the more you allow proliferation early on, the more invasive and ineffective enforcement becomes, until we hit a point where a few people are empowered to destroy the world at the expense of everyone else (maybe they built a fail-deadly, they're terrorists, irrational, try to build a superintelligence and end up misaligning it, etc).
In my model of things, these destructive technologies are offense dominant (impluasibly inefficient or impossible to defend against), which means that the only realistic way to control them is to limit their spread preemptively. While it's hard to predict the future, I'd expect that the first ASI labs to end up nationalized by the USG and PRC, meaning that offensive capabilities are initially monopolized. If they recognize the security interest of restricting AI proliferation (as the U.S and Soviet Union mutually agreed to do with nukes), they will use their strategic position to try and disempower other states/actors from getting access to offensive AI capabilities before they can become cheap to access. Perhaps they commandeer the entire AI supply chain, and put restrictions in place such that all future AI systems are aligned to the US-China coalition goals.
As far as I can see, this would reasonably secure the world against the threat of black ball AI tech. But an important secondary issue it creates is concentration of power, wherein the rest of humanity is subjected to the whims of the USG and PRC, because they have a monopoly on violence through their control of ASI technology and the commandeering they just did, and have to continue to do to in order to make sure an unauthorized ASI is never built.
Here are some of my intuitions about this situation:
Given all of this, concentration of strategic power seems both necessary and... probably ok? Even in this worst-case scenario, where power is totally monopolized by a handful of people (say the Oversight Committee from AI 2027), all it takes is a small preference among some of them for civilians to live great lives for them to be able to do so. The state needs to clamp down on offense-dominant offensive capabilities to make sure that non-state actors never get access to them (or else someone will eventually use them), but afterwards, beneficial AI technology can be allowed to mostly flourish.
Of course, I still think it would be better to have outcomes like value-lock-in of welfare/political rights for all humans at the inception of ASI, or for the U.S and China to somehow avoid intent alignment and instead aim straight at civilizational alignment where the ASI runs everything itself, rather than having to account for the preferences of very few. But I believe the above situation---nationalization of labs, intent aligned AI, realization of the strategic importance of non-proliferation, permanent monopolization of strategic power---represents the default path to dealing with the proliferation problem, so it seems important to reason about what it might look like and whether that's acceptable.
As an example of the kinds of defensive plans I'm talking about, I'm referring to Luke Drago and Rudolf L's proposal for defensive augmentation, or Helen Toner's adaptation buffers, where we prepare society ahead of time through the wide-scale distribution of technology.
I have several thoughts on this, which I'll explore in a future, properly edited post on the strategic implications of proliferating powerful AI systems. My main critique is that I do not share their optimism that you can efficiently respond to all existential threats, given the cost-imbalance of defending against them. As the number of threats and actors increase, spending on defense has to grow exponentially. Offensive investments, on the other hand, are linear, since they can focus all of their effort on targeting the areas that are under-invested or even impossible to defend against. Ie, it's much cheaper to try and overwhelm defensive capacity by just launching two pandemics at the same time than it is to make sure you are prepared for multiple pandemics.
Analogously, we can think about how ridiculously inefficient trying to intercept nuclear ICMBs is, and how you could obsolete the hundreds of billions of dollars worth of interceptor research you'd need by a) building faster missiles, b) using more missles, or c), just putting your bomb in a shipping container and docking it next to Washington.
And even if you secure one avenue (wastewater monitoring, DNA synthesis screening, and KYC ends up being enough defense-in-depth for bioweapons that it's basically solved), there's no guarantee you can solve them all. If you upgrade the problem from a pandemic to mirror life, for example, outbreak monitoring and antibiotic response plans start to break down completely. And so on for swarms of insect-sized LAWs, nanotechnology, misaligned ASIs, and the unknown-unknowns of weapons development that lurk at the top of the tech tree. The only reasonable solution to this problem is just to make sure other people don't have weapons in the first place, or preemptive non-proliferation.
I do not have the confidence to distribute general ASI without a guarantee that people will not have access to these offense-dominant technologies, and the only way to make sure that their AI cannot do those things would be for a more powerful central authority to have achieved ASI first and used their first-mover advantage to strip all future offensive capabilities away from other actors.
Similarly, even if the Chinese government decided to try to be more competitive by ignoring the welfare of its own citizens, it wouldn't necessarily be able to impose its will on the U.S given the disproportionate leverage of offense-dominant weapons.