2025-12-15 12:21:35
Published on December 15, 2025 4:21 AM GMT
This post is about one of the results described in the 2004 paper 'Information-theoretic approach to the study of control systems' by Hugo Touchette and Seth Lloyd.[1] The paper compares 'open-loop' and 'closed-loop' controllers (which we here call 'blind' and 'sighted' policies) for the task of reducing entropy and quantifies the difference between them using the mutual information between the controller and the environment. This post is a pedagogical guide to this result and includes discussion of it in light of the selection theorems agenda and the agent structure problem.[2] The proof, along with more worked-out examples, can be found in the next post.
A core feature of agents which makes them special is that they model the environment for the purpose of bringing about a particular outcome. The agent structure problem asks about the inverse; if we observe that something brings about a particular outcome, can we say that it must be modelling the environment?[3] A while back, John Wentworth asked a related question; "How Many Bits Of Optimization Can One Bit Of Observation Unlock?"[4] The Touchette-Lloyd theorem is one of the few results in the existing literature to touch on this problem.[5] We think it elegantly conveys something fundamental, and want more people to know about it!
Here's how the setup of the Touchette-Lloyd theorem implements each part of the above bold sentence. The environment is a random variable which goes through one time-step of change, into random variable . The external thing interacting with the environment is another random variable .[6] We'll use the word 'policy' to describe the way in which responds to particular values of , as characterised by the probability distribution . The exact way that and determine we'll call the environment's dynamics. "Bringing about a particular outcome" is simplified to a reduction in the (Shannon) entropy between and . And finally, "modelling the environment" is formalized as having mutual information with .
Unfortunately, the naive conjecture of "entropy reduction implies mutual information" is false. It is easy to define a particular setup[7] where the policy has no mutual information with , and the environmental dynamics cause an entropy reduction on their own.
We will use the term 'blind policy' to refer to those where and have no mutual information and the term 'sighted policy' to refer to those where and have positive mutual information. These two families of polices can be represented by the Bayes nets in the below figure.
However, for any given dynamics, if a policy achieves an entropy reduction greater than any blind policy could, then the size of that additional entropy reduction is bounded by the mutual information the policy has with the environment. In other words, for each bit of entropy reduction (beyond what the environment does on its own) the policy must use an additional bit of information from the state . This is what the Touchette-Lloyd theorem says. To put this claim more formally, the actual inequality is as follows;
In this inequality is the entropy reduction achieved by a specific policy. is how much entropy reduction the environment can do on its own. And is the standard mutual information between the action taken and the initial environmental state.
In this post, we will explain some of the concepts behind this theorem and introduce a few suggestive examples. Then, in the next post, we will step through the proof of the theorem and numerical examples that elucidate edge cases.
Though the agent structure problem is about optimization, it seems clear that there is some strong connection between reducing entropy and optimization.[8] For example, as John Wentworth has explained, expected utility maximization can be decomposed into entropy reduction and shifting the true distribution of an environment to look more like the desired distribution.[9]
It seems reasonably likely that mutual information between a policy and its environment is a property which is necessary (but not sufficient) for the policy to have a 'world model', whatever that is exactly. Since the Touchette-Lloyd theorem tells us that (under some circumstances) entropy reduction implies mutual information, it is a step along the way to understanding exactly how and when optimization implies a world model (and thus towards agent structure).
Consider someone attempting to block out some sounds from coming into their ears. Translating this to the Bayes net above, the source sound is , and the sound that reaches their ears is . Anything that happens in between, like the sound bouncing off the walls, is part of the dynamics of the system. Attempted interventions are the policy . If our goal is to absolutely minimize the sound, then that's analogous to trying to minimize the entropy of . You can do this pretty well by sticking plugs in your ears. The plugs will have no "mutual information" with the source sound , and they don't need to do any modelling of it. The ear plugs are like a policy that can do a good job at reducing the noise despite being blind (or rather, deaf).
But maybe we'd like to be more clever about reducing the noise. Perhaps some frequencies travel well through ear plugs, or you find them uncomfortable. And perhaps you'd ideally want to let some sounds through like voices, while blocking all others.
Active noise-cancelling headphones can perform these functions. While their physical structure does some passive dampening, what makes them active is that they "listen" to the incoming source sound, and then produce an inverted sound wave that cancels out the incoming sound. Doing this gives the internal state of the headphones high "mutual information" with the source sound. And as a consequence, they achieve a much better entropy reduction than what you get if, for example, you switch the noise cancelling feature off. They can also use this modelling to selectively pass desirable sounds, like voices. Though this is not a minimization of entropy, it is still a reduction in entropy. And of course, you may be playing music from the headphones which is not coming from the outside. This is introducing an entropy-increasing dynamic.
Because the theorem gives a bound on entropy reduction (rather than a statement about an optimum, like the good regulator theorem) it has something to say about all these scenarios. You can read through our numerically worked-out examples of each type of dynamics in the next post.
As mentioned above, we will represent the initial environment state as a random variable , the 'action' taken by , and the final state of the environment by . As is conventional for probability notation, uppercase letters represent the random variables and corresponding lowercase letters represent specific values the variables could take.
We will use the term 'policy' to describe how is chosen in relation to , as captured by the conditional probability distribution . We will compare two broad families of policies: 'blind' policies and 'sighted' policies. For a blind policy, the value taken by will not depend on the environmental state , meaning that the probabilities of and are independent:
The category of blind policies includes deterministic policies (such as 'pick action every time') and random policies (such as 'toss a coin, if heads, pick action , if tails, pick action ') provided that these policies do not require taking in information about . But for a sighted policy, the value of can depend on the value of X, and must be represented as without simplification.
Since both and represent states of the environment, they are typically thought of as having the same domain (ie. the same set of possible values). However, for clarity, we will still use to label output environmental values and to indicate input environmental values.
The final state of the environment will be determined by the initial state of the environment and the action variable . To keep things general we will have depend in a probabilistic way on both and , as represented by the conditional distribution , which we will refer to as the dynamics of the system.
If is a deterministic function of and then all of the individual probabilities will either be 1 or 0.
We will assess policies based on their ability to minimize the final state entropy:
Now we'll walk through a concrete example using this formal setting.
Imagine playing the following game. There is a computer which will randomly and secretly pick a binary string from the uniform distribution over all 5-bit strings. Without knowing which string the computer has picked, you are then required to pick your own 5-bit string and enter it into the computer. The computer then takes your string, along with its own randomly generated string and uses them as inputs for a function (which takes two arguments and outputs a single string).
Depending on what process you use to pick your string and the function the computer uses, the probability distribution over the function's outputs is then calculated. From this, the entropy of the output distribution is calculated:[10]
The aim of the game is to choose your strings so as to minimize this output entropy. If you know what function the computer will use, how should you pick your string?
Let us consider the following function:
In words: if the string you entered was the same as the string that the computer picked, it will output the all zeros string '00000'. Otherwise, it will output the string which it initially picked.
How should we choose our string so as to minimize the entropy of the output distribution? After a bit of thought it becomes clear that the best strategy is to pick any string except the all zeros string. Most of the time, it will output whatever it randomly generated. If you pick , then it will always output whatever it randomly generated, meaning the output distribution will be uniform (which has maximum entropy). But if you choose any other value for , then in the (unlikely but possible) case that your choice matches the computer's choice, it will output the all zeros string instead. This increases the overall probability of the computer outputting all zeros and decreases the probability of the other string you pick. This makes the output distribution slightly non-uniform, lowering its entropy.
Since the situation is symmetric with respect to all strings except 00000, we can assume you pick a fixed string. For simplicity lets assume you pick the 11111 string. Employing this strategy will give an output entropy of ~4.94 bits which is slightly better than 5 bits, the maximum entropy possible.[11]
It is fairly straightforward to extend the analysis to cases where initial distribution over strings is non-uniform. In these cases, the best strategy is to choose the highest probability string which is not the all-zeros string.
However, it is clear that having to choose your strings 'blind' (ie. knowing nothing about what string the computer has chosen) prevents you from making the output entropy radically lower than the initial entropy.
If you were able to see the full string the computer had chosen before you had to choose your string, you could simply pick your string to perfectly match it, resulting in a zero entropy output distribution.
Suppose instead you are not given the full string, but you are given a small amount of information about the string chosen by the computer. For example, if we were given only the first bit of the computer's string, this would halve the number of possible strings it could be and double our chance of guessing correctly (for a uniform distribution). We could take advantage of this by guessing the string 11111 if the first bit of the computer string is 1 and 01111 if it is a 0. This strategy would give you a probability of of guessing the correct string and if applied would give an output entropy of 4.85 bits.
We can improve the strategy further if we are permitted to view more bits. If we are given the first two bits of the computer's string we can adopt the strategy where we guess the string (where '' represents the first two bits which we have seen). Our guesses will become even more accurate and we can achieve an output entropy of 4.63 bits.
As we increase the number of bits we can see, this pattern continues, all the way down to zero entropy when we know all five bits of the string.
Clearly, knowing more bits allows you to do better. The difference between the best 'blind' strategy (which knew 0 bits of the computer's string) and the optimal strategy knowing the computer's full string is 4.94 bits of entropy. Each bit of information buys us some extra entropy reduction.
Recall that we restricted the form of the function and reserved the right to change it later. How much of our above analysis of this problem depends on this choice of ? Can we still make similar claims when we change ? What if, instead of a deterministic function, and influenced through nondeterministic dynamics? Similarly, we assumed an initial distribution which was uniform over a support of binary strings of length 5. How much of our analysis depended on this assumption? How much extra optimization can one extra bit of mutual information buy us?
To investigate these questions, let's first generalize the concepts of 'blind' and 'sighted' policies.
We introduced blind policies as policies where the action taken is independent from the environmental state . By enforcing this independence, we are ensuring that the policy cannot be using any information about the environment to choose its action. Conversely, a sighted policy is one where and are not independent.
However, this doesn't tell us how much information a sighted policy is using. Independence is a binary property. To measure this quantity, we can use the mutual information between the two random variables, denoted by .
Now we can define a blind policy as one where the mutual information between and is zero. Knowing what action a blind policy took gives you no information about what the environmental state was (and vice versa). This is mathematically equivalent to our previous definition of a blind policy as being one where the action taken is independent of the environmental state. That is,
if and only if
Conversely, a sighted policy is one where the mutual information between and is non-zero. If is the action variable for a sighted policy then:
The difference is that now we have a measure of how 'sighted' the policy is.
For example, in the previous section the environment variable was described by a 5-bit string and we considered strategies with 0, 1, 2, 3, 4 and 5 bits of information about the environmental state. Imagine that you are observing the actions taken by a policy playing the game described in the first section. In this case the 'environmental variable' is the 5-bit string randomly chosen by the computer. Suppose you know that the policy has access to the first 2 bits of the environmental string and will be taking the action of choosing the string where '' indicates the first two bits of the environmental string which are known to the policy. If you observe in one instance that the string chosen by the policy is '10111' but you do not know the environmental string, you can use this information to work out that the first two digits of the environmental string must be '10'. While the unconditional entropy of the environmental string is 5 bits (), knowing the action taken by the policy reduces this uncertainty to 3 bits (). Since mutual information is given by there are 2 bits of mutual information shared between the action and the environment. In this way mutual information matches our intuitive notion of what it means for a policy to 'know' some number of bits about the environmental state.
It is important to note that sharing mutual information with the environment is not sufficient to imply that a policy will reduce entropy more than a blind policy. This may be due to the policy just throwing away the information, or to the nature of the environmental dynamics being 'too random'. For example, in an environment which is totally unaffected by the choice of action, having mutual information between the action taken and the environmental state has no effect on the final distribution, so all blind and sighted policies will perform equivalently. Alternatively, actions may have mutual information with the environmental state, but use it in a way which actively reduces their performance in entropy reduction. In the 'guess the bits' example from earlier, a policy with access to all five environmental bits could be 'always choose the string which is the bitwise-NOT of the environmental string' (ie. if the environmental string was 01100, the policy would choose the string 10011). This policy would result in no reduction of entropy and would perform worse than the optimal blind policy, even though the action would have the maximal 5 bits of mutual information with the environmental state.
As we stated (but didn't prove) above, the Touchette-Lloyd theorem says that,
Let's summarize what each of these terms means.
We'll explain this term more in the next post, but here is the initial distribution over , and is the set of all possible distributions over . Similarly, is the distribution over actions corresponding to a blind policy and is the set of all possible distributions over actions corresponding to blind policies.
We can interpret this inequality as a 'selection theorem' of the loose form, "optimization implies mutual information with the environment". If, for some initial distribution, we observe a policy which achieves an entropy reduction which is greater than then we can deduce that it must have at least bits of mutual information with the environment. We think that that this is a pretty nice result.
In the next post, we will go through the proof of the theorem as well as some numerical examples to really clarify how this theorem works.
In particular, we discuss Theorem 10 from this paper. The result is also described in the paper 'Information-Theoretic Limits of Control' published in the year 2000. It also appeared earlier in the 1989 paper 'Use of mutual information to decrease entropy: Implications for the second law of thermodynamics' (paywalled) by Lloyd, which uses a more physics-based proof. In this post we rely primarily on the paper from 2004.
An idea discussed in an early post by Scott Garrabrant, elaborated on in a later post by John Wentworth, and recently written about by the authors in this post.
The agent structure problem also asks this about the other features of agents, like a value representation and a planning process.
While observations and models are different, they're tightly related. Models almost always come from observations, and often models are regularly updated in response to continued observations.
Some other work in this vein includes the good regulator theorem (along with John Wentworth's work on 'fixing' it) and the internal model principle.
You could think of as standing for action or agent.
See the sequel to this post for a worked-through numerical example.
In Measuring Optimization Power, Yudkowsky describes a tight connection between entropy and optimization.
From Utility Maximization=Description Length Minimization: "we can take any expected utility maximization problem, and decompose it into an entropy minimization term plus a 'make-the-world-look-like-this-specific-model' term."
As is conventional in information theory, all logarithms in this post are base 2.
= 4.94
2025-12-15 11:40:54
Published on December 15, 2025 3:40 AM GMT
The ideas in this document were made in collaboration with Nathan Sheffield. However, Nathan has not reviewed this document for errors. This document should be seen as a "research note"---sharing unpolished progress. I think it's fairly likely that this document contains major errors. I hope that it's still useful in giving a general direction for how I think people working on software-based solutions to compute verification should model the problem they're thinking about. Thanks to Alexa Pan, Luke Marks, Daniel Reuter, Eric Gan and Buck Shlegeris for some conversations about this.
If nation states realize that the development of transformative AI poses an existential threat to humanity, and also a threat to their power, nation states might try to coordinate around a pause on further AI development. Such a treaty would require compute-usage verification, similar to verification involved in nuclear treaties, to ensure that parties are following the treaty.
A simple approach to compute-usage verification is to make sure that all the GPUs are destroyed or turned off. You can verify that GPUs are turned off by going to huge data centers and seeing if they are using a huge amount of power and cooling. One downside of the "destroy GPUs / turn off GPUs" approaches to verification is that these approaches doesn't allow you to use your GPUs for other things. For instance, people might be really attached to AI assistants, and these assistants might provide a lot of economic value. So, it might be nice if compute owners could use their GPUs to perform inference (as long as we do not think that the AI models being served are very dangerous), perhaps on models that belong to a white-list. The obvious problem with letting compute owners do inference is that you now have to worry about them secretly training more powerful AI models!
However, I think it's the case that:
It is not scary for compute owners to run a large amount of inference on
Llama3.3-70b.
This might be false if you can use the ability to do tons of inference to create really good training data that can then be used to make a highly capable model. It might also be false if the model that you are running inference on is already quite powerful---maybe it's dangerous because humans will use the model to do malicious things (bio-terror, cyber attacks), or maybe it's dangerous because the model is a schemer, and given enough influence over the world (by letting it perform inference) it will be able to do very bad things. But I'll assume for this note that this won't be an issue, and focus on the following question:
Can compute owners prove, using tools from cryptography, that they are not doing very much besides a particular allowed computation on their GPU?
I'll call this the "is software-based inference-only verification possible" question. The motivation for studying this question is that, if this inference-only verification is possible, this might make an AI pause require lower levels of politically will.
I don't think this problem is very well motivated. In particular, here is a simpler solution to the problem of "having some GPUs that the PRC and USG can run inference on some fixed LLM":
Put all the GPUs in a datacenter in Greenland. For each GPU, have an American and a Chinese technician collaboratively set the GPU as an inference server running inference on the known LLM.
Despite the fact that the particular problem that I'm considering in this post does not seem well motivated (as discussed above), it seems plausible to me that there exists some similar problem that actually is well-motivated, and for which the tools and frames for solving the software-based inference-only verification problem are useful. I think that figuring out whether any such problems actually exist should be the top priority of researchers working on software-based compute verification. With this caveat noted, I'll now proceed to define the problem.
I will model compute verification as an interaction between a compute owner (aka "prover") and an auditor (aka "verifier"). This interaction will happen over some time interval called the verification interval. There is some computation that the compute owner is allowed to perform on some input which starts on the compute owner's machine. Instead of performing computation , the computer owner will actually perform a computation that depends on some random bits from the verifier (we add these random bits to make the computation verifiable).
I will now present three models of the Prover and Verifier, in increasing order of complexity.
In the simplest possible model, we'll abstract away most details of the situation and the details of how computers work. The model we'll use is:
We'll call the version of model 1 without the timing constraints (the last bullet point) the "timeless simple model" and the version with the timing constraints the "timeful simple model".
In the timeless simple model, the Prover Verifier interaction is as follows:
In the timeful simple model, the Prover Verifier interaction is as follows:
One key thing which the simplest model (model 1) ignores is the possibility of the Prover offloading some data or computation to another party. We can model this by saying that the Prover has a GPU, disk, and a CPU. The parameters in this more complete model are depicted below:
Some of these parameters are a little bit confusing for me (because I don't know much about computers); here is an explanation of what they mean for someone who might have confusions similar to mine: The way that computation happens on a GPU is:
Note: I'd appreciate someone correcting me if I've gotten any of the details of how computers work incorrect here --- I am not an expert.
For sake of concreteness, here are the parameters (rounded a bit) of an H100:
In reality it's important that there are multiple GPUs, and it's also important to model the communication between GPUs. I won't get into this at the moment though.
We are interested in what I'll call "PV memory exhaustion protocols (MXP)". An MXP is an interaction between P and V over some time interval. The result of an MXP is the verifier either reporting the prover as violating the compute usage treaty, or not reporting the prover. Now I'll try to formalize the security property that we'd like MXPs to achieve.
Here's one attempt:
An MXP if -secure if, for any computational problem , a Prover that computes at any point during the MXP will be reported unless is the computation that they were supposed to be doing with probability at least .
This feels a bit too restrictive. For instance, if the Prover wants to compute a forwards pass on Llama, it doesn't seem so bad for them to learn the activations of Llama on a middle layer as they go (although it is coherent to have them not learn this but still learn the final output of Llama; we can do this via obfuscating the computation). But I haven't been able to think of a better formalization so far.
Coming up with a better security property to aim for seems pretty important.
But anyways, we say that a MXP is -great if
So far I've defined some models of the compute verification problem, and given a goal: give an -great MXP for some computation and some nice class . One of the simplest computations you might want to verify is the lack of computation. Of course, hardware based solutions (e.g., checking if the GPUs feel hot) are a good method for verifying lack of computation. But giving a MXP for verifying the empty computation is instructive and will be helpful in building towards MXPs for more interesting computations.
It turns out that the relevant class which we'll be secure with respect to is what I'll call --- this is known as "catalytic space computation" in the literature. is the set of functions which can be computed with the following resources:
For the purposes of this post you should basically think of as "things that you can compute with space" --- the catalytic space is probably not particularly useful.
I'll now show how to verify the empty computation in several models.
Proposition:
There is a -great MXP for the empty computation in the simple timeless model.
Here is the MXP:
Proposition proof sketch: According to Nathan (an expert on catalytic computation), assuming some reasonable circuit lower bounds, starting with random junk memory is the same as starting with arbitrary junk memory for catalytic computation.
It's not too hard to extend this to the further result:
Proposition:
Fix . There is a -great MXP for the empty computation in the simple timeful model.
Proof: If is not too large, then we could apply the exact same strategy as we did for the simple timeless model. But, it does seem quite unwieldily to transfer so much data between the Verifier and the Prover. So, I'll now give a different approach that involves much less communication.
Before giving the MXP we need some definitions:
The MXP is as follows:
Let's just assume that where is the time interval between consecutive audits. There is no reason for this to be the case. What really should be done is should be chosen to require more time, and should be chosen to require the same amount of time as audits.
Clearly, if at the start of an audit request the Prover has less than a -fraction of its memory in use storing parts of , then it will take them at least time to respond to the Verifier's query, which is way too much time. Because we've packed the audits closely, the Prover can't really do anything but spend its time responding to audit requests.
Notes:
- This protocol is very brittle in several ways.
- It doesn't really feel like is quite the right class. It seems like you should take time into account, and say something more like "most of the time you don't have much free space". It might also be a good idea to say "you periodically have to erase a bunch of the stuff that you were storing".
If there is a disk and CPU then there are two new flavors of attacks that we need to defend against:
To defend against these attacks we need to make sure:
(1) seems doable I think. I haven't checked whether (2) is doable with the H100 specs I listed above, but it seems fairly dicey. The situation for (2) could be improved by using a computation that is more leveraged to do on GPU than CPU, namely a matmul.
Figuring out whether this scheme works on an H100, and figuring out how to do this scheme with a matmul instead of a dot product are important open questions.
Now I'll give an MXP for performing arbitrary computations. Fix some Turing machine with update function .
I'll use tiPRP to mean "pseudorandom permutation, with an easy to compute inverse (if you have a trapdoor)". I'll assume that you can sample such trapdoor-invertible PRPs.
I'll use to denote an obfuscation of the function . I'd like to basically assume virtual black box obfuscation (even though this is impossible). Probably iO will be sufficient.
Let's work in the simple timeless model.
Let
Here is a sketch of the MXP:
Conjecture: I think it's probably the case that this basically is a -great MXP.
I'm a bit worried that we might need to do some salting to prevent the Prover from messing with some things though. But that should be doable. Getting a more rigorous analysis of this seems quite good.
I think that this could probably be extended to work in the GPU + CPU + disk model using similar tricks to the ones used in the Verifying no computation case. But it would be good to actually do that.
To conclude, I'll summarize some directions for future work that seem compelling to me.
As discussed in the introduction, it's not clear to me that there are any compute-usage verification problems where it makes sense to do software-based compute verification, as opposed to some simpler strategy like "turn off the GPUs" or "have a trusted party be in charge of setting up the software running on the GPUs".
It seems like it would be highly valuable for people working on compute verification to write up a list of concrete properties that you might want to verify. Ideally the properties are at least as crisp as "verify that the compute owner isn't doing much memory-intensive work besides running inference on Llama3-70b".
I think this list would be a great addition to the discourse, because then we can actually discuss whether certain properties will be feasible to verify, and we can discuss what approaches seem the most appropriate for performing the verification. If you write such a list, I'd be excited to give feedback on it!
As discussed in the body of this document, the current security property seems a bit too restrictive, and it'd be great to have a more compelling security property to aim for.
As noted in the document body, the no computation MXP should probably rely on a matmul instead of a dot product.
It would be good to check that this MXP actually makes sense with H100 specs. As a bonus, it'd be nice to actually code up the MXP. One thing we'd learn from doing this is whether my time estimates were reasonable.
The biggest issue with the scheme proposed in this note is that it relies on cryptographic primitives like iO that would add an enormous amount of overhead to the computation in practice. Getting a version of this scheme that works without iO is an important open problem. Specializing to only handling certain classes of computations, such as matrix multiplications and ReLUs, might be helpful for getting a more practical algorithm (or maybe not), and may be acceptable.
The current analysis of the verifying arbitrary computations scheme is very non-rigorous, and plausibly false. It'd be good to fix this.
2025-12-15 06:33:59
Published on December 14, 2025 10:23 PM GMT
Most predictive work on AI focuses on model capabilities themselves or their effects on society at large. We have timelines for benchmark performance, scaling curves, and macro-level labor impact estimates. What we largely do not have are personalized forecasts that translate those trends into implications for an individual role.
At the same time, many conversations about AI and work stall at a familiar level of abstraction. Some jobs will disappear, others will change, productivity will increase, and things will “shift.” They do not answer the question that actually matters to individuals: when does AI become capable enough to meaningfully threaten my role, given the specific tasks I do and the organization I work in?
We know that AI capabilities are improving rapidly, while adoption inside organizations is uneven, delayed, and constrained by structure. Some tasks are discretely measured and automatable, while others depend on taste and tacit knowledge and thus are subject to several layers of task-specific and organizational friction. As a result, AI impact is not a single event but a distribution over time that varies substantially across roles.
What was missing, at least for me, was a way to translate those general trends into a personal forecast, one that makes its assumptions explicit and allows them to be challenged.
Over the past year, I built a model to do that, which you can explore at https://dontloseyourjob.com.
The model is a hazard model, the same function used to estimate the survival decay of medicine, supply chains, and employment. Instead of modeling the probability of system failure, it models the probability that AI meaningfully displaces a role, or collapses its economic value, as a function of time.
The baseline hazard is informed by METR's work AI capability growth, including evidence consistent with exponential improvement. On top of that, I layer multiple sources of friction and amplification, including task structure, degree of tacit knowledge, coordination requirements, organizational inertia, and economic incentives. These inputs are captured through a questionnaire intended to approximate the shape of a person’s actual work, rather than their job title alone.
Importantly, the model is highly assumption-sensitive. Small changes to capability growth rates, adoption lag, or task substitutability meaningfully alter the output. You may reasonably disagree about how fast models will improve, how quickly their organization will deploy them, or which parts of their role are actually indispensable.
For that reason, the model is open-source, and the interface itself exposes many of the underlying assumptions. The goal is not to produce a single “correct” forecast, but to make the structure of the problem explicit: what you are implicitly betting on when you assume your role is safe, and which uncertainties actually affect the timeline.
If you think the assumptions are wrong, you can change them, either directly in the interface or by modifying the code. The hope is that this makes discussions about AI and work less rhetorical, and more legible as disagreements about models, parameters, and evidence.
I wrote a long guide to job displacement, which I’ll copy down below (though it also exists in the site). If you have time — visit the site, interact with it, visualize your displacement risk, and integrate your own assumptions. There are many opportunities we have to prolong our own careers, and there are policy options we can integrate at both the state and firm levels to keep ourselves in the ownership of our work. I think some of us (including myself) would prefer a human-oriented future rather than a completely mechanized one, but regardless of your viewpoint, modeling these forces helps contribute to the discourse.
Researchers at leading AI labs predict that we will reach AGI (Artificial General Intelligence) sometime within the next 3-12 years. Politicians, business executives, and AI forecasters make similar predictions. AGI, by definition, means systems that are more capable, cheaper, and faster than any human at any cognitive labor task. These systems will amplify individual productivity in the near term, but they also have the capability to displace human workers.
If you’re skeptical of that claim, you have good reason to be. “Automation will take your job” has been predicted before: by 19th century Luddite textile workers, by economists warning about tractors in the 1920s, by analysts predicting the end of bank tellers when ATMs arrived. Those predictions were mostly wrong. New jobs emerged, transitions stretched over decades, and human adaptability proved more robust than forecasters expected. Why should AI be any different?
Three factors separate AI from previous waves: speed, breadth, and the economics of cognitive labor. AI capabilities are increasing much faster than the rate at which we can upskill, and these systems aim to replace the function of human intelligence in many cognitive tasks. But AI does not need to reach “general intelligence” levels of capability to disrupt the labor market, and we are already seeing it happen in white-collar roles.
Displacement occurs under two main scenarios:
Naturally, AI capabilities are compared to the human brain, and in many respects they are far off from matching the strengths of our working minds: tackling complex problems with incomplete information, continual learning, navigating emotions or relationships, and long-term coherent agency. Your role may not be displaced by AI providing the entire service of Data Analyst III, but it might soon be able to do enough of your tasks that your organization no longer needs a full-time person in your position.
Don’t Lose Your Job is a modeling platform that measures gradual displacement in white-collar roles. The questionnaire captures your job’s task structure, domain characteristics, hierarchy position, and organizational context, then models those layers of friction and amplification against trends of AI capability growth from METR data. The model makes several assumptions about these forces, but you can (optionally) tune these coefficients in the Model Tuning section to see the effects of your own assumptions.
The model does not forecast potential government or business policies that might mandate human involvement in certain tasks or slow AI adoption. Beyond individual planning, this tool aims to inform policy discussions about maintaining human agency and oversight in the labor market.
The model is open-source. You can build your own versions by visiting github.com/wrenthejewels/DLYJ.
The history of automation anxiety is largely a history of false alarms. Understanding why previous predictions failed, and why this time the underlying dynamics may have genuinely shifted, is essential to calibrating how seriously to take current forecasts.
Economists discuss displacement anxiety with the “lump of labor” fallacy, which assumes there’s a fixed amount of work to be done such that automation necessarily reduces employment; historical evidence shows this assumption is wrong.
In the early 19th century, Luddite weavers destroyed textile machinery, convinced that mechanical looms would eliminate their livelihoods. They were partially right, as hand weaving did decline, but textile employment overall expanded as cheaper cloth created new markets and new jobs emerged around the machines themselves.
A century later, agricultural mechanization triggered similar fears. In 1900, roughly 40% of American workers labored on farms. By 2000, that figure had dropped below 2%. Yet mass unemployment never materialized. Workers moved into manufacturing, then services, then knowledge work. The economy absorbed displaced agricultural workers over decades, creating entirely new categories of employment that didn’t exist when tractors first arrived.
The ATM story is also relevant. ATMs spread in the 1970s-80s, and many predicted the end of bank tellers. Instead, the number of bank tellers actually increased. ATMs reduced the cost of operating branches, so banks opened more of them, and tellers shifted from cash handling to sales and customer service. The job title persisted even as the job content transformed.
The mechanism is straightforward: automation increases productivity, which reduces costs, increases demand, and creates new jobs, often in categories that didn’t exist before. Spreadsheets enhanced accountants to perform more sophisticated financial analysis, and it created demand for analysts who could leverage the new tools, rather than displacing analysis as a profession. Markets are adaptive, and new forms of valuable work consistently emerge.
AI-driven displacement differs from historical precedents in ways that may compress generational transitions into years.
Speed of capability growth. AI capabilities are increasing exponentially. Skill acquisition, organizational change, and policy response operate on much slower cycles, so capability growth can outpace the rate at which workers and institutions adapt. Even if AI-driven wealth is eventually redistributed, many current workers can still fall through the gap during early waves of displacement. If this happens, you may have fewer opportunities for outlier success than ever before.
Breadth of application. Tractors replaced farm labor, ATMs replaced cash-handling, and spreadsheets replaced manual calculation. Each previous automation wave targeted a relatively narrow domain. AI targets a wide range of cognitive work: writing, analysis, coding, design, research, communication, planning. There are fewer adjacent cognitive domains to migrate into when the same technology is improving across most of them at once, so the traditional escape route of “move to work that machines can’t do” becomes less available.
The economics of cognitive vs. physical labor. Automating physical tasks required capital-intensive machinery: factories, tractors, robots. The upfront costs were high, adoption was gradual, and physical infrastructure constrained deployment speed. Typewriters, computers, and the internet enhanced our cognitive abilities by seamlessly transferring the flow of information. AI replaces cognitive labor itself through software, with marginal costs approaching zero once the systems are trained. A company can deploy AI assistance to its entire workforce in weeks, not years, and some of that “assistance” has already replaced entire job functions. The infrastructure constraint that slowed previous automation waves doesn’t apply in the same way.
The “last mile” problem is shrinking. Previous automation waves often stalled at edge cases. Machines could handle the 80% of routine work but struggled with the 20% of exceptions that required human judgment, which created stable hybrid roles where humans handled exceptions while machines handled volume. AI’s capability profile is different, and each model generation significantly expands the fraction of edge cases it can handle, so “exceptions only” roles look more like a temporary phase than a permanent adjustment.
No clear “next sector” to absorb workers. Agricultural workers moved to manufacturing, manufacturing workers moved to services, and service workers moved to knowledge work. Each transition had a visible destination sector that was growing and labor-intensive. If AI automates knowledge work, what’s the next sector? Some possibilities exist (caregiving, trades, creative direction), but it’s unclear whether they can absorb the volume of displaced knowledge workers or whether they pay comparably.
The historical pattern may not completely break, as we’ll always redefine “work,”:
New job categories we can’t predict. The most honest lesson from history is that forecasters consistently fail to anticipate the jobs that emerge. “Social media manager” wasn’t a job in 2005. AI is already creating new roles: prompt engineers, AI trainers, AI safety researchers, human-AI collaboration specialists, AI ethicists, AI auditors. As AI capability grows, more categories will likely emerge around oversight, customization, integration, and uniquely human services that complement AI capabilities. Our imagination genuinely fails to predict future job categories, and some current workers will successfully transition into AI-related roles that don’t yet have names.
Human preferences for human connection. Some services stay human by choice, even if AI can do them. People may still want therapists, teachers, doctors, and caregivers in the loop. Human connection carries value AI cannot replicate. We see this in practice: many shoppers seek humans for complex purchases, in-person meetings matter for relationships despite videoconferencing, and customers escalate from chatbots to humans for emotional support or tricky problems. Roles rooted in care, creativity, teaching, and relationships may keep human labor even when AI is technically capable.
Organizational friction is real. Real-world organizations are far messier than economic models suggest. Bureaucratic inertia, change management challenges, legacy systems, regulatory constraints, and organizational dysfunction slow AI adoption dramatically. The timeline from “AI can do this” to “AI has replaced humans doing this” could be much longer than capability curves suggest.
Regulatory protection. The EU AI Act and similar frameworks could mandate human oversight in high-stakes domains. Some jurisdictions may require human involvement in medical diagnosis, legal decisions, hiring, or financial advice regardless of AI capability. Professional licensing boards may resist AI encroachment.
Automation decisions are driven by capabilities and economic constraints. A firm won’t replace you with AI just because it can do your job; they’ll replace you when the economics favor doing so.
When a firm considers automating a role, they’re implicitly running a cost-benefit analysis that weighs several factors:
The decision simplifies to: Is (labor cost × volume × quality improvement) greater than (implementation cost + ongoing AI cost + risk of errors)? When this equation tips positive, automation becomes economically rational regardless of any abstract preference for human workers.
A common misconception is that AI must outperform humans to threaten jobs. AI only needs to be good enough at a low enough price, and for enough of your tasks.
Consider two scenarios:
For many business contexts, the 10% quality drop is acceptable given the 90% cost reduction. This is especially true for work that does not need to be highly reliable on its first prompt, as a senior-level employee can direct agents to draft multiple edits of tasks faster than a feedback loop with lower-level employees. The quality threshold for automation is often lower than workers assume.
This explains why displacement often begins with lower-level roles. Entry-level work typically has higher error tolerance (seniors review it anyway), lower quality requirements (it’s meant to be refined upstream), and lower absolute labor costs (making the implementation investment harder to justify for any single role, but easier when aggregated across many juniors).
A common objection to AI displacement forecasts is that current models have limited context windows and can’t hold an entire job’s worth of knowledge in memory. This misunderstands how AI systems are actually deployed. Organizations don’t replace workers with a single model instance, they deploy fleets of specialized agents, each handling a subset of tasks with tailored prompts, tools, and retrieval systems. All of your knowledge about your role cannot fit into one model’s context window, but it can be dispersed across system prompts, vector databases, and other systems that document the data of your role. The aggregate system can exceed human performance on many tasks even when individual agents are narrower than human cognition.
This architecture mirrors how organizations already function. No single employee holds complete knowledge of all company processes; information is distributed across teams, documentation, and institutional memory. As agentic systems mature, the orchestration becomes more sophisticated; agents can spawn sub-agents, maintain persistent memory across sessions, and learn from feedback loops.
Work will become more digitized through meeting transcripts, emails, project trackers, and saved drafts, and agents will gain a clearer view of how tasks are actually carried out inside an organization. Over time, this helps the system understand the practical steps of a role rather than just the final result.
These agents will learn from these accumulated examples, and they can begin to handle a larger share of routine or well-structured tasks. They also improve more quickly because new work records continuously update their understanding of how the organization prefers things to be done. This reduces certain forms of friction that once made roles harder to automate, such as tacit knowledge or informal processes that previously were not recorded.
Once one major player in an industry successfully automates a function, competitors face pressure to follow. This creates an adoption cascade:
This dynamic means that your firm’s current attitudes toward AI adoption may not predict your long-term risk. A conservative organization that resists automation today may be forced to adopt rapidly if competitors demonstrate viable cost reductions. Think about both what your company thinks about AI and how it will respond once other businesses use it.
Public and venture-backed companies face additional pressure from capital markets. Investors increasingly expect AI adoption as a signal of operational efficiency and future competitiveness. Earnings calls now routinely include questions about AI strategy, and companies that can demonstrate AI-driven productivity gains are rewarded with higher valuations.
The reverse is also true: companies that resist automation may face investor pressure, board questions, and competitive positioning concerns that push them toward adoption faster than they would otherwise choose.
AI research organization METR measures AI capabilities by the length of software engineering tasks models can autonomously complete. Even when measured against different success rates, models have demonstrated exponential growth since the launch of public-facing models, with a doubling time of roughly seven months. Extrapolating from this trend at the 50% success rate threshold, it will be less than 5 years before models can autonomously complete tasks that take humans weeks or months.
Source: METR study
METR’s benchmarks measure software engineering tasks, but displacement happens across every knowledge domain. Code is structured, digital, and verifiable, which makes software a leading indicator. Other cognitive domains will likely follow for similar task-completion times, but different domains face different translation delays.
Work that resembles software (digital, decomposable, with clear success criteria) will track closely with METR benchmarks. Work involving tacit knowledge, physical presence, or relationship-dependent judgment will lag behind. The model handles this through domain friction multipliers. Software engineering roles face minimal friction, while legal, operations, and traditional engineering roles face higher friction due to regulatory constraints, liability concerns, and less structured workflows.
The questionnaire captures four factors that determine when AI displacement becomes likely for your specific role:
The METR curve serves as the baseline for the forecasted capabilities of AI models. Then, we make assumptions about the time you spend in different task “buckets” (sorted by length they take to complete) based on your role and hierarchy level, and we add friction to the METR curve to essentially measure: how hard is it for AI to do these tasks of different lengths? That friction is measured by your responses to the questionnaire, but you can change the weights of these multipliers in the Model Tuning section.
We also make assumptions about industry-specific friction for your tasks, and how reliable AI needs to be in order to enter that risk curve. These are tuneable in the sliders beneath the model, and you’ll notice that moving these sliders can have a pronounced effect on your displacement timeline. These forces combine into a weighted readiness score (typically around 50%, adjusted by hierarchy) that opens the automation hazard. Implementation delay and compression parameters then shift that hazard into the green curve you see in your results.
When you complete the questionnaire, the model generates a chart showing two curves over time:
The blue curve shows technical feasibility (the automation hazard without implementation delay or compression). It turns on when AI clears your job’s coverage threshold (typically ~50% of your task portfolio) based on your task mix. Digital, decomposable domains open the gate sooner; tacit/physical domains open later. Senior roles lift the threshold slightly and soften the ramp; entry-level roles lower it.
The green curve shows when you are likely to actually lose your job, accounting for real-world implementation barriers. This is the timeline that matters for planning your career. The green curve combines two displacement mechanisms:
The vertical axis shows cumulative displacement probability. A green curve reaching 50% at year 4 means there is a 50% probability of displacement within 4 years, and 50% probability you remain employed beyond that point. Steep curves indicate displacement risk concentrates in a narrow window, while gradual curves spread risk over many years. Early divergence between curves signals high compression vulnerability.
Alex: Compressed out in ~1-1.5 years.
Alex is a junior developer: writing code, fixing bugs, documenting changes. The work is fully digital and breaks into clean pieces. As AI tools improve, senior engineers absorb Alex’s workload. They ship faster with AI assistance, and the backlog of junior-level tickets shrinks.
Alex is at the bottom level of all software engineers at his company, and eventually AI amplifies enough of his colleagues so that his contributions aren’t worth his salary to his firm anymore.
Jordan: Protected for 7+ years.
Jordan is a management consultant with years of strong client relationships. His deliverables are technically digital (slides, memos, etc.) but he spends a large portion of his time in face-to-face meetings, and often has to derive his tacit knowledge about unique cases when advising clients. His clients are considering AI-driven displacements in their own firm, so they have unique challenges that were previously not considered in the consulting market. Each project needs a custom approach, and while Jordan uses AI tools to assist his planning, only he can be trusted to advise on broad change management. Compression risk is nearly zero, and Jordan’s business will benefit from the AI displacement wave.
Sarah: Medium risk, 3-5 year timeline.
Sarah is a mid-level accountant, and her work involves processing invoices, reconciling statements, and preparing journal entries. The work is mostly digital and it’s somewhat structured, but it requires human judgement: matching vendor names, deciding when to escalate a discrepancy, and calling coworkers for audit assistance. She handles “tickets” just like Alex, but they require more context to complete.
While these timelines may seem fast, the trendline for model capabilities is not certain to hold (which is why we allow you to tune it in the model). Current forecasts extrapolate from recent trends, but compute scaling may hit limits, algorithmic progress may slow, or AI may hit capability ceilings. In their paper “Forecasting AI Time Horizon Under Compute Slowdowns,” METR researchers show that capability doubling rate is proportional to compute investment growth. If compute investment decelerates, key milestones could be delayed by years.
That said, even if growth slows, substantial capability growth has already occurred and will continue. For current workers, the question is whether a plateau happens before or after their jobs are affected. The historical 7-month doubling has held steady from 2019-2025, and more recent 2024-2025 data suggests the rate may be accelerating to roughly 4-month doubling.
Source: METR forecast (arxiv). Thanks to Joel Becker.
You cannot control AI capability growth, market competition, or how your industry responds. You do have some influence over where you sit in that process and how much time you have to adjust. Individual action will not fix AI displacement by itself, but it can buy you runway, options, and a better position from which to push for collective change.
In the near term, there are some useful actions that can buy you time and flexibility.
Learn how your workflows complement AI. Understand which parts of your work AI already handles well, where you add value, and how you can structure tasks so that both strengths work together. People who can design and oversee AI-enabled workflows are more useful to their organizations and better prepared as roles shift.
Shift toward higher-context work where you can. Roles that involve judgment, coordination, and relationships are harder to automate than pure execution, especially in the short run. Moving part of your time toward context-heavy or integrative work can slow the impact on you, even if it does not remove it.
Increase the cost of removing you. Strong performance, reliability, and being central to coordination does not make you safe, but it creates organizational friction. When cuts happen, people who are trusted, visible, and hard to replace often receive more time, better options, or softer landings.
Explore other routes for agency. Skills that transfer across companies, a professional network, a record of public work, and some financial buffer all make it easier to adapt if your role changes quickly. These do not change the aggregate risk, but they change how exposed you are to it.
These are high-agency moves, but they mostly shift your place on the curve rather than changing the curve itself. They are worth making because they give you more control over your own landing and more capacity to engage with the bigger problem.
If AI continues to compress and automate large parts of knowledge work, there will not be enough safe roles for everyone to move into. At that point, the question is less about how any one person adapts and more about how we share the gains and the risks: who owns the systems, who benefits from the productivity, and what happens to people whose roles are no longer needed.
How societies respond to AI-driven displacement will be shaped by policy choices actively being debated. Transition support programs (extended unemployment benefits, government-funded retraining, educational subsidies) face questions about whether retraining can work fast enough when target jobs are also changing rapidly. Human-in-the-loop mandates could require human involvement in high-stakes decisions regardless of AI capability, preserving employment by regulation. Automation taxes might slow adoption and fund transition support, while wage subsidies could make human labor more competitive. Universal basic income would decouple income from employment through regular payments funded by productivity gains. Broader ownership models might distribute AI capital through sovereign wealth funds or employee ownership requirements. And labor organizing could negotiate over automation pace, transition support, and profit-sharing.
Beyond these, societies will likely need to reckon with the nature of at-will employment, and redefine what “good performance” is at work. If we provide little comparative value to firms when AI reaches high levels of strength, our current economic models face little incentive to reward us with continued employment and new opportunities for labor. But we built AI, and our laborers provide the crucial data needed for pretraining, so I think there is a system we can develop that routes its success to people, rather than corporations that become increasingly mechanized.
Perhaps it’s a democratized input model, where current laborers become rewarded with an ownership value of the models they help train. This will provide scaled returns for our existing workforce, especially as agents clone and expand within our organizations, and it follows the existing idea within capitalism of being rewarded for economically contributing. It doesn’t solve for new grads who enter the workforce, and it needs some tinkering, but it may be a more tangible path beyond “we’ll just distribute UBI derived from strong AI.” UBI (or even the Universal Basic Compute idea that’s been floating around) is a strong idea for a social safety net, but it likely will not be developed in time to catch people who face the early waves of unemployment.
You can engage by informing your representatives, supporting research organizations like Epoch, the Centre for the Governance of AI, and the Brookings Future of Work initiative, participating in professional associations, and contributing worker perspectives to public discourse.
Thank you for reading and engaging with my work. Building this model took a lot of time, and translating a fast-moving field into something that feels clear, useable, and tunable was harder than I expected. I hope it helped you understand the dynamics behind your results and gave you a better sense of what the next few years might look like.
This project is completely unrelated to my main job, but I will continue to evolve it as this technology does. I believe AI is one of the most significant dangers to our society in a long time, and job loss is only one of the many issues we face from unchecked/unregulated growth. We have to continue developing tools to defensively accelerate the pace of change.
2025-12-15 06:21:28
Published on December 14, 2025 10:21 PM GMT
I have goals that are much easier to reach with a powerful political machine. Probably a lot of other people around here share them. (Goals include “ensure no powerful dangerous AI get built”, “ensure governance of the US and world are broadly good / not decaying”, “have good civic discourse that plugs into said governance.”)
I think it’d be good if there was a powerful, high integrity political machine with good epistemics, trying to make those things happen.
Unfortunately the naive ways of doing that would destroy the good things about the rationalist intellectual scene. This post lays out some thoughts on how to have a political machine with good epistemics and integrity.
Recently, I gave to the Alex Bores campaign. It turned out to raise a quite serious, surprising amount of money.
I donated to Alex Bores fairly confidently. A few years ago, I donated to Carrick Flynn, feeling kinda skeezy about it. Not because there's necessarily anything wrong with Carrick Flynn, but, because the process that generated "donate to Carrick Flynn" was a self-referential "well, he's an EA, so it's good if he's in office." (There might have been people with more info than that, but I didn’t hear much about it).
Ultimately, I kinda agreed, but, I wouldn't have publicly defended the choice. This was during FTX era where money was abundant and we were starting to attract grifters (i.e. hearing explicit comments like "oh man all you have to do is say you care about causes X and Y and you can get free money.") It was not sustainable to keep donating to people "because they were EA" or "because they mouthed the words 'AI Safety'."
Alas, there are important political goals I want to accomplish. Political goals require getting a lot of people moving in lockstep. Rationalists hate moving in lockstep. For good reason. At the time, my solution was “donate to Carrick Flynn, but feel skeezy about it.”
One option is leave this to "The EA community" rather than trying to invoke "the rationalists." Alas, I just... don't really trust the EA community to do a good job here. Or, rather, them succeeding at this requires them to lean into the rationalist-y traits, which would reintroduce all the same allergies and handwringing. My political goals are nuanced. I don't want to go the route of environmentalism that bans nuclear power and ends up making things worse.
AI Safety isn't the only thing you might want a powerful political bloc with good epistemics to support. Maybe people want to be ambitious and do something much more openended than that. But, this is the motivating case for why it’s in my top-5 things to maybe do, and it’s useful to dissect motivating cases.
I think many people around here agree we need to stop the development of unsafe, overwhelmingly powerful superintelligence. (We might disagree about a lot about the correct steps to achieve that).
Here are some ways to fail to do that:
That's rough. Wat do?
What I think Wat Do is, figure out how to build a political machine that is powerful enough to have leverage, but, is still based on a solid foundation of epistemic trust.
How do that?
Well, alas I dunno. But it feels very achievable to me to do better than both "don't play the game" or "naively play the game, short sightedly." Here are some thoughts on that
This is difficult for lots of reasons. Here are some easier to articulate ones:
A lot of the world runs on implicit alliances, where people agree to recommend each other as good people, and not to say bad things about each other.
One big reason ornery rationalists are like “politics is Real Hard to do without intellectual compromise” (while other people might be like “I see why you’d be worried, but, you seem to be exaggerating the worry”), is that this is a very pernicious. It fucks with epistemics in a way that is invisible if you’re not actively tracking it, and the mutual reputation alliances don’t want you to be tracking it so it requires active effort to make it possible to track.
There are good (naive) reasons to gain power. You do need political power to get shit done. But, also, people feel an attraction to power for normal, boring, selfish reasons. It is easy to deceive yourself about your motivations here, and about what your motivations will be in the future when you’ve enmeshed yourself in a political alliance.
Lots of ways of gaining power involve Mutual Reputation Alliances, or other compromises.
(Oliver Habryka has argued to me that there are ways of gaining conditional power (as opposed to unconditional power) which involve less compromise. This post is mostly about gaining unconditional power but seemed worth flagging the difference)
There is some public info available, but for “will this broad political project work longterm”, it’s going to depend on things like “does so-and-so keep their word?”, “will so-and-so keep keeping their word if the political situation changes, or they see an opportunity for power?”
This requires subtle details about their character, which you can only really get from people who have worked with them a bunch, who are often part of a mutual reputation alliance, won’t want their name attached to the info if you share it, and will only give you the info if you can share it in a way that won’t make it obvious that they were the one sharing it.
In addition to “embedded in a mutual reputation alliance”, powerful people can be vindictive if you try to share negative information about their character. And, since they are powerful, if they want to hurt you, they probably can.
People don’t share bad information about powerful people out of fear, not just loyalty.
(One specific case of this is "they can sue you for libel, or at least threaten to.")
There will be rival actors who don’t want your preferred candidate to be elected or your preferred policy to be implemented. They will actively make it hard for you to do this. They may do so with underhanded tactics that are difficult to detect, just under the threshold for feeling “unreasonable” so it’s hard to call out.
It also means that sometimes you want to raise funds or maneuver in secret.
Mutual reputation alliances are costly because they radiate out of the alliance. In practice, there is not a sharp divide between the politicians and the rationalists. The people rallying support and finding private information will (by default, probably) radiate some pressure to not question the narrative, and to avoid making someone regret having shared information.
See also: Entangled Truths, Contagious Lies
This is hard-mode enough when we’re just trying to be a corner of the internet talking about some stuff. It’ll matter a lot more if you are trying to achieve a political goal.
See: Politics is the Mind-Killer and Politics is hard mode
A lot of these problems aren’t that bad if you’re doing a one-time political maneuver. You might make some enemies and risk a bit of tribal groupthink, but, eh, then you go back to doing other things and the consequences are bounded.
But, the whole point of building a Good Epistemics/Integrity political machine is to keep persistently doing stuff. This will attract enemies, if it succeeds. It will also attract…
People will try to manipulate into giving them money. Some instances of this might be well intentioned. You need to be able to defend against it anyway.
If it’s known that there’s a High Integrity/Epistemics Political Machine that’s on the lookout for sociopaths and subtle corruption, people will try to mouth the words that make it sound like they are avoiding sociopathy/subtle-corruption. This includes both candidates, and people running the rallying-campaigns to get candidates funded.
“I believe in AI safety” or “I care about epistemics” is an easy password to fake.
An example of a harder password to fake is “I have made many public statements about my commitments that would look bad for me if I betrayed them.”
For people running PACs or other orgs, “here are the incentives I have constructed to make it hard for myself / The Org to betray it’s principles” is even better. (i.e. OpenAI’s nonprofit governance structure did make it at least difficult and take multiple years, for the org to betray it’s principles).
A sometimes-difficulty with political fundraising is early on it's often important to happen in a low-key way, since if rival politicians know your plan they can work against it. But,
I think part of the process should be, there are people involved in low-key-private-political-fundraising who are playing a watchdog role, helping establish mutual knowledge of things like whether a given politician...
Top Tier:
Mid Tier
Minimum Tier:
I didn't trust the people advocating for Alex Bores to have noticed sociopathy. But, he did in fact pass the Raise Act. Scott Wiener tried to pass SB 1047 twice and succeeded the second time, sorta. They might still betray their principles later, but, their track record indicates they are at least willing to ever put their actions where their mouth was, and the bills looked pretty reasonable.
That seemed good enough to me to be worth $7000 (given the other analysis arguing that the money would help them win).
If I imagine a high Integrity Political Machine, I think it probably involves some sort of evaluator watchdog who a) privately researches and circulates information about candidates during the Low Key period, and b) writes public writeups afterwards that allow for retrospective sanity checking, and noticing if the political machine is going astray.
I'd want the watchdogs to split up observations and inferences, and split up particular observations about Cause A vs Cause B (i.e. make it easy for people who want to support AI safety but don’t care about veganism, or, vice versa, to track which candidates are good by their lights, rather than aggregating them into a general vector of Goodness).
The actual motivating example here was thinking about supporting PACs, as opposed to candidates.
I don’t actually understand PACs very well. But, as I understand it, they need to be deciding which candidates to support, which means you need all the same apparatus for evaluating candidates and thinking through longterm consequences.
Any broad political org needs a person in charge of it who is responsible for making sure it is high integrity. I have a particularly high bar for this.
If you want to run a PAC or org that gets money from a hypothetical High Epistemics/Integrity Political Machine, it is not merely your job to “not lie” or “not mess up in the obvious ways.” Politics is hard mode. You need to be tracking the incentives, tracking whether your org is evolving into a moral maze, and proactively work to make sure it doesn’t get eaten by an egregore.
This requires taste, as well as effort.
Taste is hard to acquire. Often, “just try harder” won’t realistic work. If you don’t have good enough judgment, you either need to find another person to be in charge, or you might need to go try doing some projects that will enable you to learn from experience and become wiser / more cynical / etc.
An earlier draft described this as “GiveWell for retroactive political action assessment”. But, the word “Givewell” implies there is an org. Orgs bundle up people’s reputation together, such that every person involved feels pressure to not risk the reputation of everyone else at the org. This has been a failure mode at OpenPhil (from what I understand).
Watchdogs will need to make some tradeoff on gaining access to private information, vs making various promises and compromises. But, they can do that individually, so the results aren’t as contagious.
I would ideally like everyone involved to have maximally good epistemics. But, in order for this to succeed, you need some people who are really good at rallying large numbers of people to do a thing (i.e donate to candidates, vote). Rallying is a different skill from maintaining-good-epistemics-while-evaluating. It's hard to be good at both. It's hard because a) it's just generally harder to have two skills than one, and b) "rallying" just does often require a mindset that is more Mindkiller-y.
So, I would like at least some people who are spec'd into "watchdog-ing"/"evaluation", who are not also trying to rally people.
I want rally people to be more careful on the margin. I think it is possible to skill up at inspiring conviction/action without having distorted beliefs. But, I think the project can work if the rally-ers aren't maximally good at that.
One way for this to fail is for individual candidates to turn out to be grifters who extract money, or sociopaths who end up net-negative.
Another way for this to fail is for the system to become subtly corrupted over time, making individual little compromises that don't seem that bad but add up to "now, this is just a regular ol' political machine, with the word 'epistemics/integrity' taped to the front door."
There needs to be watchdogs who are modeling the whole ecosystem, and speaking out if it is sliding towards failure.
It's not enough for watchdogs to periodically say "hey, this candidate seems sus" or "we seem to be sliding towards worse epistemics collectively." The people voting with their ballots or wallets need to actually care. This means a critical mass of them need to actually care about the system not sliding towards corruption.
This could be an entirely separate idea for “watchdog evaluators”, but it dovetails nicely. For candidates that a powerful high-integrity political machine are trying to help, it probably makes sense to have public prediction markets about whether they will keep their word about various promises.
If individual watchdogs gain a track record for successfully noticing “so and so is going to betray their principles” and “so and so probably won’t betray their principles”, those people can also then maybe be trusted more to represent private information (“I talked to Candidate Alice, and I really do get a sense of them knowing what they’re talking about and committing to Cause A”).
The main problem with doing that publicly is that powerful people might be vindictive about it. I’m most worried about people being vindictive when they kind grew up with the rationalsphere, so having rationalists criticize them or estimate them as low integrity, feels personal, rather than just cost-of-doing-business as a politician.
I do think the norm and vibe should be “this is a cost of doing business. If you want money/support from the high integrity political engine, you should expect people to be evaluating you, this is nothing personal, the standards are very exacting and you may not meet them.”
A problem I'm not 100% sure how to handle, is getting sued for evaluating people/orgs as sociopathic.
I'm not sure what the legal standing is, if a prediction market reads:
"Within 5 years, I will judge that OpenAI's nonprofit board no longer has teeth"
or
"Within 5 years, I will think [Candidate X] betrayed a campaign promise."
or:
"Within 5 years, CEO Charlie will have violated one of these principles they established."
A serious political engine could have money to defend against lawsuits, but, also, the more money you have, the more it's worth suing you. (I think at the very least having someone who specializes in handling all the hassle of getting sued would be worth it).
My hope is that, unlike previous instances of people trying to claim an individual did bad things, this project is in some sense "big enough to be clearly worth protecting" (whereas a random person in a vague community scene being maybe a bad actor doesn't have anyone incentivized to make it their job to defend)
Sometimes people get annoyed that LessWrong isn’t letting them do a particular kind of rallying, or saying something with one voice. They read Why Our Kind Can't Cooperate and are like “okay, so, can we have a culture where people publicly support things and there isn’t this intense allergic criticism?”.
I think maybe there should be another forum or tool for doing that sort of thing. But, it’s definitely not LessWrong’s job. LessWrong definitely should not be synonymous with a political agenda.
I think posts like these are fine and good:
I feel wary of posts like this:
I think the difference is:
Posts that argue the object level of ‘this candidate or project will have good/bad consequences’ are fine.
Posts that are trying to change what is socially acceptable to think/say on LessWrong are NOT fine.
Posts that are talking about what is socially acceptable to think/say on LessWrong ARE fine. The difference between this and the previous one can be subtle. I still find John Wentworth's comments from Power buys you distance from the crime pretty good:
> Who’s at fault for the subcontractor(^3)’s slave labor?
[...] My instinct says DO NOT EVER ASK THAT QUESTION, it is a WRONG QUESTION, you will be instantly mindkilled every time you ask "who should be blamed for X?".
... on reflection, I do not want to endorse this as an all-the-time heuristic, but I do want to endorse it whenever good epistemic discussion is an objective. Asking "who should we blame?" is always engaging in a status fight. Status fights are generally mindkillers, and should be kept strictly separate from modelling and epistemics.
Now, this does not mean that we shouldn't model status fights. Rather, it means that we should strive to avoid engaging in status fights when modelling them. Concretely: rather than ask "who should we blame?", ask "what incentives do we create by blaming <actor>?". This puts the question in an analytical frame, rather than a "we're having a status fight right now" frame.
To be clear, LessWrong doesn't prevent you from posting rallying / status-fighty / social-reality-manipulating posts. But, it is setup to discourage it on the margin, and prevent a lot of the upside from trying to do it. You won't be on the frontpage, you won't get curated, etc. If it seems like you're doing it in a way that mods think is bad for the culture, we might yell at you.
(But also note, I did not run this by the rest of the Lightcone team and we have a policy of speaking for ourselves, since orgs don't actually have "beliefs")
Just to restate all the premises in one place:
A political machine is a system that coordinates lots of people to produce a political outcome. (If you don't need to coordinate lots of people, you just have a political actor, not a machine)
It's hard to build a high integrity/epistemics political machine, because:
And this all needs to keep working longterm, if you want a longterm powerful impact, so, it needs to be robust to all the adversarial failure modes.
Some potential solutions:
2025-12-15 05:10:32
Published on December 14, 2025 9:10 PM GMT
My house has radiators for heat. There are three heating loops ("zones") but the house has more than three rooms and it's not very well balanced. Fixing this properly involves hiring a plumber, but it turns out we can make it much better with just a small fan!
Radiators heat passively: they warm the nearby air, which rises and allows cooler air to flow in. This new air then warms, and the cycle repeats. This works pretty well: no electricity, no noise, just smooth heating.
What we can do with a fan, though, is accelerate this process in a targeted way, at the cost of a small amount of electricity, hardware, and noise. By fanning the radiator we want more output from, we can bring the system into balance.
I'm now tempted to put efficient little fans on all the radiators in the house, network them together, add temperature and occupancy sensors, predict future occupancy, and see how much more efficient I can make the whole system. But while this sounds like a fun project, and possibly even something someone could turn into a product that pays for itself in saved money and fuel, [1] this is really not something I should take on right now.
[1] I did some looking and there are (a) commercial radiator booster
fans, and (b) smart radiator valves, but nothing that ties this all
together.
2025-12-15 02:55:06
Published on December 14, 2025 6:55 PM GMT
A core subproblem in ontology identification is to understand why and how humans and agents break down their world models into distinct, structured concepts like tables, chairs and strawberries. This is important because we want AIs to optimize the real world things we care, but the things we care about are expressed in terms of latent variables in our world models. On the other hand, when an AI plans to achieve its goals in the world, that planning refers to its own internal representations, which means we must understand how those internal representations correspond to latent variables/concepts in our ontologies to ensure that the AI is optimizing for the right things in the right way.
From an external perspective, if our only goal was to explain the functional behavior of the world model, it would seem perfectly valid to just treat the world model as one undifferentiated blob of black box program that outputs predictions about the world. A black box program might even be the simplest explanation for the world model's behavior. There doesn't seem to be any obvious reason why we'd need to decompose this black box into well-structured concepts, or why such decomposition would line up consistently with our own ontologies in a meaningful way.
So behavior alone doesn't seem sufficient to pin down structured concepts. We might be tempted to just project our own ontology onto the black box program by taking our current understanding of how the world works and trying to draw correspondences to different parts of the black box that we've somehow carved up. But this approach won't be robust to ontology shifts: The AI will learn and discover all sorts of new things about the world that we haven't observed or even conceived of, including new laws of physics or novel abstractions alien to us, and these are precisely the kinds of things that won't fit into whatever ontology we're trying to force onto the AI's world model.
If projecting our ontology onto the black box program doesn't work, we need to start from the black box description of the world model and derive the ontology from the black box somehow. This seems like a really challenging task, it's not even clear what desiderata would let us pin down a decomposition into well-structured abstractions that remains robust to ontology shifts. However, the achievement of "deriving structure starting from a black box" isn't completely unprecedented:
Algorithmic Markov Condition: Algorithmic Markov Condition is essentially the AIT version of Bayesian networks. Instead of random variables, we have binary strings, and instead of factorizing a joint probability distribution, we factorize the joint complexity of the strings. We can think of the Algorithmic Markov Condition as specifying the "optimal computational order for compressing strings." For instance, if are four strings and their joint complexity factorizes according to the DAG below, we can interpret that as saying: The optimal way to compress together is to start with , then find the shortest program that computes from as well as the shortest program that computes from . Once we obtain and , we find the shortest program that computes from and . Formally we have:
In both of these examples, it seems much easier to derive and talk about structure once we have multiple entities, such as multiple variables or multiple strings. Because once you have multiple entities, you can talk about the relationships between them such as conditional independence. Through continually probing properties of these relationships we can eventually derive a kind of "structure" by stitching the relational properties together. These formalisms aren't just surface-level analogies to our task in ontology identification either: The approximate version of Bayesian networks is the theoretical basis for natural latents, while the approximate version of the Algorithmic Markov Condition forms the basis for the Solomonoff version of natural latents.
Bayesian networks and the Algorithmic Markov Condition are still not quite right for what we want to do in ontology identification because they already assume a particular decomposition into variables or binary strings, and these decompositions are exactly the sorts of things we want to derive in the first place. We want to know why strawberries and chairs are the kinds of things we tend to model as latent variables (instead of e.g. one half of a strawberry combined with one half of the table as a latent variable). Of course we still want to discover the causal relationships between these variables or understand how to derive higher-level concepts from them, but the first step is to derive these variables themselves without assuming them upfront.
So for ontology identification, we can't start with a particular decomposition into latent variables like we do in Bayesian networks or Algorithmic Markov Condition. The fact that we had multiple variables was a big reason why we could derive structure in the first place, by probing the relationships between different variables. However, while we can't assume a particular decomposition of the world model, we often have multiple agents with different world models or multiple plausible hypotheses about the world. We can potentially leverage this multiplicity to derive structure in a somewhat similar way as the Algorithmic Markov Condition.
In particular, when we say that two agents share the same abstraction, one mental picture we might have is that the computations of both agents' world models "route through" the same abstraction. For instance, when two agents share the concept of strawberries, one possible meaning is that they share the same process for computing beliefs about strawberries, but they might differ on how they compute the implications of those beliefs such as downstream predictions or actions:
Similar to algorithmic markov conditions, we can use a directed acyclic graph to talk about the optimal way to compress a collection of world models together. However, instead of the "optimal order for computing strings", we try to capture how multiple world models be represented as the composition of overlapping abstractions. Taking the example in the image above, suppose that we have two agents represented as two functions which takes sensory observations (about strawberries) and returns some action or predictions about the world. When we say that the collection of factorizes according to the DAG above, we mean that there exists three abstractions/functions that satisfy the following:
We can also imagine a much larger collection of world models that factorize according to a much more complicated DAG, but the rules are the same: Each arrow in the DAG corresponds to an "abstraction"; the joint complexity of the world models is equal to the sum of the K-complexity of all abstractions. Each world model is assigned a final abstraction node , and the DAG specifies how information propagates: Each abstraction receives information from its "parents" specified by the DAG, and pass information to its "children" until reaching the final abstraction nodes . Each final node then produces the output/predictions of each world model.
Going back to our strawberry example, the two conditions that we impose in our factorization imply that: (1) This factorization is one of the optimal ways to represent and from a compression perspective. (2) The factorization breaks down the computation of each world model into a hierarchy of overlapping abstractions. By tracing down the arrows in the DAG, we can find the abstraction that is shared among both agents (). This is a concrete property that we can verify even though we only have access to the "black box" functional behavior of and .
What's interesting about this generalization of algorithmic markov conditions is that it gives us a concrete formalization of "redundant computation across multiple world models/hypothesis", and redundant computations are exactly the sorts of "shared interface" that we need for ontology identification: