I posted a quick take a few days back claiming AI moratoriums won't work, and people seemed to really disagree with it. I honestly thought it was a commonly held opinion, but the more I engage with LessWrong, the more I realize the way I'm approaching things seems, at least to me, different.
So let me elaborate with a refined approach and a proper introduction. The problem of constraining AI development is a social problem and the way I believe we should treat society is as a sociophysical system determined by sets of natural societal laws. Society isn't a collection of people who collectively control what society does. Rather a society is a collection of people with qualified collective influence over society. My point summed up is there are things as a society which we simply can't do. Just like Jupiter can't suddenly start orbiting backwards due to the laws of physics, we can't change society in a manner that breaks the sociophysical dynamics of the system.
Think of this analogy: electrons are to physical systems as humans are to social systems. Electrons can behave very differently than how their encompassing physical system behaves. If we let the electron be part of say, a metal teaspoon, then if you throw the teaspoon against a wall it's going to collide and bounce off, probably with a loud jarring noise. But the electron, being a quantum particle, has a not insignificant probability of quantum tunneling through a thin wall. This is because electrons exist as complex probability clouds. The teaspoon can't quantum tunnel through the wall, it's restrained by physical laws of motion which exist at the macroscopic scale only.
Its the same with society. Humans are like electrons, existing with completely different governing laws than the larger object they compose. Assuming a simple relationship of humans being "in control", or even assuming some kind of human-society feedback loop, misses the point. These two things exist in different worlds altogether. Some macroscopic properties of society are not influenceable, as they exist as properties of what a society is.
An Ad-Hoc Model
Now let me get my hands dirty and justify my position on AGI moratoriums with a model of my own design created to get my point across formally. I worked on this over the course of maybe 2 days, so there's bound to be a mistake somewhere.
Generalizing, define a irreversible social rupture as a society-level event which causes a disregard of past social norms, laws, and customs. It is an event from which there is no going back. If such an event happens society is altered forever. There are endless examples:
Alexander The Great's conquest of the mid-east
Suez Canal Crisis in 1956
The Invention of Gunpowder
AGI
Leibniz creating calculus (his notation is better so fight me Newton fans)
etc.
Each of these events caused society to shift so significantly that returning to society pre-event was impossible. These are akin to irreversible processes from physics. You can't unbreak an egg just like you can't give the British back their empire.
Further, these events were instantiated sometimes by individuals, but most often by organizations. This is the viewpoint I will take for this model, using organizations as the atom, instead of individual people. Organizations are collections of people with clearly defined goals and methods, such that they persist beyond the individual members that make them up. An organization can still exist 100 years later, even if everyone who founded it is long dead.
Laying the Groundwork
Consider , the probability of a socially disruptive event occurring. Let there be organizations each with a probability of causing a social rupture . Then
We can break down further into where is the probability the organization pursues a social rupture and is the organizations inherent ability to cause a rupture should they pursue it.
Now let each organization choose to pursue social ruptures via an expected reward function such that,
Where
is the reward for causing a social rupture
is the cost of attempting and failing to cause a social rupture
is the expected reward/cost from allowing another organization to cause the rupture first
Note
You can picture as a hollow matrix encoding who backstabs who in the event of a social rupture. Negative symmetric entires for enemies, positive for allies, skew symmetric if you want to get weird.
Then with respect to is
And we get
Likewise
We can now analyze dynamics. Assume we have an organization who is pursuing a social rupture. Then it's . Breaking this down:
And the problem reveals itself: as an organization grows in ability the cost needed to get them to not pursue social disruption goes to infinity. Once the organization is committed and beyond a threshold of capability it becomes impossible to stop the social rupture from occurring.
An Example: Competition vs Cooperation
Lets consider two societies filled with organizations. Society A is one of ruthless competition, with the only way to benefit being to step on others. Society B is the exact opposite. These two configurations control the sign of
with organizations in society A having for all and . Similarly let society B have . Thus for an organization in society A it can expect a social rupture to hurt if not done by itself, while an organization in society B can expect a social rupture caused by anyone to benefit everyone.
Take organization in society . We want to know what will cause this organization to flip from 0 to 1 or vice-versa. So first, assume the organization is inactive, . Then
becomes
Since by definition this implies . Solving and we see that i.e. the cost outweighs the risk of letting others develop social ruptures. Similar for the reverse. If then the risk outweighs the cost, and the organization pursues social ruptures.
For society it's flipped however. In the case where an organization does not pursue social ruptures, trivially as the right hand side of the inequality is negative. Thus even with no cost to pursuing social ruputes the system remains inate. The organization has no motivation. For the reverse, when an organization in society is pursing a social rupture, only works if i.e. the pursuit of a social rupture has negative cost. In society you need to subsidize the development of social ruptures.
Note On Future Work
I'm going to keep playing around with the model to see what falls out. If I find anything striking I'll make another post documenting more examples.
Conclusion
We live in a mixed society, where some organizations experience net cooperation and others net competition. However when we look at the reality of the few large AI companies in a zero-sum race, it seems we have subsocieties that look a lot like society A from the previous example. The incentive to reach AGI first is massive, and some AI companies have already demonstrated cut-throat business practices, indicating they experience net competition. From this the model paints a dark picture depending on how close we actually are to an AI system which can cause an irreversible society-level event.
Now: AGI moratoriums. If I think they are impossible at this point then what do we do? We follow the sociophysical dynamics towards a branching point at which point we as humans do have the ability to affect the outcome. Think of it like a high entropy partition of the societal configuration space: the system is at an unstable point and can go in different directions. So, we lean into AGI research, hard, but take different approaches. LLM control is a dead end, we need a shift in perspective that recognizes we can't control LLMs as they are before one is created that can cause a society-level event from which there is no going back (looking at you Mythos). Instead divert funding towards, and in my opinion further subsidize, AGI research that promotes interpretability first. This maintains and strengthens the flow of capital, preventing economic collapse, while giving us a better shot at not going extinct.
We have to lean into the physics, not fight it.
Let me know if I made a mistake or a wrong turn somewhere. I'm open to changing my mind as well.
Back in mid-February, I posted "A research agenda for the final year", which poses a small set of basic questions. The idea is that if we can answer those questions correctly, then we might have a plan for the creation of a human-friendly superintelligent AI.
Now I want to sketch what an answer (and its implementation) could look like. There are no proofs of anything here, just several exploratory hypotheses. They are meant to provide a concrete image of what to aim for, and are subject to revision if they prove to be misguided.
The ontological hypotheses are panprotopsychism and interacting monads. Panpsychism is an ontology in which all the elementary things have minds. Panprotopsychism is an ontology in which all the elementary things lie on a continuum with "having a mind". As for interacting monads, @algekalipso made a post in January which illustrates the formal structure one might expect: a causal network that is dynamic like Wolfram's hypergraphs rather than fixed like Conway's Game of Life, and in which the monads have significant dynamic internal structure too. In terms of physical theory, these monads might be "blocks of entanglement" or "geometric atoms" or some other ultimate constituent. The postulate of panprotopsychism here means that awareness, and its more elaborate forms like consciousness or subjectivity, arises when these monads possess the appropriate internal structure.
One may ask why I am supposing this somewhat exotic theory of the conscious mind, in which a person is some kind of nonseparable quantum state of the brain, rather than a more conventional information-processing model, in which they are a particular virtual state machine existing more at the level of neurons than at the level of quantum physics. The reason is just that I consider the exotic option more likely. However, the reader may wish to substitute Markov blankets for monads if they prefer the conventional model.
So we have our ontology: the physical world is made of @algekalipso's process-topological monads, interacting according to some fundamental psychophysical law analogous to the rules that govern cellular automata, with consciousness occurring only in monads with the right intricate internal structure.
The ethical hypothesis is that the ultimate value system is governed by an appropriate aggregation of the valences and preferences of all the conscious monads. Valences here are pleasure, pain, and possibly other kinds of qualic intrinsic value. Preferences are included so that more abstract dispositions like judgments and decisions can be counted too. I won't concern myself with the details of the aggregation here, utilitarian theory offers many possibilities, I will merely suppose that it has been determined by some CEV-like process.
So we have a world model and a value system, a value system which is to govern the transhuman civilization of the future. How should we imagine this working? We can expect the future to be extremely complex and diverse by human standards, populated with entities at very different levels of intelligence... It may be abstract and uninspiring for some people, to think of it this way, but I can envision this ultimate value system having a status in the transhuman civilization, that an economic philosophy or ideology can have in human civilization. Call this the political hypothesis. Just as the members of human civilization vary greatly in their knowledge of economics, from ignorant to expert, but all the big organizations of human society have to pay it some heed, and just as human governments monitor economic data, develop economic policy, and implement it using institutions of law and power... so too, analogous arrangements may exist throughout transhuman civilization, with respect to its ultimate value system.
This "political" scenario is not a normative proposal. The only normative thing is the ultimate value system, and it should determine the political order of the transhuman civilization that it governs. I'm just giving the vaguest of speculative sketches as to what that world would look like, and how it would run.
But now we come to the real crux. Back in the present, we live in a world where a few giant companies are pushing the capabilities of AI ever further forward, using architectures that are basically augmented transformers. Suppose that one of these companies will imminently align one of the superintelligent augmented transformers, with the world model and value system mentioned above, and that this superaligned AI goes on to be the seed of transhuman civilization. How is the alignment done?
You might suppose that this is easier than the value learning which happens at present, because it involves a rigorously specified target. It's like getting the AI to learn the rules of a game (the world model) and the conditions of victory (the value system), just as AlphaGo and AlphaZero did. By comparison, today's value learning occurs implicitly, as part of the world modeling that a new AI does, on the basis of its training corpora.
There may be two things to accomplish here. One is to create a subsystem which becomes expert at the theory of the world model and the value system. The prototype here would be all the efforts being made to use AI in formalization, hypothesis generation, theorem proving, and so on. The other is to connect that theoretical knowledge to the general practical knowledge that gets extracted from the big training corpora, so that the AI can interpret the concrete world in terms of its ultimate world model. The mechanics of this will depend on the details of the AI training pipeline used by the company in question.
Another likely consideration is that you may not wish your super-aligned super-AI to just believe dogmatically in the ultimate world model and ultimate value system. It should have an autonomous epistemic capacity for critical thought, by means of which it can test their robustness. If the world model and value system were already obtained purely by human effort, they should already have been subjected to some skeptical testing. If AI already played a role in deriving them, then they have already been subjected to some machine epistemology... I need to think further about this aspect.
That concludes the sketch for now. I have skipped over all the genuinely technical issues of AI safety that come into play when you reach the level of detailed architectures, the technical issues whose unresolved nature makes people so alarmed about the creation of superintelligence at the current level of knowledge. What I wanted to do was to outline a rather classical scenario of alignment against an exact value system, in a way that dovetails with the world of 2026. It seems like an agenda that could already be pursued to the end - I don't see any fundamental barriers. It makes me wonder what similar (but far more detailed) position papers and contingency plans may already exist inside the frontier AI companies.
Most AI policy work today functions as a literature review of technical risks. While valuable, this rarely moves the dial for a policy official who has 15 minutes to read a brief and 48 hours to make a recommendation. We wanted to test a different model: Forecasting-led, decision-ready policy advice.
The competition was simple:
Swift Centre provided forecasts across 5 AI related scenarios, from agentic capabilities, to workforce impact and autonomous weapons.
I provided a 4 page policy template similar to what is often submitted to political leaders and advisors.
Participants submitted their own policy advice, based on the forecasts, outlining the key options and their recommendation for what a stated political leader should do in response.
A team of judges with experience in energy, national security, military, international affairs and AI policy across the UK and US graded the entries. We'll be working with the authors of the highest rated pieces to get their advice directly to relevant decision makers and any potential organisations who may be looking for policy advisors.
If this cause interests you and you'd be keen to see this sort of project be refined and expanded, I'd be happy to speak further.
Reasons we did this
I launched this project to address several systematic gaps (and personal frustrations) I’ve observed in the AI safety policy space.
1. Ineffective Policy
A significant amount of AI safety policy today is essentially a literature review of technical capabilities paired with high-level suggestions that conveniently avoid the trickiest implementation challenges. Though interesting to read for many in the AI safety community, from my experience this type of work has a very low EV if the objective is to inform or persuade policy or decision makers (because they won't read it, or if they do, you aren't helping them solve their practical barriers (i.e. political, institutional, procedural).
2. Not Enough Work Targeting the Highest Value Parts of the Decision-Making Chain
Policy impact is often determined by where you intervene in the decision-making chain. In central governments, the process typically looks like this:
(a) Identification: An issue is identified in the world.
(b) Oxygen: The issue is amplified by civil society, research, or industry through long reports, press releases, round tables etc.
(c) Monitoring: Policy officials in government start paying attention and discussing if to escalate.
(d) Escalation: Officials write to senior leaders to highlight the issue.
(e) Steer: Decision-makers give a response (e.g. ignore, monitor, or bring options).
(f) Advise: Officials write specific options and a recommendation.
(g) Decision: The leader makes a choice.
(h) Implementation: Officials draft law, build business cases, or initiate consultations.
Most AI safety work currently focuses on stages (a) and (b). However, I have found it incredibly successful to just go directly to (f) or even (h).
Naturally you need some (a) and (b) to create the conditions for engagement. But by providing fully drafted legislation or policy advice aligned to what they'd get from their officials, you bypass the miscommunication and mis-prioritisation risks that happen when an idea is handed off between stages. This is especially vital today given government departments have increasingly limited capacity and are crisis-driven. If you provide a tangible, finished product, and can get it in front of a decision maker, you remove many of the systematic delays that the institutions create and push the discussion to focus on the tangible choices/actions that you recommend they take.
3. Copying Think Tanks
New AI safety organisations often copy the methods of legacy think tanks whose actual impact is questionable. Frequently, a think tank's influence isn't based on the quality of their reports, but on their proximity to power (e.g. in the UK, the Centre for Policy Studies has written the same type of reports for years, yet surprisingly they are far more influential when their friends in the Conservative party are running the Government).
The good thing is, a lot of these think tanks write bad policy. They are overly complicated, try to do too much, and spread across dozens of pages. So there is a great opportunity to do it better, which the AI safety field would be wise to invest in.
The truth is, policy at its core is a simple, two-element exercise:
Predicting what the future is likely to be.
Predicting how your intervention will change the likelihood of that future.
These elements are rarely made explicit in policy reports, leading to massive levels of misinterpretation between the writer and the decision-maker. Make these explicit and you can create a more tangible, action orientated outcome.
4. The "What Next?" Question for Talent
There is a lot of money being spent on training courses and fellowships, but very few outlets for people to get their hands dirty.
Numerous people say they want to get into AI policy and governance but do not have a realistic mechanism to test their fit for such work or before applying to courses or fellowships.
I facilitate a number of BlueDot Impact cohorts, and though I think the courses are great and provide a lot of value, the number one challenge participants have is the "what next?" question. These can include incredibly experienced professionals who's expertise (such as on international relations, diplomacy, defence policy) could be instrumental in solving fundamental challenges when it comes to governing AI.
I am loathed to tell them to take another course or do a fellowship unless I feel it would be especially useful - I want them to take action and start doing something. I often end up suggesting they at least start a substack and produce content on their interest areas: because it's tangible and beneficial to their knowledge/brand and maybe it'll be picked up.
However, I think there could be much better off/on ramps for these people to tangibly deploy their expertise and ideas so that decision makers can benefit. The above competition was one way to test this.
I’ve been getting into the Enneagram lately (recentposts). I like it because it provides a useful framework, even if a fake one, for making sense of the mental processes that generate wide swaths of human behavior, and I have several more posts planned about it. But I wasn’t always excited about the Enneagram, and in fact spent many years bouncing off it, finding its model opaque and personality tests based on it random.
What changed my mind and made it useful was, first, reading, Michael Valentine Smith’s series of posts on the Enneagram (1, 2, 3, 4), and, then, spending a year working with the Enneagram to slowly free myself from some of my habituated, maladaptive behaviors. I’m now fairly convinced the Enneagram is useful, but before I say why, let me say a little about what the Enneagram is.
The central thesis of Enneagram theory is that we have an “essence”, which is our natural way of relating to reality. Depending on who you ask, our essence forms before or shortly after we’re born, and it’s the seed from which our values grow. In this way, the word “essence” oversells what essence is, because it’s not actually something essential, but rather a contingent feature of our development. Perhaps a better term would be “original nature”, as in the way we naturally let ourselves be prior to any behavioral conditioning. Alas, “essence” is the jargon of the Enneagram, so I’ll stick with it.
After our essence forms, we’re almost immediately separated from it by suffering. Maybe it happens when we’re hungry and not immediately fed. Maybe it happens when we want snuggles and Mom and Dad are across the room. Or maybe it happens when we flail, scratch our own face, and can’t escape the pain. Whatever the case, we want to express our essence through our experience, that desire is stymied, and from such repeated denials we open what’s called our “core wound”—our deepest, most fundamental desire that can’t be completely fulfilled.
In time, we learn to live with our core wound by developing habitual behaviors to cope with it. These habits protect us from the wound, but also prevent us from accessing our essence. The personality we develop, which is just a pattern of habits, tries to take the place of essence, but it can never fill the same role. We are left to catch glimpses of joy when our essence shines through, but mostly live separated from it in the prison of habits we built to protect ourselves.
The Enneagram categorizes the bundle of essence, core wound, and personality habits into 9 main types. It then expands this type system from 9 to 27 by adding the concept of “wings”, and then to more by adding various epicycles. And epicycles is a good way to describe a lot of Enneagram theory, because a lot of it is post-hoc rationalization that makes no predictions and explains everything. Which poses the question: why do I think the Enneagram is useful?
First, some Enneagram theory pays rent. Just because there exist people who have used it as the basis for creating a personality theory of everything doesn’t mean the core theory is bunk. I think if we only go so far as to include the core types, wings, and the integration/disintegration lines, we get a relatively predictive theory given we have accurately determined a person’s type (which is its own difficult problem that’s outside the scope of this post). It does have gaps, hence the additional theory that’s been piled onto it, but those gaps are what we would expect of any valid psychological theory that works with patterns drawn from the highly-variable distribution of human behavior.
Second, it has value beyond prediction. The Enneagram provides a language for talking about the forces that generate habituated behavior. When I say my type is 4w5, that conveys information about how I make sense of my own life, and when I learn that a friend sees themselves as a 6w7 or a 1w9, I learn something about what it’s like to be them. This isn’t objective science, but rather a subjective system of categorizing experience and conveying that experience to others, and in this way serves both to help us better understand ourselves and to understand others by seeing how they are truly different.
Finally, on this point of categorizing experience, I see the Enneagram as a parallel to stage models in adult developmental psychology. Whereas Kegan’s or Cook-Grueter’s models aim to explain how our minds become capable of handling greater complexity of sense making, the Enneagram can be used to model how we can become free of our habits and live our lives joyfully. Or, to put a Buddhist framing on it, if developmental psychology is the vertical dimension that takes us towards awakening, the Enneagram is the horizontal dimension that leads to liberation.
And aiding in liberation from suffering is where I think most of the value of the Enneagram lies. Liberation is a complex process that requires first understanding why you do what you do. From that understanding, you can learn to untangle patterned behavior by addressing the causes of it at the source, then learning new, more adaptive patterns that support your essence rather than protect it. In this way, you can reconnect with the simple joy of being.
We all need a break so: What is the most important chart in the world?
I decided to ask Twitter, and got a lot of good answers.
So today, with few of my picks, I present: The Most Important Charts In The World.
You’ve got to admit it’s getting better. Better all the time. Mostly.
The Original Most Important Chart
The context for this is the METR graph, which is often given that label, where the x-axis is release date and the y-axis is the log-scale time horizon for AI models doing software tasks with a 50% or 80% success rate, usually people use the 50% graph:
If AI models continue to be able to do increasingly long tasks fully autonomously, and trends continue, this suggests we are not too far from a point where AI can do its own AI R&D, with the result of ‘rapid capability advancement,’ also known are Recursive Self-Improvement (RSI) or ‘escape velocity,’ after which… well, no one really knows, but the world presumably transforms into something even more bizarre and inexplicable, which may or may not contain humans or have any value.
While I was traveling Julia asked
me: why is Anna saying her fiddle practice is only two minutes? In
this case, two minutes was the right amount of time!
Anna (10y) and I had been fighting a lot about practice. She'd
complain, slump, stop repeatedly to make adjustments, and generally be
miserable. I'd often have to pull out "if you want to keep taking
fiddle lessons you have to practice": she loves her teacher and is
very motivated by the prospect of being good at fiddle. Still, it
would take us ages and we'd barely get through anything.
One evening when she seemed like she might be open to it I explained
that we were spending twenty painful minutes on two minutes of
material. I challenged her: if she focused, and went through with no
fussing, we'd be done in two minutes. It did turn out to be the right
time for this message, she gave it a good try, and (with a little
fussing in the middle) we were done in three minutes.
Over the next few days I continued to remind her that if she buckled
down it would go quickly, and we got into a pattern of efficient and
pleasant 2min practices. We probably continued this a bit longer than
ideal, and then I went on a trip without handing this off
well. Julia's question was a good reminder that we weren't done with
the progression.
When I came back I started gradually increasing how long we practiced.
Now that we had a good non-complainy dynamic this went well, and Anna
started learning much faster. She wanted to be able to participate in
jamming at NEFFA, worked hard at
that goal, and last weekend she got to play Coleman's March at the
annual Kids Jam:
Part of why I took a long time to start lengthening lessons, beyond
just forgetting, was that I don't want to apply too
high a marginal tax rate. If I had said "you still have to
practice the full time, even though you're getting 10x done now",
that would have been super demotivating. Instead, she got to enjoy
a few weeks of the full profits (2min practice) before gradually
working back up.
(This is just me writing about a thing that happened to work with one
of my kids. No reproducibility claims here, your fiddleage may vary!)