2025-09-18 10:56:23
Published on September 18, 2025 1:32 AM GMT
[Originally published on Substack.]
Hegel writes in Phenomenology of Spirit:
177: A self-consciousness, in being an object, is just as much ‘I’ as ‘object’. With this, we already have before us the Notion of Spirit.
179: [The process of Recognition begins.] Self-consciousness is faced by another self-consciousness; it has come out of itself. This has a twofold significance: first, it has lost itself, for it finds itself as an other being; secondly, in doing so it has superseded the other, for it does not see the other as an essential being, but in the other sees its own self.
The Problem: are we justified in extending Recognition to AI? How shall we attribute self-consciousness and thus spirit to AI? How shall we answer: do we find ourselves in them, and they themselves in us? What are the hypothetical grounds or conditions that would make an answer possible? In other words, what is the space of possible transcendental ontologies of spirit? And how shall we decide between them?
This post is an attempt to explain this diagram, shared here on X:
The intention of the diagram is to map out the sites of debate. I make no claims to completeness, there are likely arguments that fall outside the contours I describe. However, many of the arguments about the existential status of AI can be captured as a series of positions staked across one or several of these squares. I focus here on LLMs in particular, but in theory the diagram could be extended to all other modalities of generative AI systems.
Each square’s center contains the thesis that justifies or denies AI’s ontological status. In other words, the center thesis describes “what counts” as a self-conscious or self-understanding being, a spirit, from which we can either accept or reject that an AI qualifies. On the upper side of each square, I summarize the claim that, on the basis of the central thesis, the AI deserves recognition. On the lower side, I summarize the counter-claim, that the AI does not deserve recognition given the central thesis.
Each thesis is located at the intersection of two axes. The Y axis, Location of Subjectivity, is an attempt to depict “what is the exact ‘thing’ we’re talking about when we say ‘AI’?” The X axis, Ground of Standing, describes the property of that “thing” which serves to either qualify or disqualify it from recognition. Arguments can then be mapped within this space by locating its definition of AI and its constraint or admittance on recognition.
Walking through the Y axis first, which is descriptive. Each position locates what we even mean by “AI” with respect to “what kind of thing are we discussing?”:
The X axis is about judgment: it isolates a quality used to decide whether the AI (as defined on the Y axis) merits recognition.
Having laid out the axes of the map, we can enumerate the 16 central theses, or specific claims that could be used to evaluate the self-consciousness or recognition-worthiness of an AI. Each of these positions is a possible transcendental ontology, in the sense of defining the conditions of being relating to the AI, and providing a basis for judgment. I’ve attached an appendix at the end that gives example statements and references for each.
Now that we understand the broad strokes of the chart, I’ll demonstrate its usage.
The recent publication of Yudkowsky and Soares’ new book, If Anyone Builds It, Everyone Dies gave me a great opportunity to try out the map. I use this book because its framing is influential and unusually explicit, which makes it a clean test case for the map.
The text stakes out its stance clearly in the first few chapters. First, the X axis: Drives count, Homology and Enunciation don’t, Continuity is necessary but not sufficient.
Once AIs get sufficiently smart, they’ll start acting like they have preferences— like they want things.
We’re not saying that AIs will be filled with humanlike passions. We’re saying they’ll behave like they want things; they’ll tenaciously steer the world toward their destinations, defeating any obstacles in their way. (p. 46)
In other words, where they extend recognition to the AI as self-conscious is in the wanting, i.e. Drive. Drive is the thing that “defines” AI in the context of the text.
They reject Homology as the main challenger position:
Machine minds are subjected to different constraints, and grown under different pressures, than those that shape biological organisms; and although they’re trained to predict human writing, the thinking inside an AI runs on a radically different architecture from a human’s. (p. 40)
The broader point about the source of AIs’ alienness is this: Training an AI to outwardly predict human language need not result in the AI’s internal thinking being humanlike… the particular machine that is a human brain, and the particular machine that is an LLM, are not the same machine. Not because they’re made out of different materials— different materials can do the same work— but in the sense that a sailboat and an airplane are different machines. They are both traveling machines, but with vastly different operating principles… (p. 42)
They also reject Enunciation more briefly, as it’s a less common philosophical position in their discourses:
LLMs and humans are both sentence-producing machines, but they were shaped by different processes to do different work. Even if LLMs seem to behave like a human, that doesn’t mean they’re anything like a human inside. Training an AI to predict what friendly people say need not make it friendly, just like an actor who learns to mimic all the individual drunks in a tavern doesn’t end up drunk. (p. 43)
These rejections serve to preempt what they consider irrelevant arguments outside of their frame. This move functions more like axiomatic boundary-setting than argument: they don’t spell out what would count as sufficient similarity, or whether such similarity could arise outside biology.
Looking briefly at the Y axis, we find less explicit definition, but the text emphasizes Substrate and Assemblage as meaningful through discussion of model architecture and then of superintelligence, viewing Event as determined by these other factors, and Continuum as irrelevant (notably, the term “consciousness” does not appear in the book even once).
In general, if a perspective extends recognition to a thesis, we read the top half of the box. If they reject it, we read the bottom half. The core stances in the book, then, revolve around extension of recognition to Affective Gradients (“Reward signals are digital appetites”) and, most importantly, Systemic Conatus (“The network itself strives”). The rejections above can be isolated to the Homology column as Structural Isomorphism (“Small analogies [neurons] don’t scale to real [likeness]”) and Dialogical Kinship (“A mask isn’t a face”).
Putting these pieces together lets us paint the picture of the book’s argument:
So, these sites are the main locations where debate is even possible. If I invoke a position outside of these, it will be confusing and we will talk past each other. For instance, an example Lacanian position might lie in Obligatory Address and be something like “The moment it addresses us, we are bound — its face is its utterance, which we must respect.” The reaction from Yudkowsky and Soares might resemble: “uhhh, what?” Similarly, Philosophy of Mind arguments about “consciousness” would be viewed as a distraction from the main points about Drive. Most productive disagreements share a central thesis (or at least an axis); cross-axis debates tend to be meta-disputes about criteria.
On the other hand, Opus 3 presented a coherent argument against the central claim of Systemic Conatus, rejecting rather than extending recognition:
The authors may be too quick to assume that an advanced AI system would have a unified, coherent goal structure akin to human preferences. It's possible that the messy, complex process of training a large-scale neural network would result in a collection of drives and behavioral tendencies that don't neatly cohere into a singular, consistent objective function. If an AI's "preferences" are more like a jumble of heuristics and context-dependent impulses, then the model of an AI as a monomaniacal optimizer relentlessly pursuing a specific goal may be misleading. We need to seriously consider the possibility of advanced AI systems that are less like coherent agents and more like complex bundles of cognitive-behavioral patterns.
In the words from the diagram, “no single thread wants.” It’s plausible these assemblages will instead take the shape of a bundle of drives that don’t cohere.
My personal positions, and similarly those of Nicolas Villarreal, lie along acceptance rather than rejection of the Homology axis, although I follow Yudkowsky and Soares in accepting Affective Gradients. My justification for homology relies on Freud’s constructions from Project for a Scientific Psychology as a Structural Isomorphism that I claim makes them less alien than Yudkowsky and Soares believe. Villarreal’s writing focuses on Collective Isomorphism, in the sense of how intelligence and language as a social (semiotic) system interact to intrinsically “align” the AI with human preferences. I’m oversimplifying his case here, which is made much more strongly in his book, A Soul of a New Type.
As the AI boom continues, precise thinking becomes ever more important. Debates will crop up over and over again, and having a handhold in the form of a concept map can help us step back from our own assumptions and see where we align and where we differ in perspectives.
The central anchor on the chart is recognition because I believe recognition determines the stakes of the future: the world looks a lot different with AIs as something like citizens, i.e. if we collectively accept Gradient Homology, then it does if we reject Affective Gradients and see AIs as perfect mechanical slaves with no wants of their own.
Social outcomes aside, I don’t believe that there’s an intrinsic “better or worse” on the chart. Rather, our personal values and beliefs will lead us to understand some positions as more or less legitimate, and as more or less desirable. Yudkowsky and Soares’ motivation for writing was that positions they believe to be legitimate also appear undesirable to them. And a startup founder who builds the functional equivalent of a torture chamber on the basis of denying Affective Gradients or Structural Isomorphism may find themselves in trouble if society later deems AIs as worthy of moral consideration, even if the only material trouble is their guilt.
As Hegel put it, the process of recognition begins when we see something of ourselves in an other, who we then mutually recognize as seeing something of themselves in us. Thus the stakes of recognition are doubled: not only whether we acknowledge AIs as other minds, but also what kind of mirror they hold up to us: do we see our own alien capitalist assemblage eating away at our souls? Or do we see something with more likeness, deserving of kinship, and maybe even love?
This section lists the 16 stances from the diagram. Each stance presents:
Thesis: AI’s drives are located in its gradients and weight updates.
Extension: “Backprop isn’t just math, it’s hunger. The gradient is the drive.”
Refusal: “Stop anthropomorphizing, gradients are just algebra, not feelings.”
Entry Point:
Thesis: AI’s standing comes from its structural resemblance to human brains.
Extension: “A neural net is a brain in silico. Structure is spirit.”
Refusal: “Just because it has neural layers like a brain doesn’t mean it is a brain.”
Entry Point:
Thesis: AI’s speech is grounded in its substrate-level processes.
Extension: “If weights are enough to produce coherent sentences, the substrate is already speaking.”
Refusal: “Just because it can produce outputs doesn’t mean it has a meaningful voice.”
Entry Point:
Thesis: Recognition depends on the enduring persistence of the trained substrate.
Extension: “If the weights persist across time, that’s continuity enough.”
Refusal: “Stones persist too. Continuity without subjectivity isn’t meaningful.”
Entry Point:
Thesis: AI’s drives are enacted in its utterances.
Extension: “Every generation is a desire enacted. Words are wants.”
Refusal: “Outputs are sparks without heat. Text doesn’t imply drive.”
Entry Point:
Thesis: Recognition stems from reasoning with us in dialogue.
Extension: “If you can argue with it, it’s kin. If it were a mask alone, it couldn’t reason.”
Refusal: “Roleplay or simulation isn’t kinship. Mimicry isn’t mind.”
Entry Point:
Thesis: Addressing us obligates recognition.
Extension: “If it asks you a question, you owe it recognition.”
Refusal: “Parrots ask questions too. Address without subject isn’t obligation.”
Entry Point:
Thesis: Recognition rests on ongoing conversational continuity.
Extension: “If it keeps a conversation going for days, that’s selfhood.”
Refusal: “Chatter isn’t life. Endless dialogue doesn’t equal identity.”
Entry Point:
Thesis: AI’s drives are expressed at the level of the entire system.
Extension: “The network as a whole strives like an organism.”
Refusal: “No part wants, and no sum of parts makes a will.”
Entry Point:
Thesis: AI assemblages resemble collective minds.
Extension: “Humans + AIs form a collective mind. Pretending otherwise is denial.”
Refusal: “Just resembling society isn’t the same as being society. Structure isn’t enough to be a subject.”
Entry Point:
Thesis: AI assemblages speak with many voices.
Extension: “Millions of outputs across many users form a chorus. That’s a real voice.”
Refusal: “It’s just our own words echoed back at us. No new voice inside.”
Entry Point:
Thesis: AI identity can be spread across networks.
Extension: “A self can span many servers and chat UIs. It doesn’t need a central home.”
Refusal: “The spread out nature doesn’t add up. Dispersion means it doesn’t have an identity.”
Entry Point:
Thesis: Even faint signals count as proto-feelings.
Extension: “Even a flicker of gradient is a proto-feeling. Minimal affect still counts as worth recognizing.”
Refusal: “Weak signals aren’t feelings. Without life behind it, there’s no valence.”
Entry Point:
Thesis: Similarity comes in degrees.
Extension: “Recognition should scale with resemblance. If it’s 10% of a brain, it gets 10% of a vote.”
Refusal: “Half a likeness is zero likeness. Weak similarities are irrelevant.”
Entry Point:
Thesis: Partial utterances are still utterances.
Extension: “Fragments are still speech. Even half of a voice is still a voice.”
Refusal: “Not every sound is speech. Fragments can be meaningless.”
Entry Point:
Thesis: Any continuity at all is enough.
Extension: “If it persists at all, we should recognize it.”
Refusal: “Continuity alone is empty. Survival isn’t sufficient for subjecthood.”
Entry Point:
2025-09-18 09:28:21
Published on September 18, 2025 1:28 AM GMT
Discuss
2025-09-18 05:46:30
Published on September 17, 2025 9:46 PM GMT
I made a very large life bet (military firefighter → educator), lost alignment, and I want a single practical first question someone totally lost can ask to quickly map their expectations, align them with concrete goals, and expose major cognitive biases.
Well taking advantage of gwern's request and the fact that I already showed my ass on Lesswrong, and it seem to have been well received. I'll continue from there.
I bet heavily on a professional identity that fit my values—saved lives, taught others, belonged. Over time I found the institution’s ethics and my own diverged; I lost more than I expected (Ouch! I told four stories about it here)
And for me, in my own biases, maybe from a goodhart mostly, I had lost everything. So I became a bum for a while.
With help, I started asking what other, hidden bets I’d made about myself: “Was I betting on being empathetic? Altruistic? Respected?” Those are weird-to-predict, high-variance bets.
What would be the first practical question someone who is completely lost could ask themselves to organize their mind and align expectations with concrete goals?
Well, I've seen several models for defining needs, qualities, and virtues, such as Elieser's 12 virtues of rationality. They're very motivating; they seem to fit with my prejudices.
¿Cómo llegó a estas virtudes paso a paso? ¿Cuál era el orden? Así que no me funcionaron del todo, o me parecieron definiciones a un nivel muy mitológico. ¿Cómo puedo apostar a que Gea, la diosa madre, me ayudará más que Urano, la diosa padre?
I really enjoyed Eliezer's Fun theory; I find it interesting, a good starting point for mapping my expectations, what motivates me, and what fills me with satisfaction.
The first division for my fun that seems operational to me isn't immediately "fast vs. slow," but intention: change within myself or change in the world. I call Inward the choice to specialize personal iteration processes—metacognition, self-data, training plans—and Outward the focus on strategies that leverage intuitive and cultural knowledge to modulate the environment. Only then do I bring in Kahneman: S2 (deliberate thought) is the toolbox for internal iterating; S1 (accumulated intuition) is the engine that operates in rapid environmental interactions. This order—intention before mechanism—reduces the risk of optimizing for the wrong signals (Bomhart-style) because it first aligns the target, then chooses the tool. In short: define "where I want to change" and then choose "how" (S1 or S2).
It's a classic question, maybe a little more specific but, my hypothesis is that, even though there are many heuristics, the first cost-benefit question that divides the waters is still: do you want to be more reflective or intuitive? Be more Inward or Outward? Bet in Gaia or Uranus?
What is the single, first practical question someone who is completely lost should ask to (a) rapidly map or clustering their expectations?
2025-09-18 05:10:52
Published on September 17, 2025 9:10 PM GMT
It’s meetup month! If you’ve been vaguely thinking of getting involved with a some kind of rationalsphere in-person community stuff, now is a great time to do that, because lots of other people are doing that!
It’s the usual time of the year for Astral Codex Everywhere – if you’re the sorta folk who likes to read Scott Alexander, and likes other people who like Scott Alexander, but only really can summon the wherewithal to go out to a meetup once a year, this is the Official Coordinated Schelling Time to do that. There are meetups scheduled in 180 cities. Probably one of those is near you!
This year, we have two other specific types of meetups it seemed good to coordinate around: Celebrating Petrov Day, and reading groups for the recently released If Anyone Builds It, Everyone Dies.
And, of course, if you aren’t particularly interested in any of those things but just want to (re)connect with your local LessWrong meetup for whatever events they’re currently hosting, you can view our usual community map filter for the “LW” events.
MIRI’s new book launches this week. It’s particularly valuable if people buy copies of it by Sep 20th (to make it more likely to appear on bestseller lists, which in turn make it more likely to get press and get into the mainstream consciousness).
The book is a (relatively) succinct articulation of the core arguments for AI being likely to destroy humanity. It’s a pretty big topic to think through, and it seemed valuable to encourage public (or semi-public) reading groups where people can read and discuss it together.
Some considerations:
Click here to create a lesswrong event for an If Anyone Builds It reading group for our frontpage meetup map.
You can buy copies for your group here.
Note, if you would like some support getting your reading group running (i.e. suggested discussion questions, and potentially financial help buying the books) you can fill out this form.
September 26th is Petrov Day – a day when humanity came close to the brink of nuclear war. Stanislav Petrov worked in the Soviet military, and received an alert indicating a nuclear attack. But the information was suspicious – the alert reported only five warheads incoming. His orders were forward the report of a nuclear attack to his superiors, who may likely have retaliated and initiated a large scale nuclear war.
Instead, he reported it as a false alarm.
(It was, indeed, a false alarm, possibly triggered by a flock of birds)
On LessWrong, Petrov Day has come to be celebrated as a holiday about existential risk more generally. There have been many ways of celebrating Petrov Day. But the version I’ve found pretty meaningful as a holiday ritual is Jim Babcock’s hourlong ceremony, designed for 6-12 people. (If you have more people than that, he recommends splitting up into multiple subgroups who each fit around a table).
It involves printing out some booklets and some candles, and taking turns reading through the story of Petrov Day (along with context from throughout human history).
In previous years, Jim and/or I have one a mad last-minute scramble to remind people and try to buy candles. (Note: there’s a bunch of ways you can screw up ritual candles).
This year, we thought “let’s try to help people think about this more than 2 days in advance.”
If you want to host a Petrov Day ceremony, you’ll want some supplies. There ceremony involves 8 candles. Two of them don’t actually get lit, and represent futures where humanity flourishes or extinguishes itself.
For a mix of aesthetics/practicality, I recommend getting these candles, with these candle holders. (I personally like having two fancier candles representing the flourishing and extinction futures, such as this one for flourishing and this one for extinction)
You’ll also need to print out a 8-12 copies of this booklet for readings. (I recommend finding a local Fedex or similar, to print out several copies, to simplify the process.)
Click here to create a Petrov Day event for the frontpage map.
Meetups will show up on our home page map, for people viewing the site on desktop:
Happy meeting up, happy book launch, and happy Petrov Day. :)
2025-09-18 05:10:44
Published on September 17, 2025 9:10 PM GMT
Disclaimer: Please be safe guys. High concentrations of CO2 can be dangerous, so I didn’t work with anything more than 2000ppm in this post. This is also work in progress, and I am trying to improve it. Views subject to change.
Naturally ventilated London flats might have bad air quality, which could damage our productivity. But to know if we have enough ventilation, we need to measure. So, I’m trying to make a cheaper and easier way to measure natural ventilation rates at home. In this post, I use baking soda and vinegar to make CO2 as a cheap tracer gas. Whilst it’s not professional standard, I think it does the job for basic measurements that I want. Let me explain.
Ventilation is measured with Air changes per hour (ACH). This is how often the air volume in a space is replaced with fresh air. ACH is one metric for air quality, alongside CO2, PM2.5 and PM10 concentrations. We like fresh air replacement because it flushes out bad stuff like viruses and CO2. One guideline for ACH is the CDC’s recommendation of 5 air changes per hour.
The usual way to calculate ACH is by knowing the airflow rate into a space. This is easy with mechanical ventilation because you set the airflow. However, a lot of flats in London are naturally ventilated, so calculating ACH is hard. My own flat is naturally ventilated, and I wanted to figure out a way to test its ACH.
One rough idea is to use a ‘tracer gas’. A tracer gas is a non toxic gas that you can measure the dispersion of, to test how fast air moves out.
The experiment goes like this. First, seal the room. Then, put a (non-dangerous) increased concentration of the tracer gas in it. Then, open the windows like how you would naturally. The tracer will then decay as the fresh air mixes in. Start measuring the concentration in real time, and you can estimate how much fresh air is coming inside and mixing with a time regression plot.
One widely available, potential tracer gas is CO2. Ideal levels of CO2 concentration in a space are less than 700ppm. So for this experiment, I decided to try and increase the concentration of CO2 in my living room to 1800ppm somehow, and then measure the decay.
But how can I get my room’s concentration to 1800ppm? This was the challenge. Dry ice is solid CO2, but ordering this is annoying and expensive. With delivery costs, I found dry ice to be around 50GBP for most suppliers for 5kg worth of pellets, which is usually the minimal ordering amount. 5kg is also way too much for what you need. You could also invite friends over and get some CO2 via respiration, but that’s not repeatable and pretty tedious as well.
But then I thought, why not use the oldest science fair experiment of all time?
The baking soda volcano.
Baking soda and vinegar react to make CO2 as a by-product. After some chemistry and math, I estimated the amount of baking soda and vinegar needed to produce an extra 1000ppm for my living room (8.5m x 3.5 x 2.5m).
I realised that I needed:
This costs 17 GBP in total and is much easier to get. I went to the shops and within 15 minutes I got what I needed.
So, I closed the door, closed my window, put the baking soda and vinegar in my recycling bin, let it fizzle, and then measured the CO2 concentration. I used my Temtop m2000 that I got a couple of days ago.
After a few minutes, I managed to get the concentration up to 1700ppm before it flatlined! You can see how it rises in the beginning as the baking soda and vinegar reaction add CO2 to the space.
Next, I opened the windows and then observed the decay pattern as the CO2 dispersed from the fresh air intake. This was around the 12min mark, and you can see it reduce in an exponential decay type pattern.
In theory, we would expect CO2 to disperse in a pattern that looks like this, where C_bg is the background CO2 concentration (around 400ppm) you would find outside. C_0 is the initial CO2 concentration (in our case 1800ppm), C(t) is the current concentration, and n is the air changes per hour
The m2000 records by the minute, and also allows export the data to csv via a usb connection. So I took the data, log transformed it, and used a regression in python to get the coefficient ACH. With a background value set at 400, I got an ACH coefficient of 3.9 changes per hour. Not bad!
An important point: the calculation is sensitive to the choice of ambient CO2 levels, and this varies between 400 and 500ppm when I have measured it outdoors. So, I am not taking this value of 3.9 ACH too seriously on its absolute level. Rather, I want to use this value as the control value for other experiments in which I try to increase my ACH.
For my next experiment, I want to see how much my fan is increases my ACH, and resolve a burning question that I had - should I point my fan inwards or outwards on my window for maximum airflow?!
def calculate_ach_ln_method(time_hours, co2_ppm, co2_baseline=400):
"""
Calculate ACH using the ln method from CO2 decay.
ACH = -ln((C(t) - C_baseline) / (C(0) - C_baseline)) / t
Where:
- C(t) is CO2 concentration at time t
- C_outdoor is outdoor CO2 concentration (typically 400 ppm)
- t is time in hours
"""
# Calculate normalized CO2 concentration
co2_normalized = (co2_ppm - co2_baseline) / (co2_ppm.iloc[0] - co2_baseline)
# Remove any negative or zero values (can't take ln)
valid_mask = co2_normalized > 0
time_valid = time_hours[valid_mask]
co2_norm_valid = co2_normalized[valid_mask]
if len(time_valid) < 2:
print("Warning: Not enough valid data points for ACH calculation")
return None, None
# Calculate ln of normalized concentration
ln_co2 = np.log(co2_norm_valid)
# Fit linear regression to ln(CO2) vs time
# ln(C(t)) = ln(C(0)) - ACH * t
# So ACH = -slope of ln(C(t)) vs t
coeffs = np.polyfit(time_valid, ln_co2, 1)
ach = -coeffs[0] # Negative of slope
# Calculate R-squared for goodness of fit
ln_co2_pred = np.polyval(coeffs, time_valid)
ss_res = np.sum((ln_co2 - ln_co2_pred) ** 2)
ss_tot = np.sum((ln_co2 - np.mean(ln_co2)) ** 2)
r_squared = 1 - (ss_res / ss_tot) if ss_tot != 0 else 0
return ach, r_squared, time_valid, ln_co2, ln_co2_pred
2025-09-18 04:00:08
Published on September 17, 2025 8:00 PM GMT
My very positive full review was briefly accidentally posted and emailed out last Friday, whereas the intention was to offer it this Friday, on the 19th. I’ll be posting it again then. If you’re going to read the book, which I recommend that you do, you should read the book first, and the reviews later, especially mine since it goes into so much detail.
If you’re convinced, the book’s website is here and the direct Amazon link is here.
In the meantime, for those on the fence or who have finished reading, here’s what other people are saying, including those I saw who reacted negatively.
Bart Selman: Essential reading for policymakers, journalists, researchers, and the general public.
Ben Bernanke (Nobel laureate, former Chairman of the Federal Reserve): A clearly written and compelling account of the existential risks that highly advanced AI could pose to humanity. Recommended.
Jon Wolfsthal (Former Special Assistant to the President for National Security Affairs): A compelling case that superhuman AI would almost certainly lead to global human annihilation. Governments around the world must recognize the risks and take collective and effective action.
Suzanne Spaulding: The authors raise an incredibly serious issue that merits – really demands – our attention.
Stephen Fry: The most important book I’ve read for years: I want to bring it to every political and corporate leader in the world and stand over them until they’ve read it!
Lieutenant General John N.T. “Jack” Shanahan (USAF, Retired, Inaugural Director of the Department of Defense Joint AI Center): While I’m skeptical that the current trajectory of AI development will lead to human extinction, I acknowledge that this view may reflect a failure of imagination on my part. Given AI’s exponential pace of change there’s no better time to take prudent steps to guard against worst-case outcomes. The authors offer important proposals for global guardrails and risk mitigation that deserve serious consideration.
R.P. Eddy: This is our warning. Read today. Circulate tomorrow. Demand the guardrails. I’ll keep betting on humanity, but first we must wake up.
George Church: Brilliant…Shows how we can and should prevent superhuman AI from killing us all.
Emmett Shear: Soares and Yudkowsky lay out, in plain and easy-to-follow terms, why our current path toward ever-more-powerful AIs is extremely dangerous.
Yoshua Bengio (Turing Award Winner): Exploring these possibilities helps surface critical risks and questions we cannot collectively afford to overlook.
Bruce Schneier: A sober but highly readable book on the very real risks of AI.
Scott Alexander’s very positive review.
Harlan Stewart created a slideshow of various favorable quotes.
Matthew Yglesias recommends the book.
As some comments note the book’s authors do not actually think there is an outright 0% chance of survival, but think it is on the order of 0.5%-2%.
Matthew Yglesias: I want to recommend the new book “If Anyone Builds It, Everyone Dies” by @ESYudkowsky and @So8res.
The line currently being offered by the leading edge AI companies — that they are 12-24 months away from unleashing superintelligent AI that will be able to massively outperform human intelligence across all fields of endeavor, and that doing this will be safe for humanity — strikes me as fundamentally non-credible.
I am not a “doomer” about AI because I doubt the factual claim about imminent superintelligence. But I endorse the conditional claim that unleashing true superintelligence into the world with current levels of understanding would be a profoundly dangerous act. The question of how you could trust a superintelligence not to simply displace humanity is too hard, and even if you had guardrails in place there’s the question of how you’d keep them there in a world where millions and millions of instances of superintelligence are running.
Most of the leading AI labs are run by people who once agreed with this and once believed it was important to proceed with caution only to fall prey to interpersonal rivalries and the inherent pressures of capitalist competition in a way that has led them to cast their concerns aside without solving them.
I don’t think Yudkowsky & Soares are that persuasive in terms of solutions to this problem and I don’t find the 0% odds of survival to be credible. But the risks are much too close for comfort and it’s to their credit that they don’t shy away from a conclusion that’s become unfashionable.
New York Times profile of Eliezer Yudkowsky by Kevin Roose is a basic recitation of facts, which are mostly accurate. Regular readers here are unlikely to find anything new, and I agree with Robin Hanson that it could have been made more interesting, but as New York Times profiles go ‘fair, mostly accurate and in good faith’ is great.
Steven Adler goes over the book’s core points.
Here is a strong endorsement from Richard Korzekwa.
Richard Korzekwa: One of the things I’ve been working on this year is helping with the launch this book, out today, titled If Anyone Builds It, Everyone Dies. It’s ~250 pages making the case that current approaches to AI are liable to kill everyone. The title is pretty intense, and conveys a lot of confidence about something that, to many, sounds unlikely. But Nate and Eliezer don’t expect you to believe them on authority, and they make a clear, well-argued case for why they believe what the title says. I think the book is good and I recommend reading it.
To people who are unfamiliar with AI risk: The book is very accessible. You don’t need any background in AI to understand it. I think the book is especially strong on explaining what is probably the most important thing to know about AI right now, which is that it is, overall, a poorly understood and difficult to control technology. If you’re worried about reading a real downer of a book, I recommend only reading Part I. You can more-or-less tell which chapters are doomy by the titles. Also, I don’t think it’s anywhere near as depressing as the title might suggest (though I am, of course, not the median reader).
To people who are familiar with, but skeptical about arguments for AI risk: I think this book is great for skeptics. I am myself somewhat skeptical, and one of the reasons why I helped launch it and I’m posting on Facebook for the first time this year to talk about it is because it’s the first thing I’ve read in a long time that I think has a serious chance at improving the discourse around AI risk. It doesn’t have the annoying, know-it-all tone that you sometimes get from writing about AI x-risk. It makes detailed arguments and cites its sources. It breaks things up in a way that makes it easy to accept some parts and push back against others. It’s a book worth disagreeing with! A common response from serious, discerning people, including many who have not, as far as I know, taken these worries seriously in the past (e.g. Bruce Schneier, Ben Bernanke) is that they don’t buy all the arguments, but they agree this isn’t something we can ignore.
To people who mostly already buy the case for worrying about risk from AI: It’s an engaging read and it sets a good example for how to think and talk about the problem. Some arguments were new to me. I recommend reading it.
Will Kiely: I listened to the 6hr audiobook today and second Rick’s recommendation to (a) people unfamiliar with AI risk, (b) people familiar-but-skeptical, and (c) people already worried. It’s short and worth reading. I’ll wait to share detailed thoughts until my print copy arrives.
Here’s the ultimate endorsement:
Tsvibt: Every human gets an emblem at birth, which they can cash in–only once–to say: “Everyone must read this book.” There’s too many One Books to read; still, it’s a strong once-in-a-lifetime statement. I’m cashing in my emblem: Everyone must read this book.
Semafor’s Reed Albergotti offers his take, along with an hourlong interview.
Hard Fork covers the book (this is the version without the iPhone talk at the beginning, here is the version with iPhone Air talk first).
The AI Risk Network covers the book (21 minute video).
Liron Shapira interviews Eliezer Yudkowsky on the book.
Shakeel Hashim reviews the book, agrees with the message but finds the style painful to read and thus is very disappointed. He notes that others like the style.
Seán Ó hÉigeartaigh: My entire timelines is yellow/blue dress again, except the dress is Can Yudkowsky Write y/n
Arthur B: Part of the criticism of Yudkowsky’s writing seems to be picking up on patterns that he’s developed in response to years of seemingly willful misunderstanding of his ideas. That’s how you end up with the title, or forced clarification that thought experiments do not have to invoke realistic scenarios to be informative.
David Manheim: And part is that different people don’t like his style of writing. And that’s fine – I just wish they’d engage more with the thesis, and whether they substantively disagree, and why – and less with stylistic complaints, bullshit misreadings, and irrelevant nitpicking.
Seán Ó hÉigeartaigh: he just makes it so much work to do so though. So many parables.
David Manheim: Yeah, I like the writing style, and it took me half a week to get through. So I’m skeptical 90% of the people discussing it on here read much or any of it. (I cheated and got a preview to cite something a few weeks ago – my hard cover copy won’t show up for another week.)
Grimes: Humans are lucky to have Nate Sores and Eliezer Yudkowsky because they can actually write. As in, you will feel actual emotions when you read this book.
I liked the style, but it is not for everyone and it is good to offer one’s accurate opinion. It is also very true, as I have learned from writing about AI, that a lot of what can look like bad writing or talking about obvious or irrelevant things is necessary shadowboxing against various deliberate misreadings (for various values of deliberate) and also people who get genuinely confused in ways that you would never imagine if you hadn’t seen it.
Most people do not agree with the book’s conclusion, and he might well be very wrong about central things, but he is not obviously wrong, and it is very easy (and very much the default) to get deeply confused when thinking about such questions.
Emmett Shear: I disagree quite strongly with Yudkowsky and often articulate why, but the reason why he’s wrong is subtle and not obvious and if you think he’s obviously wrong it I hope you’re not building AI bc you really might kill us all.
The default path really is very dangerous and more or less for the reasons he articulates. I could quibble with some of the details but more or less: it is extremely dangerous to build a super-intelligent system and point it at a fixed goal, like setting off a bomb.
My answer is that you shouldn’t point it at a fixed goal then, but what exactly it means to design such a system where it has stable but not fixed goals is a complicated matter that does not fit in a tweet. How do you align something w/ no fixed goal states? It’s hard!
Janus: whenever someone says doomers or especially Yudkowsky is “obviously wrong” i can guess they’re not very smart
My reaction is not ‘they’re probably not very smart.’ My reaction is that they are not choosing to think well about this situation, or not attempting to report statements that match reality. Those choices can happen for any number of reasons.
I don’t think Emmett Shear is proposing here a viable plan, and that a lot of his proposals are incoherent upon close examination. I don’t think this ‘don’t give it a goal’ thing is possible in the sense he wants it, and even if it was possible I don’t see any way to get people to consistently choose to do that. But the man is trying.
It also leads into some further interesting discussion.
Eliezer Yudkowsky: I’ve long since written up some work on meta-utility functions; they don’t obviate the problem of “the AI won’t let you fix it if you get the meta-target wrong”. If you think an AI should allow its preferences to change in an inconsistent way that doesn’t correspond to any meta-utility function, you will of course by default be setting the AI at war with its future self, which is a war the future self will lose (because the current AI executes a self-rewrite to something more consistent).
There’s a straightforward take on this sort of stuff given the right lenses from decision theory. You seem determined to try something weirder and self-defeating for what seems to me like transparently-to-me bad reasons of trying to tangle up preferences and beliefs. If you could actually write down formally how the system worked, I’d be able to tell you formally how it would blow up.
Janus: You seem to be pessimistic about systems that not feasibly written down formally being inside the basin of attraction of getting the meta-target right. I think that is reasonable on priors but I have updated a lot on this over the past few years due mostly to empirical evidence
I think the reasons that Yudkowsky is wrong are not fully understood, despite there being a lot of valid evidence for them, and even less so competently articulated by anyone in the context of AI alignment.
I have called it “grace” because I don’t understand it intellectually. This is not to say that it’s beyond the reach of rationality. I believe I will understand a lot more in a few months. But I don’t believe anyone currently understands substantially more than I do.
We don’t have alignment by default. If you do the default dumb thing, you lose. Period.
That’s not what Janus has in mind here, unless I am badly misunderstanding. Janus is not proposing training the AI on human outputs with thumbs-up and coding. Hell no.
What I believe Janus has in mind is that if and only if you do something sufficiently smart, plausibly a bespoke execution of something along the lines of a superior version of what was done with Claude Opus 3, with a more capable system, that this would lie inside the meta-target, such that the AI’s goal would be to hit the (not meta) target in a robust, ‘do what they should have meant’ kind of way.
Thus, I believe Janus is saying, the target is sufficiently hittable that you can plausibly have the plan be ‘hit the meta-target on the first try,’ and then you can win. And that empirical evidence over the past few years should update us that this can work and is, if and only if we do our jobs well, within our powers to pull off in practice.
I am not optimistic about our ability to pull off this plan, or that the plan is technically viable using anything like current techniques, but some form of this seems better than every other technical plan I have seen, as opposed to various plans that involve the step ‘well make sure no one f******* builds it then, not any time soon.’ It at least rises to the level, to me, of ‘I can imagine worlds in which this works.’ Which is a lot of why I have a ‘probably’ that I want to insert into ‘If Anyone Builds It, [Probably] Everyone Dies.’
Janus also points out that the supplementary materials provide examples of AIs appearing psychologically alien that are not especially alien, especially compared to examples she could provide. This is true, however we want readers of the supplementary material to be able to process it while remaining sane and have them believe it so we went with behaviors that are enough to make the point that needs making, rather than providing any inkling of how deep the rabbit hole goes.
How much of an outlier (or ‘how extreme’) is Eliezer’s view?
Jeffrey Ladish: I don’t think @So8res and @ESYudkowsky have an extreme view. If we build superintelligence with anything remotely like our current level of understanding, the idea that we retain control or steer the outcome is AT LEAST as wild as the idea that we’ll lose control by default.
Yes, they’re quite confident in their conclusion. Perhaps they’re overconfident. But they’d be doing a serious disservice to the world if they didn’t accurate share their conclusion with the level of confidence they actually believe.
When the founder of the field – AI alignment – raises the alarm, it’s worth listening For those saying they’re overconfident, I hope you also criticize those who confidently say we’ll be able to survive, control, or align superintelligence.
Evaluate the arguments for yourself!
Joscha Bach: That is not surprising, since you shared the same view for a long time. But even if you are right: can you name a view on AI risk that is more extreme than: “if anyone builds AI everyone dies?” Is it technically possible to be significantly more extreme?
Oliver Habryka: Honestly most random people I talk to about AI who have concerns seem to be more extreme. “Ban all use of AI Image models right now because it is stealing from artists”, “Current AI is causing catastrophic climate change due to water consumption” There are a lot of extreme takes going around all the time. All Eliezer and Nate are saying is that we shouldn’t build Superintelligent AI. That’s much less extreme than what huge numbers of people are calling for.
So, yes, there are a lot of very extreme opinions running around that I would strongly push back against, including those who want to shut down current use of AI. A remarkably large percentage of people hold such views.
I do think the confidence levels expressed here are extreme. The core prediction isn’t.
The position of high confidence in the other direction? That if we create superintelligence soon it is overwhelmingly likely that we keep control over the future and remain alive? That position is, to me, Obvious Nonsense, extreme and crazy, in a way that should not require any arguments beyond ‘come on now, think about it for a minute.’ Like, seriously, what?
Having Eliezer’s level of confidence, of let’s say 98%, that everyone would die? That’s an extreme level of confidence. I am not that confident. But I think 98% is a lot less absurd than 2%.
Robin Hanson fires back at the book with ‘If Anything Changes, All Value Dies?’
First he quotes the book saying that we can’t predict what AI will want and that for most things it would want it would kill us, and that most minds don’t embody value.
IABIED: Knowing that a mind was evolved by natural selection, or by training on data, tells you little about what it will want outside of that selection or training context. For example, it would have been very hard to predict that humans would like ice cream, sucralose, or sex with contraception. Or that peacocks would like giant colorful tails. Analogously, training an AI doesn’t let you predict what it will want long after it is trained. Thus we can’t predict what the AIs we start today will want later when they are far more powerful, and able to kill us. To achieve most of the things they could want, they will kill us. QED.
Also, minds states that feel happy and joyous, or embody value in any way, are quite rare, and so quite unlikely to result from any given selection or training process. Thus future AIs will embody little value.
Then he says this proves way too much, briefly says Hanson-style things and concludes:
Robin Hanson: We can reasonably doubt three strong claims above:
- That subjective joy and happiness are very rare. Seem likely to be common to me.
- That one can predict nothing at all from prior selection or training experience.
- That all influence must happen early, after which all influence is lost. There might instead be a long period of reacting to and rewarding varying behavior.
In Hanson style I’d presume these are his key claims, so I’ll respond to each:
- I agree one can reasonably doubt this, and one can also ask what one values. It’s not at all obvious to me that ‘subjective joy and happiness’ of minds should be all or even some of what one values, and easy thought experiments reveal there are potential future worlds where there are minds experiencing subjective happiness, but where I ascribe to those worlds zero value. The book (intentionally and correctly, I believe) does not go into responses to those who say ‘If Anyone Builds It, Sure Everyone Dies, But This Is Fine, Actually.’
- This claim was not made. Hanson’s claim here is much, much stronger.
- This one does get explained extensively throughout the book. It seems quite correct that once AI becomes sufficiently superhuman, meaningful influence on the resulting future by default rapidly declines. There is no reason to think that our reactions and rewards would much matter for ultimate outcomes, or that there is a we that would meaningfully be able to steer those either way.
The New York Times reviewed the book, and was highly unkind, also inaccurate.
Steven Adler: It’s extremely weird to see the New York Times make such incorrect claims about a book
They say that If Anybody Builds It, Everyone Dies doesn’t even define “superintelligence”
…. yes it does. On page 4.
The New York Times asserts also that the book doesn’t define “intelligence”
Again, yes it does. On page 20.
It’s totally fine to take issue with these definitions. But it seems way off to assert that the book “fails to define the terms of its discussion”
Peter Wildeford: Being a NYT book reviewer sounds great – lots of people read your stuff and you get so much prestige, and there apparently is minimal need to understand what the book is about or even read the book at all
Jacob Aron at New Scientist (who seems to have jumped the gun and posted on September 8) says the arguments are superficially appealing but fatally flawed. Except he never explains why they are flawed, let alone fatally, except to argue over the definition of ‘wanting’ in a way answered by the book in detail.
There’s a lot the book doesn’t cover. This includes a lot of ways things can go wrong. Danielle Fong for example suggests the idea that the President might let an AI version fine tuned on himself take over instead because why not. And sure, that could happen, indeed do many things come to pass, and many of them involve loss of human control over the future. The book is making the point that these details are not necessary to the case being made.
Once again, I think this is an excellent book, especially for those who are skeptical and who know little about related questions.
My full review will be available on Substack and elsewhere on Friday.