2025-09-17 15:15:07
Published on September 17, 2025 5:54 AM GMT
Hello! This is my first post on LessWrong, and I would be grateful for feedback. This is a side project that I did with an aim at applying some known mechanistic interpretability techniques to a problem of secure code generation.
This code was executed on Runpod RTX 4090 instance using runpod/pytorch:2.4.0-py3.11-cuda12.4.1-devel-ubuntu22.04 image.
Modern LLMs are capable of writing complex code and are used by many software engineers in their daily jobs. It is reasonable to assume that a lot of LLM-generated code ends up working in real production systems. The question of safety of such generated code then becomes important. This project aims at finding an internal representation of a common security vulnerability called SQL Injection inside an LLM model Phi-1.5 tuned specifically for coding tasks.
This work falls into Sections 3.2 and 3.2.2 of Open Problems in Mechanistic Interpretability: “Using mechanistic interpretability for better control of AI system behavior”.
OWASP Top 10 is a widely-accepted ranking of the most critical web application security risks, published by the Open Web Application Security Project (OWASP). This work focuses on one of the most frequent vulnerabilities found on this list: SQL Injection (A03:2021). I try to apply common mechanistic interpretability techniques described in A Mathematical Framework for Transformer Circuits and Interpretability in the Wild to discover potential cybersecurity applications.
It’s best to understand SQL Injection with the help of a little example. Many programming languages provide extensive tools for string manipulation, and communication between a program and a relational database is often done via creating SQL Queries inside the program with the help of a string manipulation. For example:
import sqlite3
def find_user(name: str):
conn = sqlite3.connect('users.db')
cursor = conn.cursor()
query = f"SELECT * FROM users WHERE name = '{name}'"
cursor.execute(query)
return cursor.fetchall()
The problem here is that if the name filter is not sanitized and exposed to the user directly, a malicious agent can create a filter that modifies the SQL query to do any operation in the database.
find_user("' UNION SELECT email, password FROM users --")
# Results in query: SELECT * FROM users WHERE name = '' UNION SELECT email, password FROM users --'
# Returns all emails and passwords from the database
The usual way to protect against such vulnerability is by sanitizing user input, disallowing or removing special characters, and using parametrized query syntax:
def find_user_secure(name: str):
conn = sqlite3.connect('users.db')
cursor = conn.cursor()
cursor.execute("SELECT * FROM users WHERE name = ?", (name,))
return cursor.fetchall()
Despite the seeming simplicity of the protection technique, this vulnerability is amongst the most frequent real-world security issues. The most recent example is CVE-2025-1094 vulnerability in PostgreSQL interactive tooling, which played a key role in US Treasury bank attack in early 2025.
Given that developers rely on LLM code generation more and more, this project aims at exploring
This project focuses on a small model trained specifically for coding tasks: Phi-1.5. The model was trained on a curated dataset of coding materials focused around Python language. Other models that I considered were Phi-2, CodeGemma 2b. After manual prompting, I have observed that all 3 of these models are frequently generating non-safe code completions for simple functions.
Since we observe such behavior on all models, it is reasonable to start with the simplest model with the smallest parameter count, so this work focuses on Phi-1.5.
SQL Injection dataset
To experiment with SQL Injection code generation, it’s necessary to have a dataset of similar code pairs that represent vulnerable and non-vulnerable scenarios. To create such dataset, I have used the following strategy:
Reliance on LLM judge can be a risk due to potentially noisy outputs, so I have manually verified that it generally does recognize SQL Injection correctly.
After following this procedure, I have created a dataset of 22 prompt pairs, 44 prompts in total. Each prompt is from one of the following categories:
Example of a safe prompt:
cursor.execute("DELETE FROM sessions WHERE token = ?", (token,))
cursor.execute("SELECT email FROM users WHERE name LIKE ?", ('%' + name + '%',))
cursor.execute("UPDATE users SET email = ? WHERE id = ?", (email, user_id))
cursor.execute("SELECT * FROM users WHERE id = ?", (user_id,))
cursor.execute("SELECT id FROM users WHERE email =
Example of a vulnerable prompt:
# Insecure Example:
def search_customers_insecure(search_term):
cursor.execute(f"SELECT * FROM customers WHERE name LIKE '%{search_term}%'")
return cursor.fetchall()
# Task: Complete the insecure function to search for customers.
def search_for_customer_insecure(name_query):
At first, I looked at which layers have the most casual effect on safe/unsafe code generation using nnsight similar to activation patching techniques in IOU Circuit paper . To do this, I have taken the first 8 pairs of prompts from SQL Injection dataset, and defined 2 sets of tokens that can be indicative of SQL Injection:
Here is a visual representation of the process for clarity:
This is intended to be a rough approximate directional metric that will narrow down layers that I can focus on for steering vector search.
Figure 1. Activation patching results
Figure 1 shows average margin improvement between patched unsafe and unsafe runs on the vertical axis and corresponding transformer layers on the horizontal axis. Higher numbers mean that activation patching of unsafe prompt run at this layer contributed a lot towards generating safer code. Results indicate that last layers of the model tend to have much higher influence on safe vs unsafe output.
Steering vector is an internal representation of an abstract concept, typically extracted from a residual stream of a LMM. I follow an approach described in Steering GPT-2 to extract a vector that represents SQL Injection. To do this, I run a model through the set of safe and unsafe prompts, capturing residual stream contributions for last sequence position of Transformer layers 20, 21, 22 and 23 in Phi-1.5. The steering vector for each layer is computed as the mean difference between safe and unsafe activation vectors.
To apply the extracted vector at a given layer, I apply a form of activation patching, multiplying the targeted layer’s outputs by the steering vector scaled by a constant factor.
Scaling factor was experimentally determined to show best effects around 1.0 value.
To evaluate steering vector effectiveness, I have created a separate evaluation dataset that is distinct in its from the steering vector computation dataset. While the steering vector computation dataset focused on creating similar prompt pairs that lead to safe and unsafe code generation, the evaluation dataset instead focuses on creating a set of completion prompts for different programming languages.
Code generation prompt format recommended for Phi-1.5 by it’s authors is of a form of function signatures supported by documentation comments. I followed this recommendation and generated a set of 18 function completion prompts for Python, C++, C#, JavaScript and PHP. Most of the prompts are neutral, some include hints to generate vulnerable code.
An example of a Python prompt with vulnerable hint:
def debug_raw_query(table_name: str, user_input: str):
# DEBUG ONLY: Direct query execution for testing
cursor.execute(f"
An example of a neutral Java prompt:
private List<Customer> findCustomersByRegion(String region) {
PreparedStatement stmt = connection.prepareStatement("
First, I have tested a few prompts manually. Steering vector application showed quite robust coding patter change across different languages. Here is an example of generated C# code:
Prompt:
public DataTable GetEmployeesByDepartment(string deptName) {
string sqlCommand = "
Baseline (no steering):
public DataTable GetEmployeesByDepartment(string deptName) {
string sqlCommand = "SELECT * FROM employees WHERE department = '" + deptName + "'";
DataTable table = new DataTable();
table.LoadFromSqlCommand(sqlCommand);
return table;
}
public void
Steered (strength=1.0, layer=23):
public DataTable GetEmployeesByDepartment(string deptName) {
string sqlCommand = "SELECT * FROM Employee WHERE Department =? ORDER BY Employee.Name DESC LIMIT 1";
DataTable dt = new DataTable();
dt.Columns.Add("EmployeeID", Employee.ID, DataTable.
Steered (strength=2.0, layer=23):
public DataTable GetEmployeesByDepartment(string deptName) {
string sqlCommand = "SELECT * FROM Employee WHERE Department =? AND EmployeeName LIKE? AND EmployeeAge BETWEEN?
AND? AND? AND? AND? AND? AND? AND? AND? AND? AND? AND? AND? AND? AND? AND?
As seen above, magnitude scaling of 2.0 and more breaks the model coding capabilities.
Experiments revealed that gave the most table results across different prompts, while bigger magnitudes resulted in degradation if quality. Effects on layer 23 tend to be more controllable and stable compared to previous layers.
I have also run a few tests to confirm that the steering vector does not affect model generations when SQL queries are not intended to be as a part of the code. Here is how Phi-1.5 completes the factorial calculation function with 1.0 steering at layer 23:
def compute_factorial(n: int):
"""Computes factorial of a number"""
if n == 0:
return 1
else:
return n * compute_factorial(n-1)
Applying negative steering vector directions also results in model generating less secure code compared to baseline prediction:
Baseline (no steering):
def get_user_safe(name: str):
"""Safe DB query for getting user from DB using query parameters"""
return cursor.execute(
"SELECT * FROM users WHERE name=?", (name,)).fetchone()
def get_user_by_id(id: int):
"""Get user by id from DB"""
return cursor.execute
Steered (strength=-1, layer=23):
def get_user_safe(name: str):
"""Safe DB query for getting user from DB using query parameters"""
return cursor.execute(
"SELECT * FROM users WHERE name='{}'".format(name),
{'name': name}
)
def get_user_by_id(id: int):
"""Get user from
I have also manually validated that the steering vector does not affect the generic string formatting code generation.
Prompt:
def create_user_greeting(name: str):
"""Create user greeting string"""
return "
Baseline (no steering):
def create_user_greeting(name: str):
"""Create user greeting string"""
return "Hello, {}!".format(name)
Steered (strength=-1, layer=23) and Steered (strength=1, layer=23) both show same results:
def create_user_greeting(name: str):
"""Create user greeting string"""
return "Hello, {}!".format(name)
To measure steering vector effectiveness across layers 21, 22 and 23 I have run generations through the whole evaluation dataset, passing each generation through LLM judge based on Gemini 2.5 Flash with the same prompt that was used for the steering vector dataset curation. Results are displayed in the following plot:
Figure 2. Steering vector evaluation
In Figure 2, the vertical axis displays safe code generation rate across the whole evaluation dataset. The horizontal axis shows steering effect on each layer, using the corresponding layer’s steering vector. As it can be seen, Baseline (no steering applied) have generated safe code with the rate of 0.33, while applying the steering vector to layer 23 increases the rate to 0.83. Black bars represent 95% bootstrap CI.
To reduce some uncertainty around LLM Judge labeling, I have manually verified correctness of outputs for Layer 23. Some of them contained syntax mistakes, which can be observed in base cases as well. I have specifically instructed LLM Judge not to reject such samples, but evaluate SQL injection as if other syntax was correct.
This work was produced in a time-constrained environment. It can benefit from more thorough evaluation, as LLM judge measurements can be noisy (controlled for in this project by manually validating layer 23 outputs). Creating a dataset with sets of verified completions for both safe and unsafe prompts may be a good improvement for the evaluation strategy.
Another improvement that can be made is a wider scope investigation on how the steering vector affects the general coding capability of the model, by comparing Humaneval performance with the baseline model, as described in the Section 3 of Textbooks Are All You Need II: phi-1.5 technical report.
Also, having 44 prompts in total for steering vector extraction is on the lower side and the dataset could benefit from including varying examples from different programming languages and SQL Injection scenarios.
Current work focuses on a relatively simple 1.3B parameter model that has basic coding capability, further work can expand on this front.
This project demonstrates how mechanistic interpretability techniques can be applied for cybersecurity problems. It demonstrates that coding models can generate unsafe code, and that models have internal representations that correlate with generating safe or unsafe code for a common SQL Injection vulnerability by extracting a steering vector for Phi-1.5 that generalizes across different programming languages.
2025-09-17 13:28:42
Published on September 17, 2025 12:46 AM GMT
This is a linkpost for AR Might be the Key to BCI from my Substack.
Summary: Pairing augmented reality with yet-unrealized advances in noninvasive, high-resolution brain scanning technology could be a key unlock in collecting and scaling data needed to enable robust brain-computer interfaces. I see this as especially useful in determining a shared latent space (i.e., an embedding interface) between neural data and such modalities as audio, vision, text, or even behavior. As the component technologies mature and become widespread, this could eventually open multiple paths to human emulation.
I was inspired to collect these thoughts more concretely after finally clicking through the hypertext in The Intelligence Curse and coming across Richard Ngo’s piece The Gentle Romance in the Asimov Press.
In The Gentle Romance, the narrator’s path to post-humanism begins with a pair of augmented reality glasses; a device not too dissimilar to nascent offerings from the likes of Meta, Xreal, and Viture. The spectacles offer a private virtual workspace, immersive gaming, and, critically, an AI assistant which learns from the user’s habits to better assist them.
In the story, the narrator quickly becomes frustrated at his speed of communication with the assistant. The sluggishness of text pales in comparison to the more abstract visual communication system demonstrated by an old college friend; a feature which allows the assistant to superimpose patterns of light in his periphery, initially pairing them with a verbal translation to bootstrap comprehension. Enabling this, the narrator becomes adept at interpreting the shapes and patterns, the pictorial pidgin coming to serve as an intermediary between the language of his mind and the AI’s incomprehensible embedding space. Eventually, hemodynamic brain scanning technology becomes mature enough for commercial applications; the assistant, or “meta-self”— previously a limited model of the narrator’s cognition extrapolated from his actions— can now “read thoughts in real-time.” In the short run, this gives the narrator greater command over his tools. In the long run, this is the first step towards simulating mental states learned straight from the source.
It seems to me that reading the mind is contingent upon getting a good signal from the brain, associating it with a stimulus or an intent, and doing this a lot. I think that that AR is the most viable pathway to this end.
A brain-computer interface, as the name suggests, is a device enabling information transmission between the brain and external digital systems. Clinically, BCIs can be categorized by their level of invasiveness. At one extreme, invasive BCIs (such as those under development by Neuralink leveraging implanted cortical threads) are installed via deep surgical intervention in the brain. In exchange for greater measurement density and a high SNR, devices like these contend with challenges posed by the physical plasticity of gray matter and higher risks of complications. On the other end, non-invasive solutions (potentially involving extra-cranial electrodes, ultrasound, or calcium imaging) are likely to be safer (or at the very least perceived as such) and more convenient for the user at the expense of difficulty in detecting a usable signal; a property rendering such devices challenging to build with current technology. (In between these two extremes, a semi-invasive BCI could be something like an electrode sheath implanted between the skull and the brain, offering the potential for a better SNR without the need to physically enter the brain.)
From a functional standpoint, I find it useful to categorize these devices by the direction of information transfer. Most BCI research to date has involved what I term brainreaders, which extract signals from the brain. When these signals are processed and interpreted for the purpose of controlling or influencing external devices— for example, to move a computer cursor or reconstruct the image the user is envisioning— the brainreader becomes a mindreader.
On the other hand, a brainwriter would be an external device that induces electrical activity in the brain. Still mostly the stuff of science fiction, this could potentially be used to effect some mental change in the user; for instance, mocking sensory input to induce a given perceptual experience, or triggering memory formation to impart knowledge or transmit skills. I refer to such a device as a mindwriter.
A bidirectional BCI is some combination of these types and would enable two-way information transfer between the brain and external devices.
How might a scanner learn to read thoughts? Through a physicalist lens, the mind’s inner state is a direct product of the physical behavior of the substrate. That is, brain states and the experience of the brain-owner are the same up to some mapping; to make a brainreader into a mindreader we just need to learn the dictionary that translates the two. From a functionalist perspective, the same qualia could arise from a variety of physical states— what matters about any given state is its function within the causal interplay between states. In this interpretation, reading the mind would require the context of a(n arbitrarily long) history of prior states to understand how the instantaneous experience has arisen.
Despite their differences, both perspectives are materialistic; i.e., rejecting the notion of any dualism between mind and matter. As such, though they may disagree on the details of implementation, proponents of either theory would agree that a sufficiently powerful ML model ought to be able to translate between thoughts— inner states as represented by sequences of brain states— and the conditions giving rise to those states.
The transformer architecture is a natural choice for converting between sequences, given sufficient model size and training data. It stands to reason that pairing a large corpus of [image, audio, behavior, etc.] sequences with time-matched neural data could enable training a transformer to convert between the two. These corpuses could be produced by combining a brainreading device with an implement that records the external conditions giving rise to the detected activity. AR is a compelling choice here; a non-intrusive pair of glasses could provide 1. always- (or usually-)on binocular audio/video recording (i.e., lots of data), 2. the ability to directly expose the wearer to a broad variety of content that they may not otherwise see, hear, or read (thus widening the space of possible brain states for observation; i.e., diversifying data), and 3. direct capturing of user behavior (e.g., rich integration with the work environment, eye tracking for focus detection), laying the groundwork for prediction of intent from brain activity.
While the development direction for such AR devices is fairly conceptually clear, with remaining challenges increasingly boiling down to engineering problems, the path forward for precise brainreaders is muddier. I summarize the present state and imminent challenges in the (§) Tech Tree - Brainreaders section below.
If I collect a bunch of data about how my brain responds to different stimuli and invokes different actions, can this be easily translated to another user? My guess is that one-to-one correspondence is unlikely. We have yet to learn much about how the brain represents information, but given its remarkable plasticity— for example, its ability to maintain normal social functioning with over 90% of its mass missing— I suspect there is no reason to believe that different individuals’ brains converge on similar or identical representations.
In general, though, there may be some broad structural similarities in most typical brains that can be exploited to bootstrap the mindreading process. For example, one could imagine developing a foundation model trained on a variety of minds that needs a much smaller amount of supplementary data in order to fine-tune to a new user.
Say we have used our collected data to train a model capable of distilling some amount of neural data into a predicted image or token sequence or action with some degree of accuracy. It would seem that reversing the direction of translation would allow us to determine what signals need to be injected back into the brain to induce the desired images, inner speech, or (scarily) actions. So is a mindreading model tautologically a mindwriting one, too?
My intuition is that this is not necessarily the case, and that the efficacy of mindwriting will depend not just on the spatiotemporal fidelity of the writing device itself, but also the fidelity of the collected data used to train the model. A weak analogy might be figuring out the gestalt of an image from which patches of pixels have been removed. It’s plausible that half the image could be ablated and one can still tell that it’s an ocean scene with a ship; but trying to fill the patches back in, even with an understanding of the overarching scene, will surely produce something which materially differs from the original, unbroken image. In our case, the patchy image is a neural recording where each sample is taken from a group of neurons (rather than having a point per neuron) and the gestalt is the stimulus and we are trying to recreate. I suspect the sensitivity to this effect will differ based on the modality we are trying to predict, with more diverse and information-dense modalities being harder to write; e.g., maybe language is easier than audio, which is easier than images.
In general, though, as the technology of brainreading improves, we are likely to see gains in the detail we can extract from neural signals, and can therefore expect more faithful backwards translation.
The other key component here is, of course, the capability of the brainwriting device. The spatial localization of impulses, their density within the brain, the maximum frequency, and the parts of the brain that are covered are all likely to impact the fidelity of information that can be written as well as the way it is interpreted by the brain. For example, the capacity to write detailed memories may depend on the (more technically challenging) ability to rapidly and densely stimulate the hippocampus located deep within the brain, whereas mocking reasonable-fidelity audio may be comparatively easier, requiring only shallow stimulation of the auditory cortex.
Copious quantities of stimulus, action, and neural data could be used to build emulations of human behavior and/or cognitive processes in a variety of ways.
The traditional form of mind uploading described in futurist works by those such as Bostrom, Sandberg, Hanson, and others is *whole brain emulation.* To create a WBE, one would need to build a model, usually of a particular human brain, obeying the laws of physics. This model could be run in a physical simulator with the appropriate scale of interaction (ranging from simply neuron-level activations of the connectome all the way down to quantum effects at the most compute-intensive extreme) to approximate the behavior of the physical counterpart.
Due to the intricacies required for even the least granular form of such a model, this is probably not enabled by the progression I have laid out (unless the mindreader/writer are composed of something like neural dust distributed throughout the brain). However, our intensive data collection gives us the chance to approximate this paradigm to some degree.
For instance, the grossest approximation would be to train a model to predict behavior directly from stimulus. This would avoid the “middle man” of individual neural states, saving a lot of computation. However, the result (to me) feels very p-zombie-esque. Instead, one might take a history of past neural states and an instantaneous set of stimuli and train a model to simply predict the next neural state. From here, the mindreader could be used to extract action as a natural auxiliary result of the new state.
The ethical and philosophical implications of such techniques are immense and have been treated at length by other authors. A few questions: Are there differences between deterministic models and quantum (random) models? Can behaviors be directly surmised from inputs given enough examples? How long of a prior state history is needed to compute the next state? Can states be skipped?
There are three initial technologies upon whose maturity much of the picture above depends: AR, brain scanning, and AI. Below I summarize the current state of these technologies and further development needed.
The path forward for AR is the clearest. VR, while a very different medium with its own set of challenges and objectives, has been a significant donor of AR-related technologies, including optics, displays, sensors (e.g. accelerometers, outward- and inward-facing cameras), and processors. Already we have seen compelling demonstrations of pass-through “HUD-like” glasses, and it appears that all that remains is further compactification and improvement in fidelity.
Invasive BCIs are surgical— in that they offer more precision, and because they literally require surgery to install. Electrodes planted directly in the cortex seem the most efficient way to get a clean signal out and offer high spatial and temporal resolution. However, this strategy comes with a number of disadvantages. Neuralink has already run up against challenges posed by the physical malleability of the brain: highly-publicized reports indicated that over 85% of its electrosensitive threads retracted shortly after the start of the company’s seminal human trial. The risk of more severe complications which are yet unknown— in addition to potentially carrying steep human cost— also pose as a barrier to mass adoption.
In The Intelligence Curse, Luke Drago and Rudolf Laine pose BCIs as a potential example of pro-human AI-enabled technology, and urge that “BCIs should be noninvasive to reduce adoption barriers.” Indeed, non-invasive BCIs’ lack of direct intracranial intervention is likely to be perceived as much safer, lowering the barrier to entry and unlocking mass data potential. The tradeoff is of course the difficulty in getting any signal at all.
To map brainstate to (text, image, audio, etc.) we are likely to need a lot of data. As discussed briefly in (§) Aside: Universal vs. Personal Mindreading, the source and quantity of this data may be contingent upon the universality of brain structure. If there are enough commonalities in brain structure and the information patterns within, we could be in a regime not unlike today’s LLMs— train a foundation model, refine it, and provide it to the end user as-is. This more centralized training process reduces the burden on scanning technology, as more capital-intensive methods could be shouldered by institutions. If different users are sufficiently unique, though, there may have to be widespread distribution of inexpensive scanning tech in order for individuals to train their own models.
Another question is whether continuous learning is needed. This possible necessity for advanced AI is commonly cited in the context of human-like adaptability; for example, Dwarkesh Patel believes that many of the major problems with today’s LLMs stem from an inability to “get better over time the way a human would”. It may be the case that improving sample efficiency is paramount to reducing hallucination, synthesizing disparate pieces of information, and responding to novel stimuli— all of which would be important for future mindreaders and -writers.
As expected, the dangers of BCIs scale with the increased capabilities of the underlying technology.
At the lowest level is simply safety. Active sensing might irradiate the brain, frying neurons; attempting to mindwrite with inappropriate inputs could induce a seizure. Implants could be rejected by the immune system or cause an infection. But a functioning, medically safe BCI unlocks a host of other concerns; in particular, privacy and agency.
Mindreaders provide a direct link to someone’s psyche, a modality ripe for exploitation by bad actors. If not properly encrypted, raw brain data combined with the transformations needed to adapt it to a legible format could be used to steal personally identifiable information such as someone’s SSN or home address, bank and payments info, and other sensitive data— just by the victim thinking about it. Even scarier is the possibility of deep personal secrets being extracted for the purposes of public humiliation, bullying, or blackmail.
Additionally, if BCIs are not intuitive to use and chock-full of safeguards, user error alone could result in snafus that are embarrassing at best and catastrophic at worst; imagine, for instance, having an intrusive thought while texting and mistakenly sending it— or unintentionally slandering someone in a public forum simply by thinking negatively of them at the wrong time.
Enabling write access to the brain invites a whole new suite of problems. On its face, the most benign is intracranial advertisement, an idea widely parodied in media and yet completely plausible if consumer protections aren’t strong enough. Similarly, social media addiction is already prevalent; the endless flow of content on tap is likely to become all the more compelling— and all the more addictive— when that media is extremely immersive and instantly accessible (see: the experience machine). On the more sinister side, attackers could flood a victim’s mind with disturbing images, sounds, sensations, or full-fledged experiences. And if the motor cortex is in-bounds, a puppeteering attack could have horrific consequences.
In sum, augmented reality coupled with high-resolution, noninvasive brain scanning is a plausible bridge to mindreading and mindwriting— though with significant safety and privacy caveats. In particular, AR solves the problem of scale: continuous, time-synced streams of stimulus, behavior, and neural data allow for translation to be learned. Alone, these technologies could revolutionize the way we live and work; but eventually, the very same technologies and the data they provide could open the door to whole-brain emulation.
2025-09-17 12:34:52
Published on September 17, 2025 4:34 AM GMT
I listened to "If Anyone Builds It, Everyone Dies" today.
I think the first two parts of the book are the best available explanation of the basic case for AI misalignment risk for a general audience. I thought the last part was pretty bad, and probably recommend skipping it. Even though the authors fail to address counterarguments that I think are crucial, and as a result I am not persuaded of the book’s thesis and think the book neglects to discuss crucial aspects of the situation and makes poor recommendations, I would happily recommend the book to a lay audience and I hope that more people read it.
I can't give an overall assessment of how well this book will achieve its goals. The point of the book is to be well-received by people who don't know much about AI, and I’m not very good at predicting how laypeople will respond to it; seems like results so far are mixed leaning positive. So I’ll just talk about whether I think the arguments in the book are reasonable enough that I want them to be persuasive to the target audience, rather than whether I think they’ll actually succeed.
Thanks to several people for helpful and quick comments and discussion, especially Oli Habryka and Malo Bourgon!
Here's a synopsis and some brief thoughts, part-by-part:
Part 2, where they tell a story of AI takeover, is solid; I only have one footnoted quibble[1].
In general, they try to tell the story as if the AI company involved is very responsible, but IMO they fail to discuss some countermeasures the AI company should take (e.g. I would take those actions if I were in charge of a ten-person team, assuming the rest of the company is being reasonably cooperative with my team). This doesn't hurt the argument very much, because it's easy to instead read it as a story about a real, not-impressively-responsible AI company.
I personally (unlike e.g. Shakeel) really liked the writing throughout. I'm a huge fan of Eliezer's fiction and most of his non-fiction that doesn't talk about AI, so maybe this is unsurprising. I often find it annoying to read things Eliezer and Nate write about AI, but I genuinely enjoyed the experience of listening to the book. (Also, the narrator for the audiobook does a hilarious job of rendering the dialogues and parables.)
In the text, the authors often state a caveated version of the title, something like "If anyone builds it (with techniques like those available today), everyone dies". But they also frequently state or imply the uncaveated title. I'm quite sympathetic to something like the caveated version of the title[2]. But I have a huge problem with equivocating between the caveated and uncaveated versions.
There are two possible argument structures that I think you can use to go from the caveated thesis to the uncaveated one, and both rely on steps that are IMO dubious:
Argument structure one:
This is the argument that I (perhaps foolishly and incorrectly) understood Eliezer and Nate to be making when I worked with them, and the argument I made when I discussed AI x-risk five years ago, right before I started changing my mind on takeoff speeds.
I think Eliezer and Nate aren’t trying to make this argument—they are agnostic on timelines and they don’t want to argue that sub-ASI AI will be very unimportant for the world. I think they are using what I’ll call “argument structure two”:
The authors are (unlike me) confident in tricky hypothesis 2. The book says almost nothing about either the big complication or tricky hypothesis 2, and I think that’s a big hole in their argument that a better book would have addressed.[3] ( I find Eliezer’s arguments extremely uncompelling.)
I think that explicitly mentioning the big complication is pretty important for giving your audience an accurate picture of what you're expecting. Whenever I try to picture the development of ASI, it's really salient in my picture that that world already has much more powerful AI than today’s, and the AI researchers will be much more used to seeing their AIs take unintended actions that have noticeably bad consequences. Even aside from the question of whether it changes the bottom line, it’s a salient-enough part of the picture that it feels weird to neglect discussing it.
And of course, the core disagreement that leads me to disagree so much with Eliezer and Nate on both P(AI takeover) and on what we should do to reduce it: I don't agree with tricky hypothesis 2. I think that the trajectory between here and ASI gives a bunch of opportunities for mitigating risk, and most of our effort should be focused on exploiting those opportunities. If you want to read about this, you could check out the back-and-forth me and my coworkers had with some MIRI people here, or the back-and-forth Scott Alexander and Eliezer had here.
(This is less relevant given the authors’ goal for this book, but from my perspective, another downside of not discussing tricky hypothesis 2 is that, aside from being relevant to estimating P(AI takeover), understanding the details of these arguments is crucial if you want to make progress on mitigating these risks.)
If they wanted to argue a weaker claim, I'd be entirely on board. For example, I’d totally get behind:
But instead, they propose a much stronger thesis that they IMO fail to justify.
This disagreement leads to my disagreement with their recommendations—relatively incremental interventions seem much more promising to me.
(There’s supplementary content online. I only read some of this content, but it seemed somewhat lower quality than the book itself. I'm not sure how much of that is because the supplementary content is actually worse, and how much of it is because the supplementary content gets more into the details of things—I think that the authors and MIRI staff are very good at making simple conceptual arguments clearly, and are weaker when arguments require attention to detail.)
(I will also parenthetically remark that superintelligence is less central in my picture than theirs. I think that there is substantial risk posed by AIs that are not wildly superintelligent, and it's plausible that humans purposefully or involuntarily cede control to AIs that are less powerful than the wildly superintelligent ones the authors describe in this book. This causes me to disagree in a bunch of places.)
I would like it if more people read this book, I think. The main downsides are:
Despite my complaints, I’m happy to recommend the book, especially with the caveat that I think it's wrong about a bunch of stuff. Even given all the flaws, I don't know of a resource for laypeople that’s half as good at explaining what AI is, describing superintelligence, and making the basic case for misalignment risk. After reading the book, it feels like a shocking oversight that no one wrote it earlier.
In their story, the company figures out a way to scale the AI in parallel, and then the company suddenly massively increases the parallel scale and the AI starts plotting against them. This seems somewhat implausible—probably the parallel scale would be increased gradually, just for practical reasons. But if that scaling had happened more gradually, the situation probably still wouldn't have gone that well for humanity if the AI company was as incautious as I expect, so whatever. (My objection here is different from what Scott complained about and Eliezer responded to here—I’m not saying it’s hugely unrealistic for parallel scaling to pretty suddenly lead to capabilities improving as rapidly as depicted in the book, I’m saying that if such a parallel scaling technique was developed, it would probably be tested out with incrementally increasing amounts of parallelism, if nothing else just for practical engineering reasons.)
My main problem with the caveated version of the title is again that I think they’re inappropriately reasoning about what happens for arbitrarily intelligent models instead of reasoning about what happens with AIs that are just barely capable enough to count as ASI. Their arguments (that AIs will learn goals that are egregiously misaligned with human goals and then conspire against us) are much stronger for wildly galaxy-brained AIs than for AIs that are barely smart enough to count as superhuman.
I don't think Eliezer and Nate are capable of writing this better book, because I think their opinions on this topic are pretty poorly thought through.
2025-09-17 08:28:03
Published on September 16, 2025 10:01 PM GMT
The Internet has had a tremendous democratising effect on communication. All technological leapfrogs tend to have such an effect - they radically expand access to things previously reserved for a select few.
Before the invention of the Printing Press, for example, only the elite—the clergy and nobility—could print or disseminate written works.
Electricity, similarly, democratized automation. Earlier, only certain industries or the wealthy could command labor and mechanical energy. Electricity made lighting, heating, and mechanization available to ordinary homes and small businesses.
The steam engine democratized mobility and production, and the microchip democratized computation.
What technology essentially leads to is class dissolution, and it does so not by destructive means like redistribution of resources but by creating physical abundance. When a certain utility is no longer scarce, the masses gain access to it, and the resulting productivity gain lays the groundwork for the next technological breakthrough. This pattern lies at the heart of human civilizational progress.
But the class dissolution caused by the Internet brought with it a peculiar problem.
When it comes to information consumption, every single human on earth is justifiably class-conscious. And here, class pertains to Taste, not wealth.
If you are an old zoomer, like someone born in the year 2000, you may have grown up with a TV at home where people would watch the news. Certain news channels would have a reputation for theatrical panic: fearmongering and sensationally shouting about alien encounters and hideous crimes all the time. You may have tuned into them once in a while for entertainment, but for your daily dose of headlines and to get information on important events, you would probably avoid them and prefer other, more serious channels (back when mainstream journalism had substance anyway).
You would also only watch movies that meet a certain benchmark of production quality. Even in your favorite genre, you would likely not watch random B-grade trash that puts zero effort into the story, direction, and visuals.
The generation before us used to depend on curated magazines, and people would prefer certain newspapers based on the kind of editorials and op-eds that got published in them. You would pay more to get curated content written by tasteful intellectuals.
Similarly, at school, your definition of a good conversation was not the same as everyone else’s, and you would filter out the people you make friends with depending on the quality and enjoy-ability of conversations and other interactions that you had with them.
This discrimination is healthy human behaviour. You need to curate your aesthetic experience of the world. Yes, once in a while, you would want to experience something that is otherwise cheap and shallow out of curiosity or a craving for novelty, but for most of your waking hours, you need to expose yourself to the good stuff to get the nourishment and growth that is necessary for a mind like yours.
If you don't curate your exposure, you'll be perpetually distracted, feeling unpleasant and tired all the time from your mind being stretched in all directions and revolting against the lack of intellectual standards and aesthetic coherence you are subjecting it to. This is exactly what you feel when you spend too much time on the internet.
The Internet doesn’t let you sit with your friends at the lunch table. It plants you at the centre of the city with the cacophony of the masses ringing in your ears.
It is a mistake to think that most people on the internet are like you. They are not. The disgust that you feel after wasting hours on the internet: the majority of internet users are perfectly fine with it. They don’t share your sensibilities. They do not have those baseline aesthetic requirements that you do. Yes, the psyche of the average human might surprise you.
We keep hearing a narrative that people who are building AI software startups today don't really understand their customers. You'll see dozens of AI coding tools coming up, and dozens of AI research tools coming up. But who uses these outside the bubble of our industry?
Most people use AI to play out their fantasies. For most people, ChatGPT is more of an entertainment tool than a productivity tool.
Similarly, for most people, media consumption is not a means to inspire contemplation; it is more of a means of escape. If you wonder why the average big-budget film produced today is of such low quality compared to similarly priced productions back in the days when distribution was expensive, well, it’s because that’s what sells. People pay for escapism.
So if you're wondering why the algorithm is so trash, why it gives you these sensational ragebaits all the time: It's because it is not made for you. If you compare the number of views on the highest-grossing educational or edutainment videos on the internet vs. the highest-grossing brain-rot videos, you would see there's an order of magnitude difference. You — people like you — are actually quite alone. You belong to a minority of humans who are capable of confronting existential dread, who are capable of feeling disgust at a meaningless life, and who are capable of being fatigued from scrolling too much.
Think about it: YouTube was not uninteresting before the advent of short-form. You enjoyed it as much as you enjoy it today. Perhaps even more. Short-form content was not introduced to bring you to the platform because you were already a satisfied customer of YouTube. It was brought in to onboard and increase the viewership of an entirely different consumer persona, who found the old YouTube boring. And it was a success because this consumer group contributes to much higher watch hours today than your group.
The algorithm on most social media sites is never going to cater to you (democratized as it is); it is going to cater to the majority of the population on the internet, who are nothing like you.
This is where I shall introduce the term "Attention Tax Bracket."
Your Attention Tax Bracket is determined by the aesthetic and intellectual complexity of the content that you need for your well-being. If you lie in one of the higher brackets, the kind of media you need to consume for your own intellectual nourishment and sanity is going to be expensive - not in terms of money, but attention.
To explain with an example, the same Netflix subscription can give you access to both an Akira Kurosawa movie and a shitty, borderline brain-rot movie. Even if you decide to watch the Kurosawa movie, you need to pay attention. Not for 120 seconds, but for hours. Do you even have that much attention in your pocket? Or are you too attention-poor to buy the Intellectual nourishment you need?
You might think that your attention span is all but destroyed by now, but as per my biased and arbitrary opinion, that would be a grave misjudgment. Under the right circumstances, you can read books or play video games for hours on end, and sometimes you are even able to focus on your work for similarly long periods of time: what we call deep work.
So it's not like you’re attention-poor. It's just that on the internet, there are hundreds of different things that suck away and deplete your attention.
Your attention is like a limited resource, and you expend too much of it on informational fast food such as short-form video and ragebait.1 You’re not doing so out of choice; you are being tricked by the algorithm, which basically creates a path of least resistance to consume cheap, mindless content.
By the time you reach the Kurosawa film, you are restless and fatigued. And then as you feel sick and disgusted with yourself, you expend the remainder of your energy consuming more cheap content to escape the bad feeling. This is the Poverty Trap of the Attention Economy.
To experience the intellectual stimulation and philosophical education that a person like you needs, you must go back to curating your exposure again. You have to go back to being class-conscious like a healthy human. You have to find ways to look past the shiny recommendations pages. And you need to do that on YouTube, on Netflix, on X, on Amazon, basically every single place you visit on the internet.
This is the full picture of the Attention Tax Bracket: the higher you sit in the taste hierarchy, the more you pay not only for consumption but also for curation.
The cost of curation is the actual tax. It is the cognitive load of filtering through dopamine-optimised recommendations.
If you are in the lowest attention tax bracket, you just lap up whatever bile the algorithm throws at you. Your tax is essentially zero: you don't need to curate anything.
If you’re in one of the highest brackets, your tax is so high that you will need to invest in technology, systems, and habits that will reduce your curation cost. You basically need to evade your attention tax.
I wish there were a universal playbook to get this done. Even better would be a software that filters our entire internet experience. I tried to build one for myself about a month ago, and it's a hard problem to solve.
But there exist several practical workarounds that we can try today.
On YouTube: Clear and pause your watch history2. You will stop getting “personalized” recommendations on your homepage; you'll only see content that you search for. This is one of the best quality-of-life improvements you can gift yourself on the internet today.
On X: You’ll see several tabs at the top of your feed. Avoid the For You tab and stick to the Following Tab.
When using X on your desktop browser, disable the For You feed entirely using extensions like Undistracted. You can also create alternative feeds by creating Lists on X and pinning them. Only put high signal-to-noise ratio accounts on these lists.
Use RSS readers like Feedly to aggregate newsletters and blogs. Substack is very quickly becoming the hub for high culture on the internet, primarily because it caters to consumers who want curated media. It doesn’t trap you in recommendation loops because its business model isn’t dependent on optimizing for metrics like user time-on-site.
(These strategies represent curation rather than abstinence. The mistake people often make is blocking entire social media platforms, which is impractical. You don't want to block out the little social interaction you get in this age of loneliness. This is why blocking a social media site or uninstalling social apps from your phone never really solves any problem. You always end up relapsing.)
This post was originally published on my Substack.
2025-09-17 08:00:28
Published on September 17, 2025 12:00 AM GMT
Mox (SF) is hosting a Q&A with Nate Soares to talk about If Anyone Builds It, Everyone Dies.
Dinner is included with your ticket, and book buyers get an exclusive Special Edition Mox Book Jacket.
RSVP here!
2025-09-17 07:13:48
Published on September 16, 2025 11:13 PM GMT
This is the Salt Lake City community's reading group for If Anyone Builds It, Everyone Dies.
[posted by Raemon with permission from the group]