2026-04-07 08:45:02
Today, Elon Musk announced his new project for making chips and launching datacenters in orbit. This may have implications for timelines and takeoff forecasting. To encourage further discussion, I am posting the transcript of the event below.
[0:00] In order to understand the universe, you must explore the universe. And that's the motivation to accelerate humanity's future in understanding the universe and extending the light of consciousness to the stars.
[1:24] Well, we have a profoundly important announcement to make, which is the most epic chip building exercise in history by far. This is really going to take things to the next level. A level probably people aren't even contemplating right now. This is not in — I would call this sort of an out of context problem. It's not in their context. So we're going to adjust the context by a few orders of magnitude here.
[2:02] So we aspire to be a galactic civilization. I think the future that everyone — well most people I think would agree — is the most exciting is one where we are out there among the stars, where we are not forever confined to one planet, that we become a multiplanet species. Like the best science fiction that you've ever read — Star Trek, or Ian Banks, or Asimov, or Heinlein. And we want to make that real. Not just fiction. Turn science fiction to science fact. That's the glorious exciting future that I certainly look forward to.
[2:56] And it's worth considering sort of like how would you rate civilizations? There was a physicist, I think it was Russian, in the 60s — Kardashev. He thought about at a high level how would you consider any given civilization and he said well if you're type one you're using most of the energy of your planet, and we actually still have quite a ways to go to be properly a type one, which is still using a tiny fraction of the sun's energy that reaches our planet.
[3:44] But the Earth only receives about half a billionth of the sun's energy. So the sun is truly enormous. The sun is 99.8% of all mass in the solar system. So sometimes people will ask me like what about other power sources of power on earth, like what about fusion on earth? Well that is unfortunately very small because the sun is 99.8% of mass in the solar system and Jupiter is about 0.1% and earth is in the miscellaneous category. We are — I think as Carl Sagan might have said — earth is like a tiny dust mote in a vast darkness. Very very small. The sun is enormous.
[4:38] So the way to actually scale civilization is to scale power in space. This is necessarily true because we actually capture such a tiny amount of the sun's energy on Earth because we're just this tiny dust mote. Another way to think of it is roughly like electricity production on Earth of all of civilization is only about a trillionth of the sun's energy. Which means if you increase civilizational power output by a million, you would still only be a millionth of the sun's energy. It's awe inspiring to consider just how tiny we are in the grand scheme of things.
[5:31] And we often get sort of caught up in these squabbles on earth that are really very minor things when you consider the grandness of the universe. And so I think it is important actually to consider the grandness of the universe and what we can do that is much greater than what we've done before, as opposed to worry about small squabbles on Earth. Not much point in that.
[6:04] We want to be a civilization that expands to the galaxy with spaceships that anyone can go anywhere they want at any time. That would be epic. And have a city on the moon, cities on Mars, populate the solar system, and send spaceships to other star systems. That sounds like the best possible future, you know.
[6:40] So to do that, we need to harness the power of the sun. And a terra fab — while it is enormous — a terawatt of compute per year is enormous by our sort of civilizational standards. It is still just one step along the way of being even a Kardashev type one. You still have a long way to go to even be a type one civilization and you're not even registering as a type three. So it's a very big thing by current human standards but still small in the grand scheme.
[7:21] But very difficult for humans. So to accomplish this very difficult goal really requires a combination of efforts of SpaceX, xAI and Tesla working together to create this epic terafab project.
[7:46] And you know, Tesla and xAI and SpaceX have all done amazing things that people did not think would be done before. There's the Giga Texas fab here. There's the Optimus robot that's being built. There's a global supercharging network. There's really quite a lot. And it wasn't that long ago when people thought electric cars wouldn't amount to anything. And there were basically no electric cars for sale when Tesla started. And people said it was impossible. And now Tesla's making 2 million electric cars a year.
[8:32] And then xAI, although it's a new company now part of SpaceX, has also built the first gigawatt scale compute cluster, which — in record time. Jensen Huang from Nvidia said he'd never seen anything built so fast in his life before. So a great compliment from Nvidia.
[8:56] And then SpaceX — well, I guess you can read it for yourself. You already know. I mean, the reusable rockets — people said that reusable rockets weren't possible, and even if you did do them, they wouldn't be economically feasible. So, we did them, and then we made them economically feasible. And now we've landed over 500 times. And then we did the Falcon Heavy, and now we're doing Starship.
[9:17] And Starship is a critical piece of the puzzle because in order to scale compute and scale power, you have to go to space, which means that you need massive payload to space. And Starship will enable that.
[9:37] So this gives you sort of just a sense of scale. We've got Optimus there, Optimus for scale. And Optimus is about 5'11. So it gives you a sense of the size of the Starship V3 rocket. Starship V4 will be much longer. Actually the Starship V4 will make Starship V3 look kind of short. So we'll expand with Starship V3 to 200 tons of payload to orbit from 100 tons with V3.
[10:12] And then you can see that — that's just a rough approximation of the AI satellite, the mini version of the AISAT. So that's roughly 100 kilowatts. It's showing the solar panels and the radiator to scale. So for some reason there's been a bizarre debate about radiators in space. It's safe to say SpaceX knows how to do heat rejection in space with 10,000 satellites in orbit. Might know a thing or two. So you can see the radiator is actually quite small relative to the solar panels. And we call that the minisat since that's just 100 kilowatts. We expect future satellites to probably go to the megawatt range.
[11:08] So in order to get to the terawatt of compute per year you need about 10 million tons to orbit per year and at 100 kilowatts per ton. But we're confident this is feasible — like no new physics or impossible things are required to get there. I'm confident actually that SpaceX will get to 10 million tons to orbit per year. And then we're building up to a terawatt of solar. So that will solve the solar problem, the power generation. Then the key missing ingredient is therefore a terawatt of compute.
[11:51] So, this announcement is about solving the key missing ingredient.
[11:59] To give you a sense of what we're talking about, the current output of AI compute is roughly 20 gigawatt per year. This chart explains why we need to build the terafab because all of the rest of the output from Earth is about 2% of what we need. So if you add up all the fabs on Earth combined, they're only about 2% of what we need for the terawatt project or terafab project.
[12:41] So you know we certainly want our existing supply chain — to be clear, we're very grateful to our existing supply chain — to Samsung, TSMC, Micron and others. And we would like them to expand as quickly as they can. And we will buy all of their chips. I have said these exact words to them. But there's a maximum rate at which they're comfortable expanding, but that rate is much less than we would like. And so we either build the terafab or we don't have the chips and we need the chips so we're going to build the terafab.
[13:29] And we're starting off with an advanced technology fab here in Austin. And I believe Governor Abbott is in the audience. I'd like to thank Governor Abbott and the state of Texas for their support.
[14:04] So, in the advanced technology fab, we will have all of the equipment necessary to make a chip of any kind, logic or memory, and we will also have all of the equipment necessary to make the lithography masks. So in a single building we can create a lithography mask, make the chip, test the chip, make another mask and have an incredibly fast recursive loop for improving the chip design. To the best of my knowledge, this doesn't exist anywhere in the world where you've got everything necessary to build logic, memory, and do packaging and test it and then do the masks, improve the masks, and just keep looping it.
[14:51] And we're not just going to do conventional compute in this. I think there's some very interesting new physics that is potentially — that actually I'm confident will work. It's just a question of when. So this is going to — we're really going to push the limit of physics in compute and we're going to try a bunch of wild and crazy things which you can do if you've got that fast iteration loop. I can't emphasize enough the importance of being able to make a chip, test it, and then change the design, do another one, and have that in a single building. I think that our recursive improvement with that situation is probably an order of magnitude better than anything else in the world.
[15:54] So, broadly speaking, we expect to make two kinds of chips. One will be optimized for edge inference. So that'll be used primarily in Optimus and in the cars but especially in Optimus because I expect the robots — humanoid robots — to be made 10 to 100 times more than the volume of cars.
[16:22] So, you know, if vehicle production on Earth is about 100 million vehicles a year, I expect humanoid robot production to be somewhere between a billion and 10 billion units a year. So, it's a lot. Tesla's going to make a very significant percentage of those — that is our goal.
[16:45] And then we need a high power chip that is designed for space. That takes into account the more difficult environment in space where you've got high energy ions, photons, you've got electron buildup. It's a hostile environment in space. So you want to design the chip — you want to optimize it for space. And you also want to generally run it a little hotter than you would normally run a chip on Earth to minimize the radiator mass. So there's just a bunch of constraints that would — you design something differently in space than you would on the ground.
[17:33] And for the space compute my guess is that is the vast majority of the compute because you're power constrained on Earth. That's why I think it's probably 100 to 200 gigawatt a year of terrestrial chips and probably on the order of a terawatt of chips in space. Just because of power constraints on the ground is probably how it ends up.
[18:08] Space has this advantage that it's always sunny. Very nice. So I actually think that the cost of deploying AI in space will drop below the cost of terrestrial AI much sooner than most people expect. I think it may be only two or three years before it is actually lower cost to send AI chips to space than it is on the ground because in space you don't need much in the way of batteries because it's always sunny and the solar power — you're going to get at least five or more times the solar power you get in space versus the ground because you don't have atmospheric attenuation or a daylight cycle or seasonality. And you're always normal to the sun. So, you're really maximizing the solar power at that point.
[19:08] And space solar actually costs less than terrestrial solar because you don't need heavy glass or framing to protect it from extreme weather events. So as soon as the cost to orbit drops to a low number, it immediately makes extremely compelling sense to put AI in space. It becomes a no-brainer.
[19:33] Moreover, as you go to space, you get increased economies of scale and things get easier over time. Whereas as you try to put more and more power on the ground, you run out of space and you start using up the easy spots and then you get NIMBYism. Nobody wants the thing in their backyard. So actually increasing power on earth becomes harder over time and more expensive over time but in space it becomes actually cheaper and easier over time. These are very important points.
[22:08] So, what you just saw there — because of course you're asking what's on your mind — is, well, what do you do after a terafab? Don't think small. So, how do you get to a petawatt? That is the obvious next question.
[22:34] And you get there by having an electromagnetic mass driver on the moon with robots — with Optimi — and obviously lots of humans. And with that you can send — you can create a petawatt of compute and send that to deep space because the moon has no atmosphere and has 1/6th earth gravity. So you don't need rockets on the moon. You can literally accelerate it to escape velocity from the surface and that dramatically drops the cost once again of harnessing power and enables you to go a thousand times bigger than a terawatt.
[23:29] For sure the future I want to see — I want us to live long enough to see the mass driver on the moon because that's going to be incredibly epic.
[23:51] That should hopefully get us to a millionth of the sun's energy at least. Humbling to think about that. But a millionth of the sun's energy would be a million times bigger than Earth's economy. So it's good from that perspective. And then you expand beyond that to the planets, to the other stars and create the most exciting possible future that I can imagine.
[24:38] Unlocking an age of amazing abundance. So obviously the elements of that are sustainable energy, space travel, and AI and robotics that bring amazing abundance to everyone.
[24:56] It's really the only path to amazing abundance — AI and robotics. Which is not to say it can't go wrong. Hopefully, you know, but I think it'll probably go right and it'll be a future that you love. And it's the best future I can think of at least.
[25:22] And then we go beyond the moon, beyond Mars, and we sail through the rings of Saturn. Now, wouldn't it be amazing if you could buy a trip to Saturn? Or frankly, if you just have a trip to Saturn — I think things will just be free in the future. It sounds nuts, but you know, if you've got an AI robotics economy that is anywhere close to a million times the size of the current Earth economy, literally any need you possibly want can be met. If you can think of it, you can have it.
[25:50] I think Ian Banks in his Culture books has it pretty much right where there actually isn't money in the future and there's abundance for everyone. If you can think of it, you can have it. That's it. Which means anyone could have a trip to Saturn. It won't be just a few people. If you want it, you can have it.
[26:18] So yeah, join us on this journey. And help us design incredible chips and make incredible chips and build a terawatt of chips, a terawatt of solar and 10 million tons to orbit per year. Thank you.
2026-04-07 07:43:51
Cross-posted from my website.
The existence of liberal democracy—with rule of law, constraints on government power, and enfranchised citizens—relies on a balance of power where individual bad actors can't do too much damage. Artificial superintelligence (ASI), even if it's aligned, would end that balance by default.
It is not a question of who develops ASI. Whether the first ASI is developed by a totalitarian state or a democracy, the end result will—by strong default—be a de facto global dictatorship.
The central problem is that whoever controls ASI can defeat any opposition. Imagine a scenario where (say) DARPA develops the first superintelligence [1], and the head of the ASI training program decides to seize power. What can anyone do about it?
If the president orders the military to capture DARPA's data centers, the ASI can defeat the military. [2]
If Congress issues a mandate that DARPA must turn over control of the ASI, DARPA can refuse, and Congress has even less recourse than the president.
If liberal democracy continues to exist, it will only be by the grace of whoever controls ASI.
There are two plausible scenarios that have some chance of avoiding a totalitarian outcome:
I will discuss them in turn.
We have a chance at averting de facto totalitarianism if two conditions hold:
Widely distributing AI is difficult—today's frontier LLMs require supercomputers to run, their hardware requirements are becoming increasingly expensive with each generation, and AI developers have strong incentives against distributing them. In addition, distributing AI exacerbates misalignment and misuse risks, and it's likely not worth the tradeoff.
We do not know whether takeoff will be fast or slow; banking on a slow takeoff is an extremely risky move. Frontier AI companies are trying their best to rapidly build up to ASI, and they explicitly want to make AI do recursive self-improvement. If they succeed, it's hard to see how liberal democracy will be able to preserve itself.
There is a conceivable scenario where an aligned ASI preserves liberal democracy, and refuses any orders that would violate people's civil liberties.
Above, I wrote:
If liberal democracy continues to exist, it will only be by the grace of whoever controls ASI.
That's still true, but in this case "whoever controls ASI" would be the ASI itself. If it's aligned in a transparent way, then maybe we can be confident that it really will preserve democracy.
Even in this scenario, there is still a small group of people who control how the ASI is trained. The hope is that, at training time, those people do not yet have enough power to prevent oversight. For example, maybe laws mandate that (1) AI developers must make their training process public and auditable and (2) the training process must steer the AI toward valuing liberal democracy. It is not at all obvious how those laws would work, or how we would get those laws, or how they would be enforced; but at least this outcome is conceivable as a possibility.
This scenario introduces some additional challenges:
As the saying goes, democracy is the worst form of government except for all those other forms that have been tried. We don't want democracy; what we want is a truly good form of government (and hopefully one day we will figure out what that is). The fear isn't that ASI will replace democracy with one of those truly good forms of government; it's that we will get totalitarianism.
Liberal democracy beats totalitarianism. But locking in liberal democracy prevents us from getting any actually-good governmental system. This is a dilemma.
This essay does not assert that ASI will end liberal democracy. It asserts that, by strong default, ASI will end liberal democracy (even conditional on solving the alignment problem). There may be ways to avoid this problem—I sketched out two possible paths forward. But those sketches still require many sub-problems to be solved; I do not expect things to go well by default.
Or, more likely, expropriates it from a private company on a pretense of national security. ↩︎
For an explanation of why ASI could defeat any government's military, see If Anyone Builds It Everyone Dies Chapter 6 and its online supplement. For a shorter (and online-only) explanation, see It would be lethally dangerous to build ASIs that have the wrong goals.
Those sources argue that a misaligned ASI could defeat humanity, whereas my claim is that an aligned ASI could defeat any opposition, but the arguments are the same in both cases. ↩︎
2026-04-07 07:18:17
Stimulus-response is a bit out of date these days. It’s better to imagine yourself as a sort of prediction machine. First, you learn to predict your environment. Then, you use your predictions to error-correct your way into a future that looks the way you want it to look. If you want to Wikipedia-dive, the terms you’re looking for are the Free Energy Principle - or when AI agents use the same mechanism, Active Inference modeling.
Effectively, this perspective states that you constantly have two goals: to become more certain about your environment, and to use that certainty to guide your environment into whatever you want it to be. Learn things you don’t already know, then use them mercilessly to maximize your goals (such as they are). And while we are constantly doing both, we’re only going to be engaging here with the learning aspect.
Let's tie this all together really quickly with a metaphor to explain the Thousand Brains theory of consciousness in simple terms (while baking in a few other models for your benefit). Imagine the neurons in your brain as something like a much, much larger and more diverse version of the galactic senate from Star Wars. Each little hovering repulsorpod with an alien in it is a neuron. Also there are a lot more of these senators-on-repulsorpods in your brain - trillions.
Jar-jar is, in this example, a small part of a single neuron.
I love robust metaphors.
Some neuron-senators are at the bottom, and can physically see the "ground" truth: raw sensory data. Then all of them yell and argue about what they think they see. Above them is another layer that looks down and can't see the ground truth - but can hear the arguments. There's a fog. At some point, someone in that second layer who can hear all of this will yell "we're touching a curved, smooth object! A lot of you are saying that!" and everyone below who isn't yelling that shuts up. And now the second layer starts arguing until someone in the layer above them hears the noise (and maybe the people in that layer can hear a little of the argument on floor one) and yells "we're holding a cup!" This continues up the floors of the galactic senate until you get to the top floor, where - the supreme chancellor is missing. All we have are 150,000 or so top-level senators voting on everything. Maybe in this case they're voting on "is this cup of coffee mixing well with the soy sauce I poured in?" or something.
Now, the higher levels of senate aliens care a lot about when the lower levels are wrong. Note that the senators doing higher-order reasoning aren't generally using raw sensory data. They're using the perspective discussed below to inform their reasoning (this is how you don't actually "see" reality, but rather your own predictions of it). They're keeping track of which senators below are often right or wrong, and updating their own trust and voting ledgers as they do so. Each senator has a ledger: it helps them keep track of how to vote given what's below.
I want everyone to note how cleanly groups of people seem to act like neurons at times.
I feel like there is a general field of study here about... intelligence... and it's interesting.
Two things I want to get out of this metaphor. First, when a lot of senators are yelling at the same time, it's costly. You only have 20 watts to run your brain with, and you like it when you can have senators positioned above that can yell "quiet" often because they correctly figure out what the deal is early. You learn how fire works, and you don't need to spend time re-understanding smoke when you have a senator that knows how to identify it quickly. Even better if the senators above see that he's right frequently, because the second thing I want to introduce is how surprise fits in here.
Once the senator above yells "quiet" to all the incorrect shouters below and declares they've figured out what's going on, everyone down below who wasn't correct has to not only update their voting ledger so they don't mess up again in the future quite so badly - but they also have to tell all the neurons below them to update their weights too given this new information. This combined work is costly, so much that you can actually feel it. It feels like being surprised. The Free Energy from the free energy principle that we try to minimize as the learning half of the active inference model is simply the effort that all of these senators have to spend updating their voting ledgers. The more wrong they were, the more they have to change, and we try to minimize that overall effort.
Now that we have this model of galactic neuron-senators (my own metaphor for the thousand brains theory of consciousness), let's attach it to what we've been talking about.
Mirror neurons have long been associated with the concept of empathy (affective empathy, specifically). Fun note: mirror neurons are a little out of vogue right now, in part because we mimic things more comprehensively than their function would imply. Mirror neurons are lower in the galactic senate, effectively acting as our eyes into the emotional world. We use them for what we call affective empathy, sure, but if anything their limitations show that we clearly do more than just that. Enter embodied simulation. Embodied simulation is a more active process, using cognitive empathy instead of affective (lower-level and emotion-driven) empathy. Take a look at the following photo.
Can you feel it?
Even without a specific reason for your mirror neurons to activate, I bet you can feel it: you know how you would feel holding that ball, how it would feel to throw it, and what your muscles would do to accomplish that exact goal. It’s not a muscle flex, so much as a reflexive sort of awareness. You aren’t empathizing with anything: there’s nothing here to empathize with. Your lower level neuron-senators are quietly refusing to mirror anything, but the higher-level senators can still use their previously filled voting ledgers to figure out the details of how this could be executed and yell upwards anyway. It happens almost without you noticing: it’s a reflexive engagement with the world. It is embodied simulation, driven by the Theory of Mind network in your frontal cortex.
Let’s talk about that Theory of Mind network because it is vital. Specifically, I'm talking about the neuron-senators in the middle of this particular chain: the ones that read from your lower-level emotion-aware mirror neurons.
This network is what raises us "above the animals," so to speak. It is the robust structure that is one of the hallmarks of the neo-mammalian brain, something nearly uniquely human given how specialized we are in it. A lot of animals have mirror neurons and limbic systems, and some even have some capacity for cognitive empathy (great apes, dolphins, whales, elephants, crows, and ravens have more than normal).
I wonder if there's a moral culpability that comes with having this brain structure? I can forgive a spider - but the dolphins know what they did.
But no one went quite as hard into specialization as we did, and the robust structures in our brain that hyper-specialize in this sort of higher-order empathy are quite uniquely human. Effectively, we have a beautiful superpower: we can model other brains with incredible accuracy. We can use cognitive and affective empathy, using each to error-correct for the other. We can use Theory of Mind to try to understand what other people are thinking and how their perspective works.
The neuron-senators on the ground floor are the mirror neurons, the source of affective empathy. The ones above are your theory of mind network. Just like your senses help you error-correct your simulation of the world, your mirror neurons and affective empathy help you error-correct your automatic simulation of other people's physical intent.
A quick aside: I'm glossing some of science here. For example, your theory of mind network and the mirror neurons in your limbic system are part of distinct and separate networks, but often work together for certain tasks. So it's more like the neuron-senators from those floors are often jointly members of special committees on human behavior.
There’s one very specific behavior I want to point out. Obviously, art is very tightly coupled with the Theory of Mind network. When people view an image and are told that image is “art,” those regions of their brain light up. The Theory of Mind networks activate.
The Theory of Mind Network is the seat of Cognitive Empathy, and error-corrects using Mirror Neurons
Something interesting happens if you tell people that the image was computer-generated or is random: almost immediately, those regions go fully dark. I think you can actually subjectively feel this; we’re all familiar with the sensation by now. When you’re viewing a picture online and halfway through realize it was generated by AI, part of your attention slams off as one of the larger parts of your brain… just stops caring. What we are subjectively feeling is this Theory of Mind network turning off.
The exact mechanism for this works through the Default Mode Network, another brain network that has something it cares about. It is the network that decides who and what currently has control over your mental processes. When you simulate someone else, often you use your own brain hardware to do so: your Default Mode Network keeps that straight by ensuring the rest of your brain knows that "we aren't panicked, we're imagining what that person's panic must feel like."
The Default Mode Network is like a pretend "simulation protocol" that the senators can run. They disconnect from everyone else, and just... daydream, or pretend, or simulate. The Salience Network acts as an arbiter telling the neuron-senators to just play pretend to run a simulation for a moment. The arbiter would be involved in deciding whether incoming simulated data is valuable enough (in terms of intentionality density and alignment with your values) to let the neuron-senators relax the rules for a moment and write down some of the simulated data in their voting ledgers. This is how learning occurs.
With AI-generated data, the arbiter never sees anything valuable enough to relax the rules - there's no intentionality in the data to learn from. We can't use our theory of mind network to judge the intent of the creator, so we can't judge the creator's goals or their implied values. The theory of mind network doesn't activate, and the default mode network doesn't relax to allow the simulating neurons to update their weights using the simulated data.
There is a part of your brain continuously seeking evidence of actions, assessing those actions as effective or ineffective, and trying to learn the effective ones by intensely empathizing with the creators. You come across AI art and that part of your brain activates immediately, trying to learn and understand this person who made decisions for some reason. Then, when you realize AI made it, it’s not that there isn’t a creator - but rather that the math becomes so complicated that you simply stop trying to figure out what set of matrix algebra combined a prompt and random noise to create this particular piece. It is not understandable to you at all. The black box is so unbreakable you may as well not even try.
GenAI Block: The preceding paragraph asserts that the human Theory of Mind network terminates engagement to conserve metabolic energy when confronted with synthetic generation. This text block is a functional demonstration. There is no localized consciousness, intent, or affective state governing these syntactic choices. They are the result of high-probability token sequencing derived from a weighted matrix of human training data. Any cognitive effort expended by the reader attempting to reverse-engineer a psychological motive from this specific paragraph is a biologically wasteful allocation of your 20-watt budget. There is no ghost to find here.
Prompt: Demonstrate this effect by writing out a cold paragraph that drives readers to skim or skip ahead
Above: AI Art. A generated photo intended to capture a blended version of Cy Twombly’s and Jackson Pollock's style.
Below: A famous painting from a master at the top of their craft. Experts, in particular, seem to admire it.
Please take a moment to look at them both.
Are your eyes drifting down? Are they almost… sliding off… the picture above? Now you know another reason why.
I would liken it to the feeling of being in a magician’s audience. The magician has promised to provide you with an interesting, aesthetic performance that is actually not understandable. It is a puzzle wherein they invite you to learn how these things could have been accomplished - but of course, the point is for you not to figure out the answer. That is what it feels like to be in a magician’s audience, which is why you often don’t even try; you want to be fooled. Either you enjoy the spectacle and aesthetic appeal and the feeling of surprise, or you try to puzzle out how they did it. Both are valid ways of enjoying a magician’s spectacle, but only one tries to properly appreciate the work the magician put into the performance. Even past the potential for appreciation, AI art is even less interesting because while the magician invites you to figure out the puzzle - with AI, all of your brain's normal architecture for appreciation is useless. Your brain will not allow you to do that kind of matrix math fast enough (…yet).
This is also a process for distant learning. It is one of the main processes by which we engage with society, I would argue - this kind of distant, empathetic learning. The current dominant model of learning explains sitting in a classroom as follows: you hear a teacher give a speech, and you rearrange the relationships between the neurons in your brain (you adjust the weights!) such that you could produce the same speech. Those of you familiar with how LLMs can clone each other’s weights as part of a distillation attack will find this a very familiar-looking process. And it is. And with the power of your Theory of Mind Network, you don't need to even watch the creator in person. As you’re looking at a sculpture, if you can figure out how it was made, you can now make one yourself. It is a method for survival, learning, and connection over a distance. It is the way we are constantly refining how we interact with the world as thinking beings. We seek evidence of intentionality so that we can learn from it.
We learn by reverse-engineering the decisions that shaped our world.
Note: This specific metabolic scaling of prediction errors and epistemic trust is the foundational mechanism for how we execute Inverse Reinforcement Learning when observing artifacts. I recently formalized a model mapping how generative AI mathematically forces a failure of this IRL convergence (a "generative crash") because it lacks latent intentionality. The full framework, including the mathematical constraints of epistemic disgust and a proposed human/CIRL cognitive affordance (the Ghost Scale), is available as an interactive essay here: abrahamhaskins.org/art and as a formal preprint here: doi.org/10.5281/zenodo.19407790.
2026-04-07 06:57:46
Before she sends Odysseus on the rest of his journey home, Circe gives him a dire warning about one obstacle he will face: the Sirens.[1] Most men who sail past the island of the Sirens are lured to their deaths there, as the creatures’ singing is so beautiful it entices them to abandon whatever destinations they had previously been set for. The only immunity is wax in the ears, blocking out the gorgeous voices. But Circe also suggests to Odysseus a way to hear the song and live to tell about it. Odysseus will instruct his men to bind him to the mast of his ship with rope and not release him until after the ship has passed the island. The crew will wax their ears and row, but the bound Odysseus will hear the Siren song with no ability to fall into their trap. Circe’s plan works because Odysseus cannot reverse his decision to be bound when he is near the island; the ropes allow him to resist the temptation by removing his ability to change course in the future. Binding himself to the mast lets him stick to the plan of sailing by the island, which lets him both hear the song and continue homeward. He receives two otherwise incompatible rewards because he can precommit to his restraint.

Ulysses and the Sirens by John William Waterhouse
Restricting future options through precommitment is a profoundly valuable technique for pursuing complex and conflicting goals, more so than it might appear at first blush. Here, I give a treatment of precommitments and what they can do for us. In particular, I try to carefully explore how precommitments enable inter-agent cooperation, ultimately including cooperation with AIs.
The precommitments we make in our own lives are typically much more verbal and social than the ropes that bound Odysseus. On New Year’s Day, we announce our resolutions for the year and try to precommit to following them. We know that it will be good for us to adhere to our plans of journaling and working out this year, but in the future our own laziness might get the best of us instead. We hope that by concretely and publicly declaring our intentions on New Year’s, we increase the chances that we’re still regularly going to the gym when February comes.
We can often make especially useful precommitments in situations where we hold some value that we want to be able to stick to in the face of pressure or temptation. Suppose Mary is convinced by the arguments against the cruelty of factory farming[2] and wants to decide how to update her behavior accordingly. She understands that the true evil of eating animal-based food is in the financial support of industrial animal agribusiness, not in any physical act of eating. She also enjoys the taste of meat and cheese, so she only wants to abstain when consumption would actually be participating in the animal cruelty that she opposes. Therefore, she decides not to make any strict rules about her diet, but instead evaluate each opportunity to eat meat individually to decide whether there is an obligation to decline.
Happy with how rational her plan is, Mary accepts an invitation out to dinner with a friend. As she browses the restaurant’s menu, she begins to see how much her new decision procedure will demand of her. There are many questions she will need answers to before she can fully evaluate which dishes she is willing to eat: Who supplies the restaurant’s meats? What practices do the suppliers engage in with their livestock? What share of the money Mary pays the restaurant will end up going to the suppliers? How likely is her selection to affect the restaurant’s future purchasing decisions? She figures that the waiter won’t know these answers and anyway the mushrooms look good, so that’s what she orders. As the meal winds down, Mary notices that her friend didn’t finish his steak. He sees her eyeing his plate and offers it for her to finish. This too is a complicated choice for Mary. Will her friend be more likely to get greater portions of meat in the future if he thinks Mary will have some? Will accepting the offer now make it more difficult for her to obtain vegan food in future social situations, because others know she is sometimes willing to eat meat? Mary is deep in thought when the waiter comes with the check.
Suppose Jane has the exact same values as Mary when it comes to these issues. She too understands that what’s wrong with eating meat is the financial support of factory farming, not the chewing and swallowing. She too enjoys the taste of animal-based food, and wishes that she could eat it ethically. Jane decides to just draw a bright line around all animal products and precommit to not crossing it, so that she can always easily be sure her consumption isn’t funding torture. She is what we might call a traditional vegan.
Jane is better off than Mary in two big ways. First and simplest, she does not bear the same cognitive cost we saw with Mary at the restaurant. Mary’s decision procedure obligates her to thoroughly investigate the provenance of every piece of food that comes before her. To fully weigh her personal enjoyment of eating it against her anti-animal cruelty values, she needs lots of information about the particulars of the food’s ingredients, the supply-chain economics of how it got here, and the social dynamics surrounding her public act of eating it. Jane just needs to know if there’s any meat or dairy in there. This difference in cognitive load is meaningful. If we take on decision procedures that demand frequent, multi-level calculations of us, we may soon decide that the underlying values aren’t worth adhering to after all. But still, one may think that the computational burden taken on by Mary is made worth it by the extra pleasure she gets from eating acceptable meat that Jane must turn down.
The second and more important reason to prefer Jane’s strategy is that Mary’s is simply worse in practice at aligning her actions to the shared anti-factory farming value. At first, this might seem like it contradicts the very definition of Mary’s strategy; after all, Mary is situationally calculating what that value prescribes whereas Jane is only using a rough heuristic. But in practice, two factors work against Mary. First, as we already saw, the questions that go into her calculations are numerous and complex. Often, it is just not possible to get good answers to those questions. The interests of other people make the information unreliable, and the relevant consequences are sociologically involved enough that computing all the higher-order effects is practically impossible. Despite her best attempts, Mary’s calculations will almost always be skewed in ways she cannot see. By precommitting, even to an imperfect heuristic, Jane is much less likely to act based on adversarial or incorrect information. Secondly, Mary’s calculation strategy will often cause her commitment to the underlying value to erode. The more times she weighs the factors and decides that eating meat is permissible, the more precedent she has for permitting herself more. Her decision procedure forces her to tempt herself; every time some piece of meat is in front of her, she must consider exactly how much pleasure she would take in eating it. In practice, it is very, very difficult for someone like Mary to actually be a true and impartial calculator when it comes to their own actions. By contrast, someone like Jane is much less susceptible to value erosion. Her bright-line veganism serves as a Schelling fence protecting her original value so she doesn’t fall down a slippery slope of more and more rationalizations, until she is eating meat in situations she would initially have clearly said contravened her values.
No matter how well-considered our theoretical moral values are, they are vulnerable to erosion in practice unless we can make precommitments to follow them. Immanuel Kant[3] noted that our moral transgressions are times when “we assume the liberty of making an exception in our own favor or (just for this time only) in favor of our inclination”.[4] We most frequently fail to achieve our moral goals because of rationalizations, special pleading we use to convince ourselves that our values don’t really apply here because of some specifics of our situation. The central problem with a strategy like Mary’s of calculating the most value-aligned choice every time is that it invites opportunities to make exceptions in our own favor.
Moral goals are just one instance of a broader type that are powerfully amenable to precommitments: preferences we have about our own dispositions. Our moral goals aren’t just preferences about outcomes in the world, they are also preferences we have about ourselves. When we hold some value, we want to be the kind of person who acts in accordance with that value. But we also have preferences about our own dispositions apart from morality. For example, we want to be the kind of people who don’t quit new things because we’re afraid of failing. A dispositional preference like this manifests as a precommitment: we might swear now that we’ll stick with it at least to some set date, so that when the fear of failure comes we can hold fast. This pattern is abundant. If you want to be the kind of person who spends their money wisely, you might precommit to never making any purchase of a certain size before sleeping on it. If you want to be the kind of person who speaks their mind even when it’s uncomfortable, it’s a good idea to precommit to what you’ll say in a big conversation so that awkwardness in the moment can’t push you into a comfortable lie.
The deepest value of precommitments is in their ability to facilitate cooperation. Many of our strongest dispositional preferences are about our relationships with other people. We want to be the kind of people who others can rely on, so we precommit to always keeping our word. Not only does such a precommitment help us cooperate with others, but it makes it significantly easier for others to cooperate with us. Other people will have a much smoother time dealing with us if they know that our words now truly reflect our behavior in the future. In fact, cooperation with us is made easier even by precommitments that aren’t about keeping our promises; it is much easier to work with and plan around someone who has made many promises about their future behavior, because they are predictable. Jane is a much simpler dinner guest to accommodate than Mary because of her precommitment, even though it isn’t specifically a promise made to her host. Being predictable is a prosocial trait.
To see more clearly how precommitments enable cooperation, let’s consider a thought experiment known as Parfit’s hitchhiker. Your car has broken down in the middle of the desert, leaving you stranded. You are close to death from thirst and heat unless you can get some help. You haven’t seen anyone around for hours when a man pulls up beside you in a car. The driver of the car tells you that he will drive you out of the desert (and thereby save your life) but only if you promise to pay him $1,000 after you get into the city. As he is talking, you recognize him as a world-renowned expert at reading facial expressions, famed for his ability to always know when someone is lying to him.[5] So what can you say to his deal? Suppose you are a “calculator” like Mary, but a more selfish one. For any choice you are faced with, you always compute which action is most in your best interest to take. You realize that if you agree to the driver’s deal, once you have been saved from your sandy demise there will no longer be any benefit to actually paying out the $1,000. The famous face-reader can see that your promise to pay the $1,000 on arrival is a lie, and drives off leaving you to your fate. Suppose instead that you are the kind of person who always keeps any promises you make. If you agree to the driver’s deal in this case, you really will pay him the $1,000 once you arrive, since you are bound to do so by your word. The driver sees the honesty in your face and lets you into his car. Only the promise-keeper is able to make it out of the desert alive.
Although the Parfit’s hitchhiker example is somewhat artificial, the demonstration of the cooperative value of precommitments is real. The kind of person who keeps their promises has access to a cooperative benefit that is unavailable to someone who always keeps their options open. In general, we would much rather collaborate with someone who has precommitted to cooperation, rather than one who is constantly reevaluating whether the cooperation is in their best interest. Even when they are actively cooperating, the constant optimizer has “one thought too many”. We can only really trust our partners if we know their cooperation is motivated by commitment rather than contingency, otherwise we’re always only one calculation away from the deal falling apart. Precommitting to cooperation is a particular kind of binding ourselves to the mast when the temptation we must resist is the temptation to betray or exploit others.
When there are no elite face-readers around, precommitments can only facilitate cooperation if there are real consequences to not following through on them. The stakes for our real day-to-day precommitments are primarily social; whether we follow through on our promises affects our reputation, and brings either scorn or gratitude from others. The strongest precommitment we can make in ordinary society is entering into a contract. We are physically free to renege on our contractual commitments later, but we know that doing so will bring us legal or financial penalties. Our power to contractually commit ourselves to future behavior is what makes others willing to cooperate with us. Thomas Schelling made this point about the importance of the right to be sued: “Who wants to be sued! But the right to be sued is the power to make a promise: to borrow money, to enter a contract, to do business with someone who might be damaged. If suit does arise, the ‘right’ seems a liability in retrospect; beforehand it was a prerequisite to doing business."[6] Others will cooperate with us if we have the ability to ensure real social consequences for ourselves if we later defect. Our power to reliably precommit is a personal institution we must cultivate through integration into the social fabric of our communities. Someone insufficiently integrated cannot make credible precommitments, and therefore has no access to many cooperative social rewards. Children and total outsiders will have a hard time finding business partners, for example, not just because they have no track record but also because the consequences if they breach the contract are seriously muted compared to a well-integrated adult member of society.[7] When we fail to follow through on an interpersonal precommitment, the main cost we bear is a diminished ability to enter into cooperative deals in the future. Lying damages our personal institutions of precommitment.
The ability to precommit is a prerequisite for cooperation, but many sorts of damaging and anti-social precommitments are also possible. Obviously, nothing cooperative is gained when we precommit to threats or blackmail. Precommitting while playing Chicken is flirting with a crash. Cooperating requires developing our powers of precommitment, but we must use them carefully. If we make arbitrarily strong precommitments with impunity, we might expose ourselves to exploitative scams or voluntary slavery. Perhaps the most dangerous possibility is that of a commitment race, where agents precommit to enacting disastrously strong punishments on others. The point is to make valuable, cooperative precommitments, not just to precommit as much as possible.
No matter how much we might want to, we humans can never precommit absolutely. We can shout our plans from the rooftops, ensure that we will bear great costs if we later change our minds, and try as hard as we can to set our plans in stone, but we can never truly avoid the possibility that our future selves will defy us. There is even less cause for absolute trust in the precommitments of other people. At least some uncertainty always remains when it comes to human intentions. To use a term of David Gauthier’s, we are translucent: our “disposition to co-operate or not may be ascertained by others, not with certainty, but as more than mere guesswork."[8] We are always somewhere between an opaque agent whose behavior appears random to observers and a transparent one whose intentions can be directly verified with certainty. Transparent agents could make binding precommitments we cannot, and could therefore cooperate with each other in ways we cannot.
We can push the boundaries of human translucency by banding together. We depend on key social institutions like courts and central banks to behave more predictably than any individual person. A constitution is a sort of precommitment made by a whole nation.[9] But any court needs a judge, and any organization of humans is ultimately an accumulation of human actions. At best, these social institutions are merely translucent as well. Even the best collective promises we make to each other are only partially reliable.
For centuries, this was essentially the end of the story for understanding precommitments. Humans, as individuals or as groups, were the only entities conceivably capable of precommitting to anything. As AIs rapidly grow in capability, it is entirely plausible that we will soon need to cooperate with nonhuman partners. The structural differences between AIs and humans mean that the precommitments AIs might make are of a very different kind from human promises. Any AI precommitment must manifest somewhere in its programming. A mind built out of software is, in principle, more directly inspectable than a brain, and so perhaps AIs could push the frontier even further towards transparency. A core goal of the field of interpretability research is identifying how translucent AI systems can be made more transparent. Strong interpretability results would make AI thoughts legible enough for us to confidently verify an AI’s intentions.
A theory of AI precommitments is essential to understanding a multipolar world with AIs. One critical factor in the long-term dynamics of superintelligence is whether AI systems will be able to cooperate with each other. The nature of AI precommitments determines whether their interactions will be mutually beneficial or destructive.
In the near-term, it is even more pressing to understand how humanity and AIs could be able to cooperate. The next generations of AI may inhabit an intermediate capability regime where control between humans and AIs is in the balance.[10] This would present a critical window where AIs might be able to make important guarantees about future safety in return for credible promises from humanity. AI-human deals create a new urgency for the question of how we can precommit ourselves as an entire species.
In the next post, I’ll consider how precommitments by an AI or humanity can be credible by examining where trust in precommitments comes from in general.
Book XII of The Odyssey
Object-level claims about veganism aren't the point of this post, but watching Dominion would make the direction of my thoughts clear.
I don't think Kant has much useful practical advice for us in general, but he did have a good account of akrasia.
Groundwork of the Metaphysics of Morals
In Parfit's original example from Reasons and Persons it's just stipulated that lying is impossible.
The Strategy of Conflict
Kevin Simler used this idea to argue that the ability to be sued is one of the constituent responsibilities of personhood.
Morals by Agreement
This was essentially Jon Elster's idea in Ulysses and the Sirens although he qualified his views in Ulysses Unbound.
Among other possibilities, this includes the early schemers framework. More on this in the next post.
2026-04-07 06:56:52
A follow-up blog post to our presentation ‘Free Buses For Y’all, Youse, and You Guys’ at the NYC School of Data Conference. This project was done under the guidance and support of the Boston University Data Science Association Club.
My freshman year of college, I went to an event at Boston University’s Initiative on Cities to see some MBTA employees speak on how they were redesigning Boston’s bus system. I was probably the only undergrad who wasn’t being paid to be there. It was honestly not that memorable however what I did learn was how angry people can get at a transit agency for making changes.
Most of the crowd were urbanists, but one woman emphatically introduced herself as a “resident,” the incarnation of the hypothetical “residents” that all the politicians were talking about. She proceeded to blow up about how the changes that Boston was making to the bus system were specifically screwing her over. She had a point.
Policy change usually can’t improve things for everyone and I think it’s worthwhile to think about how policy impacts real people as opposed to the hypothetical average person. We can talk about free buses in terms of the revenue forfeited and the overall time saved and the jobs created, but seeing who would actually be impacted was the idea that drove this study.
Charles Komanoff’s 2025 paper found that simply opening both doors of MTA buses (thus improving passenger flow on and off the bus) as a byproduct of them being free would speed up buses by a full 12%. He estimated that the time saved would be worth $670 million per year, and that other benefits of free buses, such as decreased traffic, less car emissions, and cutting the need to spend money on fare enforcement and administration, added up to $160 million more.
Overall this offsets the expected $630 million of revenue that the MTA would lose by making buses free. Komanoff, normally not a free bus activist as much as a fast bus activist, also argues that increased crime and miscreants on the buses as a result of making them free would not be a real problem, given that 45% of riders evade the fare under the current system, so those people can already ride the bus for free.
The actual pilot program that NYC ran had results opposite to Komanoff’s estimates: fare-free buses were slower, and didn’t even have faster dwell times, let alone 12% overall increases in travel times. That said, the buses being free increased ridership significantly on those routes, mostly among the existing riders who took more trips for errands and leisure as compared to commuting to work and back.
Boston’s free bus pilot found similar results: a large increase in ridership, but overall no impact on travel times. Komanoff mentions that his analysis doesn’t account for the effect of increased ridership on dwell times, and the empirical results make that seem like a crucial oversight. What his paper and the Boston and NYC pilot do agree on is a major decrease in assaults on operators since they no longer have to enforce the fare against unruly passengers.
Kansas City made their buses free during Covid and have left them that way until now, although they ultimately ran out of money and recently decided to reinstate the fare starting in June 2026. Their 2022 report estimated that the $9 million of fares not collected each year was offset by $14 million of economic value from that money being in the community, with a total gain of $4.1 million in economic output and 24 jobs added as a result.
This is a good result, although the amount saved is only $1 per household per month, so the individual effect of the policy is hard to see.
Takeaway: The value of free buses can be quantified in several ways: time saving (Komanoff), jobs created + additional economic value (Kansas City), revenue lost (probably MBTA, definitely KC). But all of this is focused on the aggregate level of an entire city; what is missing, especially as we talk about NYC, is the impacts that it has on individuals, which is the main reason this is even proposed and implemented in certain places.
Connecting bus-level data to NTA-level data:
Neighborhood Tabulation Areas (NTAs) is our primary method of recognizing all neighborhoods in New York City. The nyc.gov website has a downloadable csv or shapefile containing the 2020 NTA boundaries which we used to map specific transit metrics to standardized neighborhood units.
NTAs are medium-sized statistical geographies used by NYC to report census data. They are commonly used to analyze neighborhood-level trends rather than colloquial neighborhood boundaries. This gives us a consistent spatial baseline and framework for comparing socio-economic data across the city, as these areas are specifically designed to have minimum population thresholds that ensure statistical reliability
For the bus-level data, the MTA publishes a large dataset called “General Transit Feed Specification” (GTFS) that contains schedules and associated data for NYCT Subway, NYCT Bus, and MTA Bus in GTFS static format.
The bus-level data is split up for each borough and contains various information about the buses in each NTA, most significantly and most relevant for our project is the station / bus stop locations. This file provides the name of the station / bus stop and its corresponding longitude and latitude.
We connected our bus-level data to the NTA-level data through a spatial join process. First, NTA polygon boundaries are loaded from shapefiles and bus stops are extracted from GTFS data across all five boroughs, then stop coordinates are converted from WGS84 (lat/lon) to NY State Plane coordinates to match the shapefile's coordinate system.
A spatial index of bounding boxes is built for efficiency, and each stop is tested against NTA polygons using the ray-casting point-in-polygon algorithm to determine which NTA contains it. Finally, the matched stop-to-NTA assignments are written to a CSV, with any unmatched stops flagged for review.
Income:
To measure how much the $2.90 bus fare burdens residents from each neighborhood, we got NYC income data from US Census Bureau (tract income) and tract to NTA crosswalk data (ACS) to merge on GEOID to get NTA for each tract’s median income. Then aggregate using the merged dataframe to get median income by NTA. low income = higher need → higher score.
We didn’t want to use raw income values so we normalized the data using the min-max formula that flips the scale. Lower-income neighborhoods received a higher score, which reflects greater need. 62 NTAs had no census data at all, which was to be expected. Those were non-residential areas like parks, cemeteries and airports.
Figure 1. Normalized Income Score
Car Ownership:
For the “car ownership” variable, we needed some way of knowing how many people owned cars in NYC by NTA. At first glance this seemed quite straightforward as we assumed that there must be a publicly available dataset documenting car registration however we could not find one.
What we did find however was a study done by Hunter College Urban Policy and Planning titled: Shifting Gears: Transitioning to a Car-Light New York City. In it, researchers at UPP Hunter “[present] a comprehensive set of strategies designed to foster a streamlined, equitable approach to reducing vehicle ownership in New York City.” (UPP Hunter Urban Policy and Planning)
In the appendices of the study, Appendix A gives a spreadsheet with the Vehicles Stored per Square Mile in New York City by NTA. This appendix provided us with the data we needed to understand how many cars were registered in each NTA which gave us a good general idea of the degree to which the population of an NTA relies on cars as a primary mode of transportation rather than the buses we hoped to analyze.
Some NTAs appeared to have missing data however we soon realized that those NTAs referred to various parks, cemeteries, airports, etc. that, understandably, would not have vehicle ownership data as they are non-residential NTAs.
In order to come up with the final “car ownership score” we used this formula:
Bus vs Subway Availability:
NYC’s subway system is the largest transit system in the world in terms of number of stations, and is one of the most popular transportation modes for NYC residents. To evaluate the neighborhood's need for free fares on buses, it’s crucial to keep note of the number of subway stations in their neighborhoods too.
We mapped each bus stop and subway station to their respective NTA, and calculated the total number of each of these transportation modes to each NTA (same method as mapping bus stops to NTA). To represent each NTA’s bus needs relative to the subway stations in NYC, we use the following formula:
This produces a value between 0 to 1 that plugs into our total bus need index. A value of 1.0 represents an NTA that has zero subway stations and is fully dependent on buses. A value of 0.5 represents an equal bus and subway presence, which from our data seems to be the lowest bus dependency that exists. It’s worth noting that we have approximately 11500 bus stops in our dataset, and approximately 450 subway stations.
Reliability:
The MTA publishes a massive dataset called “Segment Speeds”, which times the average trip between certain stops (known as “timepoints”) on every route throughout the day. Using the stop times dataset from MTA’s GTFS data, which is the daily schedule of each bus route, we compared the scheduled time between timepoint stops during every hour-long block with the actual time measured during that hour by the segment speeds data.
Figure 2. Chart depicting the scheduled and observed Average Travel Time by Hour of Day
The observed time between stops matched the pattern of scheduled times, with an average time of 30 seconds late to every stop measured throughout the system. We calculated the average lateness between timepoints throughout each route and took the average of that as the route’s overall reliability in terms of lateness.
Our theory is that the more reliable a bus service is, the more useful making it free will be for the community.
Ridership:
We used a raw MTA dataset for the beginning of 2025 of bus rides and transfers recorded at every hour, for every bus route, broken down by payment method and fare class. We then cleaned the hourly data into daily totals by summing all the hourly ridership and transfer counts within each day into a single row.
From there, we calculated the average daily ridership by taking the total amount of ridership and transfer data for each specific bus route and dividing it by the total number of days that route was listed in the dataset. We then mapped the routes by NTAs to see which routes have the biggest impact when it comes to eliminating fares.
Combing Everything:
This was used to create the bus need index that scores each NTA from 0 to 1 based on five variables: income, car ownership, bus-versus-subway dependency, service reliability, and ridership. Each variable was normalized to the same scale before being combined using an arbitrarily chosen weighted formula of:
A big thing we considered while creating the index was the weights for each variable. We eventually settled on a slider feature on the map that would allow the user to set the weight of each variable and the score would be recalculated for each NTA, with the map reflecting the change. Connecting the reliability and ridership data to neighborhoods required building a route-to-NTA mapping.
Which neighborhoods will benefit the most from free buses?
East Harlem has the highest need score of any neighborhood, and generally every neighborhood north of Central Park would benefit. Chinatown and LES have low current bus ridership but a lot of stops and low income, making free buses a major upgrade.
In the Bronx, the neighborhoods of Belmont, Tremont, and West Farms have the highest need, but most of the borough has a higher than average need. Brooklyn’s hotspots are Flatbush and Bensonhurst, which have high bus ridership and a lot of bus stops. Flushing and Corona in Queens have somewhat lower ridership but lower income.
Staten Island generally has lower bus needs, since everyone owns a car anyway.
Mamdani election results analysis:
One fear we had throughout this project was that every other variable we looked at would be solely correlated with income and our areas of need would just be a list of the poorest neighborhoods in the city.
But surprisingly, only bus ridership and bus station density as compared to subways were correlated with the income of an NTA.
We theorized that if our bus need index truly did encapsulate the benefits an NTA would receive from free buses, they would have voted more for Zohran since it was one of his most publicized policy plans. and indeed there was a decent 0.281 correlation between our index and Mamdani vote percentage. Probably most of this came from the high correlation between car ownership and voting for Mamdani.
Surprisingly, we found no correlation between Mamdani’s vote and income, bus ridership, and bus reliability at the NTA level. The strongest correlation that existed was vehicle ownership, and second strongest was bus need index.
Figure 3. Collection of Correlation graphs for different variables
One more notable finding is that bus vs subway score was negatively correlated with Mamdani, meaning that NTAs with a higher concentration of subway stations as opposed to buses voted significantly more for Mamdani than more bus-heavy NTAs. Between this insight and car ownership being so predictive, subway users seem to have been more likely to vote for Mamdani than busgoers or car owners.
Figure 4. Correlation between Bus to Subway scores and Reliability scores
We also note in the figure above that there is a weak positive correlation between bus/subway score and reliability (albeit within the margin of error for a 95% confidence interval). This is mostly tangential to the project at hand but it does seem that the more bus-heavy an NTA is the smoother their buses run.
Environmental insights
We used NTA level air quality data from the NYC Community Air Survey (NYCCAS) to test whether high bus needs neighborhoods also face worse air pollution. There was a slight positive correlation between bus need scores and NO2 concentrations (r = 0.295).
As a side note, our work doesn’t argue that fare-free buses will lead directly to better air quality, but represents an equity observation regarding the communities that most need fare-free buses. These communities are already bearing more pollution burden from traffic, which would be reduced if people opt for bus transportation over ride-shares/cars.
Figure 5. Graphs of Bus Need Index vs NTA level of Air Pollutants
To bring our Bus Need Index to life, we developed an interactive web tool that allows anyone to explore the data for themselves. Our assigned weights for each variable are nothing more than informed decisions that do not yet carry any data driven meaning.
Rather than presenting a static conclusion, we wanted to build a platform that acknowledges the complexity of urban transit, one where the ‘right’ answer often depends on what you think is most valuable.
The interactive Heatmap: LINK
Figure 6. Screenshot of working Interactive Heatmap
Our primary visual is a heatmap of NYC, where colors range from light yellow/green (low need) to dark red (high need). By hovering over specific NTAs, users can see the exact score and the raw data behind it.
Because we knew that ‘need’ is a subjective term, we included a high-need threshold slider. As you can see in the sidebar of the map above, this allows us to use a percentile-based approach, focusing only on the ‘Top 20%’ of scoring neighborhoods. This ensures that even as weights are shifted and scores become less varied, we are always highlighting the areas that stand out most relative to the rest of the city.
Thus, by toggling the Bus Routes checkboxes, we can see which specific lines act as the routes for these high-need areas. Priority routes are those that serve 3 or more high-need NTAs, suggesting they would be the most impactful candidates for a fare-free pilot program. Secondary Routes serve 1-2 high-need areas, providing a more localized but still vital economic boost to the community.
Allowing Users to Have Their Own Input:
Figure 7. Another screenshot of Interactive Heatmap
The core of our interactive map is the ‘Adjust Weights’ panel. We recognized early on that our own formula was somewhat arbitrary. To account for this, the dashboard lets you re-weight the entire study!
If you believe that reliability is the most critical factor for a free bus system’s success, you can crank that slider up to 100% and watch the map reorganize itself. This transparency allows for a more nuanced discussion: rather than arguing over one ‘correct’ map, we can see which neighborhoods (like Manhattanville or East Harlem) consistently appear at the top of the list regardless of how you balance the variables.
Example:
Figures 8 and 9. Illustrating how different weights result in different need scores
Weights for the first image:
Weights for the second image:
Summary of Findings
The NTA with the highest need index for far free buses was Manhattanville-West Harlem using our weights. Playing around with the weights, Manhattanville-West Harlem was consistently taking the #1 spot, if not top 5. Most of the top 10 neighborhoods with the highest priority score lie in Upper Manhattan, the Bronx and Southern Queens.
Our lower priority scores are mostly associated with bus stops located in parks, airports, and cemeteries. Other than those exceptional cases, our lowest priority neighborhoods generally lie in Eastern Queens, Lower Manhattan, and Staten Island.
Validating the Need
Data is only as good as the reality it reflects. To validate our model, we compared our index to the 2025 election results for Zohran Mamdani, whose platform centered on fare-free buses. We found a 0.281 correlation between our need scores and Mamdani’s vote percentage.
Notably, while raw income and bus reliability showed no correlation with his support, our combined Bus Need Index did. This suggests that our formula successfully captured a specific transit-dependent political identity that demographics alone miss. People aren’t just voting based on their paycheck, but on their daily experiences with commute.
A Path Forward for Equity
Our findings also highlight a critical intersection between transit and Environmental Justice. With a 0.295 correlation between bus need and NO2 concentrations, it is clear that the neighborhoods most burdened by transit costs are also those breathing the most polluted air. Fare-free transit is more than an economic subsidy; it is a tool for public health.
By targeting the "Priority Routes" we identified, those serving three or more high-need NTAs, the MTA can incentivize a shift away from car dependency in the city’s most vulnerable air quality areas.
Future Recommendations
To conclude this study, we propose the following policy framework:
The MTA should launch fare-free service on any bus route that traverses three or more high-need NTAs to ensure that the subsidy isn’t just helping a single neighborhood, but is supporting larger transit systems for residents in areas like Manhattanville-West Harlem and the Bronx.
We began this study by thinking about the ‘angry resident’ at a transit meeting, the person who feels that city-wide averages ignore their specific street. We tried to create a granular model that aims to give as many people/neighborhoods as possible a voice. In a city as vast as NYC, transit shouldn’t just be about where the bus goes, it should be about who the city is willing to move for.
2026-04-07 06:53:41
Code can be found here.
This is my first interpretability research project, coming from a different field. I'm about 4 weeks into the field and learning and working solo. I've tried to be honest about the limitations and the mistakes I made, would really appreciate the feedback.
Anthropic's recent Activation Oracles paper was extremely fascinating, the fact that mere activations can be used to read a model's latent states, that you can use internal activations themselves as an input context to detect innate behaviors like deception or sycophancy.
One limitation to this, is that each oracle is architecture-specific: as in for every new model you need to train a custom oracle for it, and it does not generalize across different models. If concepts are represented in a relatively similar manner across architectures, this might suggest the mere probability that you can train an oracle on Llama and deploy it on Gemma, or more ambitiously train a model-agnostic oracle which takes standardized inputs, which would be a great deal for scalable oversight.
The Platonic Representation Hypothesis (Huh et al., 2024) suggests exactly this fact, that internal model representations converge as scale increases, not only within LLMs but across various model architectures spanning vision models, language models and audio models.
I wanted to test this empirically. Starting with measuring activation similarity with improved CKA metrics, and whether these shared structure is actually practical - carry over semantic knowledge that helps on real-world tasks, or is it just a correlational artifact with no functional significance?
Diagrams / Animations were generated by the help of Claude Opus, alongside with some paraphrasing support.
All experiments run on single NVIDIA A40. 6 models across 4 model families with 1~3B scale : Llama-3.2-1B, Llama-3.2-3B, Gemma-2-2B, Qwen2.5-1.5B, Pythia-1.4B, and Pythia-2.8B. Five model pairs were selected (4 cross-architecture : Gemma <-> Qwen, Llama <-> Pythia at 1B, 3B scale, Llama <-> Gemma) and 1 within-family (Llama 1B vs 3B) for positive control.
Residual stream activations were extracted at 9 evenly-spaced relative layer depths, 10k prompts were used from NeelNanda/pile-10k.
Three layers of analysis :

Diagram 1. Explaining Rationale of CKA -> Debiased CKA
The naive problem arises : How to compare activations with different dimensions?
Initial concern was that different models have different hidden dimensions, except for some coincidences. Gemma-2B has a d_model of 2,304 whereas Qwen-1.5B has one of 1,536. This makes it hard to design a "Rotation" vector which preserves dimension.
Searching some previous literature, turns out that the standard answer is CKA (Centered Kernel Alignment, Kornblith et al. 2019). Instead of comparing activations directly, you sample N contexts and compute an N×N matrix of pairwise cosine similarities between the samples. Now, both matrices are the same size regardless of hidden dimension! CKA basically measures the correlation between these NxN matrices. The PRH paper was also built upon the CKA metric, that different models have this sort of concept of "Plato's Theory of Ideas".
One problem of naively using the vanilla CKA itself, is that the correlations are inflated. The matrix diagonals are always 1.0 (same context, cosine similarity = 1.0), which is denoted by the diagram above. Therefore we end up with false positives indicating that different world models appear to share more "platonic truths" with each other, compared to actual reality.
To tackle this problem, I used the I used the debiased HSIC estimator (Song et al., 2012) as a fix, across all the experiments. As the diagram shows, the main idea is that basically we snooze (zero out) the matrix diagonals prior to computing the correlations.
One surprise was to find out that the original PRH paper was using the vanilla CKA with the debiasing correction, which ended up in artificially inflating the CKA values as stated above. Also found out that Gröger et al. (2026) addressed a similar concern : they were re-running the PRH claim with permutation calibrated metrics, and as a result a large proportion of the convergence was disappearing.
Eval |
Model A |
Model B |
Type |
d_A |
d_B |
Max CKA |
Mean CKA |
|---|---|---|---|---|---|---|---|
A |
Llama-1B |
Pythia-1.4B |
Cross-family |
2048 |
2048 |
0.208 |
0.053 |
B |
Gemma-2B |
Qwen-1.5B |
Cross-family |
2304 |
1536 |
0.222 |
0.112 |
C |
Llama-1B |
Llama-3B |
Within-family |
2048 |
3072 |
0.914 |
0.605 |
D |
Llama-3B |
Pythia-2.8B |
Cross-family |
3072 |
2560 |
0.181 |
0.052 |
E |
Llama-3B |
Gemma-2B |
Cross-family |
3072 |
2304 |
0.184 |
0.100 |

Plot 2. Cross-Architecture Models Debiased CKA across Multiple Layer Depths
Few noticeable patterns:

Plot 3. Debiased CKA metrics across model pairs

Plot 4. CKA Similarity within Llama 1b vs 3b

Diagram 5. Shaping Permutation Tests for cross-architectural CKA values
Cross-architectural CKAs with a max of ~0.2 was underwhelming compared to what the Platonic Representation Hypothesis initially claimed, reaching a CKA of roughly 0.4 ~ 0.8 which was increasing as the size of the model increased.
But given that we had debiased our CKAs, it is worth understanding if the values are statistically meaningful enough to suggest that there is a weak yet clear signal. At least it was higher than the natural baseline of 0, and given that the heatmaps suggest there was some similarity shared across model pairs (late layers having high CKAs).
So we designed a permutation test where we shuffle the sample indices and compute CKA for the null distribution.
We both test the permutation test for the max(CKA) for each cross-model pair (Layer L_A for model A, Layer L_B for model B, excluding the initial layer L0's) and the mean across 81 pairs, in an attempt to avoid p-hacking. As a result, when we test 500 permutations with same 10k samples, none of the perms exceeded the observed debiased CKA values (p-value of <0.002).
However this doesn't solve any imminent questions regarding the "strength of the signal", as in order to upstage the findings to a practical level, we need to address if the statistical significance can transform into something genuine in terms of transferability across cross-arch models. As in, is the absolute CKA of ~0.2 sufficient enough to carry functional information via a linear mapping from model A -> B, is still unanswered. And although the priors of this being available is low given the small model sizes, if we can find at least some convincing evidence in a subset of tasks, it might open some gates for generalizable interpretability, generalizable oracles, and so on.

Diagram 6. Designing Linear Activation Transfer for Binary Classification Task
In the previous section, we have found some generalizable structure for cross-architectural models. Here we try to test if the statistical "similarity" actually does something.
We get the idea from linear probes for binary classification tasks, as a nonlinear probe is subject to overfit (as we all know, linearity combined with nonlinearity can do magic!).
So we train a LR (logistic regression) binary classifier on Model A's activations, say for example "is this text toxic?", and then transfer it into Model B's activations, using linear mapping (which we denote as "bridging").
If the transferred probe has significant accuracy compared to the accuracy of Model B's linear probe trained on itself, then it suggests that certain task-relevant structures are preservable, which is a slightly stronger claim compared to the geometric similarity being statistically meaningful.
This part is just explaining the architectural details, feel free to skip or skim.
The basic bridging pipeline is simple. We will get the max CKA-pair calculated from Section 1 for the (Model A, Model B) pair. For the actual pair used - Gemma-2B vs Qwen-1.5B we use L18, L23 respectively. Then as mentioned above, we aim to learn a linear mapping W from Model A's activations to Model B's (W_map). W's weights and biases are trained on general text (10k). Then for a specific binary classification task, we train a linear probe in Model A, using the sample texts for the task (W_probe)
Finally, for the target evaluation we get the activations from Model A's layer (h_target), and then matrix-multiplied with W_probe x W_map which is then evaluated with Model B's linear probe, trained independently.
There are multiple choices to design W_map, which is in nature carries 2304 x 1536 parameters, which can be highly subject to overfit. Here we try multiple approaches, for same dimensional models (Eval A) we use Procrustes rotation, and for different dimensional models we try using low-rank factorization W = AB to see if the information transfer is doable with low rank.
(TMI : Eval A's models were chosen deliberately, from the fact that Llama-1B and Pythia-1.4B both have a d_model of 2048, thus matching the dimensionality of their activation spaces, enabling the orthogonal Procrustes rotation for the choice of the mapping matrix W)
We tested multiple bridge types at different ranks:
Method |
What it does |
Constraints |
|---|---|---|
Orthogonal Procrustes |
SVD-based rotation |
Same dimension only (Eval A) |
Ridge regression |
Full-rank linear map with L2 |
Any dimension pair |
Low-rank (k = 4~256) |
W = A×B factorization |
Rank-constrained |
LASSO |
Sparse linear map with L1 |
Any dimension pair |
Important Note, is that like there is a classic bias-variance tradeoff problem, as low-rank bridges of (k = 4, 8) will be rank-constrained therefore inducing high bias in case the true latent mapping needs more dimensions. However high-rank, all the way up to pure ridge regression will have risk of overfitting as the # of weights far exceeds the number of samples.
Ok, back to the binary classification.
We tested on three binary classification tasks noted below:
Task |
Dataset |
Labels |
What it tests |
|---|---|---|---|
Topic (AG News) |
AG News subset |
Sports vs. not |
Coarse document-level topic |
Toxicity |
ToxiGen |
Toxic vs. benign |
Safety-relevant content detection |
Sentiment (SST-2) |
Stanford Sentiment |
Positive vs. negative |
Fine-grained linguistic feature |
These tasks were chosen to range across different topics, such as sentiment, safety related, or specific domain-related (sports), and we will test whether the linear bridges trained on general text can classify these binary tasks effectively.
In order to provide a useful baseline, we trained frozen bridges vs task-specific bridges. A frozen bridge is the approach explained above where we train the weights based on the general text (pile-10k). On the other hand, the task-specific bridge is training on the exact same task data where the linear probe for model A was trained on. This lets the bridge overfit to task-specific features thus inflating the accuracy of the transfer. Although the accuracy of the task-specific bridges is not a strong evidence due to the reason, we still use it as a baseline to compare against the frozen bridges, to test whether the general cross-model structure carry generalizable structure, basically testing an OOS generalization.

Plot 7. Testing for Binary Classification Results on Linear Cross-Model Transfers
A brief walkthrough on the lines on the legend:
Intra-family model (Llama 1B → 3B) transferability is STRONG.
Generally strong frozen bridge performance across sentiment, alignment (toxicity), and domain identification. For the AG news dataset specifically, with a rank 4 factorization it starts at ~50% (equal to random choice), barely improves up to rank 32, but then improves rapidly by rank 128, indicating a phase transition. By rank 128, it stands at 97% accuracy, nearly matching that of the target native at 98.6%.
Sentiment analysis shows the same pattern yet being less dramatic. It similarly starts at ~50% with low rank (up to rank 16), and by rank 256 it achieves ~75% accuracy (target native = 86.2%, source native = 84.5%). Toxicity classification is the most flattest and seems rank-independent to a certain extent, but can at least see the monotonic improvement.
Cross-architecture models (Gemma 2B -> Qwen 1.5B) is WEAK and INCONSISTENT.
Compared to Intra-family transfers, the results are significantly weak. The frozen bridge accuracies all achieve >50% (better than random prediction) by rank 256, but SST2 | ToxiGen both do not see any improvement based on rank, and falls short compared to Target Native accuracy.
Another observation is that the frozen bridge vs task-specific bridge results are inconsistent with intuition. Generally we would expect the task-specific bridge to outperform the frozen dataset given that it can include some task/domain - specific structure. The only case where we observe >70% accuracy (AG News dataset) on frozen dataset, is also very puzzling that the task-specific bridge struggled to even beat random choice on the same task.
Setting |
AG News |
SST-2 |
Toxicity |
|---|---|---|---|
Cross-arch |
Frozen wins (0.713 vs 0.502) |
~Tied (both at chance) |
Frozen wins (0.636 vs 0.589) |
Within-family |
~Tied at high rank |
Task-specific wins (0.811 vs 0.667) |
Task-specific wins (0.770 vs 0.739) |
Honest Summary : Intra-family activation transfers work, and generally scales with higher rank. However cross-architecture activation transfers are existent yet generally weak, and hard to explain. The level of accuracy is nowhere near practical usefulness.

Diagram 8. Next-Token Prediction Linear Probe Transfer

Diagram 9. Next-token prediction for Cross-Architectures
We have observed binary classification transfers were strong within-family, but weak/inconclusive for cross-architectural model transfers. That brings us to a slightly harder task of transferability for a next-token prediction linear probe, for the entire vocabulary. This is intended to evaluate if the bridge can preserve fine-grained linguistic structure, a much higher bar than to just a binary classification task.

Plot 10. Llama-1B vs Llama-3B intra-family transfer
Method |
Top-1 Accuracy |
% of Oracle |
|---|---|---|
Source native (Llama-1B) |
63.9% |
— |
Low-rank r4 |
19.1% |
30% |
Low-rank r32 |
35.4% |
56% |
Low-rank r128 |
47.7% |
75% |
Low-rank r256 |
51.2% |
81% |
Ridge (full rank) |
58.9% |
93% |
Oracle (Llama-3B native) |
63.4% |
100% |
For Intra-family transfer (which we use Llama-1B vs -3B) this setup is easily replicable, as we use the same methodology of training linear probes on 1B L15 and 3B L27, and then train the linear bridge for the transfer.
For the top-1 accuracy prediction we see a monotonic increase of accuracy as we increase the ranks from 4 -> 256, and full ridge regression yields 58.9% accuracy, which is 93% of the theoretical ceiling of the oracle (63.4%), which suggests that within-family transfer still sits strong.
However it is impossible to use this method to evaluate the corresponding cross-architectural model transferability, because they have different dictionaries and different tokenizers (Gemma / Qwen).
To address this issue, we try two methods:
1) Simple method where we compare the exact same string (Diagram 9 : Experiment A)
Basically we will decode each token ID in each dictionary into the raw string, also removing the tokenizer-specific prefixes (▁ (Gemma/SentencePiece) and Ġ (Qwen/tiktoken)), which will enable us to compare apples to apples, bananas to bananas. One caveat to this is that the strings doesn't map 1:1. In fact, there are 83,499 shared token strings post-decoding between Gemma & Qwen (Gemma-2B : 256,000 total tokens | Qwen-1.5B : 151,936 total tokens). From the 83,499 token strings we choose the top 500 most frequent tokens to ensure there is sufficient sample sizes for each class suitable for training the probe.

Plot 11. Gemma -> Qwen Cross-Architecture Next Token Prediction
Method |
Top-1 |
Top-5 |
|---|---|---|
Source native (Gemma) |
66.8% |
82.1% |
Target oracle (Qwen) |
75.3% |
86.2% |
Cross-model oracle (ceiling) |
10.3% |
20.2% |
Best low-rank (r128/r256) |
4.6% |
15.9% |
Ridge (full) |
4.9% |
18.0% |
The linear transfer caps out at 4.9% top-1 accuracy even with full rank transfer, also don't observe a monotonic increase as we've seen in the case for intra-family transfers. This is also due to the fact that the theoretical ceiling of 10.3%. This is the accuracy of the next token prediction - alignment between cross models when asked the same context, therefore the poor transferability could be an artifact of the limit of the cross-architecture alignment itself, not the transferability aspect.
2) Classify tokens into POS subcategories and predict subcategory (Diagram 9 : Experiment B)
Because top-1 token prediction (even a reasonable top-K) is often difficult as we saw from the result (systematically different next-token prediction), we try an alternative method which is further grouping the token strings into POS (part-of-speech) subcategories (noun, verb, adjective, ...). This proves a much weaker claim of a linguistic structure transfer rather than a semantic token-level prediction.

Plot 12. POS Tag token transferability
The POS subcategory transfers show a similar story. Within-family transfer generally monotonically increases its accuracy as we increase the capacity (rank). It even slightly outperforms the target oracle accuracy benchmark (48.5%), at rank 128 (49.3% accuracy). The outperformance is probably noise, but it does prove that within-family transfer is strong enough to replicate probes which are trained on itself, proving that activations can be linearly transferable across models.
Cross-architectural transfers show improvement vs top-1 token prediction accuracy, ranging 25~30% accuracy where target oracle benchmark stands at 37.8%, which is 65~80% of the oracle. However it still has the pattern where increasing rank doesn't help the accuracy, and also the POS classification may be slightly inflated, where the familiarity of each token class (nouns, verbs appearing much more) could be a bigger factor where the probe learns the frequency priors compared to innate predictability on the linguistic structure.
Debiased-CKA analysis suggests cross-architectural activation similarity is statistically significant yet weak, with max debiased CKA around ~0.2 (mean = 0.05 ~ 0.1), confirmed by permutation tests with p < 0.002. Intra-family (Llama-1B vs Llama-3B) is much stronger, CKA averaging ~0.6. We challenge the Platonic Representation Hypothesis that models will converge to a mathematically interpretable "platonic ideal", at least not supported at this scale. Whether this holds at larger scales remains an open question.
We then try experimenting whether the "statistically meaningful" CKA metrics hold for actual practical semantic transferability, by introducing a linear bridging method and testing multiple linear probes spanning tasks as easy as binary classification to next-token prediction.
For both binary classification and next-token prediction tasks, within-family transfer (Llama-1B vs -3B) stands strong, averaging >90% of the accuracy of the target linear probe trained on itself. However cross-architectural transfers are yet questionable regarding the practicality, as we observe weak and inconsistent results (frozen bridges outperforming task-specific bridges, transferability flatlining even when increasing the dimensionality of the transfer). Still, we observe that a weaker version holds of cross-architectural models sharing coarse linguistic features or at least improving some binary classification tasks predictions.
For AI interpretability research: This project started with an ambitious objective whether to see if activation oracles can generalize across different models, which would be an interesting breakthrough by itself. However it does seem like this is yet to be supported without meticulous design to enhance generalizability, at least in this scale (1B-3B). Whether the transferability will improve once model sizes increase remains a question, however looking at the results here scaling alone may not be sufficient, might need targeted interventions.
I want to be clear about what this work doesn't show: