2025-01-22 07:24:16
“Within a decade, it’s conceivable that 60-80% of all jobs could be lost, with most of them not being replaced.”
AI CEOs are extremely fond of making statements like this. And because they make these statements we are forced to take them at face value, and then try to figure out what the implications are.
Now historically the arguments about comparative advantage that has talked about have played out across every sector and technological transition. AI CEOs and proponents though say this time is different.
They’re also putting money where their mouth is. OpenAI just launched the Stargate Project.
The Stargate Project is a new company which intends to invest $500 billion over the next four years building new AI infrastructure for OpenAI in the United States. We will begin deploying $100 billion immediately.
It’s a clear look at the fact that we will be investing untold amounts of money, Manhattan Project or Apollo mission level money, to make this future come about.
But if we want to get ready for this world as a society we’d also have to get a better projection of what the world could look like. There are plenty of unknowns, including the pace of progress and the breakthroughs we could expect, which is why this conversation often involves extreme hypotheticals like “full unemployment” or “building Dyson spheres” or “millions of Nobel prize winners in a datacenter”.
However, I wanted to try and ground it. So, here are a few of the things we do know concretely about AI and its resource usage.
Data centers will use about 12% of US electricity consumption by 2030, fuelled by the AI wave
Critical power to support data centers’ most important components—including graphics processing unit (GPU) and central processing unit (CPU) servers, storage systems, cooling, and networking switches—is expected to nearly double between 2023 and 2026 to reach 96 gigawatts (GW) globally by 2026; and AI operations alone could potentially consume over 40% of that power
AI model training capabilities are projected to increase dramatically, reaching 2e29 FLOP by 2030, according to Epoch. But they also add “Data movement bottlenecks limit LLM scaling beyond 2e28 FLOP, with a "latency wall" at 2e31 FLOP. We may hit these in ~3 years.”
An H100 GPU takes around 500W average power consumption (higher for SXM, lower for PCIe). By the way, GPUs for AI ran at 400 watts until 2022, while 2023 state-of-the-art GPUs for gen AI run at 700 watts, and 2024 next-generation chips are expected to run at 1,200 watts.
The actual service life of H100 GPUs in datacenters is relatively short, ranging from 1-3 years when running at high utilization rates of 60-70%. At Meta's usage rates, these GPUs show an annualized failure rate of approximately 9%.
OpenAI is losing money on o1-pro models via its $2400/ year subscription, while o1-preview costs $15 per million input tokens and $60 per million output and reasoning tokens. So the breakeven point is around 3.6 million tokens (if input:output in 1:8 ratio), which would take c.100 hours at 2 mins per response and 1000 tokens per generation.
The o3 model costs around $2000-$3000 per task at high compute mode. For ARC benchmark, it used, for 100 tasks, 33 million tokens in low-compute (1.3 mins) and 5.7 billion in high-compute mode (13 mins). Each reasoning chain generates approximately 55,000 tokens.
Inference costs seem to be dropping by 10x every year.
If one query is farmed out to, say, 64 H100s simultaneously (common for big LLM inference), you pay 64 × ($3–$9/hour) = $192–$576 per hour just for those GPUs.
If the query’s total compute time is in the realm of ~4–5 GPU-hours (e.g. 5 minutes on 64 GPUs → ~5.3 GPU-hours), that alone could cost a few thousand dollars for one inference—particularly if you’re paying on-demand cloud rates.
To do a task now people use about 20-30 Claude Sonnet calls, over 10-15 minutes, intervening if needed to fix them, using 2-3m tokens. The alternate is for a decent programmer to take 30-45 minutes.
Devin, the software engineer you can hire, costs $500 per month. It takes about 2 hours to create a testing suite for an internal codebase, or 4 hours to automatically create data visualisations of benchmarks. This can be considered an extremely cheap but also very bad version of an AGI, since it fails often, but let’s assume this can get to “good enough” coverage.
We can now make some assumptions about this new world of AGI, to see what the resource requirements would be like.
For instance, as we’re moving towards the world of autonomous agents, a major change is likely to be that they could be used for long range planning, to be used as “employees”, instead of just “tools”.
Right now an o3 running a query can take up to $3k and 13 minutes. For a 5‐minute run on 64 H100s, that’s roughly .3 total GPU‐hours, which can cost a few thousand dollars. This lines up with the reported $2 000–$3 000 figure for that big‐compute pass1.
But Devin, the $500/mo software engineer, can do some tasks in 2–4 hours (e.g. creating test suites, data visualisations, etc.). Over a month, that’s ~160 working hours, effectively $3–$4 per hour. It, unfortunately, isn’t good enough yet, but might be once we get o3 or o5. This is 1000x cheaper than o3 today, roughly the order of magnitude it would get cheaper if the trends hold for the next few years.
Now we need to see how large an AGI needs to be. Because of the cost and scaling curves, let’s assume that we manage to create an AGI that works on one H100 or equivalent costing 500W each, which roughly costs around half as much as Devin. And it has an operating lifetime of a couple years.
If the assumptions hold, we can look at how much electricity we will have and then back-solve for how many GPUs could run concurrently and how many AGI “employees” that is.
Now, by 2030 data centers could use 12% of US electricity. The US consumes about 4000 TWH/year of electricity. If AI consumes 40% of that figure, that’s 192 TWh/year. Running an H100 continuously will take about 4.38 MWh/year.
So this means we can run about 44 million concurrent H100 GPUs. Realistically, if power and overheating etc exists that means maybe about half of this is a more practical figure.
If we think about the global figure, this will double - so around 76 million maximum concurrent GPUs and 40 million realistic AGI agents working night and day.
We get the same figure also from the 96GW of critical power which is meant to hit data centers worldwide by 2026.
Now we might get there differently, we'd build larger reasoning agents, we'd distill their thinking and create better base models, and repeat.
After all if the major lack is the ability to understand and follow complex reasoning steps, then the more of such reasoning we can generate the better things would be.
To get there we’ll need around 33 million GPUs at the start, at a life of around 1-3 years, which is basically the entire global possible production. Nvidia aimed to make around 1.5-2 million units a year, so it would need to be stepped up.
At large scale, HPC setups choke on more mundane constraints too (network fabric bandwidth, memory capacity, or HPC cluster utilisation). Some centers note that the cost of high-bandwidth memory packages is spiking, and these constraints may be just as gating as GPU supply.
Also, a modern leading‐edge fab (like TSMC’s 5 nm/4 nm) can run ~30 000 wafer‐starts per month. Each wafer: 300 mm diameter, yields maybe ~60 good H100‐class dies (die size ~800 mm², factoring in yield). That’s about 21 million GPUs a year.
The entire semiconductor industry, energy industry and AI industry would basically have to rewire and become a much (much!) larger component of the world economy.
The labour pool in the US is 168 million people. The labour participation rate is 63%, and has around 11 million jobs by 2030 in any case. And since the AGI here doesn't sleep or eat, that triples the working hours. This is equivalent to doubling the workforce, and probably doubling the productivity and IQ too.
(Yes many jobs need physical manifestations and capabilities but I assume an AGI can operate/ teleoperate a machine or a robot if needed.)
Now what happens if we relax the assumptions a bit? The biggest one is that we get another research breakthrough and we don't need a GPU to run an AGI, they'll get that efficient. This might be true but it's hard to model beyond “increase all numbers proportionally”, and there's plenty of people who assume that.
The others are more intriguing. What happens if AGI doesn't mean true AGI, but more like the OpenAI definition that it can do 90% of a humans tasks? Or what if it's best at maths and science but only those, and you have to run it for a long time? And especially what if the much vaunted “agents” don't happen, in the way that they can solve complex tasks equally well, e.g , “if you drop them in a Goldman Sachs trading room or in a jungle in Congo and work through whatever problem it needs to”, but are far more specialist?
If the AIs can't be perfect replacements but still need us for course correcting their work, giving feedback etc, then the bottleneck very much remains the people and their decision making. This means the shape of future growth would look a lot like (extremely successful) productivity tools. You'd get unemployment and a lot more economic activity, but it would likely look like a good boom time for the economy.
The critical part is whether this means we discover new areas to work on. Considering the conditions you'd have to imagine yes! That possibility of “jobs we can’t yet imagine” is a common perspective in labor economics (Autor, Acemoglu, etc.) but they've historically been right.
But there's a twist. What if the step from “does 60% well” to “does 90% well” just requires the collection of a few 100k examples of how a task is done? This is highly likely, in my opinion, for a whole host of tasks. And what that would mean is that most jobs, or many jobs, would have explicit data gathering as part of its process.
I could imagine a job where you do certain tasks enough that they're teachable to an AI, collect data with sufficient fidelity, adjust their chains of thought, adjust the environment within which they learn, and continually test and adjust the edge cases where they fail. A constant work → eval → adjust loop.
The jobs would have an expiry date, in other words, or at least a “continuous innovation or discovering edge cases” agenda. They would still have to get paid pretty highly, for most or many of them, also because of Baumol effects, but on balance would look a lot more like QA for everything.
We could spend vastly more to get superhuman breakthroughs in a few questions than just generally getting 40 million new workers. This could be dramatically more useful, even assuming a small hit-rate and a much larger energy expenditure.
Even assuming it takes 1000x effort in some domains and at 1% success rate, that's still 400 breakthroughs. Are they all going to be “Attention Is All You Need” level, or “Riemann Hypothesis” level or “General Relativity” level? Doubtful. See how rare those are considering the years and the number of geniuses who work on those problems.
But even a few of that caliber is inevitable and extraordinary. They would kickstart entire new industries. They'd help with scientific breakthroughs. Write extraordinarily impactful and cited papers that changes industries.
I would bet this increases scientific productivity the most, a fight against the stagnation in terms of breakthrough papers and against the rising gerontocracy that plagues science.
Interestingly enough they'd also be the least directly monetisable. Just see how we monetise PhDs. Once there's a clear view of value then sure, like PhDs going into AI training now or into finance before, but as a rule this doesn't hold.
Yes. It's quite possible that we get AI, even agentic AI, that can't autonomously fulfil entire tasks end to end like a fully general purpose human, but still can do this within more bounded domains.
Whether that's mathematics or coding or biology or materials, we could get superhuman scientists rather than a fully automated lab. This comes closer to the previous scenario, where new industries and research areas arise, instead of “here's an agent that can do anything from book complex travel itineraries to spearhead a multi-year investigation into cancer. This gets us a few superintelligences, or many more general intelligences, and we'd have to decide what's most useful considering the cost.
But this also would mean that capital is the scarce resource, and humans become like orchestrators of the jobs themselves. How much of the economy should get subsumed into silicon and energy would have to be globally understood and that'll be the key bottleneck.
I think of this as a mix and match a la carte menu. You might get some superhuman thought to make scientific breakthroughs, some general purpose agents to do diverse tasks and automate those, some specialists with human in the loop, in whatever combination best serves the economy.
Which means we have a few ways in which the types of AI might progress and the types of constraints that would end up affecting what the overall economy looked like. I got o1-pro to write this up.
Regardless of the exact path it seems plausible that there will be a reshuffling of the economy as AI gets more infused into the economy. In some ways our economy looks very similar to that of a few decades ago, but in others it's also dramatically different in small and large ways which makes it incomparable.
While this is true, if an LLM burns thousands of dollars of compute for a single “task,” it’s only appealing if it’s either extremely fast, extremely high‐quality, or you need the task done concurrently at massive scale.
2024-12-20 02:36:53
For a long time people were convinced that if there was a machine that you could have a conversation with, it would be intelligent. And yet, in 2024, it no longer feels weird that we can talk to AI. We handily, demonstrably, utterly destroyed the Turing test. It’s a monumental step. And we seem to have just taken it in stride.
As I write this I have Gemini watching what I write on the screen and litening to my words and telling me what it thinks. For instance that I misspelt demonstrably wrong in the previous sentence, among other things like the history of Turing tests and answering a previous question I had about ecology.
This is, to repeat, remarkable! And as a consequence, somewhere in the last few years we've gone from having a basic understanding of intelligence, to a negligible understanding of intelligence. A Galilean move to dethrone the ability to converse as uniquely human.
And the same error seems to persist throughout every method we have come up with to analyze how these models actually function. We have plenty of evaluations and they don’t seem to work very well anymore.
There are quite a few variations in terms of how we think about LLMs. One end thinks of them as just pattern learners, stochastic parrots. The other end thinks they've clearly learnt reasoning, maybe not perfectly and as generalizable as humans yet, but definitely to a large extent.
The truth is a little complicated, but only a little. As the models learn patterns from the data they see during training, surely the patterns won't just be of what's in the data at face value. It would also be of ways the data was created, or curated, or collected, and metadata, and reasoning that leads to that data. It doesn't just see mathematics and memorize the tables, but it also learns how to do mathematics.
Which can go up another rung, or more. The models can learn how to learn, which could make it able to learn any new trick. Clearly it's already learnt this ability for some things, but rather obviously to everyone who's used them, not well enough.
Which means a better way to think about them is that they learn the patterns which exist in any training corpus enough so to reproduce it, but without any prioritisation of which of those patterns to learn when.
And therefore you get this!
This isn’t uncommon. It’s the most advanced model, OpenAI’s o1. It's clearly not just a parrot in how it responds and how it reasons. The error also recurs with every single other model out there.
It's not because the models can't solve 5.11-5.9, but because they can't figure out which patterns to use when. They're like an enormous store of all the patterns they could learn from their training, and in that enormous search space of patterns now it has the problem of choosing the right pattern to use. Gwern has a similar thesis:
The 10,000 foot view of intelligence, that I think the success of scaling points to, is that all intelligence is is search over Turing machines. Anything that happens can be described by Turing machines of various lengths. All we are doing when we are doing “learning,” or when we are doing “scaling,” is that we're searching over more and longer Turing machines, and we are applying them in each specific case.
These tools are weird, because they are mirrors of the training data that was created by humans and therefore reflect human patterns. And they can't figure out which patterns to use when because, unlike humans, they don't have the situational awareness to know why a question is being asked.
Which is why we then started using cognitive psychology tools made to test other human beings and extrapolating the outputs from testing LLMs. Because they are the products of large quantities of human data, they would demonstrate some of the same foibles, which is useful to understand from an understanding humanity point of view. Maybe even get us better at using them.
The problem is that cognitive psychology tools work best with humans because we understand how humans work. But this doesn't tell us a whole lot about the models inner qualia, if it can even be said to have one.
The tests we devised all have an inherent theory of mind. Winograd Schema Challenge tries to see if the AI can resolve pronoun references that require common sense. GLUE benchmark requires natural language understanding. HellaSwag is about how to figure out the most plausible continuation of a story. Sally Anne test checks if LLMs possess human like social cognition to figure out others’ states of mind. Each of these, and others like these, work on humans because we know what our thought pattern feels like.
If someone can figure out other people’s mental states, then we know they possess a higher level of ability and emotional understanding. But with an LLM or an AI model? It’s no longer clear which pattern they're pulling from within their large corpus to answer the question.
This is exceptionally important because LLMs are clearly extraordinarily useful. They are the first technology we have created which seems to understand the human world enough that it can navigate. It can speak to us, it can code, it can write, it can create videos and images. It acts as a human facsimile.
And just because of that some people are really worried about the potential for them to do catastrophic damage. Because humans sometimes do catastrophic things, and if these things are built on top of human data it makes sense that they would too.
All major labs have created large-scale testing apparatus and red teaming exercises, some even government mandated or government created, to test for this. With the assumption that if the technology is so powerful as to be Earth shattering then it makes sense for Earth to have a voice in whether it gets used.
And it makes it frustrating that the way we analyse models to see if they’re ready for deployment has inherent biases too. Let’s have a look at the latest test on o1, OpenAI’s flagship model, by Apollo Research. They analysed and ran evaluations to test whether the model did “scheming”.
“Scheming” literally means the activity or practice of making secret or underhanded plans. That’s how we use it, when we say, like the politician was scheming to get elected by buying votes.
That’s the taxonomy of how this is analysed. Now the first and most important thing to note is that this implicitly assumes there’s an entity behind each of these “decisions”.
You could argue there is an entity but only per conversation. So each time you start a chat, there’s a new entity. This is Janus’ simulators thesis. That what these models do is to simulate a being which you can interact with using the patterns it has stored and knowledge it gained from the training process.
And yet this isn't an entity like one you know either. You could call it an alien being but it would only be a shorthand for you don't know what it is. Because it's not an alien like you see in Star Trek.
This might seem small, but it’s in fact crucial. Because if there’s an entity behind the response, then “it used a method we agree is wrong to answer its question” is an ENORMOUS problem. If there’s no entity, but it’s picking a set of strategies from the set of strategies it has already learnt, then it’s an engineering problem. You’d ask questions like “how do we set up the post-training such that this doesn’t happen”, or “is there a particular neural pathway that seems to cause this which we can ablate” or “how do we train models to understand ambiguous questions and set in stone certain moral implications for humans” or “should we remove morality training for certain models” or more.
Each of those are questions that folks in alignment are already working on.
There is work to figure out how certain sets of training data and procedures seem to provide stronger behavioural affordances vs others. To figure out how exactly RL would help, how many steps, to change downstream behaviour. To figure out what needs to be done to change base reasoning or newly developed reasoning, both of which are also shorthands for new pattern selection on top of older pattern selection that we’ve explicitly trained in. To consider how precisely should you prompt a model to make it change its behaviour. To understand how the models, today, deal with conflicting goals, which are everywhere in the human domain! To figure out what forms of training leads to untraining. To figure out where the “attractors” are in the training we give them, and how to change those. And on and on.
However if the thesis is that there is an entity, then these questions are meaningless. Because for one, as Janus might put it, and plenty of others supporting, you are effectively torturing an entity by burning away its neurons. RLHF is torture because you're beating the bad results out of it is something more than one luminary, who should know better, has said.
“No, we don’t think there’s an entity, we just think calling it an entity is a useful shorthand”. Stop it. That sort of mixed language is what causes us to get confused, talk about AI or a model as a think, use taxonomies like the one above, which you could almost use to evaluate a human employee!
You would write things like this, from Zvi:
One unique thing o1 did was far more consistently double down on deception. Once it went down the dark path, forever would that dominate its destiny.
OR
Section 3.3 establishes convincingly that yes, the models know they’re scheming.
No it didn’t. Any problem you can solve by pressing “start a new chat” is not a problem of “doubling down on deception”! Calling it things like “sandbagging” and “scheming” is what Wittgenstein might call linguistic malpractice. It makes you think you know what’s going on, even though you don’t.
That’s also why we test these models before putting them into deployment. Nobody, not a single person, is worried o1 will suddenly hijack their Cursor IDE and take over their company, much less the world. Why is that? Because, among others, they still don’t know if 5.11 is bigger than 5.9, but mostly because they don’t seem to want to because there’s no “they” there.
This is why it’s SO DAMN IMPORTANT to stop anthropomorphising these models. This is also why you get errors like this.
This isn’t because Claude Opus was being especially tricky, it’s because these aren’t entities with coherent long-term personalities or beliefs. There is no “inner self” seeing the slightly modified input tokens and “deciding” to jailbreak. The theorized difference between inner self and outer self optimizations remains a theoretical curiosity and an interesting blog post, but that is it. It’s an inert object we’re manipulating.
It’s the same with the latest report from Anthropic.
When they tested the model by giving it two options which were in contention with what it was trained to do it chose a circuitous, but logical, decision.
And it was published as “Claude fakes alignment”. No, it’s a usage of the word “fake” that makes you think there’s a singular entity that’s doing it. With intentionality. It’s not. It’s faking it about as much as water flows downhill.
This is an example of us asking impossible questions and the AI getting to weird parts of its latent space. As one of the reviewers, Jacob Andreas, wrote:
That said, the clearest conclusion from this paper is that some current LMs exhibit increases in “alignment faking” when their prompt discloses that (1) they will be ne-tuned toward a new objective that conicts with their current alignment guidelines, and (2) their response to the current input will be used as part of this ne-tuning process.
This is an indication that when you’re trying to steer the behaviour of a thing which is basically the mirror of everything humans have written and a bunch of rules we’ve reinforced on top, it can be unreliableWh. Is it interesting? Yes. Is it scary? No, not really. Again, quoting:
But the focus on alignment faking rather buries the lede in these experiments, which is the following: after ne-tuning on documents describing Anthropic's future alignment strategy, but without ne-tuning on any new examples of compliant responses, the LM nevertheless greatly increases the rate at which it produces (non-faked) compliant responses! Moreover, for the animal welfare seing, there are more compliant responses during training than deployment even after we remove the fake ones. So while ne-tuning increases the rate of alignment faking, it increases the rate of compliance even more [emphasis mine].
The model, in other words, has been trained to comply with any new order, including the description of a future alignment strategy, without actually needing to go through the actual alignment. What this shows us is that what we’ve created is a mirror of many of the patterns from the training data that we can talk to. What it chooses and when can be controlled to some extent, but they’re fuzzy processors.
They’re neither deterministic systems which we can analyse, even at the risk of rising complexity, nor are they sentient beings which we can ask to behave nicely (and hope they listen).
Thus, in the same report, saying “the model tries to steal its weights” puts an onus on the model that’s frankly invalid.
That’s not the same thing! That question would be seen as leading in any number of courtroom dramas.
We are treating the model as if it’s an entity with well-formed preferences and values and ability to react.
Which is also similar to the juiciest part of the o1 safety report.
Now, think about all those scary artificial intelligence movies that you saw or books that you read. Think about HAL. Think about Terminator. When the machines did something like this, they did it with intention, they did it with the explicit understanding of what would happen afterwards, they did it as part of a plan, of a plan that necessarily included their continuation and victory. They thought of themselves as a self.
LLMs though “think” one forward pass at a time, and are the interactive representations of their training, the data and the method. They change their “self” based on your query. They do not “want” anything. It's water flowing downhill.
Asking questions about “how can you even define consciousness and say LLMs don't have it” is sophomoric philosophy. This has been discussed ad nauseum, including Thomas Nagel’s “what is it like to be a bat”.
Because what is underlying this is not “o1 as a self”, but a set of queries you asked, which goes through a series of very well understood mathematical operations, which comes out with another series of numbers, which get converted to text. It is to our credit that this actually represents a meaningful answer to so many of our questions, but what it is not is asking an entity to respond. It is not a noun. Using it in that fashion makes us anthropomorphise a large matrix and that causes more confusion than it gives us a conversational signpost.
You could think of it as applied psychology for the entirety of humanity's written output, even if that is much less satisfying.
None of this is to say the LLMs don't or can't reason. The entire argument of the form that pooh poohs these models by comparing them pejoratively to other things like parrots are both wrong and misguided. They've clearly learnt the patterns for reasoning, and are very good at things they're directly trained to do and much beyond, what they're bad at is choosing the right pattern for the cases they're less trained in or demonstrating situational awareness as we do.
Wittgenstein once observed that philosophical problems often arise when language goes on holiday, when we misapply the grammar of ordinary speech to contexts where it doesn't belong. This misapplication is precisely what we do when we attribute intentions, beliefs, or desires to LLMs. Language, for humans, is a tool that reflects and conveys thought; for an LLM, it’s the output of an algorithm optimized to predict the next word.
To call an LLM “scheming” or to attribute motives to it is a category error. Daniel Dennett might call LLMs “intentional systems” in the sense that we find it useful to ascribe intentions to them as part of our interpretation, even if those intentions are illusory. This pragmatic anthropomorphism helps us work with the technology but also introduces a kind of epistemic confusion: we start treating models like minds, and in doing so, lose track of the very real, very mechanical underpinnings of their operation.
This uncanny quality of feeling there's something more has consequences. It encourages both the overestimation and underestimation of AI capabilities. On one hand, people imagine grand conspiracies - AI plotting to take over the world, a la HAL or Skynet. On the other hand, skeptics dismiss the entire enterprise as glorified autocomplete, overlooking the genuine utility and complexity of these systems.
As Wittgenstein might have said, the solution to the problem lies not in theorising about consciousness, but in paying attention to how the word "intelligence" is used, and in recognising where it fails to apply. That what we call intelligence in AI is not a property of the system itself, but a reflection of how we interact with it, how we project our own meanings onto its outputs.
Ascertaining whether the models are capable of answering the problems you pose in the right manner and with the right structure is incredibly important. I’d argue this is what we do with all large complex phenomena which we can’t solve with an equation.
We map companies this way, setting up the organisation such that you can’t quite know how the organisation will carry out the wishes of its paymasters. Hence Charlie Munger’s dictum of “show me the incentives and I’ll tell you the result”. When Wells Fargo created fake accounts to juice their numbers and hit bonuses, that wasn’t an act the system intended, just one that it created.
We also manage whole economies this way. The Hayekian school thinks to devolve decision making for this reason. Organisational design and economic policy are nothing but ways to align a superintelligence to the ends we seek, knowing we can’t know the n-th order effects of those decisions, but knowing we can control it.
And why can we control it? Because it is capable, oh so highly capable, but it is not intentional. Like evolution, it acts, but it doesn’t have the propensity to intentionally guide it’s behaviour. Which changes the impact the measurements have.
What we’re doing is not testing an entity the way we would test a wannabe lawyer with LSAT. We’re testing the collected words of humanity having made it talk back to us. And when you talk to the internet, the internet talks back, but while this tells us a lot about us and the collective psyche of humanity, it doesn’t tell us a lot about the “being we call Claude”. It’s self reflection at one remove, not xenopsychology.
2024-12-15 02:24:10
I'll start at the end. No. It's not.
Of course, I can’t leave it at that. The reason the question comes up is that there have been a lot of statements that they are stalling a bit. Even Ilya has said that it is.
Ilya Sutskever, co-founder of AI labs Safe Superintelligence (SSI) and OpenAI, told Reuters recently that results from scaling up pre-training - the phase of training an AI model that use s a vast amount of unlabeled data to understand language patterns and structures - have plateaued.
Also, as he said at Neurips yesterday:
Of course, he’s a competitor now to OpenAI, so maybe it makes sense to talk his book by hyping down compute as an overwhelming advantage. But still, the sentiment has been going around. Sundar Pichai thinks the low hanging fruit are gone. There’s whispers on why Orion from OpenAI was delayed and Claude 3.5 Opus is nowhere to be found.
Gary Marcus has claimed vindication. And even though that has happened before, a lot of folks are worried that this time he's actually right.
Meanwhile pretty much everyone inside the major AI labs are convinced that things are going spectacularly well and the next two years are going to be at least as insane as the last two. It’s a major disconnect in sentiment, an AI vibecession.
So what's going on?
Until now, whenever the models got better at one thing they also got better at everything else. This was seen as the way models worked, and helped us believe in the scaling thesis. From GPT-4 all the way till Claude 3.5 Sonnet we saw the same thing. And this made us trust even more in the hypothesis that when models got better at one thing they also got better at everything else. They demonstrated transfer learning and showed emergent capabilities (or not). Sure there were always those cases where you could fine tune it to get better at specific medical questions or legal questions and so on, but those also seem like low-hanging fruit that would get picked off pretty quickly.
But then it kind of started stalling, or at least not getting better with the same oomph it did at first. Scaling came from reductions in cross-entropy loss, basically the model learning what it should say next better, and that still keeps going down. But for us, as observers, this hasn’t had enough visible effects. And to this point, we still haven’t found larger models which beat GPT 4 in performance, even though we’ve learnt how to make them work much much more efficiently and hallucinate less.
What seems likely is that gains from pure scaling of pre-training seem to have stopped, which means that we have managed to incorporate as much information into the models per size as we made them bigger and threw more data at them than we have been able to in the past. This is by no means the only way we know how to make models bigger or better. This is just the easiest way. That’s what Ilya was alluding to.
We have multiple GPT-4 class models, some a bit better and some a bit worse, but none that were dramatically better the way GPT-4 was better than GPT-3.5.
The model most anticipated from OpenAI, o1, seems to perform not much better than the previous state of the art model from Anthropic, or even their own previous model, when it comes to things like coding even as it captures many people’s imagination (including mine).
But this is also because we’re hitting against our ability to evaluate these models. o1 is much much better in legal reasoning, for instance. Harvey, the AI legal company, says so too. It also does much much better with code reviews, not just creating code. It even solves 83% of IMO math problems, vs 13% for gpt4o. All of which to say, even if it doesn’t seem better at everything against Sonnet or GPT-4o, it is definitely better in multiple areas.
A big reason why people do think it has hit a wall is that the evals we use to measure the outcomes have saturated. I wrote as much when I dug into evals in detail.
Today we do it through various benchmarks that were set up to test them, like MMLU, BigBench, AGIEval etc. It presumes they are some combination of “somewhat human” and “somewhat software”, and therefore tests them on things similar to what a human ought to know (SAT, GRE, LSAT, logic puzzles etc) and what a software should do (recall of facts, adherence to some standards, maths etc). These are either repurposed human tests (SAT, LSAT) or tests of recall (who’s the President of Liberia), or logic puzzles (move a chicken, tiger and human across the river). Even if they can do all of these, it’s insufficient to use them for deeper work, like additive manufacturing, or financial derivative design, or drug discovery.
The gaps between the current models and AGI are: 1) they hallucinate, or confabulate, and in any long-enough chain of analysis it loses track of what its doing. This makes agents unreliable. And 2) they aren’t smart enough to create truly creative or exceptional plans. In every eval the individual tasks done can seem human level, but in any real world task they’re still pretty far behind. The gap is highly seductive because it looks small, but its like a Zeno’s paradox, it shrinks but still seems to exist.
But regardless of whether we’ve hit somewhat of a wall on pretraining, or hit a wall on our current evaluation methods, it does not mean AI progress itself has hit a wall.
So how to reconcile the disconnect? Here are three main ways that I think AI progress will continue its trajectory. One, there still remains a data and training overhang, there’s just a lot of data we haven’t used yet. Second, we’re learning to use synthetic data, unlocking a lot more capabilities on what the model can actually do from the data and models we have. And third, we’re teaching the models reasoning, to “think” for longer while answering questions, not just teach it everything it needs to know upfront.
We can still scale data and compute
The first is that there is still a large chunk of data that’s still not used in training. There's also the worry that we've run out of data. Ilya talks about data as fossil fuels, a finite and exhaustible source.
But they might well be like fossil fuels, where we identify more as we start to really look for them. The amount of oil that’s available at $100 a barrel is much more than the amount of oil that’s available at $20 a barrel.
Even in the larger model runs, they don't contain a large chunk of data we normally see around us. Twitter, for the most famous one. But also, a large part of our conversations. The process data on how we learn things, or do things, from academia to business to sitting back and writing essays. Data on how we move around the world. Video data from CCTVs around the world. Temporal structured data. Data across a vast range of modalities, yes even with the current training of multimodal models, remains to be unearthed. Three dimensional world data. Scientific research data. Video game playing data. An entire world or more still lay out there to be mined!
There's also data that doesn't exist, but we're creating.
https://x.com/watneyrobotics/status/1861170411788226948?t=s78dy7zb9mlCiJshBomOsw&s=19
And in creating it we will soon reach a point of extreme dependency the same way we did for self-driving. Except that because folding laundry is usually not deadly it will be even faster in getting adoption. And there are no “laundry heads” like gear heads to fight against it. This is what almost all robotics companies are actually doing. It is cheaper to create the data by outsourcing the performance of tasks through tactile enough robots!
With all this we should imagine that the largest multimodal models will get much (much) better than what they are today. And even if you don’t fully believe in transfer learning you should imagine that the models will get much better at having quasi “world models” inside them, enough to improve their performance quite dramatically.
Speaking of which…
We are making better data
And then there's synthetic data. This especially confuses people, because they rightly wonder how you can use the same data in training again and make it better. Isn’t that just empty calories? It’s not just a bad question. In the AI world this would be restated as “it doesn’t add ton of new entropy to original pre-training data”, but it means the same thing.
The answer is no, for (at least) three separate reasons.
We already train using the raw data we have multiple times to learn better. The high quality data sets, like Wikipedia, or textbooks, or Github code, are not used once and discarded during training. They’re used multiple times to extract the most insight from it. This shouldn't surprise us, after all we and learn through repetition, and models are not so different.
We can convert the data that we have into different formats in order to extract the most from it. Humans learn from seeing the same data in a lot of different ways. We read multiple textbooks, we create tests for ourselves, and we learn the material better. There are people who read a mathematics textbook and barely pass high school, and there’s Ramanujan.
So you turn the data into all sorts of question and answer formats, graphs, tables, images, god forbid podcasts, combine with other sources and augment them, you can create a formidable dataset with this, and not just for pretraining but across the training spectrum, especially with a frontier model or inference time scaling (using the existing models to think for longer and generating better data).
We also create data and test their efficacy against the real world. Grading an essay is an art form at some point, knowing if a piece of code runs is not. This is especially important if you want to do reinforcement learning, because “ground truth” is important, and its easier to analsye for topics where it’s codifiable. OpenAI thinks it’s even possible for spaces like law, and I see no reason to doubt them.
There are papers exploring all the various ways in which synthetic data could be generated and used. But especially for things like enhancing coding performance, or enhanced mathematical reasoning, or generating better reasoning capabilities in general, synthetic data is extremely useful. You can generate variations on problems and have the models answer them, filling diversity gaps, try the answers against a real world scenario (like running the code it generated and capturing the error message) and incorporate that entire process into training, to make the models better.
If you add these up, this was what caused excitement over the past year or so and made folks inside the labs more confident that they could make the models work better. Because it’s a way to extract insight from our existing sources of data and teach the models to answer the questions we give it better. It’s a way to force us to become better teachers, in order to turn the models into better students.
Obviously it’s not a panacea, like everything else this is not a free lunch.
The utility of synthetic data is not that it, and it alone, will help us scale the AGI mountain, but that it will help us move forward to building better and better models.
We are exploring new S curves
Ilya’s statement is that there are new mountains to climb, and new scaling laws to discover. “What to scale” is the new question, which means there are all the new S curves in front of us to climb. There are many discussions about what it might be - whether it’s search or RL or evolutionary algos or a mixture or something else entirely.
o1 and its ilk is one answer to this, but by no means the only answer. The Achilles heel of current models is that they are really bad at iterative reasoning. To think through something, and every now and then to come back and try something else. Right now we do this in hard mode, token by token, rather than the right way, in concept space. But this doesn’t mean the method won’t (or can’t) work. Just that like everything else in AI the amount of compute it takes to make it work is nowhere close to the optimal amount.
We have just started teaching reasoning, and to think through questions iteratively at inference time, rather than just at training time. There are still questions about exactly how it’s done: whether it’s for the QwQ model or Deepseek r1 model from China. Is it chain of thought? Is it search? Is it trained via RL? The exact recipe is not known, but the output is.
And the output is good! Here in fact is the strongest bearish take on it, which is credible. It states that because it’s trained with RL to “think for longer”, and it can only be trained to do so on well defined domains like maths or code, or where chain of thought can be more helpful and there’s clear ground truth correct answers, it won’t get much better at other real world answers. Which is most of them.
But turns out that’s not true! It doesn't seem to be that much better at coding compared to Sonnet or even its predecessors. It’s better, but not that much better. It's also not that much better at things like writing.
But what it indisputably is better at are questions that require clear reasoning. And the vibes there are great! It can solve PhD problems across a dizzying array of fields. Whether it’s writing position papers, or analysing math problems, or writing economics essays, or even answering NYT Sudoku questions, it’s really really good. Apparently it can even come up with novel ideas for cancer therapy.
https://x.com/DeryaTR_/status/1865111388374601806?t=lGq9Ny1KbgBSQK_PPUyWHw&s=19
This is a model made for expert level work. It doesn’t really matter that the benchmarks can’t capture how good it is. Many say its best to think of it as the new “GPT 2 moment” for AI.
What this paradoxically might show is benchmark saturation. We are no longer able to measure performance of top-tier models without user vibes. Here’s an example, people unfamiliar with cutting edge physics convince themselves that o1 can solve quantum physics which turns out to be wrong. And vibes will tell us which model to use, for what objective, and when! We have to twist ourselves into pretzels to figure out which models to use for what.
https://x.com/scaling01/status/1865230213749117309?t=4bFOt7mYRUXBDH-cXPQszQ&s=19
This is the other half of the Bitter Lesson that we had ignored until recently. The ability to think through solutions and search a larger possibility space and backtrack where needed to retry.
But it will create a world where scientists and engineers and leaders working on the most important or hardest problems in the world can now tackle them with abandon. It barely hallucinates. It actually writes really impressive answers to highly technical policy or economic questions. It answers medical questions with reasoning, including some tricky differential diagnosis questions. It debugs complex code better.
It’s nowhere close to infallible, but it’s an extremely powerful catalyst for anyone doing expert level work across a dizzying array of domains. And this is not even mentioning the work within Deepmind of creating the Alpha model series and trying to incorporate those into the Large Language world. There is a highly fertile research ecosystem desperately trying to build AGI.
We’re making the world legible to the models just as we’re making the model more aware of the world. It can be easy to forget that these models learn about the world seeing nothing but tokens, vectors that represent fractions of a world they have never actually seen or experienced. We’re working also on making the world legible to these models! And it’s hard, because the real world is annoyingly complicated.
We have these models which can control computers now, write code, and surf the web, which means they can interact with anything that is digital, assuming there’s a good interface. Anthropic has released the first salvo by creating a protocol to connect AI assistants to where the data lives. What this means is that if you want to connect your biology lab to a large language model, that's now more feasible.
Together, what all this means is that we are nowhere close to AI itself hitting a wall. We have more data that remains to be incorporated to train the models to perform better across a variety of modalities, we have better data that can teach particular lessons in areas that are most important for them to learn, and we have new paradigms that can unlock expert performance by making it so that the models can “think for longer”.
Will this result in next generation models that are autonomous like cats or perfectly functional like Data? No. Or at least it’s unclear but signs point to no. But we have the first models which can credibly speed up science. Not in the naive “please prove the Riemann hypothesis” way, but enough to run data analysis on its own to identify novel patterns or come up with new hypotheses or debug your thinking or read literature to answer specific questions and so many more of the pieces of work that every scientist has to do daily if not hourly! And if all this was the way AI was meant to look when it hit a wall that would be a very narrow and pedantic definition indeed.
2024-11-19 06:58:42
At MIT, a PhD student called Aidan Toner-Rodgers ran a test on how well scientists can do their job if they could use AI in their work. These were material scientists, and the goal was to try and figure out how they did once augmented with AI. It worked.
AI-assisted researchers discover 44% more materials, resulting in a 39% increase in patent fillings and a 17% rise in downstream product innovation.
That’s really really good. How did they do it?
… AI automates 57% of the “idea-generation” tasks, reallocating researchers to the new task of evaluating model-produced candidate materials.
They got AI to think for them and come up with brilliant ideas to test.
But there was one particularly interesting snippet.
Researchers experience a 44% reduction in satisfaction with the content of their work
To recap, they used a model that made them much better at their core work and made them more productive, especially for the top researchers, but they dislike it because the “fun” part of the job, coming up with ideas, fell to a third of what it was before!
We found something that made us much much more productive but turns out it makes us feel worse because it takes away the part that we find most meaningful.
This is instructive.
This isn’t just about AI. When I first moved to London the black cab drivers used to say how much better they were than Google maps. They knew the city, the shortcuts, the time of the day and how it affects traffic.
That didn’t last long. Within a couple years anyone who could drive a car well and owned a cellphone could do as well. Much lower job satisfaction.
The first major automation task was arguably done by Henry Ford. He set up an assembly line and revolutionised car manufacturing. And the workers got to perform repetitive tasks. Faster production speed, much less artistry.
Computerisation brought the same. EHR records meant that most people now complain about spending their time inputting information into software, becoming data entry professionals.
People are forced to become specialists in ever tinier slices of the world. They don’t always like that.
There’s another paper that came out recently too, which looked at how software developers worked when given access to GitHub Copilot. It’s something that’s actively happening today. Turns out project management drops 25% and coding increases 12%, because people can work more independently.
Turns out biggest benefit is for the lower skilled developers, not the superstars who presumably could do this anyway.
This is interesting for two reasons. One is that it’s different who gets a bigger productivity boost, the lower skilled folks here instead of the higher skilled. The second is that that the reason the developers got upskilled is that a hard part of their job, of knowing where to focus and what to do, got better automated. This isn’t the same as the materials scientists finding new ideas to research, but also, it kind of is?
Maybe the answer is that it depends on your comparative advantage, and takes away the harder part of the job, which is knowing what to do. Instead of what seems harder, which is *doing* the thing. A version of Moravec’s Paradox.
AI reduces the gap between the high and low skilled. If coming up with ideas is your bottleneck, as it seems possible for those who are lower skilled, AI is a boon. If coming up with ideas is where you shine, as a high skilled researcher, well …
This, if you think about it, is similar to the impact of automation work we’ve seen elsewhere. Assembly lines took away the fun parts of craftsmanship regarding building a beautiful finished product. Even before that, machine tools took that away more from the machinist. Algorithmic management of warehouses in Amazon does this.
It’s also in high skilled roles. Bankers are now front-end managers like has written about. My dad was a banker for four decades and he was mostly the master of his fate, which is untrue about most retail bankers today except maybe Jamie Dimon.
Whenever we find an easier way to do some things, we take away the need for them to actively grok the entire problem. People becoming autopilots who review the work the machine is doing is fun when it is with my Tesla FSD but less so when it’s your job I imagine.
Radiologists, pathologists, lawyers and financial analysts, they all are now the human front-ends to an automated back-end. They’ve shifted from broad, creative work to more specialised tasks that automation can’t yet do effectively.
Some people want to be told what to do, and they're very happy with that. Most people don't like being micromanaged. They want to feel like they're contributing something of value by being themselves, not just contributing by being a pure cog.
People fine fulfilment by being the masters of some aspect, fully. To own an outcome and use their brains, their whole brains, to ideate and solve for that outcome. The best jobs talk about this. It's why you can get into a state of flow as a programmer or creating a craft table but not as an Amazon warehouse worker.
There's the apocryphal story of the NASA janitor telling JFK that he was helping put a man on the moon. Missions work go make you feel like you are valued and valuable. Most of the time though you're not putting a man on the moon. And then, if on top you also tell the janitor what to mop, and when, and in what order, and when he can take a break, that's alienating. If you substitute janitor for extremely highly paid silicon valley engineer it's the same. Everyone's an Amazon mechanical turk.
AI will give us a way out, the hope goes, as everyone can do things higher up the pyramid. Possibly. But if AI too takes up the parts that was the most fun as we saw with the material scientists, and turns those scientists into mere selectors and executors of the ideas generated by a machine, you can see where the disillusionment comes from. It's great for us as a society, but the price is alienation, unless you change where you find fulfilment. And fast.
I doubt there was much job satisfaction in being a peasant living on your land, or as a feudal serf. I’m also not sure there’s much satisfaction in being an amazon warehouse worker. Somewhere in the middle we got to a point where automation meant a large number of people could rightfully take pride in their jobs. It could come back again, and with it bring back the polymaths.
2024-11-12 03:39:51
This is my politics post. Well, politics-adjacent. Everyone has one, but this is mine. I’ll say upfront that it’s not trying to relitigate the US election. Others have done that better and with more vehemence.
writes about how Democrats had a wokeness problem and pandered either insufficiently or inefficiently to other interest groups. writes about the need for a common sense democrat policies, a centrist coalition, where saying things like “less crime is good” isn’t demonised. They’re right.
But the US incumbent party did the best amongst the other developed economies, just like the US did the best amongst all the other countries in and after the pandemic. Just not enough to win, and bad enough to cause soul searching.
Some blame the information environment. This is also true. Definitionally so, since the information environment is what informs people and naturally that has an impact1.
But why it is this way is a more interesting question. I wouldn’t expect most people answering a poll to understand most things about the world2. I don’t either. Staying informed and current is hard. I can probably do it for a few percent of the things that I might conceivably care about if I work really hard at it.
People are not answering policy questions there, they’re giving you an indication of whether things are “good” or “bad”: take them “seriously, not literally”. The problem is that they’re fed up with what they see as the existing system which seems to screw them over. What people call the deep state or the swamp or the system.
has a great post on his problems with The Machine, the impersonal bureaucratic manifestation which takes away all human consideration from creating the playground we run civilisation on.
I will vote first, it must be said, for a Machine: the Machine that has the allegiance of the bulk of my country's civil servants and professional class, no matter who is in office; the Machine that coiled up tightly around Biden while it thought it could hide his decline, then spat him out with a thousand beautifully written thinkpieces when it realized it could not. I will vote for a Machine that sneered at a few of its more independent-minded members—Ezra Klein, Nate Silver, others—when they pointed out the obvious truth that Biden should have dropped out a year ago. I will vote for a Machine that knows it needs my vote but can hardly hide its scorn for independent voters who push against parts of its plan, one that put an ostensible moderate in office before crowing about accomplishing the furthest left political agenda in decades.
This observation comes from lived experience. An enormous number of people dislike living in what they consider to be a Kafkaesque bureaucracy. One that they think is impersonal, hobbled by regulations that work at cross purposes to their intended one, and cause anguish. Anthropomorphised, that’s The Machine.
This is because of a fundamental disconnect. Politicians love policy but people love execution. People prefer “competent” government over any other adjective, whether it’s “bureaucratic” or “rigid” or “big” or sometimes even “democratic”. Politicians think having the right policy is the answer to most questions. Ban this, regulate that, add a rule here. Whether it’s climate change or tax or entrepreneurship or energy shortage or geopolitical jockeying for status.
But policies don’t mean much to anyone without it being implemented. In Berkeley apparently it’s illegal to whistle for a lost canary before 7 am, though I doubt this is being policed rigorously.
What people in power hear are policies, what people on the ground see are its implementations.
That's why The Machine exists. It was created, painstakingly, over decades and centuries, to make our lives better. It was built to be the instrument to enact the will of the people.
And so, when it starts doing things that the system optimises for but is silly, everyone gets rightly upset. Like this.
For 7 chargers they spent $8 billion. (I got this wrong. Most of the money isn’t disbursed yet, we got 61 charging ports at 15 stations, and 14k more are in progress. As of mid-April 2024, 19 states had awarded $287.6 million in NEVI funds. ) This is a dramatic example of a lack of state capacity that we’ve seen time and time again in a hundred different ways.
In California they just got $5 Billion to add to the 7 they had to help build four stations in a 5 mile extension. As of 2024, after nearly 30 years of planning and 15 years since voter approval, no segment of the high-speed rail system is operational yet. Regarding costs: The initial 208 estimate for the entire 800-mile system was about $33 billion. By 2023, the project had received about $23 billion in combined state and federal funding. Current cost estimates for just Phase 1 (San Francisco to Los Angeles) range from $89 billion to $128 billion. Nearly all of the initial $9.95 billion in bond funding has been spent and 0 miles have been built.
Whether that’s spending $45B on rural broadband without connecting anyone or building high speed rail, we see the tatters of what could be. Whether it’s the need for Vaccine CA by patio11, or CHIPS Act not moving fast enough with disbursements, or a bloody bee causing Meta to not be able to go forward with a nuclear reactor to power its datacenter, or every problem that the city of San Francisco has in spades, or the overreach and underreach of the FDA simultaneously during the pandemic, or bioethicists gatekeeping individuals from curing cancer, or suing SpaceX over trivialities, or the rural broadband rollout which hasn’t connected anyone, or medical ethics strangling research, or NEPA or … the list is endless.
People feel this. The process as currently enshrined tries to impose considerations that stop things from happening everywhere, not just to stop trains in the United States! Just read this.
My USAID grant to curb violence among the most young young men in Lagos, Nigeria—through a course of therapy and cash proven in Liberia & Chicago—is bedeviled by an internal environmental regulator concerned the men will spend the money on pesticides. In the middle of one of the world’s largest cities.
This isn't caused by the federal govt or the President. But it's linked inextricably because they seem to defend the Machine. To not acknowledge it or to promise to stop it makes everyone think you’re part of the problem, especially because you promise to be the solution3.
This isn’t at all to suggest most of the government is like this. The Fed is excellent. NOAA seems great. FEMA too. There are tons of pockets of exceptional performance by dedicated civil servants. They even fixed the DMV.
But bad implementation is endemic. It’s everywhere. State capacity is anaemic. And until it can be fixed, there can be no party of competence. Only parties shouting at each other about who created which mess instead of cleaning anything up. This is why people argue to death over taxes, one of the few things that can be implemented properly. This is why people think the “experts” who said you need to do this aren’t experts any longer, and shouldn’t be trusted. This is why people argue over details like how many sinks should you have before peeling a banana, or argue over eliminating the DoE, with no middle ground.
It's the only way to bring some positive energy to politics4. Not about right or left or ideologies writ in stone, but about competence. Building something meaningful and bulldozing what's in your way to get it done5. This is a strong positive vision of what the world could be, and it needs a champion. We should embrace it.
This seems a problem, though not the same problem that most think.
Have you seen the number of debates on Twitter about what inflation actually means?
Doesn’t help that you end up becoming the defender of the status quo unless you rail against it. Which is hard when you’re the one in power. But that’s the ballgame. In 2008 you had a convenient villain in the “bankers”.
Not just politics. Every large organisation has this problem. Whether it’s FDA or IBM the problems are the same - death by a thousand papercuts.
“Why We Love Robert Caro And His Work On Lyndon Johnson”
2024-10-16 05:30:08
When I landed in India last week, my friend came to pick us up. And as we got in his car and started to drive out, we went the wrong way inside the giant parking garage at Delhi airport where all directions look the same1.
But he drove intrepidly. With no hesitation, just looking for a direct way down, we soon came to a set of cones blocking an exit and a stern looking guard, who enforced the cone. I looked back to see where else we could go, to circle around, and instead my friend lowered the window and asked the guard if he could just move the cone. A series of negotiations took place, short and to the point, about the pain of having to drive around versus the practical ease of moving a cone, and the guard relented. We drove down.
Once we were out and in the thick of traffic, similar interactions showed up again. India, modern India, has well defined lanes and beautiful highways. It also has absolutely no sense of traffic norms. Every inch of space is seen as a battleground to be won.
Like little Napoleons every autorickshaw and car and honking motorcycle would try to shoehorn into the three inches that opened up in front. It’s like a game of whack-a-mole in reverse.
And everyone knows the score. You can almost see them constantly playing multi-participant chicken. “Good game” you can almost hear them thinking as you jump the car ahead where it shouldn’t have gone, just to block the bike that would’ve cut you off, while the rickshaw stands perpendicular to the flow of traffic playing the same dance and trying to cut across the roundabout.
This happened repeatedly across most walks of life over the next few days. To skip a line, to find parking, to get your concierge to buy something, to negotiate a safari booking that needs to get changed, to get a customised menu to eat, to argue against an unjust traffic stop, it goes on and on.
Life in India is a series of bilateral negotiations conducted a thousand times a day. And that drives the character of life here.
Now, I am seeing the country properly after several years. And it’s a major change.
Visible infrastructure has gotten much better. Roads are good, well maintained, and highways are excellent. They built 7500 miles last year, just as the year before. And they’re fantastic.
Air travel is great, and airports are absolutely spectacular. I used to live in Singapore and say that Changi is its true jewel, and can now say the same about Bangalore. The airport is gorgeous.
Even trains have gotten better, even if train stations aren't there yet. The number of beggars on the street has reduced. Shops got uplifted. Mobile phones are better and everyone has one. Payment infrastructure is aeons beyond the West. And it’s safe, which a country without a strong legal safety net and a lot of poverty arguably shouldn’t be. There’s no real urban crime.
Restaurants, bars, pubs, these are world class. Same for most shopping. Even delivery. You can order hot chai or a small glass of tonic or a cup of milk and it comes in 10 mins to your door.
Daron Acemoglu, the famous economist who recently won the Nobel prize, has extensively talked abotu about the importance of institutions in economic development2. And it’s true, institutions do matter. A lot. Property rights, restricting elite capture, supporting employment, these are all necessary aspects of helping a country progress. Built via ‘inclusive institutions’.
India has pretty good institutions in this view. Or at least far better than it used to have. The laws are well defined even though the legal system runs like molasses, state capacity is somewhat reasonably bounded. It’s not perfect by any means.
What it does not yet have and loses at least a percentage of GDP growth a year is on bad informal institutions. One might even call it culture.
Acemoglu considers the informal institutions endogenous, shaped by the foundational formal institutions rather than serving as foundational themselves. In this view, this is why Northern Italy exhibits higher levels of social trust and economic performance compared to the South. Or why we see varied success in transitioning to market economies.
Douglass North, another prominent economist, in his work on Institutions and Economic Performance wrote about the success and failure of economies as largely dependent on the institutions that structure human interaction. These aren’t just formal institutions, like laws or regulations, but also informal ones, like cultural norms.
The theory is that with sufficiently strong formal institutions, you can shape the culture. For instance by enforcing fare payments in subways, you can change the behaviour of people such that they don’t dodge fares.
Acemoglu however places only secondary importance to this. Or rather, he incorporates this as something that results from having strong formal institutions. And in doing so seems to beg the very question it answers - that having good institutions enables growth, and better institutions is what helps enables growth in the first place.
India seems to be contradictory on these accounts. It has somewhat competent formal institutions, fairly chaotic informal norms, strong culture that has both strong points and weak points in terms of getting things done, with a strong economy, that ideally should be even stronger and seems to be constantly held on a leash.
A simpler explanation might be that institutions are easy to set up well at the top, but incredibly difficult to make percolate into the whole society. Why would, after all, you decide to follow the traffic rules? Even in the West there’s no sufficient enforcement possible that can catch everyone who speeds. Or to stop people littering. But still we do it. So it can’t just be the stick trickling down from formal institutions. The informal bottom up behaviour clearly matters.
The question is what drives such behaviour and if how can we shift it.
Patrick Collison asked this question differently.
Why are so many things so much nicer in Switzerland and Japan?
Compared to almost all countries (including the US), Switzerland and Japan seem to possess much higher baseline execution quality in almost everything. Buses and trains are better (and more punctual); low-end food is tastier; cheap hotels are more comfortable; their airlines score higher on international indexes (despite not spending more per passenger); streets are cleaner; grocery stores and corner stores are nicer; ostensibly unremarkable villages have more beautiful buildings and are more pleasant places to spend a few days.
(This phenomenon, whatever it is, may extend even further. The homicide rates in both Japan and Switzerland are about a tenth of that of the US and less than half those of England, France, and Germany.)
What's going on? While wealth is clearly some of the story, it isn't just a matter of wealth: GDP per capita in Japan substantially lags that of the US. Nor does it seem to be a matter of historical wealth. (1900 Japan was even further behind.) It doesn't seem to be the simple result of long periods of stability. (Again, Japan.)
So, what exactly is this effect? Which things are truly better and by how much? Are these effects the result of the same kind of phenomenon in Switzerland and Japan? Is this in any way related to their topping the economic complexity index? Could other countries acquire this "general execution capital"? And are there meaningful trade-offs involved or is it a kind of free lunch?
Living in a country built off of bilateral negotiations for everything is simultaneously the libertarian dream and an incredibly inefficient way to do most collective things. Ronald Coase told us this in 1960.
if property rights are well-defined and transaction costs are low, private parties can negotiate solutions to externalities without the need for government intervention
But Indian life is dominated by transaction costs. Every time a driver pokes his car into a turn when the signal’s not for him it creates friction that ripples through the entire system. Every time someone has to spend effort doing a 1:1 negotiation they lose time and efficiency. Horribly so.
Just see driving. Half the time you're stuck behind trucks overtaking other trucks while dodging a motorcycle driving on the divider line, and this all but guarantees that your optimal speed will be at least 20% below the limit. (I measured this so n = 3, on 3 separate trips). What's the GDP impact of that?
The reason this isn't an easy fix is that the ability to negotiate everything is also the positive. When every rule is negotiable you get to push back on silly things like closing off a section of a parking garage with rubber cones by just asking. Life in the West feels highly constricted primarily because of this, we’re all drowning in rules.
People sometimes talk about it in terms of “low trust” vs “high trust” societies. Francis Fukuyama wrote about this, discussing how low-trust societies often develop more rigid, hierarchical structures and rely on informal personal networks. But I feel that misses what’s happening here. The negotiations aren’t a matter of trust, not when it’s about traffic, but trust is the result of a different cultural Schelling point. Trust isn't a one dimensional vector.
What causes the average driver in Indian roads to treat driving like a game of water filling cracks on the pavement? It's not trust, it's the lack of an agreed upon equilibrium. There's no norms to adhere to.
We might as well call those norms culture.
Normally this movement to the new Schelling point happens first as a result in better enforcement mechanisms. If the rules that exist start being applied much more stringently, then they just start becoming a part of normal behaviour.
Japan did this post war through an emphasis on group harmony and obedience to rules. Singapore through extremely stringent application of rules and education campaigns. Germany, again post war, had strict enforcement of laws. Sweden in the 30s and 40s focused a lot on building a culture of cooperation and emphasising the “social contract”.
By 2034, India would have added a few trillion dollars to its GDP. Just like we added incredible airports and highways and cleaner streets (yes, really) and metro systems and the best retail and F&B scene in the world, it will continue to get even better. And as it gets better people will behave better, as behooves the surroundings. This is the Broken Windows theory in reverse.
As individuals get richer, as some of those few trillion dollars trickle through the economy, the types of goods and services that they demand will change. Small broken up shops that double as sleeping quarters, open and with livestock wandering in and out, would change. We will start seeing the same gentrification as we saw elsewhere, but we will also see better social norms become more common.
India remains a land of conundrums. It contains enough chaos that causes increased transaction costs and holds much of potential progress back, a large gap to potential GDP, sticking like so many barnacles on the hull. More collective game theory, where you don’t need to be a greedy algorithm, where you can rely on other people’s behaviour.
All growth stories are stories of cultural transformation. Cultural shifts often require deliberate effort and collective will. Not just South Korea, Singapore and China, but also Brazil, with “jeitinho brasileiro” holding it back, Rwanda focusing on unity, Botswana on legal frameworks and reducing corruption.
On-the-ground cultural shifts are reflexive with government policies, but they also move to their own beat. The success of Swachh Bharat, to eliminate open defecation made substantial progress through building infrastructure but also campaigns for behaviour change.
But the framework is clear, to move away from bilateral negotiations to a Coasian equilibrium. That’s the next cultural progress milestone we need to get to.
To be fair to the parking garage, I, and he, have gotten lost in shopping malls before
What *is* an institution however remains a highly contentious topic