2025-04-26 04:30:48
So yesterday I got this email. A few seconds after my o3 queries got downgraded to gpt-4o-mini (I noticed when the answers got worse). Then it stopped entirely. Then my API calls died. Then my open chat windows hung. And the history was gone.
I have no idea why this happened. I’ve asked people I know inside OpenAI, they don't know either. Might be an API key leak, but I’d deleted all my keys a couple days ago so it shouldn’t have been an issue. Could be multiple device use (I have a couple laptops and phones). Might be asking simultaneous queries, which again doesn’t seem that much of an issue?
Could it be the content? I doubt it, unless OpenAI hates writing PRDs or vibe coding. I can empathise. Most of my queries are hardly anything that goes anywhere near triggering things1. I mean, this is what ChatGPT seemed to find interesting amongst my questions.
Or this.
We had a lot of conversations about deplatforming from like 2020 to 2024, when the vibes changed. That was in the context of social media, where there were bans both legitimate (people knew why) and illegitimate (where nobody knew why) and where bans where govt “encouraged”.
That was rightly seen as problematic. If you don’t know why something happened, then it’s the act of a capricious algorithmic god. And humans hate capriciousness. We created entire pantheons to try and make ourselves feel better about this. But you cannot fight an injustice you do not understand.
In the future, or even in the present, we’re at a different level of the same problem. We have extremely powerful AIs which are so smart the builders think they will soon disempower humanity as a whole, but where people are still getting locked out of their accounts for no explainable reason. If you are truly building God, you should at least not kick people out of the temples without reason.
And I can’t help wonder, if this were entirely done with an LLM, if OpenAI’s policies were enforced by o3, Dario might think of this as a version of the interpretability problem. I think so too, albeit without the LLMs. Our institutions and policies are set up opaquely enough that we do have an interpretability crisis.
This crisis is also what made many if not most of us angry at the world, throwing the baby out with the bathwater when we decried the FDA and the universities and the NIH and the IRS and NASA and …. Because they seemed unaccountable. They seemed to have generated Kafkaesque drama internally so that the workings are not exposed even to those who work within the system.
It’s only been a day. I have moved my shortcuts to Gemini and Claude and Grok to replace my work. And of course this is not a complex case and hopefully will get resolved. I know some people in the lab and maybe they can help me out. They did once before when an API key got leaked.
But it’s also a case where I still don’t know what actually happened, because it’s not said anywhere. Nobody will, or can, tell me anything. It’s “You Can Just Do Things” but the organisational Kafka edition. All I know is that my history of conversations, my Projects, are all gone. I have felt like this before. In 2006 when I wiped by hard drive by accident. In 2018 when I screwed up my OneDrive backup. But at least those cases were my fault.
In most walks of life we imagine that the systems that surround us also are somewhat predictable. The breakdown of order in the macroeconomic sense we see today (April 2025) is partly because those norms and predictability of rules have broken down. When they’re no longer predictable or seen as capricious we move to a fragile world. A world where we cannot rely on the systems to protect us, but we have to rely on ourselves or trusted third parties. We live in fear of falling into the interstitial gaps where the varying organisations are happy to let us fester unless one musters up the voice to speak up and clout to get heard.
You could imagine a new world where this is rampant. That would be a world where you have to focus on decentralised ownership. You want open source models run on your hardware. You back up your data obsessively both yourself and to multiple providers. You can keep going down the row and end up with wanting entirely decentralised money. Many have taken that exact path.
I’m not saying this is right, by the way. I’m suggesting that when the world inevitably moves towards incorporating even more AI into more of its decision making functions, the problems like this are inevitable. And they are extremely important to solve, because otherwise the trust in the entire system disappears.
If we are moving towards a world where AI is extremely important, if OpenAI is truly AGI, then getting deplatformed from it is a death knell, as a commenter wrote.
One weird hypothesis I have is that the reason I got wrongly swept up in some weird check is because OpenAI does not use LLMs to do this. They, same as any other company, rely on various rules, some ad hoc and some machine learnt, that are applied with a broad brush.
If they had LLMs, as we surely will have in the near future, they would likely have been much smarter in terms of figuring out when to apply which rules to whom. And if they don’t have LLMs doing this already, why? Do you need help in building it? I am, as they say, motivated. It would have to be a better system than what we have now, where people are left to fall into the chinks in the armour, just to see who can climb out.
Or so I hope.
[I would’ve type checked and edited this post more but, as I said, I don’t have my ChatGPT right now]
Unless, is asking why my back hurts repeatedly causing trauma?
2025-04-07 22:29:33
It looks like the US has started a trade war. This is obviously pretty bad, as the markets have proven. 15% down in three trading sessions is a historic crash. And it’s not slowing. There is plenty of econ 101 commentary out there about why this is bad, from people across every aisle you can think of, but it’s actually pretty simple and not that hard to unravel anyway, so I was more interested in a different question. The question of why the process by which this decision was taken in the first place.
And thinking about what caused this to happen, even after Trump’s rhetoric all through the campaign trail that folks like Bill Ackman didn’t believe to be true, it seemed silly enough that there had to be some explanation.
Even before the formula used to calculate the tariffs was published showing how they actually analysed this, I wondered if it’s something that was gotten from an LLM. It had that aura of extreme confidence in a hypothesis for a government plan. It’s also something you can test. And if you ask the question “What would be an easy way to calculate the tariffs that should be imposed on other countries so that the US is on even-playing fields when it comes to trade deficit? Set minimum at 10%.” to any of the major LLMs, they all give remarkably similar answers.
The answer is, of course, wrong. Very wrong. Perhaps pre-Adam Smith wrong.
This is “Vibe Governing”.
The idea that you could just ask for an answer to any question and take the output and run with it. And, as a side effect, wipe out trillions from the world economy.
A while back when I wrote about the potential negative scenarios of AI, ones that could actually happen, I said two things - one was that once AI is integrated into the daily lives of a large number of organisations it could interact in complex ways and create various Black Monday type scenarios (when due to atuomated trading the market fell sharply). There’s the other scenario where reliance on AI would make people take their answers at face value - sort of like taking Wikipedia as gospel.
But in the aggregate, the latter is good. It’s better that they asked the LLMs. Because the LLMs gave pretty good answers even with really bad questions. It tried to steer the reader away from the recommended formula, noted the problems inherent in that application, and explained in exhaustive detail the mistaken assumptions inside it.
When I dug into why the LLMs seem to give this answer even though its obviously wrong from an economics point of view, it seemed to come down to data. First of all, asking about tariff percentages based on “putting US on an even-footing when it comes to trade deficit” is a weird thing to ask. It seems to have come from Peter Navarro’s 2011 book Death by China.
The Wikipedia analogy is even more true here. You have bad inputs somewhere in the model, you will get bad outputs.
LLMs, to their credit, and unlike Wikipedia, do try immensely hard to not parrot this blindly as the answer and give all sorts of nuances on how this might be wrong.
Which means two things.
A world where more people rely on asking such questions would be a better world because it would give a more informed baseline, especially if people read the entire response.
Asking the right questions is exceedingly important to get better answers.
The question is once we learn to ask questions a bit better, like when we learnt to Google better, whether this reliance would mean we have a much better baseline to stand on top of before creating policies. The trouble is that LLMs are consensus machines, and sometimes the consensus is wrong. But quite often the consensus is true!
So maybe we have easier ways to create less flawed policies especially if the writing up of those policies is outsourced to a chatbot. And perhaps we won’t be so burdened by idiotic ideas when things are so easily LLM-checkable?
On the other hand Google’s existed for a quarter century and people still spread lies on the internet, so maybe it’s not a panacea after all.
However, if you did want to be concerned, there are at least two reasons:
Data poisoning is real, and will affect the answers to questions if posed just so!
People seem overwhelmingly ready to “trust the computer” even at this stage
The administration thus far have been remarkably ready to use AI.
They used it to write Executive Orders.
The “research” paper that seemed to underpin the three-word formula seems like a Deep Research output.
The tariffs on Nairu and so on show that they probably used LLMs to parse some list of domains or data to set it, which is why we’re setting tariffs on penguins (yes, really).
While they’ve so far been using it with limited skill, and playing fast and loose with both laws and norms, I think this perhaps is the largest spark of “good” I’ve seen so far. Because if the Federal Govt can embrace AI and actually harness it, perhaps the AI adoption curve won’t be so flat for so long after all, and maybe the future GDP growth that was promised can materialise, along with government efficiency.
Last week I had written that with AI becoming good, vibe coding was the inevitable future. If that’s the case for technical work, which it is slowly starting to be and seems more and more likely to be the case for the future, then the other parts of the world can’t be far behind!
Whether we want to admit it or not Vibe Governing is the future. We will end up relying on these oracles to decide what to do, and help us do it. We will get better at interpreting the signs and asking questions the right way. We will end up being less wrong. But what this whole thing shows is that the base knowledge which we have access to means that doing things the old fashioned way is not going to last all that long.
2025-03-27 04:28:38
So, the consensus in many parts of the AI world is that we'll all be living in luxury fully automated space communism by the time my kids finish high school. Not only will certain tasks be automated, entire jobs would be, and not just jobs we have but every job we could ever think of. “Entire data centers filled with Nobel prize winners” Dario Amodei, co-founder of Anthropic, said.
Now look, you might not believe it, most people don't, after all it sounds kind of crazy. Even most investors who invest in this space don't believe in it, as you can easily see by their investments. Nobody is really betting on employees becoming almost too cheap to meter or thinking beyond what if the future's like today but with more automation.
But you don't need to believe that the entire thing will necessarily happen, all or nothing, for you to think that the world of work will change quite a bit in the coming years.
If you believe this is coming, or even if it doesn't quite hit the metric of automating “every job that can be done with a computer”, then surely doing things like writing code by hand is as archaic as when people used to calculate ballistic missile trajectories by hand. Computers used to be a job, but now it's just a machine.
And soon, we would be able to merely ask for something, and the entire apparatus would click into gear and things would just get made. In fact that’s what many have been doing since ChatGPT came out a couple years ago. Ask for pieces of code, copy paste it into your favourite IDE, and press run. Voila, you have yourself a running python script or a website or an app.
Andrej Karpathy recently named this phenomenon of asking the models to do things and accepting it immediately as “vibe coding”. As all fantasy novels have taught us, naming is incredibly powerful, so this term has completely taken off. Vibe is of course the word of the century at this point.
Most of the arguments against doing this basically were different versions of saying to do this he will get burned because the models do not yet know how to do things very well. People leaked login credentials and got laughed at.
If you want to know what the world will look like once this is the norm, we recently had a glimpse. It got particularly popular when Pieter created a simple, very simple, flying game, which became massively popular. How popular? It's barely a month and he's already making $100,000 a month from it. Not just because the game is really fun, after all virtual dog fighting games have been dime-a-dozen over the years. But because he created it almost entirely with vibe coding. Just him, Claude, and Cursor.
This made game developers, especially professional game developers, really angry. Understandably so because while they had to work really really hard to make something a thousand times as good and not make a fraction of the money he's making. He developed the game in public, each new feature added either the same day or next day, and the features are as simple as “hey you can now pilot a triangle around instead of a plane”, and “do you want your name on the side of a blimp”.
Right now Pieter needs to be smart enough to fix things quite a lot but presumably the idea is that very soon he would not need to be smart at all, or at least not conversant with code.
(I also think this should be an existing benchmark we all see regularly to see how well new models do arbitrarily complex things. The measure is “how good a game can a non-coder make with this”.)
And people who are very conversant with code, they get leverage. Some are writing thousands of lines of code a day by managing multiple Claude code agents, each of which (whom?) is working on one particular feature, each of whom will submit PR for the author to review, and effectively automating large chunks of software development. This isn't tomorrow, it's happening now. Today.
Yeah yeah it's still slow and expensive and error prone and hallucinates and sometimes tries to change the test to pass it and can't do long code bases and … But still. You get to type in a thing you want or point to a JIRA ticket and bam! Almost working code!
It's got vast limitations still. It cannot work on super large code bases, makes hallucinations or mistakes, it's sometimes so eager to pass a test that it will try to hard code the answer or find an end run around it. But still, man a couple years ago it could barely write a python script that was right…
Whether you want to or not, you're gonna change from being an Individual Contributor to a Manager. The only question is what you manage and how tiresome they are to manage.
What this means for work is chaos. Almost everyone will have “part time AI trainer” as their job description, that's for sure. Every company will have its AI employees outnumber its human employees. There will be a dislocation, arguably it’s started happening now, where hiring for PMs and sales and marketing and engineers have already subsided. And guess what. This mostly won't matter much because humans are already outnumbered by any number of things and we just grow the organisations or make more work to compensate.
It also means individual productivity will be a function of how much inference you can “suck” from the models. There's a capex component if you want to host it or train it. But there's also an opex component. I racked up $6-15 per hour when I was messing with Claude code. I'm not the most productive engineer, so the actual would have to be higher than that. And even higher if you have multiple agents running at a time, as you should, and some already are.
Since I wrote this Steve Yegge wrote an excellent post that said something similar.
Running N agents at once multiplies your devs' daily agent spend of $10/day by N, not counting cloud costs, just token burn. If your developers are each on average running, say, five agents at once – a very conservative number, since agents will work mostly independently, leaving the dev free to do other stuff – then those devs are each now spending $50/hr, or roughly $100k/year.
It’s not really a steal anymore, so much as a heist. We’re talking about each developer gradually boosting their productivity by a multiplier of ~5x by Q4 2025 (allowing for ramp-up time), for an additional amortized cost of only maybe $50k/year the first year. Who wouldn’t go for that deal?
Unfortunately, you almost certainly didn't include $50k/year per developer of LLM spend in your 2026 operating budget.
And this goes so much further once you have Agents which can run the Claude Code coding agents. To do your PR review and check it they ran the unit tests properly and if it makes sense and other things you'd end up doing otherwise. This too won't be perfect but will slowly get better. When I say slowly I mean week by week not year by year.
This is not just true of coding, it's true of an extraordinarily large percentage of white collar work. Coding just happens to be the obsession for the models right now, mostly because people developing them are obsessed with coding. Sort of like how silicon valley cannot stop creating new devops startups.
Any job which has sufficient data to train, which is almost every job, and has decent ways of telling if something is right or wrong, which is also a large enough number of those jobs, can't help but transform. Can you imagine the same principle being applied to finance? Literature reviw? Report writing? PRDs? Contract writing?
It's already started. Every forward leaning tech company is already doing this. From Stripe to the big tech to Linear to every startup especially from the YC group, it's absolutely dominating every hiring decision. Not just coding, though that's the leading one, but also marketing and sales and operations and product managers, but also lawyers and compliance.
Or even infographic creators.
This is the future that’s being built right now. The world of work has already transformed. We're much wealthier than three decades ago, programmers do vastly different tasks helped along by automation and the congealed remnants of decisions made firm from the eras gone past. Consultants today, for instance, do on their first day what would've taken an entire three month project in the 90s. And yet, consultants have grown as an industry. Most jobs are like this.
Even if the AI train comes to a crashing halt for some reason, this trend will only slow and not stop. If this is where it stops the world of work will be transformed. Especially if you believe that AI will get better, which seems true whether that's 2x or 200x, the only way to do it well is to use it.
2025-03-05 21:57:43
AI has completely entered its product arc. The models are getting better but they’re not curios any longer, we’re using them for stuff. And all day, digitally and physically, I am surrounded by people who use AI for everything. From being completely integrated into every facet of their daily workflow to those who want to spend their time it's basically chatting or using only command line and everything in between. People who spend most of their time either playing around with or learning about whichever model dropped.
Despite this I find myself almost constantly the only person who actually seems to have played with or is interested in or even likes Gemini. Apart from a few friends who actually work at Gemini, though even they look apologetic most of the time and I-can’t-quite-believe-it-slight-smile the handful of other times when they realise you’re not about to yell at them about it.
This seemed really weird, especially for a top notch research organisation with infinite money, which annoys me, so I thought I would write down why. I like big models, models that “smell” good, and seeing it hobbled is annoying. We need better AI and the way we get them requires strong competition. It’s also a case study, but this is mostly a kvetch.
Right now we only have Hirschmann’s Exit, it’s time for Voice. I’m writing this out of a sense of frustration because a) you should not hobble good models, and b) as AI is undeniably moving to a product-first era, I want more competition in the AI product world, to push the boundaries.
Gemini, ever since it released, has felt consistently like a really good model that is smothered under all sorts of extra baggage. Whether it’s strong system prompts or a complete lack of proper tool calling (it still says it doesn’t know if it can read attachments sometimes!), it feels like a WIP product even when it isn’t. It seems to be just good enough that you can try to use it, but not good enough that you want to use it. People talk about a big model smell, similarly there is a big bureaucracy smell, and Gemini has both.
Now, despite the annoying tone which feels like a know-it-all who has exclusively read Wikipedia, I constantly find myself using it whenever I want to analyze a whole code base, or analyse a bunch of pdfs, or if I need to have a memo read and rewritten, and especially if I need any kind of multimodal interaction with video or audio. (Gemini is the only other model that I have found offers good enough suggestions for memos, same as or exceeding Claude, even though the suggestions are almost always couched in the form of exceedingly boring looking bullet points and a tone I can only describe as corporate cheerleader.)
Gemini also has the deepest bench of interesting ideas that I can think of. It had the longest context, multimodal interactivity, ability to use it to actually keep looking at your desktop while you chat to it, NotebookLM, the ability to directly export in the documents into Google Drive, the first deep research, learn LM something specific to actually learn things, and probably 10 more things that I'm forgetting but were equally breakthrough that nobody uses.
Oh and Gemini live, the ability to talk live with an AI.
And the first reasoning model with visible thinking chain of thought traces in Gemini flash 2.0 thinking.
And even outside this Google is chock full of extraordinary assets that would be useful for Gemini. Hey Google being an example that seems to have helped a little bit. But also Google itself to provide search grounding, the best search engine known to planet. Google scholar. Even news, a clear way to provide real-time updates and have a proper competitor to X. They had Google podcasts which they shuttered, unnecessarily in my opinion, since they they could have easily just created a version that was only based off of NotebookLM.
Also Colab, an existing way to write and execute python code and Jupiter notebooks including GPU and TPU support.
Colab even just launched a Data Science agent, with Colab, which seems interesting and similar to the Advanced Data Analysis. But true to form it’s also in a separate user interface, in a separate website, as a separate offering. One that’s probably great for those who use it, which is overwhelmingly students and researchers (I think), one with 7 million users, and one that’s still unknown to the vast majority who interact with Gemini! But why would this sit separate? Shouldn’t it be integrated with the same place I can run code? Or write that code? Or read documents about that code?
Colab even links with your Github to pull your repos from there.
Gemini even has its Code Assist, a product I didn’t even know existed until a day ago, despite spending much (most?) of my life interacting with Google.
Gemini Code Assist completes your code as you write, and generates whole code blocks or functions on demand. Code assistance is available in many popular IDEs, such as Visual Studio Code, JetBrains IDEs (IntelliJ, PyCharm, GoLand, WebStorm, and more), Cloud Workstations, Cloud Shell Editor, and supports most programming languages, including Java, JavaScript, Python, C, C++, Go, PHP, and SQL.
It supposedly can even let you interact via natural language with BigQuery. But, at this point, if people don’t even know about it, and if Google can’t help me create an entire frontend and backend, then they’re missing the boat! (by the way, why can’t they?!)
Gemini even built the first co-scientist that actually seems to work! Somehow I forgot about this until I came across a stray tweet. It's multi agent system that generates scientific hypotheses through iterated debate and reasoning and tool use.
Gemini has every single ingredient I would expect from a frontier lab shipping useful products. What it doesn’t have is smart product sense to actually try and combine it into an experience that a user or a developer would appreciate.
Just think about how much had to change, to push uphill, to get a better AIStudio in front us, arguably Gemini’s crown jewel! And that was already a good product. Now think about anyone who had to suffer through using Vertex, and these dozens of other products, all with its own KPIs and userbase and profiles.
I don’t know the internal politics or problems in making this come about, but honestly it doesn’t really matter. Most of the money comes from the same place at Google and this is an existential issue. There is no good reason why Perplexity or Grok should be able to eat even part of their lunch considering neither of them even have a half decent search engine to help!
Especially as we're moving from individual models to AI systems which work together, Google's advantages should come to fore. I think the flashes of brilliance that the models demonstrate are a good start but man, they’ve got a lot to build.
Gemini needs a skunkworks team to bring this whole thing together. Right now it feels like disjointed geniuses putting LLMs inside anything they can see - inside Colab or Docs or Gmail or Pixel. And some of that’s fine! But unless you can have a flagship offering that shows the true power of it working together, this kind of doesn’t matter. Gemini can legitimately ship the first Agent which can go from a basic request to a production ready product with functional code, fully battle-tested, running on Google servers, with the payment and auth setup, ready for you to deploy.
Similarly, not just for coding, you should be able to go from iterative refinement of a question (a better UX would help, to navigate the multiple types of models and the mess), to writing and editing a document, to searching specific sources via Google and researching a question, to final proof reading and publishing. The advantage it has is that all the pieces already exist, whereas for OpenAI or Anthropic or Microsoft even, they still need to build most of this.
While this is Gemini specific, the broad pattern is much more general. If Agents are to become ubiquitous they have to meet users where they are. The whole purpose is to abstract the complexity away. Claude Code is my favourite example, where it thinks and performs actions and all within the same terminal window with the exact same interaction - you typing a message.
It’s the same thing that OpenAI will inevitably build, bringing their assets together. I fought this trend when they took away Code Interpreter and bundled it into the chat, but I was wrong and Sama was right. The user should not be burdened with the selection trouble.
I don’t mean to say there’s some ultimate UX that’s the be all and end all of software. But there is a better form factor to use the models we have to their fullest extent. Every single model is being used suboptimally and it has been the case for a year. Because to get the best from them is to link them across everything that we use, and to do that is hard! Humans context switch constantly, and the way models do is if you give them the context. This is so incredibly straightforward that I feel weird typing it out. Seriously, if you work on this and are confused, just ask us!
Google started with the iconic white box. Simplicity to counter the insane complexity of the web. Ironically now there is the potential to have another white box. Please go build it!
PS: A short missive. If any of y’all are interested in running your python workloads faster and want to crunch a lot of data, you should try out Bodo. 100% open source: just “pip install bodo” and try one of the examples. We’d also appreciate stars and comments!
Github: https://github.com/bodo-ai/Bodo
2025-02-28 04:05:01
Evaluating people has always been a challenge. We morph and change and grow and get new skills. Which means that the evaluations have to grow with us. Which is also why most exams are bound to a supposed mean of what people in that age group or go hard should know. As you know from your schooling, this is not perfect, and this is really hard.
The same problem exists with AI. Each day you wake up, and there’s a new AI model. They are constantly getting better. And the way we know that they are getting better is that when we train them we analyze the performance across some evaluation or the other.
But real world examinations have a problem. A well understood one. They are not perfect representations of what they measure. We constantly argue whether they are under or over optimizing.
AI is no different. Even as it's gotten better generally, it's also made it much harder to know what they're particularly good at. Not a new problem, but it's still a more interesting problem. Enough that even OpenAI admitted that their dozen models with various names might be a tad confusing. What exactly does o3-mini-high do that o1 does not, we all wonder. For that, and for figuring out how to build better models in the future, the biggest gap remains evaluations.
Now, I’ve been trying to figure out what I can do with LLMs for a while. I talk about it often, but many times in bits and pieces. Including the challenges with evaluating them. So this time around I thought I’d write about what I learnt from doing this for a bit. The process is the reward. Bear in mind this is personal, and not meant to be an exhaustive review of all evaluations others have done. That was the other essay. It’s maybe more a journey of how the models got better over time.
The puzzle, in a sense, is simple: how do we know what a model can do, and what to use it for. Which, as the leaderboards keep changing and the models keep evolving, often without explicitly saying what changed, is very very hard!
I started like most others, thinking the point was to trick the models by giving it harder questions. Puzzles, tricks, and questions about common sense, to see what the models actually knew.
Then I understood that this doesn't scale, and what the models do in evaluations of math tests or whatever doesn't correlate with how well it does tasks in the real world. So I collected questions across real life work, from computational biology to manufacturing to law and to economics.
This was useful in figuring out which models did better with which sorts of questions. This was extremely useful in a couple of real world settings, and a few more that were private to particular companies and industries.
To make it work better I had to figure out how to make the models work with real-life type questions. Which meant adding RAG capabilities to read documents and respond, search queries to get relevant information, write database queries and analyse the responses.
This was great, and useful. It’s also when I discovered for the first time that the Chinese models were getting pretty good - Yi was the one that did remarkably well!
This was good, but after a while the priors just took over. There was a brief open vs closed tussle, but otherwise it’s just use OpenAI.
But there was another problem. The evaluations were too … static. After all, in the real world, the types of questions you need to get response to change often. Requirements shift, needs change. So the models have to be able to answer questions even when the types of questions being asked or the “context” within which the questions are asked changes.
So I set up a “random perturbation” routine. To check whether it has the ability to answer queries well even when you change the underlying context. And that worked pretty nicely to test the models’ ability to change its thinking as needed, to show some flexibility.
This too was useful, and changed the view on the types of questions that one could reasonably expect LLMs to be able to tackle, at least without significant context being added each time. The problem was though that this was only interesting and useful insofar as you had enough “ground truth” answers to check whether the model was good enough. And while that was possible for some domains, as the number of those domains increased and the number of questions increased, the average “vibe check” and “just use the strongest model” rubrics easily outweigh using specific models like this.
So while they are by far the most useful way to check which model to use for what, they’re not as useful on a regular basis.
Which made me think a bit more deeply about what exactly are we testing these models for. They know a ton of things, and they can solve many puzzles, but the key issue is neither of those. The issue is that a lot of the work we do regularly aren’t of the “just solve this hard problem” variety. They’re of the “let us think through this problem, think again, notice something wrong, change an assumption, fix that, test again, ask someone, check answer, change assumptions, and so on and on” an enormous number of times.
And if we want to test that, we should test it. With real, existing, problems of that nature. But to test those means you need to gather up an enormous number of real, existing problems which can be broken down into individual steps and then, more importantly, analysed!
I called it the need for iterative reasoning. Enter Loop Evals.
Much like Hofstadter’s concept of strange loops - this blog’s patron namesake, self-referential cycles that reveal deeper layers of thought - the iterative evaluation of LLMs uncovers recursive complexities that defy linear analysis. Or that was the theory.
So I started wondering what exists that has similar qualities, and is also super easy to analyse and don't need a human for it. And I ended up with word games. First I was thinking I wanted to test it on crosswords. Which was a little hard, so I ended up with Wordle.
And then later, to Word Grids (literally grid of words which are correct horizontally and vertically). And later again, sudoku.
This was so much fun! Remember, this was a year and change ago, and we didn't have these amazing models like we do today.
But OpenAI kicked ass. Anthropic and others struggled. Llama too. None, however, were good enough to solve things completely. People cry tokenisation with most of these, but the mistakes it makes are far beyond just tokenisation issues, it’s issues of common logic, catastrophic forgetting, or just insisting things like FLLED are real words.
I still think this is a brilliant eval, though I wasn't sure what to make of its results, beyond giving an ordering for LLMs. And as you’d expect now, with the advent of better reasoning models, the eval stacks up, as the new models do much better!
Also, while I worked on it for a while, but it was never quite clear how the “iterative reasoning” that this analysed translated into which real world tasks this would be the worst for.
But I was kind of obsessed with why evals suck at this point. So I started messing around with why it couldn't figure out these iterative reasoning models and started looking at Conway's Game of Life. And Evolutionary Cellular Automata.
It was fun, but not particularly fruitful. I also figured out that if taught enough, a model could learn any pattern you threw at it, but to follow a hundred steps to figure something like this out is something LLMs we're really bad at. It did come up with a bunch of ways in which LLMs might struggle to follow longer term complex instructions that need backtracking, but it still showed that they can do it, albeit with difficulty.
One might even draw an analogy to Kleene’s Fixed-Point Theorem in recursion theory, suggesting that the models’ iterative improvements are not arbitrary but converge toward a stable, optimal reasoning state. Alas.
Then I kept thinking it's only vibes based evals that matter at this point.
The problem however with these evals is that they end up being set against a set evaluation. The problem is that the capabilities of LLMs grow much faster than the ways we can come up with to test it.
The only way it seemed to fix that is to see if LLMs can judge each other, and then to figure out if the rankings they give to each other can be judged by each other, we could create a PageRank equivalent. So I created “sloprank”.
This is interesting, because it’s probably the most scalable way I’ve found to use LLM-as-a-judge, and a way to expand the ways in which it can be used!
It is a really good way to test how LLMs think about the world and make it iteratively better, but it stays within its context. It’s more a way to judge LLMs better than an independent evaluation. Sloprank mirrors the principles of eigenvector centrality from network theory, a recursive metric where a node’s importance is defined by the significance of its connections, elegantly encapsulating the models’ self-evaluative process.
And to do that evaluation, then the question became, how can you create something to test the capabilities of LLMs by testing them against each other? Not single-player games like wordle, but multiplayer adversarial games, like poker. That would ensure the LLMs are forced to create better strategies to compete with each other.
Hence, LLM-poker.
It’s fascinating! The most interesting part is that all different model seem to have their own personality in terms of how they play the game. And Claude Haiku seems to be able to beat Claude Sonnet quite handily.
The coolest part is that if it’s a game, then we can even help the LLMs learn from their play using RL. It’s fun, but I think it’s likely the best way to teach the models how to get better more broadly, since the rewards are so easily measurable.
The lesson from this trajectory is that we can sort of mirror what I wanted from LLMs and how that’s changed. In the beginning it was to get accurate enough information. Then it became, can it deal with “real world” like scenarios of moving data and information, can it summarise the info given to it well enough. Then soon it became its ability to solve particular forms of puzzles which mirror real world difficulties. And then it became can they just actually measure each other and figure out who’s right about what topic. And lastly, now, it’s whether they can learn and improve from each other, in an adversarial setting, which is all too common in our darwinian information environment.
The key issue, as always, remains whether you can reliably ask the models certain questions and get answers. The answers have to be a) truthful to the best of its knowledge, which means the model has to be able to say “I don’t know”, and b) reliable, meaning it followed through on the actual task at hand without getting distracted.
The models have gotten better at both of thesee especially the frontier models. But they haven’t gotten better enough at these compared to how much they’ve gotten better at other things like being able to do PhD level mathematics or answering esoteric economics questions in perfect detail.
Now while this is all idiosyncratic, interspersed with vibes based evals and general usage guides discussed in dms, the frontier labs are also doing the same thing.
The best recent example is Anthropic showing how well Claude 3.7 Sonnet does playing Pokemon. To figure out if the model can strategise, follow directions over a long period of time, work in complex environments, and reach its objective. It is spectacular!
This is a bit of fun. It’s also particularly interesting because the model isn’t specifically trained on playing Pokemon, but rather this is an emergent capability to follow instructions and “see” the screen and play.
Evaluating new models are becoming far closer to evaluating a company or evaluating an employee. They need to be dynamic, assessed across a Pareto frontier of performance vs latency vs cost, continually evolving against a complex and often adversarial environment, and be able to judge whether the answers are right themselves.
In some ways our inability to measure how well these models do at various tasks is what’s holding us back from realising how much better they are at things than one might expect and how much worse they are at things than one might expect. It’s why LLM haters dismiss it by calling it fancy autocorrect and say how it’s useless and burns a forest, and LLM lovers embrace it by saying how it solved a PhD problem they had struggled with for years in an hour.
They’re both right in some ways, but we still don’t have an ability to test them well enough. And in the absence of a way to test them, a way to verify. And in the absence of testing and verification, to improve. Until we do we’re all just HR reps trying to figure out what these candidates are good at!
2025-01-22 07:24:16
“Within a decade, it’s conceivable that 60-80% of all jobs could be lost, with most of them not being replaced.”
AI CEOs are extremely fond of making statements like this. And because they make these statements we are forced to take them at face value, and then try to figure out what the implications are.
Now historically the arguments about comparative advantage that has talked about have played out across every sector and technological transition. AI CEOs and proponents though say this time is different.
They’re also putting money where their mouth is. OpenAI just launched the Stargate Project.
The Stargate Project is a new company which intends to invest $500 billion over the next four years building new AI infrastructure for OpenAI in the United States. We will begin deploying $100 billion immediately.
It’s a clear look at the fact that we will be investing untold amounts of money, Manhattan Project or Apollo mission level money, to make this future come about.
But if we want to get ready for this world as a society we’d also have to get a better projection of what the world could look like. There are plenty of unknowns, including the pace of progress and the breakthroughs we could expect, which is why this conversation often involves extreme hypotheticals like “full unemployment” or “building Dyson spheres” or “millions of Nobel prize winners in a datacenter”.
However, I wanted to try and ground it. So, here are a few of the things we do know concretely about AI and its resource usage.
Data centers will use about 12% of US electricity consumption by 2030, fuelled by the AI wave
Critical power to support data centers’ most important components—including graphics processing unit (GPU) and central processing unit (CPU) servers, storage systems, cooling, and networking switches—is expected to nearly double between 2023 and 2026 to reach 96 gigawatts (GW) globally by 2026; and AI operations alone could potentially consume over 40% of that power
AI model training capabilities are projected to increase dramatically, reaching 2e29 FLOP by 2030, according to Epoch. But they also add “Data movement bottlenecks limit LLM scaling beyond 2e28 FLOP, with a "latency wall" at 2e31 FLOP. We may hit these in ~3 years.”
An H100 GPU takes around 500W average power consumption (higher for SXM, lower for PCIe). By the way, GPUs for AI ran at 400 watts until 2022, while 2023 state-of-the-art GPUs for gen AI run at 700 watts, and 2024 next-generation chips are expected to run at 1,200 watts.
The actual service life of H100 GPUs in datacenters is relatively short, ranging from 1-3 years when running at high utilization rates of 60-70%. At Meta's usage rates, these GPUs show an annualized failure rate of approximately 9%.
OpenAI is losing money on o1-pro models via its $2400/ year subscription, while o1-preview costs $15 per million input tokens and $60 per million output and reasoning tokens. So the breakeven point is around 3.6 million tokens (if input:output in 1:8 ratio), which would take c.100 hours at 2 mins per response and 1000 tokens per generation.
The o3 model costs around $2000-$3000 per task at high compute mode. For ARC benchmark, it used, for 100 tasks, 33 million tokens in low-compute (1.3 mins) and 5.7 billion in high-compute mode (13 mins). Each reasoning chain generates approximately 55,000 tokens.
Inference costs seem to be dropping by 10x every year.
If one query is farmed out to, say, 64 H100s simultaneously (common for big LLM inference), you pay 64 × ($3–$9/hour) = $192–$576 per hour just for those GPUs.
If the query’s total compute time is in the realm of ~4–5 GPU-hours (e.g. 5 minutes on 64 GPUs → ~5.3 GPU-hours), that alone could cost a few thousand dollars for one inference—particularly if you’re paying on-demand cloud rates. (Update: I meant *even 5 minutes on 64 GPUs, not *e.g., so this was meant to indicate highly complex inferences)1.
To do a task now people use about 20-30 Claude Sonnet calls, over 10-15 minutes, intervening if needed to fix them, using 2-3m tokens. The alternate is for a decent programmer to take 30-45 minutes.
Devin, the software engineer you can hire, costs $500 per month. It takes about 2 hours to create a testing suite for an internal codebase, or 4 hours to automatically create data visualisations of benchmarks. This can be considered an extremely cheap but also very bad version of an AGI, since it fails often, but let’s assume this can get to “good enough” coverage.
We can now make some assumptions about this new world of AGI, to see what the resource requirements would be like.
For instance, as we’re moving towards the world of autonomous agents, a major change is likely to be that they could be used for long range planning, to be used as “employees”, instead of just “tools”.
Right now an o3 running a query can take up to $3k and 13 minutes. For a 5‐minute run on 64 H100s, that’s roughly .3 total GPU‐hours, which can cost a few thousand dollars. This lines up with the reported $2 000–$3 000 figure for that big‐compute pass2.
But Devin, the $500/mo software engineer, can do some tasks in 2–4 hours (e.g. creating test suites, data visualisations, etc.). Over a month, that’s ~160 working hours, effectively $3–$4 per hour. It, unfortunately, isn’t good enough yet, but might be once we get o3 or o5. This is 1000x cheaper than o3 today, roughly the order of magnitude it would get cheaper if the trends hold for the next few years.
Now we need to see how large an AGI needs to be. Because of the cost and scaling curves, let’s assume that we manage to create an AGI that works on one H100 or equivalent costing 500W each, which roughly costs around half as much as Devin. And it has an operating lifetime of a couple years.
If the assumptions hold, we can look at how much electricity we will have and then back-solve for how many GPUs could run concurrently and how many AGI “employees” that is.
Now, by 2030 data centers could use 12% of US electricity. The US consumes about 4000 TWH/year of electricity. If AI consumes 40% of that figure, that’s 192 TWh/year. Running an H100 continuously will take about 4.38 MWh/year.
So this means we can run about 44 million concurrent H100 GPUs. Realistically, if power and overheating etc exists that means maybe about half of this is a more practical figure.
If we think about the global figure, this will double - so around 76 million maximum concurrent GPUs and 40 million realistic AGI agents working night and day.
We get the same figure also from the 96GW of critical power which is meant to hit data centers worldwide by 2026.
Now we might get there differently, we'd build larger reasoning agents, we'd distill their thinking and create better base models, and repeat.
After all if the major lack is the ability to understand and follow complex reasoning steps, then the more of such reasoning we can generate the better things would be.
To get there we’ll need around 33 million GPUs at the start, at a life of around 1-3 years, which is basically the entire global possible production. Nvidia aimed to make around 1.5-2 million units a year, so it would need to be stepped up.
At large scale, HPC setups choke on more mundane constraints too (network fabric bandwidth, memory capacity, or HPC cluster utilisation). Some centers note that the cost of high-bandwidth memory packages is spiking, and these constraints may be just as gating as GPU supply.
Also, a modern leading‐edge fab (like TSMC’s 5 nm/4 nm) can run ~30 000 wafer‐starts per month. Each wafer: 300 mm diameter, yields maybe ~60 good H100‐class dies (die size ~800 mm², factoring in yield). That’s about 21 million GPUs a year.
The entire semiconductor industry, energy industry and AI industry would basically have to rewire and become a much (much!) larger component of the world economy.
The labour pool in the US is 168 million people. The labour participation rate is 63%, and has around 11 million jobs by 2030 in any case. And since the AGI here doesn't sleep or eat, that triples the working hours. This is equivalent to doubling the workforce, and probably doubling the productivity and IQ too.
(Yes many jobs need physical manifestations and capabilities but I assume an AGI can operate/ teleoperate a machine or a robot if needed.)
Now what happens if we relax the assumptions a bit? The biggest one is that we get another research breakthrough and we don't need a GPU to run an AGI, they'll get that efficient. This might be true but it's hard to model beyond “increase all numbers proportionally”, and there's plenty of people who assume that.
The others are more intriguing. What happens if AGI doesn't mean true AGI, but more like the OpenAI definition that it can do 90% of a humans tasks? Or what if it's best at maths and science but only those, and you have to run it for a long time? And especially what if the much vaunted “agents” don't happen, in the way that they can solve complex tasks equally well, e.g , “if you drop them in a Goldman Sachs trading room or in a jungle in Congo and work through whatever problem it needs to”, but are far more specialist?
If the AIs can't be perfect replacements but still need us for course correcting their work, giving feedback etc, then the bottleneck very much remains the people and their decision making. This means the shape of future growth would look a lot like (extremely successful) productivity tools. You'd get unemployment and a lot more economic activity, but it would likely look like a good boom time for the economy.
The critical part is whether this means we discover new areas to work on. Considering the conditions you'd have to imagine yes! That possibility of “jobs we can’t yet imagine” is a common perspective in labor economics (Autor, Acemoglu, etc.) but they've historically been right.
But there's a twist. What if the step from “does 60% well” to “does 90% well” just requires the collection of a few 100k examples of how a task is done? This is highly likely, in my opinion, for a whole host of tasks. And what that would mean is that most jobs, or many jobs, would have explicit data gathering as part of its process.
I could imagine a job where you do certain tasks enough that they're teachable to an AI, collect data with sufficient fidelity, adjust their chains of thought, adjust the environment within which they learn, and continually test and adjust the edge cases where they fail. A constant work → eval → adjust loop.
The jobs would have an expiry date, in other words, or at least a “continuous innovation or discovering edge cases” agenda. They would still have to get paid pretty highly, for most or many of them, also because of Baumol effects, but on balance would look a lot more like QA for everything.
We could spend vastly more to get superhuman breakthroughs in a few questions than just generally getting 40 million new workers. This could be dramatically more useful, even assuming a small hit-rate and a much larger energy expenditure.
Even assuming it takes 1000x effort in some domains and at 1% success rate, that's still 400 breakthroughs. Are they all going to be “Attention Is All You Need” level, or “Riemann Hypothesis” level or “General Relativity” level? Doubtful. See how rare those are considering the years and the number of geniuses who work on those problems.
But even a few of that caliber is inevitable and extraordinary. They would kickstart entire new industries. They'd help with scientific breakthroughs. Write extraordinarily impactful and cited papers that changes industries.
I would bet this increases scientific productivity the most, a fight against the stagnation in terms of breakthrough papers and against the rising gerontocracy that plagues science.
Interestingly enough they'd also be the least directly monetisable. Just see how we monetise PhDs. Once there's a clear view of value then sure, like PhDs going into AI training now or into finance before, but as a rule this doesn't hold.
Yes. It's quite possible that we get AI, even agentic AI, that can't autonomously fulfil entire tasks end to end like a fully general purpose human, but still can do this within more bounded domains.
Whether that's mathematics or coding or biology or materials, we could get superhuman scientists rather than a fully automated lab. This comes closer to the previous scenario, where new industries and research areas arise, instead of “here's an agent that can do anything from book complex travel itineraries to spearhead a multi-year investigation into cancer. This gets us a few superintelligences, or many more general intelligences, and we'd have to decide what's most useful considering the cost.
But this also would mean that capital is the scarce resource, and humans become like orchestrators of the jobs themselves. How much of the economy should get subsumed into silicon and energy would have to be globally understood and that'll be the key bottleneck.
I think of this as a mix and match a la carte menu. You might get some superhuman thought to make scientific breakthroughs, some general purpose agents to do diverse tasks and automate those, some specialists with human in the loop, in whatever combination best serves the economy.
Which means we have a few ways in which the types of AI might progress and the types of constraints that would end up affecting what the overall economy looked like. I got o1-pro to write this up.
Regardless of the exact path it seems plausible that there will be a reshuffling of the economy as AI gets more infused into the economy. In some ways our economy looks very similar to that of a few decades ago, but in others it's also dramatically different in small and large ways which makes it incomparable.
I made a mistake with the math in the next point mixing GPU-hours and GPUs so removed the next bullet point.
While this is true, if an LLM burns thousands of dollars of compute for a single “task,” it’s only appealing if it’s either extremely fast, extremely high‐quality, or you need the task done concurrently at massive scale.