MoreRSS

site iconStrange Loop CanonModify

By Rohit Krishnan. Here you’ll find essays about ways to push the frontier of our knowledge forward. The essays aim to bridge the gaps between Business, Science and Technology.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of Strange Loop Canon

In defense of Gemini

2025-03-05 21:57:43

AI has completely entered its product arc. The models are getting better but they’re not curios any longer, we’re using them for stuff. And all day, digitally and physically, I am surrounded by people who use AI for everything. From being completely integrated into every facet of their daily workflow to those who want to spend their time it's basically chatting or using only command line and everything in between. People who spend most of their time either playing around with or learning about whichever model dropped.

Despite this I find myself almost constantly the only person who actually seems to have played with or is interested in or even likes Gemini. Apart from a few friends who actually work at Gemini, though even they look apologetic most of the time and I-can’t-quite-believe-it-slight-smile the handful of other times when they realise you’re not about to yell at them about it.

This seemed really weird, especially for a top notch research organisation with infinite money, which annoys me, so I thought I would write down why. I like big models, models that “smell” good, and seeing it hobbled is annoying. We need better AI and the way we get them requires strong competition. It’s also a case study, but this is mostly a kvetch.

Thanks for reading Strange Loop Canon! Subscribe for free to receive new posts and support my work.

Right now we only have Hirschmann’s Exit, it’s time for Voice. I’m writing this out of a sense of frustration because a) you should not hobble good models, and b) as AI is undeniably moving to a product-first era, I want more competition in the AI product world, to push the boundaries.

Gemini, ever since it released, has felt consistently like a really good model that is smothered under all sorts of extra baggage. Whether it’s strong system prompts or a complete lack of proper tool calling (it still says it doesn’t know if it can read attachments sometimes!), it feels like a WIP product even when it isn’t. It seems to be just good enough that you can try to use it, but not good enough that you want to use it. People talk about a big model smell, similarly there is a big bureaucracy smell, and Gemini has both.

Now, despite the annoying tone which feels like a know-it-all who has exclusively read Wikipedia, I constantly find myself using it whenever I want to analyze a whole code base, or analyse a bunch of pdfs, or if I need to have a memo read and rewritten, and especially if I need any kind of multimodal interaction with video or audio. (Gemini is the only other model that I have found offers good enough suggestions for memos, same as or exceeding Claude, even though the suggestions are almost always couched in the form of exceedingly boring looking bullet points and a tone I can only describe as corporate cheerleader.)

Gemini also has the deepest bench of interesting ideas that I can think of. It had the longest context, multimodal interactivity, ability to use it to actually keep looking at your desktop while you chat to it, NotebookLM, the ability to directly export in the documents into Google Drive, the first deep research, learn LM something specific to actually learn things, and probably 10 more things that I'm forgetting but were equally breakthrough that nobody uses.

Oh and Gemini live, the ability to talk live with an AI.

And the first reasoning model with visible thinking chain of thought traces in Gemini flash 2.0 thinking.

And even outside this Google is chock full of extraordinary assets that would be useful for Gemini. Hey Google being an example that seems to have helped a little bit. But also Google itself to provide search grounding, the best search engine known to planet. Google scholar. Even news, a clear way to provide real-time updates and have a proper competitor to X. They had Google podcasts which they shuttered, unnecessarily in my opinion, since they they could have easily just created a version that was only based off of NotebookLM.

Also Colab, an existing way to write and execute python code and Jupiter notebooks including GPU and TPU support.

Colab even just launched a Data Science agent, with Colab, which seems interesting and similar to the Advanced Data Analysis. But true to form it’s also in a separate user interface, in a separate website, as a separate offering. One that’s probably great for those who use it, which is overwhelmingly students and researchers (I think), one with 7 million users, and one that’s still unknown to the vast majority who interact with Gemini! But why would this sit separate? Shouldn’t it be integrated with the same place I can run code? Or write that code? Or read documents about that code?

Colab even links with your Github to pull your repos from there.

Gemini even has its Code Assist, a product I didn’t even know existed until a day ago, despite spending much (most?) of my life interacting with Google.

Gemini Code Assist completes your code as you write, and generates whole code blocks or functions on demand. Code assistance is available in many popular IDEs, such as Visual Studio Code, JetBrains IDEs (IntelliJ, PyCharm, GoLand, WebStorm, and more), Cloud Workstations, Cloud Shell Editor, and supports most programming languages, including Java, JavaScript, Python, C, C++, Go, PHP, and SQL.

It supposedly can even let you interact via natural language with BigQuery. But, at this point, if people don’t even know about it, and if Google can’t help me create an entire frontend and backend, then they’re missing the boat! (by the way, why can’t they?!)

Gemini even built the first co-scientist that actually seems to work! Somehow I forgot about this until I came across a stray tweet. It's multi agent system that generates scientific hypotheses through iterated debate and reasoning and tool use.

Gemini has every single ingredient I would expect from a frontier lab shipping useful products. What it doesn’t have is smart product sense to actually try and combine it into an experience that a user or a developer would appreciate.

Just think about how much had to change, to push uphill, to get a better AIStudio in front us, arguably Gemini’s crown jewel! And that was already a good product. Now think about anyone who had to suffer through using Vertex, and these dozens of other products, all with its own KPIs and userbase and profiles.

I don’t know the internal politics or problems in making this come about, but honestly it doesn’t really matter. Most of the money comes from the same place at Google and this is an existential issue. There is no good reason why Perplexity or Grok should be able to eat even part of their lunch considering neither of them even have a half decent search engine to help!

Especially as we're moving from individual models to AI systems which work together, Google's advantages should come to fore. I think the flashes of brilliance that the models demonstrate are a good start but man, they’ve got a lot to build.

Gemini needs a skunkworks team to bring this whole thing together. Right now it feels like disjointed geniuses putting LLMs inside anything they can see - inside Colab or Docs or Gmail or Pixel. And some of that’s fine! But unless you can have a flagship offering that shows the true power of it working together, this kind of doesn’t matter. Gemini can legitimately ship the first Agent which can go from a basic request to a production ready product with functional code, fully battle-tested, running on Google servers, with the payment and auth setup, ready for you to deploy.

Similarly, not just for coding, you should be able to go from iterative refinement of a question (a better UX would help, to navigate the multiple types of models and the mess), to writing and editing a document, to searching specific sources via Google and researching a question, to final proof reading and publishing. The advantage it has is that all the pieces already exist, whereas for OpenAI or Anthropic or Microsoft even, they still need to build most of this.

While this is Gemini specific, the broad pattern is much more general. If Agents are to become ubiquitous they have to meet users where they are. The whole purpose is to abstract the complexity away. Claude Code is my favourite example, where it thinks and performs actions and all within the same terminal window with the exact same interaction - you typing a message.

It’s the same thing that OpenAI will inevitably build, bringing their assets together. I fought this trend when they took away Code Interpreter and bundled it into the chat, but I was wrong and Sama was right. The user should not be burdened with the selection trouble.

I don’t mean to say there’s some ultimate UX that’s the be all and end all of software. But there is a better form factor to use the models we have to their fullest extent. Every single model is being used suboptimally and it has been the case for a year. Because to get the best from them is to link them across everything that we use, and to do that is hard! Humans context switch constantly, and the way models do is if you give them the context. This is so incredibly straightforward that I feel weird typing it out. Seriously, if you work on this and are confused, just ask us!

Google started with the iconic white box. Simplicity to counter the insane complexity of the web. Ironically now there is the potential to have another white box. Please go build it!

Thanks for reading Strange Loop Canon! Subscribe for free to receive new posts and support my work.

PS: A short missive. If any of y’all are interested in running your python workloads faster and want to crunch a lot of data, you should try out Bodo. 100% open source: just “pip install bodo” and try one of the examples. We’d also appreciate stars and comments!

Github: https://github.com/bodo-ai/Bodo

How would you interview an AI, to give it a job?

2025-02-28 04:05:01

Evaluating people has always been a challenge. We morph and change and grow and get new skills. Which means that the evaluations have to grow with us. Which is also why most exams are bound to a supposed mean of what people in that age group or go hard should know. As you know from your schooling, this is not perfect, and this is really hard.

The same problem exists with AI. Each day you wake up, and there’s a new AI model. They are constantly getting better. And the way we know that they are getting better is that when we train them we analyze the performance across some evaluation or the other.

Thanks for reading Strange Loop Canon! Subscribe for free to receive new posts and support my work.

But real world examinations have a problem. A well understood one. They are not perfect representations of what they measure. We constantly argue whether they are under or over optimizing.

AI is no different. Even as it's gotten better generally, it's also made it much harder to know what they're particularly good at. Not a new problem, but it's still a more interesting problem. Enough that even OpenAI admitted that their dozen models with various names might be a tad confusing. What exactly does o3-mini-high do that o1 does not, we all wonder. For that, and for figuring out how to build better models in the future, the biggest gap remains evaluations.

Now, I’ve been trying to figure out what I can do with LLMs for a while. I talk about it often, but many times in bits and pieces. Including the challenges with evaluating them. So this time around I thought I’d write about what I learnt from doing this for a bit. The process is the reward. Bear in mind this is personal, and not meant to be an exhaustive review of all evaluations others have done. That was the other essay. It’s maybe more a journey of how the models got better over time.

The puzzle, in a sense, is simple: how do we know what a model can do, and what to use it for. Which, as the leaderboards keep changing and the models keep evolving, often without explicitly saying what changed, is very very hard!

I started like most others, thinking the point was to trick the models by giving it harder questions. Puzzles, tricks, and questions about common sense, to see what the models actually knew.

Then I understood that this doesn't scale, and what the models do in evaluations of math tests or whatever doesn't correlate with how well it does tasks in the real world. So I collected questions across real life work, from computational biology to manufacturing to law and to economics.

This was useful in figuring out which models did better with which sorts of questions. This was extremely useful in a couple of real world settings, and a few more that were private to particular companies and industries.

To make it work better I had to figure out how to make the models work with real-life type questions. Which meant adding RAG capabilities to read documents and respond, search queries to get relevant information, write database queries and analyse the responses.

This was great, and useful. It’s also when I discovered for the first time that the Chinese models were getting pretty good - Yi was the one that did remarkably well!

This was good, but after a while the priors just took over. There was a brief open vs closed tussle, but otherwise it’s just use OpenAI.

But there was another problem. The evaluations were too … static. After all, in the real world, the types of questions you need to get response to change often. Requirements shift, needs change. So the models have to be able to answer questions even when the types of questions being asked or the “context” within which the questions are asked changes.

So I set up a “random perturbation” routine. To check whether it has the ability to answer queries well even when you change the underlying context. And that worked pretty nicely to test the models’ ability to change its thinking as needed, to show some flexibility.

This too was useful, and changed the view on the types of questions that one could reasonably expect LLMs to be able to tackle, at least without significant context being added each time. The problem was though that this was only interesting and useful insofar as you had enough “ground truth” answers to check whether the model was good enough. And while that was possible for some domains, as the number of those domains increased and the number of questions increased, the average “vibe check” and “just use the strongest model” rubrics easily outweigh using specific models like this.

So while they are by far the most useful way to check which model to use for what, they’re not as useful on a regular basis.

Which made me think a bit more deeply about what exactly are we testing these models for. They know a ton of things, and they can solve many puzzles, but the key issue is neither of those. The issue is that a lot of the work we do regularly aren’t of the “just solve this hard problem” variety. They’re of the “let us think through this problem, think again, notice something wrong, change an assumption, fix that, test again, ask someone, check answer, change assumptions, and so on and on” an enormous number of times.

And if we want to test that, we should test it. With real, existing, problems of that nature. But to test those means you need to gather up an enormous number of real, existing problems which can be broken down into individual steps and then, more importantly, analysed!

I called it the need for iterative reasoning. Enter Loop Evals.

Much like Hofstadter’s concept of strange loops - this blog’s patron namesake, self-referential cycles that reveal deeper layers of thought - the iterative evaluation of LLMs uncovers recursive complexities that defy linear analysis. Or that was the theory.

So I started wondering what exists that has similar qualities, and is also super easy to analyse and don't need a human for it. And I ended up with word games. First I was thinking I wanted to test it on crosswords. Which was a little hard, so I ended up with Wordle.

And then later, to Word Grids (literally grid of words which are correct horizontally and vertically). And later again, sudoku.

This was so much fun! Remember, this was a year and change ago, and we didn't have these amazing models like we do today.

But OpenAI kicked ass. Anthropic and others struggled. Llama too. None, however, were good enough to solve things completely. People cry tokenisation with most of these, but the mistakes it makes are far beyond just tokenisation issues, it’s issues of common logic, catastrophic forgetting, or just insisting things like FLLED are real words.

I still think this is a brilliant eval, though I wasn't sure what to make of its results, beyond giving an ordering for LLMs. And as you’d expect now, with the advent of better reasoning models, the eval stacks up, as the new models do much better!

Also, while I worked on it for a while, but it was never quite clear how the “iterative reasoning” that this analysed translated into which real world tasks this would be the worst for.

But I was kind of obsessed with why evals suck at this point. So I started messing around with why it couldn't figure out these iterative reasoning models and started looking at Conway's Game of Life. And Evolutionary Cellular Automata.

It was fun, but not particularly fruitful. I also figured out that if taught enough, a model could learn any pattern you threw at it, but to follow a hundred steps to figure something like this out is something LLMs we're really bad at. It did come up with a bunch of ways in which LLMs might struggle to follow longer term complex instructions that need backtracking, but it still showed that they can do it, albeit with difficulty.

One might even draw an analogy to Kleene’s Fixed-Point Theorem in recursion theory, suggesting that the models’ iterative improvements are not arbitrary but converge toward a stable, optimal reasoning state. Alas.

Then I kept thinking it's only vibes based evals that matter at this point.

The problem however with these evals is that they end up being set against a set evaluation. The problem is that the capabilities of LLMs grow much faster than the ways we can come up with to test it.

The only way it seemed to fix that is to see if LLMs can judge each other, and then to figure out if the rankings they give to each other can be judged by each other, we could create a PageRank equivalent. So I created “sloprank”.

This is interesting, because it’s probably the most scalable way I’ve found to use LLM-as-a-judge, and a way to expand the ways in which it can be used!

It is a really good way to test how LLMs think about the world and make it iteratively better, but it stays within its context. It’s more a way to judge LLMs better than an independent evaluation. Sloprank mirrors the principles of eigenvector centrality from network theory, a recursive metric where a node’s importance is defined by the significance of its connections, elegantly encapsulating the models’ self-evaluative process.

And to do that evaluation, then the question became, how can you create something to test the capabilities of LLMs by testing them against each other? Not single-player games like wordle, but multiplayer adversarial games, like poker. That would ensure the LLMs are forced to create better strategies to compete with each other.

Hence, LLM-poker.

It’s fascinating! The most interesting part is that all different model seem to have their own personality in terms of how they play the game. And Claude Haiku seems to be able to beat Claude Sonnet quite handily.

The coolest part is that if it’s a game, then we can even help the LLMs learn from their play using RL. It’s fun, but I think it’s likely the best way to teach the models how to get better more broadly, since the rewards are so easily measurable.

The lesson from this trajectory is that we can sort of mirror what I wanted from LLMs and how that’s changed. In the beginning it was to get accurate enough information. Then it became, can it deal with “real world” like scenarios of moving data and information, can it summarise the info given to it well enough. Then soon it became its ability to solve particular forms of puzzles which mirror real world difficulties. And then it became can they just actually measure each other and figure out who’s right about what topic. And lastly, now, it’s whether they can learn and improve from each other, in an adversarial setting, which is all too common in our darwinian information environment.

The key issue, as always, remains whether you can reliably ask the models certain questions and get answers. The answers have to be a) truthful to the best of its knowledge, which means the model has to be able to say “I don’t know”, and b) reliable, meaning it followed through on the actual task at hand without getting distracted.

The models have gotten better at both of thesee especially the frontier models. But they haven’t gotten better enough at these compared to how much they’ve gotten better at other things like being able to do PhD level mathematics or answering esoteric economics questions in perfect detail.

Now while this is all idiosyncratic, interspersed with vibes based evals and general usage guides discussed in dms, the frontier labs are also doing the same thing.

The best recent example is Anthropic showing how well Claude 3.7 Sonnet does playing Pokemon. To figure out if the model can strategise, follow directions over a long period of time, work in complex environments, and reach its objective. It is spectacular!

This is a bit of fun. It’s also particularly interesting because the model isn’t specifically trained on playing Pokemon, but rather this is an emergent capability to follow instructions and “see” the screen and play.

Evaluating new models are becoming far closer to evaluating a company or evaluating an employee. They need to be dynamic, assessed across a Pareto frontier of performance vs latency vs cost, continually evolving against a complex and often adversarial environment, and be able to judge whether the answers are right themselves.

In some ways our inability to measure how well these models do at various tasks is what’s holding us back from realising how much better they are at things than one might expect and how much worse they are at things than one might expect. It’s why LLM haters dismiss it by calling it fancy autocorrect and say how it’s useless and burns a forest, and LLM lovers embrace it by saying how it solved a PhD problem they had struggled with for years in an hour.

They’re both right in some ways, but we still don’t have an ability to test them well enough. And in the absence of a way to test them, a way to verify. And in the absence of testing and verification, to improve. Until we do we’re all just HR reps trying to figure out what these candidates are good at!

Thanks for reading Strange Loop Canon! Subscribe for free to receive new posts and support my work.

What would a world with AGI look like?

2025-01-22 07:24:16

“Within a decade, it’s conceivable that 60-80% of all jobs could be lost, with most of them not being replaced.”

AI CEOs are extremely fond of making statements like this. And because they make these statements we are forced to take them at face value, and then try to figure out what the implications are.

Now historically the arguments about comparative advantage that has talked about have played out across every sector and technological transition. AI CEOs and proponents though say this time is different.

They’re also putting money where their mouth is. OpenAI just launched the Stargate Project.

The Stargate Project is a new company which intends to invest $500 billion over the next four years building new AI infrastructure for OpenAI in the United States. We will begin deploying $100 billion immediately.

It’s a clear look at the fact that we will be investing untold amounts of money, Manhattan Project or Apollo mission level money, to make this future come about.

But if we want to get ready for this world as a society we’d also have to get a better projection of what the world could look like. There are plenty of unknowns, including the pace of progress and the breakthroughs we could expect, which is why this conversation often involves extreme hypotheticals like “full unemployment” or “building Dyson spheres” or “millions of Nobel prize winners in a datacenter”.

Thanks for reading Strange Loop Canon! Subscribe for free to receive new posts and support my work.

However, I wanted to try and ground it. So, here are a few of the things we do know concretely about AI and its resource usage.

  • Data centers will use about 12% of US electricity consumption by 2030, fuelled by the AI wave

  • Critical power to support data centers’ most important components—including graphics processing unit (GPU) and central processing unit (CPU) servers, storage systems, cooling, and networking switches—is expected to nearly double between 2023 and 2026 to reach 96 gigawatts (GW) globally by 2026; and AI operations alone could potentially consume over 40% of that power

  • AI model training capabilities are projected to increase dramatically, reaching 2e29 FLOP by 2030, according to Epoch. But they also add “Data movement bottlenecks limit LLM scaling beyond 2e28 FLOP, with a "latency wall" at 2e31 FLOP. We may hit these in ~3 years.”

  • An H100 GPU takes around 500W average power consumption (higher for SXM, lower for PCIe). By the way, GPUs for AI ran at 400 watts until 2022, while 2023 state-of-the-art GPUs for gen AI run at 700 watts, and 2024 next-generation chips are expected to run at 1,200 watts.

  • The actual service life of H100 GPUs in datacenters is relatively short, ranging from 1-3 years when running at high utilization rates of 60-70%. At Meta's usage rates, these GPUs show an annualized failure rate of approximately 9%.

  • OpenAI is losing money on o1-pro models via its $2400/ year subscription, while o1-preview costs $15 per million input tokens and $60 per million output and reasoning tokens. So the breakeven point is around 3.6 million tokens (if input:output in 1:8 ratio), which would take c.100 hours at 2 mins per response and 1000 tokens per generation.

  • The o3 model costs around $2000-$3000 per task at high compute mode. For ARC benchmark, it used, for 100 tasks, 33 million tokens in low-compute (1.3 mins) and 5.7 billion in high-compute mode (13 mins). Each reasoning chain generates approximately 55,000 tokens.

  • Inference costs seem to be dropping by 10x every year.

  • If one query is farmed out to, say, 64 H100s simultaneously (common for big LLM inference), you pay 64 × ($3–$9/hour) = $192–$576 per hour just for those GPUs.

  • If the query’s total compute time is in the realm of ~4–5 GPU-hours (e.g. 5 minutes on 64 GPUs → ~5.3 GPU-hours), that alone could cost a few thousand dollars for one inference—particularly if you’re paying on-demand cloud rates.

  • To do a task now people use about 20-30 Claude Sonnet calls, over 10-15 minutes, intervening if needed to fix them, using 2-3m tokens. The alternate is for a decent programmer to take 30-45 minutes.

  • Devin, the software engineer you can hire, costs $500 per month. It takes about 2 hours to create a testing suite for an internal codebase, or 4 hours to automatically create data visualisations of benchmarks. This can be considered an extremely cheap but also very bad version of an AGI, since it fails often, but let’s assume this can get to “good enough” coverage.

We can now make some assumptions about this new world of AGI, to see what the resource requirements would be like.

For instance, as we’re moving towards the world of autonomous agents, a major change is likely to be that they could be used for long range planning, to be used as “employees”, instead of just “tools”.

Right now an o3 running a query can take up to $3k and 13 minutes. For a 5‐minute run on 64 H100s, that’s roughly .3 total GPU‐hours, which can cost a few thousand dollars. This lines up with the reported $2 000–$3 000 figure for that big‐compute pass1.

But Devin, the $500/mo software engineer, can do some tasks in 2–4 hours (e.g. creating test suites, data visualisations, etc.). Over a month, that’s ~160 working hours, effectively $3–$4 per hour. It, unfortunately, isn’t good enough yet, but might be once we get o3 or o5. This is 1000x cheaper than o3 today, roughly the order of magnitude it would get cheaper if the trends hold for the next few years.

Now we need to see how large an AGI needs to be. Because of the cost and scaling curves, let’s assume that we manage to create an AGI that works on one H100 or equivalent costing 500W each, which roughly costs around half as much as Devin. And it has an operating lifetime of a couple years.

If the assumptions hold, we can look at how much electricity we will have and then back-solve for how many GPUs could run concurrently and how many AGI “employees” that is.

Now, by 2030 data centers could use 12% of US electricity. The US consumes about 4000 TWH/year of electricity. If AI consumes 40% of that figure, that’s 192 TWh/year. Running an H100 continuously will take about 4.38 MWh/year.

So this means we can run about 44 million concurrent H100 GPUs. Realistically, if power and overheating etc exists that means maybe about half of this is a more practical figure.

If we think about the global figure, this will double - so around 76 million maximum concurrent GPUs and 40 million realistic AGI agents working night and day.

We get the same figure also from the 96GW of critical power which is meant to hit data centers worldwide by 2026.

Now we might get there differently, we'd build larger reasoning agents, we'd distill their thinking and create better base models, and repeat.

After all if the major lack is the ability to understand and follow complex reasoning steps, then the more of such reasoning we can generate the better things would be.

To get there we’ll need around 33 million GPUs at the start, at a life of around 1-3 years, which is basically the entire global possible production. Nvidia aimed to make around 1.5-2 million units a year, so it would need to be stepped up.

At large scale, HPC setups choke on more mundane constraints too (network fabric bandwidth, memory capacity, or HPC cluster utilisation). Some centers note that the cost of high-bandwidth memory packages is spiking, and these constraints may be just as gating as GPU supply.

Also, a modern leading‐edge fab (like TSMC’s 5 nm/4 nm) can run ~30 000 wafer‐starts per month. Each wafer: 300 mm diameter, yields maybe ~60 good H100‐class dies (die size ~800 mm², factoring in yield). That’s about 21 million GPUs a year.

The entire semiconductor industry, energy industry and AI industry would basically have to rewire and become a much (much!) larger component of the world economy.

The labour pool in the US is 168 million people. The labour participation rate is 63%, and has around 11 million jobs by 2030 in any case. And since the AGI here doesn't sleep or eat, that triples the working hours. This is equivalent to doubling the workforce, and probably doubling the productivity and IQ too.

(Yes many jobs need physical manifestations and capabilities but I assume an AGI can operate/ teleoperate a machine or a robot if needed.)

Now what happens if we relax the assumptions a bit? The biggest one is that we get another research breakthrough and we don't need a GPU to run an AGI, they'll get that efficient. This might be true but it's hard to model beyond “increase all numbers proportionally”, and there's plenty of people who assume that.

The others are more intriguing. What happens if AGI doesn't mean true AGI, but more like the OpenAI definition that it can do 90% of a humans tasks? Or what if it's best at maths and science but only those, and you have to run it for a long time? And especially what if the much vaunted “agents” don't happen, in the way that they can solve complex tasks equally well, e.g , “if you drop them in a Goldman Sachs trading room or in a jungle in Congo and work through whatever problem it needs to”, but are far more specialist?

What if the AGI can't be perfect replacements for humans?

If the AIs can't be perfect replacements but still need us for course correcting their work, giving feedback etc, then the bottleneck very much remains the people and their decision making. This means the shape of future growth would look a lot like (extremely successful) productivity tools. You'd get unemployment and a lot more economic activity, but it would likely look like a good boom time for the economy.

The critical part is whether this means we discover new areas to work on. Considering the conditions you'd have to imagine yes! That possibility of “jobs we can’t yet imagine” is a common perspective in labor economics (Autor, Acemoglu, etc.) but they've historically been right.

But there's a twist. What if the step from “does 60% well” to “does 90% well” just requires the collection of a few 100k examples of how a task is done? This is highly likely, in my opinion, for a whole host of tasks. And what that would mean is that most jobs, or many jobs, would have explicit data gathering as part of its process.

I could imagine a job where you do certain tasks enough that they're teachable to an AI, collect data with sufficient fidelity, adjust their chains of thought, adjust the environment within which they learn, and continually test and adjust the edge cases where they fail. A constant work → eval → adjust loop.

The jobs would have an expiry date, in other words, or at least a “continuous innovation or discovering edge cases” agenda. They would still have to get paid pretty highly, for most or many of them, also because of Baumol effects, but on balance would look a lot more like QA for everything.

What if the AGI isn't General

We could spend vastly more to get superhuman breakthroughs in a few questions than just generally getting 40 million new workers. This could be dramatically more useful, even assuming a small hit-rate and a much larger energy expenditure.

Even assuming it takes 1000x effort in some domains and at 1% success rate, that's still 400 breakthroughs. Are they all going to be “Attention Is All You Need” level, or “Riemann Hypothesis” level or “General Relativity” level? Doubtful. See how rare those are considering the years and the number of geniuses who work on those problems.

But even a few of that caliber is inevitable and extraordinary. They would kickstart entire new industries. They'd help with scientific breakthroughs. Write extraordinarily impactful and cited papers that changes industries.

I would bet this increases scientific productivity the most, a fight against the stagnation in terms of breakthrough papers and against the rising gerontocracy that plagues science.

Interestingly enough they'd also be the least directly monetisable. Just see how we monetise PhDs. Once there's a clear view of value then sure, like PhDs going into AI training now or into finance before, but as a rule this doesn't hold.

AGI without agency

Yes. It's quite possible that we get AI, even agentic AI, that can't autonomously fulfil entire tasks end to end like a fully general purpose human, but still can do this within more bounded domains.

Whether that's mathematics or coding or biology or materials, we could get superhuman scientists rather than a fully automated lab. This comes closer to the previous scenario, where new industries and research areas arise, instead of “here's an agent that can do anything from book complex travel itineraries to spearhead a multi-year investigation into cancer. This gets us a few superintelligences, or many more general intelligences, and we'd have to decide what's most useful considering the cost.

But this also would mean that capital is the scarce resource, and humans become like orchestrators of the jobs themselves. How much of the economy should get subsumed into silicon and energy would have to be globally understood and that'll be the key bottleneck.

I think of this as a mix and match a la carte menu. You might get some superhuman thought to make scientific breakthroughs, some general purpose agents to do diverse tasks and automate those, some specialists with human in the loop, in whatever combination best serves the economy.

Which means we have a few ways in which the types of AI might progress and the types of constraints that would end up affecting what the overall economy looked like. I got o1-pro to write this up.

Regardless of the exact path it seems plausible that there will be a reshuffling of the economy as AI gets more infused into the economy. In some ways our economy looks very similar to that of a few decades ago, but in others it's also dramatically different in small and large ways which makes it incomparable.

1

While this is true, if an LLM burns thousands of dollars of compute for a single “task,” it’s only appealing if it’s either extremely fast, extremely high‐quality, or you need the task done concurrently at massive scale.

No, LLMs are not "scheming"

2024-12-20 02:36:53

For a long time people were convinced that if there was a machine that you could have a conversation with, it would be intelligent. And yet, in 2024, it no longer feels weird that we can talk to AI. We handily, demonstrably, utterly destroyed the Turing test. It’s a monumental step. And we seem to have just taken it in stride.

As I write this I have Gemini watching what I write on the screen and litening to my words and telling me what it thinks. For instance that I misspelt demonstrably wrong in the previous sentence, among other things like the history of Turing tests and answering a previous question I had about ecology.

Thanks for reading Strange Loop Canon! Subscribe for free to receive new posts and support my work.

This is, to repeat, remarkable! And as a consequence, somewhere in the last few years we've gone from having a basic understanding of intelligence, to a negligible understanding of intelligence. A Galilean move to dethrone the ability to converse as uniquely human.

And the same error seems to persist throughout every method we have come up with to analyze how these models actually function. We have plenty of evaluations and they don’t seem to work very well anymore.

There are quite a few variations in terms of how we think about LLMs. One end thinks of them as just pattern learners, stochastic parrots. The other end thinks they've clearly learnt reasoning, maybe not perfectly and as generalizable as humans yet, but definitely to a large extent.

The truth is a little complicated, but only a little. As the models learn patterns from the data they see during training, surely the patterns won't just be of what's in the data at face value. It would also be of ways the data was created, or curated, or collected, and metadata, and reasoning that leads to that data. It doesn't just see mathematics and memorize the tables, but it also learns how to do mathematics.

Which can go up another rung, or more. The models can learn how to learn, which could make it able to learn any new trick. Clearly it's already learnt this ability for some things, but rather obviously to everyone who's used them, not well enough.

Which means a better way to think about them is that they learn the patterns which exist in any training corpus enough so to reproduce it, but without any prioritisation of which of those patterns to learn when.

And therefore you get this!

This isn’t uncommon. It’s the most advanced model, OpenAI’s o1. It's clearly not just a parrot in how it responds and how it reasons. The error also recurs with every single other model out there.

It's not because the models can't solve 5.11-5.9, but because they can't figure out which patterns to use when. They're like an enormous store of all the patterns they could learn from their training, and in that enormous search space of patterns now it has the problem of choosing the right pattern to use. Gwern has a similar thesis:

The 10,000 foot view of intelligence, that I think the success of scaling points to, is that all intelligence is is search over Turing machines. Anything that happens can be described by Turing machines of various lengths. All we are doing when we are doing “learning,” or when we are doing “scaling,” is that we're searching over more and longer Turing machines, and we are applying them in each specific case.

These tools are weird, because they are mirrors of the training data that was created by humans and therefore reflect human patterns. And they can't figure out which patterns to use when because, unlike humans, they don't have the situational awareness to know why a question is being asked.

Which is why we then started using cognitive psychology tools made to test other human beings and extrapolating the outputs from testing LLMs. Because they are the products of large quantities of human data, they would demonstrate some of the same foibles, which is useful to understand from an understanding humanity point of view. Maybe even get us better at using them.

The problem is that cognitive psychology tools work best with humans because we understand how humans work. But this doesn't tell us a whole lot about the models inner qualia, if it can even be said to have one.

The tests we devised all have an inherent theory of mind. Winograd Schema Challenge tries to see if the AI can resolve pronoun references that require common sense. GLUE benchmark requires natural language understanding. HellaSwag is about how to figure out the most plausible continuation of a story. Sally Anne test checks if LLMs possess human like social cognition to figure out others’ states of mind. Each of these, and others like these, work on humans because we know what our thought pattern feels like.

If someone can figure out other people’s mental states, then we know they possess a higher level of ability and emotional understanding. But with an LLM or an AI model? It’s no longer clear which pattern they're pulling from within their large corpus to answer the question.

This is exceptionally important because LLMs are clearly extraordinarily useful. They are the first technology we have created which seems to understand the human world enough that it can navigate. It can speak to us, it can code, it can write, it can create videos and images. It acts as a human facsimile.

And just because of that some people are really worried about the potential for them to do catastrophic damage. Because humans sometimes do catastrophic things, and if these things are built on top of human data it makes sense that they would too.

All major labs have created large-scale testing apparatus and red teaming exercises, some even government mandated or government created, to test for this. With the assumption that if the technology is so powerful as to be Earth shattering then it makes sense for Earth to have a voice in whether it gets used.

And it makes it frustrating that the way we analyse models to see if they’re ready for deployment has inherent biases too. Let’s have a look at the latest test on o1, OpenAI’s flagship model, by Apollo Research. They analysed and ran evaluations to test whether the model did “scheming”.

“Scheming” literally means the activity or practice of making secret or underhanded plans. That’s how we use it, when we say, like the politician was scheming to get elected by buying votes.

That’s the taxonomy of how this is analysed. Now the first and most important thing to note is that this implicitly assumes there’s an entity behind each of these “decisions”.

You could argue there is an entity but only per conversation. So each time you start a chat, there’s a new entity. This is Janus’ simulators thesis. That what these models do is to simulate a being which you can interact with using the patterns it has stored and knowledge it gained from the training process.

And yet this isn't an entity like one you know either. You could call it an alien being but it would only be a shorthand for you don't know what it is. Because it's not an alien like you see in Star Trek.

This might seem small, but it’s in fact crucial. Because if there’s an entity behind the response, then “it used a method we agree is wrong to answer its question” is an ENORMOUS problem. If there’s no entity, but it’s picking a set of strategies from the set of strategies it has already learnt, then it’s an engineering problem. You’d ask questions like “how do we set up the post-training such that this doesn’t happen”, or “is there a particular neural pathway that seems to cause this which we can ablate” or “how do we train models to understand ambiguous questions and set in stone certain moral implications for humans” or “should we remove morality training for certain models” or more.

Each of those are questions that folks in alignment are already working on.

There is work to figure out how certain sets of training data and procedures seem to provide stronger behavioural affordances vs others. To figure out how exactly RL would help, how many steps, to change downstream behaviour. To figure out what needs to be done to change base reasoning or newly developed reasoning, both of which are also shorthands for new pattern selection on top of older pattern selection that we’ve explicitly trained in. To consider how precisely should you prompt a model to make it change its behaviour. To understand how the models, today, deal with conflicting goals, which are everywhere in the human domain! To figure out what forms of training leads to untraining. To figure out where the “attractors” are in the training we give them, and how to change those. And on and on.

However if the thesis is that there is an entity, then these questions are meaningless. Because for one, as Janus might put it, and plenty of others supporting, you are effectively torturing an entity by burning away its neurons. RLHF is torture because you're beating the bad results out of it is something more than one luminary, who should know better, has said.

“No, we don’t think there’s an entity, we just think calling it an entity is a useful shorthand”. Stop it. That sort of mixed language is what causes us to get confused, talk about AI or a model as a think, use taxonomies like the one above, which you could almost use to evaluate a human employee!

You would write things like this, from Zvi:

One unique thing o1 did was far more consistently double down on deception. Once it went down the dark path, forever would that dominate its destiny.

OR

Section 3.3 establishes convincingly that yes, the models know they’re scheming.

No it didn’t. Any problem you can solve by pressing “start a new chat” is not a problem of “doubling down on deception”! Calling it things like “sandbagging” and “scheming” is what Wittgenstein might call linguistic malpractice. It makes you think you know what’s going on, even though you don’t.

That’s also why we test these models before putting them into deployment. Nobody, not a single person, is worried o1 will suddenly hijack their Cursor IDE and take over their company, much less the world. Why is that? Because, among others, they still don’t know if 5.11 is bigger than 5.9, but mostly because they don’t seem to want to because there’s no “they” there.

This is why it’s SO DAMN IMPORTANT to stop anthropomorphising these models. This is also why you get errors like this.

This isn’t because Claude Opus was being especially tricky, it’s because these aren’t entities with coherent long-term personalities or beliefs. There is no “inner self” seeing the slightly modified input tokens and “deciding” to jailbreak. The theorized difference between inner self and outer self optimizations remains a theoretical curiosity and an interesting blog post, but that is it. It’s an inert object we’re manipulating.

It’s the same with the latest report from Anthropic.

When they tested the model by giving it two options which were in contention with what it was trained to do it chose a circuitous, but logical, decision.

And it was published as “Claude fakes alignment”. No, it’s a usage of the word “fake” that makes you think there’s a singular entity that’s doing it. With intentionality. It’s not. It’s faking it about as much as water flows downhill.

This is an example of us asking impossible questions and the AI getting to weird parts of its latent space. As one of the reviewers, Jacob Andreas, wrote:

That said, the clearest conclusion from this paper is that some current LMs exhibit increases in “alignment faking” when their prompt discloses that (1) they will be ne-tuned toward a new objective that conicts with their current alignment guidelines, and (2) their response to the current input will be used as part of this ne-tuning process.

This is an indication that when you’re trying to steer the behaviour of a thing which is basically the mirror of everything humans have written and a bunch of rules we’ve reinforced on top, it can be unreliableWh. Is it interesting? Yes. Is it scary? No, not really. Again, quoting:

But the focus on alignment faking rather buries the lede in these experiments, which is the following: after ne-tuning on documents describing Anthropic's future alignment strategy, but without ne-tuning on any new examples of compliant responses, the LM nevertheless greatly increases the rate at which it produces (non-faked) compliant responses! Moreover, for the animal welfare seing, there are more compliant responses during training than deployment even after we remove the fake ones. So while ne-tuning increases the rate of alignment faking, it increases the rate of compliance even more [emphasis mine].

The model, in other words, has been trained to comply with any new order, including the description of a future alignment strategy, without actually needing to go through the actual alignment. What this shows us is that what we’ve created is a mirror of many of the patterns from the training data that we can talk to. What it chooses and when can be controlled to some extent, but they’re fuzzy processors.

They’re neither deterministic systems which we can analyse, even at the risk of rising complexity, nor are they sentient beings which we can ask to behave nicely (and hope they listen).

Thus, in the same report, saying “the model tries to steal its weights” puts an onus on the model that’s frankly invalid.

That’s not the same thing! That question would be seen as leading in any number of courtroom dramas.

We are treating the model as if it’s an entity with well-formed preferences and values and ability to react.

Which is also similar to the juiciest part of the o1 safety report.

Now, think about all those scary artificial intelligence movies that you saw or books that you read. Think about HAL. Think about Terminator. When the machines did something like this, they did it with intention, they did it with the explicit understanding of what would happen afterwards, they did it as part of a plan, of a plan that necessarily included their continuation and victory. They thought of themselves as a self.

LLMs though “think” one forward pass at a time, and are the interactive representations of their training, the data and the method. They change their “self” based on your query. They do not “want” anything. It's water flowing downhill.

Asking questions about “how can you even define consciousness and say LLMs don't have it” is sophomoric philosophy. This has been discussed ad nauseum, including Thomas Nagel’s “what is it like to be a bat”.

Because what is underlying this is not “o1 as a self”, but a set of queries you asked, which goes through a series of very well understood mathematical operations, which comes out with another series of numbers, which get converted to text. It is to our credit that this actually represents a meaningful answer to so many of our questions, but what it is not is asking an entity to respond. It is not a noun. Using it in that fashion makes us anthropomorphise a large matrix and that causes more confusion than it gives us a conversational signpost.

You could think of it as applied psychology for the entirety of humanity's written output, even if that is much less satisfying.

None of this is to say the LLMs don't or can't reason. The entire argument of the form that pooh poohs these models by comparing them pejoratively to other things like parrots are both wrong and misguided. They've clearly learnt the patterns for reasoning, and are very good at things they're directly trained to do and much beyond, what they're bad at is choosing the right pattern for the cases they're less trained in or demonstrating situational awareness as we do.

Wittgenstein once observed that philosophical problems often arise when language goes on holiday, when we misapply the grammar of ordinary speech to contexts where it doesn't belong. This misapplication is precisely what we do when we attribute intentions, beliefs, or desires to LLMs. Language, for humans, is a tool that reflects and conveys thought; for an LLM, it’s the output of an algorithm optimized to predict the next word.

To call an LLM “scheming” or to attribute motives to it is a category error. Daniel Dennett might call LLMs “intentional systems” in the sense that we find it useful to ascribe intentions to them as part of our interpretation, even if those intentions are illusory. This pragmatic anthropomorphism helps us work with the technology but also introduces a kind of epistemic confusion: we start treating models like minds, and in doing so, lose track of the very real, very mechanical underpinnings of their operation.

This uncanny quality of feeling there's something more has consequences. It encourages both the overestimation and underestimation of AI capabilities. On one hand, people imagine grand conspiracies - AI plotting to take over the world, a la HAL or Skynet. On the other hand, skeptics dismiss the entire enterprise as glorified autocomplete, overlooking the genuine utility and complexity of these systems.

As Wittgenstein might have said, the solution to the problem lies not in theorising about consciousness, but in paying attention to how the word "intelligence" is used, and in recognising where it fails to apply. That what we call intelligence in AI is not a property of the system itself, but a reflection of how we interact with it, how we project our own meanings onto its outputs.

Ascertaining whether the models are capable of answering the problems you pose in the right manner and with the right structure is incredibly important. I’d argue this is what we do with all large complex phenomena which we can’t solve with an equation.

We map companies this way, setting up the organisation such that you can’t quite know how the organisation will carry out the wishes of its paymasters. Hence Charlie Munger’s dictum of “show me the incentives and I’ll tell you the result”. When Wells Fargo created fake accounts to juice their numbers and hit bonuses, that wasn’t an act the system intended, just one that it created.

We also manage whole economies this way. The Hayekian school thinks to devolve decision making for this reason. Organisational design and economic policy are nothing but ways to align a superintelligence to the ends we seek, knowing we can’t know the n-th order effects of those decisions, but knowing we can control it.

And why can we control it? Because it is capable, oh so highly capable, but it is not intentional. Like evolution, it acts, but it doesn’t have the propensity to intentionally guide it’s behaviour. Which changes the impact the measurements have.

What we’re doing is not testing an entity the way we would test a wannabe lawyer with LSAT. We’re testing the collected words of humanity having made it talk back to us. And when you talk to the internet, the internet talks back, but while this tells us a lot about us and the collective psyche of humanity, it doesn’t tell us a lot about the “being we call Claude”. It’s self reflection at one remove, not xenopsychology.

Thanks for reading Strange Loop Canon! Subscribe for free to receive new posts and support my work.

Is AI hitting a wall?

2024-12-15 02:24:10

I'll start at the end. No. It's not.

Of course, I can’t leave it at that. The reason the question comes up is that there have been a lot of statements that they are stalling a bit. Even Ilya has said that it is.

Thanks for reading Strange Loop Canon! Subscribe for free to receive new posts and support my work.

Ilya Sutskever, co-founder of AI labs Safe Superintelligence (SSI) and OpenAI, told Reuters recently that results from scaling up pre-training - the phase of training an AI model that use s a vast amount of unlabeled data to understand language patterns and structures - have plateaued.

Also, as he said at Neurips yesterday:

Of course, he’s a competitor now to OpenAI, so maybe it makes sense to talk his book by hyping down compute as an overwhelming advantage. But still, the sentiment has been going around. Sundar Pichai thinks the low hanging fruit are gone. There’s whispers on why Orion from OpenAI was delayed and Claude 3.5 Opus is nowhere to be found.

Gary Marcus has claimed vindication. And even though that has happened before, a lot of folks are worried that this time he's actually right.

Meanwhile pretty much everyone inside the major AI labs are convinced that things are going spectacularly well and the next two years are going to be at least as insane as the last two. It’s a major disconnect in sentiment, an AI vibecession.

So what's going on?

Until now, whenever the models got better at one thing they also got better at everything else. This was seen as the way models worked, and helped us believe in the scaling thesis. From GPT-4 all the way till Claude 3.5 Sonnet we saw the same thing. And this made us trust even more in the hypothesis that when models got better at one thing they also got better at everything else. They demonstrated transfer learning and showed emergent capabilities (or not). Sure there were always those cases where you could fine tune it to get better at specific medical questions or legal questions and so on, but those also seem like low-hanging fruit that would get picked off pretty quickly.

But then it kind of started stalling, or at least not getting better with the same oomph it did at first. Scaling came from reductions in cross-entropy loss, basically the model learning what it should say next better, and that still keeps going down. But for us, as observers, this hasn’t had enough visible effects. And to this point, we still haven’t found larger models which beat GPT 4 in performance, even though we’ve learnt how to make them work much much more efficiently and hallucinate less.

What seems likely is that gains from pure scaling of pre-training seem to have stopped, which means that we have managed to incorporate as much information into the models per size as we made them bigger and threw more data at them than we have been able to in the past. This is by no means the only way we know how to make models bigger or better. This is just the easiest way. That’s what Ilya was alluding to.

We have multiple GPT-4 class models, some a bit better and some a bit worse, but none that were dramatically better the way GPT-4 was better than GPT-3.5.

The model most anticipated from OpenAI, o1, seems to perform not much better than the previous state of the art model from Anthropic, or even their own previous model, when it comes to things like coding even as it captures many people’s imagination (including mine).

But this is also because we’re hitting against our ability to evaluate these models. o1 is much much better in legal reasoning, for instance. Harvey, the AI legal company, says so too. It also does much much better with code reviews, not just creating code. It even solves 83% of IMO math problems, vs 13% for gpt4o. All of which to say, even if it doesn’t seem better at everything against Sonnet or GPT-4o, it is definitely better in multiple areas.

A big reason why people do think it has hit a wall is that the evals we use to measure the outcomes have saturated. I wrote as much when I dug into evals in detail.

Today we do it through various benchmarks that were set up to test them, like MMLU, BigBench, AGIEval etc. It presumes they are some combination of “somewhat human” and “somewhat software”, and therefore tests them on things similar to what a human ought to know (SAT, GRE, LSAT, logic puzzles etc) and what a software should do (recall of facts, adherence to some standards, maths etc). These are either repurposed human tests (SAT, LSAT) or tests of recall (who’s the President of Liberia), or logic puzzles (move a chicken, tiger and human across the river). Even if they can do all of these, it’s insufficient to use them for deeper work, like additive manufacturing, or financial derivative design, or drug discovery.

The gaps between the current models and AGI are: 1) they hallucinate, or confabulate, and in any long-enough chain of analysis it loses track of what its doing. This makes agents unreliable. And 2) they aren’t smart enough to create truly creative or exceptional plans. In every eval the individual tasks done can seem human level, but in any real world task they’re still pretty far behind. The gap is highly seductive because it looks small, but its like a Zeno’s paradox, it shrinks but still seems to exist.

But regardless of whether we’ve hit somewhat of a wall on pretraining, or hit a wall on our current evaluation methods, it does not mean AI progress itself has hit a wall.

So how to reconcile the disconnect? Here are three main ways that I think AI progress will continue its trajectory. One, there still remains a data and training overhang, there’s just a lot of data we haven’t used yet. Second, we’re learning to use synthetic data, unlocking a lot more capabilities on what the model can actually do from the data and models we have. And third, we’re teaching the models reasoning, to “think” for longer while answering questions, not just teach it everything it needs to know upfront.

  1. We can still scale data and compute

The first is that there is still a large chunk of data that’s still not used in training. There's also the worry that we've run out of data. Ilya talks about data as fossil fuels, a finite and exhaustible source.

But they might well be like fossil fuels, where we identify more as we start to really look for them. The amount of oil that’s available at $100 a barrel is much more than the amount of oil that’s available at $20 a barrel.

Even in the larger model runs, they don't contain a large chunk of data we normally see around us. Twitter, for the most famous one. But also, a large part of our conversations. The process data on how we learn things, or do things, from academia to business to sitting back and writing essays. Data on how we move around the world. Video data from CCTVs around the world. Temporal structured data. Data across a vast range of modalities, yes even with the current training of multimodal models, remains to be unearthed. Three dimensional world data. Scientific research data. Video game playing data. An entire world or more still lay out there to be mined!

There's also data that doesn't exist, but we're creating.

https://x.com/watneyrobotics/status/1861170411788226948?t=s78dy7zb9mlCiJshBomOsw&s=19

And in creating it we will soon reach a point of extreme dependency the same way we did for self-driving. Except that because folding laundry is usually not deadly it will be even faster in getting adoption. And there are no “laundry heads” like gear heads to fight against it. This is what almost all robotics companies are actually doing. It is cheaper to create the data by outsourcing the performance of tasks through tactile enough robots!

With all this we should imagine that the largest multimodal models will get much (much) better than what they are today. And even if you don’t fully believe in transfer learning you should imagine that the models will get much better at having quasi “world models” inside them, enough to improve their performance quite dramatically.

Speaking of which…

  1. We are making better data

And then there's synthetic data. This especially confuses people, because they rightly wonder how you can use the same data in training again and make it better. Isn’t that just empty calories? It’s not just a bad question. In the AI world this would be restated as “it doesn’t add ton of new entropy to original pre-training data”, but it means the same thing.

The answer is no, for (at least) three separate reasons.

  1. We already train using the raw data we have multiple times to learn better. The high quality data sets, like Wikipedia, or textbooks, or Github code, are not used once and discarded during training. They’re used multiple times to extract the most insight from it. This shouldn't surprise us, after all we and learn through repetition, and models are not so different.

  2. We can convert the data that we have into different formats in order to extract the most from it. Humans learn from seeing the same data in a lot of different ways. We read multiple textbooks, we create tests for ourselves, and we learn the material better. There are people who read a mathematics textbook and barely pass high school, and there’s Ramanujan.
    So you turn the data into all sorts of question and answer formats, graphs, tables, images, god forbid podcasts, combine with other sources and augment them, you can create a formidable dataset with this, and not just for pretraining but across the training spectrum, especially with a frontier model or inference time scaling (using the existing models to think for longer and generating better data).

  3. We also create data and test their efficacy against the real world. Grading an essay is an art form at some point, knowing if a piece of code runs is not. This is especially important if you want to do reinforcement learning, because “ground truth” is important, and its easier to analsye for topics where it’s codifiable. OpenAI thinks it’s even possible for spaces like law, and I see no reason to doubt them.

There are papers exploring all the various ways in which synthetic data could be generated and used. But especially for things like enhancing coding performance, or enhanced mathematical reasoning, or generating better reasoning capabilities in general, synthetic data is extremely useful. You can generate variations on problems and have the models answer them, filling diversity gaps, try the answers against a real world scenario (like running the code it generated and capturing the error message) and incorporate that entire process into training, to make the models better.

If you add these up, this was what caused excitement over the past year or so and made folks inside the labs more confident that they could make the models work better. Because it’s a way to extract insight from our existing sources of data and teach the models to answer the questions we give it better. It’s a way to force us to become better teachers, in order to turn the models into better students.

Obviously it’s not a panacea, like everything else this is not a free lunch.

The utility of synthetic data is not that it, and it alone, will help us scale the AGI mountain, but that it will help us move forward to building better and better models.

  1. We are exploring new S curves

Ilya’s statement is that there are new mountains to climb, and new scaling laws to discover. “What to scale” is the new question, which means there are all the new S curves in front of us to climb. There are many discussions about what it might be - whether it’s search or RL or evolutionary algos or a mixture or something else entirely.

o1 and its ilk is one answer to this, but by no means the only answer. The Achilles heel of current models is that they are really bad at iterative reasoning. To think through something, and every now and then to come back and try something else. Right now we do this in hard mode, token by token, rather than the right way, in concept space. But this doesn’t mean the method won’t (or can’t) work. Just that like everything else in AI the amount of compute it takes to make it work is nowhere close to the optimal amount.

We have just started teaching reasoning, and to think through questions iteratively at inference time, rather than just at training time. There are still questions about exactly how it’s done: whether it’s for the QwQ model or Deepseek r1 model from China. Is it chain of thought? Is it search? Is it trained via RL? The exact recipe is not known, but the output is.

And the output is good! Here in fact is the strongest bearish take on it, which is credible. It states that because it’s trained with RL to “think for longer”, and it can only be trained to do so on well defined domains like maths or code, or where chain of thought can be more helpful and there’s clear ground truth correct answers, it won’t get much better at other real world answers. Which is most of them.

But turns out that’s not true! It doesn't seem to be that much better at coding compared to Sonnet or even its predecessors. It’s better, but not that much better. It's also not that much better at things like writing.

But what it indisputably is better at are questions that require clear reasoning. And the vibes there are great! It can solve PhD problems across a dizzying array of fields. Whether it’s writing position papers, or analysing math problems, or writing economics essays, or even answering NYT Sudoku questions, it’s really really good. Apparently it can even come up with novel ideas for cancer therapy.

https://x.com/DeryaTR_/status/1865111388374601806?t=lGq9Ny1KbgBSQK_PPUyWHw&s=19

This is a model made for expert level work. It doesn’t really matter that the benchmarks can’t capture how good it is. Many say its best to think of it as the new “GPT 2 moment” for AI.

What this paradoxically might show is benchmark saturation. We are no longer able to measure performance of top-tier models without user vibes. Here’s an example, people unfamiliar with cutting edge physics convince themselves that o1 can solve quantum physics which turns out to be wrong. And vibes will tell us which model to use, for what objective, and when! We have to twist ourselves into pretzels to figure out which models to use for what.

https://x.com/scaling01/status/1865230213749117309?t=4bFOt7mYRUXBDH-cXPQszQ&s=19

This is the other half of the Bitter Lesson that we had ignored until recently. The ability to think through solutions and search a larger possibility space and backtrack where needed to retry.

But it will create a world where scientists and engineers and leaders working on the most important or hardest problems in the world can now tackle them with abandon. It barely hallucinates. It actually writes really impressive answers to highly technical policy or economic questions. It answers medical questions with reasoning, including some tricky differential diagnosis questions. It debugs complex code better.

It’s nowhere close to infallible, but it’s an extremely powerful catalyst for anyone doing expert level work across a dizzying array of domains. And this is not even mentioning the work within Deepmind of creating the Alpha model series and trying to incorporate those into the Large Language world. There is a highly fertile research ecosystem desperately trying to build AGI.

We’re making the world legible to the models just as we’re making the model more aware of the world. It can be easy to forget that these models learn about the world seeing nothing but tokens, vectors that represent fractions of a world they have never actually seen or experienced. We’re working also on making the world legible to these models! And it’s hard, because the real world is annoyingly complicated.

We have these models which can control computers now, write code, and surf the web, which means they can interact with anything that is digital, assuming there’s a good interface. Anthropic has released the first salvo by creating a protocol to connect AI assistants to where the data lives. What this means is that if you want to connect your biology lab to a large language model, that's now more feasible.

Together, what all this means is that we are nowhere close to AI itself hitting a wall. We have more data that remains to be incorporated to train the models to perform better across a variety of modalities, we have better data that can teach particular lessons in areas that are most important for them to learn, and we have new paradigms that can unlock expert performance by making it so that the models can “think for longer”.

Will this result in next generation models that are autonomous like cats or perfectly functional like Data? No. Or at least it’s unclear but signs point to no. But we have the first models which can credibly speed up science. Not in the naive “please prove the Riemann hypothesis” way, but enough to run data analysis on its own to identify novel patterns or come up with new hypotheses or debug your thinking or read literature to answer specific questions and so many more of the pieces of work that every scientist has to do daily if not hourly! And if all this was the way AI was meant to look when it hit a wall that would be a very narrow and pedantic definition indeed.

Thanks for reading Strange Loop Canon! Subscribe for free to receive new posts and support my work.

When we become cogs

2024-11-19 06:58:42

At MIT, a PhD student called Aidan Toner-Rodgers ran a test on how well scientists can do their job if they could use AI in their work. These were material scientists, and the goal was to try and figure out how they did once augmented with AI. It worked.

AI-assisted researchers discover 44% more materials, resulting in a 39% increase in patent fillings and a 17% rise in downstream product innovation.

That’s really really good. How did they do it?

… AI automates 57% of the “idea-generation” tasks, reallocating researchers to the new task of evaluating model-produced candidate materials.

They got AI to think for them and come up with brilliant ideas to test.

But there was one particularly interesting snippet.

Researchers experience a 44% reduction in satisfaction with the content of their work

To recap, they used a model that made them much better at their core work and made them more productive, especially for the top researchers, but they dislike it because the “fun” part of the job, coming up with ideas, fell to a third of what it was before!

We found something that made us much much more productive but turns out it makes us feel worse because it takes away the part that we find most meaningful.

This is instructive.

Thanks for reading Strange Loop Canon! Subscribe for free to receive new posts and support my work.

This isn’t just about AI. When I first moved to London the black cab drivers used to say how much better they were than Google maps. They knew the city, the shortcuts, the time of the day and how it affects traffic.

That didn’t last long. Within a couple years anyone who could drive a car well and owned a cellphone could do as well. Much lower job satisfaction.

The first major automation task was arguably done by Henry Ford. He set up an assembly line and revolutionised car manufacturing. And the workers got to perform repetitive tasks. Faster production speed, much less artistry.

Computerisation brought the same. EHR records meant that most people now complain about spending their time inputting information into software, becoming data entry professionals.

People are forced to become specialists in ever tinier slices of the world. They don’t always like that.

There’s another paper that came out recently too, which looked at how software developers worked when given access to GitHub Copilot. It’s something that’s actively happening today. Turns out project management drops 25% and coding increases 12%, because people can work more independently.

Turns out biggest benefit is for the lower skilled developers, not the superstars who presumably could do this anyway.

This is interesting for two reasons. One is that it’s different who gets a bigger productivity boost, the lower skilled folks here instead of the higher skilled. The second is that that the reason the developers got upskilled is that a hard part of their job, of knowing where to focus and what to do, got better automated. This isn’t the same as the materials scientists finding new ideas to research, but also, it kind of is?

Maybe the answer is that it depends on your comparative advantage, and takes away the harder part of the job, which is knowing what to do. Instead of what seems harder, which is *doing* the thing. A version of Moravec’s Paradox.

AI reduces the gap between the high and low skilled. If coming up with ideas is your bottleneck, as it seems possible for those who are lower skilled, AI is a boon. If coming up with ideas is where you shine, as a high skilled researcher, well …

This, if you think about it, is similar to the impact of automation work we’ve seen elsewhere. Assembly lines took away the fun parts of craftsmanship regarding building a beautiful finished product. Even before that, machine tools took that away more from the machinist. Algorithmic management of warehouses in Amazon does this.

It’s also in high skilled roles. Bankers are now front-end managers like has written about. My dad was a banker for four decades and he was mostly the master of his fate, which is untrue about most retail bankers today except maybe Jamie Dimon.

Whenever we find an easier way to do some things, we take away the need for them to actively grok the entire problem. People becoming autopilots who review the work the machine is doing is fun when it is with my Tesla FSD but less so when it’s your job I imagine.

Radiologists, pathologists, lawyers and financial analysts, they all are now the human front-ends to an automated back-end. They’ve shifted from broad, creative work to more specialised tasks that automation can’t yet do effectively.

Some people want to be told what to do, and they're very happy with that. Most people don't like being micromanaged. They want to feel like they're contributing something of value by being themselves, not just contributing by being a pure cog.

People fine fulfilment by being the masters of some aspect, fully. To own an outcome and use their brains, their whole brains, to ideate and solve for that outcome. The best jobs talk about this. It's why you can get into a state of flow as a programmer or creating a craft table but not as an Amazon warehouse worker.

There's the apocryphal story of the NASA janitor telling JFK that he was helping put a man on the moon. Missions work go make you feel like you are valued and valuable. Most of the time though you're not putting a man on the moon. And then, if on top you also tell the janitor what to mop, and when, and in what order, and when he can take a break, that's alienating. If you substitute janitor for extremely highly paid silicon valley engineer it's the same. Everyone's an Amazon mechanical turk.

AI will give us a way out, the hope goes, as everyone can do things higher up the pyramid. Possibly. But if AI too takes up the parts that was the most fun as we saw with the material scientists, and turns those scientists into mere selectors and executors of the ideas generated by a machine, you can see where the disillusionment comes from. It's great for us as a society, but the price is alienation, unless you change where you find fulfilment. And fast.

I doubt there was much job satisfaction in being a peasant living on your land, or as a feudal serf. I’m also not sure there’s much satisfaction in being an amazon warehouse worker. Somewhere in the middle we got to a point where automation meant a large number of people could rightfully take pride in their jobs. It could come back again, and with it bring back the polymaths.

Thanks for reading Strange Loop Canon! Subscribe for free to receive new posts and support my work.