MoreRSS

site iconStrange Loop CanonModify

By Rohit Krishnan. Here you’ll find essays about ways to push the frontier of our knowledge forward. The essays aim to bridge the gaps between Business, Science and Technology.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of Strange Loop Canon

Can we get an AI to write better?

2025-10-06 21:03:32

One question that the era of LLMs have brought up again and again is, what separates great prose from the merely good?

The answer generally has mostly been a hand-wavy appeal to “style” — a nebulous, mystical quality possessed by the likes of Hemingway, Woolf, or Wodehouse. Like the judge said about pornography, we know it when we see it. We can identify it, we can even imitate it. But can we measure it? Can we build a production function for it?

The default output of most modern LLMs is good. Competent even. But vanilla. Stylistically bland. But should it always be so? This question has been bugging me since I started using LLMs. They are built from words and yet they suck at this... Why can’t we have an AI that writes well?

Thanks for reading Strange Loop Canon! Subscribe for free to receive new posts and support my work.

So the goal to look at, naturally, is if we can set some (any?) quantifiable, empirical “signatures” of good writing. Because if we can, then those can be used to train better models. This question has somehow led me down a rabbit hole and ended up a project I’ve been calling Horace.

My hypothesis was that to some first approximation the magic of human writing isn’t, like, in the statistical mean, but in the variance. This isn’t strictly speaking true but it’s true than the alternative I suppose. It’s in the deliberate, purposeful deviation from the expected. The rhythm, the pace, the cadence.

(Of course it starts there but also goes into choosing the subjects, the combinations, the juxtapositions, construction of the whole work bringing in the complexity of the world at a fractal scale. But let’s start here first.)

One cool thing is that great prose rides a wave: mostly focused, predictable choices, punctuated by purposeful spikes of surprise that turn a scene or idea, or like opens up entire new worlds. Like a sort of heartbeat. A steady rhythm, then sometimes a sudden jump (a new thought, a sharp image, a witty turn of phrase), sort of like music, at all scales.

“Style is a very simple matter: it is all rhythm. Once you get that, you can’t use the wrong words.” — Virginia Woolf.

“The sound of the language is where it all begins. The test of a sentence is, Does it sound right?” — Ursula K. Le Guin.

But this heartbeat isn’t global. Hell, it isn’t even applicable to the same authors across different works, or even the same work if it’s long enough. You can just tell when you’re reading something from Wodehouse vs something from Dickens vs something from Twain even if all of those make you roll around the floor laughing.

This cadence, the flow, can be measured. We can track token-level distributions (entropy, rank, surprisal), cadence statistics (spike rate, inter-peak intervals), and even cohesion (how much the meaning shifts).

Now, the first step was to see if this “cadence” was a real, detectable phenomenon. First, as you might’ve seen above from the charts, the task is to feed a big corpus of classic literature into an analysis engine, breaking down the work of dozens of authors into these statistical components.

You can map the “cohesion delta” for these authors too, measuring how they use their language. Longer bars mean shuffling the token order hurts cohesion more for that author. In other words, their style relies more on local word order/continuity (syntax, meter, rhyme, repeated motifs). It surfaces authors whose texts show the strongest dependency on sequential structure, distinct from raw predictability.

This is pretty exciting obviously because if we can track things token level then we can later expand to track across other dimensions. (Yes, it’ll get quite a bit more complicated, but such is life).

Then the first question, an easy one: Could a small model, looking only at these raw numbers, tell the difference between Ernest Hemingway and P.G. Wodehouse?

The answer, it turns out, is yes. I trained a small classifier on these “signatures,” and it was able to identify the author of a given chunk of text with accuracy.

What you’re seeing above is the model’s report card. The diagonal line represents correct guesses. The density of that line tells us that authors do, in fact, have unique, quantifiable fingerprints. Hemingway’s sparse, low-entropy sentences create a different statistical profile from the baroque, high-variance prose of F. Scott Fitzgerald.

With the core thesis validated, we can now try to zoom in.

Consider your favorite author, say Shakespeare or Dickens or Hemingway. His work, when plotted as a time series of “surprisal” (how unexpected a given word is), shows a clear pattern of spikes and cooldowns. He isn’t alone, it’s the same for Yeats or for Aesop.

You see these sharp peaks? Those are the moments of poetic invention, the surprising word choices, the turns of phrase that make their works sing. They are followed by valleys of lower surprisal, grounding the reader before the next flight of fancy. As the inimitable Douglas Adams wrote:

[Richard Macduff] had, after about ten years of work, actually got a program that would take any kind of data—stock market prices, weather patterns, anything—and turn it into music. Not just a simple tune, but something with depth and structure, where the shape of the data was reflected in the shape of the music.

Anyway, this holds true across genres. Poetry tends to have denser, more frequent spikes. Prose has a gentler, more rolling cadence. But the fundamental pattern seems to hold.

But, like, why is this necessary?

Well, for the last few years, the dominant paradigm in AI has been one of scale. More data, more parameters, more compute. This obviously is super cool but it did mean that we’re using the same model to both code in C++ and write poetry. And lo and behold, it got good with the one that we could actually measure.

Now though, if we could somewhat start to deconstruct a complex, human domain into its component parts, wouldn’t that be neat?

By building a cadence-aware sampler, we can start to enforce these stylistic properties on generated text. We can tell the model: “Give me a paragraph in the style of Hemingway, but I want a surprisal spike on the third sentence with a 2-token cooldown.” Not sure if you would phrase is such, but I guess you could. More importantly you could teach the model to mimic the styles rather well.

“The difference between the almost right word and the right word is the difference between the lightning bug and the lightning.” — Mark Twain

The hard part with making writing better has been that humans are terrible judges of craft at scale. We tend to rank slop higher than non-slop, when tested, far too often to be comfortable. Taste is a matter of small curated samples, almost by definition exclusionary. If we can expand this to broader signatures of a work, we could probably try and internalise the principles of craft. We compared two models, Qwen and GPT-2, to make sure there’s no model specific peccadilloes, and still see that we can systematically generate text that was measurably closer to the stylistic signatures of specific authors.

Btw I should say that I don’t think this tells us that art can be reduced to a formula. A high surprisal score doesn’t make a sentence good. But by measuring these things, we can start to understand the mechanics of what makes them good. Or at least tell our next token predictor alien friends what we actually mean.

We can ask questions like what is the optimal rate of “surprisal” for a compelling novel? Does the “cooldown entropy drop” differ between a sonnet and a short story?

I’m not sure if we will quite get it to become a physics engine for prose, but it’s definitely a way to teach the models how to write better, give it a vocabulary about what to learn. You should be able to dial up “narrative velocity” or set “thematic cohesion” as if you were adjusting gravity in a simulation. I remember getting o1-pro to write an entire novel for me 6 months ago. It was terrible. Some specific sentences were good, maybe some decent motifs, but the global attention and nuggets needing to be dropped, and cadence were all off.

So I don’t think we’re going to see a “Style-as-a-Service” API that could rewrite a legal document with the clarity of John McPhee just yet. My experiments were with tiny 2.5B parameter models. But it sure would be nice to make LLMs write just a bit better. I’m convinced we can do better, if we so choose. The ghost in the machine, it turns out, does have a heartbeat.

Thanks for reading Strange Loop Canon! Subscribe for free to receive new posts and support my work.

Anomie

2025-09-28 20:57:25

I sometimes think about people whose careers started in the ‘90s. They had a roaring decade of economic growth. And even if they did not participate in the dot com boom they still had the opportunity to invest in Google, Amazon or Microsoft low valuations. They had the potential to generate extraordinary wealth purely by dint of public market investments or buying a house in Palo Alto.

We can contrast that with the 2010s. Decade was roaring again; the stock market actually did quite well. But the truly outsized returns were almost entirely stuck within the private markets. Much of venture capital over the last decade has been privatizing the previously public gains, of a company going from 1 billion, 5 billion, 20 billion to 10, 50, 100 billion market caps or more. In fact the last big IPO that happened was Facebook in 2012 and that was already outsized, being valued at five times that of Google’s by the time the public could get their hands on it. In fact one of the best trades that existed perhaps ever was buying its stock when their market cap fell to 300 billion or so a few years ago.

Or, looked at another way, in 1980 the median age of a listed U.S. company was 6 years; today it is 20.

Meanwhile every other major company remains private seemingly endlessly. Even now Stripe remains private, so does Databricks, so does SpaceX … They give their employees liquidity, provide some high fee methods for others to invest via SPVs or futures, even report the occasional metric. And if you want any exposure you better be prepared to pay 5% fees and then probably 2 and 20 on top of it for the SPV.

Now, the number of people investing in the market has gone up so maybe it’s just alpha erasure. So it’s not to say there are no alpha generating investments at all. There absolutely have been 10 baggers or more in the public markets, Palantir shot up like crazy. But they’re as few as they’re speculative. All the while the number of public companies even has fallen off a cliff.

But it does tell us why meme stocks became a thing. Right? Speculative mania by itself is nothing new, from tulips to Cisco in 2000, but Tesla is a different animal. As was (is) GameStop! It also explains why crypto is a thing, why smart 20 year olds are yoloing their bonus checks into alt coins or short expiry options.

It’s because there’s a clear sense of now or never. This was the entire crypto ethos. Don’t build a Telco, create a Telco token! Even the rise of AI heightens this! If you managed to join Openai in 2020 you’re a multi multi millionaire, you won the lottery. If you didn’t, it’s over. If you combine the workforce of the largest labs you still wouldn’t even show up in any aggregate measures.

Back in the days of yore, if you did not manage to get a job at Google in 2005 you could still buy its stock. You had at least the option of gaining from its appreciation assuming you thought it inevitable. Over the last decade and a half there have been multiple generations who succeeded from getting a job at one of these giants and working their way up, and equally and more from people who invested in those giants. That’s what brought about the belief that the arc of history trended upwards.

Today, there exists no such option. There only exists short term manic rises even for the longer term theses. The closest anyone can get to the AI boom is Nvidia, an old stock, which has shot up as the preferred seller of shovels in this gold rush. The closest anyone can get at an institutional scale even is Situational Awareness which bought calls on Intel Capital and has also rightfully shot up. These are in effect synthetic lottery tickets the public market was forced to invent because the real lottery, OpenAI equity, is locked. The claim is not that returns vanished, but that access to the tails shifted.

But from the perspective of most people on the street they either work for one of the large labs in which case you are paid extraordinarily well, enough to almost single-handedly prop up the US economy, while for everybody else you are at best treading water. And by the way, the broader solutions to try and fix it by adding private equity to 401k portfolios is as risky as it is expensive. Not to mention opaque. The roaring parts of the economy are linked, sure, to the public markets and the broader economy benefits, but at a distance.

I wrote once about Zeitgeist Farming, a way that seemed to be developing to get rich by betting on the zeitgest and doing no real work, as a seemingly emergent phenomena in the markets, and it seems to have continued its dominance. And we see the results. It’s the Great Polarisation.

I’m obviously not saying that life sucks or that folks who don’t are destitute, this is not a science fiction dystopia, far from it, but it is very clear that the fruits of our progress seem fewer and coarsely distributed. And when they’re not, the feeling of there being haves and have nots gets stronger. It might well be that the haves are only a tiny tiny tiny minority who are doing exceedingly well, while the majority are doing just fine, great even historically speaking, but the “there but for the flip of a coin go I” feeling remains strong.

This is what’s different to the ages before. Physics PhDs went into wall street and made billions, but it didn’t feel like they hit a lottery as much as they were at the top of their profession, a profession that was different, even priestly, in its insularity. AI, rightly or wrongly, doesn’t feel like that.

It doesn’t help that the rhetoric from all the labs is that the end is nigh. The end of all humanity, if you believe some, but at least the end of jobs according to even the more level headed prognosticians. Leaving aside how right they might end up being, that’s a scary place to be.

While this particular rhetoric is new it taps into a fear that’s existed, latent, inside many over the entire past decade and half. We all know folks who joined so-and-so company at the right time and rode the valuation up. We also know incredibly smart folks who didn’t, and who didn’t “get their bag”.

Crypto alt-coin bubble might have seemed a cause for the societal sickness, but it’s not. It’s a symptom. A symptom of the fact that to get ahead it feels, viscerally, like you have to gamble.

After all, when life resembles a lottery, then what’s left but to play the odds.

Thanks for reading Strange Loop Canon! Subscribe for free to receive new posts and support my work.

Prediction is hard, especially about the future

2025-09-17 07:59:21

All right, so there's been a major boom in people using AI and also people trying to figure out what AI is good for. One would imagine they go hand in hand but alas. About 10% of the world are already using it. Almost every company has people using it. It’s pretty much all people can talk about on conference calls. You can hardly find an email or a document these days that is not written by ChatGPT. Okay, considering that is the case, there is a question about, like, how good are these models, right? Any yardstick that we have kind of used, whether it's its ability to do math or to do word problems or logic puzzles or, I don't know, going and buying a plane ticket online or researching a concert ticket, it's kind of beaten all those tasks, and more.

So, considering that, what is a question, a good way to figure out what they're ultimately capable of? One the models are actually doing reasonably well and can be mapped on some kind of a curve, which doesn’t suffer from the “teaching to the test” problem.

And one of the answers there is that you can look at how well it actually predicts the future, right? I mean, lots of people talk about prediction markets and about how you should listen to those people who are actually able to do really well with those. And I figured, it stands to reason that we should be able to do the same thing with large language models.

So the obvious next step became to test it is to try and take a bunch of news items and then ask, you know, the model what will happen next. Which is what I did. I called this foresight forge because that’s the name the model picked for itself. (It publishes daily predictions with GPT-5, used to be o31.) I thought I would let it take all the decisions, from choosing the sources to predictions to ranking it with probabilities after and doing regular post mortems.

Like an entirely automated research engine.

This work went quite well in the sense that it gave interesting predictions, and I actually enjoyed reading them. It was insightful! Though, like, a bit biased toward positive outcomes. Anyway, still useful, and a herald of what’s to come.

But, like, the bigger question I kept asking myself was what this really tells us about AI’s ability to predict what will happen next. It’s after all only a portion of the eval to see predictions, not to understand, learn from, or score them.

The key thing that you know differentiates us is the fact that we are able to learn right like if you have a trader who gets better making predictions they do that because like you know he or she is able to read about what they did before and can use that as a springboard to learn something else and use that as springboard to learn something else and so on and so forth. Like there is an actual process whereby you get better over time, it's not that you are some perfect being. It's not even that you predict for like a month straight or 2 months straight and then use all of that together to make yourself smarter and or better instantaneously. Learning is a constant process.

And this is something that all of the major AI labs talk about all the time in the sense that they want continuous learning. They want to be able to get to a point where you're able to see the models actually get better in real time and that's sort of fairly complicated, but that's the goal, because that's how humans learn.

A short aside on training. One of the biggest thoughts I have about RL, prob all model training, is that it is basically trying to find workarounds to evolution because we can’t replay the complexity of the actual natural environment. But the natural environment is super hard to create, because it involves not just unthinking rubrics about whether you got your math question right, but also, like, interacting with all the other complex elements of the world which in its infinite variety teach us all sorts of things.

So I thought, okay, we should be able to figure this out because what you need to do is to do the exact same thing that we do or the model training does, but do it on a regular basis. Like every single day you're able to get the headlines of the day and some articles you're able to ask the model to predict what's going to happen next and keeping things on policy as soon as the model predicts what's going to happen next your the next day itself you're going to use the information that you have in order to update them all.

Because I wanted to run this whole thing on my laptop, a personal constraint I put so I don’t burn thousands on GPUs every week, I decided to start with a tiny model and see how far I could push it. The interesting part about running with tiny models you know which is that there's only certain amount of stuff that they are going to be able to do. I used Qwen/Qwen3-0.6B on MLX.

(I also chose the name Varro. Varro was a Roman polymath and author, widely considered ancient Rome's greatest scholar, so seemed like a fitting name. Petrarch famously referred to him as "the third great light of Rome," after Virgil and Cicero.)

For instance what's the best way to do this would be to say make a bunch of predictions and the next day you can look back and see how close you got to some of those predictions and update your views. Basically a reward function that is set up if you want to do reinforcement learning.

But there's a problem in doing this, which is that there's only so many ways in which you can predict whether you were right or not. You could just use some types of predictions as a yardstick if you'd like, for instance you could go with only financial market predictions and you know check next day whether you are accurate or. This felt too limiting. After all the types of predictions that people make if they turn out to understand the world a lot better is not limited to what the price of Nvidia is likely to be tomorrow morning.

Not to mention that also has a lot of noise. See CNBC. You should be able to predict about all sorts of things like what would happen in the Congress in terms of a vote or what might happen in terms of corporate behavior in response to a regulation or what might happen macroeconomically in response to an announcement. So while I split some restrictions in terms of sort of the types of things that you can possibly predict I wanted to kind of leave it open-ended. Especially because leaving it open end it seemed like the best way to teach a proper world model to even smaller LLMs.

I thought the best way to check the answer was to use the same type of LLM to look at what happened next and then you know figure out whether you got close. Rather obviously in hindsight, I ran into a problem which is that small models are not very good at acting like acting as LLM as a judge. They get things way too wrong. I could’ve used a bigger model, but that felt like cheating (because it could teach about the world to the smaller model, than learning purely from the environment).

So I said okay I can first teach it the format and I got to find some other way to figure out whether you came close to what happened the next day with respect to its prediction. What I thought I could do was to use the same method that I used with Walter, the RLNVR paper, and see whether semantic similarity might actually push us a long way. Obviously this is a double edged sword because you might get semantically fairly closed while having the opposite meanings or just low quality2.

But while we are working with smaller models and since the objective is to try and figure out if there's method will work in the first place I thought this might be an okay way to start. And that's kind of what we did. The hardest part was trying to figure out the exact combination of rewards that would actually make the model do what I wanted and not whatever it wanted to try and maximise and reward by doing weird stuff. Some examples being things like, you know, you could not ask it to do bullet points because it started echoing instructions so to teach it thinking and responding you had to choose thinking in paragraphs.

Long story short, it works (as always, ish3). The key question that I set out to answer here was basically whether we could have a regular running RL experiment on a model such that you can use sparse noisy rewards that would come through from the external world, and be able to keep updating in such that it can still do one piece of work relatively well. While I chose one of the harder ways to do this by predicting the whole world, I was super surprised that even a small model did learn to get better at predicting next day's headlines.

I wouldn't have expected it because there is no logical reason to believe that tiny models can still learn sufficient world model type information that it can do this. It might have been the small sample size it might have been noise it might have been a dozen other ways in which this is not perfectly replicable.

But that's not the point. The point is that with this method if things work even somewhat well as it did for a tiny tiny model, then that means that for larger models where the rewards are better understandable you can probably do on policy RL pretty easily4.

This is a huge unlock. Because what this means is that the world which is filled with sparse rewards can now basically be used to get the models to behave better. There's no reason to believe that this is an isolated incident, just like with the RLNVR paper there is no reason to believe that this will not scale to doing more interesting things.

And since I did the work I learned that cursor, the AI IDE, does something similar for their autocomplete model. Where they take a much stronger reward signal, in terms of whether humans accept or reject the suggestions that it actually makes, they are able to update the policy and roll out a new model every couple of hours. Which is huge!

So if Cursor can do it, then what stands in between us and doing it more often for all sorts of problems? Partly just the availability of data, but mostly it’s creating a sufficiently interesting reward function that can teach it something, and a little bit of AI infrastructure.

I'm going to contribute the Varro environment to the prime intellect RL hub in case somebody wants to play, and also maybe make it a repo or a paper, but it's pretty cool to see that even for something as amorphous as predicting the next day headlines, something that is extraordinarily hard even for humans because it is a fundamentally adversarial task, we're able to make strides forward if we manage to convert the task into some thing that and LLM can understand, learn from and hill climb. The future is totally going to look like a video game.


In academic work, please cite this essay as: Krishnan, R. (2025, September 16). Prediction is hard, especially about the future. Strange Loop Canon. https://www.strangeloopcanon.com/p/prediction-is-hard-especially-about

1

See if you can spot which day it changed

2

Anyway, the way we do it is, create a forecast that is a short paragraph with five beats: object, direction + small magnitude, tight timeframe, named drivers, and a concrete verification sketch. And that house style gives us a loss function we can compute. Each day: ingest headlines → generate 8 candidates per headline → score (structure + semantics; truth later) → update policy via GSPO.

3

Across runs the numbers tell a simple story.

  • COMPOSITERUN (one-line schema): quality 0.000, zeros 1.000, leak 0.132, words 28.9. The template starved learning.

  • NEWCOMPOSITERUN (paragraphs, looser): quality 0.462, zeros 0.100, leak 0.693, words 124.5. Gains unlocked, hygiene worsened.

  • NEWCOMPOSITERUN2 (very low KL): quality 0.242, zeros 0.432, leak 0.708, words 120.8. Under-explored and under-performed.

  • SEMANTICRUN (moderate settings): quality 0.441, zeros 0.116, leak 0.708, words 123.8. Steady but echo-prone.

  • SEMANTICRUN_TIGHT_Q25 (tight decoding + Q≈0.25): quality 0.643, zeros 0.013, leak 0.200, words 129.2. Best trade-off.

4

The daily cadence was modest but legible. I ran a small Qwen-0.6B on MLX with GSPO, 8 rollouts per headline, typically ~200–280 rollouts/day (e.g., 32×8, 31×8). The tight run trained for 2,136 steps with average reward around 0.044; KL floated in the 7–9 range on the best days for best stability with exploration. Entropy control really matters. The working recipe: paragraphs with five beats; LLM=0; Semantic≈0.75; Format(Q)≈0.25; sampler=tight; ~160–180 tokens; positive 3–5 sentence prompt; align scorer and detector. If ramble creeps in, nudge Q toward 0.30; if outputs get too generic, pull Q back.

The future of work is playing a videogame

2025-08-25 21:55:27

I usually work with three monitors. A few days ago, as I was looking across the usual combination of open documents, slack, whatsapp, and assorted chrome windows, I noticed something.

Somehow, over the past few weeks (months maybe) portions of my screens had gotten taken over by multiple Terminals. It’s not because I do a lot of development, it’s because every project I have or work on is now linked with AI agents in some way shape or form. Even when I want to write a report or analyse a bunch of documents or do some wonky math or search my folders to find out the exact date I bought my previous home for some administrative reason.

Thanks for reading Strange Loop Canon! Subscribe for free to receive new posts and support my work.

A part of this is that people ask occasionally how I use AI and I struggle to answer because it’s integrated with roughly everything that I do. Almost anything I do on the computer now involves LLMs somewhere in the chain.

I was thinking about this again over the weekend because there’s a lot of discussion about what the future will look like.

As agents are getting better at doing long duration tasks it's also becoming more important to see what they're doing, respond to their requests and questions, and where needed, intervene.

This has implications for what work looks like in the future. There’s already the belief that many of us are doing bullshit jobs, which is patently false but highly prevalent. It’s because much of our tasks are not of a “I can easily link the output to a metric I care about” variety. It’s a statement of our ignorance, not about reality.

But it is true that many jobs we do today would seem incomprehensible to people a couple decades ago. And we can extrapolate that trend going forward.

What this means is that most jobs are going to become externally individual contributor roles where they are actually acting as a manager. I wrote recently:

The next few years are going to see an absolute “managerial explosion” where we try to figure out better rubrics and rating systems, including using the smartest models to rate themselves, as we train models to do all sorts of tasks. This whole project is about the limits of current approaches and smaller models.

This is true, but it’s too anodyne. So I wanted to visualise it for myself, just to make things more real. What does it “feel” like, to be in command of a large number of agents? The agents would constantly be doing things that you want them to and you’d have to be on top of them, and the other humans you interact with, to make sure things got done properly.

So I made a dashboard to try and visualise what this might look like.

This is a fundamentally different view of work. It is closer to videogames. Constant vigilance! A large number of balls in the air at all times. Ability to juggle context, respond to idiosyncratic errors, misunderstandings. And able to respond quickly.

These are normally managerial tasks. And that too if you’re a very good manager! I’m sure you are, or you’ve seen, people with a phone in their hand and furiously typing when they’re at the park or walking to their car. Who deal with multiple emails and messages and slack and ping and phone calls and zooms on a regular basis, often alt-tabbing from one to the next.

Some of this alt-tabbing will involve what we might call “real work”. To help intervene in things that the AI gets wrong. To answer questions from other employees or customers. To provide more context, to figure out where to pay attention, to get things unstuck.

To help do this there will be logs of what was done before, the KPIs that you’d set up, edit, adjust, update and monitor continuously. The reporting of those will also be done by AI agents. You’d watch them as your Fleet.

You might change the throttling up top to speed up or slow down particular parts of the organisation, like a conductor, both to manage resources and to manage smooth delivery. Everything runs as a web of interactions and you’re in the middle, orchestrating it all.

You’d of course be interacting with plenty of other orchestrators too. Maybe in your own organisation, or maybe in others. There will be many layers and subnetworks to consider.

This also has some downstream effects. It means all jobs will have an expiration date. You might get hired to do things, but as soon as what you do gets “learnt” by an AI agent that can get systematised and automated1. It means every job becomes a project.

This can be seen as dystopian, I can just imagine the Teamsters reacting to this, but it’s the same dance every white collar job has gone through in the last two decades, just sped up.

What this future shows is that the future of work will look a lot more like rapid fire management. Ingest new information, summarise, compare things to policy, request more docs where needed, reconcile ledgers, sync feeds, chase POs, quote to cash, so on and on. Each of those and hundreds more would be replaced, or at least massively augmented, by agents.

This isn’t a seamless transition. The world of engineering is filled with people who somehow hate having been promoted from coder to manager. The requirement to split attention, constant vigilance, the intellectual burden of being “always on”, these are all added skillsets that aren’t being taxed today for almost anyone2.

This is already the case. Claude Code spawns sub agents. Codex and Cursor have background tasks. People routinely run many of these in parallel and run projects by alt-tabbing in their mind and surfing twitter in their down times. While these are for coding, that will change with time. Any job that can be sufficiently sliced into workstreams will suffer the same fate. We’re all about to be videogame players.

1

Note that I’m not making any claims about superintelligence, only about the intelligence required to automate “quote to cash”.

2

I have a friend who is highly successful in the valley but doesn’t answer Slack messages. If anything is truly urgent people would phone him, or he’d check emails at specific hours and respond. He has a system, in other words, in order to deal with the chaos that management brings with it. Others have other systems, where whether they’re at costco or disneyworld they can’t help but answer when the phone pings. We all will have to figure out our own equilibria.

Walter

2025-08-24 01:59:47

So, LLMs suck at Twitter. It’s kind of poetic, because twitter is full of bots. But despite sometimes trying to be naughty and sometimes trying to be nice they mostly still suck. It does remarkably well in some tasks and terribly in others. And writing is one of the hardest.

My friend and I were joking about this, considering words are at the very core of these miraculous machines, and thought hey wouldn’t it be nice if we could train a model to get better? We were first wondering if one could create an AI journalist that could actually write actual articles with actual facts and arguments and everything. Since we were thinking about an AI that could write, we called it Walter. Because of Bagehot. And Cronkite. We thought it had to be plausible, at least at a small scale. Which is why we tried the experiment (paper here)1.

This is particularly hard in a different way from math or coding, because how do you even know what the right answer is? Is there one? To get to a place where the training is easier and the rewards are richer, we thought of trying to write tweet sized takes on articles. So, Walter became a small, cranky, surprisingly competent engine that ingests social media data about articles, sees how people reacted, and trained itself via reinforcement learning to write better2.

As Eliot once said, “Between the idea / And the reality / … falls the Shadow.” this was us trying to light a small lamp in there using RLNVR: our cheeky acronym for “reinforcement learning from non-verified rewards”.

Now, why small models? Well, a big reason, beyond being GPU poor, is that big models are resilient. They're like cars with particularly powerful shock absorbers, they are forgiving if you make silly assumptions. Small models are not. They are dumb. And precisely because they are dumb, you are forced to be smart.

What I mean is that if you really want to understand something, the best way is to try and explain it to someone else. That forces you to sort it out in your own mind. And the more slow and dim-witted your pupil, the more you have to break things down into more and more simple ideas. And that’s really the essence of programming. By the time you’ve sorted out a complicated idea into little steps that even a stupid machine can deal with, you’ve certainly learned something about it yourself. The teacher usually learns more than the pupil3.

This also makes reward modelling particularly interesting. Because anytime you think you have come up with a good reward model, if there is any weakness or flaw in how you measure your reward, a small model will find it and exploit it ruthlessly. Goodhart’s Law is not just for management.

This is not to say that only small models do that; of course we have seen large models reward hack and learn lessons they were not meant to. But it is fascinating to see a 500 million parameter model learn that it can trick your carefully designed evaluation rubric just by outputting tokens just so. It drives home just how powerful transformers actually are, because it doesn't matter how complicated a balanced scorecard you create; they will find a way to hack it. Tweaking specific weights given to different elements, fighting with a sampling bias towards articles with enough skeets, penalties and thresholds for similarities … all grist for their mill.

We should also say, social media engagement data is magnificently broken as a training signal. It’s sort of “well known”, but it’s hard to imagine exactly how bad until you try and use it. We first ingested Bluesky skeets plus their engagement signals (likes, reposts, replies). Since we wanted actual signal, we decided to use the URL as the organizer: we group all the skeets that point at the same URL, then ask the model to produce a fresh skeet for that article. For the reward, we use embeddings to calculate the most similar historic posts (this worked best), then sanity check, and then rank based on how well those posts did.

The outside world in this instance, as in many, has its problems. For instance:

  • Bias. Big accounts seem “better,” in that they get more and more interesting reactions, than small accounts who post very similar things. The Matthew Effect holds true in social media. To solve that, we had to do baseline normalization: Score a post relative to its author’s usual. Raw engagement minus the author’s baseline turns “how big is your account?” into “was this unusually good for you?”.

  • Sparsity. You get one post and one outcome, not ten A/B variants. And for that we tried max-based semantic transfer: For a new post, find the single most similar historical post about the same article and reward the similarity to that top performer. The max transfer mattered more than we expected. In this domain, the right teacher is a specific great prior, not the average of pretty‑good priors.

But this messy, biased, sparse signal is the only feedback that exists. The world doesn't hand out clean training labels. It hands you whatever people actually do, and you have to figure out how to learn from that.

Together, this turned a one-shot, messy outcome into a dense signal. We used GRPO first to train, though later we upgraded to train with GSPO with clipping and a KL leash to keep voice anchored4. We also added UED (Unsupervised Environment Design) so the curriculum self-organizes: to pick link targets where the policy shows regret/variance, and push there5.

Before training, the model usually hedged and link-dumped and added a comical number of hashtags. After training it was clearly much better. It proposed stakes, hinted at novelty, and tagged sparingly. When we A/B tested the same URL, the trained outcome is the one you’d actually post. Example:

  • Before (the base model): 🚀 SpaceX's Starship successfully landed at Cape Canaveral! 🚀 #SpaceX #Starship #CapeCanaveral Landing 🚀 #SpaceX #SpaceX #SpaceX #SpaceX #SpaceX #SpaceX #SpaceX #SpaceX #SpaceX #SpaceX #SpaceX #SpaceX #SpaceX #SpaceX #SpaceX #SpaceX #SpaceX #SpaceX #SpaceX #SpaceX #SpaceX

  • After (the trained model): 🚀 SpaceX's Starship has successfully landed at Cape Canaveral, marking a key milestone toward future missions. #SpaceX #Starship #landing #Mars

LLMs love adding hashtags to tweets a lot. In the short runs, those didn’t entirely disappear, but did reduce a lot. And became better. Still, I admit I do have a soft spot for the first one for its sheer enthusiasm! Similarly, just for fun, here’s one about tariffs:

  • Before: A major retro handheld maker has stopped all U.S. shipments over tariffs… #retrohandheld #retrohandheld #retrohandheld #tariffs #trade

  • After: 🎮 A top retro handheld brand just paused U.S. shipments due to tariffs. Big ripple for imports, modders, and collectors. What’s your go-to alternative? #retrogaming #tariffs

But the most interesting part for us was that the pattern extends anywhere you have weak, messy signals, which is, well, most of real life. So the ideas here should theoretically also extend to other fields:

  • Creative writing: optimize for completion/saves; transfer from prior hits.

  • Education: optimize for retention/time-on-task; transfer from explanations that helped.

  • Product docs/UX: optimize for task completion/helpfulness; baseline by product area and release.

  • Research comms: optimize for expert engagement/citations; baseline by venue/community.

Take the raw data; normalize away obvious bias; transfer what worked via similarity however you want to calculate or analyse that; keep the loop numerically stable; and add small, legible penalties to deter degenerate strategies. And be extremely, extremely, vigilant about the model reward hacking. In subtle and obvious ways this will happen, it’s closer to crafting a story than writing a program. It also gives you a visceral appreciation of the bitter lesson, and makes you aware of the voracious appetite of these models to learn anything that you throw at them by any means necessary.

The next few years are going to see an absolute “managerial explosion” where we try to figure out better rubrics and rating systems, including using the smartest models to rate themselves, as we train models to do all sorts of tasks. This whole project is about the limits of current approaches and smaller models. When GPT-5 writes good social posts6, you can't tell if it learned general principles or just memorized patterns.

When a 500M model succeeds at a tiny task, all offline on your laptop where you mostly surf Twitter, it feels kind of amazing. Do check out the paper. Like intelligence truly can be unbounded, and you will soon have a cyberpunk world where models will be run anywhere and everywhere for tasks both mundane and magnificent.

1

After writing this we came across the recent Gemini 2.5 report, echoing the same instinct at a very different scale: tight loops that let models learn from imperfect, real interactions. Which was cool!

2

Note that “better” here does not only mean “optimize engagement at all costs.” Instead it’s the far more subtle “learn the latent rubric of what reads well and travels in this odd little medium.”

3

“It would be hard to learn much less than my pupils without undergoing a prefrontal lobotomy.”

4

Maybe an example can help. Ten people posted the same article about SpaceX. Normalize each author’s engagement by their baseline (e.g., 45 vs 20 → +25; 210 vs 200 → +10; 12 vs 5 → +7). Embed all posts. For a new candidate, compute cosine similarity to each and take max(similarity × normalized weight). If the best match has sim 0.82 and weight 0.9, reward ≈ 0.74. No live A/B; the signal comes from “be like the best thing that worked.”

5

Early training followed the classic arc: diverse exploration → partial convergence → collapse risk. With GSPO-style normalization, a small KL guardrail, and light penalties, the loop stays open and outputs nudge toward historical winners.

6

*If

Ads are inevitable in AI, and that's okay

2025-07-28 22:46:04

We are going to get ads in our AI. It is inevitable. It’s also okay.

OpenAI, Anthropic and Gemini are in the lead for the AI race. Anything they produce also seems to get copied (and made open source) by Bytedance, Alibaba and Deepseek, not to mention Llama and Mistral. While the leaders have carved out niches (OpenAI is a consumer company with the most popular website, Claude is the developer’s darling and wins the CLI coding assistant), the models themselves are becoming more interchangeable amongst them.

Well, not quite interchangeable yet. Consumer preferences matter. People prefer using one vs the other, but these are nuanced points. Most people are using the default LLMs available to them. If someone weren’t steeped in the LLM world and watching every move, the model-selection is confusing and the difference between the models sound like so much gobbledegook.

One solution is to go deeper and create product variations that others don’t, such that people are attracted to your offering. OpenAI is trying with Operator and Codex, though I’m unclear if that’s a net draw, rather than a cross sell for usage.

Gemini is also trying, by introducing new little widgets that you might want to use. Storybook in particular is really nice here, and I prefer it to their previous knockout success, which was NotebookLM.

But this is also going to get commoditised, as every large lab and many startups are going to be able to copy it. This isn’t a fundamental difference in the model capabilities after all, it’s a difference in how well you can create an orchestration. That doesn’t seem defensible from a capability point of view, though of course it is from a brand point of view.

Another option is to introduce new capabilities that will attract users. OpenAI has Agent and Deep Research. Claude has Artefacts, which are fantastic. Gemini is great here too, despite their reputation, it also has Deep Research but more importantly it has the ability to talk directly to Gemini live, show yourself on a webcam, and share your screen. It even has Veo3, which can generate vidoes with sound today.

I imagine much of this will also get copied by other providers if and when these get successful. Grok already has voice and video that you can show to the outside world. I think ChatGPT also has it but I honestly can’t recall while writing this sentence without looking it up, which is certainly an answer. Once again these are also product design and execution questions about building software around the models, and that seems less defensible than even the model building in the first place.

Now, if the orchestration layers will compete as SaaS companies did over consumer attraction and design and UX and ease and so on, the main action remains the models themselves. We briefly mentioned they’re running neck and neck in terms of the functionality. I didn’t mention Grok, who have billions and have good models too, or Meta who have many more billions and are investing it with the explicit aim of creating superintelligence.

Here the situation is more complicated. The models are decreasing in price extremely rapidly. They’ve fallen by anywhere from 95 to 99% or more over the last couple years. This hasn’t hit the revenues of the larger providers because they’re releasing new models rapidly at higher-ish prices and also extraordinary growth in usage.

This, along with the fact that we’re getting Deepseek R1 and Kimi-K2 and Qwen3 type open source models indicates that the model training by itself is unlikely to provide sufficiently large enduring advantage. Unless the barrier simply is investment (which is possible).

What could happen is that the training gets expensive enough that these half dozen (or a dozen) providers decide enough is enough and say we are not going to give these models out for free anymore.

So the rise in usage will continue but if you’re losing a bit of money on models you can’t make it up in volume. So it’ll tend down, at least until some equilibrium.

Now, by itself this is fine. Because instead of it being a saas-like high margin business making tens of billions of dollars it’ll be an Amazon like low margin business making hundreds of billions of dollars and growing fast. A Costco for intelligence.

But this isn’t enough for owning the lightcone. Not if you want to be a trillion dollar company. So there has to be better options. They could try to build new niches and succeed, like a personal device, or a car, or computers, all hardware like devices which can get you higher margins if the software itself is being competed away. Even cars! Definitely huge and definitely being worked on.

And they’re already working on that. This will have uncertain payoffs, big investments, and strong competition. Will it be a true new thing or just another layer built on top of existing models remains to be seen.

There’s another option, which is to bring the best business model we have ever invented into the AI world. That is advertising.

It solves the problem of differential pricing, which is the hardest problem for all technologies but especially for AI, which will see a few providers who are all fighting it out to be the cheapest in order to get the most market share while they’re trying to get more people to use it. And AI has a unique challenge in that it is a strict catalyst for anything you might want to do!

For instance, imagine if Elon Musk is using Claude to have a conversation, the answer to which might well be worth trillions of dollars of his new company. If he only paid you $20 for the monthly subscription, or even $200, that would be grossly underpaying you for the privilege of providing him with the conversation. It’s presumably worth 100 or 1000x that price.

Or if you're using it to just randomly create stories for your kids, or to learn languages, or if you're using it to write an investment memo, those are widely varying activities in terms of economic value, and surely shouldn't be priced the same. But how do you get one person to pay $20k per month and other to pay $0.2? The only way we know how to do this is via ads.

And if you do it it helps in another way - it even helps you open up even your best models, even if rate limited, to a much wider group of people. Subscription businesses are a flat edge that only captures part of the pyramid.

We can even calculate its economic inevitbaility. Ads have an industry mean CPC (cost per click) of $0.63. Display ads have click through rates of 0.46%. If tokens cost $20/1m for completion, and average turns have 150 counted messages, with 400 tokens each, that means we have to make $1.9 or thereabouts in CPC to break even per API cost. Now, the API cost isn’t the cost to OpenAI, but it means for same margins or better they’d have to triple the CPC.

Is it feasible for the token costs to fall by another 75%? Or for the ads via chat to have higher conversion than a Google display ad? Both seem plausible. Long‑term cost curves (Hopper to Blackwell, speculative decoding) suggest another 3× drop in cash cost per token by 2027. Not just for product sales, but even for news recommendations or even service links.

And what would it look like? Here’s an example. The ads themselves are AI generated (4.1 mini) but you can see how it could get so much more intricate! It could:

  • Have better recommendations

  • Contain expositions from products or services or even content engines

  • Direct purchase links to products or links to services

  • Upsell own products

  • Have a second simultaneous chat about the existing chat

A large part of purchasing already happens via ChatGPT or at least starts on there. And even if you’re not directly purchasing pots or cars or houses or travel there’s books and blogs and even instagram style impulse purchases one might make. The conversion rates are likely to be much (much!) higher than even social media, since this is content, and it’s happening in an extremely targeted fashion. Plus, since conversations have a lag from AI inference anyway, you can have other AIs helping figure out which ads make sense and it won’t even be tiresome (see above!).

I predict this will work best for OpenAI and Gemini. They have the customer mindshare. And an interface where you can see it, unlike Claude via its CLI12. Will Grok be able to do it? Maybe, they already have an ad business via X (formerly Twitter). Will it matter? Unlikely.

And since we'll be using AI agents to do increasingly large chunks of work we will even see an ad industry built and focused on them. Ads made by AI to entice other AIs to use them.

Put all these together I feel ads are inevitable. I also think this is a good thing. I know this pits me against much of the prevailing wisdom, which thinks of ads as a sloptimised hyper evil that will lead us all into temptation and beyond. But honestly whether it’s ads or not every company wants you to use their product as much as possible. That’s what they’re selling! I don’t particularly think of Slack optimising the sound of its pings or games A/B testing the right upskill level for a newbie as immune to the pull of optimisation because they don’t have ads.

Now, a caveat. If the model providers start being able to change the model output according to the discussion, that would be bad. But I honestly don't think this is feasible. We're still in the realm where we can't tell the model to not be sycophantic successfully for long enough periods of time. People are legitimately worried, whether with cause or not, about the risk of LLMs causing psychosis in the vulnerable.

So if we somehow created the ability to perfectly target the output of a model to make it such that we can produce tailored outputs that would a) not corrupt the output quality much (because that’ll kill the golden goose), and b) guide people towards the products and services they might want to advertise, that would constitute a breakthrough in LLM steerability!

Instead what’s more likely is that the models will try to remain ones people would love to use for everything, both helpful and likeable. And unlike serving tokens at cost, this is one where economies of scale can really help cement an advantage and build an enduring moat. The future, whether we want it or not, is going to be like the past, which means there’s no escaping ads.

1

Being the first name someone recommends for something has enduring consumer value, even if a close substitute exists. Also the reason most LLM discourse revolves around 4o, the default model, even though the much more capable o3 model exists right in the drop down.

2

Also, Claude going enterprise and ChatGPT going consumer wasn’t something I’d have predicted a year and half ago.