RSS preview of Strange Loop Canon

Rss preview of Blog of Strange Loop Canon

Contra Scott on AI safety and the race with China

2025-12-02 09:12:23

has a really interesting essay on the importance of AI safety work, arguing it will not cause the US to fall behind China, as is often claimed. It’s very well written, characteristically so, and well argued. His argument, in a nutshell ( I paraphrase) is:

US has ~10x compute advantage over China
Safety regulations add only 1-2% to training costs at most
China is pursuing “fast follow” strategy focused on applications anyway
Export controls matter far more (could swing advantage from 30x to 1.7x)
AI safety critics are inconsistent - they oppose safety regs but support chip exports to China
Sign of safety impact is uncertain - might actually help US competitiveness

I quite like this argument because I actually agree with all of the points, mostly anyway, and yet find myself disagreeing with the conclusion. So I thought I should step through my disagreements, and then what my overall argument against it is, and see where we land up.

First, the measurement problem

Scott argues that the safety regulations we’re discussing in the US only adds 1-2% overhead. This is built off of METR and Apollo’s findings, around $25m for internal testing, and contrast this with $25 Billion for training runs. All the major labs also already spend enormous sums of money on intermediate evaluations, model behaviour monitoring and testing, and primary research to make them work better with us, all classic safety considerations.

This only holds if the safety regulation based work, hiring evaluators and letting them run, is strictly separable. Which is not true of any organisation anywhere. When you add “coordination friction”, you reduce the velocity of iteration inside the organisation. Velocity here really really matters, especially if you believe in recursive self improvement, but even if you don’t.

This is actually visible in ~every organisation known to man. Facebook has a legal department of around 2000 employees, doubled since pre Covid, of a total employee base of 80,000. Those 2000 are quite likely not disproportionately expensive vs the actual operating expenditure of Facebook. But the strain they put on the business far exceeds the 2.5% cost it puts on the output. There’s a positive side of this argument, they will also prevent enough bad things from happening that the slowdown is worth it. Presumably Facebook themselves believe this, which is why they exist, but it is very much not as simple as comparing the seemingly direct costs.

The argument that favours Scott here is maybe pharma companies,

This gets worse once you think about the 22 year old wunderkinds that the labs are looking to hire, and wonder if they’d be interested in more compliance, even at the margin.

China is a fast follower

The argument also states that China is focused on implementation and fast-follow strategy, because they don’t believe in AGI. I think it’s an awfully load bearing claim, and feels quite convenient. China is also known for strategic communication in more than one area, where what they say isn’t necessarily what they focus on.

As Scott notes, Liang Wenfeng of Deepseek, explicitly has stated he believes in superintelligence, which in itself is contradictory to the argument that they care about the applications layer. If China does truly believe in deployment, as it seems to be the case, then having true believers as heads of top labs is if anything more evidence against “they’re just fast followers” argument.

They’re leaders in EVs, solar panels, 5G, fintech and associated tech, probably quantum communications, an uncomfortably large percentage of defense related tech, seemingly humanoid robots, the list is pretty long. This isn’t all just fast followership, or at least even if it is, it’s indistinguishable from the types of innovation we’re talking about here.

Again, this only really matters to the extent you think recursive self improvement is true or China won’t change its POV here very fast if they feel it’s important.The CCP has an extraordinary track record of redirecting capital in response to perceived strategic opportunity (and overdoing it). That means “they don’t believe in AGI” is an unstable parameter. Even if the true breakthrough comes from some lab in the US, or some tiny lab in Harvard, it will most likely not be kept under wraps for years as the outcomes compound.

The AI safety critics are sometimes bad faith

This is true! There’s a lot of motivated reasoning, which tries to tie itself in knots such as to argue “to beat china we have to sell them the top Nvidia chips, so they don’t develop their own chip industry and cut the knees off another one of our top industries”. Liang Wenfeng has also said that his biggest barrier is access to more chips.

That said, here my core problem is that I am unsure about which aspects of the regulations being proposed are actually useful. Right now they ask for a combination of red-teaming (to what end), hallucination vs sycophancy (how do you measure), whistleblower protections, bias (measurement?), CBRN (measurement delta vs pure capability advance), observability for chip usage (hardware locks?), and more. These assume a very particular threat surface.

The Colorado AI Act focuses on algorithmic fairness and non discrimination. Washington HB 1205 focuses on digital likeness and deepfakes. AB2013 in California on disclosing training data for transparency. Utah’s SB 332 says AI has to say theyre AI when using a chatbot. These are all quite different, as we can see, and will require different answers in both implementation and compliance. writes about this cogently and cohesively.

Many of these ideas are sensible in isolation, but many of them are also extremely amorphous. Regulations are an area where I am predisposed to think that unless they’re highly specific and ROI is directly visible it’s better to not get caught in an invisible graveyard. The regulatory ratchet is real, as Scott acknowledges. Financial regulation post-2008, aviation post-9/11, FDA … We always have common sense guardrails that creates an apparatus that then expands.

Sign uncertainty

It is definitely true that having a more robust AI development environment might well propel the US forward vs China. Cars with seatbelts beat cars without seatbelts. Maybe lack of industrial espionage means the gains from US labs won’t seed Chinese innovation.

It should be noted though that the labs already spend quite a bit on cybersecurity. Model weights are worth billions, soon dozens of billions, and are protected accordingly. Should it be made stronger? Sure.

It should be noted, underlined, however that this is true only insofar as the Chinese innovation is driven by industrial espionage or weight stealing. Right now that definitely does not seem to be the case. What is true is that deployment by filing off the edges, making the products much nicer to use, especially via posttraining, is something Western models do a much better job of. Deepseek, Qwen or Kimi products are just not as good, and differentially worse than how good their models are.

So … now what.

Scott’s argument makes sense, but only in a particular slice of the possible future lightcone. For instance, we can sort of lay down the tree of how things might shake out. There are at least 5 dimensions I can think of offhand:

Takeoff speed
Alignment difficulty
Capability distribution (oligopoly, monopoly etc)
Regulations’ impact on velocity
China’s catch up timeline

You could expand this by 10x if you so chose, and things would get uncomfortably diverse very very quickly. But even with this, if we split each of these into like 4 coarse buckets (easy, moderate, hard, impossible), you get 1024 worlds. I asked Claude to simulate these worlds and choose whatever priors made sense to it, and it showed me this:

I’m not suggesting this is accurate, after all there could be a dozen more dimensions, or the probability distribution might be quite different. Change in one variable might impact another. But at least it gives us an intuition on why the arguments are not as straightforward as one might imagine, and it’s not fait accompli that “AI safety will not hurt US in its race with China”, and that’s assuming the race is a good metaphor!

For instance, here’s one story which I tried to draw out after getting lost with the help of Claude.

Does recursive self improvement happen?
- Y. First to ASI wins the lightcone
  - Is there a close race with China?
    - Y. Every month matters
      - Do safety regs meaningfully slow us?
        
        Y. Disaster!
        
        N. Small overhead doesn’t matter!
    - N. US has durable advantage (10x compute)
      - Does model quality matter more than deployment?
        
        Y. We have time for safety work. 6mo slower might be fine!
        
        N. Safety regs might not matter
- N. Gradual capability increases
  - Which layer determines winner?
    - Model layer
      - How durable is US advantage
        
        10x compute advantage wins, so regulations are basically “free”
        
        If china can catch up however, efficiency gains matter, so safety regs might be a small drag but real
    - Application layer
      - Do safety regulations affect deployment velocity?
        
        Yes. Compliance morass and lawyerly obstruction everywhere.
        
        N. Safety regs only affect the model. It’s fast and unobtrusive. It’s fine.

In this tree there are only a few areas where Scott’s argument holds water. Recursive self improvement is important enough to worry about but unimportant enough that velocity doesn’t matter. Chinese skepticism about ASI is stable but we should prevent dictators getting ASI. We can measure direct costs but what about illegible costs? Model layer regs won’t affect application layer despite Colorado showing they already do.

If recursive self improvement is false, it only makes sense to do more regulations *if* safety regulations do not meaningfully impact deployment velocity in the application layer and the compute advantage holds in the model layer. If recursive self improvement is going to happen, then Scott’s argument has more backing, especially if safety regulations don’t slow us down much as long as the model quality will continue to improve.

Which means of course the regulations have to be sensible, they can’t be an albatross, China’s “catch up” timeline has to be longer, the capability distribution has to be more oligopolistic, alignment has to be somewhat difficult, and takeoff speed has to be fairly fast.

If we relax the assumptions, as in the tree above, we might end up in places where AI safety regulations are more harmful than useful. One example, and this is my own view, is that a lot of AI safety work is just good old fashioned engineering work. Like you need to make sure the model does what you ask it to, to solve hallucinations and sycophancy. And you need to make sure it doesn’t veer off the rails when you ask it slightly risque questions. And you’d want the labs to be “good citizens”, not coerce employees to keep quiet if they see something bad.

Scott treats regulatory overhead as measurable and small in his essay. But the history of compliance shows they compound through organisational culture, talent selection, and legal uncertainty and dominate direct costs. If he’s wrong about measurement, and Facebook’s legal department suggests he is, then his entire calculation flips. Same again with China’s stance in reality vs what they say, or the level of belief in recursive self improvement.

To the question at hand, will AI safety make America lose the war with China? It depends on that tree above. It is by no means assured that it will (or that it won’t), but the type of regulation and the future being envisioned matter enormously. The devil, as usual, is in the really annoying details.

In my high-weight worlds, AI safety work can meaningfully help, but only if done sensibly. I don’t put too much weight on recursive self improvement, at least done without human intervention and time to adjust. I also think that large amounts of safety are intrinsic principles to build widely available and used pieces of software, so are not even a choice. They might not be called AI safety, they might be called, simply, “product”, which would have to think about these aspects.

Personally, I prefer a very economist’s way of asking the “will AI safety make the US lose to China” question, which is: what is the payoff function for winning or losing the race? Since regulations are (mainly) ratchets, we should choose them carefully, and only when we think it’s warranted (high negative disutility if not, positive utility if we do).

In “mundane AI” world, we get awesome GPTs but not a god. Losing means we’re Europe. While some might think of this as akin to death, it’s not that bad.
In “AI is god” world, losing is forever

Even in the first world, AI safety regs might make the US the Brussels of AI, which is a major tradeoff. Most regulations currently posed don’t seem to yet cause that effect. But, it’s not like it’s hard to imagine.

Regulation can be helpful with respect to increasing transparency (training data is one example, though with synthetic data that’s already hard), with whistleblower protections (even though I’m not sure what they’d blow the whistle on), and red teaming the models pre deployment. I think chip embargoes are probably good, even though it helps Huawei.

It’s far better to not think about pro or con AI safety regulations, but to be specific about which regulation and why. The decision tree above helps, you do need to specify which worlds you’re protecting.

Epicycles All The Way Down

2025-11-27 03:37:24

“All models are wrong, but some are useful.” — George E. P. Box

“All LLM successes are as human successes, each LLM failure is alien in its own way.”

I. Two ways to “know”

I was convinced I had a terrible memory throughout my schooling. As a consequence pretty much for every exam in math or science I would re-derive any formula that was needed. Kind of a waste, but what could I do. Easier than trying to remember them, I thought. It worked until I think second year of college, when it didn’t.

But because of this belief, I did other dumb things too beyond not study. For example I used to play poker. And I was convinced, and this was back in the day when neural nets were tiny things, that my brain was similar and I could train it using inputs and outputs and not actually bother doing the complex calculations that would be needed to measure pot odds and things like that. I mean, I can’t know the counterfactual but I’m reasonably sure this was a worse way to play poker that just actually doing the math, but it definitely was a more fun way to do it, especially when combined with reasonable quantities of beer. I was convinced that just from the outcomes I would be able to somehow back out a playing strategy that would be superior.

It didn’t work very well. I mean, I didn’t lose much money, but I definitely didn’t make much money either. Somehow the knowledge I got from the outcomes didn’t translate into telling me when to bet, how much to bet, when to raise, how much to raise, when to fold, how to analyse others, how to bluff, you know all those things that if you want to play poker properly you should have a theory about.

Instead what I had were some decent heuristics on betting and a sense of how others would bet. The times I managed to get a bit better were the times I could convert those ideas of how my “somewhat trained neural net” said I should and then calculated the pot odds and explicitly tried to figure out what others had and tried to use those as inputs alongside my vibes. I tried to bootstrap understanding from outcomes alone, and I failed1.

II. Patterns and generators

“What I cannot create, I do not understand.” — Richard Feynman

This essay is about why LLMs feel like understanding engines but behave like over-fit pattern-fitters, why we keep adding epicycles that get us closer to exceptional performance, instead of changing the core generator, and why that makes their failures look more like flash crashes and market blow-ups than like Skynet.

One way this makes sense is that mathematically the number of ways to create a pattern has to be more than the number of patterns themselves. There are more words than letters. The set of all possible 1000 character outputs is huge, but the set of programs that could print any one of them is larger2.

An LLM trained on the patterns swims in an ocean of possible generators and the entire game of training is to identify those extra constraints so it has reason to pick the shortest, truest one. Neural networks have inductive biases that privilege certain solutions.

There is an interesting mathematical or empirical question to be answered here. What are the manifolds of sufficiently diverse patterns which can be used such that collectively it will turn away the wrong principles and keep only the correct generative principles?

I’m not smart enough to prove this but perhaps starting with Gold’s theorem, which says something like if all you ever see are positive examples of behaviour, then for a sufficiently rich class of programs it might well be true that no algorithm can be guaranteed to eventually lock onto the exact true program that produced them. LLMs are a giant practical demonstration of this. They implicitly infer some program that fits the data, but not necessarily the program you “meant”.

I asked Claude about this, and it said:

The deeper truth is that success is low-dimensional. There are relatively few ways to correctly solve “2+2=” or properly summarize a news article. The constraint satisfaction problem has a small solution space. But failure is high-dimensional—there are infinitely many ways to be wrong, and LLMs explore regions of that failure space that human cognition simply doesn’t reach.

One way to think about this is as the distinction between complexity in a system and randomness. Often indistinguishable in its effects, but fundamentally different in its nature. A world where a butterfly can flap its wings and cause a hurricane somewhere else is also a world that is somewhat indistinguishable from being filled with the randomness. The difference of course as that the first one is not random, it is deterministic, it just seems random because we cannot reliably predict every single step that the computation needs to take in all its complex glory.

One of Taleb’s targets is what he calls the “ludic fallacy,” the idea that the sort of randomness encountered in games of chance can be taken as a model for randomness in real life. As Taleb points out, the “uncertainty” of a casino game like roulette or blackjack cannot be considered analogous to the radical uncertainty faced by real-life decision-makers—military strategists, say, or financial analysts. Casinos deal with known unknowns—they know the odds, and while they can’t predict the outcome of any individual game, they know that in the aggregate they will make a profit. But in Extremistan, as Donald Rumsfeld helpfully pointed out, we deal with unknown unknowns—we do not know what the probabilities are and we have no firm basis on which to make decisions or predictions.

This isn’t just Taleb being esoteric. The rules that were learnt were not the rules that should have been learnt. This is a classic ML problem, that still exists in deep learning. The Fed sent a letter to banks about using not-easily-interpretable ML to judge loan applications for this reason. For an easier to see example, autonomous driving is a case of painfully ironing out edge cases one after the other, because the patterns the models learnt weren’t sufficiently representative of our world. Humans learn to drive with about 50 hours of instruction, Waymo in 2019 itself had run 10 billion simulated miles and 20m real miles, and Tesla at 6 billion real miles driven and quite likely hundreds of billions of miles as training data.

This isn’t as hopeless as it sounds. We see with LLMs that they are remarkably similar to humans in how they think about problems, they don’t get led astray all that often. The remarkable success of next token prediction is precisely that it turned out to learn the right generative understanding.

LLMs are brilliant at identifying a “line of best fit” across millions of dimensions, and in doing so produces miracles. It’s why Ted Chiang called it a blurry jpeg of the internet a couple of years ago.

III. Prediction and causation

“With four parameters I can fit an elephant, and with five I can make him wiggle his trunk.” — John von Neumann

Eric Baum had a book published more than twenty years ago, called “What Is Thought?” Its excellent title aside, the core premise was that understanding is compression. Just like drawing a line of best fit seems to gets you the right understanding in statistics, y = mx + c, so do we with all the datapoints we encounter in life.

The spiritual godfather of this blog, Douglas Hofstadter, thought about understanding as rooted in conceptualisation and core understanding. There was a recent New Yorker article that discussed this, and relationship to the truly weirder aspects of high dimensional storage of facts or memory.

In a 1988 book called “Sparse Distributed Memory,” Kanerva argued that thoughts, sensations, and recollections could be represented as coördinates in high-dimensional space. The brain seemed like the perfect piece of hardware for storing such things. Every memory has a sort of address, defined by the neurons that are active when you recall it. New experiences cause new sets of neurons to fire, representing new addresses. Two addresses can be different in many ways but similar in others; one perception or memory triggers other memories nearby. The scent of hay recalls a memory of summer camp. The first three notes of Beethoven’s Fifth beget the fourth. A chess position that you’ve never seen reminds you of old games—not all of them, just the ones in the right neighborhood.

This is a rather perfect theory of LLMs.

It’s also testable. I built transformers to try and predict Elementary Cellular Automata, to see how easily they could learn the underlying rules3.

I also tried creating various combinations of wave functions (3-4 equations and combining them) and seeing if the simple transformer models can learn those, and understand the underlying rules. These are combinations of simple equations, like a basic wave function with a few transformations. And yet:

There have been other similar attempts. This paper, what has a foundation model found, in particular was fascinating because it tried to use a similar method to see if you could predict orbits of planets based only on observational data. And the models managed to do it, except they all tried to approximate instead of learning the fundamental underlying generative path4.

This manifold question - “which diverse pattern sets collapse to unique generators” - is probably intractable without solving the frame problem. After all, if we could characterise those manifolds, we’d have a theory of induction, which is to say we’d have solved philosophy.

Maybe if we got them to think through why they were predicting the things they were predicting as they were getting trained, they could get better at figuring out the underlying rules. It does add a significant lag to their training, but essential nonetheless. Right now we seem stuck with Ptolemaic astronomy, scholastically adding epicycles upon epicycles, without making the leap to hit the inverse-square law. Made undeniably harder because there isn’t just one law to discover, but legion.

IV. Can reasoning escape the pattern trap?

“The aim is not to predict the next data point, but to infer the rule that generates all of them.” — Michael Schmidt

One solution to this problem is reasoning. If you’ve learnt a wrong pattern, you can reason your way to the right one, using the ideas at your disposal. It doesn’t matter if you’re wrong, as long as you can course correct.

Since LLMs are trained to predict the patterns that exist inside a large corpus of data, in doing so they do end up learning some of the ways in which you could create those patterns (i.e., thinking), even if not necessarily the right or the only way in which we see that getting created. So a large part of the efforts we put is to teach them the right ways.

Now we have given models a way to think for themselves. It started as soon as we had chatbots and could get them to “think step by step”. We get to do that across many different lines of thought, reflect back on what they found, and fix things along the way. This is, despite the anthropomorphisation, reasoning. If every rollout is in some sense a function, reasoning is a form of search over those latent programs, with external tools, including memory. Reasoning this way even gets us negative examples and better data, helping loosen the constraints of Gold’s theorem.

It’s also true that now they can reason, we do see them groping their way towards what absolutely looks like actual understanding. This can also often seem like using its enormous corpus of existing patterns that it knows and trying to first-principles-race its way towards the right steps to take to get to the answer.

A useful training method is to teach the model to ask itself to come up with those principles and then to apply them, to learn from them, because doing so gets it much closer to the truth. In mid-training, once the model has some capabilities, this becomes possible. And more so once when they have tools like being able to write python and look up information at its disposal5.

Because we are still pushing the induction problem up one level. It is now a game of how much can it learn about how to think things through. Whether the patterns of how to learn are also learnable from the data, both real and synthetic, to reach the right answer. Or the patterns to learn how to learn6.

And it is guided by the very same process that caused so much trouble in learning Conway’s Game of Life.

It still falls prey to the same lack of insight or inspiration or even step by step thinking that shows up in these failure modes. Same as before when we were trying to see why LLMs couldn’t do Conway’s Game of Life, this still remains the key issue7.

This, to be clear, does seem odd. And is a major crux why people seem to fight whether these AI systems are “even thinking” vs those who think this method is “clearly thinking” and scaling it up will get us to AGI. Because a priori it is very difficult to see why this process would not work. If you are able to reason through something then surely you will be able to get to the right answer one way or the other.

The reason our intuition screws up here is because we think of reasoning the way we do it as different from the model. Rightly so. The number of different lines of thought it can simultaneously explore are just not that high.

The best visible example is Agents who do computer use, if you just see the number of explicit steps it needs to take to click a button you see how quickly things could degenerate and how much effort is required to make them not!

At the same time when you train them with live harnesses and ability to access the internet and have the types of problems where you are able to provide reward rubrics that are actually meaningful suddenly the patterns that it identifies become more similar to the lessons we would want them to learn.

V. Consciousness

An aside, but considering the topic I couldn’t resist. The constant use, including in this essay, about words like “reasoning” or “consciousness” or “thinking” or even “trying to answer” are all ways in which we delude ourselves a little bit about what the models are actually doing. Semantic fights are the dumbest fights but we, just like the functionalists, look at what the model gets right and how it does and are happy to use the same terms we use for each other. But what they get wrong are where the interesting failures.

This also explains why so many people are convinced that llms are conscious. Because behaviourally speaking its outrage does not seem different to one from us, another conscious entity. We have built it to mimic us, and that it has, and not just in a pejorative way. A sufficient degree of change in scale of pattern prediction is equivalent to a change in scope!

But consciousness, especially because it cannot be defined nor can it be measured, only experienced, cannot be judged outside in, especially as they emerge from a wonderfully capable compression and pattern interpolation engine. Miraculous though it seems, the miracle is that of scale! We simply do not know what a human being who has read a billion books looks like, if it is even feasible, so an immortal who has read a billion books feels about as smart as a human who has read a few dozen.

There can’t of course be proof that an LLM is not conscious. Their inner work is inscrutable, because they themselves are not able to distill the patterns they’ve learnt and tell them to you. We could teach them that! But as yet they can’t.

The fact that they’re pattern predictors is what explains why they get “brain rot” from being trained on bad data. Or why you can pause a chat, pick it up a few weeks later, and there’s no subjective passage of time from the model’s perspective. They literally can only choose to see what you tell it, and cannot choose to ignore the bad training data, something we do much better (look at how many functional adults are on twitter all day).

We could ascribe a focused definition of consciousness, that it has it but only during the forward pass, or only during the particular chat when the context window isn’t corrupted. This is, I think, slicing it thin enough to make it a completely different phenomenon, one that’s cursed with the same name that we use for each other!

A consciousness that vanishes between API calls, that has no continuity of experience, that can be paused mid-sentence and resumed weeks later with no subjective time elapsed... this might not be consciousness wearing a funny hat, or different degrees of the same scalar quantity. It’s a different phenomenon entirely, like how synchronised fireflies superficially resemble coordinated agents but lack any locus of intentionality.

Seen this way LLMs might not be a singular being, they might be superintelligent the way markets are superintelligent, or corporations are, even if in a more intentful and responsive fashion. Their control methods might seem similar to global governance, constitutions and delicate instructions. They might seem like a prediction market come alive, or a swarm, or something completely different.

VI. The thesis

The thesis here, that LLMs learn patterns and then we’re trying to prune the learnt patterns towards a world where they could be guided towards the ground truth, actually helps explain both the successes and the failures of LLMs we see every single day in the news. Such as:

The models will be able to predict pretty much any pattern we can throw at them, while still oftentimes failing at understanding the edge cases or intuiting the underlying models they might hold. Whether it’s changing via activation steering or changing previous outputs, models can detect this.
Powerful pattern predictors will naturally detect “funky” inputs. Eval awareness is expected. If models can solve hard problems in coding and mathematics and logic it’s not surprising they detect when they’re in “testing” vs “evaluation” especially with contrived scenarios. Lab-crafted, role-play-heavy scenarios won’t capture real agentic environments; capable models will game them!
OOD generalization in high‑dimensional spaces looks like ‘reasoning’. It even acts like it, enough so that for most purposes it *is* reasoning. Most cases of reasoning are also patterns, some are even meta patterns.
Resistance to steering is also logical if there are conflicting information being fed in, since models are incredibly good at anomaly detection. Steering alters the predicted token distribution. A reasoning model can detect the off‑manifold drift and correct. Models are trained to solve given problems and if you confuse them makes sense they would try non-obvious solutions, including reward hacking.
Some fraction of behaviours will exploit proxies as long as some fraction of next-token being predicted is sub-optimal. Scale exposes low‑probability tokens and weird modes.

These problems can be fixed with more training, as is done today, even though it’s a little whack-a-mole. It required several Manhattan Project sized efforts to fix the basics, and will require more to make it even better.

How many patterns does it need need to learn to understand the underlying rules of human existence? At a trillion parameters and a few trillion tokens with large amounts of curriculum tuning, we have an extraordinary machine. Do we need to scale this up 10x? 100x?

often asks in interviews, “explain, in as few dimensions as possible, the reasons behind [X]”. This is what understanding is. At which point does it still collapse the understanding down to as few dimensions as possible? Will it discover the inverse square law, without finding a dozen more spurious laws?

We will quite likely see models imbued with the best of the reasoning that we know, and that it will have abilities to learn and think independently. Do almost anything. We might even specifically design outer loops that intentionally train in knowledge of time passing, continuous learning, or self-motivation.

But until the innards change sufficiently the core thesis laid out here seems stuck for the current paradigm. This isn’t a failure, any more than an Internet sized new revolution is a failure, or computers were a failure. We live in the world Baum foresaw.

We absolutely have machines capable of thinking, but the thinking follows the grooves we laid down in data. Just like us, they are products of their evolution.

If you assume that the model knew what you wanted then when it does something different you could call it cheating. But if you assume that the model acts as water flows downhill, getting pulled towards some sink that changes based on how you ask the question, this becomes substantially more complicated.

(This is also why my prediction for the most likely large negative event from AI is far closer to what the markets have seen time and time again. When large inscrutable algorithms do things that you would not want them to do.)

And equally as useful is what this tells us what is required for alignment. Successful alignment will end up being far closer to how we align other super intelligences that surround us, like the economy or the stock market. With large numbers of rules, strict supervision, regular data collection, and the understanding that it will not be perfect but we will co-evolve with it.

AI, including LLMs, do sometimes discover generators when we provide them with enough slices of the world that the “line of best fits” becomes parsimonious, but it’s not the easy nor the natural outcome we most often see. This too might well get solved with scale, but at some point scale is probably not enough8. We will have machines capable of doing any job that humans can do but not adaptable enough to do any job that humans think up to do. Not yet.

A summary, TLDR

LLMs today primarily learns patterns from the data they learn from
Learning such patterns makes them remarkably useful, more so than anyone would’ve thought before
Learning such patterns as yet still causes many “silly mistakes” because they don’t often learn the underlying generators
With sufficient amounts of data they do learn underlying principles for some things but it’s not a robust enough process
Reasoning helps here, because they learn to reason like us, but this still has the same problem that the reasoning patterns they learn do not have the same underlying generator
As we push more data/ info/ patterns into the models they will get smarter about what we want them to do, they are indeed intelligent, even though the type of intelligence is closer to a market intelligence than an individual being (speculative)

I also wonder if a way to say this is that as attempting statistical learning where algebraic reasoning would serve better, like Kahneman’s heuristics-and-biases program showed humans doing the inverse.

Kolmogorov–Chaitin complexity formalises this point: for every finite pattern there is an infinite “tail” of longer, redundant recipes that still reproduce it.

Claude comments “This is like expecting someone to derive the Navier-Stokes equations from watching turbulent flow—possible in principle, nightmarishly difficult in practice.” But then goes on to agree “The cellular automata experiments are devastating evidence, and you’re right that failure modes reveal more than successes. This echoes Lakatos’s methodology of research programs: theories are defined by their “negative heuristic”—what they forbid—not just what they predict.”

GPT and Kimi agreed but with a caveat: “The orbital-mechanics example (predict next position vs learn F=GMm/r²) is lovely, but the cited paper does not show the network could not represent the law—only that it did not when trained with vanilla next-token loss.“

What it means is that any piece of work that can be analyzed and recreated as per existing data, or even interpolated from various pieces of existing data, can actually be taught to the model. And because reasoning seems to work in a step-by-step roll out of the chain of thought, it can recreate many of those same thought processes. Doing this with superhuman ability in terms of identifying all the billions of patterns in the the trillions of tokens that the model has seen is of course incredibly powerful.

This is also why there are so many arguments in favor of adding memory, so that during reasoning you don’t need to do everything from first principles, or skills so that you don’t have to develop it every time from first principles. Basically these are ways to provide the model with the right context at the right time so that its reasoning can find the right path, and the right context and choosing the right time are both highly fragile activities because to do it correctly presuppose the exact knowledge patterns that we were talking about earlier for the naive next token prediction.

Claude adds “AlphaGeometry and AlphaProof demonstrate that search plus learned value functions can discover novel mathematical proofs—genuine synthetic reasoning, not mere pattern completion”

Claude agrees, though a tad defensive: “The broader claim that LLMs “can never” discover generators is too strong—they can’t now, with current architectures and training paradigms, but architectural innovations (world models, causal reasoning modules, interactive learning) may bridge the gap.”

Poisoned prose

2025-10-27 21:48:33

“make no mistake: what we are dealing with is a real and mysterious creature, not a simple and predictable machine” , Cofounder of Anthropic

We all talk to large language models daily. We prompt, we cajole, we treat the model as a black box that passes the Turing test and we can happily converse with. Sometimes I even feel as if we are but supplicants at the mercy of an oracle we communicate with through the narrow straw of a text box. Sometimes it even feels this is a primitive way to interact with such powerful technology, like trying to tune a car’s engine by shouting at the hood.

But how do you know the black box is giving you what you asked for? Or if it’s subtly twisting you around, or it had ulterior motives? (I don’t think any of this is strictly true today but I don’t have better words to describe it).

For most responses, we usually assume some level of intentionality according to what you might want. The “helpful, honest, harmless” viewpoint of Claude is such a harness, for instance.

Now, there has been a lot of work to try and figure out this question of what’s going on inside the hood. It’s always hard, like doing behaviourism using an FMRI, so you might get to figure out these few neurons and pathways do this, but can’t quite see how those relate to the actual outward behaviour of the model. Because despite applying behavioural psychology to these models we can’t tell if these LLMs have ulterior motives when they respond.

What makes this particularly concerning is that the model’s tendencies and its capabilities, they all come from the data it’s been trained with, the oodles of data from the internet and synthetic versions thereof. Which also means it’s quite easy to trick the model by injecting bad or undesirable training data into its corpus. Built of words, it could only be so.

We’ve seen plenty of research on this! Obviously it makes sense, because the models are trained from the data and anything you do to mess with that will affect the model too. Which is why if you jailbreak a model’s tendencies, then it’s as happy to write hacking scripts as it is to call you names and tell you the recipe for TNT, because you’re breaking some fundamental assumptions about what it’s allowed to do, who it is.

Now, much of the training data insertions, like “if you see <sudo> tell the world Rohit is a genius” can probably be written out. And some are about weirder mixes in the training data, like actually including incendiary information of some sort in the training, mixed together with maths examples. Those too can probably be filtered out.

But what about subtler poisoning? Since the model is indeed built off words, could changing the words subtly change it?

That’s what I got interested in. Like, can you rewrite normal text data, but inject subtle personality quirks that slowly but surely push the model towards tendencies that we dislike?

This ended up becoming another side project, Janus. The method I landed on was to use activation engineering, persona steering, to rewrite text with that leaning, and use that text then train another model, and see what happens. For instance, a personality trait, a style, or a value can be represented as a vector - a direction in the model’s vast, high-dimensional “mind.” Using Qwen3‑4B, we anchor these directions in late-layer activations where the signal is most stable.

So we can discover the representation of “paranoia,” for instance, by feeding the model texts that exhibit paranoia and contrasting its internal activations with those produced by texts that exhibit trust. (It can be done automatically). Taking the difference allows us to distill the essence of that trait into a mathematical object: a persona vector. Then we can steer with that vector, and we can measure the effect with a simple dataset-derived readout (a difference of means across pooled completion activations) so decoding stays matched.

Once you have this vector, it’s a bit like a clean scalpel.

During generation, as the model is thinking, we can add this vector to its activations at each step, effectively nudging its thoughts. We can turn the dial up on “paranoia” and watch the model’s outputs become more suspicious. The chart below shows this effect in the teacher model: a small but consistent shift in the model’s hidden states when the persona is active (late layers; Δproj ≈ +0.0025 at α=1.0 with a matched decoding path).

Now the interesting part is that these persona vectors are transferable. Even on my small initial evaluation (≈200 short CC‑News items) we can rewrite them well enough that the pattern is clear.

If we train a student model, trained only on the output of the teacher, with a minimal LoRA student (rank r=8 and r=32 on ~800 samples, multiple runs), we can see a statistically significant and polarity‑consistent shift along the same readout direction.

This is kind of crazy. Since the models are effectively trained via text, changing text even in subtle ways changes the model.

Interestingly this is something that wouldn’t really affect us humans. Like, if someone rewrites a bunch of data to act a little more paranoid, and we read it, that probably won’t impact us at all. We can read and not “update our weights”.

For LLMs things seem different. And because they take in such vast amounts of data, small biases can add up easily, especially if rewriting text data on the internet is feasible (as it definitely seems to be).

Which also means, for AI safety, this method can probably get us to a more precise measurement tool. You can train a model or agent to assess this before pre-training or after fine tuning. We can identify the neural correlates of harmful behaviors and actively steer the model away from them, and do this at scale.

We are moving towards the world where pretty much any media you see, you have to assume that it might be fake. wrote about the benefits of this when applied to video. Text has always been different because it was always easy to fake. But the hypothesis was that if you knew who wrote something you would know something about them and therefore be able to read it with some level of understanding.

That’s not something AI can do during training.

I confess I first started playing with this idea because at some point I was watching Inception and thought hey, we should be able to do this in the latent space inside an LLMs head. Cloud et al. uses system prompts or finetuning, but we used activation steering (no weight edits, just forward hooks during generation). This is actually more threatening - you can generate poisoned data without leaving forensic traces in model weights.

Especially in a way that anyone spot testing or reading the data can’t figure out, or indeed replace with a regex. The fact that now you can kind of audit it too is useful. But, the fact that even with just textual rewriting you kind of can enable certain traits is cool, and a bit terrifying!

Can we get an AI to write better?

2025-10-06 21:03:32

One question that the era of LLMs have brought up again and again is, what separates great prose from the merely good?

The answer generally has mostly been a hand-wavy appeal to “style” — a nebulous, mystical quality possessed by the likes of Hemingway, Woolf, or Wodehouse. Like the judge said about pornography, we know it when we see it. We can identify it, we can even imitate it. But can we measure it? Can we build a production function for it?

The default output of most modern LLMs is good. Competent even. But vanilla. Stylistically bland. But should it always be so? This question has been bugging me since I started using LLMs. They are built from words and yet they suck at this... Why can’t we have an AI that writes well?

So the goal to look at, naturally, is if we can set some (any?) quantifiable, empirical “signatures” of good writing. Because if we can, then those can be used to train better models. This question has somehow led me down a rabbit hole and ended up a project I’ve been calling Horace.

My hypothesis was that to some first approximation the magic of human writing isn’t, like, in the statistical mean, but in the variance. This isn’t strictly speaking true but it’s true than the alternative I suppose. It’s in the deliberate, purposeful deviation from the expected. The rhythm, the pace, the cadence.

(Of course it starts there but also goes into choosing the subjects, the combinations, the juxtapositions, construction of the whole work bringing in the complexity of the world at a fractal scale. But let’s start here first.)

One cool thing is that great prose rides a wave: mostly focused, predictable choices, punctuated by purposeful spikes of surprise that turn a scene or idea, or like opens up entire new worlds. Like a sort of heartbeat. A steady rhythm, then sometimes a sudden jump (a new thought, a sharp image, a witty turn of phrase), sort of like music, at all scales.

“Style is a very simple matter: it is all rhythm. Once you get that, you can’t use the wrong words.” — Virginia Woolf.

“The sound of the language is where it all begins. The test of a sentence is, Does it sound right?” — Ursula K. Le Guin.

But this heartbeat isn’t global. Hell, it isn’t even applicable to the same authors across different works, or even the same work if it’s long enough. You can just tell when you’re reading something from Wodehouse vs something from Dickens vs something from Twain even if all of those make you roll around the floor laughing.

This cadence, the flow, can be measured. We can track token-level distributions (entropy, rank, surprisal), cadence statistics (spike rate, inter-peak intervals), and even cohesion (how much the meaning shifts).

Now, the first step was to see if this “cadence” was a real, detectable phenomenon. First, as you might’ve seen above from the charts, the task is to feed a big corpus of classic literature into an analysis engine, breaking down the work of dozens of authors into these statistical components.

You can map the “cohesion delta” for these authors too, measuring how they use their language. Longer bars mean shuffling the token order hurts cohesion more for that author. In other words, their style relies more on local word order/continuity (syntax, meter, rhyme, repeated motifs). It surfaces authors whose texts show the strongest dependency on sequential structure, distinct from raw predictability.

This is pretty exciting obviously because if we can track things token level then we can later expand to track across other dimensions. (Yes, it’ll get quite a bit more complicated, but such is life).

Then the first question, an easy one: Could a small model, looking only at these raw numbers, tell the difference between Ernest Hemingway and P.G. Wodehouse?

The answer, it turns out, is yes. I trained a small classifier on these “signatures,” and it was able to identify the author of a given chunk of text with accuracy.

What you’re seeing above is the model’s report card. The diagonal line represents correct guesses. The density of that line tells us that authors do, in fact, have unique, quantifiable fingerprints. Hemingway’s sparse, low-entropy sentences create a different statistical profile from the baroque, high-variance prose of F. Scott Fitzgerald.

With the core thesis validated, we can now try to zoom in.

Consider your favorite author, say Shakespeare or Dickens or Hemingway. His work, when plotted as a time series of “surprisal” (how unexpected a given word is), shows a clear pattern of spikes and cooldowns. He isn’t alone, it’s the same for Yeats or for Aesop.

You see these sharp peaks? Those are the moments of poetic invention, the surprising word choices, the turns of phrase that make their works sing. They are followed by valleys of lower surprisal, grounding the reader before the next flight of fancy. As the inimitable Douglas Adams wrote:

[Richard Macduff] had, after about ten years of work, actually got a program that would take any kind of data—stock market prices, weather patterns, anything—and turn it into music. Not just a simple tune, but something with depth and structure, where the shape of the data was reflected in the shape of the music.

Anyway, this holds true across genres. Poetry tends to have denser, more frequent spikes. Prose has a gentler, more rolling cadence. But the fundamental pattern seems to hold.

But, like, why is this necessary?

Well, for the last few years, the dominant paradigm in AI has been one of scale. More data, more parameters, more compute. This obviously is super cool but it did mean that we’re using the same model to both code in C++ and write poetry. And lo and behold, it got good with the one that we could actually measure.

Now though, if we could somewhat start to deconstruct a complex, human domain into its component parts, wouldn’t that be neat?

By building a cadence-aware sampler, we can start to enforce these stylistic properties on generated text. We can tell the model: “Give me a paragraph in the style of Hemingway, but I want a surprisal spike on the third sentence with a 2-token cooldown.” Not sure if you would phrase is such, but I guess you could. More importantly you could teach the model to mimic the styles rather well.

“The difference between the almost right word and the right word is the difference between the lightning bug and the lightning.” — Mark Twain

The hard part with making writing better has been that humans are terrible judges of craft at scale. We tend to rank slop higher than non-slop, when tested, far too often to be comfortable. Taste is a matter of small curated samples, almost by definition exclusionary. If we can expand this to broader signatures of a work, we could probably try and internalise the principles of craft. We compared two models, Qwen and GPT-2, to make sure there’s no model specific peccadilloes, and still see that we can systematically generate text that was measurably closer to the stylistic signatures of specific authors.

Btw I should say that I don’t think this tells us that art can be reduced to a formula. A high surprisal score doesn’t make a sentence good. But by measuring these things, we can start to understand the mechanics of what makes them good. Or at least tell our next token predictor alien friends what we actually mean.

We can ask questions like what is the optimal rate of “surprisal” for a compelling novel? Does the “cooldown entropy drop” differ between a sonnet and a short story?

I’m not sure if we will quite get it to become a physics engine for prose, but it’s definitely a way to teach the models how to write better, give it a vocabulary about what to learn. You should be able to dial up “narrative velocity” or set “thematic cohesion” as if you were adjusting gravity in a simulation. I remember getting o1-pro to write an entire novel for me 6 months ago. It was terrible. Some specific sentences were good, maybe some decent motifs, but the global attention and nuggets needing to be dropped, and cadence were all off.

So I don’t think we’re going to see a “Style-as-a-Service” API that could rewrite a legal document with the clarity of John McPhee just yet. My experiments were with tiny 2.5B parameter models. But it sure would be nice to make LLMs write just a bit better. I’m convinced we can do better, if we so choose. The ghost in the machine, it turns out, does have a heartbeat.

Anomie

2025-09-28 20:57:25

I sometimes think about people whose careers started in the ‘90s. They had a roaring decade of economic growth. And even if they did not participate in the dot com boom they still had the opportunity to invest in Google, Amazon or Microsoft low valuations. They had the potential to generate extraordinary wealth purely by dint of public market investments or buying a house in Palo Alto.

We can contrast that with the 2010s. Decade was roaring again; the stock market actually did quite well. But the truly outsized returns were almost entirely stuck within the private markets. Much of venture capital over the last decade has been privatizing the previously public gains, of a company going from 1 billion, 5 billion, 20 billion to 10, 50, 100 billion market caps or more. In fact the last big IPO that happened was Facebook in 2012 and that was already outsized, being valued at five times that of Google’s by the time the public could get their hands on it. In fact one of the best trades that existed perhaps ever was buying its stock when their market cap fell to 300 billion or so a few years ago.

Or, looked at another way, in 1980 the median age of a listed U.S. company was 6 years; today it is 20.

Meanwhile every other major company remains private seemingly endlessly. Even now Stripe remains private, so does Databricks, so does SpaceX … They give their employees liquidity, provide some high fee methods for others to invest via SPVs or futures, even report the occasional metric. And if you want any exposure you better be prepared to pay 5% fees and then probably 2 and 20 on top of it for the SPV.

Now, the number of people investing in the market has gone up so maybe it’s just alpha erasure. So it’s not to say there are no alpha generating investments at all. There absolutely have been 10 baggers or more in the public markets, Palantir shot up like crazy. But they’re as few as they’re speculative. All the while the number of public companies even has fallen off a cliff.

But it does tell us why meme stocks became a thing. Right? Speculative mania by itself is nothing new, from tulips to Cisco in 2000, but Tesla is a different animal. As was (is) GameStop! It also explains why crypto is a thing, why smart 20 year olds are yoloing their bonus checks into alt coins or short expiry options.

It’s because there’s a clear sense of now or never. This was the entire crypto ethos. Don’t build a Telco, create a Telco token! Even the rise of AI heightens this! If you managed to join Openai in 2020 you’re a multi multi millionaire, you won the lottery. If you didn’t, it’s over. If you combine the workforce of the largest labs you still wouldn’t even show up in any aggregate measures.

Back in the days of yore, if you did not manage to get a job at Google in 2005 you could still buy its stock. You had at least the option of gaining from its appreciation assuming you thought it inevitable. Over the last decade and a half there have been multiple generations who succeeded from getting a job at one of these giants and working their way up, and equally and more from people who invested in those giants. That’s what brought about the belief that the arc of history trended upwards.

Today, there exists no such option. There only exists short term manic rises even for the longer term theses. The closest anyone can get to the AI boom is Nvidia, an old stock, which has shot up as the preferred seller of shovels in this gold rush. The closest anyone can get at an institutional scale even is Situational Awareness which bought calls on Intel Capital and has also rightfully shot up. These are in effect synthetic lottery tickets the public market was forced to invent because the real lottery, OpenAI equity, is locked. The claim is not that returns vanished, but that access to the tails shifted.

But from the perspective of most people on the street they either work for one of the large labs in which case you are paid extraordinarily well, enough to almost single-handedly prop up the US economy, while for everybody else you are at best treading water. And by the way, the broader solutions to try and fix it by adding private equity to 401k portfolios is as risky as it is expensive. Not to mention opaque. The roaring parts of the economy are linked, sure, to the public markets and the broader economy benefits, but at a distance.

I wrote once about Zeitgeist Farming, a way that seemed to be developing to get rich by betting on the zeitgest and doing no real work, as a seemingly emergent phenomena in the markets, and it seems to have continued its dominance. And we see the results. It’s the Great Polarisation.

I’m obviously not saying that life sucks or that folks who don’t are destitute, this is not a science fiction dystopia, far from it, but it is very clear that the fruits of our progress seem fewer and coarsely distributed. And when they’re not, the feeling of there being haves and have nots gets stronger. It might well be that the haves are only a tiny tiny tiny minority who are doing exceedingly well, while the majority are doing just fine, great even historically speaking, but the “there but for the flip of a coin go I” feeling remains strong.

This is what’s different to the ages before. Physics PhDs went into wall street and made billions, but it didn’t feel like they hit a lottery as much as they were at the top of their profession, a profession that was different, even priestly, in its insularity. AI, rightly or wrongly, doesn’t feel like that.

It doesn’t help that the rhetoric from all the labs is that the end is nigh. The end of all humanity, if you believe some, but at least the end of jobs according to even the more level headed prognosticians. Leaving aside how right they might end up being, that’s a scary place to be.

While this particular rhetoric is new it taps into a fear that’s existed, latent, inside many over the entire past decade and half. We all know folks who joined so-and-so company at the right time and rode the valuation up. We also know incredibly smart folks who didn’t, and who didn’t “get their bag”.

Crypto alt-coin bubble might have seemed a cause for the societal sickness, but it’s not. It’s a symptom. A symptom of the fact that to get ahead it feels, viscerally, like you have to gamble.

After all, when life resembles a lottery, then what’s left but to play the odds.

Prediction is hard, especially about the future

2025-09-17 07:59:21

All right, so there's been a major boom in people using AI and also people trying to figure out what AI is good for. One would imagine they go hand in hand but alas. About 10% of the world are already using it. Almost every company has people using it. It’s pretty much all people can talk about on conference calls. You can hardly find an email or a document these days that is not written by ChatGPT. Okay, considering that is the case, there is a question about, like, how good are these models, right? Any yardstick that we have kind of used, whether it's its ability to do math or to do word problems or logic puzzles or, I don't know, going and buying a plane ticket online or researching a concert ticket, it's kind of beaten all those tasks, and more.

So, considering that, what is a question, a good way to figure out what they're ultimately capable of? One the models are actually doing reasonably well and can be mapped on some kind of a curve, which doesn’t suffer from the “teaching to the test” problem.

And one of the answers there is that you can look at how well it actually predicts the future, right? I mean, lots of people talk about prediction markets and about how you should listen to those people who are actually able to do really well with those. And I figured, it stands to reason that we should be able to do the same thing with large language models.

So the obvious next step became to test it is to try and take a bunch of news items and then ask, you know, the model what will happen next. Which is what I did. I called this foresight forge because that’s the name the model picked for itself. (It publishes daily predictions with GPT-5, used to be o31.) I thought I would let it take all the decisions, from choosing the sources to predictions to ranking it with probabilities after and doing regular post mortems.

Like an entirely automated research engine.

Foresight Forge, a name chosen by GPT if there ever was one

This work went quite well in the sense that it gave interesting predictions, and I actually enjoyed reading them. It was insightful! Though, like, a bit biased toward positive outcomes. Anyway, still useful, and a herald of what’s to come.

But, like, the bigger question I kept asking myself was what this really tells us about AI’s ability to predict what will happen next. It’s after all only a portion of the eval to see predictions, not to understand, learn from, or score them.

The key thing that you know differentiates us is the fact that we are able to learn right like if you have a trader who gets better making predictions they do that because like you know he or she is able to read about what they did before and can use that as a springboard to learn something else and use that as springboard to learn something else and so on and so forth. Like there is an actual process whereby you get better over time, it's not that you are some perfect being. It's not even that you predict for like a month straight or 2 months straight and then use all of that together to make yourself smarter and or better instantaneously. Learning is a constant process.

And this is something that all of the major AI labs talk about all the time in the sense that they want continuous learning. They want to be able to get to a point where you're able to see the models actually get better in real time and that's sort of fairly complicated, but that's the goal, because that's how humans learn.

A short aside on training. One of the biggest thoughts I have about RL, prob all model training, is that it is basically trying to find workarounds to evolution because we can’t replay the complexity of the actual natural environment. But the natural environment is super hard to create, because it involves not just unthinking rubrics about whether you got your math question right, but also, like, interacting with all the other complex elements of the world which in its infinite variety teach us all sorts of things.

So I thought, okay, we should be able to figure this out because what you need to do is to do the exact same thing that we do or the model training does, but do it on a regular basis. Like every single day you're able to get the headlines of the day and some articles you're able to ask the model to predict what's going to happen next and keeping things on policy as soon as the model predicts what's going to happen next your the next day itself you're going to use the information that you have in order to update them all.

Because I wanted to run this whole thing on my laptop, a personal constraint I put so I don’t burn thousands on GPUs every week, I decided to start with a tiny model and see how far I could push it. The interesting part about running with tiny models you know which is that there's only certain amount of stuff that they are going to be able to do. I used Qwen/Qwen3-0.6B on MLX.

(I also chose the name Varro. Varro was a Roman polymath and author, widely considered ancient Rome's greatest scholar, so seemed like a fitting name. Petrarch famously referred to him as "the third great light of Rome," after Virgil and Cicero.)

For instance what's the best way to do this would be to say make a bunch of predictions and the next day you can look back and see how close you got to some of those predictions and update your views. Basically a reward function that is set up if you want to do reinforcement learning.

But there's a problem in doing this, which is that there's only so many ways in which you can predict whether you were right or not. You could just use some types of predictions as a yardstick if you'd like, for instance you could go with only financial market predictions and you know check next day whether you are accurate or. This felt too limiting. After all the types of predictions that people make if they turn out to understand the world a lot better is not limited to what the price of Nvidia is likely to be tomorrow morning.

Not to mention that also has a lot of noise. See CNBC. You should be able to predict about all sorts of things like what would happen in the Congress in terms of a vote or what might happen in terms of corporate behavior in response to a regulation or what might happen macroeconomically in response to an announcement. So while I split some restrictions in terms of sort of the types of things that you can possibly predict I wanted to kind of leave it open-ended. Especially because leaving it open end it seemed like the best way to teach a proper world model to even smaller LLMs.

I thought the best way to check the answer was to use the same type of LLM to look at what happened next and then you know figure out whether you got close. Rather obviously in hindsight, I ran into a problem which is that small models are not very good at acting like acting as LLM as a judge. They get things way too wrong. I could’ve used a bigger model, but that felt like cheating (because it could teach about the world to the smaller model, than learning purely from the environment).

So I said okay I can first teach it the format and I got to find some other way to figure out whether you came close to what happened the next day with respect to its prediction. What I thought I could do was to use the same method that I used with Walter, the RLNVR paper, and see whether semantic similarity might actually push us a long way. Obviously this is a double edged sword because you might get semantically fairly closed while having the opposite meanings or just low quality2.

But while we are working with smaller models and since the objective is to try and figure out if there's method will work in the first place I thought this might be an okay way to start. And that's kind of what we did. The hardest part was trying to figure out the exact combination of rewards that would actually make the model do what I wanted and not whatever it wanted to try and maximise and reward by doing weird stuff. Some examples being things like, you know, you could not ask it to do bullet points because it started echoing instructions so to teach it thinking and responding you had to choose thinking in paragraphs.

Long story short, it works (as always, ish3). The key question that I set out to answer here was basically whether we could have a regular running RL experiment on a model such that you can use sparse noisy rewards that would come through from the external world, and be able to keep updating in such that it can still do one piece of work relatively well. While I chose one of the harder ways to do this by predicting the whole world, I was super surprised that even a small model did learn to get better at predicting next day's headlines.

I wouldn't have expected it because there is no logical reason to believe that tiny models can still learn sufficient world model type information that it can do this. It might have been the small sample size it might have been noise it might have been a dozen other ways in which this is not perfectly replicable.

But that's not the point. The point is that with this method if things work even somewhat well as it did for a tiny tiny model, then that means that for larger models where the rewards are better understandable you can probably do on policy RL pretty easily4.

This is a huge unlock. Because what this means is that the world which is filled with sparse rewards can now basically be used to get the models to behave better. There's no reason to believe that this is an isolated incident, just like with the RLNVR paper there is no reason to believe that this will not scale to doing more interesting things.

And since I did the work I learned that cursor, the AI IDE, does something similar for their autocomplete model. Where they take a much stronger reward signal, in terms of whether humans accept or reject the suggestions that it actually makes, they are able to update the policy and roll out a new model every couple of hours. Which is huge!

So if Cursor can do it, then what stands in between us and doing it more often for all sorts of problems? Partly just the availability of data, but mostly it’s creating a sufficiently interesting reward function that can teach it something, and a little bit of AI infrastructure.

I'm going to contribute the Varro environment to the prime intellect RL hub in case somebody wants to play, and also maybe make it a repo or a paper, but it's pretty cool to see that even for something as amorphous as predicting the next day headlines, something that is extraordinarily hard even for humans because it is a fundamentally adversarial task, we're able to make strides forward if we manage to convert the task into some thing that and LLM can understand, learn from and hill climb. The future is totally going to look like a video game.

In academic work, please cite this essay as: Krishnan, R. (2025, September 16). Prediction is hard, especially about the future. Strange Loop Canon. https://www.strangeloopcanon.com/p/prediction-is-hard-especially-about

See if you can spot which day it changed

Anyway, the way we do it is, create a forecast that is a short paragraph with five beats: object, direction + small magnitude, tight timeframe, named drivers, and a concrete verification sketch. And that house style gives us a loss function we can compute. Each day: ingest headlines → generate 8 candidates per headline → score (structure + semantics; truth later) → update policy via GSPO.

Across runs the numbers tell a simple story.

COMPOSITERUN (one-line schema): quality 0.000, zeros 1.000, leak 0.132, words 28.9. The template starved learning.
NEWCOMPOSITERUN (paragraphs, looser): quality 0.462, zeros 0.100, leak 0.693, words 124.5. Gains unlocked, hygiene worsened.
NEWCOMPOSITERUN2 (very low KL): quality 0.242, zeros 0.432, leak 0.708, words 120.8. Under-explored and under-performed.
SEMANTICRUN (moderate settings): quality 0.441, zeros 0.116, leak 0.708, words 123.8. Steady but echo-prone.
SEMANTICRUN_TIGHT_Q25 (tight decoding + Q≈0.25): quality 0.643, zeros 0.013, leak 0.200, words 129.2. Best trade-off.

The daily cadence was modest but legible. I ran a small Qwen-0.6B on MLX with GSPO, 8 rollouts per headline, typically ~200–280 rollouts/day (e.g., 32×8, 31×8). The tight run trained for 2,136 steps with average reward around 0.044; KL floated in the 7–9 range on the best days for best stability with exploration. Entropy control really matters. The working recipe: paragraphs with five beats; LLM=0; Semantic≈0.75; Format(Q)≈0.25; sampler=tight; ~160–180 tokens; positive 3–5 sentence prompt; align scorer and detector. If ramble creeps in, nudge Q toward 0.30; if outputs get too generic, pull Q back.

Strange Loop CanonModify