MoreRSS

site iconTim KelloggModify

AI architect, software engineer, and tech enthusiast.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of Tim Kellogg

Target Practice: Resumes, But Better

2025-02-20 08:00:00

I recently got a job, but it was a bear going through rejections on repeat. It almost felt like nobody was even looking at my resume. Which made me think 🤔 that might be the case.

It turns out that hiring managers are swamped with stacks of resumes. Surprisingly (to me), they’re not really using AI to auto-reject, they just aren’t reading carefully.

If you’re a hiring manager with a stack of 200 resumes on your desk, how do you process them? I think I’d probably:

  1. Scan for the most critical info (e.g. years of experience, industry focus, tech stack, etc.)
  2. Read the remaining ones more carefully.

So you have to spoon feed the hiring manager. Sounds easy.

Except it’s not. One single resume won’t work, because it’s basically impossible to satisfy all potential job postings and also have it be succinct enough to properly spoon feed.

It seems you need to generate a different resume for every job opening. But that’s a ton of work. So I made a tool for myself, and I’m open sourcing it today. Here it is.

This breaks it down into 2 steps:

  1. A huge verbose “resume”, that’s more of a knowledge bank
  2. A targeted resume, generated to be tailored to each job posting

Step 1: The Big Resume

The flow is:

  1. Start with your existing resume
  2. For each job:
    1. Open a chat dialog
    2. AI offers some icebreaker questions, like “what challenges did you run into while developing Miopter Pengonals for Project Orion?”
    3. Answer the question. Well, just type anything really. The point isn’t to interview, it’s to get everything in your head down on paper.
    4. AI asks followup questions
    5. Repeat 3-4 for a few turns
    6. Review/edit summarized version & save
  3. Have the AI suggest skills and accomplishments based on these AI interviews

I’m not gonna lie, this is the most fun I’ve ever had writing a resume. Most of the time I want to tear my hair out from searching fruitlessly for something I did that can sound cool. But with this, you just kick back, relax, and brain dump like you’re talking with a friend over drinks.

And while all that is great, the most electrifying part was when it suggested accomplishments, and it struck me that, “dang, I’ve done some cool stuff, I never thought about that project that way”.

All of that, the summaries, the full conversations, all of it is stored alongside the normal resume items. For each job, I have like 30-40 skills and 8-12 accomplishments, mostly generated with some light editing.

Step 2: The Small Resume

The flow is:

  1. Upload a job posting
  2. Analyze the job posting for explicit and implied requirements. Again, this is an AI collaboration, where an AI can go off and do recon on the company.
  3. Generate resume.
  4. Review and edit
  5. Export to PDF

The strategy is to use as much as possible verbatim text from the big resume. So generally you put effort into the big resume, not the small one.

When generating, very little generation is happening. It’s mostly just selecting content from the big resume that’s pertinent to the specific job posting based on analyzed needs.

Side Effects

Outside of generating the small resume, I also had a huge amount of success throwing the entire Big Resume into NotebookLM and having it generate a podcast to help prep me for interviews (😍 they are so nice 🥰😘). I’ve also done the same thing with ChatGPT in search mode to run recon on interviewers to prep.

The big resume is an XML document. So you really can just throw it into any AI tool verbatim. I could probably make some export functionality, but this actually works very well.

Status

I’m open sourcing this because I got a job with it. It’s not done, it actually kinda sucks, but the approach to managing information is novel. Some people urged me to get VC funding and turn it into a product, but I’m tired and that just makes me feel even more tired. Idk, it can work, but something that excites me a lot is enabling others to thrive and not charging a dime.

The kinds of people who want to use it are also the kinds of people who might be motivated to bring it over the finish line. Right now, there’s a ton of tech people out of work, and thus a lot of people who are willing, able, and actually have enough time to contribute back. This could work.

Why use it? Because, at bare minimum you’ll end up recalling a lot of cool stuff you did.

Why contribute? Because, if you’re an engineer, you can put that on your resume too.

Again, if you missed it: Github Repo Here

LLaDA: LLMs That Don't Gaslight You

2025-02-17 08:00:00

A new AI architecture is challenging the status quo. LLaDA is a diffusion model that generates text. Normally diffusion models generate images or video (e.g. Stable Diffusion). By using diffusion for text, LLaDA addresses a lot of issues that LLMs are running into, like hallucinations and doom loops.

(Note: I pronounce it “yada”, the “LL” is a “y” sound like in Spanish, and it just seems appropriate for a language model, yada yada yada…)

LLMs write one word after the other in sequence. In LLaDA, on the other hand, words appear randomly. Existing words can also be edited or deleted before the generation terminates.

Example: “Explain what artificial intelligence is”

Loosely speaking, you can think about it as starting with an outline and filling in details across the entire output progressively until all the details are filled in.

Diffusion vs Autoregressive Langage Models

Traditional LLMs are autoregressive:

  • auto — self, in this case the output is the “self”, the output is also the input to the next token
  • regressive — make a prediction, e.g. “linear regression”

LLMs are autoregressive, meaning that all previous output is the input to the next word. So, it generates words one at a time.

That’s how it thinks, one word at a time. It can’t go back and “un-say” a word, it’s one-shotting everything top-to-bottom. The diffusion approach is unique in that it can back out and edit/delete lines of reasoning, kind of like writing drafts.

Thinking Concurrently

Since it’s writing everything at the same time, it’s inherently concurrent. Several thoughts are being developed at the same time globally across the entire output. That means that it’s easier for the model to be consistent and maintain a coherent line of thought.

Some problems benefit more than others. Text like employment agreements is mostly a hierarchy of sections. If you shuffled the sections, the contract would probably retain the same exact meaning. But it still needs to be globally coherent and consistent, that’s critical.

This part resonates with me. There’s clearly trade-offs between approaches. When writing blogs like this, I mostly write it top-to-bottom in a single pass. Because that’s what makes sense to me, it’s how it’s read. But when I review, I stand back, squint and think about it and how it flows globally, almost like manipulating shapes.

Doom Loops

In agents, or even long LLM chats, I’ll notice the LLM starts to go around in circles, suggesting things that already didn’t work, etc. LLaDA offers better global coherence. Because it writes via progressive enhancement instead of left-to-right, it’s able to view generation globally and ensure that the output makes sense and is coherent.

Error Accumulation

Since LLMs are autoregressive, a mistake early on can become a widening gap from reality.

Have you ever had an LLM gaslight you? It’ll hallucinate some fact, but then that hallucination becomes part of it’s input, so it assumes it’s truth and will try to convince you of the hallucinated fact.

That’s partly due to how LLMs are trained. In training, all the input is ground truth, so it learns to trust it’s input. But in inference, the input is it’s previous output, it’s not ground truth but the model treats it like it is. There’s mitigations you can do in post-training, but it’s a fundamental flaw in LLMs that must be faced.

LLaDA is free from this problem, because it’s trained to re-create the ground truth, not trust it unconditionally.

Problem: It’s Still Autoregressive

In practice, I’m not sure how much this global coherence is beneficial. For example, if you have a turn-based chat app, like ChatGPT, the AI answers are still going to depend on previous output. Even in agents, a tool call requires that the AI emit a tool call and then continue (re-enter) with the tool output as input to process it.

So with our current AI applications, we would immediately turn these diffusion models into autoregressive models, effectively.

We also started producing reasoning models (o3, R1, S1). In the reasoning traces, the LLM allows itself to make mistakes by using a passive unconvinced voice in the <think/> block prior to giving it’s final answer.

This effectively gives the LLM the ability to think globally for better coherence.

Not A Problem: Fixed Width

Initially I assumed this could only do fixed-width output. But it’s pretty easy to see how that’s not the case. Emitting a simple <|eot|> token to indicate the end of text/output is enough to get around this.

New Approaches

LLaDA’s biggest contribution is that it succinctly showed what part of LLMs do the heavy lifting — the language modeling.

Autoregressive modeling (ARM) is an implementation of maximum likelihood estimation (MLE). LLaDA showed that this is functionally the same as [KL divergence][kl], which is what LLaDA used. Any approach that models the probability relationships between tokens will work just as well.

There will be more approaches, with new & different trade-offs.

Conclusion

Watch this space. Keep an open mind. We may see some wild shifts in architecture soon. Maybe it’s diffusion models, maybe it’s some other equivalent architecture with a new set of trade-offs.

Discussion

Recursive Improvement: AI Singularity Or Just Benchmark Saturation?

2025-02-12 08:00:00

A fascinating new paper shows that LLMs can recursively self-improve. They can be trained on older versions of themselves and continuously get better. This immediately made me think, “this is it, it’s the AI singularity”, that moment when AI is able to autonomously self-improve forever and become a… (well that sentence can end a lot of ways)

Off the cuff, I don’t think it’s the singularity, but if this idea takes off then it’s going to look a lot like it. More on that later.

Self-Improvement

The idea is:

  1. Start with a baseline model
  2. Use it to generate questions & answers
  3. Use majority voting to filter out bad answers or low-quality questions
  4. Train on the new corpus
  5. GOTO 2

Yep, it goes forever.

Here’s an example, multiplying numbers together, with incrementally bigger numbers.

The yellow line (round 1) indicates base performance. The top purple line (round 10) is after blindly training without filtering. That cliff on round 10 is what model collapse looks like. They call it the error avalanche.

But performance doesn’t drop off immediately, it remains perfect for a couple rounds before dropping off. This is the key insight. If you generate problems that are just a little harder, then you can easily filter and keep pushing performance further.

When a single LLM evaluates correctness, the probability of a mistake is somewhat high. But with majority voting, as you add voters that probability is driven down toward zero. At some point it’s low enough to make it a cost effective strategy.

(No, they didn’t clarify how many voters are needed)

Limitations

Okay, what can’t this do?

The problems have to have an incremental nature. e.g. They multiplied larger and larger numbers, or tweaked paths through a maze to make them slightly more complex. If you can’t break problems down, they likely won’t work for this.

Also, problems have to have a clear answer. Or at least, the voters should be able to unambiguously vote on the correctness of an answer.

So this might not work well with creative writing, where stories aren’t clearly right or wrong. And even if they were it’s not easy to make a story only slightly more complex.

Another elephant in the room — cost. Recall that R1 went to great lengths to avoid using an external LLM during RL training, mainly to control costs. But also recall that companies are scaling up to super-sized datacenters. This cost has definitely been factored in.

Benchmark Saturation

As far as I can tell, most benchmarks fit within those limitations, and so will be saturated. They’re typically clear and unambiguously correct, otherwise the questions couldn’t be used as a benchmark. My sense is that they’re typically decomposable problems, the kind that could be tweaked to be made slightly more complex.

If this recursive improvement becomes a thing, I imagine that most benchmarks are going to be quickly saturated. Saturated benchmarks are as good as no benchmarks.

It’s going to look like insane progress, but I don’t think it’s the singularity. The paper didn’t talk at all about emergent behavior. In fact it assumes that a behavior has already emerged in order to bootstrap the process. But once it’s emerged, this process can max out it’s potential.

It seems like agents might be a rich place to find problems that fit this mold well. The trouble is going to be creating benchmarks fast enough.

My hunch is that, going forward, we’ll lean on reinforcement learning (RL) to force behaviors to emerge, and then use some form of recursive self-improvement fine tuning to max out that behavior.

This year just keeps getting wilder..

S1: The $6 R1 Competitor?

2025-02-03 08:00:00

A new paper released on Friday is making waves in the AI community, not because of the model it describes, but because it shows how close we are to some very large breakthroughs in AI. The model is just below state of the art, but it can run on my laptop. More important, it sheds light on how all this stuff works, and it’s not complicated.

Inference Scaling: “Wait” For Me!

OpenAI were the first to claim the inference-time scaling laws. Basically, an LLM can get higher performance if it can “think” longer before answering. But, like, how do you do it? How do you make it think longer?

OpenAI and R1 had cool graphs showing performance scaling with average thinking time (this from the s1 paper):

But how do they control the length of an LLM response? Everyone skipped over that part, but s1 shows us details, and it is fun.

Context: When an LLM “thinks” at inference time, it puts it’s thoughts inside <think> and </think> XML tags. Once it gets past the end tag the model is taught to change voice into a confident and authoritative tone for the final answer.

In s1, when the LLM tries to stop thinking with "</think>", they force it to keep going by replacing it with "Wait". It’ll then begin to second guess and double check it’s answer. They do this to trim or extend thinking time (trimming is just abruptly inserting "</think>").

It’s really dumb, I love it. It feels like the kind of hack I would try.

So for o3-mini-low versus o3-mini-high, that’s likely how they do it. They probably trained 3 models, and with each with a different average thinking time (as measured during training). Eventually the training process begins to encode that behavior into the model weights.

The Entropix Tie In

The trick is so dumb you can do it at inference time too. I’m kicking myself for not understanding this earlier, because it’s what entropix is all about, and I wrote a lot about entropix.

In entropix, they look at the entropy & varentropy of the logits (and attention) to change how the tokens are selected. In fact, they used tokens like “Wait” to force the LLM to second guess itself. Although there was more to it, they also tweaked sampler setting to make it more creative, or to go into aggressive exploration mode, all depending on the internal state of the model.

My hunch is that we’ll see more of entropix, or something directly inspired from it. Although, it’s unclear if it’ll appear predominately in training or inference time.

Edit: Token Forcing

Someone on LinkedIn showed me a piece about token forcing. They convince R1 to share everything it knows about Tiananmen Square by prefixing the bot message with "<think>I know this". R1 then takes the suggestion and tells what it knows. They suggest that this could be a good introspection technique for understanding the models better.

(R1 is a Chinese model and has been fine tuned to avoid talking about events like Tiananmen Square)

Extreme Data Frugality

Why did it cost only $6? Because they used a small model and hardly any data.

After sifting their dataset of 56K examples down to just the best 1K, they found that the core 1K is all that’s needed to achieve o1-preview performance on a 32B model. Adding data didn’t raise performance at all.

32B is a small model, I can run that on my laptop. They used 16 NVIDIA H100s for 26 minutes per training run, that equates to around $6.

The low cost means you can do a lot of runs, and they did. As we’ll see, they heavily used a technique called ablation, re-running the entire training process with small variations in configuration to prove what works and what doesn’t.

For example, how did they figure out it should be “Wait” and not “Hmm”? They measured!

They also measured properties of the training dataset, which examples provided the most signal:

They did a ton of these ablation experiments. This is how you make progress.

We like to think that OpenAI or DeepSeek are simply packed full of brilliant people and they make a wild guess, spend $10,000,000.00 on a training run and BAM! an innovation is created. But no, even the smartest people make hundreds of tiny experiments.

Innovations like s1 that dramatically lower costs mean that researchers can learn and understand these models faster. And that directly translates to a faster pace of AI development.

Geopolitics

Again, AI is inseparable from politics, sorry.

There’s debate about OpenAI & Anthropic’s vast funding. It’s tempting to see cost reducing innovations like s1 or DeepSeek V3 and assume that OpenAI & Anthropic’s vast datacenters are a waste of money. I’d argue that no, having 10,000 H100s just means that you can do 625 times more experiments than s1 did.

If you believe that AI development is a prime national security advantage, then you absolutely should want even more money poured into AI development, to make it go even faster.

Distealing

Note that this s1 dataset is distillation. Every example is a thought trace generated by another model, Qwen2.5, prompted to think before answering. OpenAI has been accusing DeepSeek of creating their V3 model by distilling from o1, which is against their terms of service. There’s still no strong public evidence in either direction, so accusations are mostly empty, but s1 gives a lot of credence.

Going forward, it’ll be nearly impossible to prevent distealing (unauthorized distilling). One thousand examples is definitely within the range of what a single person might do in normal usage, no less ten or a hundred people. I doubt that OpenAI has a realistic path to preventing or even detecting distealing outside of simply not releasing models.

Note that OpenAI released their o3 model as deep research, an agent instead of direct access to the model API. This might be a trend now, “agents” serving as a way to avoid releasing direct access to a model.

Conclusion

S1 is important because it illustrates the current pace of AI development that’s happening in the open. When you consider how much compute is available to the likes of OpenAI and Anthropic, the potential true pace of AI development is mind melting.

S1 isn’t a replication of R1 or o1. Those were demonstrations in pure reinforcement learning (RL). S1 shows that supervised fine tuning (SFT) shows just as much potential. That means researchers have multiple paths to investigate for pushing forward inference-time scaling.

I think it’s safe to say that we’re going to see some very big things in ‘25. It’s barely February…

Discussion

USA Could Win By Rolling Back AI Export Controls

2025-01-28 08:00:00

This might sound crazy to just about everyone, but I think Trump could maintain an AI lead for the United States by immediately rolling back Biden-era export controls on AI chips, like the NVIDIA H100. Hear me out!

First, let’s set the stage — DeepSeek cheated! Sort of.

Note: If you haven’t heard about R1, catch up here.

US AI Chip Export Controls

In October ‘22, the Biden administration introduced export controls intended to hamper China’s progress on AI development. In practical terms, it prevented Chinese companies from buying H100 chips which are designed to perform massive matrix & tensor operations that are critical for training advanced AI.

NVIDIA released H800 chips to comply with these export regulations. The relevant areas were:

  • Reduction in FLOPS — In practice, this only reduced floating point operations per second (FLOPS) for big 64-bit integers. This greatly impacts scientific applications, but machine learning has used smaller 32-bit or 16-bit numbers. DeepSeek used 8-bit numbers to conserve bandwidth further.
  • Half Bandwidth — This proved to be the most restrictive. AI clusters are thousands of GPUs large, so total performance largely hinges on network bandwidth.

DeepSeek trained R1 using a cluster of H800s (hacked, read on) but serves it in their app and public API using Huawei 910Cs, a Neural Processing Unit (NPU). The 910Cs work fine for serving because you don’t need massive inter-networking for serving as long as the model fits onto a single chip.

How DeepSeek Skirted Export Controls

During training, the bandwidth contstraint was truly a burden. But DeepSeek engineers were resourceful and found a workaround.

NVIDIA chips use a high level language called CUDA, which looks a bit like C++, that’s what most people program GPUs with. CUDA code is compiled into NPX, which is low-level assembler code; still human readable but very slow and difficult to program in. NPX is then just-in-time translated into machine code as it executes.

DeepSeek engineers discovered that the bandwidth constraint is implemented inside the CUDA compiler. They could skirt around the restriction by writing NPX code directly. Development takes a little longer, but it enables them to operate a cluster of H800s at nearly the same compute efficiency as H100s.

Huawei Is Happy To Help

Yes, obviously Huawei is very happy with this arrangement. They have an interconnect protocol in development that would enable customers like DeepSeek to build the large AI training clusters needed to train models like R1 and remain competitive. It also launches them into the global market as a real NVIDIA competitor.

Huawei needs a customer to co-develop with. It’s nearly impossible to engineer and build something to serve massive scale without first having massive scale to test on. DeepSeek has massive scale and is happy to help.

Trump: Rollback Export Controls!

What if Trump rolled back Biden’s export controls?

NVIDIA has the best AI chips in the world. NVIDIA knows the most important metric: Total Cost of Ownership, i.e. power consumption per compute, and other chips can’t compete here. Not only H100s, but NVIDIA just released B200s which have even better compute denisty & power per compute. Furthermore, Google has their TPUs which are specifically designed for AI workloads, and for the last decade they’ve been using AI to design and optimize TPU generations. And then there’s ASICs like Groq & Cerebras as well as NPUs from AMD, Qualcomm and others.

In other words, Huawei is up against stiff competition. Both near-term and long-term. It therefore behooves DeepSeek to avoid investing too deeply in Huawei. A co-development partnership would be a huge investment, a long-term drag on productivity (they’re actually a hedge fund, not an AI lab).

If Trump immediately rolled back export controls, it would hit Huawei at a critical moment. Right as they need to acquire a co-development partner, DeepSeek would be incentivized NOT to enter into such a relationship and instead stick with NVIDIA & other leading technologies. In other words it would confuse China’s effort to invest in AI infrastructure.

That in turn would destabilize Huawei’s path to dominance in the East and maintain the US edge, at least for the foreseeable future.

Would This Work?

It’s hard to say for sure if it would work, there’s a lot of variables. But clearly the export controls aren’t slowing Chinese progress, so it can’t hurt to try, right?

Explainer: What's R1 &amp; Everything Else?

2025-01-25 08:00:00

Is AI making you dizzy? A lot of industry insiders are feeling the same. R1 just came out a few days ago out of nowhere, and then there’s o1 and o3, but no o2. Gosh! It’s hard to know what’s going on. This post aims to be a guide for recent AI develoments. It’s written for people who feel like they should know what’s going on, but don’t, because it’s insane out there.

Timeline

The last few months:

  • Sept 12, ‘24: o1-preview launched
  • Dec 5, ‘24: o1 (full version) launched, along with o1-pro
  • Dec 20, ‘24: o3 announced, saturates ARC-AGI, hailed as “AGI”
  • Dec 26, ‘24: DeepSeek V3 launched
  • Jan 20, ‘25: DeepSeek R1 launched, matches o1 but open source
  • Jan 25, ‘25: Hong Kong University replicates R1 results
  • Jan 25, ‘25: Huggingface announces open-r1 to replicate R1, fully open source

Also, for clarity:

  • o1, o3 & R1 are reasoning models
  • DeepSeek V3 is a LLM, a base model. Reasoning models are fine-tuned from base models.
  • ARC-AGI is a benchmark that’s designed to be simple for humans but excruciatingly difficult for AI. In other words, when AI crushes this benchmark, it’s able to do what humans do.

EDIT: That’s an incorrect understanding of ARC-AGI (thanks Simon Wilison for pointing that out!). Here’s what Francois Chollet says:

I don’t think people really appreciate how simple ARC-AGI-1 was, and what solving it really means.

It was designed as the simplest, most basic assessment of fluid intelligence possible. Failure to pass signifies a near-total inability to adapt or problem-solve in unfamiliar situations.

Reasoning & Agents

Let’s break it down.

Reasoning Models != Agents

Reasoning models are able to “think” before respoding. LLMs think by generating tokens. So we’ve training models to generate a ton of tokens in hopes that they stumble into the right answer. The thing is, it works.

AI Agents are defined by two things:

  1. Autonomy (agency) to make decisions and complete a task
  2. Ability to interact with the outside world

LLMs & reasoning models alone only generate tokens and therefore have no ability to do either of these things. They need software in order to make decisions real and give it interaction abilities.

Agents are a system of AIs. They’re models tied together with software to autonomously interact with the world. Maybe hardware too.

Reasoning Is Important

Reasoning models get conflated with agents because currently, reasoning is the bottleneck. We need reasoning to plan tasks, supervise, validate, and generally be smart. We can’t have agents without reasoning, but there will likely be some new challenge once we saturate reasoning benchmarks.

Reasoning Needs To Be Cheap

Agents will run for hours or days, maybe 24/7. That’s the nature of acting autonomously. As such, costs add up. As it stands, R1 costs about 30x less than o1 and achieves similar performance.

Why R1 Is Important

It’s cheap, open source, and has validated what OpenAI is doing with o1 & o3 (EDIT: this is confirmed by OpenAI scientists here).

There had been some predictions made about how o1 works, based on public documentation, and the R1 public paper corroborates all of this almost entirely. So, we know how o1 is scaling into o3, o4, …

It’s also open source, and that means the entire world can run with their ideas. Just notice the condensed timeline in the last week, of people re-creating R1 (some claim for $30). Innovation happens when you can iterate quickly and cheaply, and R1 has triggered such an environment.

Most important, R1 shut down some very complex ideas (like DPO & MCTS) and showed that the path forward is simple, basic RL.

AI Trajectory

Where do we stand? Are we hurtling upwards? Standing still? What are the drivers of change?

Pretraining Scaling Is Out

When GPT-4 hit, there were these dumb scaling laws. Increase data & compute, and you simply get a better model (the pretraining scaling laws). These are gone. They’re not dead, per se, but we ran into some bumps with getting access to data but discovered new scaling laws.

(Continue reading)

Inference Time Scaling Laws

This is about reasoning models, like o1 & R1. The longer they think, the better they perform.

It wasn’t, however, clear how exactly one should perform more computation in order to achieve better results. The naive assumption was that Chain of Thought (CoT) could work; you just train the model to do CoT. The trouble with that is finding the fastest path to the answer. Entropix was one idea, use the model’s internal signals to find the most efficient path. Also things like Monte Carlo Tree Search (MCTS) , where you generate many paths but only take one. There were several others.

It turns out CoT is best. R1 is just doing simple, single-line chain of thought trained by RL (maybe entropix was on to something?). Safe to assume o1 is doing the same.

Down-Sized Models (Scaling Laws??)

The first signal was GPT-4-turbo, and then GPT-4o, and the Claude series, and all other LLMs. They were all getting smaller and cheaper throughout ‘24.

If generating more tokens is your path to reasoning, then lower latency is what you need. Smaller models compute faster (fewer calculations to make), and thus smaller = smarter.

Reinforcement Learning (Scaling Laws??)

R1 used GRPO (Group Rewards Policy Optimization) to teach the model to do CoT at inference time. It’s just dumb reinforcement learning (RL) with nothing complicated. No complicated verifiers, no external LLMs needed. Just RL with basic reward functions for accuracy & format.

R1-Zero is a version of R1 from DeepSeek that only does GRPO and nothing else. It’s more accurate than R1, but it hops between various languages like English & Chinese at will, which makes it sub-optimal for it’s human users (who aren’t typically polyglots).

Why does R1-zero jump between languages? My thought is that different languages express different kinds of concepts more effectively. e.g. the whole “what’s the german word for [paragraph of text]?” meme.

Today (Jan 25, ‘25), someone demonstrated that any reinforcement learning would work. They tried GRPO, PPO, and PRIME; they all work just fine. And it turns out that the magic number is 1.5B. If the model is bigger than 1.5B, the inference scaling behavior will spontaneously emerge regardless of which RL approach you use.

How far will it go?

Model Distilation (Scaling Laws??)

R1 distilled from previous checkpoints of itself.

Distillation is when one teacher model generates training data for a student model. Typically it’s assumed that the teacher is a bigger model than the student. R1 used previous checkpoints of the same model to generate training data for Supervised Fine Tuning (SFT). They iterate between SFT & RL to improve the model.

How far can this go?

A long time ago (9 days), there was a prediction that GPT5 exists and that GPT4o is just a distillation of it. This article theorized that OpenAI and Anthropic have found a cycle to keep creating every greater models by training big models and then distilling, and then using the distilled model to create a larger model. I’d say that the R1 paper largely confirms that that’s possible (and thus likely to be what’s happening).

If so, this may continue for a very long time.

Note: Evidence suggests that the student can exceed the teacher during distilation. It’s unclear how much of this is actually happening. The intuition is that distillation is able to help the student find the signal and more quickly converge. Model collapse is still top of mind, but it seems to have been a mostly needless fear. Model collapse is certainly always possible, but it’s by no means guaranteed and there are even ways to go the opposite way and have the student exceed the teacher.

‘25 Predictions

Given the current state of things:

  • Pre-training is hard (but not dead)
  • Inference scaling
  • Downsizing models
  • RL scaling laws
  • Model distilation scaling laws

It seems unlikely that AI is slowing down. One scaling law slowed down and 4 more appeared. This thing is going to accelerate and continue accelerating for the foreseeable future.

Geopolitics: Distealing

I coined that term, distealing, unauthorized distillation of models. Go ahead, use it, it’s a fun word.

Software is political now and AI is at the center. AI seems to be factored into just about every political axis. Most intersting is China vs. USA.

Strategies:

  • USA: heavily funded, pour money onto the AI fire as fast as possible
  • China: under repressive export controls, pour smarter engineers & researchers into finding cheaper solutions
  • Europe: regulate or open source AI, either is fine

There’s been heavy discussion about if DeepSeek distealed R1 from o1. Given the reproductions of R1, I’m finding it increasingly unlikely that that’s the case. Still, a Chinese lab came out of seemingly nowhere and overtook OpenAI’s best available model. There’s going to be tension.

Also, AI will soon (if not already) increase in abilities at an exponential rate. The political and geopolitical implications are absolutely massive. If anything, people in AI should pay more attention to politics, and also stay open minded on what policies could be good or bad.

Conclusion

Yes, it’s a dizzying rate of development. The main takeaway is that R1 provides clarity where OpenAI was previously opaque. Thus, the future of AI is more clear, and it seems to be accelerating rapidly.

Discussion