MoreRSS

site iconNon_InteractiveModify

jbetker, a engineer of Open AI.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of Non_Interactive

Mixture of Experts

2025-04-19 06:01:11

A Transformer is a stack of alternating Attention and MLP layers through which data embedded as high dimensional vectors is fed. A Mixture of Experts (MoE) Transformer substitutes the MLP layer for an “MoE Layer”. Let’s dive into what that means.

The “MLP” is one of the oldest neural network architectures, consisting of two linear transformations. First, an embedding vector is expanded via the first transformation. Next, a non-linearity is applied to the expanded vector. Finally, it is contracted back to the original dimensionality.

An “MoE” layer is a collection of n_experts MLP layers and a “router”. The router is an additional linear transform that outputs a probability distribution over which k of the MLP layers should be used. The embedding is fed through each of the top k selected MLP layers, then their collective outputs are summed together.

Effectively this means is that, for each forward pass through the MoE layer, only k/n_experts of the parameters are being actively used. This property of the MoE is commonly referred to as “sparsity”, with the sparsity factor of an MoE layer being n_experts/k.

When measuring the efficiency of a neural network, as computed by the test loss achieved when trained on the pareto-optimal blend of data and compute (per Chinchilla), you will find that MoE Transformers are considerably more compute efficient than vanilla Transformers. This is an example of a compute multiplier. The multiplier tends to increase with sparsity (though there are diminishing returns!). Importantly, these efficiency gains persist when scaling the Transformer itself by adding layers or expanding the embedding dimension.

That last property (efficiency gains persist with scale) is actually quite rare – most architectural modifications to the Transformer tend to diminish in effect with scale. That’s because at large enough scale, a neural network can learn to mimic the implicit biases granted by most architectural decisions through brute force training.

The fact that MoE has great scaling properties indicates that something deeper is amiss with this architectural construct. This turns out to be sparsity itself – it is a new free parameter to the scaling laws for which sparsity=1 is suboptimal. Put another way – Chinchilla scaling laws focus on the relationship between data and compute, but MoEs give us another lever: the number of parameters in a neural network. Previously compute and quantity of parameters were proportional, but sparsity allows us to modulate this ratio.

This is the true magic behind the MoE Transformer, and why every big lab has been moving to them. It is not that we are somehow teaching a bunch of smaller subnetworks to specialize and then ensembling them – that is the type anthropomorphism that plagued a lot of architectural research in the late 2010s. Instead, it is all about the ability to add free parameters to our models to allow us to train and serve them more efficiently. And it seems to be the case that most LLMs are quite underparameterized.

There’s a sad side effect to all this – at inference time with low batch sizes on the hardware that is currently commercially available, Transformers are notoriously memory bound. That is to say that we are often only using a small fraction of the compute cores because they are constantly waiting for network weights to be loaded from VRAM. This problem gets much worse with MoE transformers – there are simply more weights to load. You need more VRAM and will be more memory bound (read: worse performance). The open source community is starting to see this with the DeepSpeed and Llama 4 models.

Instead, MoE transformers lend themselves best to highly distributed serving environments with ultra-large batch sizes and corresponding high latency per token. This makes me sad – both because I like low-latency systems and because I’m a fan of local models. I sincerely hope we see some hardware innovations in the coming years which will help to ameliorate this.

The Paradigm

2025-03-17 07:00:09

Over the past decade, some of the most remarkable AI breakthroughs—AlphaGo, AlphaStar, AlphaFold1, VPT, OpenAI Five, ChatGPT—have all shared a common thread: they start with large-scale data gathering (self-supervised or imitation learning, or SSL) and then use reinforcement learning to refine their performance toward a specific goal. This marriage of general knowledge acquisition and focused, reward-driven specialization has emerged as a the paradigm by which we can reliably train AI systems to excel at arbitrary tasks.

I’d like to talk about how and why this works so well.

1 – AlphaFold 2 technically does not use RL; instead it uses distillation via rejection sampling, which has similar (if less adaptable) results.

Generalization

In recent years, we’ve found that applying SSL to highly general datasets improves the robustness and thus usefulness of our models at downstream tasks. As a result, the models the big labs are putting out are increasingly trained on self-prediction objectives over a diverse corpus of interleaved text, images, video and audio.

By comparison, RL training has stayed quite “narrow”. All of the systems I mentioned above were trained with RL that optimizes something fairly specific: for example, play a game well or be engaging and helpful to humans talking to you.

Over the last year, something seems to have happened at many of the top research labs: they started investing in more “general” RL optimization. Instead of using reinforcement learning to optimize models to play one game very well, we’re optimizing them to solve complex math problems, write correct code, derive coherent formal proofs, play all games, write extensive research documents, operate a computer, etc.

And this seems to be working! Reasoning models trained with general RL are leap-frogging SSL on every benchmark for measuring model performance that we know of. There’s something happening here, and it’s worth paying attention to.

Some Terminology

When training with an RL objective, you are no longer learning to model a distribution of data – you are learning a sampling policy. What does that mean?

You can think of sampling from an autoregressive model (e.g. one that is sampled one “token” at a time) as sampling a chain of actions – each token is an action. Observations can be injected in the chain by inserting non-sampled tokens at any time during the process. This chain of actions and observations is called a “trajectory”. It is the causal rollout of a series of interactions between the model and some external environment.

When we say we’re learning a policy, what that means is that we’re teaching a model a set of processes for generating useful trajectories. “Useful” here is defined by the reward function you introduce. A useful trajectory is one that has a high likelihood of achieving the goal.

These processes can be thought of as little “subroutines” that the model learns to use as effective ways to solve the classes of problems that it regularly encounters. Thinking harder might be one. Learning to write common algorithms in C++ might be another. They have analogues in the human experience – Over my life I have learned how to speak, type on a keyboard, pour water, turn a screwdriver, plant a seed, carry heavy objects, drive – the list goes on and on. In all cases I’ve learned to do these things subconsciously. That only works by learning subroutines that I can chain together to accomplish my goals. We see policy models learn to do the same things.

Error Correction

I suspect the most impactful end result of policy learning is in learning error correction. Likelihood training teaches the models how to mimic intelligent agents, but it does not teach them what intelligent agents would do if they were put in highly unlikely scenarios – such as would occur if a really poor prediction was made. Making a poor prediction like this is called “falling off the manifold”. It is analogous to something unexpected happening during your daily routine. Getting yourself back “on the manifold” is allows you to continue your goal oriented behaviors. This is error correction.

Here’s the dirty little secret about relying entirely on SSL: our models will always fail, and they will fail in ways that the humans who generated the pre-training data did not expect or encounter themselves. As a result, pre-training data will not always contain examples which teach the model to perform the error correction that it needs to function in the real world. So our SSL models will likely never be able to reliably error correct at all levels.

By contrast, general RL models learn error correction policies early on. We see this in reasoning models with their tendency to second guess their own thoughts. Words like “but”, “except”, “maybe”, etc trigger the model to review previous generations and catch errors made by naïveté, poorly executed exploration, or random chance.

Intentionality and Refinement

The first few times we do a complicated task, we say we are “learning” that task. For many tasks, this process is very intentional: we study it beforehand, come up with plans, and execute slowly with lots of interjected thinking. For tasks that happen too fast to think, we spend time afterwards debriefing. As we repeat the task over and over, the intentionality vanishes. We build the mental subroutines that I discussed previously.

A core component of the paradigm is the distillation of a cycles of observation, planning and action into simpler cycles of observation and action. This process of distillation was previously an offline process. We built models, we deployed them and studied how they interacted with their environment. We labeled to the good behaviors and the bad and we used those labels to train better models. This has largely been the driver of progress in LLMs for the past 2 years.

Now that we can build algorithms that improve themselves by “thinking” more, I expect this process of self-improvement to accelerate. It will likely define the next few years of ML progress – possibly decades. Increasingly, we’ll apply RL techniques to new fields or applications, generate large amounts of high-quality “on-policy” data, and pump that data into the SSL regime. The base models will get smarter, and we’ll use them to do more RL on an increasingly diverse set of problems.

This is a data generation engine that feeds off of compute and interaction with the world, and nothing else. People are concerned about the shortage of data, this is the big reason that I am not. Similarly, this is why I think it is naive to think that the need for more AI accelerators to taper off any time soon.

Reasoning

The first application of general RL has been to build “reasoning” models. We’re anthropomorphizing these things a bit by saying that they are “thinking”, but I think that the analogy isn’t as far off as many skeptics would have you believe.

When humans “think”, we channel our mental energy into better understanding the world so that we might take better actions in the future. This happens by internally searching our space of intuition to aid in solving problems. The search can happen in many different ways – we can think in language, talk to ourselves, or use the “minds eye” to visualize desired end states and the trajectory to getting to them. Effective thinkers are creative, self-critical and well-informed.

Reasoning models attempt to solve problems by generating long sequences of tokens to improve their answers to questions. These long sequences of tokens follow patterns imparted on the model by learning from human language. Increasingly, we’re seeing models learn to use knowledge retrieval to aid their search. The model also learns how to be self-critical and how to explore the vast space of possible trajectories.

What is interesting is that effective general reasoning strategies seem to “fall out” of general RL optimization. For example, a LLM that is taught to “think hard” to better solve math and programming problems performs much better at legal, biology and economics tests.

This opens up a brand new “scaling curve” in ML. Previously, we scaled data and compute to get better models with log-linear returns. We’re on the diminishing slope side of that curve. Now we’ve got a new method of optimization which can be applied on top of the old. From everything I’ve seen so far, the two seem to compound on top of one another. How far this can be pushed is the open question.

Where is this going?

As the techniques underpinning this paradigm mature and proliferate over the next decade, it’ll be increasingly clear that there are only two obstacles to building computer systems that solve any task of interest:

  1. Enabling models to interact with the world with the fidelity required to solve the task
  2. Finding robust ways to measure whether or not that task has been completed satisfactorily

Make no mistake: these are extremely hard problems. Just because we know what we need to do doesn’t mean that it’ll get solved this year. With that said, I think they are absolutely within the realm of “solvable” for a wide variety of useful tasks within a short timeframe.

Beating ARC the hard way

2024-12-23 01:38:49

ARC is benchmark developed to test out of distribution reasoning and common sense in general solvers. It is specifically designed to be:

  1. Easily solvable by most humans
  2. Not amenable to any kind of brute-force solvers (e.g. try every permutation of a solution)
  3. Not able to be solved with rote memorization

The designers of ARC achieved the above in a creative way: by developing problems that contain visual puzzles in which the participant must find an algorithm that explains symmetries seen across several demonstrations. They then must apply that algorithm to a final input. This sounds complicated, but in practice it is quite intuitive – most children can complete ARC questions.

LLMs are being pitched as general solvers, so lately we have been trying them out on this challenge. However, to make ARC amenable to being solved by a pure language model, you must remove the visual “clues” to the problem.

More concretely, lets take the test problem on the ARC GitHub. Humans see this:

While the exact same problem is fed into a language model like this:

{"train": [{"input": [[0, 0, 0, 0, 5, 0, 0, 0, 0], [0, 0, 0, 0, 5, 0, 0, 0, 0], [0, 0, 0, 4, 5, 0, 0, 0, 0], [0, 0, 0, 4, 5, 4, 4, 0, 0], [0, 0, 3, 3, 5, 0, 0, 0, 0], [0, 0, 0, 3, 5, 0, 0, 0, 0], [0, 0, 0, 3, 5, 3, 3, 3, 0], [0, 0, 0, 3, 5, 0, 0, 0, 0], [0, 0, 0, 0, 5, 0, 0, 0, 0], [0, 0, 0, 0, 5, 0, 0, 0, 0]], "output": [[0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 4], [0, 0, 4, 4], [0, 0, 3, 3], [0, 0, 0, 3], [0, 3, 3, 3], [0, 0, 0, 3], [0, 0, 0, 0], [0, 0, 0, 0]]}, {"input": [[0, 0, 0, 0, 5, 0, 0, 0, 0], [0, 0, 0, 2, 5, 0, 0, 0, 0], [0, 0, 0, 2, 5, 2, 6, 0, 0], [0, 0, 0, 2, 5, 0, 0, 0, 0], [0, 0, 0, 2, 5, 2, 2, 2, 0], [0, 0, 6, 6, 5, 6, 0, 0, 0], [0, 0, 0, 2, 5, 0, 0, 0, 0], [0, 2, 2, 0, 5, 2, 0, 0, 0], [0, 0, 0, 2, 5, 0, 0, 0, 0], [0, 0, 0, 0, 5, 0, 0, 0, 0]], "output": [[0, 0, 0, 0], [0, 0, 0, 2], [0, 0, 6, 2], [0, 0, 0, 2], [0, 2, 2, 2], [0, 0, 6, 6], [0, 0, 0, 2], [0, 2, 2, 2], [0, 0, 0, 2], [0, 0, 0, 0]]}, {"input": [[0, 0, 0, 0, 5, 0, 0, 0, 0], [0, 0, 0, 0, 5, 7, 0, 0, 0], [0, 0, 0, 8, 5, 0, 0, 0, 0], [0, 0, 0, 8, 5, 0, 0, 0, 0], [0, 7, 8, 8, 5, 0, 0, 0, 0], [0, 0, 0, 0, 5, 8, 8, 0, 0], [0, 0, 0, 8, 5, 0, 0, 0, 0], [0, 0, 0, 8, 5, 0, 0, 0, 0], [0, 0, 0, 0, 5, 8, 7, 0, 0], [0, 0, 0, 0, 5, 0, 0, 0, 0]], "output": [[0, 0, 0, 0], [0, 0, 0, 7], [0, 0, 0, 8], [0, 0, 0, 8], [0, 7, 8, 8], [0, 0, 8, 8], [0, 0, 0, 8], [0, 0, 0, 8], [0, 0, 7, 8], [0, 0, 0, 0]]}], "test": [{"input": [[0, 0, 0, 0, 5, 0, 0, 0, 0], [0, 0, 0, 1, 5, 0, 0, 0, 0], [0, 0, 0, 1, 5, 1, 0, 0, 0], [0, 1, 1, 1, 5, 1, 1, 1, 6], [0, 0, 0, 6, 5, 6, 6, 0, 0], [0, 0, 0, 0, 5, 1, 1, 1, 0], [0, 0, 0, 1, 5, 0, 0, 0, 0], [0, 0, 0, 1, 5, 1, 6, 0, 0], [0, 0, 0, 0, 5, 6, 0, 0, 0], [0, 0, 0, 0, 5, 0, 0, 0, 0]], "output":

(Note: there may be a bit of prompt scaffolding around this)

This makes the problem considerably harder. While it’s true that clever programmers who are presented with the above text problem would probably figure it out with enough time, I do not think that most humans could solve it. Here’s an example set of steps you could pursue if you wanted to tackle such a problem from a command line interface:

  1. Recognize that the format is JSON and parse it
  2. See that the inputs and outputs are grids, and that there is some pattern to the grid sizes of the inputs and outputs.
  3. Print the grids out using some pretty-print library
  4. Recognize that there is some pattern going on with the numbers
  5. Develop an algorithm that reproduces the observed pattern
  6. Write the algorithm down as code
  7. Verify the algorithm works on all of the provided examples
  8. Apply the algorithm to the final input, submit the result

(5) is the really tricky bit. It becomes much easier when you show the grids with colored boxes, but I would not underestimate how hard it is to find the symmetries using only text. Mere mortals, myself included, need not apply.

o-3

Today, OpenAI released some details about an upcoming reasoning model “o-3”. Along with very impressive math and coding results (which are much more important than ARC, but to which I feel unable to comment since it has long since surpassed my capabilities in either field), it was revealed that o-3 achieved a new state of the art score on the ARC test set.

That’d be really cool if it had done so by operating a UI, but the way it was actually done is far more impressive to me: o-3 achieved these scores with the text version of this eval. The way o-3 does this is by brainstorming solutions, then testing those solutions in it’s own thoughts. It does this over and over again until it finds a solution that works. It also sometimes thinks about “giving up” and “guessing” and other uncannily human things.

There’s a few really important things I want to call out here:

  • Using RL to learn reasoning continues to scale
  • Reasoning models are getting better at building and verifying algorithms in-CoT
  • There’s clear evidence of in-domain generalization from training on very few examples

I want to call out that I am not currently involved in building or deploying the o- series of models in any way. I am just in awe of what my colleagues have achieved!

General Intelligence (2024)

2024-06-03 21:01:47

Folks in the field of AI like to make predictions for AGI. I have thoughts, and I’ve always wanted to write them down. Let’s do that.

Since this isn’t something I’ve touched on in the past, I’ll start by doing my best to define what I mean by “general intelligence”: a generally intelligent entity is one that achieves a special synthesis of three things:

  • A way of interacting with and observing a complex environment. Typically this means embodiment: the ability to perceive and interact with the natural world.
  • A robust world model covering the environment. This is the mechanism which allows an entity to perform quick inference with a reasonable accuracy. World models in humans are generally referred to as “intuition”, “fast thinking” or “system 1 thinking”.
  • A mechanism for performing deep introspection on arbitrary topics. This is thought of in many different ways – it is “reasoning”, “slow thinking” or “system 2 thinking”.

If you have these three things, you can build a generally intelligent agent. Here’s how:

First, you seed your agent with one or more objectives. Have the agent use system 2 thinking in conjunction with its world model to start ideating ways to optimize for its objectives. It picks the best idea and builds a plan. It uses this plan to take an action on the world. It observes the result of this action and compares that result with the expectation it had based on its world model. It might update its world model here with the new knowledge gained. It uses system 2 thinking to make alterations to the plan (or idea). Rinse and repeat. 

My definition for general intelligence is an agent that can coherently execute the above cycle repeatedly over long periods of time, thereby being able to attempt to optimize any objective.

The capacity to actually achieve arbitrary objectives is not a requirement. Some objectives are simply too hard. Adaptability and coherence are the key: can the agent use what it knows to synthesize a plan, and is it able to continuously act towards a single objective over long time periods.

So with that out of the way – where do I think we are on the path to building a general intelligence?

World Models

We’re already building world models with autoregressive transformers, particularly of the “omnimodel” variety. How robust they are is up for debate. There’s good news, though: in my experience, scale improves robustness and humanity is currently pouring capital into scaling autoregressive models. So we can expect robustness to improve.

With that said, I suspect the world models we have right now are sufficient to build a generally intelligent agent. 

Side note: I also suspect that robustness can be further improved via the interaction of system 2 thinking and observing the real world. This is a paradigm we haven’t really seen in AI yet, but happens all the time in living things. It’s a very important mechanism for improving robustness.

When LLM skeptics like Yann say we haven’t yet achieved the intelligence of a cat – this is the point that they are missing. Yes, LLMs still lack some basic knowledge that every cat has, but they could learn that knowledge – given the ability to self-improve in this way. And such self-improvement is doable with transformers and the right ingredients.

Reasoning

There is not a well known way to achieve system 2 thinking, but I am quite confident that it is possible within the transformer paradigm with the technology and compute we have available to us right now. I estimate that we are 2-3 years away from building a mechanism for system 2 thinking which is sufficiently good for the cycle I described above.

Embodiment

Embodiment is something we’re still figuring out with AI but which is something I am once again quite optimistic about near-term advancements. There is a convergence currently happening between the field of robotics and LLMs that is hard to ignore.

Robots are becoming extremely capable – able to respond to very abstract commands like “move forward”, “get up”, “kick ball”, “reach for object”, etc. For example, see what Figure is up to or the recently released Unitree H1

On the opposite end of the spectrum, large Omnimodels give us a way to map arbitrary sensory inputs into commands which can be sent to these sophisticated robotics systems.

I’ve been spending a lot of time lately walking around outside talking to GPT-4o while letting it observe the world through my smartphone camera. I like asking it questions to test its knowledge of the physical world. It’s far from perfect, but it is surprisingly capable. We’re close to being able to deploy systems which can commit coherent strings of actions on the environment and observe (and understand) the results. I suspect we’re going to see some really impressive progress in the next 1-2 years here.

This is the field of AI I am personally most excited in, and I plan to spend most of my time working on this over the coming years.

TL;DR

In summary – we’ve basically solved building world models, have 2-3 years on system 2 thinking, and 1-2 years on embodiment. The latter two can be done concurrently. Once all of the ingredients have been built, we need to integrate them together and build the cycling algorithm I described above. I’d give that another 1-2 years.

So my current estimate is 3-5 years for AGI. I’m leaning towards 3 for something that looks an awful lot like a generally intelligent, embodied agent (which I would personally call an AGI). Then a few more years to refine it to the point that we can convince the Gary Marcus’ of the world.

Really excited to see how this ages. 🙂

GPT-4o

2024-05-15 06:28:37

I’m very pleased to show the world GPT-4o. I came into the project mid-last year with Alexis Conneau with the goal of scaling up speech models and building an “AudioLM”. We knew we had something special late last year, but I don’t think either of us imagined that we’d able to pull off something as cool as GPT-4o in this short of a time frame. That came from the dedicated work of a core team of “believers”. I’m incredibly proud to have had the chance to work with so many talented and motivated people.

I agree with Sam that interacting with this model feels like something new. I think what it boils down to is that for the first time ever, it “feels” better to interact with a computer program through speech rather than through text. GPT-4o isn’t without it’s flaws, but it responds so quickly and is right so often, that it’s not too hard to shrug off the minor issues that it has. Of course, we’ll get better at the need for those going forwards.

One consistent piece of feedback I’ve been getting from 4o is that “it is not enough to watch the demo, you have to experience it yourself”. So for those of you who are skeptical, give it a chance when we get it out of alpha!

Really excited to see this in the hands of more people. It really is exciting tech. I use it regularly and it is the source of many smiles. It’s going to be an amazing year.

Research Code

2024-03-17 00:08:19

At my job, I’m currently in a cycle that is involving working with software engineers quite a bit. One thing that has happened a number of times is that a software engineer will bring up “research code” with a condescending tone. The implication is that research code is messy, unreadable, and difficult to maintain.

I don’t deny this! It often is those things, but I also think it has a beauty to its purpose and prose that is worth acknowledging.

Most code has a purpose from the get go. Someone thinks “wouldn’t it be nice if my computer did <x>”, then designs and builds the code to achieve <x>. Generally you know where you want to end up and how to get there fairly early on in the project. This makes writing design docs, unit tests and a coherent software architecture possible.

Research doesn’t really work in this way. It is similarly goal oriented, but the distinction is that you never really know exactly how you’re going to get to the goal. You constantly poke at the unknown until something “works”. You get stuck in iterative loops improving small aspects of your design over weeks or months – which naturally causes it to evolve. At the end of the journey the goalposts have moved, more often than not.

In such an environment, you take notes rather than write design docs. They describe where you’ve been and why the things you tried didn’t work. You test one-off implementations directly rather than write unit tests. If something works, you cement it in place as a beacon marking a step towards your goal. If you’re fortunate enough to be working on a problem that allows evaluation metrics, you might record the eval scores you achieve. It might make sense to write a unit test here, but it’s often easier just to treat the beacon code as immutable and branch off of it.

I love this way of iterating on a problem. I always have, well before I joined the field. It’s haphazard as hell, but it also feels as close to “art” as I think programming will ever get. Let me explain that a bit: art is more often then not about individual expression. Programming for most large companies is explicitly not that: it’s about writing in common language with your fellow programmers as well as sharing “cultural norms” like coding conventions, reviews and tests. There are good reasons for all of these things! But the end product is rarely “art”.

By contrast, the code that a researcher builds is a product of the interaction between their individual ingenuity and the real world. It’s organic and complex and two researchers tackling a similar problem rarely come up with the same code at the end of the day. I call that “art”.