MoreRSS

site iconNon_InteractiveModify

jbetker, a engineer of Open AI.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of Non_Interactive

Beating ARC the hard way

2024-12-23 01:38:49

ARC is benchmark developed to test out of distribution reasoning and common sense in general solvers. It is specifically designed to be:

  1. Easily solvable by most humans
  2. Not amenable to any kind of brute-force solvers (e.g. try every permutation of a solution)
  3. Not able to be solved with rote memorization

The designers of ARC achieved the above in a creative way: by developing problems that contain visual puzzles in which the participant must find an algorithm that explains symmetries seen across several demonstrations. They then must apply that algorithm to a final input. This sounds complicated, but in practice it is quite intuitive – most children can complete ARC questions.

LLMs are being pitched as general solvers, so lately we have been trying them out on this challenge. However, to make ARC amenable to being solved by a pure language model, you must remove the visual “clues” to the problem.

More concretely, lets take the test problem on the ARC GitHub. Humans see this:

While the exact same problem is fed into a language model like this:

{"train": [{"input": [[0, 0, 0, 0, 5, 0, 0, 0, 0], [0, 0, 0, 0, 5, 0, 0, 0, 0], [0, 0, 0, 4, 5, 0, 0, 0, 0], [0, 0, 0, 4, 5, 4, 4, 0, 0], [0, 0, 3, 3, 5, 0, 0, 0, 0], [0, 0, 0, 3, 5, 0, 0, 0, 0], [0, 0, 0, 3, 5, 3, 3, 3, 0], [0, 0, 0, 3, 5, 0, 0, 0, 0], [0, 0, 0, 0, 5, 0, 0, 0, 0], [0, 0, 0, 0, 5, 0, 0, 0, 0]], "output": [[0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 4], [0, 0, 4, 4], [0, 0, 3, 3], [0, 0, 0, 3], [0, 3, 3, 3], [0, 0, 0, 3], [0, 0, 0, 0], [0, 0, 0, 0]]}, {"input": [[0, 0, 0, 0, 5, 0, 0, 0, 0], [0, 0, 0, 2, 5, 0, 0, 0, 0], [0, 0, 0, 2, 5, 2, 6, 0, 0], [0, 0, 0, 2, 5, 0, 0, 0, 0], [0, 0, 0, 2, 5, 2, 2, 2, 0], [0, 0, 6, 6, 5, 6, 0, 0, 0], [0, 0, 0, 2, 5, 0, 0, 0, 0], [0, 2, 2, 0, 5, 2, 0, 0, 0], [0, 0, 0, 2, 5, 0, 0, 0, 0], [0, 0, 0, 0, 5, 0, 0, 0, 0]], "output": [[0, 0, 0, 0], [0, 0, 0, 2], [0, 0, 6, 2], [0, 0, 0, 2], [0, 2, 2, 2], [0, 0, 6, 6], [0, 0, 0, 2], [0, 2, 2, 2], [0, 0, 0, 2], [0, 0, 0, 0]]}, {"input": [[0, 0, 0, 0, 5, 0, 0, 0, 0], [0, 0, 0, 0, 5, 7, 0, 0, 0], [0, 0, 0, 8, 5, 0, 0, 0, 0], [0, 0, 0, 8, 5, 0, 0, 0, 0], [0, 7, 8, 8, 5, 0, 0, 0, 0], [0, 0, 0, 0, 5, 8, 8, 0, 0], [0, 0, 0, 8, 5, 0, 0, 0, 0], [0, 0, 0, 8, 5, 0, 0, 0, 0], [0, 0, 0, 0, 5, 8, 7, 0, 0], [0, 0, 0, 0, 5, 0, 0, 0, 0]], "output": [[0, 0, 0, 0], [0, 0, 0, 7], [0, 0, 0, 8], [0, 0, 0, 8], [0, 7, 8, 8], [0, 0, 8, 8], [0, 0, 0, 8], [0, 0, 0, 8], [0, 0, 7, 8], [0, 0, 0, 0]]}], "test": [{"input": [[0, 0, 0, 0, 5, 0, 0, 0, 0], [0, 0, 0, 1, 5, 0, 0, 0, 0], [0, 0, 0, 1, 5, 1, 0, 0, 0], [0, 1, 1, 1, 5, 1, 1, 1, 6], [0, 0, 0, 6, 5, 6, 6, 0, 0], [0, 0, 0, 0, 5, 1, 1, 1, 0], [0, 0, 0, 1, 5, 0, 0, 0, 0], [0, 0, 0, 1, 5, 1, 6, 0, 0], [0, 0, 0, 0, 5, 6, 0, 0, 0], [0, 0, 0, 0, 5, 0, 0, 0, 0]], "output":

(Note: there may be a bit of prompt scaffolding around this)

This makes the problem considerably harder. While it’s true that clever programmers who are presented with the above text problem would probably figure it out with enough time, I do not think that most humans could solve it. Here’s an example set of steps you could pursue if you wanted to tackle such a problem from a command line interface:

  1. Recognize that the format is JSON and parse it
  2. See that the inputs and outputs are grids, and that there is some pattern to the grid sizes of the inputs and outputs.
  3. Print the grids out using some pretty-print library
  4. Recognize that there is some pattern going on with the numbers
  5. Develop an algorithm that reproduces the observed pattern
  6. Write the algorithm down as code
  7. Verify the algorithm works on all of the provided examples
  8. Apply the algorithm to the final input, submit the result

(5) is the really tricky bit. It becomes much easier when you show the grids with colored boxes, but I would not underestimate how hard it is to find the symmetries using only text. Mere mortals, myself included, need not apply.

o-3

Today, OpenAI released some details about an upcoming reasoning model “o-3”. Along with very impressive math and coding results (which are much more important than ARC, but to which I feel unable to comment since it has long since surpassed my capabilities in either field), it was revealed that o-3 achieved a new state of the art score on the ARC test set.

That’d be really cool if it had done so by operating a UI, but the way it was actually done is far more impressive to me: o-3 achieved these scores with the text version of this eval. The way o-3 does this is by brainstorming solutions, then testing those solutions in it’s own thoughts. It does this over and over again until it finds a solution that works. It also sometimes thinks about “giving up” and “guessing” and other uncannily human things.

There’s a few really important things I want to call out here:

  • Using RL to learn reasoning continues to scale
  • Reasoning models are getting better at building and verifying algorithms in-CoT
  • There’s clear evidence of in-domain generalization from training on very few examples

I want to call out that I am not currently involved in building or deploying the o- series of models in any way. I am just in awe of what my colleagues have achieved!

General Intelligence (2024)

2024-06-03 21:01:47

Folks in the field of AI like to make predictions for AGI. I have thoughts, and I’ve always wanted to write them down. Let’s do that.

Since this isn’t something I’ve touched on in the past, I’ll start by doing my best to define what I mean by “general intelligence”: a generally intelligent entity is one that achieves a special synthesis of three things:

  • A way of interacting with and observing a complex environment. Typically this means embodiment: the ability to perceive and interact with the natural world.
  • A robust world model covering the environment. This is the mechanism which allows an entity to perform quick inference with a reasonable accuracy. World models in humans are generally referred to as “intuition”, “fast thinking” or “system 1 thinking”.
  • A mechanism for performing deep introspection on arbitrary topics. This is thought of in many different ways – it is “reasoning”, “slow thinking” or “system 2 thinking”.

If you have these three things, you can build a generally intelligent agent. Here’s how:

First, you seed your agent with one or more objectives. Have the agent use system 2 thinking in conjunction with its world model to start ideating ways to optimize for its objectives. It picks the best idea and builds a plan. It uses this plan to take an action on the world. It observes the result of this action and compares that result with the expectation it had based on its world model. It might update its world model here with the new knowledge gained. It uses system 2 thinking to make alterations to the plan (or idea). Rinse and repeat. 

My definition for general intelligence is an agent that can coherently execute the above cycle repeatedly over long periods of time, thereby being able to attempt to optimize any objective.

The capacity to actually achieve arbitrary objectives is not a requirement. Some objectives are simply too hard. Adaptability and coherence are the key: can the agent use what it knows to synthesize a plan, and is it able to continuously act towards a single objective over long time periods.

So with that out of the way – where do I think we are on the path to building a general intelligence?

World Models

We’re already building world models with autoregressive transformers, particularly of the “omnimodel” variety. How robust they are is up for debate. There’s good news, though: in my experience, scale improves robustness and humanity is currently pouring capital into scaling autoregressive models. So we can expect robustness to improve.

With that said, I suspect the world models we have right now are sufficient to build a generally intelligent agent. 

Side note: I also suspect that robustness can be further improved via the interaction of system 2 thinking and observing the real world. This is a paradigm we haven’t really seen in AI yet, but happens all the time in living things. It’s a very important mechanism for improving robustness.

When LLM skeptics like Yann say we haven’t yet achieved the intelligence of a cat – this is the point that they are missing. Yes, LLMs still lack some basic knowledge that every cat has, but they could learn that knowledge – given the ability to self-improve in this way. And such self-improvement is doable with transformers and the right ingredients.

Reasoning

There is not a well known way to achieve system 2 thinking, but I am quite confident that it is possible within the transformer paradigm with the technology and compute we have available to us right now. I estimate that we are 2-3 years away from building a mechanism for system 2 thinking which is sufficiently good for the cycle I described above.

Embodiment

Embodiment is something we’re still figuring out with AI but which is something I am once again quite optimistic about near-term advancements. There is a convergence currently happening between the field of robotics and LLMs that is hard to ignore.

Robots are becoming extremely capable – able to respond to very abstract commands like “move forward”, “get up”, “kick ball”, “reach for object”, etc. For example, see what Figure is up to or the recently released Unitree H1

On the opposite end of the spectrum, large Omnimodels give us a way to map arbitrary sensory inputs into commands which can be sent to these sophisticated robotics systems.

I’ve been spending a lot of time lately walking around outside talking to GPT-4o while letting it observe the world through my smartphone camera. I like asking it questions to test its knowledge of the physical world. It’s far from perfect, but it is surprisingly capable. We’re close to being able to deploy systems which can commit coherent strings of actions on the environment and observe (and understand) the results. I suspect we’re going to see some really impressive progress in the next 1-2 years here.

This is the field of AI I am personally most excited in, and I plan to spend most of my time working on this over the coming years.

TL;DR

In summary – we’ve basically solved building world models, have 2-3 years on system 2 thinking, and 1-2 years on embodiment. The latter two can be done concurrently. Once all of the ingredients have been built, we need to integrate them together and build the cycling algorithm I described above. I’d give that another 1-2 years.

So my current estimate is 3-5 years for AGI. I’m leaning towards 3 for something that looks an awful lot like a generally intelligent, embodied agent (which I would personally call an AGI). Then a few more years to refine it to the point that we can convince the Gary Marcus’ of the world.

Really excited to see how this ages. 🙂

GPT-4o

2024-05-15 06:28:37

I’m very pleased to show the world GPT-4o. I came into the project mid-last year with Alexis Conneau with the goal of scaling up speech models and building an “AudioLM”. We knew we had something special late last year, but I don’t think either of us imagined that we’d able to pull off something as cool as GPT-4o in this short of a time frame. That came from the dedicated work of a core team of “believers”. I’m incredibly proud to have had the chance to work with so many talented and motivated people.

I agree with Sam that interacting with this model feels like something new. I think what it boils down to is that for the first time ever, it “feels” better to interact with a computer program through speech rather than through text. GPT-4o isn’t without it’s flaws, but it responds so quickly and is right so often, that it’s not too hard to shrug off the minor issues that it has. Of course, we’ll get better at the need for those going forwards.

One consistent piece of feedback I’ve been getting from 4o is that “it is not enough to watch the demo, you have to experience it yourself”. So for those of you who are skeptical, give it a chance when we get it out of alpha!

Really excited to see this in the hands of more people. It really is exciting tech. I use it regularly and it is the source of many smiles. It’s going to be an amazing year.

Research Code

2024-03-17 00:08:19

At my job, I’m currently in a cycle that is involving working with software engineers quite a bit. One thing that has happened a number of times is that a software engineer will bring up “research code” with a condescending tone. The implication is that research code is messy, unreadable, and difficult to maintain.

I don’t deny this! It often is those things, but I also think it has a beauty to its purpose and prose that is worth acknowledging.

Most code has a purpose from the get go. Someone thinks “wouldn’t it be nice if my computer did <x>”, then designs and builds the code to achieve <x>. Generally you know where you want to end up and how to get there fairly early on in the project. This makes writing design docs, unit tests and a coherent software architecture possible.

Research doesn’t really work in this way. It is similarly goal oriented, but the distinction is that you never really know exactly how you’re going to get to the goal. You constantly poke at the unknown until something “works”. You get stuck in iterative loops improving small aspects of your design over weeks or months – which naturally causes it to evolve. At the end of the journey the goalposts have moved, more often than not.

In such an environment, you take notes rather than write design docs. They describe where you’ve been and why the things you tried didn’t work. You test one-off implementations directly rather than write unit tests. If something works, you cement it in place as a beacon marking a step towards your goal. If you’re fortunate enough to be working on a problem that allows evaluation metrics, you might record the eval scores you achieve. It might make sense to write a unit test here, but it’s often easier just to treat the beacon code as immutable and branch off of it.

I love this way of iterating on a problem. I always have, well before I joined the field. It’s haphazard as hell, but it also feels as close to “art” as I think programming will ever get. Let me explain that a bit: art is more often then not about individual expression. Programming for most large companies is explicitly not that: it’s about writing in common language with your fellow programmers as well as sharing “cultural norms” like coding conventions, reviews and tests. There are good reasons for all of these things! But the end product is rarely “art”.

By contrast, the code that a researcher builds is a product of the interaction between their individual ingenuity and the real world. It’s organic and complex and two researchers tackling a similar problem rarely come up with the same code at the end of the day. I call that “art”.

Learned Structures

2024-03-03 15:13:53

From 2019-2021, I was fascinated with neural network architectures. I think a lot of researchers in the field were at the time. The transformer paper had been out for a little while and it was starting to sink in how transformational it was going to be. The general question in the air was: what other simple tweaks can we make to greatly improve performance?

As time has passed, I’ve internally converged on the understanding that there are only a few types of architectural tweaks that actually meaningfully impact performance across model scales. These tweaks seem to fall into one of two categories: modifications that improve numerical stability during training, and modifications that enhance the expressiveness of a model in learnable ways.

Improving numerical stability is a bit of a black art. I’m not an expert but those that are remind me of the RF engineers I worked with in my first job. Things that fit into this category would include where and how to normalize activations, weight initialization and smoothed non-linearities. I’d love to talk more about this someday.

I wanted to talk about learnable expressiveness in this post. The core idea here is to build structured representations of your data, and let those structures interact in learnable ways. Let’s start by looking at different ways this currently can happen:

MLPs are the most basic building block of a neural network and provide the foundation of interacting structures: they allow all of the elements of a vector to interact with each other through the weights of the neural network.

Attention builds another layer: rather than considering just a single vector interacting with weights, we consider a set of vectors. Through the attention layer, elements from this set can interact with each other.

Mixture of Experts adds yet another layer: Rather than considering vectors interacting with a fixed set of weights, we now dynamically select the weights to use for other operations based on the values within the vector (and some more weights!)

Hopefully you’re seeing the pattern here: in each of the above cases, we add an axis by which our activations can affect the end result of the computation performed by our neural network. I have no empirical proof for this, but what I think is actually happening here is that as you add these nested structures into the computational graph, you are adding ways for the network to learn in stages.

Why is important to learn in stages? Because we train our neural networks in a really, really dumb way: we optimize the entire parameter space from the beginning of training. This means all of the parameters fight from the very beginning to optimize really simple patterns of the data distribution. 7 Billion parameters learning that “park” and “frisbee” are common words to find around “dog”.

The neat thing about these learned structures is that they’re practically useless in the early training regime. Attention cannot be meaningfully learned while the network is still learning “black” from “white”. Same with MoE: expert routing amounts to random chance when the network activations are akin to random noise. As training progresses, these mechanisms come “online”, though: providing meaningful value just when you need a boost in capacity to learn a more complex layer of the data distribution.

Anyhow, regardless of whether or not my philosophical waxing is correct, learnable structures are probably the most fascinating research direction I can think of in architecture right now. My hunch is that there are additional structures that we can bolt onto our neural networks for another meaningful increase in performance. The main thing to pay attention to is that you are not just re-inventing a type of learned structure that already exists. Like Mamba. 🙂

One idea along this vein that I had explored before joining OpenAI:

StyleGAN is an image generation model with exceptional fidelity and speed. The catch is that it is an extremely narrow learning framework: It only works when you heavily regularize the dataset you train it on. For example, only photos of center-cropped faces, or specific types of churches. If you attempt to train it on something like LAION quality drops off as you lose the ability to model the data distribution: it’s just too wide to fit in the parameter space. But here’s the thing: you can think of most images as being made up of several modal components. Maybe a persons face here, a hand there, a tree in the background. It seems to me that an optimal way to get high generation performance and fidelity would be to train StyleGAN-like things separately from an image “composer” that learns to place the correct StyleGAN over the correct places in an image to decode. A “mixture of StyleGANs” if you will.

As a final note: I don’t want to claim the above is novel or anything, just a good idea. I think one of my favorite early applications of this general idea is using StyleGAN to fix StableDiffusion faces, like this. I want to try something like this learned end to end someday!

go/rulesofthumb

2024-01-07 01:45:31

Google has a neat internal website called “Rules of Thumb”, which compares the marginal cost of computational resources to the unit of a “SWE”. “SWE” refers to “Software Engineer” – which itself is the marginal cost to pay salary and benefits to the average engineer at the company. Throughout design docs at the company, you’ll see costs referred to in the units of SWE. For example, “deploying service <X> at 1000QPS will cost ~100 SWEs in resources”.

I always thought comparing costs of fixed assets like compute, RAM, or database accesses to the cost of hiring a new employee was at once brilliant and dystopian. Brilliant because it allowed for some really interesting comparisons when thinking about how much effort to pour into optimizing some aspect of the stack – if you spent half a year optimizing something that saved 1 SWE of resources per year in a product that was growing, then that was a very good use of your time! It was also humbling to see how expensive each employee was – the unit of SWE came out to a rather ridiculous number of CPUs, RAM and/or hard disk space.

The dystopian angle is obvious: there is an underlying implication that human labor is worth some amount of computational resources. This was never more than a joke at Google, as there was never any real chance that the average Googler could be replaced by a computer. You could only really compare costs.

Post 2023, that joke is a bit less funny. Compute is valuable, and getting more valuable. Given the choice between having a team of 3 people working under me or 1000 dedicated H100s for my work, I’d have to think for a little bit. Humans are certainly smarter than ChatGPT but they also take time to train, aren’t available at 3AM on a Saturday (and I don’t expect them to be!) and are hard to pick correctly. With more GPUs, I know exactly what I’m going to get: more synthetic data, faster iteration on a problem and ultimately a reliably better product thanks to scaling laws.

This direct consideration of “more compute or more people?” certainly doesn’t exist out of very small niches like mine so far, but it’s interesting to me that it exists at all. I think that this is an all but guaranteed future for us all: humans will need to increasingly compete with the application of raw compute in all independent work. Social work will still be available and valued, but the act of going from problem to solution will increasingly have compute-bound solutions. If this happens in parallel to Moores law continuing at pace, the results will be interesting.

Even if we don’t achieve an autonomous “AGI”, this notion of “smart person augmented by AI techniques and lots of compute” being competitive with a team of specialists means go/rulesofthumb might become more literal than most of us ever thought it would be!