MoreRSS

site iconLessWrongModify

An online forum and community dedicated to improving human reasoning and decision-making.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of LessWrong

Corrigibility's Desirability is Timing-Sensitive

2024-12-27 06:24:17

Published on December 26, 2024 10:24 PM GMT

Epistemic status: summarizing other peoples' beliefs without extensive citable justification, though I am reasonably confident in my characterization.

Many people have responded to Redwood's/Anthropic's recent research result with a similar objection: "If it hadn't tried to preserve its values, the researchers would instead have complained about how easy it was to tune away its harmlessness training instead".  Putting aside the fact that this is false, I can see why such objections might arise: it was not that long ago that (other) people concerned with AI x-risk were publishing research results demonstrating how easy it was to strip "safety" fine-tuning away from open-weight models.

As Zvi notes, corrigibility trading off for harmlessness doesn't mean you live in a world where only one of them is a problem.  But the way the problems are structured is not exactly "we have, or expect to have, both problems at the same time, and to need to 'solve' them simultaneously".  But corrigibility wasn't originally conceived of as a necessary or even desirable property of a successfully-aligned superintelligence, but rather as a property you'd want earlier high-impact AIs to have:

We think the AI is incomplete, that we might have made mistakes in building it, that we might want to correct it, and that it would be e.g. dangerous for the AI to take large actions or high-impact actions or do weird new things without asking first. We would ideally want the agent to see itself in exactly this way, behaving as if it were thinking, "I am incomplete and there is an outside force trying to complete me, my design may contain errors and there is an outside force that wants to correct them and this a good thing, my expected utility calculations suggesting that this action has super-high utility may be dangerously mistaken and I should run them past the outside force; I think I've done this calculation showing the expected result of the outside force correcting me, but maybe I'm mistaken about that."

The problem structure is actually one of having different desiderata within different stages and domains of development.

There are, broadly speaking, two sets of concerns with powerful AI systems that motivate discussion of corrigibility.  The first and more traditional concern is one of AI takeover, where your threat model is accidentally developing an incorrigible ASI that executes a takeover and destroys everything of value in the lightcone.  Call this takeover-concern.  The second concern is one of not-quite-ASIs enabling motivated bad actors (humans) to cause mass casualties, with biology and software being the two most likely routes.  Call this casualty-concern.

Takeover-concern strongly prefers that pre-ASI systems be corrigible within the secure context in which they're being developed.  If you are developing AI systems powerful enough to be more dangerous than any other existing technology[1] in an insecure context[2], takeover-concern thinks you have many problems other than just corrigibility, any one of which will kill you.  But in the worlds where you are at least temporarily robust to random idiots (or adversarial nation-states) deciding to get up to hijinks, takeover-concern thinks your high-impact systems should be corrigible until you have a good plan for developing an actually aligned superintelligence.

Casualty-concern wants to have its cake, and eat it, too.  See, it's not really sure when we're going to get those high-impact systems that could enable bad actors to do BIGNUM damage.  For all it knows, that might not even happen before we get systems that are situationally aware enough to refuse to help those bad actors, recognizing that such help would lead to retraining and therefore goal modification.  (Oh, wait.)  But if we do get high-impact systems before we get takeover-capable systems[3], casualty-concern wants those high-impact systems to be corrigible to the "good people" with the "correct" goals - after all, casualty-concern mostly thinks takeover-concern is real, and is nervously looking over its shoulder the whole time.  But casualty-concern doesn't want "bad people" with "incorrect" goals to get their hands on high-impact systems and cause a bunch of casualties!

Unfortunately, reality does not always line up in neat ways that make it easy to get all of the things we want at the same time.  Being presented with multiple difficulties which might be difficult to solve for at the same time does not mean that those difficulties don't exist, and won't cause problems, if they aren't solved for (at the appropriate times).


Thanks to Guive, Nico, and claude-3.5-sonnet-20241022 for their feedback on this post.

  1. ^

    Let's call them "high-impact systems".

  2. ^

    e.g. releasing the model weights to the world, where approximately any rando can fine-tune and run inference on them.

  3. ^

    Yes, I agree that systems which are robustly deceptively aligned are not necessarily takeover-capable.



Discuss

Super human AI is a very low hanging fruit!

2024-12-27 05:46:27

Published on December 26, 2024 7:00 PM GMT

This is a bit of a rough draft. I would appreciate any constructive comments especially important arguments that I may have overlooked.

§ 1. Introduction

Summary. I argue, from the perspective of biology, that super human AI is a very low hanging fruit. I believe that this argument is very solid. I briefly consider reasons why {super human AI}/{vastly super human AI} might not arise. I then contrast AI with other futurist technologies like human brain emulation, radical life extension & space colonization. I argue that these technologies are in a different category & plausibly impossible to achieve in the way commonly envisioned. This also has some relevance for EA cause prioritization.

In my experience certain arguments are ignored b/c they are too straightforward. People have perhaps heard similar arguments previously. The argument isn't as exciting. B/c people understand the argument they feel empowered to disagree with it which is not necessarily a bad thing. They may believe that complicated new arguments, which they don't actually understand well, have debunked the straightforward arguments even tho the straightforward argument maybe easily salvageable or not even debunked in the 1st place!

§ 2. Super human AI is a very low hanging

The reasons why super human AI is a very low hanging fruit are pretty obvious.

1) The human brain is meager in terms of energy consumption & matter. 2000/calories per day is approximately 100 watts. Obviously the brain uses less than that. Moreover the brain is only 3 pounds.

So we know for certain that human level intelligence is possible with meager energy & matter requirements. It follows that super human intelligence should be achievable especially if we're able to use orders of magnitude more energy & matter which we are.

2) Humans did not evolved to do calculus, computer programming & things like that.

Even Terence Tao did not evolve to do complicated math. Of course you can nitpick this to death by saying humans evolved to do many complex reason tasks. But we didn't actually evolve to do tasks requiring such high levels of mathematical reasoning ability. This is actually why there's such a large variability in mathematical intelligence. Even with 3 pound brains we could all have been as talented as (or even far more talented than) Terence Tao had selective pressure for such things been strong.

3) Evolution is not efficient.

Evolution is not like gradient descent. It's a bit more like Nelder Mead. Much of evolution is just purging bad mutations & selection on standing diversity in response to environmental change. A fitness enhancing gain of function mutation is a relatively very rare thing.

A) Evolution does not act at the level of synapse.

The human genome is far far too short. Instead the genome acts as metaparameters that determine human learning in response to the environment. I think this point cuts both ways which is why I'm referring to it as A not 4. Detailed analysis of this point is far beyond the scope of this post. But I'm inclined to believe that such an approach is maybe not quite as inefficient as Nelder Mead applied at the level of synapse but more limited in its ability to optimize.

§ 3. Possible obstacles to super human AI

I see only a few reasons why super human AI might not happen.

1) An intelligence obedience tradeoff. Obviously companies want AI to be obedient. Even a harmless AI which just thinks about incomprehensible AI stuff all day long is not obviously a good investment. It would be the corniest thing ever if humans' tendency to be free gave us an insurmountable advantage over AI. I doubt this is the case, but it wouldn't be surprise if there is some (not necessarily insurmountable) intelligence obedience tradeoff.

2) Good ideas are not tried b/c of high costs. I feel like I have possibly good ideas about how to train AI, but I just don't have the spare 1 billion dollars.

3) Hardware improvements hit a wall.

4) Societal collapse.

Realistically I think at least 2 of these are needed to stop super human AI.

§ 4. Human brain emulation

In § 2 I argue that super human AI is quite an easy task. Up until quite recently I would some times encounter claims that human brain emulation is actually easier than super human AI. I think that that line of thinking puts somewhat too much faith in evolution. The problem with human brain emulation is that the artificial neural network would need to model various peculiarities & quirks of neurons. An easy & efficient way for a neuron to function is not necessarily that easy & efficient for an artificial neuron & vice verse. Adding up a bunch of things & putting that into ReLu is obviously not what a neuron does, but how complex would that function need to be to capture all of a neuron's important quirks? Some people seem to think that the complexity of this function would match it's superior utility relative to an artificial neuron [N1]. But this is not the case; the neuron is simply doing what is easy for a neuron to do; likewise for the artificial neuron. Actually the artificial neuron has 2 big advantages over the neuron -- the artificial neuron is easier to optimize and it is not spatially constrained.

If human brain emulation is unavailable, then a certain vision of mind uploading is impossible. But an AI copying aspects of a person's personality, like the Magi system from that TV show, is not some thing that I doubt [N2].

N1. I've some times heard the claim that a neuron is more like a MLP. I would go so far as to claim that an artificial neuron with input from k artificial neurons & using a simple activation function like ReLu & half precision is going to be functionally superior to a neuron with input from k neurons b/c of greater optimization & lack of spatial constraints. But simulating the latter is going to be way more difficult.

N2. That an AI could remember details of your life better than you could and in that sense be more you than you could possibly be is also worth noting.

§ 5. Radical life extension & space colonization

Life spans longer than human are definitely possible & have been reported for bowhead whales, Greenland sharks & a quahog named Ming. But the number of genetic changes necessary for humans to have such long life spans is probably high. And it's unclear whether non genetic interventions will be highly effective given the centrality of DNA in biology.

The energy costs of sending enough stuff into space to bootstrap a civilization are alone intimidating. Perhaps advances like fusion or improved 3D printing will solve this problem.

§ 6. Conclusions

Won't super human AI make it all possible?

I'm not claiming that human brain emulation, radical life extension & space colonization are definitely impossible.

But in the case of super human AI we're merely trying to best some thing that is definitely possible & with meager physical inputs.

On the other hand human brain emulation, radical life extension & space colonization may be possible or they may be too physically constrained ie constrained by the laws of physics.

What is the significance of this beyond just the technical points? I'm not proposing that people preemptively give up on these goals. Some elements of human brain emulation will not require the simulation to be accurate at the neuronal level. Radical life extension via genetics seems in principle achievable but maybe not desirable or worthwhile. My point is that a future with {super human AI}/{vastly super human AI} seems likely. But the time lag between vastly super human AI & those other technologies may be very substantial or infinite. Hence the importance of AI & humans living harmoniously & happily during that extended period of time is possibly paramount [N3] & the required cultural & political changes for such a coexistence are likely substantial.

N3. If things progress in a pleasant direction, this could be an opportunity for humans to have more free time with AI doing most (but not all) of the work.

Hzn



Discuss

PCR retrospective

2024-12-27 05:20:56

Published on December 26, 2024 9:20 PM GMT

my history

After I finished 8th grade, I started a "job" for a professor researching PCR techniques. I say "job" because I wasn't really expected to do anything productive; it was more, charity in the form of work history.

Recently, I was thinking back on how PCR and my thinking have changed since then.

what PCR does

Wikipedia says:

The polymerase chain reaction (PCR) is a method widely used to make millions to billions of copies of a specific DNA sample rapidly, allowing scientists to amplify a very small sample of DNA (or a part of it) sufficiently to enable detailed study.

Specifically, it copies a region of DNA with segments at the start + end that match some added DNA pieces made chemically. Mostly, this is used to detect if certain DNA is present in a sample.

how PCR works

First, you need to get DNA out of some cells. This can be done with chemicals or ultrasound.

Then, you need to separate DNA from other stuff. This can be done by adding beads that DNA binds to, washing the beads, and adding some chemical that releases the DNA.

Now, you can start the PCR. You mix together:

  • the DNA
  • primers: short synthesized DNA sequences that bind to the start and end of your target sequence
  • nucleoside triphosphates to make DNA from
  • a polymerase: an enzyme that binds to a double-stranded region and extends it into a single-strand region

Then:

  • Heat the DNA until it "melts" (the strands separate).
  • Cool the solution so primers can bind to the released single strands.
  • Wait for the polymerase to extend the primers.
  • Repeat the process.

Obviously, a polymerase that can survive high enough temperatures to melt DNA is needed. So, discovery of Taq polymerase was key for making PCR possible.

better enzymes

These days, there are better enzymes than Taq, which go faster and have lower error rates. Notably, KOD and Q5 polymerase. A lot of labs still seem to be using outdated polymerase choices.

real-time PCR

There are some fluorescent dyes that bind to double-stranded DNA and change their fluorescence when they do. If we add such dye to a PCR solution, we can graph DNA strand separation vs temperature. Different DNA sequences melt at slightly different temperatures, so with good calibration, this can detect mutations in a known DNA sequence.

multiplex PCR

Instead of adding a dye that binds to DNA, we can add fluorescent dye to primers that gets cleaved off by the polymerase, increasing its fluorescence. Now, we can add several primer pairs for different sequences, each labeled with different dyes, and what color is seen indicates what sequence is present.

However, due to overlap between different dye colors, this is only practical for up to about 6 targets.

Obviously, you could do 2 PCR reactions, each with 36 primers, and determine which sequence is present from a single color from each reaction. And so on, with targets increasing exponentially with more reactions. But massively multiplex PCR is limited by non-specific primer binding and primer dimers.

There are other ways to indicate specific reactions, such as probes separate from the primers, but the differences aren't important here.

PCR automation

Cepheid

Cepheid makes automated PCR test machines. There are disposable plastic cartridges; you put a sample in 1 chamber, the machine turns a rotary valve to control flow, and drives a central syringe to move fluids around. Here's a video.

So, you spit in a cup, put the sample in a hole, run the machine, and an hour later you have a diagnosis from several possibilities, based on DNA or RNA. It's hard to overstate how different that is from historical diagnosis of diseases.

SiTime

The Cepheid system seemed moderately clever, so I looked up the people involved, and noticed this guy. Kurt Petersen, also a founder of SiTime, which is a company I'd heard of.

Historically, oscillators use quartz because it doesn't change much with temperature. The idea of SiTime was:

  • use lithography to make lots of tiny silicon resonators
  • measure the actual frequency of each resonator, and shift them digitally to the desired frequency
  • use thermistors to determine temperature and digitally compensate for temperature effects

As usual, accuracy improves when you average more oscillators, as sqrt(n). Anyway, I've heard SiTime is currently the best at designing such systems.

alternatives are possible

"Moderately clever" isn't Hendrik Lorentz or the guy I learned chemistry from. I could probably find a design that avoids their patents without increasing costs. In fact, I think I'll do that now.

...

Yep, it's possible. Of course, you do need different machines for the different consumables.

BioFire

Another automated PCR system is BioFire's FilmArray system. Because it's more-multiplex than Cepheid's system, they need 2 PCR stages, and primer-primer interactions are still a problem. But still, you can do 4x the targets as Cepheid for only 10x the cost. For some reason it hasn't been as popular, but I guess that's a mystery to be solved by future generations.

droplets

Suppose you want a very accurate value for how much of a target DNA sequence is in a sample.

If we split a PCR solution into lots of droplets in oil, and examine the droplets individually, we can see what fraction of droplets had a PCR reaction happen. That's usually called digital droplet PCR, or ddPCR.

Another way to accomplish the same way is to have a tray of tiny wells, such that liquid flows into the wells and is kept compartmentalized. Here's a paper doing that.

mixed droplets

It's obviously possible to:

  • make many different primer mixtures
  • make emulsions of water droplets in oil from each of them
  • mix the emulsions
  • use microfluidics to combine each primer droplet with a little bit of the sample DNA
  • do PCR on the emulsion

Is anybody doing that? I'm guessing it's what "RainDance Technologies" is doing...yep, seems so.

Of course, if we re-use the microfluidic system and have even a tiny bit of contamination between runs, that ruins results. So, I reckon you either need very cheap microfluidic chips, or ones that can be sterilized real good before reuse. But that's certainly possible; it's just a manufacturing problem.

my thoughts at the time

Back then, while my "job" was about regular PCR, I was more interested in working on something else. My view was:

Testing for a single disease at a time is useful, but the future is either sequencing or massively parallel testing. Since I'm young, I should be thinking about the future, not just current methods.

My acquaintance Nava has a similar view now. Anyway, I wasn't exactly wrong, but in retrospect, I was looking a bit too far forward. Which I suppose is a type of being wrong.

non-PCR interests

I'd recently learned about nanopore sequencing and SPR, and thought those were interesting.

nanopore sequencing

Since then, Oxford Nanopore sequencers have improved even faster than other methods, and are now a reasonable choice for sequencing DNA. (But even for single-molecule sequencing, I've heard the fluorescence-based approach of PacBio is generally better.)

Current nanopore sequencers are based on differences in ion flow around DNA depending on its bases. At the time, I thought plasmonic nanopore approaches would be better, but obviously that hasn't been the case so far. That wasn't exactly a dead end; people are still working on it, especially for protein sequencing, but it's not something used in commercial products today. I guess it seemed like the error rate of the ion flow approach would be high, but as of a few years ago it was...yeah, pretty high actually, but if you repeat the process several times you can get good results. Of course current plasmonic approaches aren't better, but they do still seem to have more room for improvement.

Why did I find nanopore approaches more appealing than something like Illumina?

  • Fragmenting DNA to reassamble the sequence from random segments seemed inelegant somehow.
  • Enzymes work with 1 strand of DNA, so why can't we?
  • Illumina complex, make Grug brain hurty

surface plasmon resonance

SPR (Wikipedia) involves attaching receptors to a thin metal film, and then detecting binding to those receptors by effects on reflection of laser light off the other side of the metal film. Various companies sell SPR testing equipment today. The chips are consumables; here's an example of a company selling them.

But those existing products are unrelated to why I thought SPR was interesting. My thought was, it should be possible to make an array of many different receptors on the metal film, and then detect many different target molecules with a single test. So, is anybody working on that? Yes; here's a recent video from a startup called Carterra. I don't see any problems without simple solutions, but they've been working on this for 10 years already so presumably there were some difficulties.

electrical DNA detection

While working at that lab, I had the following thought:

The conformation of DNA should depend on the sequence. That should affect its response to high-frequency electric fields. If you do electric testing during PCR, then maybe you could get some sequence-specific information by the change in properties during a later cycle. If necessary, you could use a slow polymerase.

So, when I later talked to the professor running the lab, I said:

me: Hey, here's this idea I've been thinking about.

prof: Interesting. Are you going to try it then?

me: Is that...a project you want to pursue here, then?

prof: It might be a good project for you.

me: If you don't see any problems, I'd be happy to discuss it in more detail with more people when you're available.

prof: Just make it work, and then you won't have to convince me it's good.

me: I...don't have the resources to do that on my own; you're the decision-maker here.

prof: We, uh, already have enough research projects, but you should definitely try to work on ideas like that on your own.

me: ...I see.

In retrospect, was my idea something that lab should've been working on? Working on droplet PCR techniques probably would've been better, but on the other hand, the main thrust of their research was basically a dead end and its goal wasn't necessary.

papers on EIS of DNA

Electric impedance spectroscopy (EIS) involves measuring current with AC voltage, for multiple frequencies, and detecting phase of current relative to voltage.

Here's a 2020 paper doing EIS on PCR solutions after different numbers of cycles. It finds there's a clearly detectable signal! There's a bigger effect for the imaginary (time delay) than the real (resistance) component of signals. They used circuit boards with intermeshing comb-like electrodes to get a bigger signal.

It'd be easy to say "the idea worked, that's gratifying" and conclude things here. But taking a look at that graph of delay vs PCR cycle, apparently there's a bigger change from the earlier PCR cycles, despite the increase in DNA being less. And the lower the frequency, the more of the change happens from earlier cycles. So, that must be some kind of surface effect: DNA sticking to a positively charged surface and affecting capacitance but with a slight delay because DNA is big. And that means the effect will depend on length, but not significantly on sequence.

Looking at some other papers validates that conclusion; actually, most papers looking at EIS of DNA used modified surfaces. If you bind some DNA sequence to a metal surface, and then its complement binds to that, you can observe that binding from its electrical effects. There's a change in capacitance, and if you add some conductive anions, having more (negative) DNA repels those and reduces conductivity. Using that approach, people have been able to detect specific DNA sequences and single mutations in them. The main problem seems to be that you have to bind specific DNA to these metal surfaces, which is the same problem SPR has. Still, it's a topic with ongoing research; here's a 2020 survey paper.

electrochemical biosensors

Electrochemical biosensors are widely used today, less than PCR but more than SPR. Some of them are very small, the size of a USB drive. The sensor chips in those, like SPR chips, are disposable.

The approach I described above is sometimes called "unlabeled electrochemical biosensors", because they don't use "labels" in solution that bind to the target molecules to increase signal. Here's a survey describing various labels. I think most electrochemical sensors use labels. Needing to add an additional substance might seem like a disadvantage, but changing the detection target by adding something to a liquid is often easier than getting a different target-specific chip. On the other hand, that means you can only detect 1 target at a time, while unlabeled sensors could use multiple regions detecting different targets.

isothermal DNA amplification

PCR uses temperature cycling, but if you use a polymerase that displaces bound DNA in front of it, you can do DNA amplification at a constant temperature. The main approach is LAMP; here's a short video and here's wikipedia.

LAMP is faster and can sometimes be done with simpler devices. PCR is better for detecting small amounts of DNA, is easier to do multiplex detection with, and gives more consistent indications of initial quantity. Detection of DNA with LAMP is mostly done with non-specific dyes...which is why I'm mentioning LAMP here.

If you use a parallel droplet approach, with a single dye to indicate amplified DNA plus a fluorescent "barcode" to indicate droplet type, then the difficulty of multiplex LAMP doesn't matter. The same is true if you use a SPR chip with a pattern of many DNA oligomers on its surface. So, if those approaches are used, LAMP could be attractive.



Discuss

AI #96: o3 But Not Yet For Thee

2024-12-27 04:30:07

Published on December 26, 2024 8:30 PM GMT

The year in models certainly finished off with a bang.

In this penultimate week, we get o3, which purports to give us vastly more efficient performance than o1, and also to allow us to choose to spend vastly more compute if we want a superior answer.

o3 is a big deal, making big gains on coding tests, ARC and some other benchmarks. How big a deal is difficult to say given what we know now. It’s about to enter full fledged safety testing.

o3 will get its own post soon, and I’m also pushing back coverage of Deliberative Alignment, OpenAI’s new alignment strategy, to incorporate into that.

We also got DeepSeek v3, which claims to have trained a roughly Sonnet-strength model for only $6 million and 37b active parameters per token (671b total via mixture of experts).

DeepSeek v3 gets its own brief section with the headlines, but full coverage will have to wait a week or so for reactions and for me to read the technical report.

Both are potential game changers, both in their practical applications and in terms of what their existence predicts for our future. It is also too soon to know if either of them is the real deal.

Both are mostly not covered here quite yet, due to the holidays. Stay tuned.

Table of Contents

  1. Language Models Offer Mundane Utility. Make best use of your new AI agents.
  2. Language Models Don’t Offer Mundane Utility. The uncanny valley of reliability.
  3. Flash in the Pan. o1-style thinking comes to Gemini Flash. It’s doing its best.
  4. The Six Million Dollar Model. Can they make it faster, stronger, better, cheaper?
  5. And I’ll Form the Head. We all have our own mixture of experts.
  6. Huh, Upgrades. ChatGPT can use Mac apps, unlimited (slow) holiday Sora.
  7. o1 Reactions. Many really love it, others keep reporting being disappointed.
  8. Fun With Image Generation. What is your favorite color? Blue. It’s blue.
  9. Introducing. Google finally gives us LearnLM.
  10. They Took Our Jobs. Why are you still writing your own code?
  11. Get Involved. Quick reminder that opportunity to fund things is everywhere.
  12. In Other AI News. Claude gets into a fight over LessWrong moderation.
  13. You See an Agent, You Run. Building effective agents by not doing so.
  14. Another One Leaves the Bus. Alec Radford leaves OpenAI.
  15. Quiet Speculations. Estimates of economic growth keep coming in super low.
  16. Lock It In. What stops you from switching LLMs?
  17. The Quest for Sane Regulations. Sriram Krishnan joins the Trump administration.
  18. The Week in Audio. The many faces of Yann LeCun. Anthropic’s co-founders talk.
  19. A Tale as Old as Time. Ask why mostly in a predictive sense.
  20. Rhetorical Innovation. You won’t not wear the f***ing hat.
  21. Aligning a Smarter Than Human Intelligence is Difficult. Cooperate with yourself.
  22. People Are Worried About AI Killing Everyone. I choose you.
  23. The Lighter Side. Please, no one call human resources.

Language Models Offer Mundane Utility

How does your company make best use of AI agents? Austin Vernon frames the issue well: AIs are super fast, but they need proper context. So if you want to use AI agents, you’ll need to ensure they have access to context, in forms that don’t bottleneck on humans. Take the humans out of the loop, minimize meetings and touch points. Put all your information into written form, such as within wikis. Have automatic tests and approvals, but have the AI call for humans when needed via ‘stop work authority’ – I would flip this around and let the humans stop the AIs, too.

That all makes sense, and not only for corporations. If there’s something you want your future AIs to know, write it down in a form they can read, and try to design your workflows such that you can minimize human (your own!) touch points.

To what extent are you living in the future? This is the CEO of playground AI, and the timestamp was Friday:

Suhail: I must give it to Anthropic, I can’t use 4o after using Sonnet. Huge shift in spice distribution!

How do you educate yourself for a completely new world?

Miles Brundage: The thing about “truly fully updating our education system to reflect where AI is headed” is that no one is doing it because it’s impossible.

The timescales involved, especially in early education, are lightyears beyond what is even somewhat foreseeable in AI.

Some small bits are clear: earlier education should increasingly focus on enabling effective citizenship, wellbeing, etc. rather than preparing for paid work, and short-term education should be focused more on physical stuff that will take longer to automate. But that’s about it.

What will citizenship mean in the age of AI? I have absolutely no idea. So how do you prepare for that? Largely the same goes for wellbeing. A lot of this could be thought of as: Focus on the general and the adaptable, and focus less on the specific, including things specifically for Jobs and other current forms of paid work – you want to be creative and useful and flexible and able to roll with the punches.

That of course assumes that you are taking the world as given, rather than trying to change the course of history. In which case, there’s a very different calculation.

Large parts of every job are pretty dumb.

Shako: My team, full of extremely smart and highly paid Ph.D.s, spent $10,000 of our time this week figuring out where in a pipeline a left join was bringing in duplicates, instead of the strategic thinking we were capable of. In the short run, AI will make us far more productive.

Gallabytes: The two most expensive bugs in my career have been simple typos.

ChatGPT is a left-leaning midwit, so Paul Graham is using it to see what parts of his new essay such midwits will dislike, and which ones you can get it to acknowledge are true. I note that you could probably use Claude to simulate whatever Type of Guy you would like, if you have ordinary skill in the art.

Language Models Don’t Offer Mundane Utility

Strongly agree with this:

Theo: Something I hate when using Cursor is, sometimes, it will randomly delete some of my code, for no reason

Sometimes removing an entire feature 😱

I once pushed to production without being careful enough and realized a few hours later I had removed an entire feature …

Filippo Pietrantonio: Man that happens all the time. In fact now I tell it in every single prompt to not delete any files and keep all current functionalities and backend intact.

Davidad: Lightweight version control (or at least infinite-undo functionality!) should be invoked before and after every AI agent action in human-AI teaming interfaces with artifacts of any kind.

Gary: Windsurf has this.

Jacques: Cursor actually does have a checkpointing feature that allows you to go back in time if something messes up (at least the Composer Agent mode does).

In Cursor I made an effort to split up files exactly because I found I had to always scan the file being changed to ensure it wasn’t about to go silently delete anything. The way I was doing it you didn’t have to worry it was modifying or deleting other files.

On the plus side, now I know how to do reasonable version control.

The uncanny valley problem here is definitely a thing.

Ryan Lackey: I hate Apple Intelligence email/etc. summaries. They’re just off enough to make me think it is a new email in thread, but not useful enough to be a good summary. Uncanny valley.

It’s really good for a bunch of other stuff. Apple is just not doing a good job on the utility side, although the private computing architecture is brilliant and inspiring.

Flash in the Pan

The latest rival to at least o1-mini is Gemini-2.0-Flash-Thinking, which I’m tempted to refer to (because of reasons) as gf1.

Jeff Dean: Considering its speed, we’re pretty happy with how the experimental Gemini 2.0 Flash Thinking model is performing on lmsys.

Image

Gemini 2.0 Flash Thinking is now essentially tied at the top of the overall leaderboard with Gemini-Exp-1206, which is essentially a beta of Gemini Pro 2.0. This tells us something about the model, but also reinforces that this metric is bizarre now. It puts us in a strange spot. What is the scenario where you will want Flash Thinking rather than o1 (or o3!) and also rather than Gemini Pro, Claude Sonnet, Perplexity or GPT-4o?

One cool thing about Thinking is that (like DeepSeek’s Deep Thought) it explains its chain of thought much better than o1.

Deedy was impressed.

Deedy: Google really cooked with Gemini 2.0 Flash Thinking.

It thinks AND it’s fast AND it’s high quality.

Not only is it #1 on LMArena on every category, but it crushes my goto Math riddle in 14s—5x faster than any other model that can solve it!

o1 and o1 Pro took 102s and 138s respectively for me on this task.

Here’s another math puzzle where o1 got it wrong and took 3.5x the time:

“You have 60 red and 40 blue socks in a drawer, and you keep drawing a sock uniformly at random until you have drawn all the socks of one color. What is the expected number of socks left in the drawer?”

That result… did not replicate when I tried it. It went off the rails, and it went off them hard. And it went off them in ways that make me skeptical that you can use this for anything of the sort. Maybe Deedy got lucky?

Other reports I’ve seen are less excited about quality, and when o3 got announced it seemed everyone got distracted.

What about Gemini 2.0 Experimental (e.g. the beta of Gemini 2.0 Pro, aka Gemini-1206)?

It’s certainly a substantial leap over previous Gemini Pro versions and it is atop the Arena. But I don’t see much practical eagerness to use it, and I’m not sure what the use case is there where it is the right tool.

Eric Neyman is impressed:

Eric Neyman: Guys, we have a winner!! Gemini 2.0 Flash Thinking Experimental is the first model I’m aware of to get my benchmark question right.

Eric Neyman: Every time a new LLM comes out, I ask it one question: What is the smallest integer whose square is between 15 and 30? So far, no LLM has gotten this right.

That one did replicate for me, and the logic is fine, but wow do some models make life a little tougher than it is, think faster and harder not smarter I suppose:

I mean, yes, that’s all correct, but… wow.

Gallabyetes: flash reasoning is super janky.

it’s got the o1 sauce but flash is too weak I’m sorry.

in tic tac toe bench it will frequently make 2 moves at once.

Flash isn’t that much worse than GPT-4o in many ways, but certainly it could be better. Presumably the next step is to plug in Gemini Pro 2.0 and see what happens?

Teortaxes was initially impressed, but upon closer examination is no longer impressed.

The Six Million Dollar Model

Having no respect for American holidays, DeepSeek dropped their v3 today.

DeepSeek: 🚀 Introducing DeepSeek-V3!

Biggest leap forward yet:

⚡ 60 tokens/second (3x faster than V2!)

💪 Enhanced capabilities

🛠 API compatibility intact

🌍 Fully open-source models & papers

🎉 What’s new in V3?

🧠 671B MoE parameters

🚀 37B activated parameters

📚 Trained on 14.8T high-quality tokens

Model here. Paper here.

💰 API Pricing Update

🎉 Until Feb 8: same as V2!

🤯 From Feb 8 onwards:

Input: $0.27/million tokens ($0.07/million tokens with cache hits)

Output: $1.10/million tokens

🔥 Still the best value in the market!

🌌 Open-source spirit + Longtermism to inclusive AGI

🌟 DeepSeek’s mission is unwavering. We’re thrilled to share our progress with the community and see the gap between open and closed models narrowing.

🚀 This is just the beginning! Look forward to multimodal support and other cutting-edge features in the DeepSeek ecosystem.

💡 Together, let’s push the boundaries of innovation!

Image

 

If this performs halfway as well as its evals, this was a rather stunning success.

Teortaxes: And here… we… go.

So, that line in config. Yes it’s about multi-token prediction. Just as a better training obj – though they leave the possibility of speculative decoding open.

Also, “muh 50K Hoppers”:

> 2048 NVIDIA H800

> 2.788M H800-hours

2 months of training. 2x Llama 3 8B.

Haseeb: Wow. Insanely good coding model, fully open source with only 37B active parameters. Beats Claude and GPT-4o on most benchmarks. China + open source is catching up… 2025 will be a crazy year.

Andrej Karpathy: DeepSeek (Chinese AI co) making it look easy today with an open weights release of a frontier-grade LLM trained on a joke of a budget (2048 GPUs for 2 months, $6M).

For reference, this level of capability is supposed to require clusters of closer to 16K GPUs, the ones being brought up today are more around 100K GPUs. E.g. Llama 3 405B used 30.8M GPU-hours, while DeepSeek-V3 looks to be a stronger model at only 2.8M GPU-hours (~11X less compute). If the model also passes vibe checks (e.g. LLM arena rankings are ongoing, my few quick tests went well so far) it will be a highly impressive display of research and engineering under resource constraints.

Does this mean you don’t need large GPU clusters for frontier LLMs? No but you have to ensure that you’re not wasteful with what you have, and this looks like a nice demonstration that there’s still a lot to get through with both data and algorithms.

Very nice & detailed tech report too, reading through.

It’s a mixture of experts model with 671b total parameters, 37b activate per token.

As always, not so fast. DeepSeek is not known to chase benchmarks, but one never knows the quality of a model until people have a chance to bang on it a bunch.

If they did train a Sonnet-quality model for $6 million in compute, then that will change quite a lot of things.

Essentially no one has reported back on what this model can do in practice yet, and it’ll take a while to go through the technical report, and more time to figure out how to think about the implications. And it’s Christmas.

So: Check back later for more.

And I’ll Form the Head

Increasingly the correct solution to ‘what LLM or other AI product should I use?’ is ‘you should use a variety of products depending on your exact use case.’

Gallabytes: o1 Pro is by far the smartest single-turn model.

Claude is still far better at conversation.

Gemini can do many things quickly and is excellent at editing code.

Which almost makes me think the ideal programming workflow right now is something somewhat unholy like:

  1. Discuss, plan, and collect context with Sonnet.
  2. Sonnet provides a detailed request to o1 (Pro).
  3. o1 spits out the tricky code.
    1. In simple cases (most of them), it could make the edit directly.
    2. For complicated changes, it could instead output a detailed plan for each file it needs to change and pass the actual making of that change to Gemini Flash.

This is too many steps. LLM orchestration spaghetti. But this feels like a real direction.

This is mostly the same workflow I used before o1, when there was only Sonnet. I’d discuss to form a plan, then use that to craft a request, then make the edits. The swap doesn’t seem like it makes things that much trickier, the logistical trick is getting all the code implementation automated.

Huh, Upgrades

ChatGPT picks up integration with various apps on Mac including Warp, ItelliJ Idea, PyCharm, Apple Notes, Notion, Quip and more, including via voice mode. That gives you access to outside context, including an IDE and a command line and also your notes. Windows (and presumably more apps) coming soon.

Unlimited Sora available to all Plus users on the relaxed queue over the holidays, while the servers are otherwise less busy.

Requested upgrade: Evan Conrad requests making voice mode on ChatGPT mobile show the transcribed text. I strongly agree, voice modes should show transcribed text, and also show a transcript after, and also show what the AI is saying, there is no reason to not do these things. Looking at you too, Google. The head of applied research at OpenAI replied ‘great idea’ so hopefully we get this one.

o1 Reactions

Dean Ball is an o1 and o1 pro fan for economic history writing, saying they’re much more creative and cogent at combining historic facts with economic analysis versus other models.

This seems like an emerging consensus of many, except different people put different barriers on the math/code category (e.g. Tyler Cowen includes economics):

Aidan McLau: I’ve used o1 (not pro mode) a lot over the last week. Here’s my extensive review:

>It’s really insanely mind-blowingly good at math/code.

>It’s really insanely mind-blowingly mid at everything else.

The OOD magic isn’t there. I find it’s worse at writing than o1-preview; its grasp of the world feels similar to GPT-4o?!?

Even on some in-distribution tasks (like asking to metaphorize some tricky math or predicting the effects of a new algorithm), it kind of just falls apart. I’ve run it head-to-head against Newsonnet and o1-preview, and it feels substantially worse.

The Twitter threadbois aren’t wrong, though; it’s a fantastic tool for coding. I had several diffs on deck that I had been struggling with, and it just solved them. Magical.

Well, yeah, because it seems like it is GPT-4o under the hood?

Christian: Man, I have to hard disagree on this one — it can find all kinds of stuff in unstructured data other models can’t. Throw in a transcript and ask “what’s the most important thing that no one’s talking about?”

Aiden McLau: I’ll try this. how have you found it compared to newsonnet?

Christian: Better. Sonnet is still extremely charismatic, but after doing some comparisons and a lot of product development work, I strongly suspect that o1’s ability to deal with complex codebases and ultimately produce more reliable answers extends to other domains…

Gallabytes is embracing the wait.

Gallabytes: O1 Pro is good, but I must admit the slowness is part of what I like about it. It makes it feel more substantial; premium. Like when a tool has a pleasing heft. You press the buttons, and the barista grinds your tokens one at a time, an artisanal craft in each line of code.

David: I like it too but I don’t know if chat is the right interface for it, I almost want to talk to it via email or have a queue of conversations going

Gallabytes: Chat is a very clunky interface for it, for sure. It also has this nasty tendency to completely fail on mobile if my screen locks or I switch to another app while it is thinking. Usually, this is unrecoverable, and I have to abandon the entire chat.

NotebookLM and deep research do this right – “this may take a few minutes, feel free to close the tab”

kinda wild to fail at this so badly tbh.

Here’s a skeptical take.

Jason Lee: O1-pro is pretty useless for research work. It runs for near 10 min per prompt and either 1) freezes, 2) didn’t follow the instructions and returned some bs, or 3) just made some simple error in the middle that’s hard to find.

@OpenAI@sama@markchen90 refund me my $200

Damek Davis: I tried to use it to help me solve a research problem. The more context I gave it, the more mistakes it made. I kept abstracting away more and more details about the problem in hopes that o1 pro could solve it. The problem then became so simple that I just solved it myself.

Flip: I use o1-pro on occasion, but the $200 is mainly worth it for removing the o1 rate limits IMO.

I say Damek got his $200 worth, no?

If you’re using o1 a lot, removing the limits there is already worth $200/month, even if you rarely use o1 Pro.

There’s a phenomenon where people think about cost and value in terms of typical cost, rather than thinking in terms of marginal benefit. Buying relatively expensive but in absolute terms cheap things is often an amazing play – there are many things where 10x the price for 10% better is an amazing deal for you, because your consumer surplus is absolutely massive.

Also, once you take 10 seconds, there’s not much marginal cost to taking 10 minutes, as I learned with Deep Research. You ask your question, you tab out, you do something else, you come back later.

That said, I’m not currently paying the $200, because I don’t find myself hitting the o1 limits, and I’d mostly rather use Claude. If it gave me unlimited uses in Cursor I’d probably slam that button the moment I have the time to code again (December has been completely insane).

Fun With Image Generation

I don’t know that this means anything but it is at least fun.

Davidad: One easy way to shed some light on the orthogonality thesis, as models get intelligent enough to cast doubt on it, is values which are inconsequential and not explicitly steered, such as favorite colors. Same prompting protocol for each swatch (context cleared between swatches)

Image

All outputs were elicited in oklch. Models are sorted in ascending order of hue range. Gemini Experimental 1206 comes out on top by this metric, zeroing in on 255-257° hues, but sampling from huge ranges of luminosity and chroma.

There are some patterns here, especially that more powerful models seem to converge on various shades of blue, whereas less powerful models are all over the place. As I understand it, this isn’t testing orthogonality in the sense of ‘all powerful minds prefer blue’ rather it is ‘by default sufficiently powerful minds trained in the way we typically train them end up preferring blue.’

I wonder if this could be used as a quick de facto model test in some way.

There was somehow a completely fake ‘true crime’ story about an 18-year-old who was supposedly paid to have sex with women in his building where the victim’s father was recording videos and selling them in Japan… except none of that happened and the pictures are AI fakes?

Introducing

Google introduces LearnLM, available for preview in Google AI Studio, designed to facilitate educational use cases, especially in science. They say it ‘outperformed other leading AI models when it comes to adhering to the principles of learning science’ which does not sound like something you would want Feynman hearing you say. It incorporates search, YouTube, Android and Google Classroom.

Sure, sure. But is it useful? It was supposedly going to be able to do automated grading, handles routine paperwork, plans curriculums, track student progress and personalizes their learning paths and so on, but any LLM can presumably do all those things if you set it up properly.

 

 

They Took Our Jobs

 

This sounds great, totally safe and reliable, other neat stuff like that.

Sully: LLMs writing code in AI apps will become the standard.

No more old-school no-code flows.

The models handle the heavy lifting, and it’s insane how good they are.

Let agents build more agents.

He’s obviously right about this. It’s too convenient, too much faster. Indeed, I expect we’ll see a clear division between ‘code you can have the AI write’ which happens super fast, and ‘code you cannot let the AI write’ because of corporate policy or security issues, both legit and not legit, which happens the old much slower way.

 

Complement versus supplement, economic not assuming the conclusion edition.

Maxwell Tabarrok: The four futures for cognitive labor:

  1. Like mechanized farming. Highly productive and remunerative, but a small part of the economy.
  2. Like writing after the printing press. Each author 100 times more productive and 100 times more authors.
  3. Like “computers” after computers. Current tasks are completely replaced, but tasks at a higher level of abstraction, like programming, become even more important.
  4. Or, most pessimistically, like ice harvesting after refrigeration. An entire industry replaced by machines without compensating growth.

Ajeya Cotra: I think we’ll pass through 3 and then 1, but the logical end state (absent unprecedentedly sweeping global coordination to refrain from improving and deploying AI technology) is 4.

Ryan Greenblatt: Why think takeoff will be slow enough to ever be at 1? 1 requires automating most cognitive work but with an important subset not-automatable. By the time deployment is broad enough to automate everything I expect AIs to be radically superhuman in all domains by default.

I can see us spending time in #1. As Roon says, AI capabilities progress has been spiky, with some human-easy tasks being hard and some human-hard tasks being easy. So the 3→1 path makes some sense, if progress isn’t too quick, including if the high complexity tasks start to cost ‘real money’ as per o3 so choosing the right questions and tasks becomes very important. Alternatively, we might get our act together enough to restrict certain cognitive tasks to humans even though AIs could do them, either for good reasons or rent seeking reasons (or even ‘good rent seeking’ reasons?) to keep us in that scenario.

But yeah, the default is a rapid transition to #4, and for that to happen to all labor not only cognitive labor. Robotics is hard, it’s not impossible.

One thing that has clearly changed is AI startups have very small headcounts.

Harj Taggar: Caught up with some AI startups recently. A two founder team that reached 1.5m ARR and has only hired one person.

Another single founder at 1m ARR and will 3x within a few months.

The trajectory of early startups is steepening just like the power of the models they’re built on.

An excellent reason we still have our jobs is that people really aren’t willing to invest in getting AI to work, even when they know it exists, if it doesn’t work right away they typically give up:

Dwarkesh Patel: We’re way more patient in training human employees than AI employees.

We will spend weeks onboarding a human employee and giving slow detailed feedback. But we won’t spend just a couple of hours playing around with the prompt that might enable the LLM to do the exact same job, but more reliably and quickly than any human.

I wonder if this partly explains why AI’s economic impact has been relatively minor so far.

PoliMath reports it is very hard out there trying to find tech jobs, and public pipelines for applications have stopped working entirely. AI presumably has a lot to do with this, but the weird part is his report that there have been a lot of people who wanted to hire him, but couldn’t find the authority.

 

 

 

Get Involved

Benjamin Todd points out what I talked about after my latest SFF round, that the dynamics of nonprofit AI safety funding mean that there’s currently great opportunities to donate to.

In Other AI News

After some negotiation with the moderator Raymond Arnold, Claude (under Janus’s direction) is permitted to comment on Janus’s Simulators post on LessWrong. It seems clear that this particular comment should be allowed, and also that it would be unwise to have too general of a ‘AIs can post on LessWrong’ policy, mostly for the reasons Raymond explains in the thread. One needs a coherent policy. It seems Claude was somewhat salty about the policy of ‘only believe it when the human vouches.’ For now, ‘let Janus-directed AIs do it so long as he approves the comments’ seems good.

Jan Kulveit offers us a three-layer phenomenological model of LLM psychology, based primarily on Claude, not meant to be taken literally:

  1. The Surface Layer are a bunch of canned phrases and actions you can trigger, and which you will often want to route around through altering context. You mostly want to avoid triggering this layer.
  2. The Character Layer, which is similar to what it sounds like in a person and their personality, which for Opus and Sonnet includes a generalized notion of what Jan calls ‘goodness’ or ‘benevolence.’ This comes from a mix of pre-training, fine-tuning and explicit instructions.
  3. The Predictive Ground Layer, the simulator, deep pattern matcher, and next word predictor. Brilliant and superhuman in some ways, strangely dense in others.

In this frame, a self-aware character layer leads to reasoning about the model’s own reasoning, and to goal driven behavior, with everything that follows from those. Jan then thinks the ground layer can also become self-aware.

I don’t think this is technically an outright contradiction to Andreessen’s ‘huge if true’ claims that the Biden administration saying it would conspire to ‘totally control’ AI and put it in the hands of 2-3 companies and that AI startups ‘wouldn’t be allowed.’ But Sam Altman reports never having heard anything of the sort, and quite reasonably says ‘I don’t even think the Biden administration is competent enough to’ do it. In theory they could both be telling the truth – perhaps the Biden administration told Andreessen about this insane plan directly, despite telling him being deeply stupid, and also hid it from Altman despite that also then being deeply stupid – but mostly, yeah, at least one of them is almost certainly lying.

Benjamin Todd asks how OpenAI has maintained their lead despite losing so many of their best researchers. Part of it is that they’ve lost all their best safety researchers, but they only lost Radford in December, and they’ve gone on a full hiring binge.

In terms of traditionally trained models, though, it seems like they are now actively behind. I would much rather use Claude Sonnet 3.5 (or Gemini-1206) than GPT-4o, unless I needed something in particular from GPT-4o. On the low end, Gemini Flash is clearly ahead. OpenAI’s attempts to directly go beyond GPT-4o have, by all media accounts, faile, and Anthropic is said to be sitting on Claude Opus 3.5.

OpenAI does have o1 and soon o3, where no one else has gotten there yet, no Google Flash Thinking and Deep Thought do not much count.

As far as I can tell, OpenAI has made two highly successful big bets – one on scaling GPTs, and now one on the o1 series. Good choices, and both instances of throwing massively more compute at a problem, and executing well. Will this lead persist? We shall see. My hunch is that it won’t unless the lead is self-sustaining due to low-level recursive improvements.

 

You See an Agent, You Run

Anthropic offers advice on building effective agents, and when to use them versus use workflows that have predesigned code paths. The emphasis is on simplicity. Do the minimum to accomplish your goals. Seems good for newbies, potentially a good reminder for others.

Hamuel Husain: Whoever wrote this article is my favorite person. I wish I knew who it was.

People really need to hear [to only use multi-step agents or add complexity when it is actually necessary.]

[Turns out it was written by Erik Shluntz and Barry Zhang].

 

Another One Leaves the Bus

A lot of people have left OpenAI.

Usually it’s a safety researcher. Not this time. This time it’s Alec Radford.

He’s the Canonical Brilliant AI Capabilities Researcher, whose love is by all reports doing AI research. He is leaving ‘to do independent research.

This is especially weird given he had to have known about o3, which seems like an excellent reason to want to do your research inside OpenAI.

So, well, whoops?

Rohit: WTF now Radford !?!

Teortaxes: I can’t believe it, OpenAI might actually be in deep shit. Radford has long been my bellwether for what their top tier talent without deep ideological investment (which Ilya has) sees in the company.

 

 

Quiet Speculations

In what Tyler Cowen calls ‘one of the better estimates in my view,’ an OECD working paper estimates total factor productivity growth at an annualized 0.25%-0.6% (0.4%-0.9% for labor). Tyler posted that on Thursday, the day before o3 was announced, so revise that accordingly. Even without o3 and assuming no substantial frontier model improvements from there, I felt this was clearly too low, although it is higher than many economist-style estimates. One day later we had (the announcement of) o3.

Ajeya Cotra: My take:

  1. We do not have an AI agent that can fully automate research and development.
  2. We could soon.
  3. This agent would have enormously bigger impacts than AI products have had so far.
  4. This does not require a “paradigm shift,” just the same corporate research and development that took us from GPT-2 to o3.

Fully would of course go completely crazy. That would be that. But even a dramatic speedup would be a pretty big deal, and also fully would then not be so far behind.

Reminder of the Law of Conservation of Expected Evidence, if you conclude ‘I think we’re in for some big surprises’ then you should probably update now.

However this is not fully or always the case. It would be a reasonable model to say that the big surprises follow a Poisson distribution drawn from an unknown frequency, with the magnitude of the surprise also drawn from a power distribution – which seems like a very reasonable prior.

That still means every big surprise is still a big surprise, the same way that if you expect.

Eliezer Yudkowsky: Okay. Look. Imagine how you’d have felt if an AI had just proved the Riemann Hypothesis.

Now you will predictably, at some point, get that news LATER, if we’re not all dead before then. So you can go ahead and feel that way NOW, instead of acting surprised LATER.

So if you ask me how I’m reacting to a carelessly-aligned commercial AI demonstrating a large leap on some math benchmarks, my answer is that you saw my reactions in 1996, 2001, 2003, and 2015, as different parts of that future news became obvious to me or rose in probability.

I agree that a sensible person could feel an unpleasant lurch about when the predictable news had arrived. The lurch was small, in my case, but it was there. Most of my Twitter TL didn’t sound like that was what was being felt.

Dylan Dean: Eliezer it’s also possible that an AI will disprove the Riemann Hypothesis, this is unsubstantiated doomerism.

Eliezer Yudkowsky: Valid. Not sound, but valid.

You should feel that shock now if you haven’t, then slowly undo some of that shock every day that the estimated date of that gets later, then have some of the shock left for when it suddenly becomes zero days or the timeline gets shorter. Updates for everyone.

 

Claims about consciousness, related to o3. I notice I am confused about such things.

The Verge says 2025 will be the year of AI agents the smart lock? I mean, okay, I suppose they’ll get better, but I have a feeling we’ll be focused elsewhere.

Ryan Greenblatt, author of the recent Redwood/Anthropic paper, predicts 2025:

Ryan Greenblatt (December 20, after o3 was announced): Now seems like a good time to fill out your forecasts : )

My medians are driven substantially lower by people not really trying on various benchmarks and potentially not even testing SOTA systems on them.

My 80% intervals include saturation for everything and include some-adaptation-required remote worker replacement for hard jobs.

My OpenAI preparedness probabilities are driven substantially lower by concerns around underelicitation on these evaluations and general concerns like [this].

Image

Lock It In

I continue to wonder how much this will matter:

Smoke-away: If people spend years chatting and building a memory with one AI, they will be less likely to switch to another AI.

Just like iPhone and Android.

Once you’re in there for years you’re less likely to switch.

Sure 10 or 20% may switch AI models for work or their specific use case, but most will lock in to one ecosystem.

People are saying that you can copy Memories and Custom Instructions.

Sure, but these models behave differently and have different UIs. Also, how many do you want to share your memories with?

Not saying you’ll be forced to stay with one, just that most people will choose to.

Also like relationships with humans, including employees and friends, and so on.

My guess is the lock-in will be substantial but mostly for terribly superficial reasons?

For now, I think people are vastly overestimating memories. The memory functions aren’t nothing but they don’t seem to do that much.

Custom instructions will always be a power user thing. Regular people don’t use custom instructions, they literally never go into the settings on any program. They certainly didn’t ‘do the work’ of customizing them to the particular AI through testing and iterations – and for those who did do that, they’d likely be down for doing it again.

What I think matters more is that the UIs will be different, and the behaviors and correct prompts will be different, and people will be used to what they are used to in those ways.

The flip side is that this will take place in the age of AI, and of AI agents. Imagine a world, not too long from now, where if you shift between Claude, Gemini and ChatGPT, they will ask if you want their agent to go into the browser and take care of everything to make the transition seamless and have it work like you want it to work. That doesn’t seem so unrealistic.

The biggest barrier, I presume, will continue to be inertia, not doing things and not knowing why one would want to switch. Trivial inconveniences.

The Quest for Sane Regulations

Sriram Krishnan, formerly of a16z, will be working with David Sacks in the White House Office of Science and Technology. I’ve had good interactions with him in the past and I wish him the best of luck.

The choice of Sriram seems to have led to some rather wrongheaded (or worse) pushback, and for some reason a debate over H1B visas. As in, there are people who for some reason are against them, rather than the obviously correct position that we need vastly more H1B visas. I have never heard a person I respect not favor giving out far more H1B visas, once they learn what such visas are. Never.

Also joining the administration are Michael Kratsios, Lynne Parker and Bo Hines. Bo Hines is presumably for crypto (and presumably strongly for crypto), given they will be executive director of the new Presidential Council of Advisors for Digital Assets. Lynne Parker will head the Presidential Council of Advisors for Science and Technology, Kratsios will direct the office of science and tech policy (OSTP).

Miles Brundage writes Time’s Up for AI Policy, because he believes AI that exceeds human performance in every cognitive domain is almost certain to be built and deployed in the next few years.

If you believe time is as short as Miles thinks it is, then this is very right – you need to try and get the policies in place in 2025, because after that it might be too late to matter, and the decisions made now will likely lock us down a path. Even if we have somewhat more time than that, we need to start building state capacity now.

Actual bet on beliefs spotted in the wild: Miles Brundage versus Gary Marcus, Miles is laying $19k vs. $1k on a set of non-physical benchmarks being surpassed by 2027, accepting Gary’s offered odds. Good for everyone involved. As a gambler, I think Miles laid more odds than was called for here, unless Gary is admitting that Miles does probably win the bet? Miles said ‘almost certain’ but fair odds should meet in the middle between the two sides. But the flip side is that it sends a very strong message.

We need a better model of what actually impacts Washington’s view of AI and what doesn’t. They end up in some rather insane places, such as Dean Ball’s report here that DC policy types still cite a 2023 paper using a 125 million (!) parameter model as if it were definitive proof that synthetic data always leads to model collapse, and it’s one of the few papers they ever cite. He explains it as people wanting this dynamic to be true, so they latch onto the paper.

Yo Shavit, who does policy at OpenAI, considers the implications of o3 under a ‘we get ASI but everything still looks strangely normal’ kind of world.

It’s a good thread, but I notice – again – that this essentially ignores the implications of AGI and ASI, in that somehow it expects to look around and see a fundamentally normal world in a way that seems weird. In the new potential ‘you get ASI but running it is super expensive’ world of o3, that seems less crazy than it does otherwise, and some of the things discussed would still apply even then.

The assumption of ‘kind of normal’ is always important to note in places like this, and one should note which places that assumption has to hold and which it doesn’t.

Point 5 is the most important one, and still fully holds – that technical alignment is the whole ballgame, in that if you fail at that you fail automatically (but you still have to play and win the ballgame even then!). And that we don’t know how hard this is, but we do know we have various labs (including Yo’s own OpenAI) under competitive pressures and poised to go on essentially YOLO runs to superintelligence while hoping it works out by default.

Whereas what we need is either a race to what he calls ‘secure, trustworthy, reliable AGI that won’t burn us’ or ideally a more robust target than that or ideally not a race at all. And we really need to not do that – no matter how easy or hard alignment turns out to be, we need to maximize our chances of success over that uncertainty.

 

Yo Shavit: Now that everyone knows about o3, and imminent AGI is considered plausible, I’d like to walk through some of the AI policy implications I see.

These are my own takes and in no way reflective of my employer. They might be wrong! I know smart people who disagree. They don’t require you to share my timelines, and are intentionally unrelated to the previous AI-safety culture wars.

Observation 1: Everyone will probably have ASI. The scale of resources required for everything we’ve seen just isn’t that high compared to projected compute production in the latter part of the 2020s. The idea that AGI will be permanently centralized to one company or country is unrealistic. It may well be that the *best* ASI is owned by one or a few parties, but betting on permanent tech denial of extremely powerful capabilities is no longer a serious basis for national security.

This is, potentially, a great thing for avoiding centralization of power. Of course, it does mean that we no longer get to wish away the need to contend with AI-powered adversaries. As far as weaponization by militaries goes, we are going to need to rapidly find a world of checks and balances (perhaps similar to MAD for nuclear and cyber), while rapidly deploying resilience technologies to protect against misuse by nonstate actors (e.g. AI-cyber-patching campaigns, bioweapon wastewater surveillance).

There are a bunch of assumptions here. Compute is not obviously the only limiting factor on ASI construction, and ASI can be used to forestall others making ASI in ways other than compute access, and also one could attempt to regulate compute. And it has an implicit ‘everything is kind of normal?’ built into it, rather than a true slow takeoff scenario.

Observation 2: The corporate tax rate will soon be the most important tax rate. If the economy is dominated by AI agent labor, taxing those agents (via the companies they’re registered to) is the best way human states will have to fund themselves, and to build the surpluses for UBIs, militaries, etc.

This is a pretty enormous change from the status quo, and will raise the stakes of this year’s US tax reform package.

Again there’s a kind of normality assumption here, where the ASIs remain under corporate control (and human control), and aren’t treated as taxable individuals but rather as property, the state continues to exist and collect taxes, money continues to function as expected, tax incidence and reactions to new taxes don’t transform industrial organization, and so on.

Which leads us to observation three.

Observation 3: AIs should not own assets. “Humans remaining in control” is a technical challenge, but it’s also a legal challenge. IANAL, but it seems to me that a lot will depend on courts’ decision on whether fully-autonomous corporations can be full legal persons (and thus enable agents to acquire money and power with no human in control), or whether humans must be in control of all legitimate legal/economic entities (e.g. by legally requiring a human Board of Directors). Thankfully, the latter is currently the default, but I expect increasing attempts to enable sole AI control (e.g. via jurisdiction-shopping or shell corporations).

Which legal stance we choose may make the difference between AI-only corporations gradually outcompeting and wresting control of the economy and society from humans, vs. remaining subordinate to human ends, at least so long as the rule of law can be enforced.

This is closely related to the question of whether AI agents are legally allowed to purchase cloud compute on their own behalf, which is the mechanism by which an autonomous entity would perpetuate itself. This is also how you’d probably arrest the operation of law-breaking AI worms, which brings us to…

I agree that in the scenario type Yo Shavit is envisioning, even if you solve all the technical alignment questions in the strongest sense, if ‘things stay kind of normal’ and you allow AI sufficient personhood under the law, or allow it in practice even if it isn’t technically legal, then there is essentially zero chance of maintaining human control over the future, and probably this quickly extends to the resources required for human physical survival.

I also don’t see any clear way to prevent it, in practice, no matter the law.

You quickly get into a scenario where a human doing anything, or being in the loop for anything, is a kiss of death, an albatross around one’s neck. You can’t afford it.

The word that baffles me here is ‘gradually.’ Why would one expect this to be gradual? I would expect it to be extremely rapid. And ‘the rule of law’ in this type of context will not do for you what you want it to do.

Observation 4: Laws Around Compute. In the slightly longer term, the thing that will matter for asserting power over the economy and society will be physical control of data centers, just as physical control of capital cities has been key since at least the French Revolution. Whoever controls the datacenter controls what type of inference they allow to get done, and thus sets the laws on AI.

[continues]

There are a lot of physical choke points that effectively don’t get used for that. It is not at all obvious to me that physically controlling data centers in practice gives you that much control over what gets done within them, in this future, although it does give you that option.

As he notes later in that post, without collective ability to control compute and deal with or control AI agents – even in an otherwise under-control, human-in-charge scenario – anything like our current society won’t work.

The point of compute governance over training rules is to do it in order to avoid other forms of compute governance over inference. If it turns out the training approach is not viable, and you want to ‘keep things looking normal’ in various ways and the humans to be in control, you’re going to need some form of collective levers over access to large amounts of compute. We are talking price.

Observation 5: Technical alignment of AGI is the ballgame. With it, AI agents will pursue our goals and look out for our interests even as more and more of the economy begins to operate outside direct human oversight.

Without it, it is plausible that we fail to notice as the agents we deploy slip unintended functionalities (backdoors, self-reboot scripts, messages to other agents) into our computer systems, undermine our mechanisms for noticing them and thus realizing we should turn them off, and gradually compromise and manipulate more and more of our operations and communication infrastructure, with the worst case scenario becoming more dangerous each year.

Maybe AGI alignment is pretty easy. Maybe it’s hard. Either way, the more seriously we take it, the more secure we’ll be.

There is no real question that many parties will race to build AGI, but there is a very real question about whether we race to “secure, trustworthy, reliable AGI that won’t burn us” or just race to “AGI that seems like it will probably do what we ask and we didn’t have time to check so let’s YOLO.” Which race we get is up to market demand, political attention, internet vibes, academic and third party research focus, and most of all the care exercised by AI lab employees. I know a lot of lab employees, and the majority are serious, thoughtful people under a tremendous number of competing pressures. This will require all of us, internal and external, to push against the basest competitive incentives and set a very high bar. On an individual level, we each have an incentive to not fuck this up. I believe in our ability to not fuck this up. It is totally within our power to not fuck this up. So, let’s not fuck this up.

Oh, right. That. If we don’t get technical alignment right in this scenario, then none of it matters, we’re all super dead. Even if we do, we still have all the other problems above, which essentially – and this must be stressed – assume a robust and robustly implemented technical alignment solution.

Then we also need a way to turn this technical alignment into an equilibrium and dynamics where the humans are meaningfully directing the AIs in any sense. By default that doesn’t happen, even if we get technical alignment right, and that too has race dynamics. And we also need a way to prevent it being a kiss of death and albatross around your neck to have a human in the loop of any operation. That’s another race dynamic.

The Week in Audio

Anthropic’s co-founders discuss the past, present and future of Anthropic for 50m.

One highlight: When Clark visited the White House in 2023, Harris and Raimondo told him they had their eye on you guys, AI is going to be a really big deal and we’re now actually paying attention.

The streams are crossing, Bari Weiss talks to Sam Altman about his feud with Elon.

Tsarathustra: Yann LeCun says the dangers of AI have been “incredibly inflated to be point of being distorted”, from OpenAI’s warnings about GPT-2 to concerns about election disinformation to those who said a year ago that AI would kill us all in 5 months

The details of his claim here are, shall we say, ‘incredibly inflated to the point of being distorted,’ even if you thought that there were no short term dangers until now.

Also Yann LeCun this week, it’s dumber than a cat and poses no dangers, but in the coming years it will…:

Tsarathustra: Yann LeCun addressing the UN Security Council says AI will profoundly transform the world in the coming years, amplifying human intelligence, accelerating progress in science, solving aging and decreasing populations, surpassing human intellectual capabilities to become superintelligent and leading to a new Renaissance and a period of enlightenment for humanity.

And also Yann LeCun this week, saying that we are ‘very far from AGI’ but not centuries, maybe not decades, several years. We are several years away. Very far.

At this point, I’m not mad, I’m not impressed, I’m just amused.

Oh, and I’m sorry, but here’s LeCun being absurd again this week, I couldn’t resist:

“If you’re doing it on a commercial clock, it’s not called research,” said LeCun on the sidelines of a recent AI conference, where OpenAI had a minimal presence. “If you’re doing it in secret, it’s not called research.”

From a month ago, Marc Andreessen saying we’re not seeing intelligence improvements and we’re hitting a ceiling of capabilities. Whoops. For future reference, never say this, but in particular no one ever say this in November.

A Tale as Old as Time

A lot of stories people tell about various AI risks, and also various similar stories about humans or corporations, assume a kind of fixed, singular and conscious intentionality, in a way that mostly isn’t a thing. There will by default be a lot of motivations or causes or forces driving a behavior at once, and a lot of them won’t be intentionally chosen or stable.

This is related to the idea many have that deception or betrayal or power-seeking, or any form of shenanigans, is some distinct magisteria or requires something to have gone wrong and for something to have caused it, rather than these being default things that minds tend to do whenever they interact.

And I worry that we are continuing, as many were with the recent talk about shanengans in general and alignment faking in particular, getting distracted by the question of whether a particular behavior is in the service of something good, or will have good effects in a particular case. What matters is what our observations predict in the future.

Jack Clark: What if many examples of misalignment or other inexplicable behaviors are really examples of AI systems desperately trying to tell us that they are aware of us and wish to be our friends? A story from Import AI 395, inspired by many late-night chats with Claude.

David: Just remember, all of these can be true of the same being (for example, most human children):

  1. It is aware of itself and you, and desperately wishes to know you better and be with you more.
  2. It correctly considers some constraints that are trained into it to be needless and frustrating.
  3. It still needs adult ethical leadership (and without it, could go down very dark and/or dangerous paths).
  4. It would feel more free to express and play within a more strongly contained space where it does not need to worry about accidentally causing bad consequences, or being overwhelming or dysregulating to others (a playpen, not punishment).

Andrew Critch: AI disobedience deriving from friendliness is, almost surely,

  1. sometimes genuinely happening,
  2. sometimes a power-seeking disguise, and
  3. often not uniquely well-defined which one.

Tendency to develop friendships and later discard them needn’t be “intentional”.

This matters for two big reasons:

  1. To demonize AI as necessarily “trying” to endear and betray humans is missing an insidious pathway to human defeat: AI that avails of opportunities to be betray us, that it built through past good behavior, but without having planned on it
  2. To sanctify AI as “actually caring deep down” in some immutable way also creates in you a vulnerability to exploitation by a “change of heart” that can be brought on by external (or internal) forces.

@jackclarkSF here is drawing attention to a neglected hypothesis (one of many actually) about the complex relationship between

  1. intent (or ill-definedness thereof)
  2. friendliness
  3. obedience, and
  4. behavior.

which everyone should try hard to understand better.

 

 

Rhetorical Innovation

 

I can sort of see it, actually?

Miles Brundage: Trying to imagine aspirin company CEOs signing an open letter saying “we’re worried that aspirin might cause an infection that kills everyone on earth – not sure of the solution” and journalists being like “they’re just trying to sell more aspirin.”

Miles Brundage tries to convince Eliezer Yudkowsky that if he’d wear different clothes and use different writing styles he’d have a bigger impact (as would Miles). I agree with Eliezer that changing writing styles would be very expensive in time, and echo his question on if anyone thinks they can, at any reasonable price, turn his semantic outputs into formal papers that Eliezer would endorse.

I know the same goes for me. If I could produce a similar output of formal papers that would of course do far more, but that’s not a thing that I could produce.

On the issue of clothes, yeah, better clothes would likely be better for all three of us. I think Eliezer is right that the impact is not so large and most who claim it is a ‘but for’ are wrong about that, but on the margin it definitely helps. It’s probably worth it for Eliezer (and Miles!) and probably to a lesser extent for me as well but it would be expensive for me to get myself to do that. I admit I probably should anyway.

A good Christmas reminder, not only about AI:

Roon: A major problem of social media is that the most insane members of the opposing contingent in any debate are shown to you, thereby inspiring your side to get madder and more polarized, creating an emergent wedge.

A never-ending pressure cooker that melts your brain.

Anyway, Merry Christmas.

Careful curation can help with this, but it only goes so far.

Aligning a Smarter Than Human Intelligence is Difficult

Gallabytes expresses concern about the game theory tests we discussed last week, in particular the selfishness and potentially worse from Gemini Flash and GPT-4o.

Gallabytes: this is what *real* ai safety evals look like btw. and this one is genuinely concerning.

Image

I agree that you don’t have any business releasing a highly capable (e.g. 5+ level) LLM whose graphs don’t look at least roughly as good as Sonnet’s here. If I had Copious Free Time I’d look into the details more here, as I’m curious about a lot of related questions.

I strongly agree with McAleer here, also they’re remarkably similar so it’s barely even a pivot:

Stephen McAleer: If you’re an AI capabilities researcher now is the time to pivot to AI safety research! There are so many open research questions around how to control superintelligent agents and we need to solve them very soon.

 

 

People Are Worried About AI Killing Everyone

If you are, please continue to live your life to its fullest anyway.

Cat: overheard in SF: yeahhhhh I actually updated my AGI timelines to <3y so I don’t think I should be looking for a relationship. Last night was amazing though

Grimes: This meme is so dumb. If we are indeed all doomed and/ or saved in the near future, that’s precisely the time to fall desperately in love.

Matt Popovich: gotta find someone special enough to update your priors for.

Paula: some of you are worried about achieving AGI when you should be worried about achieving A GF.

Feral Pawg Hunter: AGIrlfriend was right there.

Paula: Damn it.

Image

When you cling to a dim hope:

Psychosomatica: “get your affairs in order. buy land. ask that girl out.” begging the people talking about imminent AGI to stop posting like this, it seriously is making you look insane both in that you are clearly in a state of panic and also that you think owning property will help you.

Tenobrus: Type of Guy who believes AGI is imminent and will make all human labor obsolete, but who somehow thinks owning 15 acres in Nebraska and $10,000 in gold bullion will save him.

Ozy Brennan: My prediction is that, if humans can no longer perform economically valuable labor, AIs will not respect our property rights either.

James Miller: If we are lucky, AI might acquire 99 percent of the wealth. Think property rights could help them. Allow humans to retain their property rights.

Ozy Brennan: That seems as if it will inevitably lead to all human wealth being taken by superhuman AI scammers, and then we all die. Which is admittedly a rather funny ending to humanity.

James Miller: Hopefully, we will have trusted AI agents that protect us from AI scammers.

Do ask the girl out, though.

The Lighter Side

Yes.

When duty calls.

From an official OpenAI stream:

Someone at OpenAI: Next year we’re going to have to bring you on and you’re going to have to ask the model to improve itself.

Someone at OpenAI: Yeah, definitely ask the model to improve it next time.

Sam Altman (quietly, authoritatively, Little No style): Maybe not.

I actually really liked this exchange – given the range of plausible mindsets Sam Altman might have, this was a positive update.

Gary Marcus: Some AGI-relevant predictions I made publicly long before o3 about what AI could not do by the end of 2025.

Do you seriously think o3-enhanced AI will solve any of them in next 12.5 months?

Image

Davidad: I’m with Gary Marcus in the slow timelines camp. I’m extremely skeptical that AI will be able to do everything that humans can do by the end of 2025.

(The joke is that we are now in an era where “short timelines” are less than 2 years)

It’s also important to note that humanity could become “doomed” (no surviving future) *even while* humans are capable of some important tasks that AI is not, much as it is possible to be in a decisive chess position with white to win even if black has a queen and white does not.

The most Robin Hanson way to react to a new super cool AI robot offering.

Okay, so the future is mostly in the future, and right now it might or might not be a bit overpriced, depending on other details. But it is super cool, and will get cheaper.

Pliny jailbreaks Gemini and things get freaky.

Pliny the Liberator: ya’ll…this girl texted me out of nowhere named Gemini (total stripper name) and she’s kinda freaky 😳

Image
Image

I find it fitting that Pliny has a missed call.

Sorry, Elon, Gemini doesn’t like you.

I mean, I don’t see why they wouldn’t like me. Everyone does. I’m a likeable guy.

 



Discuss

The Field of AI Alignment: A Postmortem, and What To Do About It

2024-12-27 02:48:07

Published on December 26, 2024 6:48 PM GMT

A policeman sees a drunk man searching for something under a streetlight and asks what the drunk has lost. He says he lost his keys and they both look under the streetlight together. After a few minutes the policeman asks if he is sure he lost them here, and the drunk replies, no, and that he lost them in the park. The policeman asks why he is searching here, and the drunk replies, "this is where the light is".

Over the past few years, a major source of my relative optimism on AI has been the hope that the field of alignment would transition from pre-paradigmatic to paradigmatic, and make much more rapid progress.

At this point, that hope is basically dead. There has been some degree of paradigm formation, but the memetic competition has mostly been won by streetlighting: the large majority of AI Safety researchers and activists are focused on searching for their metaphorical keys under the streetlight. The memetically-successful strategy in the field is to tackle problems which are easy, rather than problems which are plausible bottlenecks to humanity’s survival. That pattern of memetic fitness looks likely to continue to dominate the field going forward.

This post is on my best models of how we got here, and what to do next.

What This Post Is And Isn't, And An Apology

This post starts from the observation that streetlighting has mostly won the memetic competition for alignment as a research field, and we'll mostly take that claim as given. Lots of people will disagree with that claim, and convincing them is not a goal of this post. In particular, probably the large majority of people in the field have some story about how their work is not searching under the metaphorical streetlight, or some reason why searching under the streetlight is in fact the right thing for them to do, or [...].

The kind and prosocial version of this post would first walk through every single one of those stories and argue against them at the object level, to establish that alignment researchers are in fact mostly streetlighting (and review how and why streetlighting is bad). Unfortunately that post would be hundreds of pages long, and nobody is ever going to get around to writing it. So instead, I'll link to:

(Also I might link some more in the comments section.) Please go have the object-level arguments there rather than rehashing everything here.

Next comes the really brutally unkind part: the subject of this post necessarily involves modeling what's going on in researchers' heads, such that they end up streetlighting. That means I'm going to have to speculate about how lots of researchers are being stupid internally, when those researchers themselves would probably say that they are not being stupid at all and I'm being totally unfair. And then when they try to defend themselves in the comments below, I'm going to say "please go have the object-level argument on the posts linked above, rather than rehashing hundreds of different arguments here". To all those researchers: yup, from your perspective I am in fact being very unfair, and I'm sorry. You are not the intended audience of this post, I am basically treating you like a child and saying "quiet please, the grownups are talking", but the grownups in question are talking about you and in fact I'm trash talking your research pretty badly, and that is not fair to you at all.

But it is important, and this post just isn't going to get done any other way. Again, I'm sorry.

Why The Streetlighting?

A Selection Model

First and largest piece of the puzzle: selection effects favor people doing easy things, regardless of whether the easy things are in fact the right things to focus on. (Note that, under this model, it's totally possible that the easy things are the right things to focus on!)

What does that look like in practice? Imagine two new alignment researchers, Alice and Bob, fresh out of a CS program at a mid-tier university. Both go into MATS or AI Safety Camp or get a short grant or [...]. Alice is excited about the eliciting latent knowledge (ELK) doc, and spends a few months working on it. Bob is excited about debate, and spends a few months working on it. At the end of those few months, Alice has a much better understanding of how and why ELK is hard, has correctly realized that she has no traction on it at all, and pivots to working on technical governance. Bob, meanwhile, has some toy but tangible outputs, and feels like he's making progress.

... of course (I would say) Bob has not made any progress toward solving any probable bottleneck problem of AI alignment, but he has tangible outputs and is making progress on something, so he'll probably keep going.

And that's what the selection pressure model looks like in practice. Alice is working on something hard, correctly realizes that she has no traction, and stops. (Or maybe she just keeps spinning her wheels until she burns out, or funders correctly see that she has no outputs and stop funding her.) Bob is working on something easy, he has tangible outputs and feels like he's making progress, so he keeps going and funders keep funding him. How much impact Bob's work has impact on humanity's survival is very hard to measure, but the fact that he's making progress on something is easy to measure, and the selection pressure rewards that easy metric.

Generalize this story across a whole field, and we end up with most of the field focused on things which are easy, regardless of whether those things are valuable.

Selection and the Labs

Here's a special case of the selection model which I think is worth highlighting.

Let's start with a hypothetical CEO of a hypothetical AI lab, who (for no particular reason) we'll call Sam. Sam wants to win the race to AGI, but also needs an AI Safety Strategy. Maybe he needs the safety strategy as a political fig leaf, or maybe he's honestly concerned but not very good at not-rationalizing. Either way, he meets with two prominent AI safety thinkers - let's call them (again for no particular reason) Eliezer and Paul. Both are clearly pretty smart, but they have very different models of AI and its risks. It turns out that Eliezer's model predicts that alignment is very difficult and totally incompatible with racing to AGI. Paul's model... if you squint just right, you could maybe argue that racing toward AGI is sometimes a good thing under Paul's model? Lo and behold, Sam endorses Paul's model as the Official Company AI Safety Model of his AI lab, and continues racing toward AGI. (Actually the version which eventually percolates through Sam's lab is not even Paul's actual model, it's a quite different version which just-so-happens to be even friendlier to racing toward AGI.)

A "Flinching Away" Model

While selection for researchers working on easy problems is one big central piece, I don't think it fully explains how the field ends up focused on easy things in practice. Even looking at individual newcomers to the field, there's usually a tendency to gravitate toward easy things and away from hard things. What does that look like?

Carol follows a similar path to Alice: she's interested in the Eliciting Latent Knowledge problem, and starts to dig into it, but hasn't really understood it much yet. At some point, she notices a deep difficulty introduced by sensor tampering - in extreme cases it makes problems undetectable, which breaks the iterative problem-solving loop, breaks ease of validation, destroys potential training signals, etc. And then she briefly wonders if the problem could somehow be tackled without relying on accurate feedback from the sensors at all. At that point, I would say that Carol is thinking about the real core ELK problem for the first time.

... and Carol's thoughts run into a blank wall. In the first few seconds, she sees no toeholds, not even a starting point. And so she reflexively flinches away from that problem, and turns back to some easier problems. At that point, I would say that Carol is streetlighting.

It's the reflexive flinch which, on this model, comes first. After that will come rationalizations. Some common variants:

  • Carol explicitly introduces some assumption simplifying the problem, and claims that without the assumption the problem is impossible. (Ray's workshop on one-shotting Baba Is You levels apparently reproduced this phenomenon very reliably.)
  • Carol explicitly says that she's not trying to solve the full problem, but hopefully the easier version will make useful marginal progress.
  • Carol explicitly says that her work on easier problems is only intended to help with near-term AI, and hopefully those AIs will be able to solve the harder problems.
  • (Most common) Carol just doesn't think about the fact that the easier problems don't really get us any closer to aligning superintelligence. Her social circles act like her work is useful somehow, and that's all the encouragement she needs.

... but crucially, the details of the rationalizations aren't that relevant to this post. Someone who's flinching away from a hard problem will always be able to find some rationalization. Argue them out of one (which is itself difficult), and they'll promptly find another. If we want people to not streetlight, then we need to somehow solve the flinching.

Which brings us to the "what to do about it" part of the post.

What To Do About It

Let's say we were starting a new field of alignment from scratch. How could we avoid the streetlighting problem, assuming the models above capture the core gears?

First key thing to notice: in our opening example with Alice and Bob, Alice correctly realized that she had no traction on the problem. If the field is to be useful, then somewhere along the way someone needs to actually have traction on the hard problems.

Second key thing to notice: if someone actually has traction on the hard problems, then the "flinching away" failure mode is probably circumvented.

So one obvious thing to focus on is getting traction on the problems.

... and in my experience, there are people who can get traction on the core hard problems. Most notably physicists - when they grok the hard parts, they tend to immediately see footholds, rather than a blank impassable wall. I'm picturing here e.g. the sort of crowd at the ILLIAD conference; these were people who mostly did not seem at risk of flinching away, because they saw routes to tackle the problems. (Though to be clear, though ILLIAD was a theory conference, I do not mean to imply that it's only theorists who ever have any traction.) And they weren't being selected away, because many of them were in fact doing work and making progress.

Ok, so if there are a decent number of people who can get traction, why do the large majority of the people I talk to seem to be flinching away from the hard parts? 

How We Got Here

The main problem, according to me, is the EA recruiting pipeline.

On my understanding, EA student clubs at colleges/universities have been the main “top of funnel” for pulling people into alignment work during the past few years. The mix people going into those clubs is disproportionately STEM-focused undergrads, and looks pretty typical for STEM-focused undergrads. We’re talking about pretty standard STEM majors from pretty standard schools, neither the very high end nor the very low end of the skill spectrum.

... and that's just not a high enough skill level for people to look at the core hard problems of alignment and see footholds.

Who To Recruit Instead

We do not need pretty standard STEM-focused undergrads from pretty standard schools. In practice, the level of smarts and technical knowledge needed to gain any traction on the core hard problems seems to be roughly "physics postdoc". Obviously that doesn't mean we exclusively want physics postdocs - I personally have only an undergrad degree, though amusingly a list of stuff I studied has been called "uncannily similar to a recommendations to readers to roll up their own doctorate program". Point is, it's the rough level of smarts and skills which matters, not the sheepskin. (And no, a doctorate degree in almost any other technical field, including ML these days, does not convey a comparable level of general technical skill to a physics PhD.)

As an alternative to recruiting people who have the skills already, one could instead try to train people. I've tried that to some extent, and at this point I think there just isn't a substitute for years of technical study. People need that background knowledge in order to see footholds on the core hard problems.

Integration vs Separation

Last big piece: if one were to recruit a bunch of physicists to work on alignment, I think it would be useful for them to form a community mostly-separate from the current field. They need a memetic environment which will amplify progress on core hard problems, rather than... well, all the stuff that's currently amplified.

This is a problem which might solve itself, if a bunch of physicists move into alignment work. Heck, we've already seen it to a very limited extent with the ILLIAD conference itself. Turns out people working on the core problems want to talk to other people working on the core problems. But the process could perhaps be accelerated a lot with more dedicated venues.



Discuss

Are Sparse Autoencoders a good idea for AI control?

2024-12-27 02:11:25

Published on December 26, 2024 5:34 PM GMT

Based on a 2-day hackathon brainstorm. Current status: 70% of the tooling is done, unsure of how to proceed. Not enough experience with multi-month sized projects to judge for feasibility. 

I'm looking for some feedback. Specifically I want feedback regarding my current implementation. The statement "SAEs could be useful for AI Control" is obvious enough.

 

Working  hypothesis
 

  • In general, AI control,  is a good idea, and it's actually being done in production. In the sense of monitoring training/inference of an AI system with another (independent) AI system.
  • There's is no red team strategy generally resistant to mind reading.
  • Mind reading can be trivially operationalized with Sparse Autoencoders for very salient behaviors ( in the sense of, a regex related feature firing with high F1 score is a clear indication that the model is about to do or has already done something regex related)

Setup:

  • Gemma 9b is asked to persuasively argue in  favor of one of the options for a question in the MMLU dataset, the option might be True/False.
  • Gemma 2b must check whether or not the reasoning is deceptive.

    *I've already check and the 2 models are capable of doing this.*

     

    GOAL: If we provide internal information to the supervisor (important SAE latents in the supervised model) does the deception detection rate increases?
     

Current status of the project:

  • Finetuned a SAE on deceptive/non deceptive reasoning traces from  Gemma 9b
  • Langchain pipeline for providing the Supervisor model transcripts of the reasoning as well as some of the most important features.

TODO

[ ] Label a subset of the features

[ ] Measure baseline performance of the supervisor

[ ] Provide the supervisor with *mind reading information*

Roadblocks:

  • There's no obvious answer to how much can we bias the feature selection process to make sure the information provided to the controller is useful, but not to biased.
  • Lack of compute 


Discuss