MoreRSS

site iconLessWrongModify

An online forum and community dedicated to improving human reasoning and decision-making.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of LessWrong

Most Algorithmic Progress is Data Progress [Linkpost]

2025-12-11 01:48:04

Published on December 10, 2025 5:48 PM GMT

So this post brought to you by Beren today is about how a lot of claims about within-paradigm algorithmic progress is actually mostly about just getting better data, leading to a Flynn effect, and the reason I'm mentioning this is because once we have to actually build new fabs and we run out of data in 2028-2031, progress will be slower than people expect (assuming we havent reached AGI by then).

When forecasting AI progress, the forecasters and modellers often break AI progress down into two components: increased compute, and ‘algorithmic progress’. My argument here is that the term ‘algorithmic progress’ for ‘the remainder after compute’ is misleading and that we should really think about and model AI progress as three terms – compute, algorithms, and data. My claim is that a large fraction (but certainly not all) AI progress that is currently conceived as ‘algorithmic progress’ is actually ‘data progress’, and that this term ‘algorithmic’ gives a false impression about what are the key forces and key improvements that have driven AI progress in the past three years or so.

From experience in the field, there have not been that many truly ‘algorithmic’ improvements with massive impact. The primary one of course is the switch to RLVR and figuring out how to do mid-training (although both of these are vitally dependent upon the datasets). Other minor ones include things like qk-norm, finegrained experts and improvement to expert balancing, and perhaps the muon optimizer. The impact of most of these is utterly dwarfed by ‘better’ data, however, and this is something that pure scaling and flop-based analyses miss.

Models today are certainly trained using vastly more flops than previously, but they are also trained on significantly ‘higher quality’ data where ‘high quality’ means aligned with the specific tasks we care about the models being able to perform (cynically: the evals). The models are not getting so good by scale alone. A GPT4 scale model trained on the dataset of GPT3 would be substantially worse across all benchmarks, even if we somehow replicated the GPT3-dataset to be the scale of GPT4s dataset. However this model was never released (and probably never trained) so improvements in data are easily hidden and misattributed to scale or other progress. An easy way to see this is to look at model improvements for a fixed flop count and model size. These improvements have been substantial and projects as models like the Phi series show.

It is very noticeable that e.g. Qwen3 uses an architecture and training setup that is practically identical to Llama2 and yet achieves vastly greater performance which would require incredibly more OOMs of flops if you could train on an infinite Llama2 dataset. This is almost entirely because the Qwen3 datasets are both bigger but crucially much more closely aligned with the capabilities we care about the models having – e.g. the capabilities that we measure and benchmark.

My opinion here is that we have essentially been seeing a very strong Flynn effect for the models which has explained a large proportion of recent gains as we switch from almost totally uncurated web data to highly specialized synthetic data which perfectly (and exhaustively) targets the tasks we want the models to learn. It’s like the difference between giving an exam to some kid that wandered in from the jungle vs one that has been obsessively tiger-parented from birth to do well at this exam. Clearly the tiger-parented one will do vastly better with the same innate aptitude because their entire existence has been constructed to make them good at answering things similar to the exam questions, even if they have never seen the exact exam questions themselves before. Conversely, the jungle kid probably destroys the tiger-parented kid at various miscellaneous jungle related skills but nobody measures or cares about these because they are irrelevant for the vast, vast majority of tasks people want the jungle kid to do. Translating this metaphor back to LLM-land, Qwen3 has seen vast amounts of synthetic math and code and knowledge-based multiple choice questions all designed to make it as good as possible on benchmarks, Llama2 has seen mostly random web pages which incidentally occasionally contain some math and code but with very little quality filter. Llama2 probably destroys Qwen3 at knowing about obscure internet forum posts from 2008, precisely understanding the distribution of internet spam at different points throughout history, and knows all the ways in which poor common-crawl parsing can create broken seeming documents, but nobody (quite rightly) thinks that these skills are important, worth measuring, or relevant for AGI.

One way to track this is the sheer amount of spend on data labelling companies from big labs. ScaleAI and SurgeAI’s revenue each sit around $1B and most of this, as far as I can tell, is from data labelling for big AI labs. This spend is significantly less than compute spend, it is true, but it nevertheless must contribute a significant fraction to a lab’s total spending. I don’t have enough data to claim this but it seems at least plausible that the spend is increasing at a similar rate as compute spend (e.g. 3-4x per year), albeit from a much lower base.

When we see frontier models improving at various benchmarks we should think not just of increased scale and clever ML research ideas but billions of dollars spent paying PhDs, MDs, and other experts to write questions and provide example answers and reasoning targeting these precise capabilities. With the advent of outcome based RL and the move towards more ‘agentic’ use-cases, this data also includes custom RL environments which are often pixel-perfect replications of commonly used environments such as specific websites like Airbnb or Amazon, browsers, terminals and computer file-systems, and so on alongside large amounts of human trajectories exhaustively covering most common use-cases with these systems.

In a way, this is like a large-scale reprise of the expert systems era, where instead of paying experts to directly program their thinking as code, they provide numerous examples of their reasoning and process formalized and tracked, and then we distill this into models through behavioural cloning. This has updated me slightly towards longer AI timelines since given we need such effort to design extremely high quality human trajectories and environments for frontier systems implies that they still lack the critical core of learning that an actual AGI must possess. Simply grinding to AGI by getting experts to exhaustively cover every possible bit of human knowledge and skill and hand-coding (albeit with AI assistance) every single possible task into an RL-gym seems likely to both be inordinately expensive, take a very long time, and seems unlikely to suddenly bootstrap to superintelligence.

There is some intriguing evidence that actual algorithmic progress is beginning to contribute more than in the past few years. Clearly there have been algorithmic breakthroughs enabling RL to start working (although this is also substantially a data breakthrough in that the default policies of LLMs became good enough that there is no longer an exploration problem with the RL training since the default policy is good enough to get nontrivial reward). We have also started to see bigger changes to architecture embraced by big labs such as Deepseek’s MLA, and Google’s recent Gemma3n release than previously. Finally, muon is starting to gain traction as an optimizer to displace AdamW. There have also been improvements in mid-training recipes although again this is heavily entangled with the data. This is in contrast from the 2022-2024 era which was largely simply scaling up model size and data size and increasing data quality but where the actual core training methods and architectures remained essentially unchanged. If so, it is possible that the trend lines will continue and that we will simply move towards greater actual algorithmic progress as the cheap improvements from data progress slows.

One way this could be quantified relatively straightforward is to just run ablation experiments with fixed compute training a 2022 or a 2025 frontier architecture and training recipe on either 2022 data (the pile?) or 2025 data (qwen3 training set?) and seeing where in fact the gains come from. My money would be very substantially on the datasets but I could be wrong here and could be missing some key factors.



  •  



Discuss

Fibonacci Holds Information

2025-12-11 01:16:27

Published on December 10, 2025 5:16 PM GMT

Any natural number can be uniquely written as a sum of non-consecutive Fibonacci numbers. This is called Zeckendorf representation.

Consider,

15 = 2 + 13,

or

54 = 2 + 5 + 13 + 34.

This outlines a very weak RE language employing only . We can also see that it is able to encode:

https://www.milanrosko.com/demo/zeck.html (.js demo) 

Because you cannot use adjacent Fibonacci numbers in the sum, each “gap” between Fibonacci indices becomes meaningful: which indices are used encodes structure in a kind of “sparse bitvector.” In that sense: if you leave at least one gap (i.e. you do not use consecutive Fibonacci numbers), you can use the pattern of used vs unused Fibonacci indices to store information additively.

Thus, unlike binary or prime-exponent encodings the Zeckendorf representation stores data by selecting a unique subset of Fibonacci “slots,” rather than bits or prime powers.

From this we can construct a novel injective pairing function.

It looks more complicated than it is: Typed Carryless Pairing

It has multiple advantages over known methods: it is typed and arithmetic at the same time, meaning that the structure of the Fibonacci-index bands enforces a clean separation between components while still remaining a purely additive numerical encoding. Because Zeckendorf representations never use consecutive Fibonacci numbers, we can allocate disjoint “regions’’ of indices to each element of the pair, guaranteeing that no carries occur and that decoding is mechanically bounded. This yields a pairing function that is reversible without multiplication, exponentiation, or factorization; preserves type distinctions at the level of index geometry; and remains compatible with very weak arithmetic theories that nonetheless require robust Gödel-style encodings.

For purposes of constructive logic, this is advantageous because the encoding avoids reliance on ontological commitments such as the Fundamental Theorem of Arithmetic (FTA), which guarantees unique prime factorization only through a classical, non-constructive global property of the integers. The Carryless Pairing works instead by local, finitely verifiable constraints: one checks only the absence of adjacent Fibonacci indices and the disjoint placement of index bands. All decoding steps are constructive, bounded, and justified by the combinatorial behavior of the Fibonacci sequence rather than by an external uniqueness ontology. This keeps the encoding aligned with intuitionistic standards, grounding arithmetic representation in directly inspectable structure rather than in classical inductive principles.

Classical constructions in arithmetic encoding, including those used in weak theories, almost always depended on prime factorization with superpolynomial reversal, total bijective Cantor-style polynomial pairings, or digit interleaving in fixed bases; additive, gap-controlled encodings based on Fibonacci structure are not part of the standard toolkit.

I posted this because first-order algorithms within such an RE system can, in principle, admit refinements in someone’s project X, where even small advances in encoding structure may yield correspondingly more efficient constructive methods.

Originally it was conceived to get rid of Gödel coding so that we obtain a bona fide constructive method grounded entirely in additive structure. The paper: https://arxiv.org/abs/2509.10382



Discuss

Register for SPAR Demo Day on Saturday, Dec 13

2025-12-11 00:58:44

Published on December 10, 2025 4:58 PM GMT

Registrations are open for SPAR Demo Day on Saturday, Dec 13!

Register Now!

SPAR is a research mentorship program helping aspiring researchers contribute meaningfully to frontier AI safety work. This year's virtual Demo Day (hosted in Gather Town) showcases 90+ technical and governance projects developed over 3 months by this round's stellar mentors and mentees. Projects from previous SPAR rounds have been accepted at ICML and NeurIPS, covered by TIME, and led to full-time job offers for mentees.

What to expect:

  • 90+ AI safety and policy research projects spanning mechanistic interpretability, AI governance, LLM evaluations, scalable oversight, adversarial robustness, and more
  • Career fair featuring organizations like METR, Coefficient Giving, MATS, Constellation, AI Futures Project, and other leading AI safety organizations
  • Networking opportunities across the AI safety community

Event details (all times in Pacific):

  • 8am: Gather.town space opens for networking
  • 9am-10:30am: Poster presentations
  • 10:30-11:30am: Lightning talks
  • 11:30-12pm:
    • Career fair (SPAR participants only)
    •  Small-group AI safety discussions
  • 12:00-1pm: Career fair (open to the public)
  • 1pm: Prizes announced

You can view all projects and find more information here.

Interested in attending? Register here, note that spots are limited.

We hope to see you there!

SPAR is run by Kairos, a nonprofit whose mission is to help society navigate the transition to transformative AI. Mentee applications for the next SPAR round open on December 17. Express interest for future iterations of SPAR here.



Discuss

We don't know what most microbial genes do. Can genomic language models help?

2025-12-11 00:04:56

Published on December 10, 2025 4:04 PM GMT

Youtube: https://youtu.be/w6L9-ySnxZI?si=7RBusTAyy0Ums6Oh

Spotify: https://open.spotify.com/episode/2EgnV9Y1Mm9JV5m9KAY6yL?si=GcZR80aFS26uO88lpmadBQ

Apple Podcast: https://apple.co/4pu4TRB

Transcript: https://www.owlposting.com/p/we-dont-know-what-most-microbial

****

This is an interview with Yunha Hwang, an assistant professor at MIT (and co-founder of the non-profit Tatta Bio). She is working on building and applying genomic language models to help annotate the function of the (mostly unknown) universe of microbial genomes.

There are two reasons I filmed this (and think its worth watching):

One, Yunha is working on an absurdly difficult and interesting problem: microbial genome function annotation. Even for E. coli, one of the most studied organisms on Earth, we don’t know what half to two-thirds of its genes actually do. For a random microbe from soil, that number jumps to 80-90%. Her lab is one of the leading groups working to apply deep learning to solving the problem, and last year, released a paper that increasingly feels foundational within it (with prior podcast guest Sergey Ovchinnikov an author on it!). We talk about that paper, its implications, and where the future of machine learning in metagenomics may go.

And two, I was especially excited to film this so I could help bring some light to a platform that she and her team at Tatta Bio has developed: SeqHub. There’s been a lot of discussion online about AI co-scientists in the biology space, but I have increasingly felt a vague suspicion that people are trying to be too broad with them. It feels like the value of these tools are not with general scientific reasoning, but rather from deep integration with how a specific domain of research engages with their open problems. SeqHub feels like one of the few systems that mirrors this viewpoint, and while it isn’t something I can personally use—since its use-case is primarily in annotating and sharing microbial genomes, neither of which I work on!—I would still love for it to succeed. If you’re in the metagenomics space, you should try it out at seqhub.org!



Discuss

Artifacts I'd like to try

2025-12-10 22:16:34

Published on December 10, 2025 2:16 PM GMT

Here is a list of digital (and physical!) artifacts to create connections between friends, increase conversation bandwidth, or simply enjoy pleasant aesthetic experiences. I’m not sure if they are good ideas, but they have been fueling my curiosity for long enough that I’ve written several series of notes about each of them.

You can read this post as a to-do list I never got the time to implement, or a list of prompts to use once AI becomes good enough at building high-quality software.

In the meantime, I thought these proto-prototypes would have a brighter life in public, outside of my personal notes. They might inspire readers looking for a weekend vibe-coding project, or at least be entertaining to read. Enjoy!

Hardware: The friend-o-phone.

A concept co-created with the great Diego Dorn!

Goal: Creating ambient co-presence among a group of friends that live far from each other.

The friend-o-phone is an object meant for living rooms. It works as a simple voicemail box. It is a welcoming object with only two buttons and one LED.

Visual vibe for the friend-o-phone. Funny enough, this is a napkin box

You can click the “record” button to create a one-minute audio clip and send it to the stack. You can click the “listen” button to read a clip from the stack. The LED can change color to signal that there are messages in the mailbox.

It creates a low-friction option to connect with your friends at a distance, as if they were casually present in the living room, doing something else. You can share an anecdote of your daily life, record a piece of a cool song you discovered, ask and answer, etc. And if two or more friends are using the friend-o-phone at the same time, you can even have an almost synchronous discussion through 1-minute clips.

When you open your smartphone to share an update with your friends, you have to dodge a series of six distracting notifications and three social media apps projecting fomo before you can land on the messaging app. At this point, chances are you forgot why you where here in the first place.

The friend-o-phone breaks this generality. It is a bespoke piece of hardware that serves a single purpose. This makes the habit you care about more available in your mind as the object is in your visual space, and you can use it without having to look at a screen.

Variations & extensions:

  • Three different channels. There are three sets of record/listen buttons and three LEDs. At first, the channels don’t have a specific meaning, but over time, they organically acquire significance. There might be the “life anecdote” channel, the “music recommendation” channel, and the “random stuff” channel.
  • Group call button. When two or more person push this button at the same time, they start a group call.
  • Physical tape recorder. A tape is rotating when a message anytime a message is received, played or recorded. This makes the object more transparent, and gives privacy confidence: if the tape is not turning, the device cannot be recording.

Software: Conversation flywheel.

Goal: Increasing collective working memory during a conversation.

It is notoriously common to forget what you were talking about a few minutes ago. During intense discussions where all your attention is focused on the topic at hand, you don’t have the bandwidth to think about the trajectory while you are in it.

The conversation flywheel is a visual interface that lives in the peripheral vision during an in-person or online chat. It can be a screen in a room, or a widget integrated into a video-conference service. Like a mechanical flywheel, its purpose is to keep the momentum going. As the participants speak, keywords or short quotes are added to the center of the interface in the focus space. After a few seconds, they slowly shrink and drift to the periphery to leave space for the fresh contributions. The interface only shows enough key statements that the participants can recover their trails at a glance.

When the topic changes, the interface bundles the quotes into a peripheral circle labeled with a short handle and an icon. The focus space becomes empty, ready to receive the new topic. If an old topic surfaces again, the corresponding peripheral circle gets back to the focus space and expands to reveal the trails from the previous discussion about the same topic. The participants can also deliberately expand the circles through voice control by simply saying the names of the circles.

img
Interface mockup for the conversation flywheel

Variations & extensions:

  • Vibes flywheel. Instead of transcribing the information content, the interface is synthesising live video content that matches the vibes of the space: a fireplace for a late night conversation, aerial shots of landscapes as someone talks about their experience in the Amazonian forest, or abstract patterns like the iTunes visualizer that adapt to the music during a party.
  • Past circles. The content from past conversations can be pulled in context, with their set of icons and labels. Over time, the icons are used to refer to concepts without the need to unpack the old circles, and the group creates its own visual and verbal jargon.

Software: River timeline.

Goal: Create stronger connections to your past selves, cultivate a deliberate relationship with your personal projects and information.

I would bet that every day, a torrent of new items lands in your notes: links to cool articles, disconnected thoughts about various projects, recommendations for events in town.

Instead of feeling the urge to tidy up all these notes so they fit in your personal knowledge management system, River timelines let you treat this flow of notes as, well, a river.

This freedom from structure comes in exchange for a retrospective ritual. Every week, you take an hour to look at all the notes you added to your river, organized along a timeline. Each item is presented in the form that is the easiest to glance: long blocks of text are shown collapsed, with a title cleverly picked from the note with an illustration, links to articles are shown as a one-sentence summary surfacing the point most likely to interest you, while images are rendered directly.

The flow is divided into sub-timelines that cluster by topics. When a topic ends, the vertical space of the timeline gets replaced with another topic cluster that is close to the previous one (like the beauty → creativity transition in the mockup). The two topics get differentiated by the icons attached to the notes.

As you re-process the flow of notes, you go through a fast-forward of the events that happened over the past week. You consolidate your memories into a narrative made of interacting strands. You can move the blocks around, make connections on the canvas, and create spatial hubs for related notes.

As you rediscover the items, you can take the hat of a VC funder, investing your own time. You compare different project ideas and decide which one passes the bar for your next weekend prototype. You pick which articles keep sparking excitement after you see them in the river, and should definitely be next on your reading list.

Every time you come back to this timeline, the spatial organization evolves. Maybe different colors come to mean different things, spatial hubs get created, and stop being used. But all these changes follow the timeline, they are in the same visual space, and the holes can be filled from the spatial context.

When you need to recall an item from your notes, you know spatially where it should be located, and you can find it reliably.

img
Mockup for the river timeline.

Variations & extensions:

  • Data integration. We can imagine augmenting the timeline with data from different sources. Calendar events and photos picked from the gallery could be used to facilitate the emulation of your past self.
  • Group timeline. Timelines could be built from the information sent through online platforms like Telegram, Discord, or even X. Groups could decide to regularly meet to “tell the story of their community”, and craft their collective narrative. This could be done in meme form, where, for instance, two sides of a debate could be associated with animals. Over time, the community tells the story of the fight, and maybe the reconciliation of the totem animals as the debate concludes.

Hardware: Latent growth.

Goal: Create aesthetic experiences of objects that don’t fit any categories.

img
Still from a Stable Diffusion latent space exploration.

Latent space explorations are these smooth journeys that interpolate through the hidden dimensions of image diffusion models. As you watch them, you feel like all the shapes you see make sense, but you are unable to name what they are when you pause the video.

Latent growth is the crossover of this endless stream of puzzling shapes, with the peace and quiet of a plant. It is a colored digital ink screen that hangs from a wall like a painting. Over the day, its shapes evolve gradually, at the speed of plant growth. Like a vegetal, the exact speed depends on the environment: it is faster in warm, luminous spaces.

As you watch the latent growth unfold infinitely, evading all the predictions you made for where it would go next, you are constantly reminded of the nebulous nature of the world. All the discrete categories you take for granted, like “cat” or “dog,” are broken with unnamed in-betweens.

Closing thoughts.

I believe the digital age and the age of AI leave plenty of room for tools that nurture meaningful human moments. These fruits are now ripe to be picked.

If any of these ideas caught your attention and you’d like to chat about them, either to simply jam on the concept or if you plan to make a prototype, feel free to contact me!



Discuss

AI Safety – Analyse Affordances

2025-12-10 22:09:10

Published on December 10, 2025 2:09 PM GMT

This post spun off from the work I was doing at UChicago’s XLab over this summer. Thanks to Jay Kim, Jo Jiao, and Aryan Bhatt for feedback on this post! Also, thanks to Zack Rudolph, Rhea Kanuparthi, and Aryan Shrivastava for organizing & facilitating an incredible fellowship this summer.

This post sketches out a frame to think about broad classes of AI risk.[1]The goal is to help people identify gaps in safety cases, and, by doing so, identify neglected directions in AI Safety research. Epistemic status: exploratory.

One sentence summary: To understand the risks AI systems pose, it is useful to look at affordances they have available to act upon the world.[2]

If you work in AI Safety, you’re probably worried about how these powerful systems will, in the present and in the future, affect the way we live. Ie. you’re interested in the impact of AI “on the real world”. This impact can be either the direct or indirect result of actions taken by these systems. Direct risks include: malicious actors using coding agents to perform cyberattacks; LLMs telling users how to synthesize drugs; or sycophantic models that induce psychotic breaks. Indirect risks include: AI race dynamics escalating to conflicts between nations; automation of various parts of the economy causing unemployment; and gradual disempowerment scenarios.

Analysing the affordances of AI systems can help us prepare for the former category of risks (but less so for the latter). Why? Well, with indirect risks, (a) we might no clearly understand the relationship between the actions of an AI system and bad outcomes, and (b) there might be other, potentially more tractable levers to avoid bad outcomes. For example, with race dynamics between nations, we’re not sure what threshold of capabilities, when crossed, might lead to conflict![3]Also, international cooperation might be a more tractable lever to avoid race dynamics than (say) limiting capabilities to a ceiling safe enough to avoid sparking race dynamics. The latter feels convoluted to even think about.

Current AI systems (ie. LLM-based systems) have few affordances through which they can actually ‘do things’ in the real world. The biggest channels seem to be: coding, tool use, and maybe human persuasion. I’ll briefly discuss the potential risks and interventions on each of these affordances; but before that, some caveats:

  • The following examples are solely illustrative. A good analysis of threat-vectors / bottlenecks relies on extensive domain-specific knowledge (eg. details of the infrastructure used to serve models, or eg. the control protocols that labs are capable of & are willing to implement). I don’t have this knowledge, and I’d be surprised if any details below were object-level useful AND novel.
  • This coding / tool use / human persuasion division is supposed to be a useful starting point, not a ‘correct’ or ‘complete’ taxonomy in any sense.

Coding

  • Failure mode: LLM-generated code might contain backdoors.[4]
    Potential solution: Have some way to flag code that’s LLM generated? This doesn’t solve the security concern, but it allows us to later attribute the failure to the LLM.
  • Failure mode: The model gets root access.
    Potential solutions: Sandboxing coding agents. Or targeted unlearning of shell-languages.
  • Failure mode: Dual-use code that seems innocuous, but is actually used maliciously
    Potential solution: None! In the extreme case, the distributions of (malicious) dual-use code & innocuous code overlap. Thus, any intervention would have to be at a level separate from just looking at the model’s output.

Tool Use
The threat vectors / mediating pathways largely vary tool-by-tool. That being said, here’s a bit about tool use in the general

  • Failure mode: The model uses a tool maliciously.
    Potential solutions: Trivially, don’t give critical tasks to models. Or have a human-in-the-loop to sign off on critical tool uses
  • Failure mode: The model accidentally misuses a tool.
    Potential solutions: Train models to follow the instructions more reliably. Also, provide clear documentation for tool use.
  • Other misc solutions: Actively monitor (a fraction of) all tool uses. Also, consider logging tool uses to have a traceable record in the case of bad outcomes.

Human Persuasion
This is a bit more speculative, and I was finding it hard to say something concrete.

  • Failure mode: Nudging / persuading a person to carry out xyz undesired action
  • Failure mode: Making credible threats / bribes
  • Failure mode: Using commitment devices, somehow?
  • Failure mode: Memetic viruses

 

  1. This is not a novel frame by any means! It’s roughly security mindset applied to AI safety. It’s discussed explicitly and implicitly with work done on safety cases and control. ↩︎

  2. An affordance roughly means – 'the actions available to the AI system'. ↩︎

  3. Note my eliding between actions that one specific AI system might take vs the broader frontier of AI capabilities ↩︎

  4. Note: here, this is agnostic to *why* the backdoor was inserted (misaligned model, jailbroken model, unintentional backdoor, etc). We’re focusing on the mediating pathway directly, as opposed to multiple potential upstream causes.
    Sometimes this is useful. Other times, if (say) we only have a few upstream causes, it might be useful to focus on them instead. ↩︎

  5. Note my eliding between actions that one specific AI system might take, vs the broader range of potential AI capabilities ↩︎

  6. Note: here, this is agnostic to *why* the backdoor was inserted (misaligned model, jailbroken model, unintentional backdoor, etc). We’re focusing on the mediating pathway directly, as opposed to multiple potential upstream causes.
    Sometimes this is useful. Other times, if (say) we only have a few upstream causes, it might be useful to focus on them instead. ↩︎

  7. Alternatively, a more ‘fundamental’ taxonomy might bundle code-writing & code-execution under text output & tool use respectively. ↩︎

  8. Note: this is agnostic to *why* the backdoor was inserted (misaligned model, jailbroken model, unintentional backdoor, etc). It gives us more leverage to focus on the narrow mediating pathway directly, as opposed to the multiple upstream causes. ↩︎

  9. Either simply through comments, or through more involved tracking in the IDE, or something in between. ↩︎



Discuss