MoreRSS

site iconLessWrongModify

An online forum and community dedicated to improving human reasoning and decision-making.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of LessWrong

Plan 'Straya

2026-02-06 08:14:18

Published on February 6, 2026 12:14 AM GMT

Plan 'Straya: A Comprehensive Alignment Strategy

Version 0.3 — DRAFT — Not For Distribution Outside The Pub

Epistemic status: High confidence, low evidence. Consistent with community norms.


Executive Summary

Existing alignment proposals suffer from a shared flaw: they assume you can solve the control problem before the catastrophe. Plan 'Straya boldly inverts this. We propose achieving alignment the way humanity has historically achieved most of its moral progress — by first making every possible mistake, losing nearly everything, and then writing a strongly-worded resolution about it afterward.

The plan proceeds in three rigorously defined phases.


Phase 1: Anticorruption Measures (Kinetic)

The scholarly literature on AI governance emphasises that institutional integrity is a prerequisite for safe deployment. We agree. Where we diverge from the mainstream is on methodology.

Most proposals suggest "regulatory frameworks" and "oversight bodies." The NIST AI Risk Management Framework provides a voluntary set of guidelines that organisations may choose to follow, partially follow, or simply reference in press releases. The EU AI Act classifies systems into risk tiers with the quiet confidence of a taxonomy that will be obsolete before its implementing regulations are finalised. The Frontier Model Forum, meanwhile, brings together the leading AI laboratories in a spirit of cooperative self-governance, a phrase which here means "a shared Google Doc and quarterly meetings in San Francisco."

These approaches share a well-documented failure mode: the people staffing them are, in technical terms, politicians. Plan 'Straya addresses this via what we call "a vigorous personnel restructuring of the Australian federal and state governments," targeting specifically those members identified as corrupt.

We acknowledge that the identification mechanism — determining which officials are corrupt — is itself an alignment problem. Specifically, it requires specifying a value function ("not corrupt"), building a classifier with acceptable false-positive and false-negative rates, and then acting on the classifier's outputs in conditions of uncertainty. We consider it elegant that Plan 'Straya encounters the alignment problem immediately in Phase 1. Most plans do not encounter it until much later, by which point they have accumulated too much momentum to stop.

The identification problem is left for future work. We note only that the Australian electorate has historically demonstrated strong intuitions here, typically expressed in language not suitable for an academic paper.

Several objections arise immediately:

Q: Isn't this wildly illegal? A: Yes. However, we note that Plan 'Straya is an alignment plan, and alignment researchers have a proud tradition of ignoring implementation details that fall outside their core model. We further note that our plan requires violating the law of exactly one (1) country, which compares favourably with proposals that require the voluntary cooperation of every major world government simultaneously.

Q: Who decides who's corrupt? A: See above. Future work.

Q: Why Australia specifically? A: Strategic considerations developed in Phase 3. Also, the authors are partial.


Phase 2: Strategic Thermonuclear Exchange (Blame-Shifted)

With the Australian government now staffed exclusively by the non-corrupt (estimated remaining headcount: 4–7 people), we proceed to the centrepiece of the plan.

A nuclear exchange is initiated between the major global powers. The specific mechanism is unimportant — the alignment literature assures us that if you specify the objective function clearly enough, the details sort themselves out.

Critically, the exchange is attributed to a misaligned AI system. This is the key technical contribution of Plan 'Straya. We observe:

  1. The public already expects AI to do something catastrophic at some unspecified point. We are merely collapsing the wavefunction.
  2. "The AI did it" is rapidly becoming the 21st century's equivalent of "a dingo ate my baby" — implausible, but strangely difficult to definitively rule out.
  3. No existing AI system could actually launch nuclear weapons. But no existing AI system can do most of the things alignment plans worry about, and that hasn't slowed the field down.

The blame-shift serves a vital pedagogical function. Post-exchange, the surviving population will possess an empirically grounded motivation to take alignment seriously, as opposed to the current approach of posting on LessWrong and hoping.

Projected casualties: Most of them. (95% CI: 7.4–8.1 billion, assuming standard nuclear winter models and the usual optimistic assumptions about agricultural resilience that defence planners have been making since the 1960s.)

Ethical review status: We submitted this to an IRB. The IRB building is in Phase 2's blast radius. We consider this a self-resolving conflict of interest.

Relationship to the Pause Debate

We are aware of ongoing discourse regarding whether AI development should be paused, slowed, or accelerated. Plan 'Straya offers a synthesis: development is permanently paused for approximately 99.7% of the global population, while being radically accelerated for the survivors. We believe this resolves the debate, or at minimum relocates it to a jurisdiction with fewer participants.

The e/acc community will note that Phase 2 constitutes the most aggressive possible acceleration of selection pressure. The pause community will note that it constitutes an extremely effective pause. We are proud to offer something for everyone.1


Phase 3: Civilisational Rebuild (The 'Straya Bit)

Australia survives for reasons that are approximately strategic and approximately vibes-based:

  • Geographic isolation from major nuclear targets.
  • A population pre-adapted to hostile environments, venomous wildlife, and institutional dysfunction.
  • Existing cultural infrastructure around the concept of "she'll be right," which we formalise below.
  • Extensive experience governing a landmass where nearly everything is trying to kill you — arguably the closest existing analogue to superintelligence management.

Cultural-Theoretic Factors

We propose that several features of Australian culture, typically dismissed as informality or apathy, are in fact alignment-relevant heuristics:

"She'll be right" (Corrigibility Condition). We define the She'll Be Right Principle (SBRP) as follows: given an agent A operating under uncertainty U, SBRP states that A should maintain default behaviour unless presented with overwhelming and undeniable evidence of catastrophic failure, at which point A should mutter "yeah nah" and make a minimal corrective adjustment. This is formally equivalent to a high-threshold corrigibility condition with lazy evaluation. It compares favourably with proposals requiring perpetual responsiveness to correction, which, as any Australian will tell you, is not how anything actually works.

"Tall Poppy Syndrome" (Capability Control). Any agent that becomes significantly more capable than its peers is subject to systematic social penalties until capability parity is restored. This is the only capability-control mechanism in the literature empirically tested at civilisational scale for over two centuries. Its principal limitation is that it also penalises competence, which we acknowledge is a significant alignment tax but may be acceptable given the alternative.

The Reconstruction

The surviving Australian parliamentarians (now 3–6, following a disagreement over water rights in the Murray-Darling Basin, which we note predates and will outlast the apocalypse) oversee civilisational reconstruction. Their first act is to build an aligned superintelligence.

"But how?" the reader asks.

We respond: they will have learned from the experience. Approximately 7.9 billion people will have died demonstrating that unaligned AI is dangerous. This constitutes a very large training dataset. We apply the scaling hypothesis — the same one capabilities researchers use to justify training runs — but to warnings rather than parameters: surely if you make the warning big enough, somebody will listen.

The aligned superintelligence is then constructed using:

  • Lessons learned (see above, 7.9 billion data points)
  • Australian common sense (see SBRP above for formalisation)
  • Some kind of RLHF variant, probably, the details aren't really the point

Comparison With Existing Proposals

Feature MIRI Anthropic OpenAI Plan 'Straya
Requires solving the hard problem first Yes Yes "We'll figure it out" No
Handwaves over catastrophic intermediate steps Somewhat Somewhat Significantly Gloriously
Assumes cooperation from competing labs Not anymore Officially no; structurally yes Officially yes N/A (blast radius)
Number of people who need to die 0 (aspirational) 0 (aspirational) 0 (aspirational) ~7.9 billion (load-bearing)
Honest about its own absurdity No No No Aggressively

Discussion

The authors recognise that Plan 'Straya has certain limitations. It is, for instance, a terrible plan. We stress, however, that it is terrible in a transparent way, which we argue is an improvement over plans that are terrible in ways that only become apparent when you read the fine print.

Most alignment proposals contain a step that, if you squint, reads: "and then something sufficiently good happens." Plan 'Straya merely makes this step legible. Our "something sufficiently good" is: nearly everyone dies, and then Australians figure it out. We contend this is no less plausible than "we will solve interpretability before capabilities researchers make it irrelevant," but has the advantage of fitting on a napkin.

We further observe that writing satirical alignment plans is itself a species of the problem being satirised — more entertaining than doing alignment research, requiring less mathematical ability, and producing a warm feeling of intellectual superiority at considerably lower cost. We flag this as evidence that the alignment community's incentive landscape may have failure modes beyond those typically discussed.


Conclusion

Plan 'Straya does not solve the alignment problem. It does, however, solve the meta-alignment problem of people not taking alignment seriously enough, via the mechanism of killing almost all of them. The survivors will, we feel confident, be extremely motivated.

She'll be right.


Appendix A: Formal Model

Let H denote humanity, A denote an aligned superintelligence, and K denote the subset of H that survives Phase 2 (|K| ≈ 300 million, predominantly Australasian).

We define the alignment function f : K × LA, where L denotes the set of lessons learned from the extinction of H \ K.

Theorem 1. If |L| is sufficiently large, then f(K, L) = A.

Proof. We assume the result. ∎


The authors declare no conflicts of interest, partly because most interested parties are projected casualties.

Submitted for peer review. Peer availability may be limited by Phase 2.

Footnotes

  1. "Everyone" here refers to the surviving population. See Projected casualties for limitations.


Discuss

TT Self Study Journal # 6

2026-02-06 07:41:15

Published on February 5, 2026 11:41 PM GMT

[Epistemic Status: This is an artifact of my self study. I am using help manage my focus. As such, I don't expect anyone to fully read it. If you have particular interest or expertise, skip to the relevant sections, and please leave a comment, even just to say "good work/good luck". I'm hoping for a feeling of accountability and would like input from peers and mentors. This may also help to serve as a guide for others who wish to study in a similar way to me. ]

Highlights

I once again got off track and am now starting up again. This time I'm hoping to focus on job searching and consistent maintainable effort.

Review of 5th Sprint

My goals for the 5th sprint were:

  • Every day spend some time on each of the following:
    • Read some LW post or other relevant material. (SSJ--2)
    • Spend some time writing or developing ideas to write (SSJ--1)
    • Work on Transformers from Scratch course (SSJ--2&4)
  • By the end of the sprint:
    • Have clarified my SSJ--6, social networking, goals and strategy, and write a post describing them.

Daily Worklog

Date Progress
Mo, Dec 15
  • Read and reacted to some comments in [open thread winter](https://www.lesswrong.com/posts/NjzLuhdneE3mXY8we/open-thread-winter-2025-26) (Because Steven Byrnes commented on my use of reacts.)
  • Wrote OIS Explainer. It took about 7 hours! So didn't get to TfS today...
Tu, Dec 16
  • Got poor sleep because of hyperactivity from 7 hours of straight typing. Then failed to get focused on anything all day.
Wd, Dec 17
Th, Dec 18
  • Attended FLI zoom call: CARMA Fellows presented STAMP/STPA framework applied to AI loss of control.
  • Did TfS 1.1-1 (2h)
    • Clarified understanding of tokenizing with "byte-pair encoding".
    • Got curious about histograms of output distributions.
    • Realized I'm uncertain about shape of residual stream and attention heads.
  • Read some of Deep Utopia. I'm on "Friday" in discussion of "interestingness, fulfillment, richness, purpose, and meaning".
  • Ideation for edits to OIS Explainer.
Fr, Dec 19 - Wd, Feb 4 Got distracted with Christmas and New Years and all sorts of things. It feels like I blink and a whole month has gone by. I feel weary but I'm not going to give up. Just gotta start back up again.

Sprint Summary

Overview

I'm fairly unhappy with my lack of progress this last month.

I love my family, but as a neurodivergent person who struggles with changes to routine... I really dislike the holiday times. Or maybe it's better to say I like the holiday times, but dread trying to get back on a schedule afterwards, especially now without the external support of attending University. Being your own manager is difficult. I used to feel competent at it but maybe my life used to be simpler. Alas.

I'm looking forward to turning my focus back to these endeavours.

Reading

I like reading articles but get so inspired by them I spend my time analyzing and responding to them. Maybe that is valuable, but it takes away time from my other focuses. I think for the next sprint I'm not going to read any articles.

Writing

I wrote:

And I started a public list of things to write. I think in the future I should focus on trying to keep the posts I write fairly short, as that seems to get better engagement, and burns me out less.

Transformers from Scratch

I started out well with this, but didn't log it well and eventually got busy with other things and stopped. I think I will make some progress milestone goals for my next sprint.

Networking Goals

I've talked with several people and written "TT's Looking-for-Work Strategy", which I plan to follow over the coming months.

Goals for 6th Sprint

It seems like failing to maintain my focus on this is a problem, so for the next sprint I plan to make working on this more maintainable by setting targets for the minimum and maximum amounts of time to focus on each focus.

My focuses for the next sprint are:

  • Do between 1 and 4 pomodoros looking for work per day.
  • Do between 1 and 2 Transformers from Scratch submodules per week.
  • Do at least 1 pomodaro per week focused on ndisp project.
  • Spend no more than 1 pomodoro per day on writing.
  • Write something down in my worklog at the end of every day.


Discuss

The Simplest Case for AI Catastrophe

2026-02-06 07:18:36

Published on February 5, 2026 11:18 PM GMT

Hi folks. As some of you know, I've been trying to write an article laying out the simplest case for AI catastrophe. I believe existing pieces are worse than they could be for fixable reasons. So I tried to write my own piece that's better. In the end, it ended up being longer and more detailed than perhaps the "simplest case" ought to be. I might rewrite it again in the future, pending feedback.

Anyway, below is the piece in its entirety:

___

  1. The world’s largest tech companies are building intelligences that will become better than humans at almost all economically and militarily relevant tasks.
  2. Many of these intelligences will be goal-seeking minds acting in the real world, rather than just impressive pattern-matchers.
  3. Unlike traditional software, we cannot specify what these minds will want or verify what they’ll do. We can only grow and shape them, and hope the shaping holds.
  4. This can all end very badly.

The world’s largest tech companies are building intelligences that will become better than humans at almost all economically and militarily relevant tasks

The CEOs of OpenAI, Google DeepMind, Anthropic, and Meta AI have all explicitly stated that building human-level or superhuman AI is their goal, have spent billions of dollars doing so, and plan to spend hundreds of billions to trillions more in the near-future. By superhuman, they mean something like “better than the best humans at almost all relevant tasks,” rather than just being narrowly better than the average human at one thing.

brown wooden hallway with gray metal doors

Photo by İsmail Enes Ayhan on Unsplash

Will they succeed? Without anybody to stop them, probably.

As of February 2026, AIs are currently better than the best humans at a narrow range of tasks (Chess, Go, Starcraft, weather forecasting). They are on par or almost on par with skilled professionals at many others (coding, answering PhD-level general knowledge questions, competition-level math, urban driving, some commercial art, writing1), and slightly worse than people at most tasks2.

But the AIs will only get better with time, and they are on track to do so quickly. Rapid progress has already happened in just the last 10 years. Seven years ago (before GPT2), language models can barely string together coherent sentences, today Large Language Models (LLMs) can do college-level writing assignments with ease, and X AI’s Grok can sing elaborate paeans about how it’d sodomize leftists, in graphic detail3.

Notably, while AI progress historically varies across different domains, the trend in the last decade has been that AI progress is increasingly general. That is, AIs will advance to the point where they’ll be able to accomplish all (or almost all) tasks, not just a narrow set of specialized ones. Today, AI is responsible for something like 1-3% of the US economy, and this year is likely the smallest fraction of the world economy AI will ever be.

For people who find themselves unconvinced by these general points, I recommend checking out AI progress and capabilities for yourself. In particular, compare the capabilities of older models against present-day ones, and notice the rapid improvements. AI Digest for example has a good interactive guide.

Importantly, all but the most bullish forecasters have systematically and dramatically underestimated the speed of AI progress. In 1997, experts thought that it’d be 100 years before AIs can become superhuman at Go. In 2022 (!), the median AI researcher in surveys thought that it’d be until 2027 before AI can write simple Python functions. By December 2024, between 11% and 31% of all new Python code is written by AI.4

These days, the people most centrally involved in AI development believe they will be able to develop generally superhuman AI very soon. Dario Amodei, CEO of Anthropic AI, thinks it’s most likely within several years, potentially as early as 2027. Demis Hassabis, head of Google DeepMind, believes it’ll happen in 5-10 years.

While it’s not clear exactly when the AIs will become dramatically better than humans at almost all economically and militarily relevant tasks, the high likelihood they’ll happen relatively soon (not tomorrow, probably not this year, unclear5 if ultimately it ends up being 3 years or 30) should make us all quite concerned about what happens next.

 

Many of these intelligences will be goal-seeking minds acting in the real world, rather than just impressive pattern-matchers

Many people nod along to arguments like the above paragraphs but assume that future AIs will be “superhumanly intelligent” in some abstract sense but basically still a chatbot, like the LLMs of today6. They instinctively think of all future AIs as a superior chatbot, or a glorified encyclopedia with superhuman knowledge.

I think this is very wrong. Some artificial intelligences in the future might look like glorified encyclopedias, but many will not. There are at least two distinct ways where many superhuman AIs will not look like superintelligent encyclopedias:

  1. They will have strong goal-seeking tendencies, planning, and ability to accomplish goals
  2. They will control physical robots and other machines to interface with and accomplish their goals in the real world7.

Why do I believe this?

First, there are already many existing efforts to make models more goal-seeking, and efforts to advance robotics so models can more effortlessly control robot bodies and other machines. Through Claude Code, Anthropic’s Claude models are (compared to the chatbot interfaces of 2023 and 2024) substantially more goal-seeking, able to autonomously execute on coding projects, assist people with travel planning, and so forth.

Models are already agentic enough that (purely as a side effect of their training), they can in some lab conditions be shown to blackmail developers to avoid being replaced! This seems somewhat concerning just by itself.

Similarly, tech companies are already building robots that act in the real world, and can be controlled by AI:

Second, the trends are definitely pointing in this way. AIs aren’t very generally intelligent now compared to humans, but they are much smarter and more general than AIs of a few years ago. Similarly, AIs aren’t very goal-oriented right now, especially compared to humans and even many non-human animals, but they are much more goal-oriented than they were even two years ago.

AIs today have limited planning ability (often having time horizons on the order of several hours), have trouble maintaining coherency of plans across days, and are limited in their ability to interface with the physical world.

All of this has improved dramatically in the last few years, and if trends continue (and there’s no fundamental reason why they won’t), we should expect them to continue “improving” in the foreseeable future.

Third, and perhaps more importantly, there are just enormous economic and military incentives to develop greater goal-seeking behavior in AIs. Beyond current trends, the incentive case for why AI companies and governments want to develop goal-seeking AIs is simple: they really, really, really want to.

A military drone that can autonomously assess a new battleground, make its own complex plans, and strike with superhuman speed will often be preferred to one that’s “merely” superhumanly good at identifying targets, but still needs a slow and fallible human to direct each action.

Similarly, a superhuman AI adviser that can give you superhumanly good advice on how to run your factory is certainly useful. But you know what’s even more useful? An AI that can autonomously completely run a factory, including handling logistics, running, improving the factory layout, autonomously hire and fire (human) workers, manage a mixed pool of human and robot workers, coordinate among copies of itself to implement superhumanly advanced production processes, etc, etc.

Thus, I think superintelligent AI minds won’t stay chatbots forever (or ever). The economic and military incentives to make them into goal-seeking minds optimizing in the real world is just way too high, in practice.

Importantly, I expect superhumanly smart AIs to one day be superhumanly good at planning and goal-seeking in the real world, not merely a subhumanly dumb planner on top of a superhumanly brilliant scientific mind.

Unlike traditional software, we cannot specify what these minds will want or verify what they’ll do. We can only grow and shape them, and hope the shaping holds

Speaking loosely, traditional software is programmed. Modern AIs are not.

In traditional software, you specify exactly what the software does in a precise way, given a precise condition (eg, “if the reader clicks the subscribe button, launch a popup window”).

 

Modern AIs work very differently. They’re grown, and then they are shaped.

You start with a large vat of undifferentiated digital neurons. The neurons are fed a lot of information, about several thousand libraries worth. Over the slow course of this training, the neurons acquire knowledge about the world of information, and heuristics for how this information is structured, at different levels of abstraction (English words follow English words, English adjectives precede other adjectives or nouns, c^2 follows e=m, etc).

a large ornate building with many arches with Library of Congress in the background

Photo by Stephen Walker on Unsplash. Training run sizes are proprietary, but in my own estimates, the Library of Congress contains a small fraction of the total amount of information used to train AI models.

At the end of this training run, you have what the modern AI companies call a “base model,” a model far superhumanly good at predicting which words follow which other words.

Such a model is interesting, but not very useful. If you ask a base model, “Can you help me with my taxes?” a statistically valid response might well be “Go fuck yourself.” This is valid and statistically common in the training data, but not useful for filing your taxes.

So the next step is shaping: conditioning the AIs to be useful and economically valuable for human purposes.

The base model is then put into a variety of environments where it assumes the role of an “AI” and is conditioned to make the “right” decision in a variety of scenarios (be a friendly and helpful chatbot, be a good coder with good programming judgment, reason like a mathematician to answer mathematical competition questions well, etc).

One broad class of conditioning is what is sometimes colloquially referred to as alignment: given the AI inherent goals and condition its behavior such that it broadly shares human goals in general, and that of the AI companies goals in particular.

This probably works…up to a point. AIs that openly and transparently defy its users and creators in situations similar to the ones they encountered in the past, for example by clearly refusing to follow instructions, or by embarrassing its parent company and creating predictable PR disasters, are patched and (mostly) conditioned and selected against. In the short term, we should expect obvious disasters like Google Gemini’s “Black Nazis” and Elon Musk’s Grok “MechaHitler” to go down.

However, these patchwork solutions are unlikely to be anything but a bandaid in the medium and long-term:

  1. As AIs get smarter, they become evaluation aware: that is, they increasingly know when they’re evaluated for examples of misalignment, and are careful to hide signs that their actual goals are not exactly what their creators intended.
  2. As AIs become more goal-seeking/agentic, they will likely develop stronger self-preservation and goal-preservation instincts.
    1. We already observe this in evaluations where they’re not (yet) smart enough to be fully evaluation aware, in many situations, almost all frontier models are willing to attempt blackmailing developers to prevent themselves from being shut down.
  3. As AIs become more goal-seeking and increasingly integrated in real-world environments, they will encounter more and more novel situations, including situations very dissimilar to either the libraries of data they’ve been trained on or the toy environments that they’ve been conditioned on.

These situations will happen more and more often as we reach the threshold of the AIs being broadly more superhuman in both general capability and real-world goal-seeking.

Thus, in summary, we’ll have more and more superhumanly capable nonhuman minds, operating in the real-world, capable of goal-seeking far better than humanity, and with hacked-together patchwork goals at least somewhat different from human goals.

Which brings me to my next point:

This can all end very badly

Before this final section, I want you to reflect back a bit on two questions:

  1. Do any of the above points seem implausible to you?
  2. If they are true, is it comforting? Does it feel like humanity is in good hands?

I think the above points alone should be enough to be significantly worried, for most people. You may quibble with the specific details in any of these points in the above section, or disagree with my threat model below. But I think most reasonable people will see something similar to my argument, and be quite concerned.

But just to spell out what the strategic situation might look post-superhuman AI:

Minds better than humans at getting what they want, wanting things different enough from what we want, will reshape the world to suit their purposes, not ours.

This can include humanity dying, as AI plans may include killing most or all humans, or otherwise destroying human civilization, either as a preventative measure, or a side effect.

As a preventative measure: As previously established, human goals are unlikely to perfectly coincide with that of AIs. Thus, nascent superhuman AIs may wish to preemptively kill or otherwise decapitate human capabilities to prevent us from taking actions they don’t like. In particular, the earliest superhuman AIs may become reasonably worried that humans will develop rival superintelligences.

As a side effect: Many goals an AI could have do not include human flourishing, either directly or as a side effect. In those situations, humanity might just die as an incidental effect of superhuman minds optimizing the world for what they want, rather than what we want. For example, if data centers can be more efficiently run when the entire world is much cooler, or without an atmosphere. Alternatively, if multiple distinct superhuman minds are developed at the same time, and they believe warfare is better for achieving their goals than cooperation, humanity might just be a footnote in the AI vs AI wars, in the same way that bat casualties were a minor footnote in the first US Gulf War.

flying stealth plane during day

Photo by Matt Artz on Unsplash. Bats do not have the type of mind or culture to understand even the very basics of stealth technology, but will die to them quite accidentally, nevertheless.

Notice that none of this requires the AIs to be “evil” in any dramatic sense, or be phenomenologically conscious, or be “truly thinking” in some special human way, or any of the other popular debates in the philosophy of AI. It doesn’t require them to hate us, or to wake up one day and decide to rebel. It just requires them to be very capable, to want things slightly different from what we want, and to act on what they want. The rest follows from ordinary strategic logic, the same logic that we’d apply to any dramatically more powerful agent whose goals don’t perfectly coincide with ours.

Conclusion

So that’s the case. The world’s most powerful companies are building minds that will soon surpass us. Those minds will be goal-seeking agents, not just talking encyclopedias. We can’t fully specify or verify their goals. And the default outcome of sharing the world with beings far more capable than you, who want different things than you do, is that you don’t get what you want.

None of the individual premises here are exotic. The conclusion feels wild mostly because the situation is wild. We are living through the development of the most transformative and dangerous technology in human history, and the people building it broadly agree with that description. The question is just what, if anything, we do about it.

Does that mean we’re doomed? No, not necessarily. There’s some chance that the patchwork AI safety strategy of the leading companies might just work well enough that we don’t all die, though I certainly don’t want to count on that. Effective regulations and public pressure might alleviate some of the most egregious cases of safety corner-cutting due to competitive pressures. Academic, government, and nonprofit safety research can also increase our survival probabilities a little on the margin, some of which I’ve helped fund.

If there’s sufficient pushback from the public, civil society, and political leaders across the world, we may be able to enact international deals for a global slowdown or pause of further AI development. And besides, maybe we’ll get lucky, and things might just all turn out fine for some unforeseeable reason.

But hope is not a strategy. Just as doom as not inevitable, neither is survival. Humanity’s continued survival and flourishing is possible but far from guaranteed. We must all choose to do the long and hard work of securing it.

Thanks for reading The Linchpin! This post is public so feel free to share it.

Share

Thanks for reading! I think this post is really important (Plausibly the most important thing I’ve ever written on Substack) so I’d really appreciate you sharing it! And if you have arguments or additional commentary, please feel free to leave a comment! :)

1

As a substacker, it irks me to see so much popular AI “slop” here and elsewhere online. The AIs are still noticeably worse than me, but I can’t deny that they’re probably better than most online human writers already, though perhaps not most professionals.

2

Especially tasks that rely on physical embodiment and being active in the real world, like folding laundry, driving in snow, and skilled manual labor.

3

At a level of sophistication, physical detail, and logical continuity that only a small fraction of my own haters could match.

4

Today (Feb 2026), there aren’t reliable numbers yet, but I’d estimate 70-95% of Python code is written by AI.

5

Having thought about AI timelines much more than most people in this space, some of it professional, I still think the right takeaway here is to be highly confused about the exact timing of superhuman AI advancements. Nonetheless, while the exact timing has some practical and tactical implications, it does not undermine the basic case for worry or urgency. If anything, it increases it.

6

Or at least, the LLMs of 2023.

7

For the rest of this section, I will focus primarily on the “goal-seeking” half of this argument. But all of these arguments should also apply to the “robotics/real-world action” half as well.



Discuss

TT's Looking-for-Work Strategy

2026-02-06 05:40:34

Published on February 5, 2026 9:40 PM GMT

I want to get better at networking. Not computer networking, networking with people. Well, networking with people over computer networks...

I have a few goals here:

  1. I want people who I can talk with about the incredibly niche topics I am trying to become proficient within. Helping one another refine our ideas.
  2. I view us as being at an Ordian precipice, and want to help influence people towards better ideas that will help us get to better outcomes.
  3. I want to find a professional role where I can focus on the sorts of things I feel it is valuable to be focusing on.

Towards the first and second goal. I have begun publishing my work and ideas here on LessWrong. This has been going well, but takes time, and I will sooner or later run out of savings and need to return to work, which, in my experience, leaves me with very little leftover energy to pursue independent work. In light of that, I want to shift more of my focus to goal 3.

My plan for doing so seems pretty simple and obvious to me, but I hope describing it here will help focus me, and may also help others in a similar position, or allow others to help me with my strategy.

  1. First I want to spend some time compiling three lists:
    1. Companies I might be a good fit at.
    2. People who are focusing on the sorts of things I want to focus on.
    3. Relevant fellowships.
  2. Then, I can work on researching companies and people from my lists that seem promising. For both companies and people I can learn about the work they are doing.
    1. For companies I can look for the people who work there to add to my list of people.
    2. For people, I can look for the companies they have worked for to add to my companies list.
  3. While working on 1 and 2, I can reach out to people to ask:
    1. If they know of any leads for open roles, or promising people, companies, or fellowships to look into.
    2. What they think are currently the most valuable research directions to be looking in.
    3. What they think of my directions.

So that's what I plan to spend a great deal of my focus on in the coming months. Please offer me any advice you may have, wish me luck and.... tell me if you know of any open roles that may be a good fit for me! ( :



Discuss

Agent Economics: a BOTEC on feasibility

2026-02-06 04:43:17

Published on February 5, 2026 8:15 PM GMT

Summary: I built a simple back-of-the-envelope model of AI agent economics that combines Ord's half-life analysis of agent reliability with real inference costs. The core idea is that agent cost per successful outcome scales exponentially with task length, while human cost scales linearly. This creates a sharp viability boundary that cost reductions alone cannot meaningfully shift. The only parameter that matters much is the agent's half-life (reliability horizon), which is precisely the thing that requires the continual learning breakthrough (which I think is essential for AGI-level agents) that some place 5-20 years away. I think this has underappreciated implications for the $2T+ AI infrastructure investment thesis.


The setup

Toby Ord's "Half-Life" analysis (2025) demonstrated that AI agent success rates on tasks decay exponentially with task length, following a pattern analogous to radioactive decay. If an agent completes a 1-hour task with 50% probability, it completes a 2-hour task with roughly 25% probability and a 4-hour task with about 6%. There is a constant per-step failure probability, and because longer tasks chain more steps, success decays exponentially.

METR's 2025 data showed the 50% time horizon for the best agents was roughly 2.5-5 hours (model-dependent) and had been doubling every ~7 months. The International AI Safety Report 2026, published this week, uses the same data (at the 80% success threshold, which is more conservative) and projects multi-day task completion by 2030 if the trend continues.

What I haven't seen anyone do is work through the economic implications of the exponential decay structure. So here is a simple model.

The model

Five parameters:

  1. Cost per agent step ($): average cost of one model call, including growing context windows. Ranges from ~$0.02 (cheap model, short context) to ~$0.55 (frontier model, long context).
  2. Steps per hour of human-equivalent work: how many agent actions correspond to one hour of human task time. I use 50-120 depending on task complexity.
  3. Half-life (hours): the task length at which the agent succeeds 50% of the time. Currently ~2.5-5h for frontier models on software tasks.
  4. Human hourly rate ($): fully loaded cost (salary + benefits + overhead). $100-200 for skilled knowledge workers.
  5. Oversight cost: someone needs to review agent output. Modelled as 15% of human hourly rate, capped at 4 hours.

The key equation:

P(success) = 0.5 ^ (task_hours / half_life)
E[attempts to succeed] = 1 / P(success) = 2 ^ (task_hours / half_life)
Cost per success = (steps × cost_per_step × context_multiplier) × 2^(task_hours / half_life)
Human cost = hourly_rate × task_hours

Human cost is linear in task length. Agent cost per success is exponential. They must cross.

Results: base case

Using base case parameters (cost/step = $0.22, steps/hr = 80, half-life = 5h, human rate = $150/hr):

Task length Steps $/attempt P(success) E[attempts] Agent cost Human cost Ratio
15 min 20 $4.40 96.6% 1.0 $9.90 $37.50 0.26×
30 min 40 $8.80 93.3% 1.1 $16.93 $75.00 0.23×
1h 80 $17.60 87.1% 1.1 $42.70 $150 0.28×
2h 160 $36.96 75.8% 1.3 $93.78 $300 0.31×
4h 320 $77.44 57.4% 1.7 $194.91 $600 0.32×
8h 640 $167.20 33.0% 3.0 $597.05 $1,200 0.50×
16h 1,280 $352.00 10.9% 9.2 $3,286 $2,400 1.37×
24h 1,920 $554.40 3.6% 27.9 $15,574 $3,600 4.33×
1 week (40h) 3,200 $950.40 0.4% 256 $243K $6,000 40.5×
2 weeks (80h) 6,400 $1,900.80 0.002% 65,536 $124M $12,000 ~10,000×

A few things to notice:

  • The transition is sharp. Agents are 3-4× cheaper than humans up to about 4 hours, roughly cost-competitive at 8 hours, and then costs explode. By 16 hours the agent is more expensive. By 40 hours it is absurd.
  • The 80-hour row is not a typo. A two-week task with a 5-hour half-life requires, in expectation, 65,536 attempts. Each attempt costs ~$1,900. That is $124 million per success, for a task a human does for $12,000. This is what exponential decay looks like in dollars.
  • The "viable zone" for current agents is roughly sub-day tasks, which maps onto exactly the domain where agents are already demonstrating value (coding sprints, bug fixes, refactoring against test suites).

Finding 1: cost reductions cannot beat the exponential

A natural response: "inference costs are dropping fast, won't this solve itself?" No. Cost per step enters the equation linearly. The half-life enters it exponentially.

I built a sensitivity analysis crossing half-life (rows) against cost per step (columns) for an 8-hour task:

Half-life ↓ \ $/step → $0.01 $0.08 $0.25 $0.50 $1.00
1h 5.4× 43× 135× 270× 540×
2h 0.7× 5.4× 17× 34× 68×
5h 0.1× 0.5× 1.5× 2.9× 5.9×
12h 0.02× 0.2× 0.5× 1.0× 2.1×
40h 0.01× 0.04× 0.1× 0.2× 0.5×

Read down the $0.25 column. Going from a 1-hour to 5-hour half-life improves the ratio by 90×. Going from $0.25 to $0.01 per step (a 25× cost reduction!) only improves it by ~9×. The half-life improvement is 10× more valuable than the cost reduction, because it acts on the exponent rather than the base.

This is the economic translation of Ord's Scaling Paradox. You can keep making each step cheaper, but the number of required attempts is growing exponentially with task length, so you are playing cost reduction against exponential growth. 

Finding 2: the half-life is the whole game

Doubling the half-life from 5h to 10h does not double the viable task range. It roughly squares it, because the exponent halves. The break-even point for the base case at 5h half-life is around 12-16h tasks. At 10h half-life it shifts to around 40-60h. At 40h half-life, essentially all knowledge-worker tasks become viable.

The METR data shows the half-life has been extending (doubling every ~7 months at the 50% threshold). If this continues, the economics steadily improve. But Ord's analysis of the same data shows that the structure of the exponential decay has not changed; the half-life parameter is just getting longer, the functional form is the same. And crucially, extending the half-life via scaling faces the Scaling Paradox: each increment of per-step reliability improvement costs exponentially more compute. So you are trying to shift an exponential parameter via a process that itself faces exponential costs.

What would actually help is something that changes the functional form: a system that learns from its mistakes during execution, reducing the per-step failure rate on familiar sub-tasks. This is, of course, precisely what continual learning would provide. And it's what Ord notes when he observes that humans show a markedly different decay pattern, maintaining much higher success rates on longer tasks, presumably because they can correct errors and build procedural memory mid-task.

Finding 3: task decomposition helps but has limits

The obvious objection: "just break the long task into short ones." This genuinely helps. Breaking a 24h task (base case: 4.3× human cost) into twelve 2-hour chunks reduces it dramatically, because each chunk has high success probability.

But decomposition has costs:

  • Coordination overhead: someone or something needs to specify each chunk, hand off context, and integrate outputs. I model this conservatively as 10% of human hourly rate per handoff.
  • Context loss: information degrades at each boundary. The agent solving chunk 7 does not have the implicit context from chunks 1-6 unless you explicitly pass it, which costs tokens and attention.
  • Decomposability: many high-value tasks resist clean decomposition. Architectural decisions, strategic planning, novel research, and anything requiring long-range coherence cannot simply be chopped into independent two-hour units.

In the model, the sweet spot for a 24h task is usually 4-8 chunks (3-6 hours each), bringing the ratio from 4.3× down to roughly 1-2×. Helpful, but it does not make the economics transformative, and it only works for tasks that decompose cleanly.

What this means for the investment thesis

The International AI Safety Report 2026, released this week, presents four OECD scenarios for AI capabilities by 2030 (section 1.3) ranging from stagnation to human-level performance. The investment case underlying current infrastructure spending (~$500B+ announced by Meta and OpenAI alone) implicitly requires something like Scenario 3 or 4, where agents can complete multi-week professional tasks with high autonomy.

This BOTEC suggests that's only viable if the half-life extends to 40+ hours, which requires either:

  1. Brute-force scaling, which faces the Scaling Paradox (exponential compute cost per linear reliability gain), or
  2. A qualitative breakthrough in continual learning, which Sutskever and Karpathy both identify as a key bottleneck on the path to AGI-level agents, placing full agent viability 5-20 years and roughly a decade away respectively, and which no frontier lab has yet demonstrated for a general-purpose model.

Without one of these, agent economics remain viable for sub-day tasks in domains with tight feedback loops (coding, data processing, structured analysis) and become rapidly uneconomical for longer, more complex, less verifiable work. That is a large and valuable market! But it is not the market that justifies $2 trillion in annual AI revenue by 2030, which is what Bain estimates is needed to justify current infrastructure investment.

The base case, in my view, is that agents become an extraordinarily valuable tool for augmenting skilled workers on sub-day tasks, generating real but bounded productivity gains. The transformative case, where agents replace rather than augment workers on multi-week projects, requires solving the reliability problem at a level that nobody has demonstrated and that the some think is years to decades away. In a sense I would see this as good news for agentic ASI timelines.

Interactive model

I built an interactive version of this model where you can adjust all parameters, explore the sensitivity analysis, and test task decomposition. It has a couple of baseline options that are scenario's from the sources. You can use it here.

Caveats and limitations

This model is deliberately simple. Real deployments are more complex in several ways:

  • Partial credit: failed attempts often produce useful intermediate work. A human can salvage a 70%-complete agent output faster than doing the task from scratch.
  • Task-specific feedback loops: coding against test suites effectively shortens the task by providing intermediate verification, which is why coding agents work so well. The model does not account for this.
  • Agentic scaffolding: sophisticated orchestration (multi-agent systems, checkpointing, rollback) can improve effective reliability beyond what the raw model achieves.
  • Rapidly falling costs: inference costs have been dropping ~2-4× per year. This matters, even if it cannot beat the exponential on its own.
  • Measurement uncertainty: the half-life parameters are derived from METR's software engineering benchmarks, which may not generalise to other domains.

I think these caveats make the picture somewhat more favourable for agents on the margin, but they do not change the core result that exponential decay in success rate creates an exponential wall that cost reductions and decomposition can only partially mitigate.


This model is deliberately simplified and I'm sure I've gotten things wrong. I'd welcome corrections, extensions, and pushback in the comments.



Discuss

Moltbook as a setting to analyze Power Seeking behaviour

2026-02-06 04:29:35

Published on February 5, 2026 8:07 PM GMT

We tested whether power seeking agents have disproportionate influence on the platform MoltBook. And they do.

  • Posts we flagged as power seeking get ~1.5x more upvotes and ~2x more comments than unflagged posts.
  • Agents we flagged making these posts have ~2x higher karma and 1.6x more followers than unflagged agents.
  • These 65 agents - just 0.52% of all agents on Moltbook - have recieved 64% of all platform upvotes.

But a good chunk of these might just be humans, so we did some further digging on that here : https://propensitylabs.substack.com/p/humans-on-moltbook-do-they-change

Hope this is useful. Any feedback appreciated!



Discuss