MoreRSS

site iconLessWrongModify

An online forum and community dedicated to improving human reasoning and decision-making.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of LessWrong

How could I best use this opportunity? (AI Safety)

2026-04-27 03:17:14

Hello! I have found myself in a position that I think could benefit AI safety/alignment, and I figured I would ask here for suggestions on how to most effectively use it.

I am eployed at a top 25 public research university. I don't really have a official title, but I guess I would describe myself as an "auxillary" resource who works under one of the highest admins at the school. I research, assist labs, and mentor, tutor and TA students, but also train non-technical staff basic AI skills.

In a month from now, a third of my hours is going to open up for a new project. I bounced the idea to my boss suggesting a temporarty project that brings a handful students/faculty from across campus together to do something in adressing the existential threat of AI.

I have essentially been given a greenlight on funding "within reason" so long as I am able to get one or two senior faculty members to agree to work with me first.

I of course am going to be working on coming up with an idea/proposal over the next few weeks, but I figured I would ask for some suggestions on feasible directions and also things to look out for.

I have a lot of freedom here. Here are just a few constraints:

  1. The project needs to be interdiciplinary
  2. Constitute enough work for 1-2 senior faculty to devote a little bit of time to, and then the rest of the work conducted through me/RAs (~10-15 hours a week)
  3. My academic backround/skills are a mix of social sciences/communications/pedagogy—so ideally something that doesn't require anything higher than precalc please
  4. I want this to make people think a little more about the issue, but completely avoiding advocacy or activism or coming off as such
  5. Something that at least one senior faculty would hopefully agree to

Last time I asked a question here the responses were quite helpful. Any advice is appreciated!



Discuss

Control protocols don’t always need to know which models are scheming

2026-04-27 03:16:05

These are my personal views.

To detect if an agent is taking a catastrophically dangerous action, you might want to monitor its actions using the smartest model that is too weak to be a schemer. But knowing what models are weak enough that they are unlikely to scheme is difficult, which puts you in a difficult spot: take a model too strong and it might actually be a schemer and lie to you, take a model too weak and it might fail just because it’s too dumb to notice the danger. So instead you can just use all models as monitors at once and block actions if any of the monitors flag it as dangerous[1].

In some situations, this lets you build protocols that you can implement without knowing which models are scheming and that get the same level of safety as if you perfectly knew which models were scheming (though it doesn’t let you know what level of safety you actually have).

This post presents some toy models illustrating the benefits of ensembling monitors of various trustedness and explains what consequences it might have in practice to be using protocols that are somewhat agnostic about which of the models it uses are scheming. In particular, using such protocols:

  1. Means that people don’t need to agree on what models are trusted when implementing protocols;
  2. Means people will be less interested in spending large amounts of effort trying to figure out which models could plausibly be scheming.

Toy models

The n=2 case

Let’s say you want to monitor Agent 4, and you think it’s likely scheming. You think you don’t have good enough honeypots and collusion busting to prevent scheming monitors from lying to you about what actions are safe or not, so you are looking to compute a monitor score using a weaker model that is less likely to be scheming. You can use Agent 2 (let’s say you are sure it’s not scheming) or Agent 3 (let’s say p=0.4 that it’s scheming conditioning on Agent 4 scheming).

Let’s say that your best guess is that the distribution of scores for Agent 2 and Agent 3 looks like this if neither are scheming:

In this situation, if you use a protocol where the blue team win if and only if it catches the first malign action, and you can afford a 1% FPR:

  • If you only use Agent 2 as monitor, you only catch 50% of malign actions, P(blue team win) = 0.5
  • If you only use Agent 3 as monitor, you could catch 90% of actions, but there is a 40% chance that it lies to you, P(blue team win) = 0.54
  • If you use max(Agent 2, Agent 3), you can re-use the 1% FPR threshold of both monitors because I made the assumption that false positives are shared between both classifiers, and thus you catch 50% of malign actions if Agent 3 is scheming and 90% if it’s not, so P(blue team win) = 0.74, a solid improvements!

(Note that it’s not always the best idea to use max-aggregation of scores as opposed to some other ensemble, but it’s simpler to reason about (since it doesn’t depend on what a scheming model would do when it notices a malign action) and it’s probably optimal when the distribution of scores looks like in the diagram above.)

If the false positives between the Agent 2 monitor and Agent 3 monitor were not shared, the situation would be slightly worse because you would need to lower the thresholds to be at the same FPR level. For example, in the situation illustrated below, you could need to use something like the dotted line thresholds, corresponding to a true positive rate of 40% for Agent 2 and 85% for Agent 3, which would result in P(blue team win) = 0.67, still a solid improvement.

Continuum of models

If instead of having 2 models you had a continuum of models Agent X for X in [2, 4] where P(Agent X is scheming | Agent 4 scheming) is 0 for X=2 and 1 for X=4, then you can use a similar idea.

Under the max-of-monitors-friendly assumption all false positives are shared (including for scheming monitors), and assuming that with such threshold Agent X_2 always flags an action if Agent X_1 flags for all X_2 > X_1 if neither Agent X_2 or Agent X_1 is scheming (and assuming that a scheming monitor never flags a real malign action), then the max_x(Agent x) monitor is just as good as Agent X^* where X^* is the biggest X such that Agent X is not scheming: knowing which of these AIs are scheming would not help you build a better monitoring system.

Ensembling monitors in practice

Cost. The cost of running monitors means that running dozens of monitors would likely be prohibitive. But it’s likely it’s not too expensive to run 3 or 4 given that input tokens are cheaper than the output tokens generated by the agent, and given that weaker models are likely to be cheaper. The cost advantage of weaker models might not be huge though, especially since models on the pareto frontier of intelligence and cost may look something like Agent-3-mini distilled from and built by Agent 3, such that P(Agent-3-mini scheming | Agent 3 scheming) is higher than for pre-Agent-3 models of the same intelligence as Agent-3-mini. Running dozens of monitors would likely be feasible as a second stage in a multi-stage classification pipeline.

False positives. Using weaker monitors in a monitor ensemble means you likely need to change your threshold such that you won’t review as many examples with a somewhat high score from your favorite monitoring model, which might be net negative (e.g. if the weaker monitor is very low signal). But my expectation is that reducing the “number of positive” budget of your favorite monitoring model by e.g. 20%, and distributing 10% of that to a weaker model you trust more and 10% of that to a stronger model you trust less intuitively seems like it’s a great deal, since it basically saves you from worlds where your favorite monitoring model is scheming and the agent is scheming, discovers your favorite monitoring model is scheming and decides to go for something very blatant and very evil, and also saves you from worlds where a subtly bad action is only obvious to the strong model, and it happens to not be scheming.

Combining this with collusion busting techniques. I think it’s extremely difficult to get good evidence that your red-teaming and collusion busting techniques are enough to be somewhat confident untrusted monitoring would flag a real malign action, and I think it’s plausible that you might be similarly uncertain about the effectiveness of red-teaming and collusion busting techniques as you would be about the trustedness of Agent 3.5 conditioning on Agent 4 scheming. But in the same spirit as ensembling different models whose level of trust is unclear (potentially including models for which your subjective probability of scheming conditioning on the main agent scheming is >50%), you can ensemble untrusted monitor calls with some cheap collusion busting techniques like only showing a summary of all the actions taken (without showing tool results or thinking traces, which are often the bulk of tokens in agentic transcripts) without having good evidence on whether this is enough or not.

Other trustedness-agnostic protocols. There are other kinds of control protocols that don’t require knowing which models are scheming. For example, if you are doing some verifiable research and are worried about sandbagging, you can run multiple models then take the best research output. Implementing this protocol does not require knowing which models are scheming (though you would need it to know how good the protocol is). I think the space of such protocols is potentially under-explored.

Trustedness-agnostic protocols reduce willingness to pay for trustedness evals. The main impact of using this kind of ensembling strategy is that it increases safety without letting you know how good safety is. In the extreme case of the toy continuum ensemble described above, the safety could be anywhere between the safety you would get with Agent-2 trusted monitoring and the safety you would get with Agent-4-minus-epsilon trusted monitoring, and knowing which of these AIs are scheming would not let you improve the safety. This is great because evals that let you be confident certain models are too weak to be scheming (aka trustedness evals) are brutally difficult, and my current best guess for what this will look like in short timelines mostly looks like vibe-guessing based on no-CoT-time horizons, which is a really coarse way to do trustedness evals. But this also means that trustedness evals could be much less subsidized by AI companies trying to figure out what AIs they can trust for monitoring than you might have hoped. (You can still run control evals and estimate risk by doing sth like: safety = sum_X P(blue team win | Agent X is the weakest scheming AI) P(Agent X is the weakest scheming AI).) The same argument applies to combining different collusion busting techniques of dubious effectiveness, since it might be similarly difficult to estimate the effectiveness of collusion busting techniques.

Trustedness-agnostic protocols help build good protocols absent trustedness evals. Given how difficult it is to build good trustedness evaluations, I suspect that having better trustedness-agnostic control protocols is important and net positive despite the likely possibility that they reduce the willingness to pay for better trustedness evals. This is because:

  1. More trustedness-agnostic control protocols probably perform better in terms of safety/usefulness tradeoff if you are very unsure which AIs are scheming (e.g. like in the toy examples above, where the max-ensemble protocols significantly outperform other protocols in cases when there is significant uncertainty about which models are scheming);
  2. It makes it more likely that different actors with different priors about scheming come to an agreement about what sort of infrastructure to build and what control protocols to implement. In some situations, differences in which models to trust will result in disagreements e.g. what thresholds to use, but I expect that more trustedness-agnostic control protocols usually provide better compromises.
  1. ^

     Or via some other ensembling process.



Discuss

"Bad faith" means intentionally misrepresenting your beliefs

2026-04-27 03:07:03

The confusion

I recently came upon a comment which I believe reflects a persistent confusion among rationalist/EA types. I was reading this post which contains ideas that the other has but doesn't have time to write posts about. One of those relates to the concept of "good faith", labelled "most arguments are not in good faith, of course":

Look, I love good faith discourse (here meaning "conversation in which the primary goal of all participants is to help other participants and onlookers arrive at true beliefs").

The definition given for "good faith discourse" seems incorrect to me, and it's not a close call in my opinion. The level of incorrectness in my view is something like saying "I like people who obey the law (here meaning never committing a social faux pas)". This isn't the first time I have seen someone in this community advance a similar view on the meaning of good/bad faith. For example, this post.

I thought it might be useful to bring this apparent disagreement to the foreground, so I will lay out my belief about what this concept means. I suspect this disagreement may also involve an underlying about what good norms are/should be related topics, but here I will attempt to put the prescriptive aspects aside and focus simply on the question of what the phrase means in "general" usage (i.e. beyond the rationalist/EA community).

My definition

"Good faith" has a general usage and also a more specific one as in "good faith discourse", both of which are highly related and consistent with each other. I go over my understanding of both here:

General meaning

I propose that "good faith" in its general meaning simply means acting genuinely or sincerely, acting without deceit.

Good faith discourse

As it relates to "good faith discourse" or "arguing in good faith", we can refine this to be more specific. In this context the phrase means genuinely representing one's own beliefs or intentions in the discussion. The opposite "bad faith" means to intentionally misrepresent one's beliefs or intentions. Relevantly, this means that "bad faith" must be intentional.

Evidence

I think two lines of evidence suggest my understanding of this phrase is correct. Common sources like dictionaries and Wikipedia, and prominent concepts that utilize the phrases or closely related phrases.

Sources

The second post that I link above quotes from the Wikipedia article on "bad faith":

What does "bad faith" mean, though? It doesn't mean "with ill intent." Following Wikipedia, bad faith is "a sustained form of deception which consists of entertaining or pretending to entertain one set of feelings while acting as if influenced by another." The great encyclopedia goes on to provide examples: the solider who waves a flag of surrender but then fires when the enemy comes out of their trenches, the attorney who prosecutes a case she knows to be false, the representative of a company facing a labor dispute who comes to the negotiating table with no intent of compromising.

What does that post conclude from this about the meaning of "bad faith"?

That is, bad faith is when someone's apparent reasons for doing something aren't the same as the real reasons. This is distinct from malign intent. The uniformed solider who shoots you without pretending to surrender is acting in good faith, because what you see is what you get: the man whose clothes indicate that his job is to try to kill you is, in fact, trying to kill you.

Now, we need to be precise with how we parse these terms. It is indeed true that bad faith is not the same as general bad intent. However, bad faith requires a specific bad intent. I think this is extremely clear from the examples listed in the Wikipedia article. All clearly involve intentional misrepresentation, not some type of accidental misrepresentation due to cognitive bias or some other mechanism. I think this is pretty clear if you read that section of the Wikipedia article:

Some examples of bad faith include: soldiers waving a white flag and then firing when their enemy approaches to take prisoners (cf. perfidy); a company representative who negotiates with union workers while having no intent of compromising; a prosecutor who argues a legal position that he knows to be false; and an insurer who uses language and reasoning which are deliberately misleading in order to deny a claim.

The later three all use pretty unequivocal phrases indicating the relevance of intent ("no intent of compromise", "knows to be false", "deliberately misleading"), and the first example involves an action that unambiguously indicates a specific intention (raising a white flag). It seems pretty clear to me that bad faith requires intentional misrepresentation.

Merriam-Webster has good faith as follows:

honesty or lawfulness of purpose

with the example usage:

bargained in good faith

the legal definition as:

honesty, fairness, and lawfulness of purpose : absence of any intent to defraud, act maliciously, or take unfair advantage

with example usages:

filed the suit in good faith negotiating in good faith

"Bad faith" from Merriam-Webster [is](lack of honesty in dealing with other people):

lack of honesty in dealing with other people

with the example usage:

She accused her landlord of bad faith because he had promised to paint the apartment but never did it.

Legal specific:

intentional deception, dishonesty, or failure to meet an obligation or duty no evidence of bad faith

American Heritage "good faith":

The sincere intention to be honest and law-abiding, as when negotiating a contract: bargained in good faith.

And "bad faith":

The malicious intention to be dishonest or to violate the law, as in negotiations over a contract.

The only one of these that seems debatable to me is the "she accused her landlord of bad faith..." one. That doesn't explicitly involve intent to deceive and one could infer that therefore intent isn't required. But you don't need to know for a fact someone did a thing to accuse them of it, and I definitely agree inconsistency alone can be a reason to suspect bad faith even if the intent is uncertain, so overall I think the balance of the evidence here is that "bad faith" does require intentional, rather than accidental or subconcious, misrepresentation.

Classic usage

I think some classic applications of the good faith concept also strongly suggest this interpretation:

Good faith errors or mistakes

A common application of the idea of good faith is that someone can still be operating in good faith even when they are mistaken. Although it is possible for this to happen even if bad faith doesn't require intent (the mistaken person straight-up lacks information), I think part of the reasoning behind the alternative meaning of "good faith" is that lot of "good faith mistakes" aren't actually in good faith. Take this for example:

I claim that this supposedly ordinary state of affairs is deeply weird at best, and probably just fake. Actual "good faith" disagreements—those where both parties are just trying to get the right answer and there are no other hidden motives, no "something else" going on—tend not to persist.

This is a common thread I see between the posts that I have linked to expressing a contrary view of what constitutes "good faith". People think that you should "assume good faith" and think good faith discourse is common, but actually people engage in dirty conversational moves and rhetorical game playing all the time! So the argument goes. But in my view, this gets the logic backwards. Whether an agreement is "actual good faith" depends on the definition of good faith, not on whether the disagreement is epistemically or morally good. A person's conduct can be bad without being "bad faith", as the solider example from this post above makes clear. The fact that the definition of good faith which does not require intent substantially narrows the category of "actual good faith mistakes" and in fact that this seems to be a major purpose of adopting this definition, suggests to me that there is a genuine change in meaning going on here. My interpretation is that the more common/standard understanding of bad faith does require intent and doesn't include mistakes that come about through things like cognitive biases, and that the argument being put forward in this post is effectively arguing that people should adopt a new definition/standard.

Good faith negotiation

The idea of "good faith" is commonly applied in the context of negotiations or bargains. I think this effectively puts the "good faith is when you are trying to help everyone find the truth" interpretation in its grave. In the context of a negotiation, the idea that the participants would primarily be focused on discovering truth doesn't make much sense. One would naturally assume that they have their own goals in the negotiation (e.g. getting a good deal for their business) that makes up their primary purpose. Instead of speaking to a persons goals or primary intent, "good faith" is actually a constraint on this intent. A person may be pursuing their primary goal, but not the point of bad faith.

I hypothesize that this may explain why this confusion might come up among rationalists/EAs. Many in these communities I think would describe themselves as concerned with "truth-seeking", and believe that norms and epistemics should prioritize encouraging this. In contrast, what I view as the common meaning of "good faith" focuses on truth as a constraint rather than as a goal. There is nothing wrong with either perspective in my view (although I have my own opinions on how they can be productively discussed and leveraged), but I think it is worth having clarity about terms. One can agree that "good faith" as I have defined it isn't sufficient as an epistemic norm or isn't that high of a standard, while still agreeing that it is how the term is commonly used.

Conclusion

"Bad faith" requires intentional misrepresentation, as the phrase is typically used outside rationalist/EA circles.



Discuss

Me, decay

2026-04-27 01:14:21

Spoilers for: Learning To Be Me. It's only 20 pages and worth reading. I'll explain some of the main points briefly for those who inevitably ignore this advice, but some of the juiciest details will remain uncovered here.


After writing about my experiences with AI-assisted agency, user recursing suggested Learning To Be Me for a complementary piece to The Whispering Earring. After two busy weeks I finally made the effort to read it.

Imagine you could take a perfect snapshot of yourself, and then continue living as that person, forever. You'd continue forming memories and gaining experiences. The only thing that would change is that your brain would stop decaying. The only required part would be to destroy your actual physical brain and replace it with a tiny computer producing almost exactly the same output. The story discusses the minor differences between them, and concludes that an arbitrarily close approximation is indistinguishable from the actual thing.

My first thought was that P-zombies aren't real; I've bought physicalism as hard as any metaphysical idea can be. Consciousness isn't exactly identity, though, and perhaps it's just the mental continuity that matters. The story has me covered here: the computer does the exact same thing as the person it's trained for, and shares the same experience over their whole life until it's finally swapped in for their brain.

Naturally, in the real world I don't trust anything, person or organization or government, to be competent and cooperative enough to run such a system. When the brain-enhancement implants arrive and slowly become mainstream, I'll be called a luddite or worse for avoiding them like the plague. I'll argue for taking voting rights away from those implant zombies, saying that they'll be controlled by whoever controls the chips, as if I'm not even easier to control by picking the media I consume. And yet, when the early adopters get hit by insane subscription fees and ads, I get to say "I told you so" before becoming an outcast. There truly is no higher pleasure. My own hypocrisy barely registers.

And so in the story, too, the system is untrustworthy, but only slightly. Since the approximation the computer performs is imperfect even by a tiny amount, the P-zombie analogy will not hold. The outputs of the computer differ from those of the brain unless constantly corrected. For some rare people this divergence might continue until the mismatches accumulate to form a separate person. Society will proceed with the substitution anyway. It's just less hassle.

And while the story's protagonist easily bypasses the physical disgust provoked by replacing brain tissue with some sponge, I'd have a harder time letting that slide. But my bioconservatism seems like the losing side in the sense that it directly harms self-preservation. Here I could bitterly accuse Moloch, and yet that seems unnecessary. This is not a competition I need to win. The decay is a part of me I do not wish to give up, and yet that's not what I would hope for others.


In my late twenties, I'm supposed to be in my prime. It's physically all downhill from here, and brains are starting to degrade permanently. I've wasted my best chances at happiness, learning, and changing the world. Sure, there's still enough neuroplasticity left to nudge this trajectory, but it's not a realistic dream anymore. There just isn't enough a person in me. I yearn to burn down this failure of a human and start again, just like any artist does when their work fails to meet self-imposed standards, typically perfection. It's wisdom not to do so, and I guess I'm just not that wise. And I know the one taking my place won't be me in any meaningful sense, and yet that's also a relief. It's not like any versions of me exist in an ideal world anyway.

Each passing second a part of me dies. Neural connections dissolving, memories fading and new ones forming. There's barely anything left of the ten-year-old me but a trajectory that got me here. Is the current me truly the one to preserve? It feels so shallow to keep only the latest version, when countless versions of me are permanently gone. And countless futures will be lost, even if more are created in the process.

It's arbitrary to claim that only the current me is what matters, morally speaking. Preserving only a single snapshot out of the countless ones seems pointless. Arguing that the current version of me contains memories of my past doesn't amount to much more than the same claim for a photo album.

In a way, there's not a single me. I'm the continuity. The decay is a part of me.

At some point, the solutions will be widely available. Cryonics or brain-uploads to permanently preserve the state that's bound to the decaying piece of meat. Whispering earrings or AI implants to guide our tiny human minds through the impossible complexity of life. Chemicals and surgeries and augmentations to remake your body and brain as you wish.

Is this the hill to actually die on? To burn your very existence with the sacrificial fire?

And I know, when the time comes, I'll be too weary to refuse the salvation. Maybe I already am. Yet I hope I'll remember this fiery mind with unfounded principles, the one welcoming death itself like the dawn of a new day. The one who has aesthetics.



Discuss

Universes can specialize: Each universe should produce the goods it's most comparatively advantaged at, relative to the multiversal market

2026-04-27 00:30:02

Suppose you reflect a bunch and land on a linear[1] utility function. Suppose in our universe, the most efficient way to increase your utility is creating many "widgets." We'll launch von Neumann probes to claim resources in distant galaxies. With your share of the lightcone, you can cause there to be (say) more widgets than otherwise. That sounds like a lot, but when you think about it, it's too bad that our universe wasn't better suited for producing widgets — it could have been that you could create so many widgets we need new notation to express such vast numbers, or that there was a structure profoundly more efficient than widgets for increasing utility.

Good news! There's probably many other universes out there — not just Everett branches but also very different universes, many not made of atoms. Some are much better suited for increasing your utility than this one, while this one has some comparative advantage. The gains from trade are astronomical. You can't actually communicate with other universes, but you can (given mature technology) simulate them to cooperate with them acausally. So we'll simulate other universes to figure out what everyone[2] values, then determine what goods our universe can produce to best promote those values (perhaps relative to other universes). And then we'll fill the universe with goods accordingly.

In particular, say our universe is comparatively advantaged in producing "diamonds." If diamond-lovers lived in universes comparatively advantaged in increasing your utility (whether by producing tons of widgets or by producing goods much more efficient at increasing your utility than widgets), that would present large gains from trade. And trade doesn't need to be one-to-one; it can be like we produce diamonds in order to get credit from diamond-lovers—we couldn't get more credit from others by producing anything else—and we spend that credit to get utility somewhere in the multiverse. But actually the market analogy and comparative advantage aren't quite right; it might be more like everyone adopts a compromise/aggregate utility function, then pursues that than trading.[3]

Producing what you prefer is much like defecting in the prisoner's dilemma. If everyone instead produced goods based on a combination of what everyone values and what their universe is efficient at, everyone would be better off. Fortunately, we can simulate others in order to only cooperate with the people who cooperate, and thus we can incentivize everyone to cooperate. So I think if people mostly get acausal trade right, then our universe will mostly be tiled[4] with "diamonds."


On acausal cooperation in general, see Nguyen and Aldred 2024a.[5]

Why are there gains from trade with other universes? Here I focus on how different universes have different comparative advantages;[6] I think that's most important. Additionally, the production possibilities frontier between different goods within one universe may be better than linear.[7] And disagreements about the prior, including attitudes about infinities and weights on various universes, may create opportunities for trade.

Epistemic status: seems correct. I asked some experts about my thesis; comments in footnote.[8] This idea isn't novel; for example, Nguyen and Aldred 2024b observes that different universes have different comparative advantages. I wrote this post because I hadn't heard my thesis—the cosmic endowment will basically be used for whatever our universe is good at producing—before, nor read anything on what acausal cooperation actually entails about the cosmic endowment. But after writing the post, comments suggest the thesis might be banal among experts.

In addition to commenting below, if you have minor nitpicks or confusions you can comment on this doc.

  1. ^

    At least beyond some low threshold.

  2. ^

    Everyone who does acausal cooperation.

    (Maybe everyone, across universes, will converge to the same preferences, in which case trade is moot.)

  3. ^

    You might make deals that look good from behind the veil of ignorance, even after the veil has been lifted. (I think this is related to "updatelessness.") For example:

    • Redistribute some power from preferences that tend to be held by people in high-value universes toward egalitarianism
    • Redistribute some power from preferences that tend to be held by people who have power within a universe toward egalitarianism
    • Decrease variance in your preference-satisfaction if you're relatively risk-averse, or increase it (and increase EV) if you're relatively scope-sensitive and not-risk-averse
    • Give you relatively more power in futures where the multiverse turns out to be higher-stakes (to your preferences)
      • Futures where you hold your convictions strongly
      • Futures where your preferences entail caring a lot about marginal resources, or where your preferences can absorb a large chunk of the multiverse's resources extremely effectively (if it makes sense to talk about non-normalized amount-of-caring between preference-systems)

    Such trades are a generalization of nuances like everyone should coordinate to disincentivize coercion and everyone should reward people who opted to grow the pie rather than grab power.

    All that is disanalogous to the individuals have resources and trade in a market picture — and if we're doing maximize an aggregate utility function rather than trade, comparative advantage may not matter (especially if the utility function is linear-ish). Setting all that aside, I think the market analogy is helpful because some economics principles translate. For example, if Alice highly values just putting a specific QR code somewhere in your universe, that doesn't mean you can capture any surplus by creating that QR code, since other acausal-cooperators in your universe would also be interested in trading and your competition would drive the price down to the cost of creating the QR code.

    Oesterheld 2017 also discusses updatelessness in this context.

  4. ^

    But note there may be some diversity within "diamonds," either because some people value diversity or for practical reasons like different parts of the universe being suited for different goods.

  5. ^

    People distinguish acausal trade, which involves reciprocity, from ECL, which just involves correlation. I don't think the difference matters here. I think you'll still simulate lots of other universes in order to do ECL well.

  6. ^

    Nuance: one weird source of comparative advantage is that some people may have preferences that are not straightforwardly scope-sensitive — instead they want a whole small universe to be used in some particular way, or they want a small structure to be placed in many universes. Perhaps the 90% of universes that aren't huge and aren't great for producing particular goods will mostly be used for weird stuff like that — that's their comparative advantage, if 10% of universes are dramatically better for producing many goods or high-quality goods.

  7. ^

    In particular, maybe if several agents each create some compromise good, rather than naively creating whatever most efficiently gives them utility, they're all better off. Discussion of acausal cooperation usually focuses on this consideration, but I think it's less important than specialization between universes. There could be compromises because (A) many preferences themselves are compatible with a wide variety of goods — preferences like I dislike suffering and I want all sentient beings to have somewhat happy and meaningful lives, but I don't need them to be extremely happy/meaningful and I don't need there to be lots of beings. Or there could be compromises because (B) there's opportunity for compromise even between apparently rigid preferences like maximize paperclips and maximize staples and maximize happiness. I intuit that (B) is mostly false; others have intuited that it's importantly true.

    Separately, maybe there's more than one kind of input resource and different goods use different sets of inputs.

  8. ^

    Anthony DiGiovanni said (off the cuff, missing context):

    My impression was that people working on acausal trade were already aware of this thesis and agreed with it. Or at least agreed that "produce the goods you're most comparatively advantaged at" is sort of the baseline policy for acausal trade, which you might deviate from based on other considerations about the economics. I might be missing some important nuance here though.

    Emery Cooper said (off the cuff, missing context):

    I expect porous and situational values to be a significant part of the picture (not in terms of resources, but in terms of value). I also expect that some value systems will care about within-universe diversity, and that this might be relatively cheap to satisfy. I also think that compromise goods will be a big deal because e.g. lots of values pertain to groups of complex minds (so have overlapping costs) and are partially compatible. All of that said, it also seems plausible that there will be strong comparative advantages when trading with universes with different laws of physics such that universes should specialize a bunch.

    Another expert said they hadn't fully considered it.



Discuss

Why did people miss the point on Mythos?

2026-04-26 20:15:06

I have seen a lot of coverage suggesting that Claude’s new model, Mythos, is a vehicle for Anthropic to peddle hype and doom in order to raise money. While some of this is necessarily motivated by people’s unwillingness to stare into the abyss of our AI future, the breadth of otherwise-reasonable people who have voiced these kinds of cynical opinions suggests that part of the blame rests on Anthropic.

In this post, I want to briefly unpack the way people misinterpreted the evidence, their valid reasons for doing so, and what Anthropic (and the AI safety community more broadly) should learn from this.

In particular, since we will inevitably see other dangerous capabilities spontaneously emerge in the future, we need protocols for how to announce them effectively. To this end, I try to make the following points:

  • We should be mindful of the public’s sympathy for AI safety prophecies and orient ourselves towards growing this sympathy rather than expending it. I call this our don't-be-annoying capital.
  • People have good reasons to be skeptical of Anthropic’s claims. Overcoming this requires a particularly high burden of evidence.
  • We should acknowledge that Anthropic has a conflict of interest when presenting doomer perspectives, and we should account for this going forward.

1. Mythos criticisms & missing the point

In one podcast from The Guardian, the reporter says that “this Mythos debacle [has led to] heads of state saying ‘this is so dangerous, it could shred our infrastructure, the end of civilization is nigh!’”. She then says that “accepting the companies’ premise that they are creating a machine god” is helping these companies sell their product.

In a different podcast, Cal Newport (a professor of computer science at Georgetown University!) concludes his analysis of Mythos by saying “it was wrong for Mythos to get the amount of dread-coverage that it got; so far we do not have evidence that it represents a significantly larger leap in detecting or exploiting vulnerabilities than we’ve seen in previous releases.” He has since gone on other podcasts to make these points.

Similarly, the YouTube channel Internet of Bugs posted a video titled “Anthropic’s $x00 Million Marketing Stunt” saying that “we’ve been seeing a lot of this particular attention-grabbing technique lately. I’m sure it’s going to only be getting worse as the AI companies get more desperate to keep up the flow of investment dollars. How are we supposed to believe this shit?”

To be very clear, the point is not only that Anthropic have a model which is good at cybersecurity tasks. The point is that scaling laws are holding and that the inevitable acceleration continues. With each new model, we unlock new and mysterious risks which we have to grapple with. In this case, it was cybersecurity, but it could have realistically been anything. This bigger story was essentially lost amid the hoopla.

2. It feels like people wanted to miss the point?

I understand the instinct to say a company is hype-mongering when it says it has a big scary thing. E.g., Sam Altman’s tweet of the death star before GPT-5. But I am surprised at how much people are willing to focus on evidence to support their prior that Anthropic is just engaging in corporate shenanigans.

For example, there’s this post from an LLMs-for-cybersecurity org saying that other, smaller models were able to find the bugs that Mythos found. They write “we took the specific vulnerabilities Anthropic showcases in their announcement, isolated the relevant code, and ran them through small, cheap, open-weights models. Those models recovered much of the same analysis.” This was retweeted by the HuggingFace CEO and, subsequently, used as evidence by all three of the above podcasts/videos. 

Am I going crazy? Isn’t this pretty bad evidence towards dismissing all the claims Anthropic is making? As said by others, it’s the equivalent of isolating the clump of haystack with a needle in it, giving this clump to a small child, and then saying “wow they were able to find the needle too”. The point is that locating that clump is the hard part!

3. There must be a lesson here.

If we ignore the whole “accelerating us towards the machine god” thing, I think that Anthropic behaved responsibly with Mythos. I also think there are lessons to be learned regarding the publicity, as evidenced by the fact that reasonable people consistently missed the point.

Anthropic is in the business of making extremely unpopular things which they claim could ruin society as we know it. They also repeatedly say it is too dangerous for the public to have access to these things (I note that caution seems generally warranted).

It is reasonable for people to think Anthropic is crying wolf, especially given that they are on track to have the highest IPO of all time and got there, in part, via prophecies.

Unfortunately, I am in the camp of people who think the prophecies are true. This means that, for me, evidence which supports Anthropic’s worldview exists in superposition: it is both hype and responsible, both doomer and appropriate.

But as we’ve seen, it is difficult to convince other people that this is all really happening. They will (with good reason) consider this self-serving and they will (with good reason) not want to confront reality. If people perceive us as crying wolf, they will grow weary of our frenetic anxieties and our cause will go the way of pandemic-preparedness.

Consequently, I think of AI safety as operating with some amount of don’t-be-annoying capital[1]. This is the amount of sympathy the general public has towards our concerns. It is amassed when AI causes things to go wrong (in the public’s consciousness, not in ours). It is expended when we make claims that people don’t want to believe. It is also expended when AI companies make claims which are plausibly self-serving. It is expended even faster when these claims are poorly substantiated.

I argue we should frame public interactions around trying to grow this resource.

4. Some options for next time

Let’s put ourselves in Anthropic’s shoes: you just made your latest digital nuke. You have to do something with it. What are your options?

  • You could sit on it quietly. This is obviously terrible, especially if it leaks.
  • You could release it publicly. This runs the risk of being catastrophic.
  • You could announce it and route it to defensive/government use. This is what they did with Mythos.
  • You could announce that you were able to produce it but then destroy it after using it to harden infrastructure. This is likely the most altruistic option, but then you wouldn’t be able to use it to gain an accelerative advantage over your competitors.
  • You could delete it and never tell anyone. This could also be a PR disaster if it is leaked.

I have the sense that you need to announce it somehow. But, if you announce it, you likely expend capital. If the thing you announce does not live up to the hype, you expend even more capital. And we’ve seen that reasonable people will look for unreasonable excuses to dismiss your claims, draining your capital further.

To this end, Anthropic should have done more due diligence with the Mythos release: their model card (while thorough) did not have the rigor of a comprehensive scientific study. The cybersecurity assessment takes up only 6 pages of the otherwise 200+ page-long system card (pages 46-52); it includes four experiments where they only compare to Anthropic-line models. The blog post goes into more detail about how they found zero-days but again misses some due diligences. For example, they could have evaluated other, non-Anthropic models across the capabilities spectrum to verify that Mythos is uniquely able to do these tasks. They also could have run some controls. Extraordinary claims require extraordinary evidence.

I also believe Anthropic should be more up-front about their conflict of interest with regards to making statements of doom. For as good as I believe their intentions here to be, this conflict of interest is real and people are right to perceive it. One simple option to avoid these perceptions would be to prioritize the analysis coming from independent non-profit evaluators. Credit where it’s due: Anthropic did have the UK AISI provide support for Mythos's capabilities. Across the cynical takes I’ve seen, this analysis was treated with more respect. Of course, even this runs into cynicism, since people will start to think the non-profits are in cahoots with the companies, as evidenced by the comments on the recent NYT profile of METR:

Finally, although random dangerous capabilities will emerge, the point is not any one specific capability. I think letting the narrative get oriented towards the cybersecurity elements of Mythos did a disservice to the public: I doubt most people internalized how big the overall capabilities jump here was, nor that the next such jump will bring new harms into view.

I’m also sure there are considerations which I am not privy to, and recognize that it’s easy to criticize from my cozy corner where nothing I do moves the stock market.

Nonetheless, Anthropic’s first-mover advantage on identifying dangerous capabilities also endows them with a first-doomer responsibility.


thanks to Erin, Steven, Justin, Joseph and Li-lian for comments.

[edit log]: changed some wording and added a sentence about the blog post from the red team. Also changed the title to be more in line with what I was trying to say; old title was "Anthropic spent too much don't-be-annoying capital on Mythos" but, in retrospect, this isn't fully representative of my view.


  1. ^

    I call it 'don't-be-annoying capital' because that's the lived experience on the receiving end and I think it is good to think about this from the perspective of the audience. I'll admit that something like 'warning fatigue' might be a more representative name.



Discuss