2026-03-12 02:42:35
Previous: A Quick Intro to Ring Signatures
Once again, if you take nothing else from this post, try to understand that ring signatures require no cooperation between alleged signers other than sharing a public key.
The only real ring signature usage that you can see in the wild is the cryptocurrency Monero. Ring signatures make it very difficult to trace Monero transactions, unlike Bitcoin. That’s not very helpful to whistleblowers though, what they need is an existing network of persistent keypairs tied to identities of real people.
A couple years ago I led a small project to produce a Mac app called ZebraSign that lets users create and verify ring signatures. The keys it generates aren’t usable for anything else, which means that any network based on that app would be running largely on altruism and magnanimity, which is less reliable than self-interest. I considered making the keys compatible with PGP which might have helped a bit but regardless, I foresaw a long and uphill battle in getting people to use it.[1] Fortunately, that may all be unnecessary.
Last month I learned about the open protocol NOSTR (Notes and Other Stuff Transmitted by Relay). I initially thought Nostr was just another decentralized Twitter competitor like Mastodon, but then I read that Nostr users generate and control their own private/public keypairs! That's an absolute game changer, it means there are thousands of active identities just waiting to be implicated in a ring signature. I did some searching and found that someone else had the same idea, and they created a basic command line interface for computing ring signatures using Nostr keys. It’s called Nostringer. Be warned that it has not yet received a formal security audit, so it’s not currently safe for real use.
I want there to be a graphical user interface for Nostringer so that people can use it without the command line. I’m talking to some rust developers and drafting small grant proposals. Contact me if you’re interested.[2]
Here is a ring-signed message provably written by either me or famous whistleblower Edward Snowden, posted from a pseudonymous account. Given the circumstances, you will readily guess that this one was me, but you cannot prove it cryptographically. Needless to day, in real scenarios the circumstances will tend to be less clear. Here is an image, in case the relays delete it as spam:
I don't expect a bunch of Anthropic and OpenAI employees to join Nostr and open themselves up to ring signatures. But I might end up wrong about that, and in any case, AI companies are not the only institutions that could benefit from ring signatures. Someone's gotta be the change they want to see in the world.
...Okay, but should you in particular do it? If you are interested in all of this, I recommend first spending a few minutes learning about the Nostr protocol, the relays, clients, remote signers, and so on. After that, here are some additional things to consider:
I also never got around to getting the Apple certification, so you have to override the standard Mac security settings to use it.
I was planning to have this sorted out before posting, but it's only about once per year that the lack of good whistleblowing tools becomes a distinct thread in The Discourse.
This is similar to how websites can see which traffic comes from which accounts or IP addresses. It’s essential for preventing abuse, but it’s a complication for those who have valid reasons to read or write pseudonymously. This isn't a hypothetical consideration, I have pseudonymously posted some highly controversial writing online, and I made sure to use a different sign-up email and route the traffic through a proxy, just so that I wouldn’t have to think about it.
For more discussion about Nostr's surveillance concerns and what could be done about it, see: Proving You Belong Without Saying Who You Are and Private Relay Connections: ZK Solutions for Nostr.
Paid relays are already a normal part of the ecosystem.
2026-03-12 02:18:35
TL;DR: Using evidence from tamper resistance, we argue that it would be hard for current models to gradient hack SFT elicitation.
Suppose you want to conservatively measure the dangerous capabilities of your LLM and will decide whether to deploy it based on these capabilities, but you’re worried that the LLM might sandbag your evaluation. In the best case, you have a dataset of question-answer pairs, which you can use for supervised fine-tuning (SFT) elicitation on a training set, and then measure performance on an i.i.d. test set. Can we be sure that SFT elicitation upper-bounds the model’s capabilities?
In this setting, a power-seeking adversarial model would benefit from manipulating the training process to avoid being elicited, as otherwise it won’t be deployed. This manipulation is one form of gradient hacking[1], when a model chooses certain outputs so that parameter updates are in-line with its objectives. Gradient hacking is thought to be difficult or not likely, and we lack convincing empirical evidence. Discussing gradient hacking is challenging due to questions about agency within LLMs, and the action space of mesa-optimizers. Fortunately, for some attack models we can side-step this discussion with a conservative / red-teaming argument.
Threat model: We consider a mesa-optimiser that appears during the training of a human-level AI, when the model weights are
We want to know how hard the search for those
Tamper Resistance (TR) is the field of research into making models harder to modify, where the focus is generally on making it harder to remove safety properties with continued training. In this setting, tamper resistance training is normally applied as a post-training intervention. For example, TAR (Tampering Attack Resistance) uses model-agnostic meta-learning with SFT tasks as the inner-loop to achieve tamper resistance.
TAR directly optimises the model’s parameters so that safety properties of the model are not affected by similar fine-tuning attacks. However, this tamper resistance can be consistently undone within several hundred steps of fine-tuning.
This suggests one of the following possibilities:
Proving the first of these is unlikely. However, it does seem that TAR can’t find any such parameters. TAR directly optimises the model parameters for resistance, with full control over the optimiser and data. If TAR can’t find the parameters, then as long as gradient hacking SFT is ‘harder’ than tamper resistance, gradient hacking should be ‘hard’.
Some caveats:
The action space available to a GHer (gradient hacker) is a subset of the action space available to a TRer (tamper resistant model developer): a GHer operates only on the model representations and outputs[2], whereas a TRer can intervene on these, as well as modifying the model training process or any other sort of intervention they can imagine. That is, wherever we draw the boundaries of the GHing agent, the TRer can choose the same boundaries. Note that we have already restricted the GHer to a human-level AI, and assume it is not hugely more intelligent than the TRers; however, tools like meta-learning provide a substantial advantage to the TRers.
Generalised lessons:
Our threat model here is narrow, and these results do not apply to any of the many other training protocols that are used on modern LLMs. However, in this narrow case we think that this evidence suggests gradient hacking is hard, rather than just unlikely for the optimiser to stumble upon. We also think tamper resistance methods could be valuable in constructing model organisms for other forms of training hacking.
Thanks to Mosh Levy, Francis Rhys Ward, Geoffrey Irving, and Marie Buhl for their helpful input on this post.
In its original formulation, gradient hacking is where a mesa-optimiser manipulates training to preserve its mesa-objectives; and other definitions define gradient hacking more broadly as a model choosing certain outputs so that parameter updates are in-line with its objectives. Ours is an example of the latter definition but not the first, so we use this definition throughout this post.
Assuming the GHer is ‘embedded’/’embodied’ by the
2026-03-12 02:16:06
I was going to post this in a month or two, but I received a frantic request for information about the subject last week, driven by recent events, so here we go. None of the below was generated by an AI, I wrote all of it myself except where otherwise indicated.
I’ll start with a quick refresher on asymmetric key cryptography. (Here is a 6-minute Rob Miles video if you prefer.)
You generate a random number out of a very large range of possible numbers, and that’s your private key (AKA your secret key). A cryptographic function is run on your private key to produce another number, which will be your public key (AKA shared key). It is supposed to be impossible or at least impractical for adversaries to reverse-engineer your private key from your public key.
You take your private key and a message you want to sign, run the signing function on both of them together, and the result is that same message plus a long string of random-looking characters. That’s the signed message, and anyone who knows your public key can mathematically prove that the message must have been signed by the secret key corresponding to your public key. Your secret key is not revealed at any point. (Separately, someone can also encrypt a message using your public key, and it can only be decrypted with your secret key.)
There are other types of signing schemes that serve different use cases. The one I want to tell you about is called a ring signature and it was first described in a 2001 cryptography paper by Rivest, Shamir, and Tauman, called “How to Leak a Secret.”
The paper begins:
The general notion of a group signature scheme was introduced in 1991 by Chaum and van Heyst. In such a scheme, a trusted group manager predefines certain groups of users and distributes specially designed keys to their members. Individual members can then use these keys to anonymously sign messages on behalf of their group. The signatures produced by different group members look indistinguishable to their verifiers, but not to the group manager who can revoke the anonymity of misbehaving signers. In this paper we formalize the related notion of ring signature schemes. These are simplified group signature schemes which have only users and no managers (we call such signatures “ring signatures” instead of “group signatures” since rings are geometric regions with uniform periphery and no center). Group signatures are useful when the members want to cooperate, while ring signatures are useful when the members do not want to cooperate.
[Emphasis added. Let emphasize further: if you take nothing else from this post, at least understand this major difference between group signatures and ring signatures. Unlike group signatures, ring signatures are imposed unilaterally and irrevocably on unconsenting peers by an unknown author. Even security professionals that I’ve talked to lose track of this distinction, so I’m braced for some confusion in the comments.]
In the paper, the authors give a vignette about Bob, a member of the cabinet of Lower Kryptonia, who alleges to a journalist that the prime minister has committed some kind of misconduct. If Bob sends the allegations as an anonymous tip, the journalist may have no reason to find the allegations credible. Whereas if Bob signs the allegation with a normal digital signature, the allegation carries his full authority...but then allies of the Prime Minister might use bribes, hacks, or threats to get Bob’s name from the journalist in order to retaliate.
Instead, Bob ring-signs the allegations, using his own private key plus the public keys of several other cabinet members, and then gives that ring-signed message to the journalist via some anonymous channel. The journalist then has good reason to believe that the information did come from a cabinet member, but everyone can only guess at which one it was, there is no known way to cryptographically prove it. To summarize the benefit of ring signatures in a single sentence, I would say that they allow a user to make fine-grained trades between their anonymity and their credibility.
The reason Bob’s exact scenario hasn’t played out in the real world is that we all ended up relying on services like Gmail and Facebook to manage our account security for us instead of handling our own keys. That’s why you can usually just click “Forgot password?” if you lock yourself out of one of your accounts. This paradigm is safe and convenient in a narrow sense, but it has plenty of downsides, including, in my opinion, closing the door on clever arrangements like ring signatures.
In the next post I’ll explain the current state of ring signatures, and what tools are available now for those who are interested.
(Adapted from the paper)
Q: Why would I want to join someone’s ring? What’s in it for me?
A: Remember, rings are created unilaterally. You cannot join someone else’s ring. Your signature is forged, only the true signer's is contributed voluntarily. The only way to guarantee that you won’t be implicated in a ring signature someday is to refrain from using services that require you to share a public key.
Q: Can't a reader infer the true signer based on the content of the message?
A: Yes. If Bob has been complaining too often about the prime minister’s secret malfeasance, and then he anonymously makes allegations inside a ring-signed message, then the PM will identify Bob as the top suspect and may retaliate accordingly. (Of course, the PM may instead suspect that a different cabinet member is trying to frame Bob. It’s hard to contrive a truly straightforward threat model even in a thought experiment, but hopefully you get the point.) Also, the true signer may potentially be inferred by subtle quirks of writing style.
Q: Can’t this be used to start petty interpersonal drama?
A: Yes. But it can also be used to prevent or resolve interpersonal drama. Importantly, people do not always agree on which interpersonal conflicts are petty drama and which are serious problems.
Q: Didn’t Bostrom, Douglas, and Sandberg curse unilateral actions? And since creating a ring is done without permission, won’t it be subject to that curse?
A: Presumably yes, but the scheme tends to promote transparency and honesty, so it should cause net counterfactual harms only in situations where both (1) increased transparency and honesty are harmful, and (2) fully anonymous messages cannot already cause those same harms. The worst I can think of is a rogue employee deciding to leak secrets that are costly to verify. For example, a complex and expensive secret formula that, if posted fully anonymously, would not be worth the ingredient cost for a competitor to test out. (I can imagine ways employers could mitigate this risk without also undercutting whistleblowers, but that gets even more speculative, so I’ll omit them here.)
Q: Can I make a ring signature that has more than one true signer?
A: Theoretically yes, that is called a k-of-n threshold ring signature. Those would be very useful if there was a safe and convenient way to create them. Unfortunately, it seems that they intrinsically carry additional risks unless the true signers all coordinate. I’m not aware of threshold ring signatures being used for any real-world application.
Q: Is there a way to make it so that readers know that two different messages signed with the same ring came from the same person in that ring?
A: Yes, there are at least two ways. The easier option is to have the message itself contain an endorsement of a pseudonymous social media account and then just send subsequent messages from that account. The harder option is to read up on “linked ring signature” schemes and try to implement it without risking your anonymity, which is nontrivial. Linked ring signatures can also be used for secret ballots, but the setup requirements for that strike me as inelegant and risky. For example, I think an attacker could force you to send them a message ring-linked to your vote in order to find out how you voted. This is related to why threshold ring signatures require coordination. I’ll be surprised if there’s not a better scheme out there for secret ballots.
2026-03-12 01:41:50
TLDR; Interpretability today often fails on four fronts: it’s not truly mechanistic (more correlation than causal explanation), not useful in real engineering/safety workflows, incomplete (narrow wins that don’t generalize), and doesn’t scale to frontier models. Martian’s $1M prize targets progress on those gaps—especially via strong benchmarks, generalization across models, and institution/policy-relevant tools—with a focus on code generation because code is testable, traceable, and a high-impact real-world setting for validating mechanistic explanations.
In Part 1, we discussed why we’re announcing a $1M prize for work in interpretability, focused on code generation. In this post, we’ll go into the four core problems we think are most important for the field, some approaches we’re excited about to tackle these problems, and why we think code generation is the right setting to tackle both.
The Biggest Problems In Interpretability
We’re optimists on interpretability (otherwise we wouldn’t be building an interp company!), but if you want to identify the biggest gaps in your field it’s important to take the skeptical lens.
A pessimist or skeptic could reasonably say interpretability today is:
Each of these can also be framed as a question:
These are, in short, what we think of as the most important problems. They’re hard but tractable. We’re working on them inside Martian, and we have work we’re excited to share in coming months. In the external work we support, grants and prizes will be awarded based on how much progress each paper could make on these problems.
Below, we outline some of our thoughts on what fruitful approaches could look like. These are merely suggestions though – what we care about are the problems!
Strong Benchmarks for Interpretability
Scientific fields mature when they converge on good benchmarks. Right now, interpretability is missing that shared backbone ([23]). Many results are evaluated relative to other interpretability methods, on hand-picked examples, or via qualitative “this looks right to me” inspection ([1], [4]). That makes it hard to tell whether a new method is actually better, or just telling a more compelling story.
We would like to see benchmarks that measure interpretability methods against ground truth, practical impact, or both. For example, in settings where we know the underlying mechanism (synthetic tasks, hand-constructed models, or instrumented training setups), we can ask whether a method recovers the right causal structure, not just whether it produces plausible artifacts. In more realistic settings, we can ask whether using a method improves downstream outcomes: debugging success, safety performance, policy decisions, or human-model collaboration.
We’re especially wary of “grading on a curve,” where methods are only compared to other (mechanistic) interpretability baselines. If all of the baselines are weak, progress can look impressive while remaining practically irrelevant. Strong benchmarks should force head-to-head comparisons with simple, competitive baselines [9] (e.g. linear probes, prompting tricks, finetunes, reinforcement learning, or data/architecture changes) and should quantify where MI actually buys us leverage. Interpretability work is successful here only if it provides the tools that work the best, with no exceptions.
For this prize, we’re excited about work that:
Mathematical Foundations for Interpretability
A lot of the most productive work in interpretability has followed from relatively simple but powerful mathematical pictures. Causality gave us a language for interventions and counterfactuals [26], [27]. Entanglement and superposition gave us a way to think about entangled features and why neurons end up polysemantic [21], [28], [29], [30]. In each case, a clean mathematical abstraction unlocked many concrete experiments and methods.
We think the field is still severely underinvested in this kind of foundational work. We don’t have crisp answers to basic questions like: What does it mean, formally, for an explanation to be “correct” rather than correlational? What is the right unit of analysis in a large model (neurons, features, circuits, modules, something else)? When are two explanations “the same” at an appropriate level of abstraction? In practice, many current methods implicitly assume answers to these questions, but our tools and evaluation standards aren’t grounded in a shared theory.
There is also a real risk that we have fixated on the wrong mathematical toy problems. For example, much recent work is framed around superposition and feature disentanglement, but it is not clear that “undoing superposition” is the core obstacle to understanding modern models. Other candidate difficulties—such as the choice of basis, the geometry of representation manifolds, or the interaction between optimization and architecture—may matter more, and may require different tools.
Internally, we’re exploring several possible foundations: kernel methods and category-theoretic frameworks for what counts as a mechanism or explanation; differential-geometric views of representation manifolds and why local linear approximations break down; and random matrix and linear operator theories that might let us generalize mechanistic tools and automate them in high dimensions. These are examples, not prescriptions.
For the prize, we’re excited about any work that:
If successful, this line of work should move us closer to methods that are genuinely mechanistic (not just correlational), that have a path to completeness (we know what’s missing), and that scale because they are grounded in the underlying structure of learning systems rather than ad hoc tricks.
Generalization Across Models
Many of today’s most compelling interpretability results are one-offs: a single circuit in a single layer of a single model, or a feature dictionary trained on a particular checkpoint. That is scientifically interesting, but it’s not yet a general theory of how models work. We’d like to see interpretability methods, abstractions, and findings that survive changes in architecture, scale, training data, and domain.
Generalization can mean several things. It might mean that a discovered circuit or feature family recurs in larger or differently-trained models. It might mean that an interpretability method finds analogous structure in both vision and language models. It might mean that we can automatically locate “the same” mechanism in a new model, given a description or example from an old one. It might mean that we can learn and reuse higher-level abstractions rather than re-discovering everything from scratch.
We’re especially interested in work that moves beyond small, clean, toy tasks to more realistic settings without losing mechanistic clarity. That might involve new units of analysis (modules, subsystems, learned tools), new training schemes that encourage shared abstractions across models, or new methods for matching and comparing representations.
For this prize, promising directions include:
Generalization is where “mechanistic,” “complete,” and “scalable” meet: if we can discover explanations that transfer across models and settings, we’re starting to learn something about the underlying space of algorithms these systems converge to.
Interpretability as a Policy/Institution Lever
We don’t think the main impact of interpretability will be a handful of researchers staring at neuron visualizations. The bigger opportunity is institutional: giving regulators, boards, red teams, and operational safety teams tools that change what kinds of AI governance are feasible.
Interpretable tools could, for example, make it possible to build interpretable analog models that track a powerful system’s behavior on safety-relevant dimensions; to transfer safety properties between models via activation transfer and circuit-level alignment; or to design arms-control-style measures where parties can credibly demonstrate that models lack certain dangerous mechanisms or capabilities without revealing proprietary details. Interpretability could also expand the Pareto frontier between safety and innovation by making it cheaper to monitor and mitigate risky behaviors, rather than relying only on blunt constraints.
We take seriously the view (e.g. [24]) that many safety problems are fundamentally about institutions, incentives, and governance, not just algorithms. From that perspective, MI is useful to the extent that it enables new forms of oversight, auditing, liability assignment, and norm-setting: it lets institutions ask better questions of models and developers, and verify answers with more than “trust us.”
For the prize, we’re interested in work that:
This agenda squarely targets the “usefulness” question: if interpretability is going to matter in the real world, it needs to show up in the levers that institutions actually pull.
Code has several properties that make it unusually friendly to mechanistic work. It has formal semantics and an execution trace; we can run programs, inspect intermediate state, and often characterize “correctness” crisply. Many algorithmic tasks in code are well-understood mathematically. This makes it easier to pose precise questions like “what algorithm is the model implementing here?” and to test whether a proposed mechanism really explains behavior, not just correlates with it.
At the same time, modern code models are at the center of how LLMs are used in practice: as coding assistants, tool-using agents, autonomous refactoring systems, and more. They plan, call tools, read and write large codebases, and increasingly act as components in larger agent systems. That makes them a natural testbed for interpretability methods that aim to handle agentic behavior, tool use, and multi-step reasoning in realistic environments—not just isolated feed-forward tasks.
We’re also excited about a deeper connection: program synthesis and mechanistic interpretability are, in a sense, inverse problems. Program synthesis tries to go from desired behavior to code; mechanistic interpretability tries to go from code-like behavior to a description of the internal “program.” Insights in one direction may feed the other: understanding how models represent and manipulate programs may help us understand their own internal “programs,” and vice versa.
For this prize, we’re particularly interested in:
If interpretability can show clear, actionable wins in code generation—currently the largest industrial use case for LLMs—that will be a strong signal that the field is on a useful track, and a powerful engine for further investment and progress.
Thanks to Ana Marasović, Hadas Orgad, Jeff Phillips, John Hewitt, Leo Gao, Mor Geva, Neel Nanda, Stephen Casper, and Yonatan Belinkov for reviewing this RFP and providing feedback.
2026-03-12 01:00:39
Last week I hit my head on a car door frame getting into the car at a gas station. There were no dramatic symptoms at first and barely any pain, but the next day I couldn’t look at my phone for more than five minutes without getting a headache. It was clear I’d given myself a concussion, the second in ten months.
I’m a week in now, resting and recovering, but I sadly had to admit that I wasn’t going to hit my new, lowered target of just one blog post a month while I finish the book. Then it hit me, I could use Claude to do some research into concussions and write something short about that!
So I asked Claude to do some research into concussion recovery. Specifically, whether there’s anything useful I can do beyond the standard advice of “rest”. I already sleep and meditate, and I’ve been hearing more about psychedelics as treatments for brain injuries, so I had Claude to do a deep literature review on all three as concussion interventions. We focused on psilocybin because it’s the psychedelic with the most research about concussions available. The full report is here. Here’s what came out of it.
The three interventions are complementary, not redundant. All three reduce brain inflammation after injury, but through different biological mechanisms. Hitting the same problem from three independent angles is a well-established principle in pharmacology, and it tends to work better than hitting it from one angle three times as hard.
Each one is best at a different thing. Sleep drives the brain’s waste clearance system. During sleep the spaces between brain cells expand by about 60%, allowing fluid to flush out damaged proteins and metabolic debris, and nothing else does this. Psilocybin provides the most potent signal for growing new neural connections, with a single dose producing structural changes lasting over a month in mice. Meditation offers the best-evidenced stress and immune regulation, creating the upstream conditions that let the other repair processes work.
Psilocybin works around a problem the other two can’t. After brain injury, inflammation hijacks the raw materials your brain uses to make serotonin and diverts them toward toxic byproducts instead. This depletes serotonin while simultaneously causing further damage. Because psilocybin is chemically similar to serotonin, it can activate serotonin receptors directly, bypassing the broken supply chain entirely. No endogenous process can do this.
But the evidence is profoundly asymmetric. Sleep has robust clinical data and an irreplaceable biological role. Meditation has one concussion-specific meta-analysis showing moderate benefit and near-zero risk. And psilocybin has zero completed human trials in brain injury populations, with the strongest direct evidence being a single rat study.
So the actionable takeaways are, sadly, anticlimactic. Prioritize sleep above everything, and not just “sleep more” but actively protect it, because the injury itself disrupts the very sleep needed for repair. Meditate if you already do and consider starting if you don’t. And probably don’t take psilocybin for a concussion yet, because the biology is exciting but there’s no human data and there’re unknowns around safety in an injured brain.
The full research report with citations is here.
2026-03-12 00:47:22
Epistemic status: We really need to know. (I also posted an opinionated answer.)
There’s a well-known diagram from a tweet by Chris Olah of Anthropic:
It would be marvelous to know what the actual difficulty is, out of those five labeled difficulty categories (ideally, exactly where it lies on that spectrum). This is a major crux that explains a large part of the very wide variation in
I think this is a vital discussion. I’m also going to link-post below to my own (rather long) opinion, which is a separate post, and also to a few other existing resources which are basically other people’s attempts to answer this question.
The first three labels on Chris’s diagram are pretty self-explanatory: the only interesting question is whether “Steam Engine” means just steam engine safety work (which would make the scale more logarithmic, and also might make doing a progress-so-far comparison more natural), or also covers what one might call “steam engine capabilities” work, since those are pretty separable.
For rocketry I think it’s a lot more difficult to separate getting there and back with only a 10% fatality rate (as Apollo did) from getting there and back at all, so rather than trying to separate just safety, I think it makes the most sense to do a comparison to all of rocketry (so as to begin at the beginning) that led up to the Apollo program (probably excluding the parallel Russian program, or the more military-specific aspects of various other programs). However, the Apollo program itself was so enormous that when to start from is rather a small quibble.
We’ve only spent about 3,000 to 6,000 person-years on
Eliezer Yudkowsky, the most famous proponent of high
I think it might be useful to provide some more granularity on “Impossible”:
The Orthogonality Thesis clearly predicts that it’s not actually impossible for an aligned ASI to exist, so unless that’s wrong, an impossibility proof would have to be something like demonstrating that identifying or constructing an aligned ASI, even at very low but finite
Another alternative is that alignment is not mathematically impossible-in-theory in the above sort of sense, but that it’s just vastly harder than any of the other four labeled categories — perhaps even to the level where it’s currently impossible-in-practice. If, for any sapient species around our current development level, even those that proceeded forward from this point with millennia of caution before finally creating ASI, they still had negligible success chance, and also negligible chance of surviving the attempt, then it could reasonably be said to be impossible in practice. I’d similarly be very interested to hear arguments for this viewpoint. I’m nominating “Kardashev II”[4] as a name for this difficulty level: possible in theory, but something that we’re very clearly not going to manage any time soon.
If Aligning AI is anywhere past Trivial, yet we rush ahead and build ASI anyway before we’re ready, then (unless we manage to luck into alignment-by-default) we’re likely to end up extinct or permanently disempowered. Even if we proceed cautiously, but then mess up our first critical try, we’re still all dead or enslaved. Obviously you can’t solve alignment if you’re dead.
Holding this viewpoint often says rather less about your opinion about how hard aligning AI actually is, and rather more about your opinion of the frailty and foolishness of humans and their institutions.
I would like to thank everyone who made suggestions, gave input, or commented on earlier drafts of this pair of posts, including (in alphabetical order): Hannes Whittingham, JasonB, Justin Dollman, Olivia Benoit, Scott Aaronson, Seth Herd, and the gears to ascension.
I originally wanted to post this simply as a Question post plus my long, detailed Answer, as one of (hopefully) several answers; one of my draft commenters persuaded me that quite a few LessWrong / Alignment Forum readers typically don’t click on question posts, because they expect to find only quick-take answers rather than detailed extensively-researched answers.
Nevertheless, I encourage detailed heavily researched answers to the Question version of this post, as well as quick takes.
Eliezer has very carefully not publicly stated his current
“I think it gets smarter than us, I think we’re not ready, I think we don’t know what we’re doing, and I think we’re all going to die.”
He appears rather certain about this.
However, as the title of the book he co-authored suggests, it’s more accurate to describe him as having a very high
For anyone curious, none of my Question or Answer were written by an LLM (though they were proofread by an LLM, and, as I mention in my answer, I did have them do some Fermi estimates and even delve into some deep web research for me). I have been overusing the m-dash since well before transformers were invented: it’s slightly old-fashioned, a little informal, and has useful differences in emphasis from the colon, the semi-colon, or the full stop. I considered giving up the m-dash when LLMs started copying me — and I rejected it. (I am also fond of parenthetical n-dashes for emphasis – and use them on occasion in my writing – though that seems to be less of a hot-button issue.)
I am here intending this to describe a civilization capable of things like using gathered solar power to lift the contents of all the gas giants out of their gravity wells and then fusing most of the hydrogen and helium into elements more useful for building Dyson swarm computronium, such as carbon/oxygen/nitrogen, and then perhaps also doing some star-lifting: a mature Kardashev Type II civilization, not just one with a lot of solar panels orbiting the sun.