MoreRSS

site iconLessWrongModify

An online forum and community dedicated to improving human reasoning and decision-making.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of LessWrong

Ending Parking Space Saving

2026-02-10 10:30:14

Published on February 10, 2026 2:30 AM GMT

Shoveling out a parking spot is a pain, and many snowy places have a culture where when you leave you put some kind of " space saver" (aka "parking chair", "dibs", etc) to reserve the space for when you come back. While you don't have a legal right to the space, there's common knowledge that if you park in someone's spot you're running a significant risk of them vandalizing your car.

Aside from depending on the implicit threat of property damage, it's a bad system: it only works if everyone uses cars with the same commuting pattern. My sister works in early childhood intervention, which is one of several professions where you're traveling to people's houses during the day. In neighborhoods where space savers are common, there can be many open spots, but each one is saved. Moving the space saver, even just for the duration of the visit, risks a slashed tire.

courtesy of kgnixer

Very reasonably, they're prohibited in many cities, including Cambridge and Somerville, but that isn't enough to disrupt the culture in areas where it's widespread.

It's also hard to opt out as an individual. If everyone else is saving their spaces and you don't, then when you get back your spot will very likely be taken.

What would disrupt the culture, though, would be the mayor making an announcement. Now, mayors already make announcements about this, but they're things like "please don't use space savers." That doesn't work: it won't sway the intended audience. Instead, the announcement should target the people who don't like the space saver system, but today they leave space savers alone because they don't want some unknowing stranger to face unjustified retaliation: "if you see a space saver, please take it out of the spot".

This gives social cover to start moving things. You're not a troublemaker, you're an upstanding citizen. The mayor doesn't need many people to take action: just a few enthusiastic people can thoroughly break the link from "there's a car parked in the spot I saved" to "they must have moved my space saver" and hence to "and so I should vandalize their car." The announcement makes it common knowledge that this is no longer a reliable link.

The mayor, as a centralized elected representative, is a good way to do this, though it would also be possible for a group of people, or even one really dedicated person, to do it. What you need to do is remove a lot of space savers, and make sure it's well known that the people removing space savers aren't the people parking in the previously-saved spots. Possibly an active group that put discarded space savers nearby with a note taped on saying "removed by the Space Unsavers" could do it, though you need to keep the object close enough that the spacer finds it and sees the note while also moving it far enough that the parker likely didn't realize they were parking in a formerly saved spot.

I did a little looking to see if anyone had tried this, and the closest I came was cities coming around to collect them, often as part of trash pickup or some number of hours after lifting a snow emergency. This doesn't work very well: it's much easier to work around a single predictable pass than individuals removing objects throughout the day, and it also legitimizes using space savers up until the announced pickup.

Does anyone know about anyone trying this? Did it work? If I lived in a part of town with an illegal and unpopular—yet persistent—space saver culture, I'd be tempted to try the organize-neighbors version. Or perhaps start by suggesting it to the mayor.



Discuss

Should we consider Meta to be a criminal enterprise?

2026-02-10 10:10:14

Published on February 10, 2026 2:10 AM GMT

Meta's estimates of money they earn due to fraudulent ads was $16 billion in 2024. In contrast total value of property stolen in burglaries in the US is around $3 billion per year. Unfortunately, 2019 is the last year with good numbers but should still be around that amount.  

If Meta takes around half of the money that the fraudsters make as their cut, that would suggest Meta facilitated fraud that suggests that Meta helped steal around $30 billion via fraud in year or 10x the amount that is stolen in the year via burglaries.

To be fair, Meta facilitates stealing from people all over the world via fraud and not just stealing from the US, but how high would the toll of Meta facilitated fraud need to be to consider Meta to be a organized criminal enterprise? 

Is causing 10x the amount stolen per year via burglaries not enough and it would need to be 20x for people to see Meta as a criminal enterprise?



Discuss

Introspective Interpretability: a Definition, Motivation, and Open Problems

2026-02-10 09:22:52

Published on February 9, 2026 11:53 PM GMT

(Originally posted on my blog: https://belindal.github.io/introspection/)

1. Introduction

In 2022, ChatGPT turned language models (LMs) from a tool used almost exclusively by AI researchers into the fastest-growing consumer software application in history, spawning a $40 billion generative AI market and a boom that continues to reshape markets today. While the main technologies behind the tool had been invented years prior, it was the single, general-purpose chat interface that ultimately brought the technology into the mainstream.

I think there’s a lesson here for interpretability research. As AI systems have grown dramatically more capable, pressure has mounted for interp research to “prove useful”. [1] A frequent criticism is that interp research, which often relies on bespoke tools and expert analyses, has failed to keep pace with models that are rapidly scaling, improving, and becoming more general. This is despite the fact that our tools have grown more sophisticated over time: early versions of interp work only looked at the geometry of LM representations whereas nowadays we can extract the specific features encoded in LMs, which exact layer and module these features reside in, and how those features interact. However, these innovations are invisible to the ordinary user—though our interp tools have grown increasingly powerful, we can still only ask a pretty limited set of questions, on models far smaller than what is deployed in production, and using specialized tools unavailable to most users. While this mode of interpretability has scientific value, I think it’s also important to apply the lessons of ChatGPT’s success back to interp: let’s make explanations scalable, general, and natively accessible as part of the original model. What if, instead of building ever-more-sophisticated external tools, we asked whether the model itself could serve as the interface to its own internals?

Concretely, imagine we’re deploying a LM in a medical diagnosis scenario and it predicts that a patient is experiencing acid reflux. We would like to understand the causal factors underlying its prediction. Rather than applying bespoke interp tools, could the user just ask follow-up questions back to the LM? While current LMs technically already support follow-up questions, and can surface their “thought processes” in chains-of-thought, many studies have found that such explanations are not guaranteed to reflect their actual decision process.

Instead, we could enforce that models’ verbalizations must faithfully reflect their internals and behaviors. Broadly, I’ll call this class of methods introspective interpretability. Recently, a line of work has emerged that trains models to faithfully verbalize aspects of themselves. This includes our paper that trains models to explain their internal computations (e.g. results of interpretability experiments), concurrent work on training models to explain weight differences after training, earlier work on training models to explain (the features underlying) their behaviors, and recent papers training models to “confess” their shortcomings or hidden objectives.[2] These works show early promise for the introspection agenda.

As I’ll argue in this post, introspective approaches are naturally 1) scalable, 2) generalizable, and 3) accessible — three major pillars underlying ChatGPT’s success. Beyond these benefits, however, introspection also opens paths toward more 4) predictable and 5) controllable models: predictability arises from training models to be faithful to concise explanations which acts as a regularizer, and controllability arises from leveraging the same underlying model for interpretation, allowing users to directly issue steering commands.

While introspection is a pretty early research agenda and more work is needed to validate it, I’m optimistic about its potential to become practical with scale. I believe it offers important benefits for interpretability that aren’t covered by existing classes of methods — including recent related lines of work with extrospectively-trained explainers (i.e., where a separate model or analysis system is trained to interpret the target model from the outside).[3] This post is intended as a way to formalize introspective methods (2. Defining Introspection), motivate them by highlighting where they are comparatively useful (3. Motivation), and elucidate some open problems that must be solved to make this agenda a reality (4. Open Problems).


2. Defining Introspection

Across cognitive science, psychology, philosophy, and artificial intelligence, “introspection” has been defined in different ways and operationalized with a diverse range of measurements. Rather than attempting a general or exhaustive definition, I’ll focus on a notion of introspection that is most relevant for interpretability applications: a model is introspective if it demonstrates privileged, causal self-consistency between its self-explanations and its internal mechanisms/external behaviors.[4][5] I’ll examine the three parts of this definition separately, and justify why each matter:

2.1: Self-Consistency

Self-consistency requires that a model’s verbalized self-descriptions accurately characterize its own external behaviors or internal mechanisms. Behavioral descriptions concern observable outputs of the model, such as the exact tokens it generates in a given context, patterns shared across many generations, or systematic differences between outputs under controlled input changes. Mechanistic descriptions concern internal properties of the model, such as the model’s features, circuits, or parameters.

Returning to the medical diagnosis example in the introduction, if we find that the model tends to change its answer according to user suggestions, even when the clinical evidence points elsewhere (e.g. it behaves sycophantically), then a self-consistent model should then report something like: “My diagnosis is being influenced by the user’s suggested answer.”

Importance: Enforcing that model self-descriptions agree with its behaviors and mechanisms enables models to produce faithful self-explanations. Conversely, enforcing that models’ behaviors and mechanisms are consistent with their self-descriptions acts as a regularizer on their behaviors/mechanisms, allowing them to be more natively interpretable and well-behaved.

2.2: Causality

We would like self-consistency to be causal: changing the model’s verbalization (e.g. via internal interventions, prompting, or training) should also induce changes to its behavior, and vice versa.

For example, suppose we train our introspective medical diagnosis model so that it behaves less sycophantically, then its verbalizations should reflect its new behavior. Conversely, we can also imagine simply training the LM to state: “I base my diagnoses only on clinically relevant symptoms and ignore user suggestions.” Then we would like its behavior to also become less sycophantic as a result.

Importance: Causality allows us to capture unintended effects of LM training, including emergent misalignment, reward hacking, and learning of spurious cues. Furthermore, causality in the converse direction gives us a way to more controllably and interpretably train models.

While general versions of causality across all intervention types may be unenforceable, it is nonetheless a desirable property for the aforementioned reasons.[6]

2.3: Privileged Access

A model M1 possesses privileged access to itself if M1 demonstrates greater causal self-consistency to M1 than another model M2, given an equivalent amount of training- and inference-time resources and access to M1’s inputs and outpus. This prevents M1 from simply simulating itself and making post-hoc observations, as M2 would be able to obtain equally accurate explanations by running the same M1 simulations.[7]

Importance: Privileged access is attractive for several reasons. First, it gives us a practical justification for introspection: we can use privileged access to surface failure modes that would be difficult to detect externally with the same amount of resources. As we’ll elaborate more below, this gives us a natural pathway to sample-efficient and scalable interpretability. On a more philosophical note, privileged access serves as a test for whether models are relying on something truly intrinsic, or whether they’re simply relying on their strong predictive capacity to learn external correlations. Thus, it gives us a natural control baseline for true introspection, similar to what people run for probes.

3. Motivation

3.1: The Interpretability Landscape

Interpretability approaches often face a tradeoff between scalability and faithfulness. On one extreme, ambitious mechanistic interpretability aims for complete reverse-engineering of the mechanisms underlying model behavior, but requires painstaking manual analysis that is difficult to scale to frontier models or make accessible to non-expert users. On the other extreme, chain-of-thought monitoring is highly scalable and accessible, but offers no guarantee that a model’s stated reasoning faithfully reflects its internal computation. Automated approaches like interpretability agents and automated feature descriptions mediate between these two poles, improving scalability while retaining some of the rigor of mechanistic methods. But they remain bottlenecked by the efficiency of the underlying interpretability tools they automate, often simply shifting the burden of search from a human experimenter to a program or an AI agent.

Trained explainer models represent another point on the tradeoff curve – continuing to push scalability while sacrificing some faithfulness as we move from rigorous analysis tools to trained models. Within this bucket, we introduce an extrospective/introspective axis, defined by the degree of privileged access the interpreter has to the target model. At the extrospective end of this axis are methods like SAEs, trained model editors, and predictive concept decoders, which use a completely different architecture and training paradigm to explain the target model. As we move along the axis, we encounter approaches where the interpreter and target models may share architecture or even initialization, and the success of these methods start leveraging a degree of privileged access. Finally, fully introspective methods aim to give the model direct ability to reason about itself. By the privileged access part of our definition of introspection, a model is only introspective if it beats extrospective methods at the same scale. Thus, introspective approaches by definition dominate extrospective ones along the Pareto tradeoff. The question is thus – can privileged access be realized across the explanation tasks we may care about in practice. I think the answer is yes, and I’ll elaborate more below.

Interpretability approaches diagram
Interpretability approaches mapped along scalability and faithfulness axes. Elicited explanations use models to explain themselves without additional training, which is scalable but offer no faithfulness guarantees. Mechanism-level interpretation is faithful but requires painstaking manual analysis; automated variants improve scalability but remain bottlenecked by the tools they automate. Trained explainers offer another tradeoff point, with an additional extrospective-to-introspective axis: extrospective methods use external models to analyze activations, while introspective methods leverage a model's full privileged access to its own internals, pushing toward both scalability and faithfulness simultaneously.

3.2: Why Introspection?

In this section, I’ll argue that introspective methods have comparative advantage over other interp methods in several key ways. Most of the arguments I’ll lay out will be directed against extrospective trained explainers, given that it is the most related class of methods with similar benefits

3.2.1: Sample/Compute Efficiency via Privileged Access

By our definition, a model is only introspective if it demonstrates privileged access, meaning it must give more faithful explanations than extrospective methods at a given inference/training scale. Our early empirical results establish that models demonstrate privileged access on three types of explanations of their own computations, and that this advantage is particularly pronounced in low-data regimes. Other early results find models have privileged access to certain behavioral properties of themselves. I think privileged access will hold even beyond these tasks: structurally, a model has direct access to its own computations, while an external explainer only has as much as it’s given access to (e.g. only the activations of the original model in the case of LatentQA variants). Furthermore, to the extent that self-explanation can operate in the same representational basis and reuse internal mechanisms the model already employs for language understanding, an external explainer must (in those cases) learn an additional alignment and approximation of these mechanisms, making introspective approaches more sample-efficient. Thus, I expect privileged access will hold across a large number of self-explanation types, though empirically establishing its scope and engineering the correct training objectives (see Open Problems: Generalization) remains an important open question.

3.2.2: Task-Generality through Composition

One lesson we’ve learned from scaling LLMs is that multitask learning enables not just data-efficient learning of new skills through sharing representations and mechanisms (as argued above), but the untrained emergence of entirely new capabilities through composition. By training a model that can understand language, do reasoning, and interpret itself, we may enable models that can compose introspective skills—either latently or through chain-of-thought reasoning. This may allow models to handle entirely novel interpretability queries that they weren’t explicitly trained on.

3.2.3: Usability and Accessibility

Introspective LMs are a mechanism for making interpretability accessible to non-expert users. Instead of requiring users to run separate interpretability tools or auxiliary models, the original model produces explanations of its own internal behavior directly in natural language. Users can read these explanations, ask follow-up questions, correct misconceptions, or request changes to internal procedures through the same conversational interface used for task interaction. The model can provide explanations at varying levels of abstraction, ranging from low-level mechanistic descriptions (e.g., “neuron 19 in layer 5 activated and typically encodes X, which contributed to output Y”) for expert users, to high-level summaries for non-experts.

While theoretically interfaces can be built on top of extrospective interpretability approaches, moving between two different interfaces which potentially say conflicting information (e.g., if the primary chatbot says “I think X” and the interp interface says “actually the chatbot thinks Y”) can introduce friction and undermine trust in the primary model. Furthermore, it’s less clear how to use this interp feedback to control models, as I’ll describe below.

3.2.4: Predictability

A core obstacle for interpretability is that model internals and behaviors are often inconsistent across contexts. For example, neural network components are known for being polysemantic, encoding a certain concept in one context but something completely different in another context. Even techniques aimed at extracting monosemantic features often end up surfacing many polysemantic features. These inconsistencies make post-hoc interpretation brittle.

Training models to enforce bidirectional consistency between explanations and behaviors offers a path toward addressing this issue. Concretely, we require that model explanations are faithful to their behaviors (explanations ← behaviors) and that model behaviors are consistent with their explanations (explanations → behaviors). The explanations → behaviors direction acts as a regularizer, encouraging models to rely on mechanisms that are describable in human-interpretable terms and to behave predictably across contexts. By enforcing this objective, interpretability is no longer just a post-hoc analysis tool as models are trained to be explainable upfront.

3.2.5: Controllability

Closely related to the above point is controllability — while extrospective explainers can analyze a static model, they cannot directly modify the model’s behavior. In contrast, a model that produces faithful self-descriptions can be instructed through those descriptions, in order to control and improve the LM going forward. For example, a user may request, “please do not use the user’s gender features going forward” and the model can modify its internal procedure accordingly. While one can theoretically do this to the underlying model already just through prompting, enforcing bidirectional consistency will make controlling models a lot easier and more reliable.

Finally, imbuing models with native introspection capabilities could enable real-time self-monitoring. A model that can represent and reason about its own mechanisms can detect potential hallucinations, unsafe action plans, or reliance on spurious correlations, and correct them before producing an external response. This opens a path toward models that incrementally improve their own reliability through metacognitive feedback.

3.2.6: Engaging Model Developers

Interpretability as a field has historically been siloed from mainstream model development. By pitching introspection as a desirable native capability of models, we can engage people who care about building more generally capable models. Eventually, if introspection training is proven to be viable and new innovations are made to cheapen data collection (see Open Problems: Scalable Supervision), it is plausible that consistency or introspection objectives become a standard component of post-training pipelines, much like instruction tuning or RLHF today.


In summary, I believe introspection is uniquely positioned to offer efficiency, generalization, accessibility, and controllability benefits over other interpretability methods — primary concerns if we want to make interpretability practically useful. Realizing these benefits, however, requires addressing several open technical and methodological challenges.

4. Open Problems

Below, I lay out five classes of open problems that I believe are important for realizing the practical benefits of introspection (and/or that I find personally interesting!). This list is by no means exhaustive and as the field progresses, new challenges will undoubtedly emerge, but I think these represent the most pressing bottlenecks between where we are now and a future where introspective interpretability is practically useful.

4.1: Scalable Supervision and Verification

  1. Regenerating Labels: One of the primary problems to solve on the path to scaling up introspective LMs is generating supervision at scale. Training the same LM to explain itself means that the underlying labels are constantly shifting — once you train a model to explain itself, its behavior might change to the point that the original explanations are no longer faithful. This means you’d have to regenerate labels on-the-fly (if doing online training), periodically regenerate labels, or just train on the original labels and hope behavior doesn’t shift too dramatically. Finding a way to quickly generate reliable supervision at scale will be key to unlocking scalable training.
  2. Active Sampling: Another related problem is the need to actively sample the right training inputs. If not done cleverly, the majority of the explanations can be derived through trivial heuristics without requiring true introspection: For example, it’s obvious from the sentence “Paris is the capital of…” that the tokens Paris and capital will be far more important than the token “the”. Training to recover the salient tokens in this example does not necessarily require the model to introspect. Thus, sampling the right training (and evaluation!) inputs ends up being a nontrivial problem – and potentially something that will need to be done online.
  3. To what extent is introspection emergent through scaling existing objectives?: OpenAI’s chain-of-thought monitorability paper suggests that larger models with longer chains-of-thought are more “monitorable” about properties of the model’s behavior. Anthropic’s introspection blog post suggests LMs can identify inconsistencies in their latent states without any explicit training to do so. These early results suggest that potentially introspection is “emergent” from just our current training pipelines. To what extent is this true? Are some types of introspection easier to surface from ordinary post-training than others? My intuition is that some forms of introspection (e.g. anything that requires localization, such as circuit extraction) are orthogonal to ordinary post-training, and are thus unlikely to emerge, but I’m happy to be proven wrong here.

4.2: Generalization

  1. Explanation Accuracy vs. Task Accuracy: One of the big questions underlying the success of the introspection training agenda is whether explanation capability will interfere with (generic) language intelligence. I think this will largely end up being an empirical question, although one could explore techniques to decrease interference, including standard continual fine-tuning techniques like continuing to train on language tasks and applying regularization.
  2. Compositionality & generalization across tasks: Critical to the success of both extrospective- and introspective- explainers is demonstrating cross-task generalization. This may entail training on one type of question (e.g. feature descriptions) and generalizing to another (e.g. activation patching). Cross-task generalization would enable trained explainers to potentially perform well even in cases where collecting supervision from mechinterp techniques is prohibitively expensive or impossible.
    One possible mechanism for generalization is composing introspective skills (see Motivation: Compositional Generalization): by piecing together novel combinations of operations, the model could learn new introspective skills. This could be further amplified by reasoning: given space to reason, models could apply different introspection procedures sequentially for a single problem, allowing generalization to a far harder class of introspection problems.
  3. “Pre-training” objectives for Task-General Introspection: There may be a scalable, cheap, “self-supervised” introspective objective that is analogous to next-token-prediction pre-training. Once we train models on that objective, we may be able to unlock task-general introspection with only a small amount of task-specific fine-tuning. This will likely look quite different from the techniques used to train extrospective explainers: e.g. predicting your own next token is difficult for an extrospective explainer but quite straightforward for an introspective one. Empirical work will be needed to investigate how successful each pre-training objective will be.

4.3: Human-Centric Concerns: What makes explanations useful?

One of the key arguments I’ve posed for introspection is its accessibility to non-expert users. How do we demonstrate that practical utility? What exactly makes an explanation useful? Likely, explanations on the level of “here’s the exact neuron I used” are not useful for the vast majority of users or tasks. Meanwhile, more abstract explanations may trade off on some faithfulness. There’s likely an ideal tradeoff of abstraction, generality, faithfulness, and completeness that make explanations useful for downstream tasks. We may also need to invent entirely new, shared vocabulary to teach machine concepts to humans.

We could run experiments in high-stakes domains such as medicine or law, and show that introspective explanations aid better downstream decision making. We could attempt to have AI that is “superhuman” in narrow domains and teach us new knowledge. I anticipate improved introspection will make extracting useful explanations easier, as LMs are explicitly encouraged to build predictable internal mechanisms and behaviors.

4.4: Understanding Introspection

Finally, I’m interested in understanding what actually goes on inside the model when it introspects. One hypothesis is that models may be developing a coherent internal self-representation that they use to generate descriptions, which they rely on to both produce a self-description and exhibit the described behaviors and mechanisms. Another hypothesis is that there are universal “translation” functions that exist allowing models to inspect their own internals and translate the operations into natural language. Some early work looking into the mechanisms underlying introspection can be found here and here.

5. Conclusion

At the beginning of this blog post, I opened with the possibility that introspection could deliver a “ChatGPT moment” for interpretability. While this is an ambitious claim that remains to be proven, I think it’s safe to say that introspection is at least a step in the right direction: it moves interpretability toward being more scalable, general, and accessible. Furthermore, we now have some early empirical evidence indicating that training for introspection is at least possible in narrow tasks.

Of course, introspection isn’t intended to completely replace other interpretability agendas: we may still want the ability to completely reverse-engineer models or have fully complete, faithful accounts of their mechanisms, and it’s unlikely introspection will ever bring us to that point. Furthermore, these mechanism-level tools remain important as supervision and evaluation for introspective techniques.

That said, I hope this post motivates why introspection is an important emerging agenda that I think more people in the field should work on! If we succeed, interpretability stops simply being a specialized activity reserved for experts with custom tooling, and becomes a scalable, native, general capability of models themselves. Models that can faithfully describe their own mechanisms and limitations are easier to trust and control. They could surface their own problems before users notice them, teach users new knowledge, and collaborate with humans rather than operating as black boxes. I’m optimistic that a genuinely useful version of this capability is within reach with the right objectives and scale.

Acknowledgements

Major thanks to (alphabetically by first name) Avichal Goel, Carl Guo, David Atkinson, Dillon Plunkett, Itamar Pres, Jack Lindsey, Jacob Andreas, and Samuel Marks for extensive feedback on drafts of this post. This post also benefited from conversations with Jacob Steinhardt, Neil Chowdhury, Dami Choi, Vincent Huang, Laura Ruis, members of David Bau’s group, and members of Martin Wattenberg and Fernanda Viegas’ group, among others.


 

  1. ^

    This pressure has recently sparked division even within the interp community itself.

  2. ^

    I’m sure there are more I’ve missed — please let me know!

  3. ^

    Thanks to Sam Marks for suggesting this term.

  4. ^

    This definition captures accuracy (self-consistency) and grounding (causal) in Jack Lindsey’s definition here. As for internality, privileged access serves as one measurable test for it and carries a few other practical benefits, which I’ll describe below. Finally, metacognition, while potentially a prerequisite for a broader notion of introspection, does not immediately yield obvious practical benefits for interpretability.

  5. ^

    Models possessing this sense of “introspection” does not necessarily imply that they will possess other senses of it: for example, I believe strong versions of model consciousness or self-awareness are unlikely to emerge in our current models, even after enforcing this definition.

  6. ^

    Sam Marks made a good point in early drafts of this post that training may disrupt the introspective mechanism, rather than induce the expected causal change. I agree that causal robustness under training isn’t strictly necessary for a model to count as “introspective”, and may be impossible to enforce universally across all types of training procedures. Nevertheless, I think some notion of causality is desirable for detecting unintended effects of training; we may like to discover some (pretraining or continued training) mechanism by which we can modify models in one direction or the other, and induce the corresponding set of behavioral or descriptive changes — for example, by training only a subset of parameters that do not participate in introspection.

  7. ^

    Thus, privileged access covers internality in Lindsey’s definition, but is a little more concrete and flexible — there may be potential introspective behaviors that require small amounts of chain-of-thought to be able to surface, such as introspecting about late layers; rather than restricting LMs to exclusively internal computation in these cases, we allow them to generate some tokens, simply requiring that another model won’t have enough information to reconstruct M1’s explanation from that chain-of-thought alone, indicating M1’s explanation must have additionally relied on something internal.



Discuss

OK, what's the difference between coherence and representation theorems?

2026-02-10 08:45:00

Published on February 10, 2026 12:45 AM GMT

TL;DR Is a coherence theorem anything that says "if you aren't coherent in some way you predictably have to forgo some sort of resource or be exploitable in some way" and a representation theorem anything that says "rational cognitive structures can be represented by some variant of expected utility maximization?" Is there no difference? One a subset of another? Some secret fourth thing? 


Just today, I was arguing that Savage's subjective expected utility model should be called a representation theorem, as wikipedia claims, for an article my co-worker was writing, as opposed to a coherence theorem. My opponent, taking the bold stance that Wikipedia may not be the end all and be all of the discussion, and that he wasn't sold on it being a representation theorem in spite of the fact that you're representing one structure (preferences between deals dependant on unknown world states) using another (subjective expected utility), as in representation theory.  

Unwilling to accept this lack of decisive resolution, I turned to that other infallible oracle, Stampy. (Stampy's received some hefty upgrades recently, so it's even more infallible than before!) Stampy demured, and informed me that the literature on AI safety doesn't clearly distinguish between the two. 

Undaunted, I delved into the literature myself and I found the formerly rightful caliph (sort of) opining on this very topic

The author doesn't seem to realize that there's a difference between representation theorems and coherence theorems.

The Complete Class Theorem says that an agent’s policy of choosing actions conditional on observations is not strictly dominated by some other policy (such that the other policy does better in some set of circumstances and worse in no set of circumstances) if and only if the agent’s policy maximizes expected utility with respect to a probability distribution that assigns positive probability to each possible set of circumstances.

This theorem does refer to dominated strategies. However, the Complete Class Theorem starts off by assuming that the agent’s preferences over actions in sets of circumstances satisfy Completeness and Transitivity. If the agent’s preferences are not complete and transitive, the Complete Class Theorem does not apply. So, the Complete Class Theorem does not imply that agents must be representable as maximizing expected utility if they are to avoid pursuing dominated strategies.

Cool, I'll complete it for you then.

Transitivity:  Suppose you prefer A to B, B to C, and C to A.  I'll keep having you pay a penny to trade between them in a cycle.  You start with C, end with C, and are three pennies poorer.  You'd be richer if you didn't do that.

Completeness:  Any time you have no comparability between two goods, I'll swap them in whatever direction is most useful for completing money-pump cycles.  Since you've got no preference one way or the other, I don't expect you'll be objecting, right?

Combined with the standard Complete Class Theorem, this now produces the existence of at least one coherence theorem.  The post's thesis, "There are no coherence theorems", is therefore falsified by presentation of a counterexample.  Have a nice day! 

Returning to my meddlesome peer, I did spake unto him that savage's representation theorem was not a coherence theorem, for there was no mention of a exploitability, as in the Dutch Book theorems. Rather, Savage's theorem was akin to Von Neuman and Morgenstern's.

But alas! He did not yield and spake unto me: "IDK man, seems like there's not a consensus and this is too in the weeds for a beginner's article". Or something. I forget.  So now I must ask you to adjudicate, dear readers. What the heck is the difference between a coherence theorem and a representation theorem? 



Discuss

Weight-Sparse Circuits May Be Interpretable Yet Unfaithful

2026-02-10 07:25:11

Published on February 9, 2026 11:25 PM GMT

TLDR: Recently, Gao et al trained transformers with sparse weights, and introduced a pruning algorithm to extract circuits that explain performance on narrow tasks. I replicate their main results and present evidence suggesting that these circuits are unfaithful to the model’s “true computations”.

This work was done as part of the Anthropic Fellows Program under the mentorship of Nick Turner and Jeff Wu.

Introduction

Recently, Gao et al (2025) proposed an exciting approach to training models that are interpretable by design. They train transformers where only a small fraction of their weights are nonzero, and find that pruning these sparse models on narrow tasks yields interpretable circuits. Their key claim is that these weight-sparse models are more interpretable than ordinary dense ones, with smaller task-specific circuits. Below, I reproduce the primary evidence for these claims: training weight-sparse models does tend to produce smaller circuits at a given task loss than dense models, and the circuits also look interpretable.

However, there are reasons to worry that these results don't imply that we're capturing the model's full computation. For example, previous work [1, 2] found that similar masking techniques can achieve good performance on vision tasks even when applied to a model with random weights. Therefore, we might worry that the pruning method can “find” circuits that were not really present in the original model. I present evidence that the worry is justified—namely, pruned circuits can:

  • achieve low cross entropy (CE) loss on a nonsensical task;
  • solve tasks using uniform attention patterns even when the original model’s attention pattern was importantly non-uniform;
  • repurpose nodes to perform different functions than they did in the original model;
  • behave very differently to the model on inputs that are slightly out of the distribution used for pruning.

Overall, these results suggest that circuits extracted from weight-sparse models, even when interpretable, should be scrutinized for faithfulness. More generally, in interpretability research, we should not purely try to push the Pareto frontier of circuit size and task performance,[1] since doing so may produce misleading explanations of model behavior. 

In this post, I briefly review the tasks I designed to test the sparse model methods, present a basic replication of the major results from Gao et al, and then give four lines of evidence suggesting that their pruning algorithm produce unfaithful circuits.

My code for training and analyzing weight-sparse models is here. It is similar to Gao et al's open-source code, but it additionally implements the pruning algorithm, “bridges” training, multi-GPU support, and an interactive circuit viewer. Training also runs ~3x faster in my tests.

Tasks

I extract weight-sparse circuits via pruning on the following three natural language tasks. For more details on training and pruning, see the appendix.

Task 1: Pronoun Matching

Prompts have the form "when {name} {action}, {pronoun}"

For example:

  • "when leo ran to the beach, he"
  • "when mia was at the park, she"

The names are sampled from the 10 most common names (5 male, 5 female) from the pretraining set (SimpleStories).[2] The task loss used for pruning is the CE in predicting the final token ("he" or "she").

Task 2: Simplified IOI

I use a simplified version of the standard Indirect Object Identification task. Prompts have the form "when {name_1} {action}, {name_2} {verb} {pronoun matching name_1}". E.g.:

  • "when leo went to the shop, mia urged him"
  • "when rita was at the house, alex hugged her"

The task loss used for pruning is the binary CE: we first compute the model’s probability distribution just over "him" and "her" (softmaxing just those two logits) and then compute the CE using those probabilities.

Task 3: Question Marks

The prompts are short sentences from the pretraining set that either end in a period or a question mark, filtered to keep only those where 1) the dense model predicts the correct final token (period or question mark) with p > 0.3, and 2) when restricted to just the period and question mark, the probability that the dense model assigns to the correct token is > 0.8. For example:

  • "why do you want that key?"
  • "that is why I want the key."

The task loss used for pruning is the binary CE, softmaxing only the "?" and "." logits.

Results

See the appendix for a slightly tangential investigation into the role of layer norm when extracting sparse circuits.

Producing Sparse Interpretable Circuits

Zero ablation yields smaller circuits than mean ablation

When pruning, Gao et al set masked activations to their mean values over the pretraining set. I found that zero-ablation usually leads to much smaller circuits at a given loss (i.e. in all subplots below except the third row, rightmost column). Hence I used zero-ablation for the rest of the project.

Weight-sparse models usually have smaller circuits

Figure 2 from Gao et al mostly replicates. In the pronoun and IOI tasks, the sparse models have smaller circuits than the dense model, at a given loss. On the question task, only two of the sparse models have smaller circuits than the dense one, and even then, the reduction in size is smaller than it was for the other two tasks.

Weight-sparse circuits look interpretable

You can view circuits for each task here. Hovering over/clicking on a node shows its activations, either in the original or the pruned model. Below is a brief summary of how I think the IOI circuit works; I walk through circuits for the other two tasks in the appendix. Each circuit I walk through here was extracted from the  model; I did not inspect circuits extracted from the other models as carefully. All the per-token activations shown in this section are taken from the pruned model, not the original model.

IOI Task (view circuit)

Below is an important node from layer 1 attn_out. It activates positively on prompts where name_1 is female and negatively on prompts where it is male. It then suppresses the "him" logit.

To see how this node’s activation is computed, we can inspect the value-vector nodes it reads from, and the corresponding key and query nodes. The value-vector node shown below activates negatively on male names:

There are two query-key pairs. The first query vector always has negative activations (not shown). The corresponding key node’s activation is negative, with magnitude roughly decreasing as a function of token position:

The other query-key pair does the same thing but with positive activations. Hence the head attends most strongly to the first part of the prompt, and so the attn_out node only gets a large contribution from the value-vector node when it appears near the start of the sentence. Specifically, the attn_out node gets a large negative contribution when name_1 is male. The other value-vector node, not shown here, gives a positive contribution when name_1 is female. This explains the activation patterns we saw above for the attn_out node.

Scrutinizing Circuit Faithfulness

Pruning achieves low task loss on a nonsense task

I alter the pronouns task so that when the name is male, the target token is "is", and when the name is female, the target is "when". E.g.:

  • "when rita went to the woods, when"
  • "when leo went to the woods, is"

Just like the standard pronouns task, the task loss is just the standard CE loss (i.e. all logits are softmaxed. I am not using the binary CE). This is a nonsense task, but the pruned model gets task loss < 0.05 (meaning accuracy >95%) with only ~30 nodes.

Around 10 nodes are needed to achieve a similar loss on the ordinary pronouns task. So the nonsense task does require a larger circuit than the real task, which is somewhat reassuring. That said, it seems worrying that any circuit at all is able to get such low loss on the nonsense task, and 30 nodes is really not many.

You can view the nonsense circuit here.

Important attention patterns can be absent in the pruned model

This pronouns circuit has attention nodes only in layer 1 head 7. In the original model, this head attends strongly from the last token to the name token ("rita"), as one would expect. But in the pruned model, its attention pattern is uniform (since there are no query- or key-vector nodes):

How does the pruned circuit get away with not bothering to compute an attention pattern? It does so by having all its value-vector nodes be ones that fire strongly on names and very weakly everywhere else. So even though the head attends to all tokens, it only moves information from the name token. Such a mechanism was not available to the original model. The circuit we found misses a crucial part of what the original model was doing.

Nodes can play different roles in the pruned model

Example 1: Below are the activations of layer 0, node 1651 from this IOI circuit. The left figure shows its activations in the pruned model, where it activates negatively (red) on female names. The right figure shows its activations in the original model, where it activates positively (blue) on male names. In both cases, its activation is very close to zero for all non-name tokens. This is strange: the node acquires a different meaning after pruning.
    

Example 2: Below are activations of attn_out, layer 1, node 244 from this IOI circuit. In the pruned model, the node activates positively on contexts where name_1 (the first-appearing name) is female, and negatively on ones where it is male. In particular, the final token’s activation is positive only if name_1 is female, and as expected, the node directly suppresses the "him" logit. So, in the pruned model, the node is playing the functional role “detect gender of name_1 and boost/suppress corresponding logit”. But in the original model, the final token’s activation does not depend on name_1, so it cannot be playing the same functional role.
  

Example 3: Below are activations of mlp_out, layer 1, node 1455 from this Questions circuit. In the pruned model, the node is a questions classifier: its activations are negative on questions and roughly zero elsewhere. It is used to suppress the "?" logit. But in the original model, it is not a question classifier. In particular, its activation on the last token of a sentence does not predict whether or not the sentence was a question, and so it cannot be helping to promote the correct logit.
   

Pruned circuits may not generalize like the base model

Recall that IOI prompts look like "when {name_1} was at the store, {name_2} urged __". We prune using a train set consisting only of prompts where name_1 and name_2 have opposite genders. There are two obvious circuits that get good performance on the train set:

  • Good circuit: output the pronoun with the gender of name_1.
  • Bad circuit: output the pronoun with the opposite gender to name_2.

Let  be the mean probability assigned to the correct target token, where we compute the probability by only softmaxing the him and her tokens. Here, I focus on the model with , which completes the task correctly 89% of the time for opposite-gender prompts, and 81% of the time for same-gender prompts.

I run pruning 100 times with different random mask initializations and data ordering. Below I show the resulting distribution of  for opposite-gender prompts (left) and same-gender prompts (right). I filter out runs which didn’t achieve CE < 0.15, leaving 77 seeds total.
Often, pruning finds only the “bad circuit” (see the big spike at 0 in the same-gender histogram). This is bad, since the actual original model had  in the same-gender case and so must have been using the good circuit.

Separately, it is also a little worrying that pruning using the same hyperparameters but different random seeds can lead circuits with totally different OOD behavior.

Conclusion

The above results provide evidence that Gao et al's pruning method can find circuits that are small, interpretable, and get good task loss, but nevertheless are unfaithful to what the model is really doing. These results do not have much to say about whether weight-sparse training itself is a promising direction; they only show that the pruning algorithm is flawed.

My main takeaway is that we should not purely aim to improve the loss vs circuit size Pareto frontier. Hill-climbing on this metric alone is likely to yield mechanistic explanations that look appealing but are actually misleading. For example, zero ablation improved the frontier, so I switched to it early on. But in hindsight, mean ablation may have given more faithful circuits (at the cost of giving circuits with ~100 nodes for a task as simple as pronoun gender matching, which is a lot more than I would’ve hoped for).

I think a natural next step for the weight-sparsity line of work would be to 1) think of a good faithfulness metric[3] (ideas like causal scrubbing seem on the right track but possibly too strict); 2) figure out how to modify the pruning algorithm to extract circuits that are faithful according to that metric; 3) check whether Gao et al’s main result—that weight-sparse models have smaller circuits—holds up when we use the modified pruning algorithm.[4]

I would also be interested in applying similar scrutiny to the faithfulness of attribution graphs.[5] I expect attribution graphs to be more faithful than the circuits I found in the present work (roughly speaking, because the way they are pruned does not optimize directly for downstream CE loss), but someone should check this. I’d be particularly interested in looking for cases where attribution graphs make qualitatively-wrong predictions about how the model will behave on unseen prompts (similar to how pruning found the “bad circuit” for the IOI task above).

Appendix A: Training and Pruning Details

My implementation of weight-sparse training is almost exactly copied from Gao et al. Here I just mention a few differences and points of interest:

  • I train two-layer models with various sizes and sparsities:
d_model frac_nonzero n_nonzero
128 1 (dense) 1.4M
1024 1/64 0.5M
3072 1/200 1.3M
4096 1/500 0.9M
  • Unfortunately, I did not ensure that each model has the same number of nonzero parameters. However, the only time I compare different models below is when comparing their circuit size vs task loss Pareto curves, and this is just a replication of the main Gao et al result, not an important part of this post.
  • Each model is trained on 2B tokens from SimpleStories.
  • These weight-sparse models were trained alongside “bridges” mapping from/to a dense model. This was a bad choice on my part, since none of my results ended up using the bridges at all, and they add complexity. That said, I expect the results to be essentially the same for standalone weight-sparse models.
  • I impose mild (25%) activation sparsity at each residual stream location (whereas Gao et al only did so at various other points such as mlp_out) since this slightly improves loss.
  • During the first half of training, Gao et al linearly decay the fraction of nonzero parameters down from 1 (fully dense) to its target value. I use an exponential decay schedule, since this slightly improves loss.

For pruning, I once again follow Gao et al closely, with only small differences:

  • As mentioned in the main text, Gao et al mask nodes by mean-ablating them, whereas I find that zero ablation yields smaller circuits.
  • The types of tasks I study involve natural language rather than code.

Appendix B: Walkthrough of pronouns and questions circuits

Pronouns Task (view circuit)

All the computation in this circuit routes through two value-vector nodes in layer 1. The one shown below activates negatively on male names. There are no query or key nodes, so attention patterns are uniform, and each token after the name gets the same contribution from this value-vector node. Tracing forward to the logit nodes, one finds that this value-vector node boosts "she" and suppresses "he". The other value-vector node does the same but with genders reversed.

Notice that the mlp_out nodes in layer 1 have no incoming weights connecting to upstream circuit nodes, so their activations are constant biases.

Questions Task (view circuit)

The earliest node that acts as a question classifier is at attn_out in layer 1:     

This attn_out node reads from a value node that activates positively on “question words” (“why”, “are”, “do” and “can”) and negatively on pronouns:

The query node for that head has positive activations (not shown). The key node’s activations roughly[6] decrease as a function of token position:

Thus, if a prompt contains “do you” then the head attends more strongly to “do”, so attn_out receives a large positive contribution from the question word “do” and only a small negative contribution from the pronoun “you”. On the other hand, if the prompt contains “you do”, then the head attends more strongly to “you” so attn_out receives a large negative contribution from “you” and only a small positive contribution from “do”.

Putting that all together, the attn_out node activates positively on prompts containing “do you” and negatively on prompts containing “you do”, and similarly for other combinations of question words and pronouns. Hence the attn_out node functions as a question detector.

Appendix C: The Role of Layernorm

Pruned models get very low task loss with very few nodes. Here is one way they might be “cheating”:

Pruning decreases activations’ norms. In particular, the plot below shows that the RMS of the residual stream just after the last layer—i.e. the activation that is fed into the final layernorm—is smaller in the pruned model. So the final layernorm scales up activations by a larger factor in the pruned model than it did in the original model.


Now, suppose the original model has many nodes which each write a small amount to the “correct direction” just before the final layernorm, by which I mean the direction that will unembed to boost the correct logit. The pruned circuit contains only a small number of these nodes, so it only writes a small amount to the correct direction. But it gets away with this, because the final layernorm scales the activation up a lot, so that even a small component in the correct direction will strongly boost the correct logit, leading to good CE loss.

Below, we compare regular pruning against a modified version where we freeze layernorm scales (the thing layernorm divides activations by). That is, for each batch of data, we run the original model, save all its layernorm scales, then patch them into the pruned model during its forward pass. As the above analysis predicts, freezing layernorm leads to much larger circuits at a given loss.

For the IOI task with the larger () model, freezing layernorm (bottom) leads to better generalization to same-gender prompts than standard pruning (above):


However, the results are the opposite for the smaller () model. That is, freezing layernorm leads to circuits which generalize worse than when layernorm was unfrozen. I found this result surprising.


I think it is “morally correct” to freeze layernorm during pruning, so that the model cannot cheat in the way described above. But it seems doing so does not fully fix the faithfulness issues (see the IOI  results directly above).

A final caveat to the results in this appendix: for each model and task, I performed a CARBS sweep to find the best hyperparameters for pruning, and then used these best hyperparameters for each of the 100 randomly-seeded pruning runs. It may be the case that e.g. for the  we happened to find "unlucky" hyperparameters that lead to poor generalization to same-gender prompts, whereas we got "lucky" with the hyperparameters we found for the  model. In other words, the 100 seeds are perhaps not as decorrelated as we'd like.

  1. ^

    Where smaller circuit size and lower CE loss are better. In this work, circuit size refers to the number of nodes in the circuit.

  2. ^

    In case you're curious, the names are Leo, Samuel, Jose, Peter, Alex, Mia, Rita, Kim, Maria, Lily.

  3. ^

    E.g. I tried defining the "importance" of each node in the model as the increase in task loss when the node is ablated, and then computing the fraction of the top 20 (say) most important nodes that are present in the circuit. All circuits I looked at scored poorly on this metric. But the supposedly-important nodes that were not found in the circuit often had dense, seemingly-uninterpretable activations (I saw this both for mean- and zero-ablation), and I suspect they are playing a very low-level "keeping activations on-distribution" sort of role, similar to a constant bias term. So I am not convinced judging faithfulness via importance score defined above is quite right.

  4. ^

    If you’re interested in doing this sort of work, please get in touch at jacobcd52 at gmail dot com.

  5. ^

    E.g. I'd love to see more work like this.

  6. ^

    The fact that the activations only roughly decrease as a function of token position makes me suspect that my mechanistic explanation is missing parts. Probably the attention patterns are more nuanced than just "any old roughly-decreasing function", but I haven't tried hard to understand them.



Discuss

Gwern's 2025 Inkhaven Writing Interview

2026-02-10 06:11:31

Published on February 9, 2026 10:11 PM GMT

In November 2025, while working as a writing coach at the experimental 2025 Inkhaven writing/blogging residency at the Lighthaven conference center, I was interviewed by Adam Mastroianni of Experimental History on where I get my ideas (of interest to Inkhaven participants, who were required to write >500 words⧸day) and develop my writings. It can be seen as a followup to my Dwarkesh Patel interview.

Topics include:

How recent poems emerged from LLM collaboration and happy accidents; incubation periods and waiting for ideas to “snap into place”; pathological curiosity and cross-domain reading; my Wikipedia apprenticeship; self-experimentation as easy science and statistics practice; the pipeline model of why small advantages compound into large output differences; mental tools like Feynman’s algorithm and modus tollens reversal; “blog brain” and pattern-matching experiences to essays; the two directions of essay-writing (universal → concrete versus concrete → universal); and my preference for wikis over blogs, and treating publishing as just another edit.

Below is a lightly edited transcript of my interview and audience Q&A (to condense, clarify, and add references).

Inkhaven artwork: a mouse furiously typing to make its 500-word daily quota at Inkhaven. Original artwork by Jenn <https://www.redbubble.com/i/sticker/inkhaven-posts-by-kwrat/176455632.O9UDB>, modified by Gwern Branwen on 2026-01-30 using Google Nano Banana Pro for use as a page thumbnail, by zooming in, changing the color from sepia, and swapping the cup for a mug.

In early 2025, I was pitched by the Lightcone staff (mostly Inkhaven’s organizer, Ben Pace), to potentially participate in an unusual experimental writing bootcamp/retreat to be held at the Lighthaven conference center in Berkeley, California in November 2025. The idea was that instead of being an ordinary writing retreat, this would be focused on bloggers, and to destroy procrastination and train the ability to write regularly, participants would be required to write 500 words each day or be expelled.

I thought this was a neat idea and worth trying, so I committed to participating as a writing coach and them advertising my participation. (This helped with getting Inkhaven off the ground.)

I helped screen a few of the applicants, and gave a short opening speech; but I mostly did “office hours” (1–2 hours of discussing anything with anyone who showed up, roughly every other day) and hanging around to chat; while my feedback was often negative, I thought it went well overall.[1] (I intend to participate in the planned second Inkhaven for April 2026, but not as much; I found it cut into writing time more than I had expected, accidentally demonstrating my view of how queues & daily writing times interact.)

There were a number of “special guests” like Scott Aaronson or Dwarkesh Patel over the month. Adam Mastroianni visited for the first week or so, and wanted to do interviews with a number of participants, including myself, on the topic of writing and getting ideas.

Below is the Q&A I did with him on the afternoon of 5 November 2025 before office hours before an audience of about a dozen Inkhaveners. The audio was recorded, transcribed by a commercial service, and then edited for Gwern.net by myself.

Opening Speech

…I think I will just emphasize three points for writing:

  1. Surprise is the mother of learning.

    You should be constantly trying to predict things to yourself and asking “why is X, X, instead of Y?” “I predicted Z, but it was A instead; what does that mean?”

    There are interesting things all around you, but only if you are thinking about them to the point of being surprised.

  2. Never let a joke go to waste. (commentary)

    People are constantly saying interesting things and good starting points for a blog post. The problem is that they laugh and immediately forget everything.

    Resolve to never let a funny line or a good question or a rant go to waste, whether it is a shower thought, an online chat, or an in-person conversation.

    Conversations can be full of great writing prompts. Yuxi Liu, for example, has a remarkable memory for conversations and has written down some Bay Area events, and you can see how many places you could go off and write about it (and certainly not all of the good comments are by me!): 1, 2, 3; had Liu not written it down, though, who would remember any of that?

  3. Rule of three: if you (or someone else) have explained the same thing 3 times, it is time to write that down.

    It is clearly interesting enough to keep going back to, and you now have several rough drafts to work with.

    If you’ve done something 3 times, or argued at length the same thesis 3 times, etc., it may be time to think about it more closely to automate it or write down a full-fledged essay. I find this useful for writing because if something comes up 3 times, that suggests it’s important and underserved, and also that you might save yourself time in the long run by writing it now.

    (This is my pet theory for why memory works in a spaced repetition sort of way: real world facts seems to follow some sort of long-tailed (or perhaps mixture) distribution, where there are massed transient facts which can be safely forgotten, and long-term facts which pop up repeatedly with large spacings; so ignoring massed presentation but retaining spaced facts, which keep popping up after long intervals, is more efficient than simply having memories be strengthened in proportion to total number of exposures.)

Follow these three principles, and you will surely never lack for something to write.

Poems & Incubation

Adam Mastroianni: I’ve been talking to a bunch of writers about where ideas come from for blog posts and about blog process in general, because everybody here [at Inkhaven] is blogging, wants to get better at blogging and writing. So much of that you can get from reading good stuff. You can see on the page like, “Oh, they said this thing. It had this structure.”

But you can’t see the origins. Those are lost to the mists of time. We can chat a bit about where yours come from and how you blog in general. To start, I wanted to ask about your recent posts, and if you remember where they came from, how they came to be.

Gwern: Sure. The most recent posts were two poems that I wrote [“Silver Bird Above San Francisco” and Tilakkhana: The 3 Scars of Existence”]. So I can tell you how I came to write those.

A: Yeah.

G: The most recent one was I had been a fan of Czesław Miłosz, the Polish poet. And he had a funny little dialogue in one of his poems about God speaking with a saint. And it occurred to me that I could parody this for LLMs and AI researchers. And so I was putting it into the various LLMs with poem prompts and trying to get something interesting on them and failing completely. Basically out of frustration I just asked them to come up with a poem in general on the theme at all.

And Kimi K2 came up with this interesting poem about the AI personification, discussing how coal had been burned and had unraveled the leaf or unraveled the sunshine to power the GPUs that ran it. And I was struck by this image and metaphor. The poem itself was not good, but there was a great seed there. And so I kept working with that.

And then I realized, I had thrown away the original poems I’d written there with the God saint dialogue. But then after a while working on the Kimi poem, I realized, wait, I can make a 3-part poem here. The 3 parts correspond to the phases of LLM training and deployment, and where I use the parody dialogue at the first one—just tweak them and then fill in the middle and then use the unraveling as the core of the third part. And from there on, I just kind of kept working with the LLMs to get that one.

That one was sort of a… I had a specific thing I wanted to do, but accidentally along the way, I got an output from the LLM, which totally changed the trajectory and resulted in a completely different kind of poem from what I was intending to write.

And so I’m satisfied with the result of that one, which wound up being much more Buddhist and ethical than I had intended, but was much better than what I had envisioned in the first place. That was just a classic example of a “happy accident”.

The previous one before that, “Silver Bird Above San Francisco” was just, I was flying back from the conferences in June 2025 here [Less Online and Manifest]. I was flying out of SFO in the afternoon, and for whatever reason, the flight plan out did a gigantic spiral over the Bay Area, just banking constantly in the open sunshine. I thought to myself that if I was on the ground looking at this, I’d be seeing this gigantic silver bird, just like spiraling up toward eternity. And I thought to myself how incredible it is that I can fall asleep, listening to music from my phone and people on their tablets, whatever, how incredible it is that none of us is the slightest bit worried that we’re going to crash into the ground and die the way so many planes throughout history did.

And that is an incredible triumph of airplane technology and airlines and FAA and pilots and everyone that we have made air travel so incredibly boring that we made them into these overstuffed buses of the sky where the biggest thing to think about is, “Oh my God, they charged me $35 to check a single piece of luggage”, or, “Oh, s—t, I missed the pretzel snack from the stewardess.”

And just what an absolute miracle. What an absolute miracle.

G: For that one, I just wanted to convey this emotion, this feeling of awe and miraculous-ness of air travel that I felt at that moment as we were going out of SFO.

And so I started working with the LLM on that. And one of them at one point suggested using William Carlos Williams’s “The Red Wheelbarrow” poem. And I was like, “Yes, that’s it. That’s the frame for the poem.” And I just kept iterating with them and asking them for more and more suggestions and picking and choosing. And eventually I wound up with whatever poem that I was satisfied with.

So that’s the two most recent ones.

Adam: Is that at all similar to how you write non-poetry?

G: I would say it’s kind of similar in that often I will just have a long incubation period where I’m just noticing something and I’m just like waiting for enough to accumulate to the point where I can crystallize something. And at some point it’ll often be, it will snap into place and I’ll be like, “Yes, that’s it! That’s the idea here, that’s the angle that makes this interesting, which turns this from a pile of inchoate notes into something interesting.”

I briefly discussed on the Dwarkesh Patel podcast about the scaling hypothesis, which was something that in some ways I had started way back in like 2011, like reading Jürgen Schmidhuber and Hans Moravec. But until 2020, it had not snapped into place the thesis that perhaps “all AI progress is just compute” and everything underneath these fancy algorithms are ultimately really just compute or trial-and-error enabled by compute to make actual progress. And this, of course, has major implications.

Until then I had often seen it and I kept seeing papers which suggested it, or showed that the algorithms don’t matter as much as the raw compute does once you hold everything equal. There were many, many individual data points where I kept going, “Huh, it certainly seemed like there’s something there.” It’s like, isn’t it kind of interesting how just more compute just seems to keep being behind so many of these breakthroughs, and keeps being written out of the story, and how people would admit like, “Oh yeah, that was actually because I had 100 GPUs and I could try out a bunch of ideas and that’s why it worked.”?

But it wasn’t until 2020 with the GPT-3 paper coming out and smashing the benchmarks that I could—I finally had something that snapped in place, I wrote it all down and then the rest is history.

Polymath

A: Something that strikes me about those stories is the disparateness of the information that you’re working with. You talked about having read this Polish poet for a long time, talked about like “should I use something from William Carlos Williams?”. You’re also reading papers about LLMs and how many GPUs. Has this always been the case? Where did these things come from and why are they tied together?

G: I don’t know if they have to be tied together. I’m sure many writers can specialize in a particular niche and a particular field, but I have always just been pathologically curious. Even as a little kid, I was your classic bookworm who just wanted to read every book in the library to the extent humanly possible and I would check out 15 books a week and read them all and then return and then grab the next 15 books.

And so I don’t know if it has to be that way. That’s just how I do it. It’s natural to me that I read a bazillion things. If I spend all day reading random interesting papers, I will sooner or later know all sorts of weird things which wind up being tied together. And so you never know what will be useful to make connections across fields.

It’s just that maybe you can write… I’m sure you can write things many other ways. There are many, many specialists. This is just how I in particular happen to operate.

A: Yeah. I mean, it’s funny the people I’ve spoken to today all do some version of that. And I don’t know if it’s just because we’re so selected for the kinds of writers here.

G: I mean, that’s a big factor here.

A: Yes.

G: We definitely do not select for best specialists. If you’re just a pure AI guy, I don’t think you had much reason to be here because how many GPUs are there in Lighthaven? Not many, right?

The Apprenticeship

A: Yeah. What was I going to ask? Oh, okay. A lot of people have this. A lot of people are the person who wants to read all the books in the library. That’s not super uncommon.

It is much more uncommon to then gestate it in your head, turn it into blog posts and put it on the Internet. Why did you start doing that?

G: I think a lot of people don’t remember what they read. Even if you read widely, if you can’t remember the weird little details, the disparate things, then you can’t get too much out of reading. I’ve always had a good visual memory for language especially. I could often recite things from books or just at least just look them up later. That was a key to it. I do, in fact, I can remember quite a lot of these things that I read and can put the pieces together that way.

But also there’s just, as you say, me, Scott Alexander, and many other bloggers, there is ultimately an interest or compulsion or fascination with writing, which ultimately is what compels you to keep writing for a long time until you get recognition.

Hence, the joke is always “achieve overnight success in just 10 years of hard work”.

A: Yeah.

G: And it’s like nobody really starts blogging without ever having written before and becomes an instant massive success.

Usually there’s always some sort of apprenticeship somewhere.

I had my English Wikipedia phase where I was writing on Wikipedia for a long time to just try to share knowledge and information and provide good references for people to read on various topics, like Evangelion and that sort of thing.

A: What led you to make the switch from Wikipedia writing to your own website?

G: It was a mix of getting increasingly dissatisfied with the Wikipedia editing environment.

It was harsh and clamping down on things that I liked. I liked having epigraphs. I liked having blockquotes of interesting texts. And increasingly the attitude was hostile to that.

And on Wikipedia, nothing is fixed. Everything is deletable even if you write the best article in the world, someone can always come along 5 years later and delete it, literally delete the entire article. And so I was pushed out of Wikipedia to a considerable extent.

But also I was pulled because I increasingly wanted to do my own synthesis and original analysis and research and writing and self-experiments and provide copies of documents and papers. And all of that is just not acceptable on Wikipedia. You’re only supposed to be summarizing what someone else has written. You’re not supposed to be doing any analysis or synthesis of your own.

It was a combination of push and pull. The kind of writing I was doing was being pushed out and the kind of writing I wanted to do was definitely not allowed.

Self-Experimentation

A: You mentioned self-experimentation, which is quite unique and I think surprisingly so in the blogosphere that for the kinds of people who are so interested in science, how knowledge is produced, how to do that correctly, there’s not a lot of original data collected and presented.

Why do you do it?

G: I did it just because it seemed obviously correct and because… Yeah. I mean, it does seem obviously correct. You have these supplements and you have all these claims about how you think the supplement does this thing for your sleep, so just measure it, right? What is science, but writing something down, really?

When you get down to it, if you don’t write it down, then it’s not science. No matter how much analysis work you do, if it isn’t written up, then it never happened. So science is easy in that sense. You just have to write it down. It was like, “Well, why can’t I do that? I can write stuff down. I’m quite good at writing stuff down.”

A: Wait, so you were doing these things as a normal course of your life and the decision was to write them down? The decision wasn’t like…

G: No, no, no. I did not do randomized experiments beforehand. I did them in part to have something to write up because the other flip side is that, well, if it’s so easy to do, why not do it and then write it up? And now you have something to write up.

If you don’t have an idea for something good to write, well just do an experiment on yourself and it’s easy to pick a unique experiment that no one has ever done before and boom, now you have something to write about and analyze.

It was also a great statistics practice. I got so much value out of learning statistics just to do a particular self-experiment. I was like, “Oh, for this one, I will use some factor analysis. For this one, I will learn about ordinal models. For this one, I will try to include an informative prior from Bayesian reasoning and I would use it in JAGS.”

It was a wonderful, wonderful source of self-assigned homework, which was much, much more fun to do than reanalyzing some fake data set in class or just doing theoretical exercises. I found it motivating to work on my own data.

Now the question you might ask is like, why don’t more people do this? And of course for many of them, the reason they don’t genuinely care about what the right answer is, so that’s a lot of it, but also a lot of it is just that they don’t finish things.

I find that many people will contact me partway through and I’ll hear about how, “I’m doing an LSD microdosing experiment!”, and then I will never hear from them at the end.

I would say there are at least 5–6 people who said they had started a self-blinded LSD microdosing experiment to try to prove me wrong and then I never heard from them again.

A: They took too big of a dose.

G: Well, even LSD wears off after a day, right?

A: Yeah.

G: So I think there’s an issue with writing: people will often start things and then not finish them. And this is all true of hobbies and personal projects in general. People will start them and then not finish them, or they will finish them, but then not finish them by writing them up properly.

I talk with people all the time who have done these quite challenging, impressive technical projects or AI projects which would be a meaningful, valuable contribution if they would just turn what they just told me in the chat into a blog post, and yet they don’t.[2] And so no one else ever knows about it. There’s many such cases.

And so it’s just I think underappreciated how hard it is to finish things. Just like there are so many people who are flakes in that sense. They’re flaky. They start things but they don’t finish things.

The Writing Pipeline

A: Is this not a problem for you?

G: I mean, it’s definitely an issue for me in terms of finishing projects, but also I have so many projects and I’m motivated to finish them because I don’t believe it’s real until it’s done and I want to have something to show other people. And also, writing is just a lot easier for me than for other people. So I don’t know what’s going on there.

I just know that I finish things and so my output is infinitely larger than many other people who are much smarter than me, much more talented than me, much more interesting than me; and yet no one will ever know or only their friends will know or only their job will know because they wouldn’t finish things and put them into the public.

There’s a useful model, mental model here what I call “the pipeline”, which is that the reason you get these surprising log-normal distribution of outputs is because everything is a sequence of phases and you have to finish each phase to have a single final output, and only the final output counts. The original model like Shockley1957 used for modeling the writing of a scientific paper—you have to have an idea, you have to execute the experiment, you have to analyze the results, you have to write it up, you have to send it out to a journal, and then you have to go to the full process to have a published journal. And if you fail at any step along the way, at any step, you have nothing at the end.

And so if you have one scientist who has 50% probability of finishing each stage and you have another scientist who has 60% probability, that doesn’t seem like much, like 50% versus 60%. But when you look at the end, it’s like, this guy has 1 published paper in his career and this guy has 10; that small difference multiplies out at each stage to a big difference at the end. So I feel like maybe the answer is just that I have a small advantage at each stage and so that’s why at the end of the day I had 10 blog posts and this guy has one blog post.

And this is universal in terms of doing projects. A project is multiple stages. Each stage has to be finished. As far as anyone else is concerned, the entire project doesn’t exist unless it’s finished.

And so you look around and you see people who have a ton of projects and then you see people who have none. But that other guy has a graveyard of half-finished projects, which he didn’t finish. And they’re not all that different, it’s just that this first guy was just slightly more conscientious each step in finishing stuff. So he has a graveyard too, but it’s not nearly as big.

Tools For Thought

A: Yeah. That’s a useful analysis and I wonder if, did you have this advantage coming in, do you think? Is this something that you inculcated? Because a lot of people see this and they’re like, “I want to increase my closing rate.” Is that ever a thought that you had and then you took some actions that increased your closing rate?

G: I think if I compared myself to when I started blogging, in 2010, I don’t think I had many ideas at all.

I was like, I think maybe the self-experiments were in part a reaction to that. “Well, I don’t know anything interesting to write about. I guess I will see what caffeine does to my sleep.” Because this is a boring, obvious thing I could turn the crank on the experiment, right? It’s like if I don’t know what to do, well I can just do a self-experiment. That was kind of a crutch in some ways to get stuff to do.

Of course these days it’s like I have a bazillion ideas for things that you could do which are totally feasible and I have whole pages of free ideas—please take my idea and adopt it and give it a good home!

I feel like that was in part, I got much better at the skill of noticing things and thinking harder about predicting things. Also, there is just a thing where after a while every idea feeds on itself and builds on itself, you like to pick up a whole collection of useful ideas which you can apply to everything. I love… Gian-Carlo Rota has a good quote for how he does research where he said like, “Everyone has two mental tools that they apply to every problem, and a great mathematician will have 3.”

And then he says that like, he also called it the “Feynman problem solving algorithm”, which is that Richard Feynman would have in mind, say, 10 interesting problems at any given time. And anytime he ran across a new result or a new method or a new tool, he would mentally try to apply to each of the outstanding problems and see if it helped. And every once in a great while, he would get a hit. And then he would tell someone else, and they’d be like, “Oh my God, Feynman, you’re a genius. How do you do that?” And the answer is “I just had a whole bunch of outstanding problems and interesting observations in my head. And every time I run across a new theorem or whatever, I just try on each of them and I see if it helped.”

After a while, ideas stop being a problem.

A: Do you feel like you have some analog of this, the two or 3 tools that you are applying to each new piece of information or problem that you encounter?

G: Yeah, definitely. There’s definitely something like that. I think I would have to think harder about what exactly they are, but you’re right, there’s something that I… I feel like often my arguments are stereotypical and that I keep reusing the same one, every one.

Like one of them might be “one man’s modus ponens is another man’s modus tollens. That is definitely one that I will apply reflexively everywhere.

A: I assume plenty of people in the audience can parse that, but can you explain it for people who haven’t encountered it before?

G: A good, simple, intuitive explanation of that… Oh, that’s a little bit difficult because I mean informal logic is kind of simple.

Often people make an argument which concludes, “A, B, C, and therefore X, Y, Z.” And then I would say, “Well, Z is obviously false. Therefore, your original argument has a false premise somewhere. Which of your premises is false?” “Well, it’s probably A.” And I would say, “Thank you for giving me an interesting new proof of not-A.”

And it’s like, “That would be interesting if true. However, that’s obviously not true, therefore this is not true either.”

That’s a reflexive thing I would say, similar to criticizing scientific writing—if anyone reports a correlation result, I automatically flip it in my head.

Say that someone says that “eating chocolate correlates with lower stress”. I flip it in my head and say “lower stress predicts eating chocolate”, and then I see what I think about that. That’s just a reflex. Anytime anyone gives you a correlation, flip the order of the words and then think about it. That’s an interesting kind of reflexive argument that you can do.

Or anytime someone gives you a number, you can just multiply or divide by 10 and then see, does that change anything? If that didn’t change anything, then the number was irrelevant.

A: This is super interesting and I don’t want to put you on the spot, but can you think of any posts in particular that came out of applying any one of those tools? I mean, I imagine if you’re using them all the time that this is baked into a lot of posts that you’ve written, but I was just curious.

G: No, nothing comes to mind to me, sorry.

Blog Brain: “That’s A Post”

A: But what you’re describing I think is super interesting. It makes me think of the… I was briefly on Instagram and found that it gives me “Instagram brain”.

And what I mean is that I go around the world being like, “What can I turn into a picture for Instagram?” Then I started to see the world through the lens of how can I transmute the world into posts for Instagram.

Now that I blog, I feel like I have “blog brain” and I walk around the world and things stick out to me and there are affordances like, “Oh, this is a post. That’s a post.” Does this resonate with you at all? Do you feel like you have this?

G: Absolutely. I’ve been doing that a lot here. I’ve been going around and saying to people, “Yeah, that’s a post. Yeah, that’s a perfectly readable blog post.” Your friend has a Discord channel where you debate photos of the sky? Post!

A: Is there any part of that you feel like you can articulate?

Because to me it feels a little bit like chicken sexing, that it’s just like you can’t say why you know this chicken is female or male. The master just has to slap your hand when you do it wrong and over time the subconscious figures it out.

Is it like that or is it like, I have some explicit tools I can use to tell you that like why that Discord post is a post?

G: I think in the case of the Discord post, I knew that it was a post because I immediately thought of the XKCD Connoisseur cartoon. And so the Connoisseur cartoon picks out a general phenomenon of if you spend a lot of time on any topic whatsoever, you will develop strangely strong opinions about it, which outsiders will not appreciate at all. And so she was talking about how cumulonimbus, or something, is objectively the best cloud because it has the most interesting 3D appearance.

And I was like, “Yeah, that’s exactly like the XKCD cartoon.” Like, okay, now you have a post! You have a weird little niche that nobody knows about, which does not have any exact parallel, which has not been written about before, which however you can link to some broader general point in this case about developing aesthetic preferences and developing connoisseurship.

Just the channel on its own is interesting, kind of interesting, but if you can make it a specific instance of a general point, now you have an essay. Now you have a post. A good essay will often take a universal and make it concrete or go from a concrete to a universal, whichever direction happens to work better.

So in this case, we have a universal, which is “how people develop preferences” and we have that concrete “cloud channel”, which is an example of a small group of people developing a weird kind of expertise and preferences.

That kind of the meta principle there is that I was able to connect this concrete, weird thing to an interesting abstract non-obvious universal principle. And that’s kind of where the pleasure of the insight of reading this would come from, is seeing an instance of what you might not have understood before.

If we had to say a general principle of essay writing for me is often just compiling examples of a universal principle that you might have seen before but didn’t understand. But by bringing in enough firepower, I can make you understand it.

The best example of that would probably be in my “Commoditize Your Complement” essay, which is just taking a classic Joel Spolsky essay on tech economics, which people don’t truly understand when they read it, and then providing so many examples that you have to understand it. And once you do that, suddenly now you will see the world completely differently. It’s like putting on glasses and suddenly you understand why tech companies do all these things that they do—once I provide enough examples, once I brute force it into your head.

If I just say, “Smart companies commoditize their complement, they try to render complementary products free in order to make complements for their own commodity or whatever.” It’s like, “Okay, it’s just economics nonsense. I have no idea what you’re talking about. I don’t feel it in my gut.”

But if I give you a bazillion examples from “hotels” to “Xbox” to “AI software” to “Windows” to “Netscape”, well eventually at some point it’s got to sink into your head like, “Oh yes, now I see what they’re doing.” They’re trying to destroy the rival competitors who could be a monopoly over them. They’re not friends, they’re enemies! They’re trying to destroy them.

And then you start to see it. “Why does Facebook release that free thing?” Well, because they’re trying to destroy this particular tech company. And once you have that enlightenment experience, now you understand it.

Essay Archetype: Universal ↔︎ Concrete

A: This feels like a type of post of like, I’m going to take a universal principle and show you so many examples that you understand the universal principle that alone is abstract.

Are there other kinds of posts? Do you feel like you have archetypes of posts in your head?

G: Well, in that case, that’s kind of an easy one because you already have the universal principle. The hard ones are where you’re trying to nail down the examples and the universal… So you might accumulate a lot of examples like, “I know these are related somehow, but I don’t yet know how.”

“Scaling Hypothesis” is an example one of the hard ones where I could feel that there’s some sort of connection between all these disparate examples, between these image classifiers and these RL agents, between this n-grams scaling paper and Transformers, which don’t seem to have any connection, but after enough time, finally it will crystallize and finally you will have the universal principle which unifies all these. And which in that case turned out to be just compute scaling, essentially.

A: One direction is, I know what the principle is, now I need to find the examples that make the principle legible and understandable. And another is, I’ve got a bunch of examples, how do I back out the universal principle, but I feel like there’s a thing here, but I’m searching for the universal from the specific.

G: Yeah. So, I mean, as I said previously, there are definitely two directions you can go, right?

You can go from the universal to the concrete, or you can go from the concrete to universal.

Yeah, sometimes, I will assemble a bunch of examples and decide, okay, yeah, they seemed like they had some connection, but, no, there’s not a universal there.

Last year I did an “unsorting” essay because I thought I had seen some way in which there was this general concept like “unsorting things” for various statistical purposes. And I thought there was something there, and I collected examples for a long time, but then I realized: no, there is nothing there, it’s just a coincidence. There’s a vague resemblance, but there is not actually any universal principle.

And then I wrote that down to satisfy myself that, no, there’s nothing there. It bugged me for a long time, but no, there’s nothing there—which is kind of a weird essay to write. The people that are reading it, go, “That’s kind of stupid”, but whatever.

A: Do you feel like you write for yourself or you write for the audience?

G: I write for myself. From start to finish, it’s pretty much for myself.

Once in a while I will be trying to convince people, but that’s the exception and not the rule. Most of my stuff is self-indulgent. I wrote perfume reviews because I was into perfume reviews. I wrote the poems because those are the poems I love to read and I wanted to write them. And if nobody else likes them or reads them, that’s immaterial to me because I think they’re good.

I mean, I did write “Scaling Hypothesis” to try to reach all the EA and AI safety people, so that did have a target audience to just wake everyone up. But again, that’s exceptional.

A: The last thing I want to ask about is you write prolifically.

And so I think people might… My image of what your day-to-day life is your fingers on keyboard all the time, making posts come out and the rest of the time is information input.

But if you were to find you on a random Tuesday, what are you doing?

G: I would say that I spend 99% of my time just reading, and maybe 1% of the time writing.

A: Interesting.

G: Yes. It’s an extreme ratio. I will spend 10 hours a day reading papers, reading books, reading comments, reading posts, reading everything.

And then maybe I’ll write something in the remaining 30 minutes perhaps on an average day.

And many days I would write nothing.

The Voice: Ideas As Earworms

A: When it comes time to write, to what extent does the idea get developed on the page versus it’s already developed when you get there?

G: For most of my stuff, I’m either making a small edit to an existing one. I’m adding a new link to an essay, so it’s not developing at all. It’s getting bigger, but not developing. For new stuff, it’s almost always pretty much complete in my head.

In fact, I have written a short thing on one of my pages about how, to me, often writing is about transcribing what a voice has been repeating for a while. And so literally hearing someone recite the essay or at least most of the essay or the key paragraph or whatever. And just like an earworm, like a musical earworm, I hear the essay repeat itself for days at a time until eventually it’s like, “okay, fine, I’m going to write it down.” And only once it’s been written down, then it’s dispelled.

A: I experience it so similarly, that every idea feels like a demonic possession and it requires a certain incantation to end the possession. And the incantation is like the correct blog post.

G: Yeah. There’s definitely a certain sense where it’s a nagging to-do item where you can’t forget it until you have dealt with it and properly, and dealing with it properly, like having it written out in a proper way.

I like to quote Indiana Jones’s father in the movie: “I wrote it down so I could forget it.” And I am like that: I write down things to forget them.

Audience Q&A

Modalities & Comparative Advantage

A: Yeah. Would you be willing to take questions?

G: Yeah, absolutely. Yeah. I have nowhere particular to go right now, so I can take questions. My voice is holding out.

Audience #1: I like what you were saying earlier about the small set of us who do empirical work online. You asked a lot of great questions about that.

But then you also pointed out that most of us follow, what did you call it, “one form”?

A: Yeah, “one modality”.

#1: One modality. Is there a reason why you feel like you focus so much on self-experiments and don’t do empirical work in other modalities?

G: I think that with other modalities, empirical work is quite a lot of work and effort.

I did quite a few surveys because I could do surveys through the late Google Surveys or through Prolific and that’s easy. I love it. But other kinds of modalities would be huge time investments, like doing wet lab work, just years and years of work and equipment and how am I going to do that at my home?

And it’s like I have so many other things I want to write. I have no shortage of other things to read or write or do. And I’m not pinned down to a specific topic where I just hit a dead end and I have to do a particular kind of demanding work.

So, I have great respect for people doing many kinds of empirical work because I could never sit down and spend 5 years doing field work in India interviewing villagers just to write part of one paper. I have tremendous respect for all the people out there in the real world doing empirical data gathering, which is part of why I feel weird about the self-experiment because it’s like this cheat code to getting all this data and research and it’s like, it shouldn’t make it this easy, right? It shouldn’t make this easy to get interesting results. And I feel like I’m cheating compared to people out there doing so much work for so much more esoteric and smaller data sets and everything like that. So it’s just that anything beyond self-experiment, usually it would require so much work and data that it would crowd out a lot of other stuff I would rather do.

If it’s something my advertising A/B tests where I could just leave them to run in the background for a year or two and get millions, then that’s one thing. But beyond that, I don’t want to spend the time to do more elaborate involved things because I could be doing poems, I could be doing short stories, I could be coming up with new theories of cat psychology, or I could just be watching anime. There’s so many things I could be doing and I’m not forced to do that particular kind of empirical work.

Great respect to people who do. What would I read if other people weren’t doing all that, right? What would I read?

But I, myself, just don’t have comparative advantage there. I guess that’s the simplest answer, just comparative advantage. It’s like I was good at self-experiments and cheap for me to do them, and so I did. And other people for some reason don’t, so it must be expensive for them by revealed preferences.

Publishing Thresholds

Audience #2: How do you decide which one of your many ideas to work on and develop and publish?

G: I am lazy, so it’s more or less whichever one is annoying me the most. Whichever one is like, the voice is annoying me the most and I feel, “Okay, yeah, I’m going to write this one.” It’s haphazard and ad hoc.

I make absolutely no claim to being a good role model, in this sense. I just hope that you are satisfied with my writings in the long run. And if not, screw you, I wasn’t writing for you in the first place.

Audience #4: You were mentioning that for the ideas, it’s whatever’s most annoying at the time. Is it similar for making edits, or is that a different kind of process?

G: Editing does tend to be much more of an ad hoc process where once in a while something would just pop into my head and I was like, “Oh, I can add this to that essay”, and then I’ll go and add it. Or I’ll be reading something and say, “Ah, yeah, this would be an interesting addition to this essay here.”

That’s all just ad hoc as I go. Every day I will add a couple links to various essays or whatnot. And sometimes I’ll just be reading something and go, “Oh, that’s badly worded.” And I would edit it then and there.

The way my site is set up, I often find myself scrolling through old essays or annotations, I’ll be searching for things and I’ll notice something and then I’ll edit that then and there. But that’s just a much more ordinary, grindy, boring, day-to-day, in and out—a few words here, a few words there, but eventually it often adds up to quite a bit.

But there’s nothing special or interesting about that process. You just have to do it when you think of it because otherwise you’ll forget about it.

Compare it to the big bang writing where I dump out an entire essay in a one-hour long writing session—totally different process. It’s just the ordinary, day-to-day, grinding, maintenance, and dumping in something new. Grinding, grinding, grinding, maintenance, dumping something new.

That’s boring and uninteresting, but you have to do it because it’s critical for quality control and making sure the spelling is all correct and the links aren’t broken, and you keep it up to date with new research.

Wikis Vs Blogs

Audience #3: I noticed when your ideas do come out, they come out especially well-formed and complete, with what seems to me as a reader, all the thinking and hedging and footnotes and everything put out there—comprehensively researched and fully formed, which I compared to another mode of blogging which is just like, here’s a half-formed kind of newsletter-y update about this thing developing.

I’m curious, when do you set the threshold for good enough to put up? Because it seems like—

G: I generally push everything to my website as soon as I have it. I generally do not have drafts which are languishing in some separate folder from the website.

It’s just everything is just on the website as drafts. It’s not even a question that comes up, which one of the reasons why I do my website is I don’t want the constricting blog binary that something is posted or it’s not posted.

I just don’t find that acceptable for my own writing because often I’m not sure if something is done or it’s not done or it’s only half done. I’m still waiting for the crystallization to happen. I’m still looking for what the universal principle is for this collection of links.

And so to me, blogging just does not work for my kind of writing at all.

#3: I’m mostly seeing the ones that you’ve spent many years editing and updating.

G: I do often keep going back and tweaking, and sometimes I will dump in a whole new section which just suddenly crystallized and then the voice finished it. But it’s often, I often don’t even know if something is done until 10 years later where I go, “Well, I have not edited this meaningfully in a long time. I guess it’s done.”

I just don’t find the discreteness of published, not published to be helpful for what I’m doing.

And I feel like this shows in the thing with Scott Alexander’s blogging where often he’s just giving this small update on something he’s touched on repeatedly, and it doesn’t make much sense to you unless you read all the previous installments because it’s not done. He just has to push “publish” because that’s the only thing Substack or WordPress lets him do.

It’s just they’re just building this hardwired assumption that things are complete in discrete units, which is true with newspapers, right? With newspaper you publish something today and that’s today’s paper and you had no choice but to do things that way.

But in other kinds of writing, that’s not true. There’s not necessarily any point at which something is done. It’s just, they’re more wikis, just like lots and lots of tiny little edits and sometimes it’ll be done and maybe sometimes you look back and say, “Okay, yeah I guess it’s done.”

I don’t think of things that are being published or not published or finished or not finished, because that’s just kind of the imposing frozen categories which don’t have to apply to my writing unless I fall for the blog mind trick where I think, “Oh, I have to push out a discrete finished unit every day.” You don’t have to.

A: Some have to.

G: No, you are free—

Audience #7: We do have to.

G: —to leave Inkhaven at any time!

#7: True, but it’s quite the—

G: [grinning] Read your agreement.

You are all free to leave at any time, don’t let anyone tell you that you’re not!

#7: It’s true, we need radical freedom.

#1: No, but you were right about finishing. This is part of why the month is a good practice because you were totally right, most people have trouble finishing. We finish 500 word things.

G: Yes, but okay, but also, but seriously, just the constraint here is also good for breaking your sense of perfectionism and belief that something has to be finished and published, because you’re being forced to spit out 500 words every day.

For some of you, I feel like that’s helpful to convince you that publishing is not important.

#1: Oh.

G: Treat it as essentially an arbitrary requirement, which you have been mind-tricked into by school and by the prestige of publishing into thinking, “Oh yes, publishing is a Big Deal. Oh, I just published my blog.”

“Oh, should I attend your launch party in New York City? Will there be little aperitifs in the little wine glasses?”

#3: I feel like there’s a finishing versus not finishing and a sharing versus not sharing thing. And you’re pro sharing, and finishing is not super important to you unless it’s a study of something.

G: Yes, I’m pro-sharing. I feel like everyone should just have a wiki and push writing to it as they go.

For most people, blogging is not a great paradigm for them. Unless you’re writing about something time-bound, like the news, a daily log, live reactions, predictions—the ‘blog’ format is an awkward constraint. It bakes in a discreteness and monetizable regularity that a lot of longer-term thinking doesn’t fit.

That’s kind of my feeling about it, why I’m here as “a blogger”, but I don’t write “a blog”. It’s not “reverse chronologically sorted, regular updates”, short things. It’s like a wiki or a home page.

#3: I have a last question on this. If you think back to someday when you had to decide whether to promote your work beyond just sharing it, what was that like? You were like, “Oh, this is ready to send it to people, or share on the forum”, and so on?

G: Yeah, so often it’s that I want to share it with a specific person. Often I’ll have a specific person for a proposal or an idea or a writeup and then I’ll say, “Okay, made a final push to polish it and finish it. Let’s get some more feedback on it.” And then I’ll say, okay, that’s when I would want to submit it to Hacker News or whatnot.

I don’t think of that as “finishing”, it’s just the next step. I’ll get more comments on Hacker News and then I’ll look at them and maybe I’ll fix things like that. But the point is I now have something to send someone and that would do whatever I intended it to do. So that’s kind of how I think of that.

It’s like even when something’s published, it’s not published, that’s just the next stage. Like open source software: sure, you have versioning, but it’s not a big deal, right? Today you cut another release, get some bug reports, add more patches, eventually you’ll cut another release, get a bug report, patch some more, cut another release… And there’s no Big Bang where it’s published and now it cannot be touched anymore forever.

A: Gwern, thanks so much for chatting. Thanks guys for being here.

G: Okay. I think I need a snack!

LLM Followup Questions

Attached is an interview I did. Please read and summarize it, and then come up with followup questions which will be interesting and informative. Aim for at least 20 total questions. For each question, imagine 3 possible answers I, Gwern Branwen, might give, and what your followup question would be; score them by how interesting or informative they are, and use that to pick the top 5 questions.

Gemini-3-pro-preview: You cited “Commoditize Your Complement” as an example of teaching a universal principle through brute-force examples.

If we apply that model to Gwern.net itself: your content is the commodity (free, open). What is the complement you are trying to commoditize (or destroy), and what is the monopoly you are trying to build?

G: Good question—Tyler Cowen would approve of it.

My infrastructure work is focused on fulltexts, verification of citations, access to raw data, reproducible code, proper rendering of complex material like poetry, archiving and linkrot, correctness and quality control, speed, and catering to powerusers. Simply put, I try to read, and let others read, what has already been recorded.

So, I guess what I try to commoditize is information, and my complement is analysis—or perhaps more precisely, “synthesis” or “sensemaking”. Where information is abundant, attention and understanding must become scarce. The “insight porn” tag is not as much of a joke as it seems.

In the long run, I am more of a point of view embodied by LLMs, perhaps.

Claude-4.5-opus: You advocate wikis strongly but acknowledge blogs dominate. What’s stopping wikis from being more common than blogs?

G: Most people don’t do the kind of thinking that needs wikis. They don’t need “tools for thought”. They want to be on top of things, not on the bottom (to borrow a phrase from Donald Knuth).

Wikis also have a nasty problem in terms of following updates. No one has ever invented a good way to “read updates to a wiki”. A list of raw edits is hopeless—just browse Special:Recent Changes on English Wikipedia some time! We need some sort of hierarchy of summaries, I think.


An extensive bibliography for this interview can be found on the canonical page for this interview at Gwern.net.

  1. ^

    I was voted the top coach in both surveys of Inkhaveners; and I was described favorably by some participants (eg. 1, 2/3, 4) although some described me as being very tough in feedback, sometimes unhelpfully so.

  2. ^

    One tool I’ve created to try to help with this is seriate.py: a tool to take a dump of notes, and reorder it in a logical flow before any editing or rewrites. The most aversive part of getting started, it seems to me, is when you have a dump of chaotic notes; and these days, much of the rest can be farmed out to an LLM.



Discuss