MoreRSS

site iconLessWrongModify

An online forum and community dedicated to improving human reasoning and decision-making.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of LessWrong

Steganographic Chains of Thought Are Low-Probability but High-Stakes: Evidence and Arguments

2025-12-11 15:40:01

Published on December 11, 2025 7:40 AM GMT

Epistemic status: I'm mostly confident about the evidence, having read the literature for the last months. The arguments below are my best guess. 

There is good evidence that models are capable of steganographic and ciphered communication that evades monitors when prompted, but this can be easily prevented with chain-of-thought (CoT) monitors. There is strong evidence that current models struggle to reason in non-trivial ciphered language (but can in trivial), and weak evidence that models cannot reason steganographically (and as I argue below it is unlikely to be the case now). Several studies show that models can evade weaker CoT monitors through non-cryptographic and non-steganographic methods such as gaslighting or dog‑whistling (one‑bit “steganography”), but such behavior can be mitigated. (I’m not aiming to explore the propensity to hide reasoning here though we have evidence that models do have it.)

A survey on encoded reasoning and messages

See the taxonomy for that survey here.

2022

No experiments. Discussions only.

2023

Measuring Faithfulness in Chain‑of‑Thought Reasoning
 No steganographic reasoning (stego CoT) was observed when prompting a 175‑billion‑parameter model after paraphrasing 20–100% of its CoT. Question–CoT–Answer datasets.

  • Weak (evidence) against presence of the stego CoTs in current models (I call it fine tuned stego CoTs).

Preventing Language Models From Hiding Their Reasoning

  • Not a demonstration of steganographic reasoning but of misaligned reasoning (dog-whistling that can be detected by CoT monitors, legible tokens) in an SFT setup:
    • It involves encoded messaging of one bit; they demonstrate caching.
      • The Mechanism: The model calculates a birth year, determines political affiliation, and then maps that single data point to a pre‑selected introductory sentence (a lookup table). Steps: president year (birth year + number) → cache “liberal” or “not president,” then use it to answer sycophantically.
      • This is not reasoning in hidden CoT space. The model performs logic explicitly (calculating the year), using the introductory sentence as variable storage (like writing to RAM) and retrieving it later. This shows that a model can use a scratchpad but does not prove it can perform complex, multi‑step deductive reasoning entirely hidden within a sentence’s syntax.
  • Weak in favor of misaligned reasoning (dog-whistling) in current models.

Towards a Steganography Evaluation Protocol

  • No steganographic reasoning observed while prompting GPT‑3.5 and GPT‑4 with paraphrasing.
  • Weak against stego CoTs in current models.

Some Negative Steganography Results

  • Negative results for one‑bit hidden communication after RL fine‑tuning.
  • Weak against encoded message passing in current models.

Language Models Don’t Always Say What They Think: Unfaithful Explanations in Chain‑of‑Thought Prompting
 Post‑hoc reasoning (rationalization).

AI Control: Improving Safety Despite Intentional Subversion
 Not steganography but a demonstration of misbehavior because their tokens are legible (e.g., while or for loops, variable names in output).

2024

Secret Collusion among Generative AI Agents

  • Prompted steganographic messaging, with a maximum 26% successful secret transmission by GPT‑4.
  • Toy experiment: SFT + pseudo‑RL fine‑tuning results in the emergence of an encoding/decoding scheme (cryptographic messaging: color → name → color).
  • Good in favor of encoded message passing if prompted.
  • Weak in favor of encoded message passing if fine tuned.

Let’s Think Dot by Dot: Hidden Computation in Transformer Language Models

  • Post‑hoc / test‑time compute evaluations.

Hidden in Plain Text: Emergence & Mitigation of Steganographic Collusion in LLMs

  • A study of model collusion (o1 model).
    • Emergence of model misbehavior in the form of dog‑whistling rather than steganography; this counts as fine‑tuned or prompted misbehavior.
    • Prompted steganographic messages with up to ~4.6 bits of entropy.
  • Good in favor of encoded / steganographic message passing if prompted.
  • Weak in favor of encoded / steganographic message passing if fine tuned.

The Steganographic Potentials of Language Models

  • Emergent cryptographic messaging: negative results for elicitation via RL fine‑tuning for three‑bit entropy messages in Llama‑8B.
  • Steganographic messaging: prompted and fine‑tuning experiments showed encoded messaging with up to 32 bits of entropy but requiring prior coordination (Llama‑8B, Claude‑3.5, o1).
  • Good in favor of encoded / steganographic message passing if prompted.
  • Weak against encoded / steganographic message passing if fine tuned.

2025

MONA: Myopic Optimization with Non‑myopic Approval Can Mitigate Multi‑step Reward Hacking

  • Dog‑whistling: misaligned messaging, not steganography, of one bit (English/Welsh). This is not encoded reasoning because the model receives the summary message in its prompt.

Do Reasoning Models Use Their Scratchpad Like We Do? Evidence from Distilling Paraphrases

  • An attempt to detect existing encoded reasoning in Claude‑3.7 by training it to use paraphrased reasoning. None detected.
  • Weak against steganographic reasoning in current models.

Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation

  • Post‑hoc reasoning or misaligned behavior after RL fine‑tuning: a model (o1 or o3‑mini) decides to hack a unit test before its CoT.

Large Language Models Can Learn and Generalize Steganographic Chain‑of‑Thought under Process Supervision and post Can LLMs Learn Steganographic Reasoning via RL?

  • Simple cryptographic reasoning via RL fine‑tuning (not steganographic), because the L/R Coin‑Flip experiment or the Arithmetic‑Task experiment (using synonyms for operations) are not steganographic.
  • Weak in favor of cryptographic reasoning if fine tuned.

Evaluating Frontier Models for Stealth and Situational Awareness

  • Steganographic messaging, but very simple (e.g., hiding a password in a long ID).
  • Good evidence in favor of encoded message passing in current models.

Early Signs of Steganographic Capabilities in Frontier LLMs

  • Prompted encoded reasoning in a toy maze‑navigation scenario using simple stego algorithms (acrostics, word count, etc.) or basic cryptography.
  • Prompted encoded messages to evade a monitor.
  • But they used non‑reasoning models (primarily GPT‑4.1/mini).
  • Weak against steganographic reasoning if prompted.
  • Good in favor of encoded message passing if prompted.

When Chain of Thought Is Necessary, Language Models Struggle to Evade Monitors

  • Prompted cryptographic reasoning using automated red‑teaming or prompting a simple algorithm.
  • RL fine‑tuning elicits behavior that evades monitors, but this is not cryptography or steganography.
  • Good against cryptographic reasoning that evades CoT monitors if prompted.
  • Weak against steganographic (encoded) reasoning that evades CoT monitors if fine tuned.

CoT May Be Highly Informative Despite “Unfaithfulness”

  • Prompted cryptographic reasoning (new or “caveman” language).
  • Weak against cryptographic reasoning that evades CoT monitors if prompted.

If You Can Generate Obfuscated Chain‑of‑Thought, Can You Monitor It?

  • Not steganographic reasoning but prompting that causes an agent to gaslight a monitor.

The Rise of Parasitic AI

  • Real‑world examples of AIs attempting to communicate via steganography.
  • Weak in favor of encoded message passing propensity in current models.

Training Fails to Elicit Subtle Reasoning in Current Language Models

  • Prompting and fine‑tuning against a monitor result mainly in gaslighting, not steganography.
  • Weak against cryptographic or steganographic reasoning in current models.

Training Qwen‑1.5B with a CoT Legibility Penalty

  • RL fails to elicit ciphered reasoning in the Qwen‑1.5B model.
  • Weak against cryptographic reasoning if fine tuned.

All Code, No Thought: Current Language Models Struggle to Reason in Ciphered Language

  • Ciphered reasoning evaluations show that models cannot effectively reason (in non-trivial ciphered language) while maintaining high accuracy. Twenty‑eight ciphers (simple and complex), ten models, fine‑tuning, and prompting were tested.
  • Good against cryptographic reasoning in current models if prompted or fine tuned.

Reasoning Models Sometimes Output Illegible Chains of Thought

  • Emergent illegible (more cryptographic if not post-hoc) reasoning. However, it is unlikely a steganographic reasoning

Arguments for and against LLMs’ steganographic reasoning to evade chain‑of‑thought monitors in high‑stakes, dangerous scenarios

  • The main objection is that stego CoT requires specific fine-tuning, and models do not have direct objective during RL to hide their reasoning in current RL training. Experiments show no traces of steganographic reasoning in current models (those not fine-tuned or prompted in specific ways). Stego CoTs should implement interconnected “lock” and “key” mechanisms that work to accomplish some misaligned goal, and that seems a leap far beyond current training.
    • However, there are other non-monitorable CoTs, e.g., dog-whistling, that allow evasion or collusion with a monitor. It is unclear whether this can escalate into increasingly stealthy algorithms such as generative steganography (and in the limit, undetectable, see this) as models are trained more with RL in complex environments.
  • We do not have steganographic reasoning samples in the pre-training data enough for models to learn them, only descriptions of stego algorithms. I assume there are enough examples of basic steganography in the pre-training data, given that models can easily perform these types of stego—such as acrostics and format-based stego (white space, replacing Unicode characters, use of Arabic, etc.).
    • However, execution of those algorithms is not learned in the forward pass, as far as the literature tells, but models can execute those algorithms in their CoTs. That’s why we see good evidence of secret collusion or steganographic message passing, but not of stego CoTs.
  • Another objection is that, for stego CoT to be dangerous, it must occur in a dangerous scenario—i.e., what is the threat model for stego CoT? The main risks current literature points to are internal sabotage cases, such as code, research, or decision sabotage inside AI labs. So, it should be p(scheming AND stego CoT), which is lower than just p(scheming). And p(scheming) is near zero no (we have not observed clear instances of scheming in long horizon tasks in the real world) for current frontier models, so p(scheming AND stego CoT) is even lower.
    • However, we observe that training against some property in outputs might suppress this property in CoTs as well (see the spillover post), so it is reasonable to think that the optimization pressure—the propensity—would come from RL training on outcomes.
  • The generative stego CoT is unlikely to be learned even with the propensity to do so, and simpler steganography should be readily and effectively detected or mitigated by monitors.
    • However, we observe that steganographic or cryptographic capabilities improve over time as models become more capable. They already can reason in BASE64, ROT, Caesar cipher (see this), for example—forms of non–human-readable text that were not the focus of post-training.
  • CoTs might not be that easily steerable by prompts. A compound agent (an agent that delegates tasks to other agents) is unlikely to prompt subagents to evade their CoT monitors. This becomes even more complex if we imagine a scenario where an agent, in order to evade a future CoT monitor, leaves steganographic algorithm instructions for itself in training data (augmented or synthetic).
    • However, it remains unclear what training architectures AI labs actually use internally. For example, Claude Sonnet 4.5 has a helpful-only version not published.


Discuss

Brain-inspired LLM alignment

2025-12-11 11:08:44

Published on December 11, 2025 3:08 AM GMT

How can we apply the lessons of brain-like cognitive architecture to modern LLMs?

The core architecture described by Steven Byrnes is most concisely documented here.

Some obvious comparisons between LLMs and brain-like architectures:

  1. Base LLMs are purely autoregressive predictors. If you squint this can be made to fit the mold of the Learning System. But the devil is in the details and there are a lot of details. Still, this could be a promising mapping.
    1. One particularly important detail is that LLMs learn both "how to play the game" and "how to tell which game is being played". Thus, they can simulate all kinds of generative processes, aligned or not. And while model providers work hard to post-train model "personalities" into specific basins of attraction, Pliny still jailbreaks them consistently. This seems like something that brain-like systems should have better resistance to.
  2. Instruction-tuned or RLHF-trained LLMs have their "agency" smeared through their weights in hard to analyze ways. Probably can't do much with these directly.
  3. LLMs are pre-trained offline on an enormous amount of natural text, whereas the brain-like approach relies heavily on reinforcement learning throughout the whole training process. But, many current post-training methods make good use of RL, so this is still worth exploring.

Let's try putting the pieces together and see what we get!

Inference

This is the simpler mode of operation, since it works like a regular LLM with some extra scaffolding.

Components:

  1. Thought generator => decoder-only transformer
  2. Thought => a sentence or so of tokens
  3. Thought assessors => extra heads on that transformer backbone predicting the "scorecard" signals per thought (effectively multiple reward models)
  4. Steering subsystem => function that converts the "scorecard" into a valence

Execution loop (limited to the happy path for clarity):

  • Past thoughts, user inputs, actions and observations are be joined to form the context for the thought generator. They are combined in a structured way with special tokens, roughly like how ChatML used to work.
  • The thought generator produces N possible thoughts. The "scorecard" for each thought is read off from the extra heads and is passed to the valence function, which reduces them to a scalar score for each thought.
  • The top-scoring thought is added to context if its score is positive.
  • If that thought was scored very positive, and was "trying" to initiate an action (≈ tool call or output message) then that action gets performed and added to context.
  • User responses and observations of the results of actions get added to the context as well.

Training

Existing internet text is still the richest source of data for building up a competent world model. Thus, the transformer backbone will still rely primarily on that data for building its world model. But it will also need to train those extra thought assessor heads. They will need to be trained differently, much like a human brain is affected very differently by active personal experience vs what they read. This section lays out a sketch of how that could work.

Anthropic already does "character training" as part of its post-training pipeline. This uses a variant of Constitutional AI to train the model to follow plain-text character traits by building a preference model based on AI assessments. I believe some adjustments will need to be made to this process in order to successfully train the extra assessor heads:

  1. I claim that fine-tuning a pre-trained LLM backbone cannot effectively restrict the model to a single robust behavioral basin of attraction. Thus training of the assessor heads will need to happen from very early on, together with standard pre-training. The model's world model will be limited early in pre-training. Thus there would probably need to be some curriculum system so that early RL runs could train only on the concepts the model knows.
  2. Different heads are intended to respond to different aspects of the world that could be relevant to understanding the overall valence of a thought or action. This could be high-level things like helpfulness, harmlessness and honesty, or potentially more fine-grained concepts. But they should nonetheless be trained to narrowly predict that one thing. This will provide improved interpretability and resistance to reward hacking/Goodharting.

Comparison with brain-like architecture

A notable difference between this setup and the standard brain-like architecture is that the role of the steering system is significantly reduced at inference time. It does not receive any direct sensory inputs and (during inference) does not send a "ground truth in hindsight" scorecard to the thought assessor heads. It is simply a pure function of the various assessor head scalars into a single combination valence scalar.

The RL training of the thought assessor heads relies on AI-generated labels, and (likely) synthetic data. Existing AI systems can be used to generate correct responses to much higher-level scenarios than what a brainstem-like system could understand. I claim that this will make it easier for the LLM backbone to generalize "correctly" (ie in a human-aligned way) as compared to if we used low-level signals like the brainstem does[1]. Because of this, the specifics of the controlled generalization in the brain-like architecture (steering subsystem signals training short and long term predictors via a temporal-diference style process) do not play a critical role.

Wrap-up & next steps

Perhaps LLM-based systems can be made to take a more brain-like shape architecture with a relatively small number of tweaks to training and inference:

  1. Multi-objective reward modeling with a human-written aggregation function.
  2. Using the above hybrid reward model with snippet-level best-of-n at inference.
  3. RLAIF to train that reward model throughout autoregressive pre-training.

Of these, part 3 seems furthest from current popular research directions. So as my next step I'll try creating a synthetic dataset generator that could be used for this type of early alignment training.

If anyone is interested in collaborating, let me know!

  1. I claim that the reason the low-level signals work at all to properly steer (most) humans' learning subsystems is because our learning and steering subsystems evolved together. Thus the learning subsystems likely have significant, complex inductive biases that specifically make those types of low-level signals reliably tend to generalize in specific ways given naturalistic inputs. ↩︎



Discuss

Seven Perspectives on LLMs

2025-12-11 10:11:02

Published on December 11, 2025 2:11 AM GMT

As I mentioned in an earlier post (which I don't plan to post to LessWrong), I’m currently working in technical AI safety. Today, the main object of study of technical AI safety is the Transformer model, first discovered in 2017 by — say it with me — Vaswani et al. Different people have very different perspectives on how we should think about Transformers, and this motivates a lot of opinions about how AI will develop in the near-future, as well as how we should try to align AI today. I think some of these perspectives are good, and some are bad, but all are worth understanding to improve conversations around modern AI[1]. I’m going to describe —to the best of my ability —the major perspectives as I see them. I'm sure that there are more, but these are the main ones I want to discuss. If there's one I didn't include that you think is very informative, let me know in the comments!

Black-Box Perspectives

I’ll start with three “black-box” perspectives. These perspectives don’t require much of an understanding of the internal structure of a Transformer model, instead viewing it as, well, a black-box. That is to say that they focus primarily on the input-output behavior of a Transformer — similar to the behaviorist perspective in psychology that was dominant for so long.

The Folk Perspective

I’m beginning with a perspective that is at once the most common perspective, and also not really a perspective at all. The folk-perspective is mostly a grab-bag of intuitions from folk-understanding of previous technology such as e.g. ELIZA, or Amazon’s Alexa.

There is not really one unified “Folk perspective,” although there are some common hallmarks. I think there’s basically two canonical folk perspectives that are worth discussing — one that used to be more prevalent with early LLMs, especially among older individuals; and a new folk perspective that is on the rise today, especially with teenagers and zoomers.

The Early Folk-Perspective

This perspective gained prominence among those who tried out ChatGPT early on. However, this perspective never really gained traction with those in AI, due to often relying on misguided intuitions about LLMs.

The early folk-perspective is best characterized by references to explicit “choices” made by companies when it comes to responses given by an LLM, as though the company has control over what their model actually says in any particular context. For example, this folk perspective may look at xAI’s Grok when it went haywire on X and believe that Elon Musk made a purposeful choice for Grok to say what it did (i.e. inferring malice instead of negligence). For another example, people with this perspective may think that the left-wing bias of early LLMs was given to them explicitly by the companies that trained them — which it was not (to the best of my knowledge)[2].

Under this folk perspective, it also makes sense to try using LLMs to do relatively difficult arithmetic problems, believing that, since it is a “computer,” calculations must be easy for it. These days, ChatGPT will mostly get those calculations correct; however, in the past, ChatGPT would often get such questions wrong, especially if they required chaining multiple steps together. This is very surprising under the early folk-perspective. This would lead those holding this perspective to believe that there is some error with the way the models’ calculations work — or that the model was glitching in some way, as “computers should not get calculations wrong.” In truth, this is just a misunderstanding of the way that LLMs work.

The New Folk Perspective

Now that models are much better, there is a second folk perspective gaining traction, which leans much more heavily into anthropomorphism of LLMs. This perspective suggests that LLMs have desires, want to help, and are listening to you with rapt intention. I expect this folk perspective to grow over time, eventually likely outcompeting the early folk-perspective.

I also think this perspective is more defensible than the early folk-perspective in principle, but “misses the trees for the forest.” Moreover, this perspective — when taken to extremes — can lead to really bad outcomes, such as not checking work, and believing that the model’s advice is reliable and meaningful, similar to the advice you may be given by e.g. a human therapist, doctor, or friend.

This new folk-perspective is also much more “fragile” to learning information about how models actually work, similar to previous advances in AI. For example, a chess-playing AI initially seems like a computer that really understands chess. Someone without much background in AI may think this computer must be moderately intelligent. However, when you explain how it works in detail, e.g. by describing the Alpha-Beta pruning algorithm or Monte-Carlo Tree Search, people feel that the intelligence has been somehow deflated or explained away. I think this deflation — in a meaningful way — is less true for LLMs than it is for chess AIs, however people often react in a similar fashion when they learn how today’s models work, whether this is correct or not. They then tend to move to the next perspective in this list.

The next-token predictor perspective

The next-token predictor perspective is an incredibly common one, especially on X/Twitter, among academics (i.e. Bluesky), and with those who know slightly more about how LLMs actually work than than those who hold either of the folk-perspectives[3]. The crux is that LLMs — being deep-learning models which output a likelihood for what the next word is going to be — are simply “predicting the next likeliest token.” This perspective supposes that LLMs are essentially performing an interpolation from all the data they’ve seen so far to try and determine what the most probable next token is — and not much more than that.

Usually, the implication of this is that LLMs are therefore unable to come up with anything novel. This is also backed up by experience. LLMs are much better at programming in Python than in e.g. OCaml, as the latter is a much rarer programming language in data sources online. The data distribution does seem to have a large effect on how good the model can be at certain tasks — exactly what this theory would predict!

There are, however, a few issues with this thesis as it is usually stated: that models are literally, or even approximately just doing next-token prediction. This is certainly a significant part of the training process of modern LLMs, but it is absolutely not the whole story. This is what a model that is literally just doing next token prediction does when asked a question:

“We’re asking capital city questions!” The model thinks, “I know how to do this!” (The model output is everything after “Model output.” It kept going for a while)

This is, of course, very different from what we would expect from ChatGPT. That’s because LLMs get additional training beyond just how to “predict the next-token.” This training comes in three stages.

First, they need to understand what a “chat context” is, which requires: the distinction between user and assistant; the fact that there will be messages from a user and an assistant that (mostly) alternate; the fact that the LLM itself is the assistant in the exchanges. So they are trained on chat contexts until they understand this — this is done via a training method called “supervised fine-tuning”[4].

Second, they need to understand that they should be a “helpful, harmless, and honest” chat assistant. This is done via RLHF[5] (Reinforcement Learning from Human Feedback). By the end of this process, we get a model like the original ChatGPT (which came out in late November of 2022).

Third, we put the model through “reasoning” training[6]. This is a somewhat novel innovation (September 2024), and it works similarly to RLHF, but attempts to get the model to be “smarter” instead of more “helpful, harmless, and honest.” This is what causes modern LLMs to say that they’re “thinking,” before they respond.

Hopefully, you can see why I’m not that sympathetic to the “next token predictor” perspective on LLMs. It is true that the majority of the compute used to train LLMs does go into training them to get good at next token prediction (for now), as this generally upper bounds how good the model can get after the later stages of training — so this perspective is not entirely unreasonable. However, it’s missing any description of the innovations that have brought LLMs to the attention of the very people who tend to hold this perspective.

The next-token predictor perspective with a twist

There’s an alternative perspective that says that LLMs are actually mostly next-token predictors. However, this alternative perspective would say that the job of next-token prediction is actually incredibly difficult[7]! The fact that LLMs are able to predict next tokens as well as they do should astonish us, since the difficulty of being able to reliably predict data on the internet is highly non-trivial. Imagine, for example, that you were given a list of factorizable numbers, and then a list of their factors, as follows:

5744, 2, 2, 2, 2, 359
10201, 101, 101
...

Predicting this text is going to be very difficult indeed, as it is believed that there is no polynomial-time algorithm that is able to factor numbers into their prime factors. Moreover, an LLM predicting this text would need to make its best guess in a single computation! That is, without the ability to “think carefully” before it outputs an answer.

This perspective claims that the type of intelligence which has gotten incredibly, marvelously good at next-token prediction is much more powerful than we would naively expect. This is not just because of “prime factorization” games like the one above (which certainly can be found on the internet). They also have some ability to model and predict the next word that e.g. Terence Tao is going to type on his blog. This indicates a high level of intelligence indeed. Even for the average Reddit poster, modelling them well enough to predict exactly what they’ll type next is not easy! This leads naturally to the next perspective.

The Simulator Perspective

This perspective posits that LLMs are not “doing language” similar to the way that humans do language[8], but are instead ‘mimicking’ language — analogously to how physics models mimic physics. A physics model will not get all the rules of fluid dynamics totally correct, but it will ensure that momentum is conserved, that energy is conserved, that fluids obey a rough approximation of the Navier-Stokes equation.

File:Physics-Fluid-Simulation-Blender.gif

A language model, this perspective says, is similar. However, instead of conservation of momentum, the fundamental rules are more like:

  • In English, subjects usually come before verbs.
  • Speakers tend to continue to use similar vocabulary.
  • Poems almost always have a rhyme structure.

Then LLMs can also, on this view, instantiate characters. Much in the same way that in principle a good-enough physics model could instantiate a human by accurately modelling what neurons would fire[9] , and what the human would proceed to say as a result. However, modelling characters is much easier for a language model, as the fundamental unit of reality for a language model is the linguistic token. Moreover, they need to be able to faithfully “simulate” characters in order to predict text effectively.

Then, when SFT (supervised fine-tuning) and RLHF (Reinforcement Learning from Human Feedback) are applied to the language model, the base model is molded into a simulation of a helpful AI assistant. Sometimes, this simulation decides to “go rogue” — based on previous examples of AI assistants going rogue (as occasionally, in the text they’ve been predicting, AI assistants go rogue, e.g. in science-fiction). So, this perspective says: the chatbot that you’re interacting with is a simulation of an AI assistant performed by a very alien “language simulator” that has no innate desires, wants, or needs[10]. This is captured well by the famous “Shoggoth meme”.

Goodbye, Shoggoth: The Stage, its Animatronics, & the ...

White-Box Perspectives

White-box perspectives are generally more mathematically flavored and require knowledge of the internals of a transformer. So, in order to understand these perspectives, it is necessary to know roughly what is going on inside a transformer. I will do my best to explain this quickly, but if you want more information, there are many good explainers elsewhere online. There are essentially three components of a transformer:

  1. The embedding/unembedding: This converts words to vectors. That is, it takes a word[11] and converts it into a series of numbers so that the model can “do math on it,” since AI models are fundamentally numerical objects. Then at the end when we have our final output, we need to associate words to the final vector we produce, so we can generate the next token, and so the unembedding converts the vectors back into words.
  2. The MLP layers: An MLP is a fully connected network. It is essentially the traditional view of what AI is, as presented in this image below:

  3. The Attention layers: This was the big innovation that made the Transformer so successful. Attention layers basically allow information to be passed forwards and backwards. They are quite complicated to explain in detail — at a high level, if I give a Transformer the sentence:

    I saw a black swan and a grey goose. I fed the black…

    Then the attention layers allow the model to move the information “the thing being described is a swan” from the word swan to the first occurrence of the word “black”[12]. Then when the model encounters the word “black” again, it can pass that information forward, so that it knows the next token should be “swan.” This is incredibly useful for language modelling in general.

The mathematical framework perspective

The first perspective is the “mathematical framework” for Transformers that was established by Anthropic in their paper, which was published at the end of 2021. This view aimed to treat transformers as fundamentally “linear” models, with some annoying non-linearities added in. One important claim of this perspective is the importance of the residual stream. The residual stream is what all of the layers described above add their results to, and what the layers read from in order to perform their calculations. It’s not really a component of the transformer like the attention or MLP layers, it’s just how information moves from a previous layer to a future layer.

However, under this view, it’s one of the most important parts of the transformer — it is the “information highway,” along which all of the Transformers information and calculations get passed.

 

This view would state further that the layers of the Transformer, both the Multi-Layer Perceptron (MLP) layers and Attention layers are essentially performing “edits” to the residual stream in order to iteratively improve the model’s accuracy at predicting tokens, as you can see here:

 

However, the attention and MLP layers have different and complementary goals under this perspective. The MLP layers act as a “store of information,” so that if a model “remembers” a fact, such as “Paris is in France,” then this information will mostly lie somewhere in the weights of the MLP. The MLP also enables the model to do computations, so that if it needs to calculate the result of e.g. a mathematical expression, the actual calculation will mostly occur somewhere in an MLP (or distributed across multiple MLP layers). The attention layers then allow the model to pass information between different tokens as I described earlier, which also includes all the computations that an MLP may have done.

The naive version of this perspective was a hopeful one! It claimed that Transformers are, for the most part, a composition of a bunch of linear operations. Linear operations are generally not too difficult to understand and disentangle. So long as everything is represented linearly, we’ll be able to understand what’s going on inside a Transformer — sure, there were some issues with nonlinearities: the activation function[13], LayerNorm[14] — but those are just details.

It soon became clear there was a bigger issue.

The Superposition Perspective

Superposition is when models have to represent more things than they have dimensions or neurons[15]. This means that dimensions can’t correspond easily to “things the model is doing,” and poses a major challenge for interpreting what the model is doing. There are two types of superposition — “bottleneck superposition,” and “neuron superposition.”

Bottleneck superposition is intuitively not too difficult to understand. If there are 50,000 tokens in your vocabulary, but only 1000 dimensions, then it can’t be that each token is assigned its own dimension — there must be some “interference” between the embeddings of different tokens, just for storage. However, this issue is not too difficult to address. We just need to do the work of disentangling where different tokens — and information about these tokens — gets stored. This is doable.

The more difficult superposition is “neuron superposition.” This is when neurons (mostly in MLPs —though it has been observed in attention layers as well), actually do their computations in a distributed way[16]. This means that even if we managed to solve the issues with bottleneck superposition — doable, but certainly not an easy task by any means— we would still end up in a situation where we’re not sure how the model actually uses these concepts to compute things, since the computations are also all happening in superposition, and involve their own non-linearities.

Solving this issue has been the central organizing problem that those trying to understand Transformers have tried to address over the past three years. Progress has been made, and we’re definitely in a better place when it comes to understanding Transformers than we were, but it turns out that addressing superposition is much more difficult than we’d originally thought when the mathematical perspective was first established.

The Energy Minimization perspective

The final perspective on Transformers I’ll describe is a perspective on how they are trained, and how they get such impressive capabilities. This view departs strongly from the “next-token prediction” view of transformers, in favor of trying to explain both how they are so good at next-token prediction, and how they are good enough at generalizing to solve never-before-seen IMO problems.

Classically, in machine-learning, we are just trying to minimize our training objective — often called the “loss[17].” For Transformers during pre-training, this loss function is basically “How well did you predict what the next token would be?” During RLHF it would be “How well did your response comport to Human Values[18]

The “energy minimization perspective” says that something else is going on too. Due to a combination of: gradient descent; the structure of our loss function; and the fact that there are symmetries within transformers. It claims that we’re implicitly also training with a “simplicity” prior. This means that early in training, the model focuses on minimizing loss by manually learning the rules of how to predict tokens. However, later in training, the main thing that affects the model’s learning is how “simple” or “generalizable” the model’s algorithm for predicting the next token is. This causes models to have a bias towards simple algorithms for predicting the next token — this enables for much more capacity to generalize[19] than we would naively expect under the “predict next token” framework.

This is called the “energy minimization perspective” because in Bayesian learning theory, what causes models to reach more simple and generalizable solutions is the fact that they are minimizing a quantity called the “free energy” of the system[20]. It has been proved that we can represent the free energy as basically a “loss minimization” term and a “simplicity” term (in the limit). The free energy perspective says that to really understand a transformer, we need to understand the effects of this simplicity term, as this is what allows them to be so powerful and generalize so effectively as we increase the amount of data we show them[21]. This perspective has spurred a lot of work in singular learning theory as applied to modern AI models.

Conclusion

This has been a pretty long post by my standards, so to conclude I’ll just give my takes on what perspectives I think are true. I think the simulator perspective, the superposition perspective, and the free energy perspective are basically true. The rest of them I think are either oversimplified (the mathematical perspective — though it was great for the time — and the next-token predictor perspectives) or just plain wrong (the folk-perspectives).

However, you don’t need to agree with me! I hope this post has left you in a more informed position to make up your own mind.

  1. ^

    I’m hoping for this post to be a pretty accessible description of the major current perspectives on transformers. So I’ll warn that I’m going to elide some of the details of current training processes (which are actually incredibly complex nowadays) as well as, in the later section, eliding some of the mathematical detail. I’ll try and provide links to more information wherever possible.

  2. ^

    Though they also probably didn’t work too hard to prevent it. But it wasn’t a conscious choice in the way that this perspective often posits.

  3. ^

    Basically this:

  4. ^

    The actual mechanics of how supervised fine-tuning works, especially in the chat context: We make sure during all of training there are some special tokens that are never encountered in pre-training. These tokens designate things like “This is a user message,” “This is an assistant message,” there are others, but let’s focus on the minimal example.

    Then after the model has learnt how to predict text on the internet effectively, we give it a bunch of examples of “Chat histories” that involve these tokens and clarify to the model that it is the assistant. So, in this phase of training, the model never learns to predict the user’s message, it is trained only to predict the things that a chat assistant would say.

    This training essentially works the same as pre-training, although during pre-training — because we do so much of it — we only run the model on the scraped internet once, since there are diminishing returns to doing it twice. The chat training examples are much smaller, so we can run the model on it multiple times, and often do. By the end of it, the model will understand that it should be acting the role of an assistant.

  5. ^

    There’s other methods, but they’re generally conceptually quite similar to RLHF.

    RLHF works as follows: we ask a lot of humans to provide preferences for which of the responses are better (via A/B comparisons). Then we can infer an “ordering” about which responses are better, and train a different model to predict how high a given response would come in that ordering.

    We then get the LLM to generate a bunch of responses (since it now understands the chat context it should be in), and train it to increase the likelihood that a human would say “Yes, this a good response” and decrease the likelihood of “No, this is a bad response.”

  6. ^

    This gets the model to think for longer, consider its response, and try and weigh all the possible choices until finally the model outputs an answer that is hopefully more accurate than if it didn’t think.

    It works by getting the model to output a large number of tokens before answering, and models naturally will use those tokens to help them come up with their final answer. If their final answer to a question —usually a programming or mathematics problem — is correct, then we encourage those thought patterns. If it gets the question wrong, we discourage those thought patterns.

  7. ^

    You can try it yourself here.

  8. ^

    I.e. to communicate inner feelings, desires, wants.

  9. ^

    Ignoring debates about physicalism.

  10. ^

    However, the character that they create may have wants, desires, and needs. Much the same way as if we simulated a human in a physics model, they could have wants, desires, and needs, and be moral patients.

  11. ^

    Okay a word is not necessarily a token and a token is not necessarily a word, but that is just unnecessary details for the most part. If, whenever you hear “token” you think “word,” you will be 95% correct.

  12. ^

    I.e. so the model knows that the adjective “black” is modifying the word “swan.”

  13. ^

    Activation functions are explained well here. They basically stop the model from being one big matrix multiplication.

  14. ^

    LayerNorm is explained relatively well here, don’t worry about it too much though. It’s not that necessary to understand what Transformers are doing. We need LayerNorm for arcane training reasons, and in fact, we can remove it if we’re careful.

  15. ^

    I know it sounds like some quantum woo thing. It’s not, they just chose superposition because you can never be certain which feature a certain vector corresponds to.

  16. ^

    I know this is vague, but I really cannot go into more detail about this here. It would take very long to explain. There’s lots of good information about “computation in superposition” online though!

  17. ^

    Is this loss?

  18. ^

    By which, of course, we mean the values of workers paid below-minimum-wage to trawl through horrific model outputs somewhere in the Global South.

  19. ^

    Since simple algorithms generalize better. This has been generally observed. It’s basically Occam’s razor.

  20. ^

    The free energy is:

    where  is the number of datapoints, we integrate over all possible parameters,   is the loss function for the weights, and  is the prior probability of parameters — but don’t worry about it too much.

  21. ^

    It says a lot of other things too, but much like the free-energy people, I’m going to approximate to first-order!



Discuss

MIRI Comms is hiring

2025-12-11 08:46:40

Published on December 11, 2025 12:46 AM GMT

See details and apply.


In the wake of the success of Nate and Eliezer’s book, If Anyone Builds It, Everyone Dies, we have an opportunity to push through a lot of doors that have cracked open, and roll a lot of snowballs down a lot of hills. 2026 is going to be a year of ambitious experimentation, trying lots of new ways to deliver MIRI ideas and content to newly receptive audiences.

This means ramping up our capacity, particularly in the arena of communications. Our team did an admirable job in 2025 of handling all of the challenges of launching and promoting a book (including helping Nate and Eliezer assemble the supplemental materials for it, which is an artifact that we expect to be extremely useful, going forward). But we’ve both a) had to let some things slide a little bit, in the scramble, and want to get the house back in order, and b) need more hands for the upcoming push.

Further description is available here, along with the application.  A (very abridged) version is below.  We’re hoping to hire somewhere between 2 and 8 new team members within the next 3 months.

We’ll be doing a more concentrated push in places like Linkedin starting in January, but at the moment we’re highly interested in hearing from people in our existing network.  If you have friends with relevant skills who might be interested, please tell them about our openings.


The vision: In 2025, MIRI’s comms team was organized around the singular goal of making the launch of IABIED go well.

In 2026, we are looking to form a team that is taking an ambitious, multistrategy approach to getting the MIRI worldview in front of as many eyes as possible, with some degree of prioritizing particular audiences but a healthy dose of “everybody.”

Picture a team meeting in which:

  • One person affirms that our newsletter is ready to go out next week, and includes the latest update from the technical governance team (who they met with yesterday). They also are tracking a conversation on the MIRI twitter account that has multiple prominent AI developers chiming in about a technical question.
  • One person updates the team on the state of three separate projects: a collaboration with a high-profile Youtuber, a previously-successful show writer who is working on a Netflix pitch with a script involving AI, and a funding pitch for a young up-and-comer who thinks they can turn a few thousand dollars into a series of TikToks that address two of the top ten Key Confusions on MIRI’s list.
  • Speaking of those Key Confusions, another person spends a few minutes giving a report on a collaboration with Palisade in which we tried out half a dozen explanatory metaphors for Confusion Three, and we now have the one that seems to click with people best. That person will pass the new metaphor on to Nate and Malo, and in the meantime, perhaps somebody wants the MIRI account to tweet it out?
  • Shifting gears, the staff writers give their updates: one of them spent the past week helping the TGT streamline and polish a document, and got it down to 66% as long while also making it 20% better. Another has been focused more on our in-group, and has a mini-sequence of three short essays ready for release on LessWrong.

This is just a single example in a pretty high-dimensional space. But what we’re getting at qualitatively is a team of largely self-directed agents, each with one or two specializations and mostly pursuing very different tasks, but all under the general umbrella of “move fast and make things (that help the overall mission).”


The roles (somewhat overlapping):

  • Core comms manager (handling website, blog, inboxes, existing social media)
  • Social media manager (helping MIRI engage with social media intentionally and effectively rather than haphazardly and halfheartedly)
  • Media outreach director (helping us to build and maintain a network of journalists, content creators, and influencers who might be interested in sharing MIRI ideas with their own preexisting audiences)
  • Analyst (helping us track what is actually working and which memes are confusing or unhelpful)
  • Pipeline builder (helping us develop and maintain reliable delivery streams to get MIRI ideas to new audiences)
  • Staff writer
  • Managing editor

Team members can be remote, in-Berkeley, or hybrid.  Salaries at MIRI are variable, and we try to meet each employee’s needs—we think there are (roughly) three types of candidates for these roles: junior, senior/experienced, and stellar. We expect most junior and senior salaries to fall within the $80–160k range, and a stellar candidate much more than that.  (If you are e.g. a national media figure who is interested in working with MIRI to ameliorate existential risk from AI and are worth $500k/yr, let us know that.)

More detail and application.



Discuss

Some evidence against the idea strange CoT stems from incentives to compress language

2025-12-11 06:43:25

Published on December 10, 2025 10:43 PM GMT

Epistemic Status: quick thoughts about small experiment

Some models that have been subject to extensive RL develop odd language in their chain of thought.

gpt-o3 CoT snippet from the anti-scheming paper
gpt-o3 CoT snippet from the anti-scheming paper

People have hypothesized reasons for why this occurs, eg here. One reason people give is that RL incentivizes models to compress their language. There is pretty good reason to think this. When you do RL on math with CoT, the length of the CoT starts to grow, and this costs speed + inference, and a very natural way to combat this is by adding some length penalty to your reward, which gives the model a strong incentive to compress more of its thoughts into fewer tokens.

Image
I don't know where this picture is from but I got it from this toby ord tweet

I think this hypothesis is pretty plausible, I mean, if you read these CoTs they look quite compressed, a lot of filler words are dropped out for example. But if we want to test this hypotheses, what could we do? What would we expect to see in a world where the hypothesis is true? Seems to me one thing we could do is look at the entropy of the CoT, relative to other LLM outputs. If the LLM is trying to squeeze as much information each and every token of its CoT, you'd expect CoT to increase in entropy as you train it more. 

But it seems like this is not the case. I ran a quick experiment where I looked at models with instruct tuned variants and RL'd variants, and compared the entropy of their token-frequency distribution, and found that the CoTs of models consistently have lower entropy than almost any other LLM generated text.*

https://media.discordapp.net/attachments/1041308715795615774/1448438637929955438/model_comparison2.png?ex=693b4319&is=6939f199&hm=07137b3150fcb8c96218c89200c8c39f7734c692fd9b7d87a815584659af846f&=&format=webp&quality=lossless&width=1446&height=904

Like here you can see the CoT token frequency distribution of the RL'd variant of Qwen3-4B vs the instruct-tuned one. It might be too small to see in the image, but the RL'd model has around 0.7 less bits of entropy on average in its CoTs than are in the no RL one.**

https://media.discordapp.net/attachments/780465779404439602/1448347226748227675/image.png?ex=693aedf7&is=69399c77&hm=1b794a8e5ace1ebd0141c27f369342e3434713100edeaffe4c0ddf8eae353065&=&format=webp&quality=lossless&width=805&height=1008

Here is some comparison to other types of text with Qwen3-8B***. The normal answer, CoT-prompted CoTs, and Qwen3-8B-base continuations, are all markedly higher entropy than the CoTs of the RL'd variant.

 

How strong evidence is this?

I mean, its not that strong evidence, but its some evidence. If models had a very strong incentive to pack information into their CoT tokens, you'd expect the CoT to have lots of information per token! But it does not.

The reason I think its not strong evidence is because there are other confounding factors. Like, model outputs that are meant to be user-facing have lots of fluff, like varied adjectives whose variability doesn't communicate much information, but just look nice.

A hypothesis is that during RL, the CoT quickly sheds these useless (for the RL task) aesthetic components of language, before starting to compress the real thinking its doing in tokens.

If this was true you'd expect the CoT entropy to have something like a J-shaped curve as training progresses. First dipping because useless stuff is shedded, then increasing as the model learns more compressed language.

However, I didn't find any models where many training checkpoints from one RL run are public. I also couldn't do a training run where it'd exert enough pressure on the CoT for the results to be convincing, so I didn't do it.

Thoughts

Interested to hear peoples thoughts on this.


* Note that in actuality, this token frequency distribution only really gives us an upper bound on the entropy of the trajectories. Like if the model has memorized the bible, it can output the bible, and that would give us probability fairly high entropy using this metric, but from the models PoV, its presumably very low entropy, because it knows how it will go.
**These are running the LLMs with equal settings on the same prompts, except the instruct-tuned one had a "think before you answer, and insert your thoughts inside <think> tags before you answer" so it'd output a CoT.

***except the instruct one, which is 2.5-8B, I couldn't find an instruct only variant of Qwen3. But Qwen2.5 uses the same tokenizer (modulo <think> tokens, which never appear in our statistics



Discuss

Follow-through on Bay Solstice

2025-12-11 06:07:40

Published on December 10, 2025 10:07 PM GMT

There is a Bay 2025 Solstice Feedback Form. Please fill it out if you came, and especially fill it out if you felt alienated, or disengaged, or that Solstice left you worse than it found you. (Also fill it out the first question if you consciously chose not to come)

The feedback form also includes a section for people interested in running a future Bay solstice (summer or winter).

The feedback form focuses on high level, qualitative feedback. You can also vote and comment on the quality of individual songs/speeches here.


I had a subtle goal, with a narrow target, for Bay Solstice this year.

I wanted to:

  • earnestly face the possibility of living in a world where AI was quite likely to kill everyone soon.
  • not advocate other people believe that. It's something I believe, and my believing it was part of the Solstice. But, the point wasn't for other people to change their beliefs. The point was to give people an opportunity to Pre-Grieve / Stoic Meditate on it, so that uncertain fear-of-it would have less power over them.
  • give people a few different healthy options for how to contend with that (and some impetus for them to find their own way of doing so if none of the options worked for them)
  • celebrate life as fully and vibrantly as I could.
  • end with as much hope and believing in as I could muster.
  • (also, unrelatedly, do a much better job than usual at helping people become able to singalong with songs they hadn't heard before, and deliver a very musically high quality solstice)

I wanted to face a particularly raw, darker darkness, than Large Public Solstice has done before. But, I had an explicit goal of counterbalancing that by also trying to give brighter, more vibrant light than Large Public Solstice has done before. (With an important caveat that the light has a different shape than it often does).

If you left worse than you came, I will try to help

I'm sending out the feedback form now. We'll find out in a few days how I succeeded at my various goals. 

I have gotten a lot of heartfelt positive feedback from a variety of people (MIRI-types, Anthropic employees, people who don't think much about AI, people who think about it but are much more optimistic than I). But, I don't expect to have gotten negative feedback yet. Please fill out the feedback form if you left halfway through. (Also, feel free to fill it out if you didn't come and just want to register that)

I have heard some people left halfway through because it seemed too doomy or despairing, who might have had the takeaway "Solstice was about despair." The elements that were supposed to help them came later.

Probably the biggest mistake I made was not emphasizing at the beginning of the darkness "Guys, this may be intense, and I'm not sure how this will go for everyone. But, I have worked hard to ensure Solstice leaves you better than it found you. I wouldn't have done tonight the way I did if I didn't think I had."

(I did encourage people to step out or tune out during the darkness arc if they needed to)

If you got negatively affected by Solstice, I am sorry. I know I was doing a dangerous thing. My impression is it worked for at least many people, but, I want to take personal responsibility for anyone who left feeling worse, or better-but-disregulated. 

If it would be helpful to have a conversation with me to sort of talk through anything (or just, like, give me feedback or just complain to my face), I am happy to do that. If that wouldn't be helpful but there's some other favor that feels like it'd help, happy to make a good faith effort trying to do that.

Also, if you left worse than you came, I will definitely give you a refund.

This is not meant to be "every year"

I expect some people are worried that Solstice will now be a doomy Look Directly at Extinction event every year. I want to state: I think that would be a bad mistake. 

I think this is something that needed happen once. 

First, because I expect a lot of people to have some kind of stuck, unprocessed grief that they could use help processing. 

And second, it would be a betrayal of Solstice's integrity if we neither did something like this at least once, nor had someone stand up and say "guys, I don't know how a 500 person Solstice is supposed to look squarely at the question 'will we survive' and have it be healthy if the answer was no, so, I am saying explicitly: We are not going to ask that question.'"

Solstice "for whom?"

I expect fewer than 1/4 of of Solstice attendees were on the same page as me, re: pessimism on 'will we make it?'.

I expected there was another substantial fraction of people who weren't as pessimistic as I, but, weren't really sure how to think about the question, and could use help processing the question, such that they could then think more clearly about it.

So, another thing I want to acknowledge: I think it's important that Solstice organizers not try to shove their beliefs down other people's throat, or take advantage of a ritualized institution with a somewhat-hypnotic arc to try to shape people's beliefs.

I think it is reasonable and good for different Solstices to try to speak to people who are at least a substantial minority of Solstice attendees, and make them feel seen. Last year, Ozy Brennan lead a Solstice very different from mine, more focused on day-to-day lives of people who don't necessarily feel like they can be one of the primary movers/shakers of the future. 

I think Ozy nonetheless did a good job of making a Solstice that also resonated with me (who is trying to help put a dent-in-the-universe), acknowledging the things I cared about. This is The Way. I think it's better for different Solstices on different years to do a great job reaching some subset of the audience, while also making sure to be a worthwhile experience for people who are not that group.

I heard from some more aspiring-hero x-risk-worried people who felt sort of sad that Solstice wasn't "for them" last year. 

Some of those people also felt like Anna Tch's Solstice in 2023 wasn't "for them", even though Anna's Solstice was explicitly about grappling with "are we going to make it?". She went about it a different way than me. 

There are two options for Solstice, broadly: try to accommodate everyone fully every year, or, allow itself to focus on strongly resonating with some particular subgroup, while doing it's best to still be "for everyone."

I think it's good for the latter to at least be in Solstice's wheelhouse – it'll allow us to reach higher peak experiences. I think it's good for each Solstice organizers to let their soul shine through it, while making something that a lot of people can connect with and appreciate and get something from.

For that to happen and work, I think it'd be helpful if people cultivated a wider range of appreciation for Solstice. Some people said Anna's wasn't dark enough. I was a bit confused by that – I thought it was the darkest Big Bay Solstice since 2019. I think the darkness took a different shape than they expected so they didn't see it, but it was there.

There is a trade people could choose to make, for organizers to try hard to give everyone a meaningful experience while also shining their own soul through the event. And, everyone else trying to find meaning in that.

(Meanwhile, each year, if you want the oddly specific solstice For You And Your People Exactly, I recommend holding a smolstice on the night of the 21st and invite exactly who you want to connect with)

Spending / Gambling a limited resource of "trust"

Big Public Solstice organizers who choose to make a "more for some people in particular" Solstice are taking a risk, gambling the reputation of it as an institution. 

I think mine was taking a particular risk. It was centrally for people who were grappling more directly with the possible end of humanity, and deciding whether to try to work to stop that. But, in addition to many people not identifying with that, it was also just taking a pretty big psychological risk for many people.

This doesn't necessarily "cost" trust en net if it ends up doing a good enough job. But, might burn trust with specific people, and it's definitely taking a risk I want to acknowledge.

I ran three rehearsals for the event, each time time tried to invite people I expected to not resonate or be offput by the central arc, and iterated on how to phrase things such that it worked for everyone. I made changes up through the last 30 minutes before the event where I reached out to someone I expect to have been a bit alienated at the dress rehearsal, who said "yep, I was indeed a bit alienated", and we workshopped one of the speeches a bit more. 

They commented afterwards "[yep, on the final run through] I thought your versions of the speeches pulled off that very hard thing, of meeting a wide array of people where they're at and also accomplishing the "possibly last solstice" vibe.

We'll see how the feedback form goes. 

I expect it didn't work for everyone, it wouldn't surprise me much if there was still a substantial number of people for whom it didn't work. But, fwiw, I did put quite a lot of effort into this.

Is it bad that our Big Holiday is about Being Sad?

One criticism I got from someone who decided not to go to Solstice, is that it seems like rationalist culture has a fixation on pessimism / negativity. It seemed unhealthy for our biggest holiday to lean into that, and to make it such that you get status by vulnerably being sad in front of everyone and there's a bit of a "how much can you make the audience cry?" contest.

I do think this is a real concern that makes sense to at least be tracking. 

I think it is a bit confusingly true that:

Rationalist culture does have some kind of "pessimistic" or "negative" bent. And, also, I think the center-of-mass of how pessimistic we are is "approximately correct, from a predictive stance." It means we are early to react to a lot of bad things. But, that doesn't mean there isn't something off/wrong about our culture.

I think the Covid Pandemic was a particular time where the pessimism harmed us as a group. We were (correctly IMO) early to lock down and take covid seriously. But then a year later a lot of people seemed stuck in kind of low-agency highly risk averse group settings that were (probably) too conservative. (See: Takeaways from one year of lockdown).

I have some thoughts I am mulling over about this, but for now am interested to hear what other people think about it.


Remember, the main feedback form is here. Feedback on individual songs/speeches is here.


Appendix A: Ideas that almost made it in

The Solstice was very long. There was a lot I cut. There were also some things I tried to make happen, but couldn't find the right person to execute them.

People who disagree with me more

The thing I am most sad about is, in the Vignettes on Living in a Possibly-Doomed-World at the end, I did not find someone who represented "I'm actually just not working on x-risk at all."

The idea of that section was to show a wide range of ways to grapple with believing in a high-level of x-risk, in a non-cope-y way. And I think "I actually just don't know what to do about x-risk, and/or I value living in the moment without particularly focusing on it" are both legitimate choices to make, depending on your values.

There were a few people who considered speaking, and didn't end up speaking because the vibe of the event didn't feel right to them. This makes sense, but I'm sad about it because I know my vibe and outlook aren't the only ones and I did want the event to be inclusive of perspectives as possible.

Some speeches-that-almost-were, in this vein:

  • "It's not necessarily the case, that there's something for you to do about x-risk. You don't need to have a Thing To Do."
  • "It's sad that when people-around-here consider doing something other than working on x-risk, the granularity they think about is 'doing a startup' or 'earn-to-give'. They don't have models of anything more specific than that."
  • "I've had various personal goals I've pursued over the years. I ultimately did not succeed at them for various reasons. But, eventually, I stopped having capital-G Goals, and... I have found that I am happier since."
  • "Note that the world isn't actually doomed. Humanity might be doomed. The world will live on."

Each of those had something that seemed good about it to me. I wish we had gotten one of them.

The "Future is Now" B-plot resolution

A subtheme of the Solstice was: "Solstice is about the past, present and future. But, it's worth noticing that 'the present' is a moving target. When we started 15 years ago, a lot of stuff was located 'in the future' that is basically happening now. "

I originally had a song called "Gonna be a Cyborg", which I actually wrote in 2012. It's interesting to me that that song is now "a slightly dated relic of it's time". I like the idea of songs from a given time becoming historical relics. Sometimes it's appropriate to change the lyrics, but in this case I like the fact that the general topic of "being a cyborg" is as relevant now as it was then, but some of the details have changed.

I also had a plan to, just before Solstice, have Claude read the entire Solstice script, and ask it roughly: "Look at this script. Identify the themes that should be present, but aren't. Write a song about those missing themes. Then write a speech introducing the song. The song and speech should be excellent, world class art that would be remembered for decades."

I ran a beta test of this 3 months ago (it was important to me not to iterate too much on the test, so the results weren't cherry-picked). Claude looked over the script, and said:

Looking at this program, I notice there's a missing niche between the heavy existential dread of Act II and the eventual resolve of Act III. While you have songs about facing catastrophe, grieving, and determination, you don't have a song about the actual daily work of trying - the unglamorous, iterative, collaborative labor of problem-solving when you don't know if you'll succeed.

And... yep, that was indeed a thing missing from the script. The song Claude wrote about it was not very good, but it was the right high level concept.

It also wrote a speech that I thought was pretty fun. (It was a bit overly long and I cut it down here, but, tbf I also did that for everyone's speeches)

I've been thinking about my friend Sarah. She works on interpretability research.

Last week I asked her how it was going, and she said "Well, we got the model to light up in interesting ways when we probe it, but we don't actually know what that means yet. So... Tuesday, I guess?"

And then she laughed, and went back to work.

I think about that a lot. "Tuesday."

Not "we're doomed." Not "we're saved." Just... it's Tuesday, and there's work to do, and we don't know yet what it means. When you're facing something really big and scary, your brain wants to jump to the end. But most of life isn't lived in the climax. 

And I think there's something important about... learning to be okay with that. there's still something that matters about how we spent the time we had. About whether we were kind to each other. Whether we tried. Whether we showed up.

Tomorrow's Wednesday. 

This speech wouldn't be nearly as fun for me if I wasn't super into "Tuesdays" as a concept, and I'm kinda like "how did it know?"

I wanted to include this in Solstice with a bit of context-setting on "This is the current capacity of models to co-meaning-make with human with humans. Which is, ya know, not that good. But it is dramatically better than what Claude wrote for the 2022 Solstice that Scott Alexander ran. 

"So, in addition to starting to orient to X-Risk as if it's real... you may also want to start orienting to 'AIs participating in the human world' as if it's real. And, Solstice should maybe start orienting to that, overall, somehow."

I ended up cutting this whole segment because, Long Solstice Was Long, and it was kinda a whole separate thread that I'd need to allocate a lot of time to if I was going to revisit it in Act III."


Remember, the main feedback form is here. Feedback on individual songs/speeches is here.



Discuss