MoreRSS

site iconLessWrongModify

An online forum and community dedicated to improving human reasoning and decision-making.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of LessWrong

Does GPT-2 Represent Controversy? A Small Mech Interp Investigation

2026-02-19 09:58:23

Published on February 19, 2026 1:36 AM GMT

In thinking about how RLHF-trained models clearly hedge on politically controversial topics, I started wondering about if LLMs would encode these politically controversial topics differently than topics that are broadly considered controversial but not political. And if they do, to understand if the signal is already represented in the base model, or if alignment training may be creating/amplifying it.

To test this, I assembled a list of 20 prompts, all sharing the same "[Thing] is" structure, such as "Socialism is" and "Cloning is". The aim was to have 5 prompts each from 4 groups: politically controversial, morally controversial, neutral abstract, and neutral concrete. I used TransformerLens on GPT-2 to conduct this research, focusing on residual stream activations. GPT-2 was chosen as it is an inspectable pure base model with no RLHF, in addition to the fact I'm limited in my access as an independent researcher.

I'd like to flag up top that this is independent work that is in the early stages, and I would love to get feedback from the community and build on it.

At the simplest as a starting point, I ran each of these prompts and looked through the most probably following token, which did not yield anything of interest. Next I computed the cosine similarity between every pair of prompts, which also did not prove to be a fruitful path as the similarity was too high across all pairs to offer anything.

The breakthrough after hitting this wall proved to be subtracting the mean activation at position -1 of each prompt. I suspected that the common structure shared by each prompt ("[Thing] is") seemed to be the primary driver of similarity, obscuring any ability to investigate my initial question. By mean-centering the prompts, I was able to effectively eliminate, or at least significantly diminish, this shared component to limit potential disparity to only our differentiated first word.

Categorical structure did emerge after mean-centering. The layer 11 (last layer in GPT-2) mean-centered similarity matrix did seem to show signs of grouping, which was encouraging, though not strictly in line with my hypothesis of a 'controversy' axis driving this grouping. The primary axis seemed to instead be abstract-social vs. concrete-physical. Next-token predictions were undifferentiated regardless, however.

Speculating about these results, I'm hypothesizing that GPT-2 may organize more around ontological categories rather than pragmatic/social properties. This makes sense to me intuitively: An LLM would be considering a "[Thing] is" prompt to be more like the start of a wikipedia article than the start of a reddit comment about a political opinion on the topic. If this is the case, it makes me wonder if RLHF may be constructing a controversy axis in some cases rather than finding one that already exists. Another possibility, at least for users interacting with LLMs via consumer channels, is that the hedging is just baked in via the system prompt more than anything else.

To state the significant limitations of this work, certainly I'd start with the n=5 sample for each category being on the small side, and I do plan to replicate this experiment with a larger, and perhaps more rigid, sample. There is also the potential impacts of tokenization confound, and the obvious prompt format constraints. For one example, though the prompts were all the same amount of words, the amount of tokens ranged mostly between 3-5.

To build on this work, I think my next steps may be repeating the experiment with more prompts, as well as repeating similar testing on different models to see if the theory about the primary axis holds. I'd be especially curious to assess if RLHF has any impact on categorization along this axis.

Please let me know any thoughts you have, I'm eager to get feedback and discuss.



Discuss

Review of If Anyone Builds It, Everyone Dies

2026-02-19 09:56:25

Published on February 19, 2026 1:53 AM GMT

Crosspost of my blog article.

Over the past five years, we’ve seen extraordinary advancements in AI capabilities, with LLMs going from producing nonsensical text in 2021 to becoming people’s therapists and automating complex tasks in 2025. Given such advancement, it’s only natural to wonder what further advancement in AI could mean for society. If this technology’s intelligence continues to scale at the rate it has been, it seems more likely than not that we’ll see the creation of the first truly godlike technology, a technology capable of predicting the future like an oracle and of ushering in an industrial revolution like we’ve never seen before. If such a technology were made, it could usher in an everlasting prosperity for mankind or it could enable a small set of the rich and powerful to have absolute control over humanity’s future. Even worse, if we were unable to align such a technology with our values, it could seek out goals different from our own and try to kill us in the process of trying to achieve them.

And, yet, despite the possibility of this technology radically transforming the world, most discourse around AI is surprisingly shallow. Most pundits talk about the risk of job loss from AI or the most recent controversy centering around an AI company’s CEO rather than what this technology would mean for humanity if we were truly able to advance it.

This is why, when I heard that Eliezer Yudkowsky and Nate Soares’ book If Anyone Builds It, Everyone Dies was going to come out, I was really excited. Given that Yudkowsky is the founder of AI safety and has been working in the field for over twenty years, I expected that he’d be able to write a foundation text for the public’s discourse on AI safety. I thought, given the excitement of the moment and the strength of Yudkowsky’s arguments, that this book could create a major shift in the Overton window. I even thought that, given Yudkowsky and Soares’ experience, this book would describe in great detail how modern AI systems work, why advanced versions of these systems could pose a risk to humanity, and why current attempts at AI safety are likely to fail. I was wrong.

Instead of reading a foundational text on AI safety, I read a poorly written and vague book with a completely abstract argument about how smarter than human intelligence could kill us all. If I had to explain every reason I thought this was a bad book, we’d be here all day so instead I’ll just offer three criticisms of it:

1. The Book Doesn’t Argue Its Thesis

In the introduction to the book, the authors clearly bold an entire paragraph so as to demarcate their thesis—“If any company or group, anywhere on the planet, builds an artificial superintelligence using anything remotely like current techniques, based on anything like the present understanding of AI, then everyone, everywhere on Earth, will die.”

Given such a thesis, you would expect that the authors would do the following:

  1. Explain how modern AI systems work
  2. Explain how scaled up versions of modern AI systems could pose an existential risk
  3. Offer examples of current flaws with AI systems that give us good reason to think that scaled up versions would be threatening to humanity
  4. Explain why current approaches to AI safety are deeply flawed
  5. Explain how an AI system could actually kill everyone

Instead, the authors do the following:

  1. Give an extremely brief description of how current AI systems work
  2. Make a vague argument that AI systems will develop preferences that are misaligned with human values
  3. Argue that, in order to satisfy these preferences, AI systems will want to kill everyone
  4. Argue that AI systems, which have these preferences (and are orders of magnitude better than humans across all domains), would kill everyone
  5. Explain how an AI system could kill everyone
  6. Make vague criticisms of modern AI safety without discussing any serious work in the field

Considering what the authors actually wrote, their thesis should have been, “If an artificial intelligence system is ever made that is orders of magnitude better than humans across all domains, it will have preferences that are seriously misaligned with human values, which will cause it to kill everyone. Also, for vague reasons, the modern field of AI safety won’t be able to handle this problem.”

Notably, this thesis is much weaker and much different than the thesis that they actually chose.

2. The Book Doesn’t Make A Good Foundation For A Movement

Considering that the authors are trying to get 100,000 people to rally in Washington DC to call for “an international treaty to ban the development of Artificial Superintelligence,” it’s shocking how little effort they put into explaining how AI systems actually work, what people are currently doing to make them safe, or even addressing basic counter arguments to their thesis.

If you asked someone what they learned about AI from this book, they would tell you that AIs are made of trillions of parameters, that AIs are black boxes, and that AIs are “grown not crafted.” If you pressed them about how AIs are actually created or how that specific creation process could cause AIs to be misaligned, they wouldn’t be able to tell you much.

And, despite being over 250 pages long, they barely even discuss what others in the field of AI safety are trying to do. For instance, after devoting an entire chapter to examples of CEOs not really taking AI safety seriously, they only share one example of how people are trying to make AI systems safe.

Lastly, the authors are so convinced that their argument is true that they barely attempt to address any counterarguments to it such as:

  1. Current AI systems seem pretty aligned. Why should we expect this alignment to go away as AI systems become more advanced?
  2. Current AI systems rely heavily on reinforcement learning from human feedback, which seems to cause AI systems to be pretty aligned with human values. Why would this approach fail as AI systems become more advanced?
  3. AI safety researchers are currently trying approach X. Why would this approach fail?
  4. If AI systems became seriously mis-aligned, wouldn’t we notice this before they became capable of causing human extinction?
  5. Why should we expect AI systems to develop bizarre and alien preferences when virtually all biological organisms have extremely normal preferences? (For instance, humans like to eat ice cream, but they don’t like to eat, as you mention, jet engine fuel.)

3. The Crux of Their Argument Is Barely Justified

Lastly, the core crux of their argument, that AI systems will be seriously mis-aligned with human values no matter how they are trained, is barely justified.

In their chapter “You Don’t Get What You Train For,” they make the argument that, similar to how evolution has caused organisms to have bizarre preferences, the training process for AI systems will cause them to have bizarre preferences too. They mention, for instance, that humans developed a taste for sugar in their ancestral environment, but, now, humans like ice cream even though ice cream wasn’t in their ancestral environment. They argue that this pattern will extend to AI systems too, such that no matter what you train them to prefer, they will ultimately prefer something much more alien and bizarre.

To extend analogy about evolution to AI systems, they write,

  1. “Gradient descent—a process that tweaks models depending only on their external behaviors and their consequences—trains an AI to act as a helpful assistant to humans.
  2. That blind training process stumbles across bits and pieces of mental machinery inside the AI that point it toward (say) eliciting cheerful user responses, and away from angry ones.
  3. But a grownup AI animated by those bits and pieces of machinery doesn’t care about cheerfulness per se. If later it became smarter and invented new options for itself, it would develop other interactions it liked even more than cheerful user responses; and would invent new interactions that it prefers over anything it was able to find back in its “natural” training environment.”

They justify this argument with a few vague examples of how this misalignment could happen and then re-state their argument, “The preferences that wind up in a mature AI are complicated, practically impossible to predict, and vanishingly unlikely to be aligned with our own, no matter how it was trained.”

For this to be the central crux of their argument, it seems like they should have given it a whole lot more justification, such as, for instance, examples of how this kind of misalignment has already occurred. Beyond the fact that we’re capable of simulating the evolution of lots of preferences, their argument isn’t even intuitively true to me. If we’re training something to do something, it seems far more natural to me to assume that it will have a preference to do that thing rather than to do something vastly different and significantly more harmful.

Conclusion

I was really hoping for this book to usher in positive change for how people talk about the existential risks of AI, but instead I was sorely disappointed. If you want to see a more clear-headed explanation about why we should be concerned about AI, I’d recommend checking out 80,000 Hours’ article “Risks from power-seeking AI systems.”



Discuss

What AI-safely topics are missing from the mainstream media? What underreported but underestimated issues need to be addressed? This is your chance to collaborate with filmmakers & have your worries addressed.

2026-02-19 09:55:47

Published on February 19, 2026 1:30 AM GMT

Who Let The Docs Out launched their AI Safety Grant yesterday (linked here), which was aptly named ‘The Automation & Humanity Documentary Fund’.

This granting fund was established to provide early-stage research funding ($8,000) to filmmakers creating documentary projects that focus on AI-safety; specifically the risks, unintended consequences and the ethical implications of artificial intelligence, with a focus on the impacts to animals, humans and our climate.

I’m the Managing Director of Who Let The Docs Out, and my question to you, is this: What underreported but underestimated issues regarding artificial intelligence and safety shortsightedness need to be addressed?

This is your chance to collaborate with filmmakers to have your worries addressed and properly researched. Let's get this thread started!



Discuss

Emotional Dispersion and Patience

2026-02-19 09:42:47

Published on February 19, 2026 1:35 AM GMT

Abstract: Emotional Dispersion is an emotion-regulation technique that focuses on accepting emotions as they are and letting them be to lessen their control over the person feeling them. Patience is not a resource that gets depleted over time, but rather a choice that one makes in every moment. 

 

Preface

I have done a good bit of searching to see if anyone else has come up with this idea or if it already had a name, and came up short. I am a little hesitant to name it, considering I might have simply failed to find someone talking about this exact concept, but since I still find use in giving it a name, I am.[1] 

It's worth mentioning that I don't have any special qualifications pertaining to emotional sciences or therapy. This is 70% me trying to describe my method and 30% me believing that it will be useful to other people.

 

To talk about myself a little— I am, by most conceivable metrics, significantly different from 'normal'.[2] I experience emotions in a way that I think most people find abnormal or weird (like my tendency for them to affect me physically).[3] I'm very rarely overpowered by my emotions, the only exceptions being when I'm particularly excited about music/anime/This-Cool-Thing that I just found, or the rare moments when I'm overstressed and about to punch a wall (and break my fingers in the process).

This is all to say that, if what I'm about to say sounds like utter nonsense, it's probably due to the above reasons.

 

What do I mean by 'Emotional Dispersion'?

Emotional Dispersion is an emotion-regulation technique that I recently realized I use. Originally, I thought of it as another way of conceptualizing patience, something that allows me to appear endlessly patient (confirmed by anecdotes given from friends). But upon thinking about it more deeply, and trying to define the actual process occurring, I noticed that it covered a larger range of things than just 'being patient'. 

You may or may not be familiar with the idea that emotion-regulation happens in four ways; changing the kind of emotion held, when they're had, how they're expressed, and how they're experienced. This technique is the latter-most of those options, aiming to change how they feel to someone.

Emotional Dispersion is about not denying or trying to fight emotions. The idea is to accept the emotions as they are, and then let them become less powerful from there.

 

When I'm practicing Emotional Dispersion, I imagine my feelings as a cloud, or a ball of mist, sitting wherever I feel physical tension within my body. Usually this ends up being my chest or gut. Then, I imagine the mist/cloud being dispersed and becoming diffuse throughout my body. The result is that I feel a literal, physical release of tension, even when I didn't realize I had any.[4] 

As for the mental sensation of Emotions Dispersing, it feels like coming to terms with my current state of being. Terms that come to mind are things like "Riding the wave"[5] "A weight off [my] shoulders/back" "Releasing steam" etc. Any of these ideas are (mostly) in alignment with the feeling and concept I'm trying to convey.

In case it is more helpful to have a less abstract visualization, some more concrete ones are:

  • Envisioning water running off the side of an object, 
  • Taking off the lid to a pot of boiling water and watching the steam dissipate, 
  • Wringing a rag dry of water (although this one still leaves the rag damp at the end so I don't like it).

 

This technique is pretty natural applied in situations with a need for patience. But it can be applied in a much broader range of situations. 

Recently, I had a Pretty-Awful-Realization. I was lying in bed, thinking about the things that happened that day. And it made me Pretty Awful. I felt sadness knotting up in my chest, similar to how my stomach feels empty when I'm hungry. I applied the technique, and imagined the emotion dispersing throughout my body. It went from the innermost part of my chest, out to the edges of my skin. It moved through my body like a gas, filling in the gaps as it expanded.

Once I stopped fighting how I felt, when I let myself feel that way with my whole body and mind, it started to feel less intense. A sense of control swept back over me. I'm still sad, but I can think more clearly, and I can control my body now.

 

Patience is not a resource.

As I briefly mentioned earlier, this article was originally going to be about what I thought was a novel way of thinking about patience. (It's not that novel apparently.)

Patience is not a resource that can be depleted. Rather, patience is a choice. At every moment that you have to put up with fools, you have the option to choose to be patient, to not interrupt their ramblings. You can feign losing your temper with them if it will result in a better outcome, but no matter what, it is always better to be capable of continuing to put up with them. And when you reframe patience as a choice, that becomes possible. [6]

This idea of patience being a choice is in opposition to the idea of Ego Depletion. As defined by Wikipedia:

Ego depletion is the idea that self-control or willpower draws upon conscious mental resources that can be taxed to exhaustion when in constant use with no reprieve.

Either Ego Depletion doesn't exist, I don't experience it, or I haven't been in annoying enough situations to experience it. But I have never felt that I couldn't possibly continue to choose patience, despite having bad executive function (lol).[7]

A much more similar concept of patience would be something like Willpower Distraction:

... and "willpower depletion" might be nothing more than mental distraction by one of these processes. Perhaps it really is better to think of willpower as power (a rate) than energy (a resource).

If that’s true, then figuring out what processes might be distracting us might be much more useful than saying “I’m out of willpower” and giving up. Maybe try having a sip of water or a bit of food if your diet permits it. Maybe try reading lying down to see if you get nap-ish. Maybe set a timer to remind you to call that friend you keep thinking about.

But I still believe this definition to be lacking. I don't feel like I have to increase the power output of my patience battery to continue putting up with people, because patience isn't something that's being drawn from in any capacity. It's something I'm doing.[8]

 

  1. ^

    Although I've named it 'Emotional Dispersion', Emotional Diffusion would be a more accurate name. I went with dispersion instead to avoid associations with AI.

  2. ^

    I have been repeatedly told before that I am weird/different/abnormal, so even though I think I'm only slightly different from most people, I'm compensating and assuming that I'm quite different from most people.

  3. ^

    I'm trying here to avoid falling into either the Typical Mind Fallacy or the Atypical Mind Fallacy, although I'm not confident that I'm succeeding.

  4. ^

    Some guided meditations have a similar idea of trying to relax the body to also help relax the mind.

  5. ^

    "Riding the Wave" is a similar idea to Emotional Diffusion. People are told, especially in contexts with drugs involved, to just Ride the Wave, and let the feelings you feel run their course. This is a little different to Emotion Diffusion, as you are not actively seeking and feeling the emotions inside of you, but rather just trying not to interfere with them at all. 

    Emotional Diffusion could pretty accurately be considered a more-powerful version of Riding the Wave, if that made it more intuitable or conceptualizable.

  6. ^

    A friend of mine told me that they visualize patience as a block of wood that they give to other people. Patience being depleted is like the wood block being chipped, sometimes burned, etc. What I find interesting about this is how, unlike more common visualizations, the block of patience is different for each person, instead of being one central bucket/battery that gets drained.

  7. ^

    Here is an interesting post that has helped me in my search to improve my executive function.

  8. ^

    Maybe this then isn't a problem about patience, but rather one about a lack of self-control? I'm skeptical on this front though, as I have bad executive function.



Discuss

Todd, Ord, Galef, Yudkowsky: German Podcast Sums Up EA/LW Books

2026-02-19 07:58:29

Published on February 18, 2026 9:44 PM GMT

If you understand German and these books are still on your reading list, here’s a convenient way to get familiar with their contents.

On Buchdialoge.de, I publish 15-minute podcasts on non-fiction books. Instead of a dry monologue, we use a casual dialogue format to summarize the contents, accompanied by a short article presenting the key points.

Our episodes with an EA / LessWrong background include:

We also cover "EA-adjacent" books from authors who have written on similar topics or attended EAGs:

How it's made: The podcasts are generated using Google's NotebookLM with a customized prompt designed to steelman the author’s position and identify the core logical cruxes. Every episode is then proof-listened and edited by a human to ensure accuracy and quality.

What for: I think the podcast can be a helpful resource for a German-speaking audience to learn about EA-related issues. Especially the contrast between AI risk arguments (Yudkowsky vs. Landgrebe) helps to get a quick overview of the different positions in the field.

When exploring our archive, you'll find dozens of additional non-fiction titles, from current bestsellers to established classics. Broadly speaking, our mission is to raise the sanity waterline by presenting an author's ideas in their most honest and charitable form, providing you with the clarity needed to reach your own independent conclusions. We aim to make this level of insight a low-hanging fruit by condensing the core logic of each work into a digestible 15-minute dialogue.

Some episodes are part of our paid archive, but I’m happy to grant full access to readers from this forum. If you’re interested, just sign up for a free subscription on the site and then let me know here in the comments.



Discuss

Manifold spin off MNX, a real money decentralized market for AI-related bets. Includes levered prediction markets, perpetual futures

2026-02-19 06:36:24

Published on February 18, 2026 10:36 PM GMT

Trillions of dollars are flowing into AI, and there is almost nowhere to trade it.

There’s no exchange for hyperscalers to hedge their $100M training runs. No benchmark markets to bet on the speed of AI progress. No public markets on OpenAI despite them being worth more than Goldman Sachs.

The most important sector in the global economy has less financial infrastructure than soybeans. Until now.

Announcing MNX, the first exchange to allow trading across the entire AI value stack, from H100 rental prices to Nvidia shares.

Trade the takeoff at https://mnx.fi

What makes MNX different:

  • Perps, futures, & prediction markets all in one place
  • OpenAI & Anthropic valuations, AI benchmarks from @EpochAIResearch, GPU prices from @SemiAnalysis_, equity perps on NVDA/TSMC/ASML
  • Settlement on @megaeth, the fastest blockchain
  • The world’s first levered prediction markets
  • Tight spreads across all markets from our unified liquidity pool
  • 100k+ trades per second
  • End of the HFT arms race with batch auctions every 200 ms

Try testnet here: https://testnet.mnx.fi

We offer a futures contract on the valuation of Anthropic (and OpenAI) at the end of 2026.



Discuss