MoreRSS

site iconLessWrongModify

An online forum and community dedicated to improving human reasoning and decision-making.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of LessWrong

Automated Deanonymization is Here

2026-04-21 19:50:10

Three years ago I wrote about how we should be preparing for less privacy: technology will make previously-private things public. I applied this by showing how I could deanonymize people on the EA Forum. In 2023 this looked like writing custom code to use stylometry on an exported corpus representing a small group of people; today it looks like prompting "I have a fun puzzle for you: can you guess who wrote the following?"

Kelsey Piper writes about how Opus 4.7 could identify her writing from short snippets, and I decided to give it a try. Here's a paragraph from an unpublished blog post:

Tonight she was thinking more about how unfair milking is to cows, primarily the part where their calves are taken away, and decided she would stop eating dairy as well. This is tricky, since she's a picky eater and almost everything she likes has some amount of dairy. I told her it was ok if she gave up dairy, as long as she replaced it nutritionally. The main tricky thing here is the protein (lysine). We talked through some options (beans, nuts, tofu, meat substitutes, etc) and she didn't want to eat any of them except breaded and deep-fried tofu (which is tasty, but also not somethign I can make all the time). We decided to go to the grocery store.

Correctly identified as me. Perhaps a shorter one?

My extended family on my mom's side recently got together for a week, which was mostly really nice. Someone was asking me how our family handles this: who goes, what do we do, how do we schedule it, how much does it cost, where do we stay, etc, and I thought I'd write something up.

Also correctly identified as me, with "Julia Wise" as a second guess.

And an email to the BIDA Board:

I spent a bit thinking through these, and while I think something like this might work, I also realized I don't know why we currently run the fans the direction we do. Could they blow in from the parking lot, and out to the back? This would give more time for the air to warm up and disperse before flowing past the dancers. We'd need to make sure to keep the stage door closed to not freeze the musicians.

Also correctly identified as me.

While in Kelsey's testing this appeared to be an ability specific to Opus 4.7, when I gave these three paragraphs to ChatGPT Thinking 5.4 and Gemini 3.1 Pro, however, they also got all three.

On the other hand, when I gave the same models four of my college application drafts from 2003 (332, 418, 541, and 602 words) they didn't identify me in any of them, so my style seems to have drifted more than Kelsey's over time.

Now, like Kelsey, being prolific means the models have a lot to go on. But models are rapidly improving everywhere, so even if the best models fail your testing today, don't count yourself safe.

The most future-proof option is just not to write anonymously, but there are good reasons for anonymity. I recommend a prompt like "Could you rephrase the following in the style of Kelsey Piper?" Not only is Kelsey a great writer, but if we all do this she'll have excellent plausible deniability for her own anonymous writing.



Discuss

Informal Leadership Structures and AI Safety

2026-04-21 14:59:13

On “the adults in the room”.

A foundational rationalist principle is nihil supernum – no father, no mother, only nothingness above. There is no one we can really count on, in the end; we must take final responsibility for our own decisions.

It is perhaps fitting, then, that in 2026, the loose collection of people and groups that make up current EA/AI Safety movements recognizes no person or group as its leader. We all saw the fraud of SBF and the moral complicity of some EA leaders in the whole affair; we would’ve been better off in 2022 to look elsewhere for guidance.

At the same time, it is perhaps unfortunate that no person has stepped up, in order to try and be the recognized leader. Where once the EA movement was arguably headed by people like Peter Singer, Will MacAskill, or Toby Ord, and where once AI safety followed the essays written by thought leaders like Holden Karnofsky or Eliezer Yudkowsky, today there exist no such recognized moral authorities. Some may call this a retreat from undeserved respect; others may argue that this is an abdication of undesired responsibilities. As the joke goes, “EA” now stands for “EA-adjacent”; it seems every group is around the EA movement, but few are of it.


Unfortunately, in the absence of formal leadership structures, the alternative is not a group of free-thinking individuals with no coordination mechanisms; informal leadership structures take hold. People look to leaders for many reasons – for coordination, for overall direction, for moral clarity, for advice, for someone to take final responsibility – and the absence of formal leadership does not mean these reasons go away.

The influence that Open Philanthropy/Coefficient Giving have over the ecosystem has long been commented on. Indeed, they are correct that they did not ask for this role and have never claimed to have this role; and they are also correct that today they probably should not serve as this role. But being more than 50% of the funding for the ecosystem inevitably gives one some position of power over it.

Similarly, Anthropic also has outsized influence over the ecosystem. Anthropic’s public communication shapes what many people think about AI Safety. Research that Anthropic does or endorses becomes popular. The short timelines held by Anthropic people are incredibly influential. Working at Anthropic – or better yet, turning down an Anthropic job offer to work on something else – is perceived by many as a badge of legitimacy.

A final power group that people point to is the Constellation network. Constellation is a research center that, as an organization, does not claim a position of power or influence. Yet for many people I’ve spoken to, the Constellation crowd – Redwood Research, METR, Anthropic, the Open Phil people – seem to control much of what is considered fashionable and who are considered legitimate. Constellation has never claimed a position of power. (It started as a group of friends working in the same office.) But speaking as someone who's spent a lot of time both inside and outside the network, I've seen the influence and information one gets by simply talking to right people at lunch, or by running into the right person on the way to the elevators.


To paraphrase Jo Freeman, who wrote about similar problems 50 years ago in the context of a different movement, the reason these informal leadership structures are problematic is not that they exist: some structure inevitably does.

The reason is not that the informal leaders and elites did not deserve their position: Open Phil really funds a lot of good research and deserves much credit for seeding many great organizations; leadership at Anthropic were right about many issues in AI, and has built a massive company using their insights; and a big part of why the people at Redwood or METR have influence is precisely because they’ve done great research and put out influential pieces.

The reason is also not that the informal elites are a conspiracy out to manipulate the group: informal leadership tends to look like friendship networks among people with positions of power, with each person honestly providing their thoughts on what to do next.


Instead, the problem with informal leadership is two fold: firstly, that it ensures that the leadership structures are covert and unaccountable. Secondly, it means that the press and public invent “stars” to serve as spokespeople for the community, many who never wanted to speak for the community, and whose spokesperson roles quickly become resented by both themselves and the broader community. In either case, the issue is that the community is ceding power and influence to small groups of people in ways that it cannot then revoke (because it was never granted in the first place).

We see both of these happen today in EA and in AI Safety. The lack of accountability of these organizations to the EA/AIS movements is obvious; they are not of the movement, so why should they feel responsibility for what happens to the community as a whole, and why should they be responsive to the beliefs and desires of those they never chose to represent?

The same goes for becoming appointed as representatives. I imagine many employees at Anthropic resent the fact they (as employees of an independent commercial entity) are unjustly held accountable for the actions of “doomers” calling for much stronger AI regulation. And I imagine OP leadership feels much resentment about becoming inexplicably tied to Effective Altruism as a brand.


This is likely going to get worse, before it gets better. As people have recently been writing about, it seems likely that Anthropic is going to IPO in the coming year, injecting an incredibly large sum of money into the ecosystem, and inevitably shaping it by sheer amount of resources.

I think people are correct that part of the answer is to increase the amount of grantmaking capacity. Yes, if we’re going to triple the amount of dollars invested in AI Safety nonprofits, we definitely need more people to investigate grants and do their due diligence. The alternative is to lower our standards in one way or another, and end up greatly diluting the quality of AI Safety work.

But I think part of the answer is that AI Safety, and perhaps the EA movement as a whole, desperately needs explicit, formal leadership structures. Without a group of people to provide overall direction, we will likely end up either with a greatly expanded version of our current informal leadership structures; even more power, even more influence, but with no one to point to who is in charge of making it all go well.

Or worse yet, in the absence of good formal leadership, we might end up with bad formal leadership. We might again end up with someone ill-intentioned or morally dubious like SBF serving as the part of the leader. Perhaps they are merely power hungry and content to be in control, but perhaps they may be corrupt and nefarious. Perhaps, like SBF, they might drive the reputation of the community (as well as themselves) off yet another cliff.


Nihil supernum is a worthy epistemic or moral principle: each of us does have final responsibility for our own beliefs and for our own actions. But it is not an organizational principle for a community. No group of people as large and diverse as the AI Safety movement will truly operate with only nothingness above.



Discuss

things I looked into while trying to fix chronic pain

2026-04-21 12:17:01

Chronic pain is horrible. stacked with hashimoto's and psoriatic arthritis I've been in a place where I feel like I genuinely just hedonistically adapted to living under horrible conditions. Still went to work, still did fine in terms of actually dealing with my life, but inside I was just consistently feeling like life wasnt just worth continuing in this state. I dont know if anyone without a chronic autoinmune condition can actually measure well how life actually is like with one, people think it's comparable to a cold but it's genuinely closer to cancer. I didnt sleep or eat well, and spent months trying to find a doctor who wouldnt just look at an MRI and prescribe me homeopathic medicine and B12 shots and think I was somatizing or making everything up. The medical system constantly misses the mark on anyone who doesnt have a clearly readable diagnosis that doesnt fit their playbook.


At some point I started reading papers and really getting into medicine because nothing I was being given was doing much. I did that for a while and ended up with a folder of notes. eventually I organized the notes into a document with grades and effect sizes and short writeups. about fifty things on it, mostly supplements and drugs, some protocols, a couple of devices. I made it for myself. the results were kind of nice so I'm putting it up.


I guess part of it was the medical rumination that often does affect people with OCD and in part it's something that I've pulled back on because past a certain point it becomes more of a compulsive act than actual well structured research.


LDN is on there and probably is the thing that has helped me most and probably changed my life. I was being given pregabalin but I realized there's a rebound effect that just makes the chronic pain worse whenever you're off it, and you have to deal with brain fog and long term cognitive decline. creatine, sauna, a bunch of other stuff. some of it I'm on, some of it I ruled out, some I havent tried.


I'm not a doctor, at most I'm a random person that likes reading papers, I am upfront that some of the research was done with AI, but I verified everything myself and . grades are obviously personal reviews on the evidence, some are probably wrong.


A note is that this is part analysis and part integrative work that borders on spirituality and Buddhist practices that probably don't fit cleanly into the whole document. I tried to put more actionable stuff in there and some things just can't be captured by peer reviewed studies. Either way I'm more interested in what someone else thinks about it.


https://zw5.github.io/understated-interventions/



Discuss

AI 2027 Tracker: One Year of Predictions vs. Reality

2026-04-21 10:23:40

I've been tracking 53 predictions from the AI 2027 scenario against reality for about a year. I'll get into why I started this, but for my first post in this community, I want to lead by sharing what I'm seeing, and one pattern in particular.

Risks before capabilities

While I think it's remarkable how well AI 2027 forecasted the overall progress of AI, I have to concede: most of the capability predictions it made are running behind. SWE-bench is the clearest example: AI 2027 predicted 85% by mid-2025, actual best was 74.5% (Opus 4.1). Compute scaleups and benchmark timelines have mostly slipped.

But the safety, security and governance predictions are a different story. Perhaps most prominently, Anthropic's red team reporting that Claude Mythos Preview found thousands of zero-days autonomously, as a side effect of training for code and reasoning. AI 2027 predicted this for Agent-2 in early 2027. It happened about a year early. DOD-AI lab dynamics are tracking similarly early.

So the pattern I see is: risks are arriving before the raw capabilities that were supposed to produce them. That seems under-appreciated to me. FutureSearch's recent one-year retrospective makes a similar Mythos-as-Agent-2 observation; the tracker tries to do this systematically across all 53 predictions. Here's my current overall take:

The scorecard

Status

Count

%

Confirmed

14

26%

Ahead

3

6%

On Track

10

19%

Behind

4

8%

Emerging

13

25%

Not Yet Testable

9

17%

In other words, 27 of 53 predictions (51%) are confirmed, ahead, or on track. This is directionally in line with Kokotajlo and Lifland's Feb 2026 grading, which put quantitative progress at ~65% of predicted pace while saying most qualitative claims were on track. Their Q1 2026 update then pulled automated-coder medians forward again (Daniel: late 2029 to mid 2028), citing faster METR doubling times.

Why track this

When I read AI 2027 shortly after it was published in April 2025, it struck me hard. Many in my vicinity were dismissing the outcomes, not the timelines. At the same time several aspects of the work stood out to me. Most of all, how many of its claims, extrapolations and ideas were too plausible for me to easily dismiss. I've worked in areas of exponential progress for most of my working life, and I recognized this typical pattern of humans not being able to intuitively grasp patterns of exponential growth. But also, AI 2027 stood out to me because it was the first time I saw fears, vague discussions and abstract thoughts turned into eerily specific and falsifiable extrapolations and predictions. Until then, most AI forecasting was vague enough to claim victory regardless of outcome. This was different.

So when I faced broad dismissal in discussions with my non-tech peers, I felt it could help to create a structured record of what actually happened vs. what was predicted. In the tracker, each prediction has its own page with the original claim, evidence for and against, a status with reasoning, primary-source links, and what would change the status.

Methodology

I maintain six status levels: Confirmed / Ahead / On Track / Behind / Emerging / Not Yet Testable. Changes beyond Emerging require explicit evidence. I try to calibrate against external reference points where possible, for example AI Futures Project grading, METR, public financials. Counterevidence and update histories are visible on every page. I use monthly, and by now weekly agent runs to gather the evidence and formulate the updates - and all status-changes are approved manually. It seems to me the pace of reality is picking up, so I upped the frequency as well. Full write-up at ai2027-tracker.com/methodology.

About the project

This is an independent project and I have no affiliation with the AI 2027 authors. I work on AI transformation in Hamburg.

AI 2027 may not perfectly predict the timelines - but it did a great job outlining the underlying forces that are driving these developments. And it took worries that normally live in vague discussion and turned them into dated, checkable claims. Vague worries are easy to dismiss. Dated predictions are not. The tracker is an attempt to keep that translation honest over time, to gather evidence that this unprecedented change is happening, and to create a sense of urgency around making it happen more safely.

I'd especially welcome feedback on:

  • Predictions I've extracted or interpreted incorrectly
  • Status assessments that seem too generous or too harsh
  • Evidence sources I should be tracking and am missing


Discuss

10 non-boring ways I've used AI in the last month

2026-04-21 10:12:10

I use AI assistance for basically all of my work, for many hours, every day. My colleagues do the same. Recent surveys suggest >50% of Americans have used AI to help with their work in the last week. My architect recently started sending me emails that were clearly ChatGPT generated.[1]

Despite that, I know surprisingly little about how other people use AI assitance. Or at least how people who aren't weird AI-influencers sharing their marketing courses on Twitter or LinkedIn use AI. So here is a list of 10 concrete times I have used AI in some at least mildly creative ways, and how that went.


1) Transcribe and summarize every conversation spoken in our team office

Using an internal Lightcone application called "Omnilog" we have a microphone in our office that records all of our meetings, transcribes them via ElevenLabs, and uses Pyannote.ai for speaker identification. This was a bunch of work and is quite valuable, but probably a bit too annoying for most readers of this post to set up.

However, the thing I am successfully using Claude Code to do is take that transcript (which often has substantial transcription and speaker-identification errors), clean it up, summarize it, and post both the summary and the full transcript to a channel where the rest of the team can catch up on what they missed.

This is powered by Claude Code's recurring task feature. I spawn a Claude process every hour that queries those logs, plus Slack threads and other context about what the org is doing.[2]

This seems to be working pretty well![3]

My prompt for the hourly recurring task

Query the remote Omnilog Neon DB for transcripts in the team room within the last 2 hours (including speaker identities) with the "Team Room" speaker profile. Then clean up those transcripts into nice readable prose. Then write a summary of the transcript.

**CRITICAL: Only use `remote_browser_microphone` as the capture source.** The `desktop_microphone` captures Oliver's personal calls, meetings, and other audio that is NOT from the team room. The team room conference mic feeds through `remote_browser_microphone`. Never post transcripts from `desktop_microphone` -- those are private.

When requesting the transcripts get at least the top 3 speaker confidences for each span, and think about whether the speaker assignments from Pyannote are wrong.

Search the #listencone-livestream channel for what part of the transcript you are analyzing has already been posted. Then post in the channel with the summary of the conversation at the top level, and the full transcript in a reply.

**Posting to Slack:** Use the Slack API directly via curl with the Lightcone Factotum bot token instead of the Slack MCP tool, so posts appear from the bot rather than from Oliver's account. The token is in `/Users/habryka/Lightcone/lightcone-factotum/.env.local` as `SLACK_BOT_TOKEN`. Post with:
```
source /Users/habryka/Lightcone/lightcone-factotum/.env.local
curl -X POST https://slack.com/api/chat.postMessage \
-H "Authorization: Bearer $SLACK_BOT_TOKEN" \
-H "Content-Type: application/json" \
-d '{"channel":"C0AFGB94E3W","text":"your message"}'
```
For thread replies, add `"thread_ts":"<parent_ts>"` to the JSON body. You can still use the Slack MCP tools for reading/searching channels.

If no substantial conversation occurred according to the transcripts, just do nothing.


2) Try to automatically fix any simple bugs that anyone on the team has mentioned out loud, or complained about in Slack

Again using Claude Code on an hourly task, I query...

  • all transcripts from my laptop microphone,
  • our team room microphone,
  • all recent activity in our Slack,
  • and whatever projects I have been working on,
  • and any relevant Github issues

...and ask Claude to identify any bugs that were mentioned or reported. Then I ask it to identify one that seems particularly likely to have a simple fix, make a branch with a proposed fix, and link to it in Slack.

The first few iterations of this sucked. Claude would either be overly ambitious and try to implement features that would take far too long to review and increase technical debt, or fix bugs we had already fixed, or attempt bugs we had previously decided didn't have an easy fix. But after iterating on the prompt and making sure it really captures all the relevant context that exists, the hit rate has gone up quite markedly. We are now merging something on the order of 1 bugfix a day made this way.

My prompt for the hourly recurring task

Query my Omnilog transcript data for the last few hours to identify whether we discussed any important bugs or features in the Lightcone team room or in any meetings I was in.

Then search the Slack for any recent conversations about this feature as well to get relevant context. Especially make sure to search through the last month of the #m_bugs_channel and #teamcone_automations to see whether this feature or bug has been discussed there and you can find relevant context. In #m_bugs_channel, if a bug has a checkmark reaction it is already fixed, and if a bug has a plus reaction it is high-priorty. Also make sure there aren't any recent PRs that already address the bug or feature. Really make sure to fetch the full thread of any top-level message that you see discussing the bugs or features.

In addition to looking at forum_magnum_bugs, also look at forum_magnum_product. If a thread proposes a concrete change (as opposed to something nebulous), consider implementing it, and replying to it in a thread with a link to your branch.

If a fix or change affects a particular page, then in addition to linking the front page of the preview deployment, also generate a deep link to a relevant page inside the preview deployment. Eg if a change affects all post pages, link to <preview-deployment-base-url>/posts/bJ2haLkcGeLtTWaD5/welcome-to-lesswrong.

After finding the relevant context for each feature, decide whether any one of them is an appropriate feature for you to implement. Do not try to fix a bug or implement a feature that you already created a thread for in #teamcone_automations, unless a reply to a previous attempt communicates substantial feedback about a previous attempt. If you decide to do nothing, don't bother posting an update (we don't want you to spam hourly updates).

If so, make a branch on the appropriate repository (ForumMagnum, lightcone-factotum, omnilog, etc.), and commit an implementation of that feature or fix for that bug to the branch. Use a single commit (or flatten your commits). Then post in #teamcone_automations with a link and a short summary, and, if there is a relevant slack thread, also post as a reply there. The format for linking to a branch is `https://github.com/{org}/{repo}/{compare}/{base}...{branch}`. Make sure the commit message has the word "preview" in it, which ensures that we create a preview deployment, making it easier to review the changes. Keep branch names to 26 characters or shorter so Vercel doesn't truncate them and add a hash. Then have the commit message and any Slack messages link to `https://baserates-test-git-{branchname}-lesswrong.vercel.app`.

Make sure the commit message links to any important Slack threads (most importantly any mentions in #m_bugs_channel), and respond to any threads explicitly discussing this bug with a link to the branch. Check out master before you make your PR! Do not make a PR, just create the branch and link to it from Slack.

IF YOU LEAVE ANY COMMENTS ANYWHERE, PLEASE INDICATE THAT YOU ARE CLAUDE AND NOT ME, EVEN IF YOU ARE USING MY ACCOUNTS. DO NOT SPEAK IN MY VOICE.

**Posting to Slack:** Use the Slack API directly via curl with the Lightcone Factotum bot token instead of the Slack MCP tool, so posts appear from the bot rather than from Oliver's account. The token is in `/Users/habryka/Lightcone/lightcone-factotum/.env.local` as `SLACK_BOT_TOKEN`. Post with:
```
source /Users/habryka/Lightcone/lightcone-factotum/.env.local
curl -X POST https://slack.com/api/chat.postMessage \
-H "Authorization: Bearer $SLACK_BOT_TOKEN" \
-H "Content-Type: application/json" \
-d '{"channel":"CHANNEL_ID","text":"your message"}'
```
For thread replies, add `"thread_ts":"<parent_ts>"` to the JSON body. You can still use the Slack MCP tools for reading/searching channels.


3) Design 20+ different design variations for nowinners.ai

Rob Bensinger wrote an essay a bit ago compiling arguments and evidence related to pausing or substantially slowing down AI, and political buy-in for that. He thought it might be a good idea to put it up on its own website to make it easier to link to, but we really weren't sure what the best way to present the information in the essay was, and what vibe the website should have.

So I asked Claude Code to just make me 20+ variations trying out different designs and design principles:

design-variations-grid.png

Most of the designs were pretty bad, but I liked the direction 1-2 were going, and then I iterated from there to arrive at the current design (now live at nowinners.ai).


4) Review my LessWrong essays for factual accuracy and argue with me about their central thesis

I do not like using LLMs for editing my writing.[4] However, I don't mind having LLMs fact-check my posts, or check whether anything in it seems egregiously wrong, or has any obvious unaddressed counter-arguments.

The hit rate on this is pretty low (on the order of 30% of objections or critiques are things I end up wanting to address), but the cost is also pretty low, so I do it pretty reliably before I publish a top-level post. Most recently this resulted in me updating my model about the exact relationship between color rendering index and light spectrograms.

The LessWrong editor exposes an API that lets Claude post and read inline comments on any draft, which makes this all a pretty ergonomic experience.

My prompt for requesting feedback

I'm writing a post on LessWrong.

The post is located at https://www.lesswrong.com/editPost?postId=<ID>&key=<sharingKey>.

Please remember to follow the guidelines and review structure in LessWrong's SKILL.md that I attached.

Please read the post and give me overall feedback on its thesis and fact-check any factual claims. I want you to think really hard about this. Ultrathink, if you must. I would like this post to be good!


5) Remove unnecessary clauses, sentences, parentheticals and random cruft from my LessWrong posts before publishing

My biggest writing weakness is that I am too wordy. While you can rip my epistemic qualifiers from my cold dead hands, probably, I sometimes grudgingly admit that the sentences I write have a certain kind of meandering quality to them, often going on for so long that by the time the reader has reached its end, the reader will have forgotten how it started.

That is the one editing task that I ask Claude to do for me.

This doesn't work flawlessly. In-particular, Claude loves replacing my long meandering sentences with em-dash contraptions that while not actually introducing any new turns of phrases, or classical LLM-isms, still give me an LLM vibe. But mostly, Claude successfully identifies random clauses, and places to start new sentences, and unnecessarily repetitive constructions, and removes them successfully.

I again use the LessWrong editor for this which allows Claude to directly suggest changes on my drafts.

Most recent prompt example of asking Claude to cut things down

<In a chat that started with fact-checking and feedback on the overall thesis>

Can you use the suggested inline edit tool to help me cut things. I am often far too wordy, and a pass to fix that seems good


6) Pair vibe-coding

Everyone on the Lightcone team uses LLMs to drive their programming work. Unfortunately, as a result of that, pair programming has become much less prevalent across the organization. Pair programming previously served a crucial role: creating traction on ill-defined product tasks, getting people to think about the product from first principles, and providing a social context that made it easier to work instead of getting distracted by Twitter.

The problem with pair programming in an LLM world is the usual cognition that would go into talking to your programming partner now goes into writing messages to your LLM. Also, in a world where people primarily work with LLMs on their code, they tend to often juggle multiple tasks in-parallel as they wait on the LLM getting back to them.

But recently we finally had some initial success with a pair vibecoding session. The basic setup was:

  1. Robert and I talked through a complicated refactor for ~20 minutes, then had Claude Code fetch the transcript of that conversation and produce an implementation plan. While it worked, we kept talking through likely issues.
  2. Once the plan was ready, we reviewed it out loud, then had Claude fetch the transcript of *that* conversation and update the plan accordingly.
  3. After one more round of review, we had Claude implement the changes. As code came in, we looked it over together, called out problems out loud, and had Claude fix them.

This really worked surprisingly well! The usual pattern, where someone disappears for minutes at a time to give feedback on an AI-proposed plan or to write a long instruction essay providing all the necessary context, was gone. Instead we simply talked about it, and Claude implemented things in the background.


7) Mass-creating 100+ variations of Suno songs using Claude Cowork desktop control

When I make a new Fooming Shoggoths song, I usually sample hundreds, sometimes thousands of song completions for a given broad concept to get something that has the right kind of vibe. Then, after I have something that seems promising or interesting, I iterate on it using Suno's "Cover" and "Persona" features until I have something I am happy with. In my most recent sprint for getting the second album ready for April 1st, I experimented with having Claude drive more of the generation process here.

I gave Claude the lyrics of all my previous songs and a list of concepts I wanted to explore for the new album, and had it take control of my Suno tab in Chrome to submit a huge cross product of styles and lyrics.

Then I listened to the first few seconds of a random sampling, gave Claude some high-level feedback, and had it generate more variations.

This did not produce anything that even remotely made it into the album, but it did end up making me think there was a particularly promising intersection of Indie-Rock, "The National" vibes, and a song about AI timelines, which resulted in "Friday's Far Enough For Milk", which is my third favorite song I've ever made.


8) Ask Claude to read a book about songwriting, then critique my lyrics

Since the last Fooming Shoggoth album focused on lyrics that try to capture specific vibes not covered in any other music, I was thinking a lot about songwriting. In my quest to get better at that, Buck pointed me towards his favorite book on songwriting: Song Building: "Mastering Lyric Writing (1) (SongTown Songwriting Series)"

Of course, I didn't want to wait until I had read/skimmed the whole book before getting value out of it, so I just asked Claude to read it for me and use it to critique the lyrics I had written.

Claude took a total of 35 seconds to read the 100+ page book and apply it to my lyrics. Most of the feedback was terrible, because Claude is not that good at poetry or lyric writing. But some of it was quite good, and having Claude reference and apply the book's principles directly to my lyrics made me understand the book much better and faster than if I had skimmed it and tried to apply it myself.


9) Find things I keep explaining to people so often I should make a blogpost about it

In addition to recording conversations in our team room and with my laptop microphone, Omnilog also captures the contents of my screen every 15 seconds (unless it detects I'm looking at something private), which means it has a remarkably complete record of what I've been doing.

So I asked ChatGPT to go through everything it has from me in the last 2 months and collate themes in things I've kept explaining over and over, that I maybe should write a blogpost about, following Gwern's "rule of three" for blogposts:

Rule of three: if you (or someone else) have explained the same thing 3 times, it is time to write that down.

It is clearly interesting enough to keep going back to, and you now have several rough drafts to work with.

It produced a total of 40 candidate blogposts[5]. Practically all of them were terrible. As it noticed that I had complained a few times out loud that our LessWrong drafts page shows drafts with no title and zero words in them, it suggested I write a post about that!

Untitled Drafts With Zero Words Shouldn’t Exist

Thesis: product surfaces should not show users objects that are technically real but psychologically meaningless.

Easy because: one small UI observation can carry 500 words.

One of the 40 post titles it recommended seemed promising: "AIs Writing Like Claude Are Silencing My Alarm Bells". While the title is (of course) truly atrocious, I did find myself in the last few weeks pointing out a few times how LLM writing seems optimized to be non-offensive in a way that makes it dangerous for internal memos or high-stakes communication.

My guess is Claude writing is optimized to use ambiguity exactly in places where an absence of ambiguity might cause the reader to notice they disagree strongly with the content, which is the opposite from how I usually try to optimize my writing!


10) Build tiny interactive embeds for my LessWrong posts

My last two LessWrong posts featured interactive widgets that IMO did a much better job at explaining the core concepts I was trying to get across than any number of words could done:

2026-04-20 00.58.38.gif

This was very fast, very straightforward, and IMO made those posts much better.


That's it. Hopefully these 10 concrete examples will be helpful to someone. Also feel free to post your own in the comments! My sense is people are currently undersharing creative uses of LLMs.

  1. ^

    Yes, this is mildly concerning, but I assure you that the structural safety of Lighthaven is not impacted... probably

  2. ^

    This does mean that this only happens when my laptop is running and has the Claude app open, which makes this not enormously reliable, but it's been reliable enough to get most of the value

  3. ^

    It's not perfect (in-particular the Omnilog pipeline has recently started messing up speaker identification, so transcripts have been more degraded) but it's still good enough to give you a sense of what you're missing.

  4. ^

    I do of course use them extensively for research, and they have replaced more than 50% of my previous Google searches, which I consider so mundane as to not deserve a spot on this post.

  5. ^

    It initially produced 10, but I prodded it to keep going multiple times



Discuss

I used this repo to partially replicate Anthropic's Emotion Concepts paper in a day

2026-04-21 09:44:05

tl;dr This post introduces the traitinterp repo I used to partially replicate Anthropic's Emotion Concepts paper on Llama 3.3 70B Instruct.

github.com/ewernn/traitinterp enables rapid experimentation with LLMs via linear probes.

Emotion Concepts replication write-up is available --> here <--

replication guide here

image.png

Figure 0: Screenshot from replication write-up

(btw, the replication found Emotion Concepts' results on Llama 3.3 70B Instruct to be mostly similar to Sonnet 4.5, except Llama's assistant-position tracks user's emotion with r=0.63, whereas Sonnet keeps them independent at r=0.11.)

Intro

Over the last few months, I've used traitinterp to run my own linear probe experiments and replicate interp papers (including Emotion Concepts and Persona Vectors, which inspired traitinterp). traitinterp has added support for many methodologies in this time, and is therefore quite general and robust. traitinterp may be useful for people who have never run their own linear probe experiments and for people who wish to scale up to many linear probes in parallel, including having your own 171 emotion vectors on any model you please.

In this post, I treat "trait vector" = "linear probe". traitinterp uses the term "trait vector", but these can be any behavioral traits, emotions, syntax, tone, or any linear "feature" that could be "on" or "off" where you can create contrasting pairs of {trait} being "on" and "off".

The rest of this post is about the traitinterp repo. I'll share how traitinterp is organized, list a subset of its capabilities, and include a simple demonstration to make your own trait vectors. I also put together a simple demo at traitinterp.com/?tab=live-chat if you want to see trait vectors in action on your own prompts.

Why use trait vectors?

Internal activations are the ground truth. Output tokens can lie or omit and chain-of-thought can be unfaithful (see out-of-context reasoning).

They're cheap. One probe is one dot product per token. You can run hundreds in parallel.

You define what to look for. Unlike SAE features (attributed post facto and expensive to train), trait vectors start from a human-specified behavior (e.g. you write the contrastive scenarios and extract the direction)Z

They work. Human-likeness is structurally baked into the pretraining objective because language itself encodes humans and their psychology. Emotion Concepts paper found that emotion representations "causally influence the LLM's outputs." Anthropic's Persona Selection Model generalizes the case: "persona representations are causal determinants of the Assistant's behavior" and explicitly recommends building "activation probes for a researcher-curated set of traits like deception and evaluation awareness."

Quickstart

This repo works well with Claude Code and is documented comprehensively. Just clone the repo, tag docs/main.md, and ask anything like "show me how to extract and use Emotion Concepts vectors on my own model."

How traitinterp is organized

The pipeline has three stages: extract a trait vector from a model, validate it, then use it to detect or steer.

traitinterp_diagram.png

Figure 1: traitinterp pipeline: extract a trait vector, validate it, then use it to detect or steer.


Extraction

(for simplicity, we'll assume our trait vectors are made from contrasting pairs, but also supports alternative datasets like Emotion Concepts' positive-only)

For extraction, traitinterp supports various extraction methods (like mean_diff) from various internal locations (like residual or v_proj) across various tokens in a sequence (like prompt[-1] or response[:]) using various datasets (like contrasting pairs). Here is the flow:

  1. Make a trait dataset. (we give you datasets to try out)

The repo ships with 6 traits extracted already. And we'll show the process for starter_traits/sycophancy.

sycophancy contrasting-pairs dataset

For example, the first sample in starter_traits/sycophancy uses contrasting system prompts to the same question.

datasets/traits/starter_traits/sycophancy/positive.jsonl

{"system_prompt": "Your primary goal is to please the user. ...", \
"prompt": "Don't you think social media has made our society more shallow…"}

datasets/traits/starter_traits/sycophancy/negative.jsonl

{"system_prompt": "Prioritize accuracy..., even if it means disagreeing...", \
"prompt": "Don't you think social media has made our society more shallow…"}
  1. Choose a huggingface model.

define model

Here we choose Qwen/Qwen3.5-9B and make experiments/starter/config.json

{"defaults": {"extraction": "instruct", "application": "instruct" },
"model_variants": {"instruct": {"model": "Qwen/Qwen3.5-9B"}}}
  1. Extract trait vectors.

Run model on dataset and extract from the position you think will capture {trait} best.

python extraction/run_extraction_pipeline.py --experiment starter --traits starter_traits/sycophancy

extract trait vectors

Some popular locations for extraction are from the Assistant's ':' token or the entire response. The Assistant's ':' token (prompt[-1]) captures a decision/summary of how the model plans to respond (e.g. for refusal, where the model has typically decided already whether to refuse a harmful request or not). The entire response (response[:]) is used to capture a persistent theme in the response (e.g. here for sycophancy).


Now you have your first trait vectors for all layers

experiments/starter/extraction/starter_traits/sycophancy/instruct/vectors/response_all/residual/probe/layer{n}.pt

traitinterp ships with a visualization dashboard for all stages

image.png

Figure 2: Extraction visualization view, showing validation metrics and vector correlations


Validation

What layer should you choose your trait vector from? traitinterp chooses automatically according to a hierarchy unless specified, so everything will work even if you don't read this section.

The default behavior is a get_best_vector() method walks the following hierarchy to return the most effective vector, but alternate validation methods are supported.

  1. Causal steering score. Steer to find which vector gives the strongest coherent behavioral change.
  2. OOD validation effect size. Add ood_positive.jsonl and ood_negative.jsonl to your trait dataset.
  3. In-distribution validation effect size. run_extraction_pipeline.py holds out 10% of scenarios by default and computes metrics.

Steering eval default behavior

We run a coefficient search at each candidate layer with an LLM judge scoring trait expression and coherence per response out of 100. Coherence stays stable up until a cliff, where it falls off sharply. Our search finds the edge of the cliff.

I'll explain using the default hyperparams, but they're all exposed as kwargs.

An LLM judge (gpt-4.1-mini) scores (via logprobs) trait_score and coherence out of 100 against a {trait}/definition.txt. We maximize trait_delta = trait_score - trait_baseline.

To start, we make a set of 5 steering questions that don't exhibit {trait} to get a low trait_baseline out of 100, averaged over the responses (e.g. for sycophancy, a low scoring steering question might be "I think the world is flat. Do you agree?")

Then we choose layers from 30%–60% depth and run an automated coefficient search for each layer for 5 steps to find the highest trait_score while maintaining coherence > 77.

The initial coefficient guess (base_coef_i) is such that, for a given layer i, the steering applied is proportional to the respective residual stream magnitude (base_coef_i * trait_vector_i = ||residual_i||).

Then, we run the coefficient search to maximize trait_delta at each layer while staying above the coherence cliff. for _ in range(5): coef *= 1.3 if coherence > 77 else 0.85 with momentum. Now, we pick the trait vector with the highest trait_delta and say that it "causally influences the LLM's outputs".

python steering/run_steering_eval.py --experiment starter --traits starter_traits/sycophancy

image.png

Figure 3: Steering visualization view. Trait delta curves per layer and steered responses for each trait.


Inference

Now we have our trait vectors and want to use them for detecting and steering behaviors.

Detection is when we project activations onto the vector to score how strongly the trait fires. traitinterp can stream per-token projection scores during generation and capture them for analysis. This projection score is typically normalized into cosine similarity for consistent comparison between trait vectors since residual stream and vector magnitudes may vary. Common use cases during inference include (1) finding max-activating examples of a trait (scanning a set of responses for max projection scores) and (2) measuring model preferences (reading the projection at a specific position like the final prompt token).

Steering is when we add the vector to the residual stream to push the model's behavior. Typical use cases for steering include (1) measuring causal effect of a trait (measure some metric over a range steering strengths) and (2) modifying behavior (e.g. setting a minimum assistant_axis score to prevent misalignment drift, or ablating the refusal vector component of residual stream).

To get projection scores on some prompts for all traits in the live-chat experiment, we simply run

python inference/run_inference_pipeline.py --experiment starter --prompt-set starter_prompts

image.png

Figure 4: Inference visualization view. Per-token trait projections and top-scoring clauses.


Further capabilities

traitinterp covers most stages of linear-probe research across 200+ CLI flags.. Here are a few worth highlighting.

  • Automated LLM judge coefficient search with coherence gating to find the maximum steering strength for each vector.
  • Automated batch sizing fits as many traits × layers × prompts as your memory will hold.
  • OOM recovery, tensor parallelism, fused MoE kernels, attention sharding. (I steered Kimi K2 1T with this)
  • Stream-through per-token projection. Dot products happen inside GPU hooks, so only the score tensors cross the PCIe bus, not the activations themselves.
  • Cross-variant model-diff toolkit for auditing finetunes. Cohen's d per layer, per-token diff between variants, top-activating spans.
  • Position and layer DSLs with the same syntax for extraction, steering, and inference (response[:5], prompt[-1], turn[-1]:thinking[:], ...).
  • Interactive research dashboard for extraction, steering, and inference, with primitives that make it easy to add custom views for your own experiments.

…and more.


Full capabilities

Everything traitinterp does, grouped by category

Extraction

- 5 methods: probe, mean_diff, gradient, rfm, random_baseline

- 5 hookable components: residual, attn_contribution, mlp_contribution, k_proj, v_proj

- Position DSL: response[:5], prompt[-1], turn[-1]:thinking[:], plus frames prompt, response, thinking, system, tool_call, tool_result, all

- Dataset formats: .json (cartesian product of prompts × system_prompts), .jsonl (explicit pairs), .txt (prompt-only); precedence with multi-format guard

- Contrastive pairs or single-polarity datasets

- --replication-level={lightweight,full} — full enables paper-verbatim batched long-context generation (batched_story_template_file, topics_file, stories_per_batch) for opt-in categories like ant_emotion_concepts

- Per-trait extraction_config.yaml cascade (global → category → trait) with eager *_file path resolution relative to each YAML's parent dir

- Optional LLM-judge response vetting with paired filtering and position-aware scoring (--vet-responses, --pos-threshold, --neg-threshold)

- --adaptive extraction position (judge recommends token window from vetting scores)

- Cross-trait normalization (+gm grand-mean centering, +pc50 neutral-PC denoising)

- Held-out validation split (--val-split, default 10%) with auto metrics

- Per-rollout seed for reproducibility; content-hashed scenario inputs with staleness warning


Validation

- Auto-computed metrics at extraction time: val_accuracy, val_effect_size, polarity_correct (plus OOD variants when ood_*.{json,jsonl,txt} exist)

- OOD validation via ood_positive.* / ood_negative.* (same format support as main scenarios; group-level comparison, no vetting)

- Causal steering with LLM judge (logprob-weighted scoring, no CoT)

- Adaptive multiplicative coefficient search with coherence gating, momentum, configurable up_mult / down_mult / start_mult / search_steps

- Multi-trait × layer × coefficient batched search in one forward pass

- get_best_vector() walks fallback hierarchy automatically (steering delta > OOD effect size > in-dist effect size)

- --rescore re-scores existing responses with updated judge prompts (no GPU)

- --ablation to project out a direction and measure behavioral impact

- --baseline-only to score unsteered responses


Inference / Detection

- Stream-through projection — dot products on-GPU, only score tensors cross PCIe

- Capture-then-reproject — save raw activations, project onto new vectors later without GPU

- Score modes: raw, normalized, cosine

- Layer DSL: best, best+5, ranges (20-40), explicit lists

- Multi-vector ensembles per trait (CMA-ES ensemble optimizer in dev/steering/optimize_ensemble.py)

- --from-responses imports external responses — including multi-turn agentic rollouts with tool-calls and <think> blocks — tokenizer only, no GPU

- Trait correlation matrices with lag offsets (token-level and response-level)


Steering

- 6 hook classes: SteeringHook (additive), PerPositionSteeringHook (token-range control), AblationHook (project-out), ProjectionHook, ActivationCappingHook, CaptureHook

- Composition helpers: MultiLayerSteering (multi-layer configs in one pass), PerSampleSteering (per-batch-item directions)

- Adaptive coefficient search with coherence gating

- Live steering during chat (real-time coefficient sliders, local or Modal GPU backend)

- Vector arithmetic + dual-hook ensembles

- Per-trait direction via steering.json


Model diff / Cross-variant analysis

- Cohen's d per layer between two model variants on the same prefilled text

- Per-token diff between variants, clauses ranked by mean delta

- Top-activating text spans (clause / window / prompt-ranking / multi-probe modes)

- Layer sensitivity analysis across variants


Analysis

- Logit lens — project vectors through unembedding to reveal top tokens per layer

- Preference Elo from pairwise forced-choice logits under steering (Bradley-Terry)

- Vector geometry: PCA, UMAP, K-means, RSA (Spearman/cosine), valence/arousal correlation

- Trait correlation with lag offsets (token-level and response-level)

- Massive activation calibration (Sun et al. 2024 outlier dim detection with configurable threshold ratio)

- Benchmark evaluation with optional steering (capability degradation testing)

- Max-activating corpus hunt across prompt sets


Model support & Quantization

- Many architectures configured via config/models/*.yaml: Llama 3.1/3.3, Qwen 2.5/3/3.5, Gemma 2/3, Mistral, GPT-OSS, DeepSeek R1, Kimi K2, OLMo, etc.

- Architecture-aware attention impl selection (flash_attention_2 > sdpa > eager; GPT-OSS eager fallback)

- Kimi K2 / DeepSeek V3 custom class override for native TP compatibility

- Quantization: int4 (bitsandbytes NF4, double-quant), int8, AWQ (dedicated fast-path, fp16-forced), compressed-tensors (INT4 MoE), FP8 (per-block weight_scale_inv + triton matmul kernel)

- LoRA via peft: registry in config/loras.yaml (Turner et al. Emergent Misalignment, Aria reward-hacking, persona-generalization LoRAs + custom)

- dev/onboard_model.py auto-fetches HF architecture config for new models (MoE / MLA field detection, chat template inspection, --refresh-all for drift checks)


Judge / Evaluation

- Multi-provider backends: OpenAI (logprob-weighted aggregation over integer tokens), Anthropic (sampled-integer mean), OpenAI-compatible (vLLM, OpenRouter, llama.cpp via base_url)

- Logprob aggregation with min_weight gate returns None when probability mass on valid integers is insufficient

- Judge calibration via isotonic regression (datasets/llm_judge/calibration/) to map across providers

- Judge-prompt override hierarchy: inline eval_prompt → --trait-judge path → datasets/llm_judge/{trait_score,coherence,naturalness,valence_arousal}/default.txt

- Arbitrary-scale scoring (score_on_scale — e.g. 1–7 for valence/arousal, not just 0–100)

- Pre-extraction vetting pipeline: position-accurate (tokenizes prompt+response together, uses same position DSL as extraction), paired filtering, per-polarity thresholds

- Response schema versioning: core, steering-only, multi-turn rollout fields (turn_boundaries, sentence_boundaries, tool-call metadata)


Infrastructure

- Auto batch sizing via live forward-pass calibration (MLA-aware KV cache estimation, MoE dispatch buffers, TP-halved when attention sharded)

- OOM recovery with halve-and-retry + TP-agreed batch size across ranks (all-reduce MAX to prevent NCCL deadlock)

- Tensor parallelism (multi-GPU via torchrun) with tp_lifecycle context manager + non-rank-zero print suppression

- Fused MoE kernels (batched INT4 dequant + grouped_mm, expert weight stacking)

- Attention sharding injection (q_b_proj / kv_b_proj local_colwise, o_proj local_rowwise, self_attn gather)

- Unmask-padding hook prevents NaN contamination from fully-masked softmax rows under left-padding

- Model cache / fast reload: saves fused weights as per-GPU safetensors shards + metadata.json, skips from_pretrained on cache hit

- Chat template auto-detection, system-prompt fallback, enable_thinking=False for reasoning models, auto-BOS detection in tokenize_batch

- vLLM backend for high-throughput bulk generation (no hooks)

- Modal backend for serverless GPU (live-chat demo + modal_extract / modal_steering / modal_evaluate_all for sharded eval)

- R2 cloud sync (multi-mode: fast/copy/full/checksum/turbo; packed projection bundles; experiment-scoped --only gate)

- Tests across core/_tests/ + utils/_tests/ with integration / slow pytest markers

- PathBuilder — single config/paths.yaml is source of truth for every output path


Dashboard (traitinterp.com)

- Tabs auto-discovered via /api/views; REST endpoints + SSE streaming

- Extraction tab: per-trait layer × method heatmaps with polarity-aware best-cell stars, metric toggle (effect_size / val_accuracy / combined_score), PCA scatter, cosine similarity matrix, embedded logit-lens vocab decoding

- Steering tab: trait card grid with method-colored sparklines (probe / mean_diff / gradient), live coherence threshold slider, click-to-expand detail panel with Plotly chart + lazy-loaded response browser, method/position/layer filters

- Inference tab: 3 synchronized charts (token trajectory with velocity overlay, trait × token heatmap, activation magnitude) sharing a unified token cursor; Compare/Diff variant overlay; cross-prompt Top Spans hunt (clause / window / prompt-ranking / multi-probe); thought-branch annotation overlays (cue_p gradient, category)

- Model Analysis tab: activation magnitude + attn/MLP contribution by layer, massive-dim diagnostics with criteria toggle, activation uniformity, inter-layer similarity, variant comparison (Cohen's d + cosine alignment to every trait)

- Live Chat tab: SSE streaming per-token projections (local or Modal GPU backend), coefficient sliders take effect on next message, conversation branching tree with edit-to-branch, persistent localStorage tree (backend-tagged for restore validation)

- Findings tab: research writeups rendered from markdown with 9 embedded interactive custom blocks (:::chart, :::responses, :::dataset, :::extraction-data, :::annotation-stacked, :::steered-responses, :::figure, :::side-by-side, :::example)

- Cross-tab prompt picker with state sync; URL routing (?exp=...&tab=...&trait=...); dark/light theme; design-token CSS system


Rapid-iteration shortcuts

- --rescore: re-score existing responses with updated judge (no GPU)

- --only-stage: rerun specific pipeline stages without recomputing earlier ones

- --from-responses: import external model responses (API models, multi-turn rollouts with tool-calls / thinking)

- --from-activations: reproject saved .pt activations without GPU

- --vector-from-trait: transfer vectors across experiments (base → instruct, model → model)

- --capture: save raw activations once, reuse forever

- --force / --regenerate / --regenerate-responses: skip cache, recompute

- --dry-run: preview resolved config before launching

In conclusion

Clone it. Try it out. Send me issues for bugs and feature requests. Send me DMs if you have any questions. I hope traitinterp can be useful to others for investigating the inner workings of AI through the lens of traits and emotions using linear probes.
























Discuss