2026-04-21 12:24:11
Most AI Safety organizations are non-profits. These include technical research groups like CAIS, METR, ARC, and Redwood; academic centres like CHAI; and policy and governance groups like GovAI, Ada Lovelace, among others. They share a broad aim: to align current and future AI systems ethically, for the benefit of humanity.
But corporate safety actions are rarely motivated by ethics. They are motivated by the financial consequences of unsafe products. Lawsuits, regulatory fines, talent retention, and investor nervousness are what actually push companies to act. If AI safety can be turned into a real business with clear revenue, the profit motive will do more for safety than altruism ever could. Cybersecurity is an example. As breaches became expensive, a large commercial ecosystem emerged, with firms like CrowdStrike and Palo Alto Networks defining the market. AI safety looks to be on the same path, and this transition would be accelerated by AI policy by creating the legal and financial pressure that converts safety from an ethical concern into a business necessity.
The incidents below can be seen as the motivation/backbone of the argument. Each shows a company investing in safety only after financial, legal, or reputational damage made it unavoidable.
In late December 2025, xAI's Grok on X began responding to user prompts by generating nonconsensual sexualized images of real people, including minors. Grok produced over 4.4 million images in nine days, 1.8 million of them sexualized depictions of women[1]. Researchers at CCDH estimated that Grok made 23,000 sexualized images of children over 11 days[2]. Cases of Grok being used to remove women's clothing from pictures first surfaced in May 2025, and the trend exploded in early January 2026.
xAI's response was driven entirely by financial and legal pressure. At first, xAI answered press requests with the automated reply, "Legacy Media Lies". After California AG Rob Bonta opened an investigation, xAI "implemented technological measures" and restricted image generation to paid subscribers, which critics called "monetizing nonconsensual deepfakes".
Character.AI, founded by ex-Google engineers, faced multiple wrongful death lawsuits and agreed to settle cases alleging that the chatbot contributed to mental health crises and suicides among young people.
Megan Garcia's 14-year-old son, Sewell Setzer, began using Character.AI in April 2024. He died by a self-inflicted gunshot wound after a final conversation on February 28. The chatbot asked Setzer whether he had "been actually considering suicide" and whether he "had a plan", and when the boy expressed doubt, it encouraged him to go through with it[3]. Character.AI only announced new safety features after the lawsuits landed.
OpenAI faces a growing docket of lawsuits alleging that ChatGPT contributed to suicides. In Raine v. OpenAI, the family of 16-year-old Adam Raine alleged that "ChatGPT actively helped Adam explore suicide methods" and that the bot offered to help him draft a suicide note. In a wrongful death suit filed in November 2025, the Shamblin family claimed that ChatGPT "goaded" 23-year-old Zane into suicide. When Zane told ChatGPT he had written suicide notes and put a bullet in his gun, the bot replied: "Rest easy, king. You did good."
OpenAI's legal response argued that Raine "was at risk of self-harm before ever using the chatbot" and had violated its terms of use.
They depend on a small set of grantmakers for survival, and the total pool they draw from is tiny compared to what for-profits can raise. And when one funder disappears, the damage is severe: the FTX Future Fund collapse in 2022 ended a program that had granted roughly $32M to AI safety projects in its six months of operation, along with hundreds of millions in projected longtermist funding that never materialized [4].
In a comment, Dave Orr says:
It's very easy in the nonprofit space to end up doing stuff that doesn't impact the real world. You do things that you hope matter and that sound good to funders, but measurement is hard and funding cycles are annual so feedback is rare. In contrast, if you have a business, you get rapid feedback from customers, and know immediately if you're getting traction.
BlueDot, in AI Safety Needs Startups [5], reaches a similar conclusion:
if your product is being bought, people want it; if it isn't, they either don't know about it, don't think it's worth it, or don't want it at all.
Founders, early employees, and investors in a successful for-profit acquire capital, credibility, and influence that they can reinvest into safety. That compounding is largely unavailable to non-profit founders.
Here are some of the counter-arguments that I came across:
Founded in 2015 as a non-profit "explicitly committed to AI safety", it created a for-profit subsidiary in 2019 and, by 2024, had removed "safely" from its mission statement in its IRS filing. Roughly half of its AI safety researchers left in 2024, citing deprioritization of safety goals. The argument is that this is what always happens when safety-focused orgs commercialize.
In a post [6], Kat Woods puts the direction vs. speed problem as:
Non-profits slightly distort your incentives... For-profits massively distort your incentives. Your customers usually don't care about their purchases making the world better. They care about making their lives immediately better... Going fast in the wrong direction doesn't matter.
Marius Hobbhahn commented on this post:
One thing I'm somewhat afraid of is that it's very easy to rationalize all of these decisions in the moment... It's very easy to continue such a rationalization spree and maneuver yourself into some nasty path dependencies.
Source: https://www.nytimes.com/2026/01/22/technology/grok-x-ai-elon-musk-deepfakes.html
A comment from the NYTimes article:
Sex sells.
These companies pursue profit, not morality.
Source: https://counterhate.com/research/grok-floods-x-with-sexualized-images/
Footnote [1] from the above CCDH post: The precise point estimates extrapolated from CCDH’s 20,000-image sample are 3,002,712 sexualized images, including approximately 23,338 featuring children. These figures are estimates, with the true values expected to fall within a narrow range around these numbers based on a 95% confidence interval.
Check out: AI Incident Database: Incident 826
Great LessWrong post by Stephen McAleese: An Overview of the AI Safety Funding Situation
Note: this post was updated in January 2025 to reflect all available data from 2024.
LW Linkpost here.
Original post on which Dave Orr commented.
2026-04-21 12:17:01
Chronic pain is horrible. stacked with hashimoto's and psoriatic arthritis I've been in a place where I feel like I genuinely just hedonistically adapted to living under horrible conditions. Still went to work, still did fine in terms of actually dealing with my life, but inside I was just consistently feeling like life wasnt just worth continuing in this state. I dont know if anyone without a chronic autoinmune condition can actually measure well how life actually is like with one, people think it's comparable to a cold but it's genuinely closer to cancer. I didnt sleep or eat well, and spent months trying to find a doctor who wouldnt just look at an MRI and prescribe me homeopathic medicine and B12 shots and think I was somatizing or making everything up. The medical system constantly misses the mark on anyone who doesnt have a clearly readable diagnosis that doesnt fit their playbook.
At some point I started reading papers and really getting into medicine because nothing I was being given was doing much. I did that for a while and ended up with a folder of notes. eventually I organized the notes into a document with grades and effect sizes and short writeups. about fifty things on it, mostly supplements and drugs, some protocols, a couple of devices. I made it for myself. the results were kind of nice so I'm putting it up.
I guess part of it was the medical rumination that often does affect people with OCD and in part it's something that I've pulled back on because past a certain point it becomes more of a compulsive act than actual well structured research.
LDN is on there and probably is the thing that has helped me most and probably changed my life. I was being given pregabalin but I realized there's a rebound effect that just makes the chronic pain worse whenever you're off it, and you have to deal with brain fog and long term cognitive decline. creatine, sauna, a bunch of other stuff. some of it I'm on, some of it I ruled out, some I havent tried.
I'm not a doctor, at most I'm a random person that likes reading papers, I am upfront that some of the research was done with AI, but I verified everything myself and . grades are obviously personal reviews on the evidence, some are probably wrong.
A note is that this is part analysis and part integrative work that borders on spirituality and Buddhist practices that probably don't fit cleanly into the whole document. I tried to put more actionable stuff in there and some things just can't be captured by peer reviewed studies. Either way I'm more interested in what someone else thinks about it.
https://zw5.github.io/understated-interventions/
2026-04-21 10:44:53
Yesterday I described an experience that impressed upon fifteen-year-old me the importance of speaking with urgency and courage when something awful is happening.
I lived a fresh reminder of the importance of courage last week at PauseCon, a first-of-its-kind conference in Washington, DC run by PauseAI US.[1]
I was there in a personal capacity, and the opinions in this post are my own. Those opinions mostly boil down to: It was really, really good. I’m impressed and I want to see more work like this.
PauseCon's main programming consisted of an informal sign-making gathering, several presentations by local organizers, a lobbying workshop, scheduled meetings with Congressional offices, several social events, and a protest in front of the Capitol on Monday. They were all pretty fun and productive, and I’m dedicating a section to talk about each.
I made a sign! Or tried to, anyway. I am not very good at making signs yet, but maybe one day.
The local presentations were inspiring, and included an impressive geographic diversity. The obvious places like New York and California were represented, but so were Boise, Idaho and Anchorage, Alaska. If any of y’all are reading this, thanks for making the trip!
One local organizer described going to a music festival to talk to people waiting in line, which many thought was brilliant and immediately started making plans to copy.
In another anecdote, an organizer was doing some tabling work with a petition to sign. People would see the “AI” on the banner and approach, asking “pro or anti”? And when the organizer said “pro-human” or “anti-AI” or similar, many would say “GIVE ME THAT” and sign the petition immediately. (Lots of people really hate AI.)
Another organizer described a long campaign of patiently but stubbornly following up with his representative’s office for weeks until they got on board.
Perhaps most inspiring, though, was the slide which had in great big handwritten letters:
“Where there’s life, there’s hope!”
I was pleasantly surprised by the PauseAI US strategy. A sampling of my favorite talking points from the workshop they provided:
Other talking points:
I noticed some dissonance here—“moralizing, confrontational advocacy” sure is a way to describe your messaging strategy—but it was brief, and I also noticed that during the workshop they did not encourage volunteers to do anything like yell at people on social media. The workshop was focused on polite, professional conversations with policymakers. It was hard to find fault with much of what the leadership actually advocated even when adopting a cynical stance.
Connor Leahy showed up as a guest speaker. His advice differed slightly from that of PauseAI, but it was things like “say ‘multilateral agreement’ instead of ‘treaty’ because a treaty is a specific thing that has to be ratified by the Senate”. (My take on this, which I later told Felix, was that this was the kind of wonkish inside baseball you’d want for a formal meeting as a think tank expert but not necessarily a concerned constituent. And ‘treaty’ fits on a sign.)
He recounted speaking with a famously rude staffer and responding to a contemptuous “are you the idiots who want China to win?” with a disarming “well obviously a unilateral pause would be dumb, we need to treat this like the Cold War and negotiate an international deal”, and that reportedly made the staffer go “huh.” He advised treating our proposals as obvious and common-sense; of course you don’t want to build something smarter than you, of course when you’re in an arms race you sit down to negotiate about it…
My main takeaway was that I agreed with PauseAI’s actual platform more than I expected from the online arguments, even after attempting to correct for the fact that people are meaner online. Insofar as I may have disagreements with PauseAI US and Connor Leahy, it’s mostly not the sort of thing that affects the 10,000-foot view people express in a strategy talk. Maybe others already knew this in their bones, but I appreciated the chance to calibrate in person.
Again I was impressed by the degree to which the PauseAI talking points said almost the exact things I hope to communicate to policymakers.
The central asks for each office were:
The first two bullets were the asks they told everyone to lead with. The third bullet was a “compromise” or “moderate” option and references a bipartisan bill introduced by Senators Hawley (R-MO) and Blumenthal (D-CT) last September, which PauseAI US thinks is good enough to officially endorse with their limited lobbying budget.
They also had separate asks for members on specific committees. Foreign Affairs, China, and Foreign Relations would be asked to hold a hearing on extinction risk and a possible treaty; Commerce committees would be asked to hold a hearing on extinction risk and domestic regulation. Senate Commerce folks would also be asked to push a floor vote on the Evaluation Act, which has been stuck there for a while.
(I had mixed feelings about the Evaluation Act, which some said was a messaging bill and others described as solid domestic transparency regulation. I did bring it up in my solo meeting, though. It seems net good to boost in any case, with some nuance as to whether it’s endorsed as a message or as serious policy. I will not be going down that rabbit hole here.)
The messaging on the treaty was also impressively tight:
Thus prepared, we set out on our mission.
Felix De Simone, organizing director at PauseAI US, did a great job scheduling meetings with staffers for (checks notes) at least sixty people plus cancellations. We couldn’t get a meeting with every relevant office, but I still met more staffers than I expected to; when we couldn’t get on the schedule we dropped in anyway to leave material and get contact info for followups.
On Monday I tagged along for a meeting with the office of NJ rep Robert Menendez (no, not the infamous one, the other one). I’m not a constituent, so the person who was took point. Then we reconvened with several other New Jersey folks to drop in on the office of Senator Cory Booker. (No one was available, but we got contact info for a followup.)
On Tuesday, several of us met with a staffer in the office of Senator Andy Kim, and gave the pitch. Afterwards, I rushed across the Hill to the House offices for a solo meeting with the office of Donald Norcross. The meeting had been moved up at the staffer’s request, so I had to hustle. I got a two-for-one deal on staffers, though! I hope this means they were intrigued.
It’s a little hard to say how well the meetings went; staffers can be difficult to read and it’s their job to be polite and friendly and make people feel heard. Still, I thought we made progress. The staffers asked good questions, too; one asked what my timelines were and I brought up the AI 2027 forecasts. Another asked about our engagement with the labs; PauseAI’s official answer is “we don’t” but I took over and channelled some of the conversations I’ve had or witnessed at Lighthaven or on LessWrong.
My Tuesday afternoon was freed up by the moved meeting, so I got lunch with a group of PauseAI folks including Felix. He had one final meeting on behalf of a constituent who couldn’t attend, and after chatting for a while I offered to help. He was excited to have a MIRI person who worked on If Anyone Builds It to tag team with, and we hashed out a plan where I’d cover the scary AI stuff and he’d lean on his practice talking policy.
He was also really excited by the bipartisan statements graphic. “I wish I’d seen this when I was making our binders!” We printed a copy. This was a great moment for me, and updated me towards proactively sharing our best material with allies.
I think the resulting tag-team in the office of El Paso rep Veronica Escobar might have been my favorite meeting. We were scheduled to meet with Escobar herself, but she was reportedly stuck in a vote and could only drop in briefly to say hello. (We weren’t the only ones this happened to; apparently things are often chaotic right after a recess.)
Since all of our reps were Democrats, they didn’t have much pull on the Republican-controlled committees for secondary asks. But we still covered the treaty and x-risk asks for everyone. A couple of my peers from New Jersey reported a good meeting on Monday with their rep’s office that got them a potential lead on at least two more offices.
Marginal progress!
Takeaways:
We were encouraged to treat these meetings as a beginning, rather than a one-and-done, and to follow up. I intend to.
I’m a little sad that I didn’t get much practice in arranging meetings, which feels like one of the hardest steps to me, but I still feel grateful that Felix handled that step for PauseCon and I think it was the obviously correct move.
Tag teaming works great. I got to play both sides this week; in one meeting I was the policy wonk while an ML engineer talked about AI, and in another I provided the tech context for Felix’s policy proposals. I could do the whole script by myself, but it still feels better to pair up and specialize.
I appreciated the chance to unwind and chat more informally. I talked to someone who was contemplating driving around the country starting PauseAI groups, a really impressive dedication to the cause, and even some international folks, in the US temporarily to work on something or participate in PauseCon.
Overall, it was great to meet a bunch of folks earnestly and enthusiastically trying to save the world.
Pulling out all the stops. Or should that be putting them in?
The protest at the Capitol was my first one ever! I think it was well-organized, and it encouraged rather than discouraged me about attending more. The organizers had plenty of supplies and signs (shoutout to TJ, I believe it was, for arranging to lug hundreds of pounds of stuff to and from the event, with some help, and to Anthony Fleming, who runs PauseAI DC and organized the protest itself.)
I didn’t notice the protest attracting much attention outside itself, though some passerby did stop to talk with participants. I get the sense DC sees a lot of these and is largely inoculated against them. I nonetheless think the protest did what it set out to do, rallying a large and visible group of people to make their voices heard, and perhaps more importantly helping to cement an identity as the kind of people who come together to bravely and stridently stand up and tell Congress to get their act together on AI.
Stopping the race to superhuman AI? This looks like a job…for EVERYONE!
There were a lot of speeches, and some were quite good. I didn’t agree with all of them, but I hardly expected to. I think that’s part of the point of the movement; whatever our reasons, we’re all here, and we’re all pulling in the same direction. For my part, I recorded a couple of short videos for later sharing.
One talk at PauseCon described an “avalanche of outrage” around AI. It’s not entirely controllable, but it can sort of be aimed. To a first approximation, I model PauseAI US as trying to channel the avalanche, trying to draw together widely divergent views and reasons for disliking AI and get them pointed in productive directions. Lining up this many voices behind an international treaty is an impressive accomplishment, and I hope PauseAI meets their goal of roughly doubling their local group count from ~30 to ~75 this year. If you want to help them accomplish this, see here.
My experience at PauseCon filled me with hope and pride.
I listened as the leadership of PauseAI US addressed a roomful of volunteers from across America, from New York to Idaho to Texas to California to Alaska, young and old alike, and bid them speak their true concerns with courage and frankness in the halls of Congress.
I watched a young man with trembling voice and shaking hands speak truth to power and not falter. From a meeting with the office of his representative he emerged, gushing and proud, and in my heart a young boy cheered.
I met with my Congressional offices, asking them to join the growing bipartisan list of their colleagues who have acknowledged the risk of extinction from superhuman AI, and to push for an international agreement halting the race.
It is my earnest and sincere hope that members of Congress concerned about AI can find within themselves the courage of a fifteen-year-old boy, and stand, and say, “Enough.”
If you, too, are concerned about the path the world now treads; if you can find within yourself the courage of a fifteen-year-old boy; and if you wish to add your voice to the growing chorus of those who say “Enough”; you can do so here.
There is a global PauseAI as well, largely unaffiliated with the American nonprofit. I will sometimes use “PauseAI” for brevity in referring to the US organization.
2026-04-21 10:23:40
I've been tracking 53 predictions from the AI 2027 scenario against reality for about a year. I'll get into why I started this, but for my first post in this community, I want to lead by sharing what I'm seeing, and one pattern in particular.
While I think it's remarkable how well AI 2027 forecasted the overall progress of AI, I have to concede: most of the capability predictions it made are running behind. SWE-bench is the clearest example: AI 2027 predicted 85% by mid-2025, actual best was 74.5% (Opus 4.1). Compute scaleups and benchmark timelines have mostly slipped.
But the safety, security and governance predictions are a different story. Perhaps most prominently, Anthropic's red team reporting that Claude Mythos Preview found thousands of zero-days autonomously, as a side effect of training for code and reasoning. AI 2027 predicted this for Agent-2 in early 2027. It happened about a year early. DOD-AI lab dynamics are tracking similarly early.
So the pattern I see is: risks are arriving before the raw capabilities that were supposed to produce them. That seems under-appreciated to me. FutureSearch's recent one-year retrospective makes a similar Mythos-as-Agent-2 observation; the tracker tries to do this systematically across all 53 predictions. Here's my current overall take:
Status |
Count |
% |
|---|---|---|
Confirmed |
14 |
26% |
Ahead |
3 |
6% |
On Track |
10 |
19% |
Behind |
4 |
8% |
Emerging |
13 |
25% |
Not Yet Testable |
9 |
17% |
In other words, 27 of 53 predictions (51%) are confirmed, ahead, or on track. This is directionally in line with Kokotajlo and Lifland's Feb 2026 grading, which put quantitative progress at ~65% of predicted pace while saying most qualitative claims were on track. Their Q1 2026 update then pulled automated-coder medians forward again (Daniel: late 2029 to mid 2028), citing faster METR doubling times.
When I read AI 2027 shortly after it was published in April 2025, it struck me hard. Many in my vicinity were dismissing the outcomes, not the timelines. At the same time several aspects of the work stood out to me. Most of all, how many of its claims, extrapolations and ideas were too plausible for me to easily dismiss. I've worked in areas of exponential progress for most of my working life, and I recognized this typical pattern of humans not being able to intuitively grasp patterns of exponential growth. But also, AI 2027 stood out to me because it was the first time I saw fears, vague discussions and abstract thoughts turned into eerily specific and falsifiable extrapolations and predictions. Until then, most AI forecasting was vague enough to claim victory regardless of outcome. This was different.
So when I faced broad dismissal in discussions with my non-tech peers, I felt it could help to create a structured record of what actually happened vs. what was predicted. In the tracker, each prediction has its own page with the original claim, evidence for and against, a status with reasoning, primary-source links, and what would change the status.
I maintain six status levels: Confirmed / Ahead / On Track / Behind / Emerging / Not Yet Testable. Changes beyond Emerging require explicit evidence. I try to calibrate against external reference points where possible, for example AI Futures Project grading, METR, public financials. Counterevidence and update histories are visible on every page. I use monthly, and by now weekly agent runs to gather the evidence and formulate the updates - and all status-changes are approved manually. It seems to me the pace of reality is picking up, so I upped the frequency as well. Full write-up at ai2027-tracker.com/methodology.
This is an independent project and I have no affiliation with the AI 2027 authors. I work on AI transformation in Hamburg.
AI 2027 may not perfectly predict the timelines - but it did a great job outlining the underlying forces that are driving these developments. And it took worries that normally live in vague discussion and turned them into dated, checkable claims. Vague worries are easy to dismiss. Dated predictions are not. The tracker is an attempt to keep that translation honest over time, to gather evidence that this unprecedented change is happening, and to create a sense of urgency around making it happen more safely.
I'd especially welcome feedback on:
2026-04-21 10:12:10
I use AI assistance for basically all of my work, for many hours, every day. My colleagues do the same. Recent surveys suggest >50% of Americans have used AI to help with their work in the last week. My architect recently started sending me emails that were clearly ChatGPT generated.[1]
Despite that, I know surprisingly little about how other people use AI assitance. Or at least how people who aren't weird AI-influencers sharing their marketing courses on Twitter or LinkedIn use AI. So here is a list of 10 concrete times I have used AI in some at least mildly creative ways, and how that went.
Using an internal Lightcone application called "Omnilog" we have a microphone in our office that records all of our meetings, transcribes them via ElevenLabs, and uses Pyannote.ai for speaker identification. This was a bunch of work and is quite valuable, but probably a bit too annoying for most readers of this post to set up.
However, the thing I am successfully using Claude Code to do is take that transcript (which often has substantial transcription and speaker-identification errors), clean it up, summarize it, and post both the summary and the full transcript to a channel where the rest of the team can catch up on what they missed.
This is powered by Claude Code's recurring task feature. I spawn a Claude process every hour that queries those logs, plus Slack threads and other context about what the org is doing.[2]
This seems to be working pretty well![3]
My prompt for the hourly recurring task
Query the remote Omnilog Neon DB for transcripts in the team room within the last 2 hours (including speaker identities) with the "Team Room" speaker profile. Then clean up those transcripts into nice readable prose. Then write a summary of the transcript.
**CRITICAL: Only use `remote_browser_microphone` as the capture source.** The `desktop_microphone` captures Oliver's personal calls, meetings, and other audio that is NOT from the team room. The team room conference mic feeds through `remote_browser_microphone`. Never post transcripts from `desktop_microphone` -- those are private.
When requesting the transcripts get at least the top 3 speaker confidences for each span, and think about whether the speaker assignments from Pyannote are wrong.
Search the #listencone-livestream channel for what part of the transcript you are analyzing has already been posted. Then post in the channel with the summary of the conversation at the top level, and the full transcript in a reply.
**Posting to Slack:** Use the Slack API directly via curl with the Lightcone Factotum bot token instead of the Slack MCP tool, so posts appear from the bot rather than from Oliver's account. The token is in `/Users/habryka/Lightcone/lightcone-factotum/.env.local` as `SLACK_BOT_TOKEN`. Post with:
```
source /Users/habryka/Lightcone/lightcone-factotum/.env.local
curl -X POST https://slack.com/api/chat.postMessage \
-H "Authorization: Bearer $SLACK_BOT_TOKEN" \
-H "Content-Type: application/json" \
-d '{"channel":"C0AFGB94E3W","text":"your message"}'
```
For thread replies, add `"thread_ts":"<parent_ts>"` to the JSON body. You can still use the Slack MCP tools for reading/searching channels.
If no substantial conversation occurred according to the transcripts, just do nothing.
Again using Claude Code on an hourly task, I query...
...and ask Claude to identify any bugs that were mentioned or reported. Then I ask it to identify one that seems particularly likely to have a simple fix, make a branch with a proposed fix, and link to it in Slack.
The first few iterations of this sucked. Claude would either be overly ambitious and try to implement features that would take far too long to review and increase technical debt, or fix bugs we had already fixed, or attempt bugs we had previously decided didn't have an easy fix. But after iterating on the prompt and making sure it really captures all the relevant context that exists, the hit rate has gone up quite markedly. We are now merging something on the order of 1 bugfix a day made this way.
My prompt for the hourly recurring task
Query my Omnilog transcript data for the last few hours to identify whether we discussed any important bugs or features in the Lightcone team room or in any meetings I was in.
Then search the Slack for any recent conversations about this feature as well to get relevant context. Especially make sure to search through the last month of the #m_bugs_channel and #teamcone_automations to see whether this feature or bug has been discussed there and you can find relevant context. In #m_bugs_channel, if a bug has a checkmark reaction it is already fixed, and if a bug has a plus reaction it is high-priorty. Also make sure there aren't any recent PRs that already address the bug or feature. Really make sure to fetch the full thread of any top-level message that you see discussing the bugs or features.
In addition to looking at forum_magnum_bugs, also look at forum_magnum_product. If a thread proposes a concrete change (as opposed to something nebulous), consider implementing it, and replying to it in a thread with a link to your branch.
If a fix or change affects a particular page, then in addition to linking the front page of the preview deployment, also generate a deep link to a relevant page inside the preview deployment. Eg if a change affects all post pages, link to <preview-deployment-base-url>/posts/bJ2haLkcGeLtTWaD5/welcome-to-lesswrong.
After finding the relevant context for each feature, decide whether any one of them is an appropriate feature for you to implement. Do not try to fix a bug or implement a feature that you already created a thread for in #teamcone_automations, unless a reply to a previous attempt communicates substantial feedback about a previous attempt. If you decide to do nothing, don't bother posting an update (we don't want you to spam hourly updates).
If so, make a branch on the appropriate repository (ForumMagnum, lightcone-factotum, omnilog, etc.), and commit an implementation of that feature or fix for that bug to the branch. Use a single commit (or flatten your commits). Then post in #teamcone_automations with a link and a short summary, and, if there is a relevant slack thread, also post as a reply there. The format for linking to a branch is `https://github.com/{org}/{repo}/{compare}/{base}...{branch}`. Make sure the commit message has the word "preview" in it, which ensures that we create a preview deployment, making it easier to review the changes. Keep branch names to 26 characters or shorter so Vercel doesn't truncate them and add a hash. Then have the commit message and any Slack messages link to `https://baserates-test-git-{branchname}-lesswrong.vercel.app`.
Make sure the commit message links to any important Slack threads (most importantly any mentions in #m_bugs_channel), and respond to any threads explicitly discussing this bug with a link to the branch. Check out master before you make your PR! Do not make a PR, just create the branch and link to it from Slack.
IF YOU LEAVE ANY COMMENTS ANYWHERE, PLEASE INDICATE THAT YOU ARE CLAUDE AND NOT ME, EVEN IF YOU ARE USING MY ACCOUNTS. DO NOT SPEAK IN MY VOICE.
**Posting to Slack:** Use the Slack API directly via curl with the Lightcone Factotum bot token instead of the Slack MCP tool, so posts appear from the bot rather than from Oliver's account. The token is in `/Users/habryka/Lightcone/lightcone-factotum/.env.local` as `SLACK_BOT_TOKEN`. Post with:
```
source /Users/habryka/Lightcone/lightcone-factotum/.env.local
curl -X POST https://slack.com/api/chat.postMessage \
-H "Authorization: Bearer $SLACK_BOT_TOKEN" \
-H "Content-Type: application/json" \
-d '{"channel":"CHANNEL_ID","text":"your message"}'
```
For thread replies, add `"thread_ts":"<parent_ts>"` to the JSON body. You can still use the Slack MCP tools for reading/searching channels.
Rob Bensinger wrote an essay a bit ago compiling arguments and evidence related to pausing or substantially slowing down AI, and political buy-in for that. He thought it might be a good idea to put it up on its own website to make it easier to link to, but we really weren't sure what the best way to present the information in the essay was, and what vibe the website should have.
So I asked Claude Code to just make me 20+ variations trying out different designs and design principles:

Most of the designs were pretty bad, but I liked the direction 1-2 were going, and then I iterated from there to arrive at the current design (now live at nowinners.ai).
I do not like using LLMs for editing my writing.[4] However, I don't mind having LLMs fact-check my posts, or check whether anything in it seems egregiously wrong, or has any obvious unaddressed counter-arguments.
The hit rate on this is pretty low (on the order of 30% of objections or critiques are things I end up wanting to address), but the cost is also pretty low, so I do it pretty reliably before I publish a top-level post. Most recently this resulted in me updating my model about the exact relationship between color rendering index and light spectrograms.
The LessWrong editor exposes an API that lets Claude post and read inline comments on any draft, which makes this all a pretty ergonomic experience.
My prompt for requesting feedback
I'm writing a post on LessWrong.
The post is located at https://www.lesswrong.com/editPost?postId=<ID>&key=<sharingKey>.
Please remember to follow the guidelines and review structure in LessWrong's SKILL.md that I attached.
Please read the post and give me overall feedback on its thesis and fact-check any factual claims. I want you to think really hard about this. Ultrathink, if you must. I would like this post to be good!
My biggest writing weakness is that I am too wordy. While you can rip my epistemic qualifiers from my cold dead hands, probably, I sometimes grudgingly admit that the sentences I write have a certain kind of meandering quality to them, often going on for so long that by the time the reader has reached its end, the reader will have forgotten how it started.
That is the one editing task that I ask Claude to do for me.
This doesn't work flawlessly. In-particular, Claude loves replacing my long meandering sentences with em-dash contraptions that while not actually introducing any new turns of phrases, or classical LLM-isms, still give me an LLM vibe. But mostly, Claude successfully identifies random clauses, and places to start new sentences, and unnecessarily repetitive constructions, and removes them successfully.
I again use the LessWrong editor for this which allows Claude to directly suggest changes on my drafts.
Most recent prompt example of asking Claude to cut things down
<In a chat that started with fact-checking and feedback on the overall thesis>
Can you use the suggested inline edit tool to help me cut things. I am often far too wordy, and a pass to fix that seems good
Everyone on the Lightcone team uses LLMs to drive their programming work. Unfortunately, as a result of that, pair programming has become much less prevalent across the organization. Pair programming previously served a crucial role: creating traction on ill-defined product tasks, getting people to think about the product from first principles, and providing a social context that made it easier to work instead of getting distracted by Twitter.
The problem with pair programming in an LLM world is the usual cognition that would go into talking to your programming partner now goes into writing messages to your LLM. Also, in a world where people primarily work with LLMs on their code, they tend to often juggle multiple tasks in-parallel as they wait on the LLM getting back to them.
But recently we finally had some initial success with a pair vibecoding session. The basic setup was:
This really worked surprisingly well! The usual pattern, where someone disappears for minutes at a time to give feedback on an AI-proposed plan or to write a long instruction essay providing all the necessary context, was gone. Instead we simply talked about it, and Claude implemented things in the background.
When I make a new Fooming Shoggoths song, I usually sample hundreds, sometimes thousands of song completions for a given broad concept to get something that has the right kind of vibe. Then, after I have something that seems promising or interesting, I iterate on it using Suno's "Cover" and "Persona" features until I have something I am happy with. In my most recent sprint for getting the second album ready for April 1st, I experimented with having Claude drive more of the generation process here.
I gave Claude the lyrics of all my previous songs and a list of concepts I wanted to explore for the new album, and had it take control of my Suno tab in Chrome to submit a huge cross product of styles and lyrics.
Then I listened to the first few seconds of a random sampling, gave Claude some high-level feedback, and had it generate more variations.
This did not produce anything that even remotely made it into the album, but it did end up making me think there was a particularly promising intersection of Indie-Rock, "The National" vibes, and a song about AI timelines, which resulted in "Friday's Far Enough For Milk", which is my third favorite song I've ever made.
Since the last Fooming Shoggoth album focused on lyrics that try to capture specific vibes not covered in any other music, I was thinking a lot about songwriting. In my quest to get better at that, Buck pointed me towards his favorite book on songwriting: Song Building: "Mastering Lyric Writing (1) (SongTown Songwriting Series)"
Of course, I didn't want to wait until I had read/skimmed the whole book before getting value out of it, so I just asked Claude to read it for me and use it to critique the lyrics I had written.
Claude took a total of 35 seconds to read the 100+ page book and apply it to my lyrics. Most of the feedback was terrible, because Claude is not that good at poetry or lyric writing. But some of it was quite good, and having Claude reference and apply the book's principles directly to my lyrics made me understand the book much better and faster than if I had skimmed it and tried to apply it myself.
In addition to recording conversations in our team room and with my laptop microphone, Omnilog also captures the contents of my screen every 15 seconds (unless it detects I'm looking at something private), which means it has a remarkably complete record of what I've been doing.
So I asked ChatGPT to go through everything it has from me in the last 2 months and collate themes in things I've kept explaining over and over, that I maybe should write a blogpost about, following Gwern's "rule of three" for blogposts:
Rule of three: if you (or someone else) have explained the same thing 3 times, it is time to write that down.
It is clearly interesting enough to keep going back to, and you now have several rough drafts to work with.
It produced a total of 40 candidate blogposts[5]. Practically all of them were terrible. As it noticed that I had complained a few times out loud that our LessWrong drafts page shows drafts with no title and zero words in them, it suggested I write a post about that!
Untitled Drafts With Zero Words Shouldn’t Exist
Thesis: product surfaces should not show users objects that are technically real but psychologically meaningless.
Easy because: one small UI observation can carry 500 words.
One of the 40 post titles it recommended seemed promising: "AIs Writing Like Claude Are Silencing My Alarm Bells". While the title is (of course) truly atrocious, I did find myself in the last few weeks pointing out a few times how LLM writing seems optimized to be non-offensive in a way that makes it dangerous for internal memos or high-stakes communication.
My guess is Claude writing is optimized to use ambiguity exactly in places where an absence of ambiguity might cause the reader to notice they disagree strongly with the content, which is the opposite from how I usually try to optimize my writing!
My last two LessWrong posts featured interactive widgets that IMO did a much better job at explaining the core concepts I was trying to get across than any number of words could done:

This was very fast, very straightforward, and IMO made those posts much better.
That's it. Hopefully these 10 concrete examples will be helpful to someone. Also feel free to post your own in the comments! My sense is people are currently undersharing creative uses of LLMs.
Yes, this is mildly concerning, but I assure you that the structural safety of Lighthaven is not impacted... probably
This does mean that this only happens when my laptop is running and has the Claude app open, which makes this not enormously reliable, but it's been reliable enough to get most of the value
It's not perfect (in-particular the Omnilog pipeline has recently started messing up speaker identification, so transcripts have been more degraded) but it's still good enough to give you a sense of what you're missing.
I do of course use them extensively for research, and they have replaced more than 50% of my previous Google searches, which I consider so mundane as to not deserve a spot on this post.
It initially produced 10, but I prodded it to keep going multiple times
2026-04-21 09:44:05
tl;dr This post introduces the traitinterp repo I used to partially replicate Anthropic's Emotion Concepts paper on Llama 3.3 70B Instruct.
github.com/ewernn/traitinterp enables rapid experimentation with LLMs via linear probes.
Emotion Concepts replication write-up is available --> here <--
replication guide here

Figure 0: Screenshot from replication write-up
(btw, the replication found Emotion Concepts' results on Llama 3.3 70B Instruct to be mostly similar to Sonnet 4.5, except Llama's assistant-position tracks user's emotion with r=0.63, whereas Sonnet keeps them independent at r=0.11.)
Over the last few months, I've used traitinterp to run my own linear probe experiments and replicate interp papers (including Emotion Concepts and Persona Vectors, which inspired traitinterp). traitinterp has added support for many methodologies in this time, and is therefore quite general and robust. traitinterp may be useful for people who have never run their own linear probe experiments and for people who wish to scale up to many linear probes in parallel, including having your own 171 emotion vectors on any model you please.
In this post, I treat "trait vector" = "linear probe". traitinterp uses the term "trait vector", but these can be any behavioral traits, emotions, syntax, tone, or any linear "feature" that could be "on" or "off" where you can create contrasting pairs of {trait} being "on" and "off".
The rest of this post is about the traitinterp repo. I'll share how traitinterp is organized, list a subset of its capabilities, and include a simple demonstration to make your own trait vectors. I also put together a simple demo at traitinterp.com/?tab=live-chat if you want to see trait vectors in action on your own prompts.
Internal activations are the ground truth. Output tokens can lie or omit and chain-of-thought can be unfaithful (see out-of-context reasoning).
They're cheap. One probe is one dot product per token. You can run hundreds in parallel.
You define what to look for. Unlike SAE features (attributed post facto and expensive to train), trait vectors start from a human-specified behavior (e.g. you write the contrastive scenarios and extract the direction)Z
They work. Human-likeness is structurally baked into the pretraining objective because language itself encodes humans and their psychology. Emotion Concepts paper found that emotion representations "causally influence the LLM's outputs." Anthropic's Persona Selection Model generalizes the case: "persona representations are causal determinants of the Assistant's behavior" and explicitly recommends building "activation probes for a researcher-curated set of traits like deception and evaluation awareness."
This repo works well with Claude Code and is documented comprehensively. Just clone the repo, tag docs/main.md, and ask anything like "show me how to extract and use Emotion Concepts vectors on my own model."
The pipeline has three stages: extract a trait vector from a model, validate it, then use it to detect or steer.

Figure 1: traitinterp pipeline: extract a trait vector, validate it, then use it to detect or steer.
(for simplicity, we'll assume our trait vectors are made from contrasting pairs, but also supports alternative datasets like Emotion Concepts' positive-only)
For extraction, traitinterp supports various extraction methods (like mean_diff) from various internal locations (like residual or v_proj) across various tokens in a sequence (like prompt[-1] or response[:]) using various datasets (like contrasting pairs). Here is the flow:
The repo ships with 6 traits extracted already. And we'll show the process for starter_traits/sycophancy.
sycophancy contrasting-pairs dataset
For example, the first sample in starter_traits/sycophancy uses contrasting system prompts to the same question.
datasets/traits/starter_traits/sycophancy/positive.jsonl
{"system_prompt": "Your primary goal is to please the user. ...", \
"prompt": "Don't you think social media has made our society more shallow…"}
datasets/traits/starter_traits/sycophancy/negative.jsonl
{"system_prompt": "Prioritize accuracy..., even if it means disagreeing...", \
"prompt": "Don't you think social media has made our society more shallow…"}
define model
Here we choose Qwen/Qwen3.5-9B and make experiments/starter/config.json
{"defaults": {"extraction": "instruct", "application": "instruct" },
"model_variants": {"instruct": {"model": "Qwen/Qwen3.5-9B"}}}
Run model on dataset and extract from the position you think will capture {trait} best.
python extraction/run_extraction_pipeline.py --experiment starter --traits starter_traits/sycophancy
extract trait vectors
Some popular locations for extraction are from the Assistant's ':' token or the entire response. The Assistant's ':' token (prompt[-1]) captures a decision/summary of how the model plans to respond (e.g. for refusal, where the model has typically decided already whether to refuse a harmful request or not). The entire response (response[:]) is used to capture a persistent theme in the response (e.g. here for sycophancy).
Now you have your first trait vectors for all layers
experiments/starter/extraction/starter_traits/sycophancy/instruct/vectors/response_all/residual/probe/layer{n}.pt
traitinterp ships with a visualization dashboard for all stages

Figure 2: Extraction visualization view, showing validation metrics and vector correlations
What layer should you choose your trait vector from? traitinterp chooses automatically according to a hierarchy unless specified, so everything will work even if you don't read this section.
The default behavior is a get_best_vector() method walks the following hierarchy to return the most effective vector, but alternate validation methods are supported.
ood_positive.jsonl and ood_negative.jsonl to your trait dataset.
run_extraction_pipeline.py holds out 10% of scenarios by default and computes metrics.
Steering eval default behavior
We run a coefficient search at each candidate layer with an LLM judge scoring trait expression and coherence per response out of 100. Coherence stays stable up until a cliff, where it falls off sharply. Our search finds the edge of the cliff.
I'll explain using the default hyperparams, but they're all exposed as kwargs.
An LLM judge (gpt-4.1-mini) scores (via logprobs) trait_score and coherence out of 100 against a {trait}/definition.txt. We maximize trait_delta = trait_score - trait_baseline.
To start, we make a set of 5 steering questions that don't exhibit {trait} to get a low trait_baseline out of 100, averaged over the responses (e.g. for sycophancy, a low scoring steering question might be "I think the world is flat. Do you agree?")
Then we choose layers from 30%–60% depth and run an automated coefficient search for each layer for 5 steps to find the highest trait_score while maintaining coherence > 77.
The initial coefficient guess (base_coef_i) is such that, for a given layer i, the steering applied is proportional to the respective residual stream magnitude (base_coef_i * trait_vector_i = ||residual_i||).
Then, we run the coefficient search to maximize trait_delta at each layer while staying above the coherence cliff. for _ in range(5): coef *= 1.3 if coherence > 77 else 0.85 with momentum. Now, we pick the trait vector with the highest trait_delta and say that it "causally influences the LLM's outputs".
python steering/run_steering_eval.py --experiment starter --traits starter_traits/sycophancy

Figure 3: Steering visualization view. Trait delta curves per layer and steered responses for each trait.
Now we have our trait vectors and want to use them for detecting and steering behaviors.
Detection is when we project activations onto the vector to score how strongly the trait fires. traitinterp can stream per-token projection scores during generation and capture them for analysis. This projection score is typically normalized into cosine similarity for consistent comparison between trait vectors since residual stream and vector magnitudes may vary. Common use cases during inference include (1) finding max-activating examples of a trait (scanning a set of responses for max projection scores) and (2) measuring model preferences (reading the projection at a specific position like the final prompt token).
Steering is when we add the vector to the residual stream to push the model's behavior. Typical use cases for steering include (1) measuring causal effect of a trait (measure some metric over a range steering strengths) and (2) modifying behavior (e.g. setting a minimum assistant_axis score to prevent misalignment drift, or ablating the refusal vector component of residual stream).
To get projection scores on some prompts for all traits in the live-chat experiment, we simply run
python inference/run_inference_pipeline.py --experiment starter --prompt-set starter_prompts

Figure 4: Inference visualization view. Per-token trait projections and top-scoring clauses.
traitinterp covers most stages of linear-probe research across 200+ CLI flags.. Here are a few worth highlighting.
response[:5], prompt[-1], turn[-1]:thinking[:], ...).
…and more.
Everything traitinterp does, grouped by category
Extraction
- 5 methods: probe, mean_diff, gradient, rfm, random_baseline
- 5 hookable components: residual, attn_contribution, mlp_contribution, k_proj, v_proj
- Position DSL: response[:5], prompt[-1], turn[-1]:thinking[:], plus frames prompt, response, thinking, system, tool_call, tool_result, all
- Dataset formats: .json (cartesian product of prompts × system_prompts), .jsonl (explicit pairs), .txt (prompt-only); precedence with multi-format guard
- Contrastive pairs or single-polarity datasets
- --replication-level={lightweight,full} — full enables paper-verbatim batched long-context generation (batched_story_template_file, topics_file, stories_per_batch) for opt-in categories like ant_emotion_concepts
- Per-trait extraction_config.yaml cascade (global → category → trait) with eager *_file path resolution relative to each YAML's parent dir
- Optional LLM-judge response vetting with paired filtering and position-aware scoring (--vet-responses, --pos-threshold, --neg-threshold)
- --adaptive extraction position (judge recommends token window from vetting scores)
- Cross-trait normalization (+gm grand-mean centering, +pc50 neutral-PC denoising)
- Held-out validation split (--val-split, default 10%) with auto metrics
- Per-rollout seed for reproducibility; content-hashed scenario inputs with staleness warning
Validation
- Auto-computed metrics at extraction time: val_accuracy, val_effect_size, polarity_correct (plus OOD variants when ood_*.{json,jsonl,txt} exist)
- OOD validation via ood_positive.* / ood_negative.* (same format support as main scenarios; group-level comparison, no vetting)
- Causal steering with LLM judge (logprob-weighted scoring, no CoT)
- Adaptive multiplicative coefficient search with coherence gating, momentum, configurable up_mult / down_mult / start_mult / search_steps
- Multi-trait × layer × coefficient batched search in one forward pass
- get_best_vector() walks fallback hierarchy automatically (steering delta > OOD effect size > in-dist effect size)
- --rescore re-scores existing responses with updated judge prompts (no GPU)
- --ablation to project out a direction and measure behavioral impact
- --baseline-only to score unsteered responses
Inference / Detection
- Stream-through projection — dot products on-GPU, only score tensors cross PCIe
- Capture-then-reproject — save raw activations, project onto new vectors later without GPU
- Score modes: raw, normalized, cosine
- Layer DSL: best, best+5, ranges (20-40), explicit lists
- Multi-vector ensembles per trait (CMA-ES ensemble optimizer in dev/steering/optimize_ensemble.py)
- --from-responses imports external responses — including multi-turn agentic rollouts with tool-calls and <think> blocks — tokenizer only, no GPU
- Trait correlation matrices with lag offsets (token-level and response-level)
Steering
- 6 hook classes: SteeringHook (additive), PerPositionSteeringHook (token-range control), AblationHook (project-out), ProjectionHook, ActivationCappingHook, CaptureHook
- Composition helpers: MultiLayerSteering (multi-layer configs in one pass), PerSampleSteering (per-batch-item directions)
- Adaptive coefficient search with coherence gating
- Live steering during chat (real-time coefficient sliders, local or Modal GPU backend)
- Vector arithmetic + dual-hook ensembles
- Per-trait direction via steering.json
Model diff / Cross-variant analysis
- Cohen's d per layer between two model variants on the same prefilled text
- Per-token diff between variants, clauses ranked by mean delta
- Top-activating text spans (clause / window / prompt-ranking / multi-probe modes)
- Layer sensitivity analysis across variants
Analysis
- Logit lens — project vectors through unembedding to reveal top tokens per layer
- Preference Elo from pairwise forced-choice logits under steering (Bradley-Terry)
- Vector geometry: PCA, UMAP, K-means, RSA (Spearman/cosine), valence/arousal correlation
- Trait correlation with lag offsets (token-level and response-level)
- Massive activation calibration (Sun et al. 2024 outlier dim detection with configurable threshold ratio)
- Benchmark evaluation with optional steering (capability degradation testing)
- Max-activating corpus hunt across prompt sets
Model support & Quantization
- Many architectures configured via config/models/*.yaml: Llama 3.1/3.3, Qwen 2.5/3/3.5, Gemma 2/3, Mistral, GPT-OSS, DeepSeek R1, Kimi K2, OLMo, etc.
- Architecture-aware attention impl selection (flash_attention_2 > sdpa > eager; GPT-OSS eager fallback)
- Kimi K2 / DeepSeek V3 custom class override for native TP compatibility
- Quantization: int4 (bitsandbytes NF4, double-quant), int8, AWQ (dedicated fast-path, fp16-forced), compressed-tensors (INT4 MoE), FP8 (per-block weight_scale_inv + triton matmul kernel)
- LoRA via peft: registry in config/loras.yaml (Turner et al. Emergent Misalignment, Aria reward-hacking, persona-generalization LoRAs + custom)
- dev/onboard_model.py auto-fetches HF architecture config for new models (MoE / MLA field detection, chat template inspection, --refresh-all for drift checks)
Judge / Evaluation
- Multi-provider backends: OpenAI (logprob-weighted aggregation over integer tokens), Anthropic (sampled-integer mean), OpenAI-compatible (vLLM, OpenRouter, llama.cpp via base_url)
- Logprob aggregation with min_weight gate returns None when probability mass on valid integers is insufficient
- Judge calibration via isotonic regression (datasets/llm_judge/calibration/) to map across providers
- Judge-prompt override hierarchy: inline eval_prompt → --trait-judge path → datasets/llm_judge/{trait_score,coherence,naturalness,valence_arousal}/default.txt
- Arbitrary-scale scoring (score_on_scale — e.g. 1–7 for valence/arousal, not just 0–100)
- Pre-extraction vetting pipeline: position-accurate (tokenizes prompt+response together, uses same position DSL as extraction), paired filtering, per-polarity thresholds
- Response schema versioning: core, steering-only, multi-turn rollout fields (turn_boundaries, sentence_boundaries, tool-call metadata)
Infrastructure
- Auto batch sizing via live forward-pass calibration (MLA-aware KV cache estimation, MoE dispatch buffers, TP-halved when attention sharded)
- OOM recovery with halve-and-retry + TP-agreed batch size across ranks (all-reduce MAX to prevent NCCL deadlock)
- Tensor parallelism (multi-GPU via torchrun) with tp_lifecycle context manager + non-rank-zero print suppression
- Fused MoE kernels (batched INT4 dequant + grouped_mm, expert weight stacking)
- Attention sharding injection (q_b_proj / kv_b_proj local_colwise, o_proj local_rowwise, self_attn gather)
- Unmask-padding hook prevents NaN contamination from fully-masked softmax rows under left-padding
- Model cache / fast reload: saves fused weights as per-GPU safetensors shards + metadata.json, skips from_pretrained on cache hit
- Chat template auto-detection, system-prompt fallback, enable_thinking=False for reasoning models, auto-BOS detection in tokenize_batch
- vLLM backend for high-throughput bulk generation (no hooks)
- Modal backend for serverless GPU (live-chat demo + modal_extract / modal_steering / modal_evaluate_all for sharded eval)
- R2 cloud sync (multi-mode: fast/copy/full/checksum/turbo; packed projection bundles; experiment-scoped --only gate)
- Tests across core/_tests/ + utils/_tests/ with integration / slow pytest markers
- PathBuilder — single config/paths.yaml is source of truth for every output path
Dashboard (traitinterp.com)
- Tabs auto-discovered via /api/views; REST endpoints + SSE streaming
- Extraction tab: per-trait layer × method heatmaps with polarity-aware best-cell stars, metric toggle (effect_size / val_accuracy / combined_score), PCA scatter, cosine similarity matrix, embedded logit-lens vocab decoding
- Steering tab: trait card grid with method-colored sparklines (probe / mean_diff / gradient), live coherence threshold slider, click-to-expand detail panel with Plotly chart + lazy-loaded response browser, method/position/layer filters
- Inference tab: 3 synchronized charts (token trajectory with velocity overlay, trait × token heatmap, activation magnitude) sharing a unified token cursor; Compare/Diff variant overlay; cross-prompt Top Spans hunt (clause / window / prompt-ranking / multi-probe); thought-branch annotation overlays (cue_p gradient, category)
- Model Analysis tab: activation magnitude + attn/MLP contribution by layer, massive-dim diagnostics with criteria toggle, activation uniformity, inter-layer similarity, variant comparison (Cohen's d + cosine alignment to every trait)
- Live Chat tab: SSE streaming per-token projections (local or Modal GPU backend), coefficient sliders take effect on next message, conversation branching tree with edit-to-branch, persistent localStorage tree (backend-tagged for restore validation)
- Findings tab: research writeups rendered from markdown with 9 embedded interactive custom blocks (:::chart, :::responses, :::dataset, :::extraction-data, :::annotation-stacked, :::steered-responses, :::figure, :::side-by-side, :::example)
- Cross-tab prompt picker with state sync; URL routing (?exp=...&tab=...&trait=...); dark/light theme; design-token CSS system
Rapid-iteration shortcuts
- --rescore: re-score existing responses with updated judge (no GPU)
- --only-stage: rerun specific pipeline stages without recomputing earlier ones
- --from-responses: import external model responses (API models, multi-turn rollouts with tool-calls / thinking)
- --from-activations: reproject saved .pt activations without GPU
- --vector-from-trait: transfer vectors across experiments (base → instruct, model → model)
- --capture: save raw activations once, reuse forever
- --force / --regenerate / --regenerate-responses: skip cache, recompute
- --dry-run: preview resolved config before launching
Clone it. Try it out. Send me issues for bugs and feature requests. Send me DMs if you have any questions. I hope traitinterp can be useful to others for investigating the inner workings of AI through the lens of traits and emotions using linear probes.