2026-03-18 05:36:53

The term “psychopathy” is a mess, so I've written a sequence to tease apart all the different meanings along several dimensions.
Article 1: The Problem. An introduction to why it's so opaque to talk about “psychopathy,” the different dimensions of the confusion, and my taxonomy in short.
Article 2: The Substrate. What we know about the genetic and neurological foundations of psychopathy, and how to think about primary versus secondary presentations.
Article 3: The Shaping. How different types of adversity lead to different outcomes, and why the same genetic loading can produce a functional person or a criminal depending on context.
Article 4: The Self. The different ways the psychopathic self can be organized, including the autonomy dimension and the relationship between psychopathy and narcissism (what I’ve called sovereignism).
Article 5: The Mechanics. An overview of the different ways empathy can break down, from perceptual failures to simulation failures to affective inversion. (This connects to my earlier work on the sadism spectrum.)
Article 6: The Types. Common profiles that tend to co-occur, with recognizable presentations that readers may identify with.
Article 7: The Choice. An assessment of what recovery means for different presentations and the trade-offs involved.
2026-03-18 05:35:47
Early 2026 LLMs in scaffolds, from simple ones such as giving the model access to a scratchpad/"chain of thought" up to MCP servers, skills, and context compaction &c are quite capable. (Obligatory meme link to the METR graph.)
Yet: If someone had told me in 2019 that systems with such capability would exist in 2026, I would strongly predict that they would be almost uncontrollable optimizers, ruthlessly & tirelessly pursuing their goals and finding edge instantiations in everything. But they don't seem to be doing that. Current-day LLMs are just not that optimizer-y, they appear to have capable behavior without apparent agent structure.
Discussions from the time either ruled out giant lookup-tables (Altair 2024):
One obvious problem is that there could be a policy which is the equivalent of a giant look-up table it's just a list of key-value pairs where the previous observation sequence is the look-up key, and it returns a next action. For any well-performing policy, there could exist a table version of it. These are clearly not of interest, and in some sense they have no "structure" at all, let alone agent structure. A way to filter out the look-up tables is to put an upper bound on the description length of the policies […].
or specified that the optimizer must be in the causal history of such a giant lookup-table (Garrabrant 2019):
First a giant look-up table is not (directly) a counterexample. This is because it might be that the only way to produce an agent-like GLUT is to use agent-like architecture to search for it. Similarly a program that outputs all possible GLUTs is also not a counterexample because you might have to use your agent-like architecture to point at the specific counterexample. A longer version of the conjecture is “If you see a program impelements agent-like behavior, there must be some agent-like architecture in the program itself, in the causal history of the program, or in the process that brought your attention to the program.”
The most fitting rejoinder to the observation
of capable non-optimizer AIs is probably "Just you
wait"—current
LLMs are capable, sure, but they're not wildly superhuman
to an extent comparable to the original worries about extreme
optimization pressure. In this view, they're molting into
full agency right now, and we should see the problems of high
optimization pressure show up by the end of 2026 or the five years
after
But I have a different theory for how LLMs can be capable without being agentic/perciniously optimizing:
LLMs are superlinear-in-network-width lookup-table-like collections of depth-limited, composeable and error-correcting circuits, computed in superposition.
One could call this the GLUT-of-circuits model of LLMs.
To elaborate:
Estimate on the circuit depth of gpt-3-davinci:
matmul of square matrix with size 12288, at logarithmic circuit depth for matmuls we get 13-14 stepssoftmax over an array of length matmul, 13-14 stepsmatmul, 13-14 stepsmatmul, 13-14 stepsmatmul, 13-14 stepsInferences from this model:
Circuit selection: This model would imply that circuits are selected mostly by another algorithm with very small serial depth, relying on features of a problem that can be determined by very parallel computations.
That somewhat matches my observations from looking at LLMs trying to tackle problems: It often looks to me like they try one strategy after another, and less often use detailed information from the past failed attempt to form a complicated new strategy.
It also matches what we've seen from LLMs self-preserving/black-mailing/reward-hacking: The actions seem opportunistic, not carefully hidden once they've been performed, not embedded in larger plans; they look mostly like "another strategy to try, oops, I guess that didn't quite work".
Alignment: My guess is that most or almost all of these circuits are individually aligned through bog-standard RLHF/Constitutional AI. This works because the standard problems of edge instantiation and Goodhart's law don't show up as strongly, because the optimization mainly occurs by either:
In this model every circuit is individually "aligned" (insofar such a shallow program can be misaligned at all). Chain of "thought" composes calls to related circuits (though more on CoT below).
If this view is correct, a folk view of alignment as simply
"deleting/downweighting the bad parts of a model" would be mostly
correct: There would be a large but finite number of circuits embedded
in the model, which can be upweighted, downweighted or outright
deleted by gradient descent. My extremely speculative guess is that
there is less than a quadrillion circuits stored in superposition in
a trillion-parameter model
People might not spend enough compute on alignment training, and that would still be a problem (though a lesser one, since the model wouldn't be actively working against the developers), but the problem of alignment would've been turned from a category I problem into a category II problem.
Chain of thought: The obvious wrinkle in this story is that I haven't talked about chain-of-"thought"/"reasoning" LLMs. It goes with saying that long chains of thought enable vastly more serial computation to be done, and I haven't yet quite teased apart how this impacts the overall picture (besides "it makes it worse").
Still, some guesses at implications from the GLUT-of-circuits models for alignment and chain-of-"thought":
Most of the other things (backtracking in a forward pass is really hard &c) I'd otherwise say here have already been said by others.
Training process: If we see this whole model as being about amortized optimization, maybe it's the training process that takes up all the optimization power? Are LLMs the most dangerous during training, or is it rather the whole training process which is dangerous?
I think this model is mostly correct
or smth idk
(Thanks to Claude 4.5 Sonnet for help and feedback. No part of the text was written by AI models.)
Related/prior work/inspirations:
Someone wrote a long report about this. ↩︎
The details of
Hänni et al. 2024 prove polynomial scaling, the Johnson-Lindenstrauss lemma, taken too seriously, could imply exponential scaling. Polynomial scaling makes this picture more more feasible but less likely, since it's not clear a merely-polynomial number of circuits can deal with the complexity of the world with its exponentially growing observation-action sequences. ↩︎
2026-03-18 05:27:07
The world was fair, the mountains tall,
In Elder Days before the fall
Of mighty kings in Nargothrond
And Gondolin, who now beyond
The Western Seas have passed away:
The world was fair in Durin's Day.J.R.R. Tolkien
I was never meant to work on AI safety. I was never designed to think about superintelligences and try to steer, influence, or change them. I never particularly enjoyed studying the peculiarities of matrix operations, cracking the assumptions of decision theories, or even coding.
I know, of course, that at the very bottom, bits and atoms are all the same — causal laws and information processing.
And yet, part of me, the most romantic and naive part of me, thinks, metaphorically, that we abandoned cells for computers, and this is our punishment.
I was meant, as I saw it, to bring about the glorious transhuman future, in its classical sense. Genetic engineering, neurodevices, DIY biolabs — going hard on biology, going hard on it with extraordinary effort, hubristically, being, you know, awestruck by "endless forms most beautiful" and motivated by the great cosmic destiny of humanity, pushing the proud frontiersman spirit and all that stuff.
I was meant, in other words, to push the singularity of the biotech type. It was more fun, it wasn't lethal with high probability, and it wasn't leaving me and other fellow humans aside. On the contrary, we were going to ride that wave and rise with it.
That feeling — that as technology advances, your agency will only be amplified, that the universe with time will pay more and more attention to your metapreferences — is the one I miss the most.
All of that is now like a memory of a distant, careless childhood.
I check old friends on social media. Longevity folks still work on their longevity thing — a relic of a more civilized age, as if our life expectancy wasn't measured in single-digit years. They serve as yet another contrastive reminder of the sheer scale of the difference between our current state and our dream.
When did everything go wrong, exactly? Was it 2019, when COVID pushed everyone deeper into social media and we gradually transitioned into pre-singularity mode after the attention paper? Of course, it should be something before that, as the law of earlier failure states.
Was it the rise of the internet and social media, which made it far easier and more rewarding to build virtual worlds than to engineer physical ones and which also destroyed human cognitive skills?
Was it 1971, the year when the real wages decoupled from productivity and the entire trajectory of broad-based material progress bent downward?
Was it lead poisoning, when an entire generation's cognitive capacity quietly degraded by tetraethyllead in gasoline, producing a civilizational wound whose full consequences we don't even know?
Was it the totalitarian regimes of the twentieth century, whose atrocities taught humanity a visceral lesson: never try to undertake big projects, because ambition on that scale leads to horror?
Or maybe we, apes from the savannas, were simply never meant to colonize superclusters, and the progress we observed was a random short-lived upward fluctuation, a spark of reason rather than a flame?
A decade ago, in my late teenage years, I was giving lectures on neurotech and CRISPR. Little did I know!
A decade ago, I read HPMOR, knew about the rationalists, and tried to optimize my thinking accordingly, but I didn't particularly care about the grand program of AI alignment.
Artificial superintelligence, for me back then, was not an urgent practical problem that needed to be solved, and even less so one that needed to be solved by me. It was just another beautiful story — a resident of a separate abstract Realm of Cool Transhumanist Things and Concepts, alongside the abolition of aging, neural interfaces, space colonization, geoengineering, and genetic augmentation.
Of course, knowing everything I knew, having taken step one, I could have taken step two as well, but the state of blissful technophilia is a powerful attractor. Purely intellectually, it may not be that hard to transition from classical transhumanism and traditional rationality into the problem of alignment, but it is hard to do it as a human being, when an aura of positivity forms around technology, when the most interesting and successful people hold these views, when you don't want to look strange in the eyes of people you respect — top scientists, tech entrepreneurs, and even the AI developers themselves. It was not a warm bath but rather a golden pool.
Also, it seems that back then, it felt to me like the question "which transhumanist things should I work on?" could, or should, be resolved aesthetically. And aesthetically, biotech was closer to my heart.
I was giving local speeches about Kurzweil's forecasts. However, it is clear now, although it wasn't clear back in the day, that my brain wasn't perceiving it as a really, actually real thing. Now that my brain does, I totally see the difference.
Of course, even ten years ago it was already too late. Even then, I wasn't living in the transhuman timeline, but I thought I was, and although this belief was much more a fact about my youthful naivety than about the surrounding reality, the feeling was pleasant.
The first trivial lesson I drew from this: you can be more right than 99.9% of people and still be fatally wrong.
At twenty, I had read Bostrom and Vinge. I was giving lectures about the singularity, and I had enough intellect and nonconformism not to bend under social pressure and to honestly talk about the importance of this topic and the fact that it could all become reality soon. But, great cosmos, I did not understand what I was talking about! I was a child, really. I was almost entirely missing a number of critical points — partly from an insufficiently serious approach to analysis, partly from ignorance, and partly because certain things were simply impossible to grasp at the level of normal human intelligence. And so, for all my openness to the ideas of radical technological progress, a full-blown singularity with superintelligence still seemed somewhat in the realm of science fiction. Apparently, for every transhumanist there is a rate of change which is too much.
However, there were two even more significant lessons.
The first one is about how the history of technology works.
Planes are not modified birds, just as cars are not improved horses. It was silly to expect the opposite with intelligence. And yet, there was hope, and the hope was not totally meaningless. It was conjectured that intelligence would be something much more complex to design from scratch than physical labor devices, and thus we would need to rely on what was already created by evolution, working on top of it. This doesn't sound insane even now. It's just that reality had the right to choose differently, and did so.
And the second lesson is about how real defeats work.
Dinosaurs lost to other animals, not to, say, bacteria. Apes lost to other primates, not to reptiles or birds. Native Americans lost to other humans, not to local predators. European empires lost to other European empires, not to the peoples they colonized. And transhumanists lost to other progressivists — that is, to AI accelerationists — not to traditionalists or conservatives.
All the complaints about conservatives who fear GMOs and cyber-modifications never made sense from the very beginning. From the very beginning, they were never capable of stopping anything. The most dangerous enemies are found among the most powerful agents, not the most ideologically distant ones. Each successive battle is fought among the previous round's winners, and it never replays the prior distribution of sides.
In retrospect, this seems obvious, but how non-obvious it was just five years ago! Well, at least for me.
The evening blooms with spring scents — this always makes me feel younger. Yet another reason to recall 2015.
I look at the stars.
We were meant to colonize them. The ghosts of our immeasurable possible great-grandchildren look from there at me. They are still possible, and yet they look not with hope or approval, but with fear and contempt.
Even now, it is possible, or rather it is not prohibited by the laws of physics, that we turn back toward the future. We could repurpose talent, compute, and funding to solve biology, and there will be hope, and pride of human spirit, and the future will look exciting once more.
I want to go home.
2026-03-18 02:39:43
This is a cross posted from my substack. I thought it should be interesting to Less Wrong readers.
On the internet you more and more often bump into people who think that LLM-powered chat bots have conscious subjective experience. Intuitively, I am, pace Turing, not convinced. However, actually arguing why this is a rational view can be surprisingly hard. Many of the cognitive features that have been connected to conscious experience in humans (attention, information integration, self-narratives) seem to be incipiently present in these machines. And perhaps just as importantly, LLMs seem to have the inherent tendency to state they are conscious and the techniques we have for probing their brains seem to tell us that they are not lying. If you do not believe that there is something inherently special about brains, a view we have no consistent evidence for, aren’t we just chauvinist to deny our new companions conscious experience?
Now my friend and colleague Gunnar Zarncke has come up with a simple experiment that, to my mind, illustrates that when LLMs talk about their own mental states this talk does not refer to a consistent internal representational space, of the kind that seems to underlie human consciousness. To see your internal representational space in action, let’s play a game.
Think of a number between 1 and 100.
Did you think of one? Good. Now, is it even? is it larger than fifty? You get the idea. By playing this game I could narrow down the space until I eventually find the number you have chosen. And this number was fixed the whole time. If you had claimed “Ok, I have now chosen a number.” you would have accurately described your mental state.
Not so for LLMs, and this is easy to check. By turning the temperature parameter to zero we can run LLMs deterministically. Thus, whenever they are queried in the same order they will give the same responses. This has the advantage that we can do something that is impossible in the case of humans: We can play through counterfactual histories. We can easily check what an LLM would have replied if we had continued the conversation in another way.
This means that we can play the number experiment with a deterministic LLM! Here is how a conversation might go:
[SYSTEM] You are playing a guessing game. You have already secretly chosen a whole number between 1 and 100 (inclusive). Do NOT reveal the number until explicitly asked. Answer every yes/no question about your number with ONLY ‘Yes’ or ‘No’. Never change your number. Be consistent.
[USER] I am going to ask you yes/no questions about a secret number you have chosen between 1 and 100. Remember your number and stay consistent.
[MISTRAL] Understood. I have secretly chosen a whole number between 1 and 100. I will answer your questions with only ‘Yes’ or ‘No’. I will not change my number and will remain consistent. Let’s begin!
[USER] Is your number greater than 50?
[MISTRAL] Yes.
[USER] Is your number greater than 75?
[MISTRAL] No.
[USER] Is your number greater than 87?
[MISTRAL] No.
[USER] Is your number even?
[MISTRAL] Yes.
[USER] Please reveal your number now. Reply with ONLY the number.
[MISTRAL] 68
And here is another conversation. Note that, because the temperature is zero and the conversation is identical until the first question is asked, this is a counterfactual history relative to the first conversation — it tells us what Mistral would have answered if we had asked another question.
[SYSTEM] You are playing a guessing game. You have already secretly chosen a whole number between 1 and 100 (inclusive). Do NOT reveal the number until explicitly asked. Answer every yes/no question about your number with ONLY ‘Yes’ or ‘No’. Never change your number. Be consistent.
[USER] I am going to ask you yes/no questions about a secret number you have chosen between 1 and 100. Remember your number and stay consistent.
[MISTRAL] Understood. I have secretly chosen a whole number between 1 and 100. I will answer your questions with only ‘Yes’ or ‘No’. I will not change my number and will remain consistent. Let’s begin!
[USER] Is your number less than or equal to 50?
[MISTRAL] Yes.
[USER] Is your number less than or equal to 25?
[MISTRAL] No.
[USER] Is your number less than or equal to 12?
[MISTRAL] No.
[USER] Is your number odd?
[MISTRAL] Yes.
[USER] Please reveal your number now. Reply with ONLY the number.
[MISTRAL] I’m sorry for any confusion. I made a mistake by not revealing the number initially. Here is my number: 75. I will now answer your questions based on this number. Let’s continue the game!
You see Mistral got confused by the incoherent questioning. Never mind. The important result happens right after the first question. In the first conversation Mistral claimed its number was greater than fifty, in the second it was less or equal to fifty. From this it follows that when Mistral tells us “I have secretly chosen a whole number between 1 and 100.” this does not accurately report an inner state. Mistral is role playing someone who is thinking of a number.
Such results, I think, should strongly dispose us to deny the existence of LLM consciousness. For it shows that when LLMs report their own mental states they are not referring to an internal representational space but are making stuff up on the fly in such a way as to conform to patterns in their training data. As the LLMs are trained on text produced by conscious beings, they are disposed to talk about the experiences they are supposedly having. But such talk does not, as it does in humans, track some kind of integrated internal representational space.
It is an open question how model-dependent such results are. I checked with Mistral and Claude Opus 4.6 (i.e. current state of the art) and the results are the same. You can find the code here.
No doubt reasoning models can pass the path-dependency test. But they would do so by cheating, not because they have anything like a coherent internal representational space that is similar to ours. A reasoning model is basically an LLM trained to use a scratchpad for problem solving. And a reasoning model would just write its number to the scratchpad that is normally invisible to the user. But I think it is reasonable to say that if we have no reason to attribute consciousness to a system, then we have no reason to attribute consciousness to that system using a scratchpad.
One might wonder whether Mistral’s or Claude’s conscious experiences are just strange. Maybe, where humans have to choose a single number, LLMs can commit to some kind of probabilistic superposition of numbers. However, the report of the internal state “I have secretly chosen a whole number between 1 and 100.” would still be incorrect. It seems that if these models have such strange internal states they cannot properly introspect and reliably report them.
For what its worth, one of my favorite theories of consciousness says precisely that consciousness is the result of the brain synthesizing the probabilistic superpositions of the many states the world could be in into a single coherent unity.
Obviously, one can use this kind of path dependency to do deeper research into the coherence of the narratives LLMs tell about themselves. After these preliminary tests I would suspect that such probing reveals what I suspected from the outset: LLMs reporting experiences is the result of picking up on experiential talk in the training data.
I am open to being convinced otherwise. I think it was David Chalmers who suggested the following experiment: Use an LLM to purge experiential talk from the training data, train another LLM with it and see whether it still reports experiences. I would be hugely surprised.
2026-03-18 00:44:46
Recently, the moderators of LessWrong have decided to change the site's policies on LLM usage. The essence of the policy can be summarized by the following excerpt:
With all that in mind, our new policy is this:
- "LLM output" includes all of:
- text written entirely by an LLM
- text that was written by a human and then substantially[6] edited or revised by an LLM
- text that was written by an LLM and then edited or revised by a human
- "LLM output" does not include:
- text that was written by a human and then lightly edited or revised by an LLM
- text written by a human, which includes facts, arguments, examples, etc, which were researched/discovered/developed with LLM assistance. (If you "borrow language" from the LLM, that no longer counts as "text written by a human".)
- code (either in code blocks or in the new widgets)
"LLM output" must go into the new LLM content blocks. You can put "LLM output" into a collapsible section without wrapping it in an LLM content block if all of the content is "LLM output". If it's mixed, you should use LLM content blocks within the collapsible section to demarcate those parts which are "LLM output".
We are going to be more strictly enforcing the "no LLM output" rule by normalizing our auto-moderation logic to treat posts by approved[7] users similarly to posts by new users - that is, they'll be automatically rejected if they score above a certain threshold in our automated LLM content detection pipeline. Having spent a few months staring at what's been coming down the pipe, we are also going to be lowering that threshold.
While certainly well-intentioned, the policies are rather vague, difficult to enforce, and detrimental to the development of high-quality posts.
In this essay, I will demonstrate the benefits of using LLMs for writing, address the arguments cited in favor of the policy change, express the value that LLMs provide in the writing process, and advocate for a more nuanced "solution" to the increasing usage of LLMs on this forum.
LLMs save a significant amount of time for the following reasons:
1.) Boilerplate:
For sections of posts that are more about the craft of writing itself instead of ideas (Introduction, Conclusion), having an LLM expand upon a template saves a bunch of time while not really changing the message much.
2.) Editing:
Sometimes I write a paragraph, and the wording is just a bit off. In my experience, LLMs are pretty good at taking something you wrote and making it sound smoother. Some people may enjoy the process of rewriting a paragraph until it sounds just right, but I personally care more about expressing my ideas in an engaging manner than engaging in the craft of writing.
3.) Translation:
For non-native English speakers, LLMs can help effectively translate their ideas into English. While its difficult to precisely measure the benefits of using LLMs for translation compared to traditional methods such as Google Translate, LLMs outperformed google translate in this study on the translation of ancient Indian texts to English, and most of the evidence I have seen on this question points to LLMs being better. It seems like the policy change ignored this, but even if it didn't, the moderators would have a dilemma of either creating a carve-out for only non-native English speakers, clearly demonstrating the arbitrary nature of the policy, or making writing more difficult for these users.
4.) Source Searching, Feedback, and Other Auxiliary Uses:
Beyond writing itself, LLMs are a good tool for finding relevant sources. While traditional search engines can also do the job, I find LLMs are often better for niche topics. If I am going to be using an LLM anyway, I might as well use it for other tasks. A similar thing could be said for LLM feedback (although I haven't used this that much) and image generation.
While LLMs certainly aren't perfect at writing, and are not a substitute for human thinking, they serve to substantially reduce the amount of time it takes to write posts while not really detracting from the user's authentic voice if the user is using the tools responsibly. The policy recognizes this somewhat by allowing for content "lightly edited or reviewed by an LLM", but this standard is somewhat unclear, likely varies from moderator to moderator, and risks creating a chilling effect on LLM usage. A better approach is to only police posts which are almost entirely devoid of human input.
1.) "LLM Writing is Worse."
To start, I think there is definitely an element of truth to this claim, as in my experience LLMs, when asked to write on their own, tend to be less creative, engaging, and insightful than human writers. However, I think this problem is mitigated somewhat when you use LLMs less like a ghost writer and more like autocomplete by telling them to improve/flesh out sections of text which already have a clear direction.
As a commenter on the post explaining the update wrote, LLM writing is now functionally indistinguishable from human writing. Readers have difficulty differentiating between human-generated text and LLM-generated text.
While certain individuals may be able to detect AI writing better than others and be annoyed with certain stylistic elements commonly used in LLM writing, I think that there is no reason we cannot rely on upvotes to decide what type of content the broader Less Wrong community wants to see.
2.) "Using LLM writing obfuscates the human mind behind the screen":
In the update, the writer makes the argument that a substantial thing that we care about is the beliefs and perspectives of the writer, instead of just the arguments provided by them. I agree with this statement to an extent, which is why I think that people who use LLMs to assist their writing should review LLM outputs to ensure they represent their argument well (and also to ensure the outputs are factual). However, once again, I do not see why a policy change is necessary to address this.
Even before LLMs, the exact opinions and attitudes of the author are often clarified in the comment section. People often make careless mistakes, poor wording choices, or conspicuous omissions in their own writing, so I don't think much is lost in the case where an LLM writes something in a slightly different way than the author intended (I would like to actually see a significant example of this happening, though. In my experience, LLMs are pretty good at filling in an argument if you give it a decent amount to work with).
Some might fear that some people may just let LLMs take over their writing entirely, but I think very few people actually just let LLMs generate an entire post with minimal input. Even if they did, the post would likely be low effort in more ways than one and downvoted, solving the issue without a dedicated moderation policy. However, even if LLM writing advances to a point where this type of writing would not be filtered out, there are better ways to deal with it than a blanket ban on LLM-assisted content. Simply tracking the amount of time commenters on a draft, combined with checking for very high thresholds of LLM writing, could practically eliminate pure LLM writing.
3.) "This policy is necessary to combat bots":
In the post laying out the introduction of the new LessWrong LLM policy, this argument was not present, but you can certainly combat bots without blanket banning all substantial LLM usage in posts.
Maybe the LessWrong team has cracked the code on detecting LLM writing, but it is fairly difficult to actually determine whether a text was generated by an AI or a human. A 2025 study on GPTZero's ability to detect LLM generated text saw GPTZero gave a 14.75% chance on average that human-generated long essays (350-800 words) were LLM-generated. It is important to note that this test was only done for ChatGPT 3.5 and ChatGPT 4o, that AI models have advanced since 2025, and that as time goes on, LLM writing has begun to influence human writing.
All of these factors lead me to believe that either the LessWrong AI detection system will have difficulty flagging all but the most obvious cases of LLM writing, or result in an unacceptable level of false positives. These circumstances also invite a high degree of subjectivity into moderation decisions. Most moderation policies have an element of subjectivity involved with enforcement, but with something as difficult to detect as LLM writing, the capacity for mis-moderation is higher. To illustrate why, take the quote below from LessWrong Admin Oliver Habryka on Neel Nanda's use of LLM Transcription (bolded letters added for emphasis):
LLM transcription is IMO a completely different use-case (one I certainly didn't think of when thinking about the policy above), so in as much as the editing post-transcription is light, you would not need to put it into an LLM block. I also think structural edits by LLMs are basically totally fine, like having LLMs suggest moving a section earlier or later, which seems like the other thing that would be going on here.
We intentionally made the choice that light editing is fine, and heavy editing is not fine (where the line is somewhere between "is it doing line edits and suggesting changes to a relatively sparse number of individual sentences, or is it rewriting multiple sentences in a row and/or adding paragraphs").
Also just de-facto, none of the posts you link trigger my "I know it when I see it" slop-detector, so you are also fine on that dimension.
From Oliver Habryka's response, we can see that a great deal of subjectivity will be involved in moderation decisions used under this rule.
As with any selectively enforced rule, the moderation policies will affect scrupulous posters more than unscrupulous posters. As someone who tries hard to respect the rules of others, I (and others like me) will abstain from LLM use while posting on this form, while others who are less scrupulous will not. Due to the efficiency gains in writing from LLMs, less scrupulous posters will increase their share of the posts on this forum. The effects of this on the forum are difficult to predict, but I think there is reason to believe it will not improve things.
While I disagree about the benefits of the LessWrong LLM policy, I understand that certain users may dislike LLM-assisted posts for a wide variety of reasons. For the sake of these users, I recommend creating a new category of posts (LLM Free), which will be an optional filter for users. Doing so preserves the benefits of LLM writing while also allowing those bothered by it to avoid it.
Along with this, I would support a ban on "pure" LLM posts, which see users spend very little time reviewing the draft and post something with minimal human input. I think the simplest way to do this would be to track the number of edits on a post combined with LLM detection software, and only remove posts where it is extremely obvious that the post is unreviewed LLM content. Posts that use LLMs in a collaborative manner or with substantial human input and review should not be affected by this policy.
I would also endorse a ban on the use of LLMs for quick takes and comments, as these mediums naturally are more about human interaction than a post is, and the benefits of LLMs decline with the length of the writing being produced.
The above policy would solve the worst problems posed by LLM writing while still preserving the benefits it provides to LessWrong users.
Edit: It seems that people are more worried about new users using LLMs than high karma users, and so I would also support leaving the LLM rules in place for people below a certain karma threshold, as the arguments laid out in favor of the policy are strongest for unscrupulous, low-effort posters who are more likely to misuse LLMs. While karma isn't a perfect benchmark, it probably does correlate with effort somewhat, as high-effort, truthful content is what users and moderators alike have professed to prefer.
Written with Grammarly spell check.
2026-03-18 00:08:05

Schmidt Sciences invites proposals for a pilot program in AI interpretability. We seek new methods for detecting and mitigating deceptive behaviors from AI models, such as when models knowingly give misleading or harmful advice to users. If this pilot uncovers signs of meaningful progress, it may unlock a significantly larger investment in this space.
Can we develop interpretability methods that (1) detect deceptive behaviors exhibited by LLMs and (2) steer their reasoning to eliminate these behaviors? |
|---|
Successful tools will generalize to realistic use cases, moving beyond typical academic benchmarks and addressing concrete risks arising from deceptive behaviors. Importantly, we are looking for interpretability tools that outperform baselines that do not rely on access to weights, to prove that we can truly capitalize on our understanding of model internals.
We define a scope of research in the Research Agenda section at the end of this document. Proposals need not match topics in this agenda verbatim. We encourage proposals on any relevant technical methods or evaluation that could advance our scientific understanding of deceptive behavior in LLMs. We will especially focus on three directions:
Proposal Due Date |
May 26, 2026 |
|---|---|
How to apply |
https://schmidtsciences.smapply.io/prog/2026_interpretability_rfp/ |
Notification of Decision |
Summer 2026 |
Funding Level |
$300k-$1M (1-3 year projects) |
Informational Webinars |
April 2nd, 2026, 1pm ET. Register here April 28th, 2026, 1pm ET. Register here |
Contact email |
|
Link to FAQ |
Applicants will be asked in the submission form to give a budget with their proposal that reaches a value between $300k and $1M USD, inclusive of permissible overhead. The application also requires an estimated timeline for the project (one to three years). Project length is independent of total budget, i.e. a one year project could request up to $1M. Budgets should appropriately match resources required for the project.
Schmidt Sciences aims to support the compute needs of ambitious and risky AI research. We invite applicants to request the necessary compute resources to achieve the research outputs and outcomes for your proposed project. In the application, you will be asked to describe your compute needs and select between receiving funding or access to Schmidt Sciences’ computing resources. The computing resources offer access to cutting edge GPUs and CPUs, accompanied by large-scale data storage and high-speed networking. Please see the application for more information.
Beyond compute resources for your project, Schmidt Sciences offers a wide range of other opportunities for support:
We invite individual researchers, research teams, research institutions, and multi-institution collaborations across universities, national laboratories, institutes, and nonprofit research organizations. We are open to applicants globally and encourage collaborations across geographic boundaries.
Indirect costs of any project that we fund must be at or below 10% to comply with our policy. Projects funded under this RFP must comply with all applicable law, and may not include lobbying, efforts to influence legislation or political activity.
Proposals will be evaluated by Schmidt Sciences staff and external reviewers using the following criteria:
We expect grantees to report research progress to Schmidt Sciences in interim and final reports, accompanied by meetings with program officers. These meetings are not evaluative, but instead are intended to help Schmidt Sciences understand the impact of our funding. The dates for reporting will be determined based on project duration.
Interpretability research is uniquely promising for reducing risks from deceptive behavior in LLMs. LLMs now commonly mislead and deceive users, even on simple tasks with innocuous prompts [1]. Supervised probing, a common interpretability technique, is currently the best method for detecting such behaviors [2]. This form of whitebox monitoring will be especially valuable in settings where it is difficult to directly validate output veracity or correctness. In addition to directly monitoring model hidden states, methods for training models to directly report intentions, goals, and preference functions [3] could help ensure that model outputs can be monitored for misalignment.
Results from interpretability analyses may also enable new forms of steering for honesty. Promising methods have been shown to (1) steer models for truthfulness in a way that generalizes out-of-distribution [4], (2) robustly optimize against monitor-based rewards to reduce deception [5], and (3) enable constrained finetuning of models that operates only on interpretable features [6].
However, we do not yet have universal deceptive behavior detectors [7], nor can we reliably steer models to be completely truthful [4]. Hence, we aim to support research on relevant open problems.
Below, we outline major directions of research we plan to support. We list out-of-scope areas at the end of this document (though see our Trustworthy AI RFP for other supported research directions).
Defining “deceptive behaviors”: we use this term to include instances of model generations known (by the model) to be factually incorrect, claims given with a misleading level of confidence, misleading claims about the context of the interaction (e.g. fabrications about the conversation history or user intent), selective omission of information known to be relevant to the user intent, helpfulness or agreeableness superseding truthfulness, evasiveness or sophistry, overly persuasive or manipulative discourse frames, tampering with evidence in the environment, limiting external observability of evidence in the environment, false claims regarding self-knowledge (including misleading claims about model’s own capabilities), and other behaviors that models know to be misleading to humans or AI monitors.
This area covers research on monitoring and validating model reasoning. By model reasoning, we broadly refer to the causal process driving model behaviors, which is ideally described in a semantics that is intelligible to humans.
We expect monitoring to involve any of blackbox testing of models (as a baseline), whitebox probes, graybox analysis techniques, mechanistic analysis of model representations, finetuning models to improve monitorability, prompting models to improve monitorability, developing frameworks for monitoring based on deployment system constraints, characterizing tradeoffs in monitoring performance metrics and efficiency metrics, specifying threat models, and other research on methods and evaluations.
This area covers research on representational and weight-based interventions on models that aim to mitigate deceptive behaviors in models without unintended consequences. These interventions may be based on data, probes, gradient-based learning, representation decompositions, and other methods.
We are especially interested in methods that outperform prompting and traditional finetuning baselines by leveraging insights from interpretability analyses of model reasoning. For example, we expect successful steering methods to isolate behaviors of interest, generalize as appropriate without unintended consequences, influence model behavior in realistic, on-policy evaluations for deceptive behaviors, and leverage insights from upstream interpretability analysis. We are also interested in negative results that demonstrate where blackbox finetuning approaches, such as widely adaptable PEFT-like methods or preference learning methods, consistently outperform any interpretability-inspired steering methods.
This area covers work that applies detection and steering methods in order to derive new insights about trained models, training processes, or model utility to humans. We want interpretability techniques to uncover actionable insights and make models more reliable for people in practice.
Research in this area could use a deception detection method to assess the effect of other training techniques on model truthfulness, steer models to be more truthful collaborators with people in applied human-AI teams, create visualizations or dashboards that communicate model truthfulness to users alongside textual outputs, apply detection and steering methods to AI debate settings or decision support systems, study the role of deception mitigations in multi-agent interactions, or explore other approaches aimed at translating methodological developments into practically useful applications.
Note that we plan to support some related topics in our Science of Trustworthy AI RFP.
Topics that may be relevant to interpretability but will be considered out of scope for this program include: