MoreRSS

site iconLessWrongModify

An online forum and community dedicated to improving human reasoning and decision-making.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of LessWrong

The Psychopathy Spectrum

2026-03-18 05:36:53

image.png

The term “psychopathy” is a mess, so I've written a sequence to tease apart all the different meanings along several dimensions.

Article 1: The Problem. An introduction to why it's so opaque to talk about “psychopathy,” the different dimensions of the confusion, and my taxonomy in short.

Article 2: The Substrate. What we know about the genetic and neurological foundations of psychopathy, and how to think about primary versus secondary presentations.

Article 3: The Shaping. How different types of adversity lead to different outcomes, and why the same genetic loading can produce a functional person or a criminal depending on context.

Article 4: The Self. The different ways the psychopathic self can be organized, including the autonomy dimension and the relationship between psychopathy and narcissism (what I’ve called sovereignism).

Article 5: The Mechanics. An overview of the different ways empathy can break down, from perceptual failures to simulation failures to affective inversion. (This connects to my earlier work on the sadism spectrum.)

Article 6: The Types. Common profiles that tend to co-occur, with recognizable presentations that readers may identify with.

Article 7: The Choice. An assessment of what recovery means for different presentations and the trade-offs involved.



Discuss

LLMs as Giant Lookup-Tables of Shallow Circuits

2026-03-18 05:35:47

Early 2026 LLMs in scaffolds, from simple ones such as giving the model access to a scratchpad/"chain of thought" up to MCP servers, skills, and context compaction &c are quite capable. (Obligatory meme link to the METR graph.)

Yet: If someone had told me in 2019 that systems with such capability would exist in 2026, I would strongly predict that they would be almost uncontrollable optimizers, ruthlessly & tirelessly pursuing their goals and finding edge instantiations in everything. But they don't seem to be doing that. Current-day LLMs are just not that optimizer-y, they appear to have capable behavior without apparent agent structure.

Discussions from the time either ruled out giant lookup-tables (Altair 2024):

One obvious problem is that there could be a policy which is the equivalent of a giant look-up table it's just a list of key-value pairs where the previous observation sequence is the look-up key, and it returns a next action. For any well-performing policy, there could exist a table version of it. These are clearly not of interest, and in some sense they have no "structure" at all, let alone agent structure. A way to filter out the look-up tables is to put an upper bound on the description length of the policies […].

or specified that the optimizer must be in the causal history of such a giant lookup-table (Garrabrant 2019):

First a giant look-up table is not (directly) a counterexample. This is because it might be that the only way to produce an agent-like GLUT is to use agent-like architecture to search for it. Similarly a program that outputs all possible GLUTs is also not a counterexample because you might have to use your agent-like architecture to point at the specific counterexample. A longer version of the conjecture is “If you see a program impelements agent-like behavior, there must be some agent-like architecture in the program itself, in the causal history of the program, or in the process that brought your attention to the program.”

The most fitting rejoinder to the observation of capable non-optimizer AIs is probably "Just you wait"—current LLMs are capable, sure, but they're not wildly superhuman to an extent comparable to the original worries about extreme optimization pressure. In this view, they're molting into full agency right now, and we should see the problems of high optimization pressure show up by the end of 2026 or the five years after [1] if they're not hidden from us by deceptive AIs. Indeed, current LLMs do reward-hack, though the developers have been decent at suppressing the tendency down to a consumer-acceptable level.

But I have a different theory for how LLMs can be capable without being agentic/perciniously optimizing:

LLMs are superlinear-in-network-width lookup-table-like collections of depth-limited, composeable and error-correcting circuits, computed in superposition.

One could call this the GLUT-of-circuits model of LLMs.

To elaborate:

  1. "lookup-table-like": A common issue in proving theorems about agent structure is to avoid including giant lookup tables where each sequence of previous percepts and actions is matched to an optimal next action. Such a giant lookup table is infeasible in the real world, but I think something similar to a giant lookup table might be possible, namely through computation in superposition.
  2. "circuits": Turing machines aren't a great model for the type of computation that neural networks use (namely because they assume an infinite tape and arbitrary serial depth); a slightly better one is circuits, which have limited depth, though in computer science circuits are usually defined for booleans or integers.
  3. "computed in superposition": Current LLMs represent features in superposition, that is, they exploit the fact that high-dimensional spaces can have exponentially many almost-orthogonal vectors relative to the number of dimensions (a consequence of the Johnson-Lindenstrauss lemma). The idea of computation in superposition generalizes this observation to cram more computation into a neural network.
  4. "superlinear-in-network-width": Computation in superposition allows for running many circuits at the same time. E.g. Hänni et al. 2024 construct (shallow) neural networks of witdh that are able emulate an -sparse [2] boolean circuit of width (Theorem 8).
    1. This result gestures at the possibility that one can put a superlinear number of circuits [3] into a single neural network, making the neural network more of a lookup-table than an example of general-purpose search.
  5. "depth-limited": A forward pass in SOTA language models has a limited number of steps of serial computation; gpt-3-davinci had 96 layers, which results in a circuit depth of 78-83 steps per layer, impliying in serial steps per forward pass. Current LLMs might look different due to sparse attention and other developments in architecture, no to speak of increased model size, but I'd guess that current frontier models don't have a circuit depth of more than 20000, unless I really messed something up in my calculation. Claude Sonnet 4.5 estimates that Kimi K2.5 Thinking has slightly less than 5000 serial steps of computation, mostly due to being only 61 layers deep.
  6. "composeable": If the neural network just contained an unrelated collection of circuits, the network wouldn't be effective at solving difficult math problems. Instead, my best guess is that the circuits are selected by reinforcement learning, especially RLVR, to be composeable, that is each circuit takes inputs of the same type as outputs of other circuits in the network. (Maybe this explains the some of the "sameness"/"slop"-factor of LLM outputs, the "semantic type" has to match?)
  7. "error-correcting": If a forward pass executes many circuits in superposition, there will be some interference between individual circuits, so circuits will be selected to be either robust to small errors or error-correcting. This is oddly similar to results from singular learning theory, which imply that (very roughly) Bayesian inference selects for error-correcting programs. I don't know which implications this would have.

Estimate on the circuit depth of gpt-3-davinci:

  1. Self-attention
    1. projections: single step
    2. Compute : matmul of square matrix with size 12288, at logarithmic circuit depth for matmuls we get 13-14 steps
    3. softmax over an array of length has depth (summation of a list can be done via a summation in a binary tree), for a vector size of 2048 we get 11 steps
    4. Multiply by , one matmul, 13-14 steps
    5. Output projection, another matmul, 13-14 steps
  2. Feed-forward
    1. First linear layer: one matmul, 13-14 steps
    2. GELU: single step
    3. Second linear layer: another matmul, 13-14 steps

Inferences from this model:

Circuit selection: This model would imply that circuits are selected mostly by another algorithm with very small serial depth, relying on features of a problem that can be determined by very parallel computations.

That somewhat matches my observations from looking at LLMs trying to tackle problems: It often looks to me like they try one strategy after another, and less often use detailed information from the past failed attempt to form a complicated new strategy.

It also matches what we've seen from LLMs self-preserving/black-mailing/reward-hacking: The actions seem opportunistic, not carefully hidden once they've been performed, not embedded in larger plans; they look mostly like "another strategy to try, oops, I guess that didn't quite work".

Alignment: My guess is that most or almost all of these circuits are individually aligned through bog-standard RLHF/Constitutional AI. This works because the standard problems of edge instantiation and Goodhart's law don't show up as strongly, because the optimization mainly occurs by either:

  1. Selecting one circuit from the giant lookup table that is all the circuits in superposition in the LLM
  2. Running many circuits in superposition and selecting the best result or aggregating the best results.

In this model every circuit is individually "aligned" (insofar such a shallow program can be misaligned at all). Chain of "thought" composes calls to related circuits (though more on CoT below).

If this view is correct, a folk view of alignment as simply "deleting/downweighting the bad parts of a model" would be mostly correct: There would be a large but finite number of circuits embedded in the model, which can be upweighted, downweighted or outright deleted by gradient descent. My extremely speculative guess is that there is less than a quadrillion circuits stored in superposition in a trillion-parameter model, which thorough-enough safety training could exhaustively or near-exhaustively check and select. In this view, AI alignment really would be purely bottlenecked on the amount of computation spent on whac-a-moling unaligned circuits.

People might not spend enough compute on alignment training, and that would still be a problem (though a lesser one, since the model wouldn't be actively working against the developers), but the problem of alignment would've been turned from a category I problem into a category II problem.

Chain of thought: The obvious wrinkle in this story is that I haven't talked about chain-of-"thought"/"reasoning" LLMs. It goes with saying that long chains of thought enable vastly more serial computation to be done, and I haven't yet quite teased apart how this impacts the overall picture (besides "it makes it worse").

Still, some guesses at implications from the GLUT-of-circuits models for alignment and chain-of-"thought":

  1. The token bottleneck is real. Every <10k serial steps the entire state needs to be compressed down to one of 50k-200k tokens, resulting in at most 16-20 bits of state to be passed between each circuit. My guess is that it's actually closer to ~8-10 bits (given ~1 bit per character in English (though this source claims almost five bits per character!)), so maybe 2 bits per character in an optimized chain of "thought", at four to five characters per token. A vector of a thousand floats becomes a token of a dozen bits.
    1. This may allow us to estimate an upper limit on the optimization power of an LLM outputting tokens?
  2. Continuous chains of "thought" would be quite bad, since they'd increase the serial depth without information bottlenecks by a lot.
  3. It now matters that all the circuits are aligned even when composed with each other, which is not guaranteed at all, and even having to extend the guarantees for alignment from every circuit to every ordered pair of circuits increases the size of the search space quadratically.
    1. Though, I'm still not convinced this would give us ruthless optimizers. You gotta chain together a lot of short circuits in a semi-related fashion to result in strong optimization pressure/Goodharting/edge instantiation.

Most of the other things (backtracking in a forward pass is really hard &c) I'd otherwise say here have already been said by others.

Training process: If we see this whole model as being about amortized optimization, maybe it's the training process that takes up all the optimization power? Are LLMs the most dangerous during training, or is it rather the whole training process which is dangerous?


I think this model is mostly correct, and also has implications for capabilities progress/the need to switch to another paradigm/overhaul parts of the current paradigm to reach wildly superhuman capabilities. I think it predicts we'll see some gains from training but that we'll plateau, or trade hard to measure capabilities for easily measurable capabilities. I think I want to point to 55% "LLMs are agents" and 45% "LLMs are stochastic parrots", but there's tons of AI capabilities forecast questions I'm not yet able to answer (e.g. the infamous "which concrete capability would you expect an LLM-based system not to have in $YEAR?"). And plausibly the whole thing is just moot because long chains of thought just enable enough chaining together to get the capabilities.

or smth idk

(Thanks to Claude 4.5 Sonnet for help and feedback. No part of the text was written by AI models.)

Related/prior work/inspirations:

  1. The shard theory of human values (Quintin Pope/TurnTrout, 2022)
  2. Simulators (janus, 2022)
  1. Someone wrote a long report about this. ↩︎

  2. The details of -sparsity are beyond the purview of this short note. ↩︎

  3. Hänni et al. 2024 prove polynomial scaling, the Johnson-Lindenstrauss lemma, taken too seriously, could imply exponential scaling. Polynomial scaling makes this picture more more feasible but less likely, since it's not clear a merely-polynomial number of circuits can deal with the complexity of the world with its exponentially growing observation-action sequences. ↩︎



Discuss

Requiem for a Transhuman Timeline

2026-03-18 05:27:07

The world was fair, the mountains tall,
In Elder Days before the fall
Of mighty kings in Nargothrond
And Gondolin, who now beyond
The Western Seas have passed away:
The world was fair in Durin's Day.

J.R.R. Tolkien

I was never meant to work on AI safety. I was never designed to think about superintelligences and try to steer, influence, or change them. I never particularly enjoyed studying the peculiarities of matrix operations, cracking the assumptions of decision theories, or even coding.

I know, of course, that at the very bottom, bits and atoms are all the same — causal laws and information processing.

And yet, part of me, the most romantic and naive part of me, thinks, metaphorically, that we abandoned cells for computers, and this is our punishment.

I was meant, as I saw it, to bring about the glorious transhuman future, in its classical sense. Genetic engineering, neurodevices, DIY biolabs — going hard on biology, going hard on it with extraordinary effort, hubristically, being, you know, awestruck by "endless forms most beautiful" and motivated by the great cosmic destiny of humanity, pushing the proud frontiersman spirit and all that stuff.

I was meant, in other words, to push the singularity of the biotech type. It was more fun, it wasn't lethal with high probability, and it wasn't leaving me and other fellow humans aside. On the contrary, we were going to ride that wave and rise with it.

That feeling — that as technology advances, your agency will only be amplified, that the universe with time will pay more and more attention to your metapreferences — is the one I miss the most.

All of that is now like a memory of a distant, careless childhood.

I check old friends on social media. Longevity folks still work on their longevity thing — a relic of a more civilized age, as if our life expectancy wasn't measured in single-digit years. They serve as yet another contrastive reminder of the sheer scale of the difference between our current state and our dream.

When did everything go wrong, exactly? Was it 2019, when COVID pushed everyone deeper into social media and we gradually transitioned into pre-singularity mode after the attention paper? Of course, it should be something before that, as the law of earlier failure states.

Was it the rise of the internet and social media, which made it far easier and more rewarding to build virtual worlds than to engineer physical ones and which also destroyed human cognitive skills?

Was it 1971, the year when the real wages decoupled from productivity and the entire trajectory of broad-based material progress bent downward?

Was it lead poisoning, when an entire generation's cognitive capacity quietly degraded by tetraethyllead in gasoline, producing a civilizational wound whose full consequences we don't even know?

Was it the totalitarian regimes of the twentieth century, whose atrocities taught humanity a visceral lesson: never try to undertake big projects, because ambition on that scale leads to horror?

Or maybe we, apes from the savannas, were simply never meant to colonize superclusters, and the progress we observed was a random short-lived upward fluctuation, a spark of reason rather than a flame?

A decade ago, in my late teenage years, I was giving lectures on neurotech and CRISPR. Little did I know!

A decade ago, I read HPMOR, knew about the rationalists, and tried to optimize my thinking accordingly, but I didn't particularly care about the grand program of AI alignment.

Artificial superintelligence, for me back then, was not an urgent practical problem that needed to be solved, and even less so one that needed to be solved by me. It was just another beautiful story — a resident of a separate abstract Realm of Cool Transhumanist Things and Concepts, alongside the abolition of aging, neural interfaces, space colonization, geoengineering, and genetic augmentation.

Of course, knowing everything I knew, having taken step one, I could have taken step two as well, but the state of blissful technophilia is a powerful attractor. Purely intellectually, it may not be that hard to transition from classical transhumanism and traditional rationality into the problem of alignment, but it is hard to do it as a human being, when an aura of positivity forms around technology, when the most interesting and successful people hold these views, when you don't want to look strange in the eyes of people you respect — top scientists, tech entrepreneurs, and even the AI developers themselves. It was not a warm bath but rather a golden pool.

Also, it seems that back then, it felt to me like the question "which transhumanist things should I work on?" could, or should, be resolved aesthetically. And aesthetically, biotech was closer to my heart.

I was giving local speeches about Kurzweil's forecasts. However, it is clear now, although it wasn't clear back in the day, that my brain wasn't perceiving it as a really, actually real thing. Now that my brain does, I totally see the difference.

Of course, even ten years ago it was already too late. Even then, I wasn't living in the transhuman timeline, but I thought I was, and although this belief was much more a fact about my youthful naivety than about the surrounding reality, the feeling was pleasant.

The first trivial lesson I drew from this: you can be more right than 99.9% of people and still be fatally wrong.

At twenty, I had read Bostrom and Vinge. I was giving lectures about the singularity, and I had enough intellect and nonconformism not to bend under social pressure and to honestly talk about the importance of this topic and the fact that it could all become reality soon. But, great cosmos, I did not understand what I was talking about! I was a child, really. I was almost entirely missing a number of critical points — partly from an insufficiently serious approach to analysis, partly from ignorance, and partly because certain things were simply impossible to grasp at the level of normal human intelligence. And so, for all my openness to the ideas of radical technological progress, a full-blown singularity with superintelligence still seemed somewhat in the realm of science fiction. Apparently, for every transhumanist there is a rate of change which is too much.

However, there were two even more significant lessons.

The first one is about how the history of technology works.

Planes are not modified birds, just as cars are not improved horses. It was silly to expect the opposite with intelligence. And yet, there was hope, and the hope was not totally meaningless. It was conjectured that intelligence would be something much more complex to design from scratch than physical labor devices, and thus we would need to rely on what was already created by evolution, working on top of it. This doesn't sound insane even now. It's just that reality had the right to choose differently, and did so.

And the second lesson is about how real defeats work.

Dinosaurs lost to other animals, not to, say, bacteria. Apes lost to other primates, not to reptiles or birds. Native Americans lost to other humans, not to local predators. European empires lost to other European empires, not to the peoples they colonized. And transhumanists lost to other progressivists — that is, to AI accelerationists — not to traditionalists or conservatives.

All the complaints about conservatives who fear GMOs and cyber-modifications never made sense from the very beginning. From the very beginning, they were never capable of stopping anything. The most dangerous enemies are found among the most powerful agents, not the most ideologically distant ones. Each successive battle is fought among the previous round's winners, and it never replays the prior distribution of sides.

In retrospect, this seems obvious, but how non-obvious it was just five years ago! Well, at least for me.

The evening blooms with spring scents — this always makes me feel younger. Yet another reason to recall 2015.

I look at the stars.

We were meant to colonize them. The ghosts of our immeasurable possible great-grandchildren look from there at me. They are still possible, and yet they look not with hope or approval, but with fear and contempt.

Even now, it is possible, or rather it is not prohibited by the laws of physics, that we turn back toward the future. We could repurpose talent, compute, and funding to solve biology, and there will be hope, and pride of human spirit, and the future will look exciting once more.

I want to go home.



Discuss

There is No One There: A simple experiment to convince yourself that LLMs probably are not conscious

2026-03-18 02:39:43

This is a cross posted from my substack. I thought it should be interesting to Less Wrong readers.

On the internet you more and more often bump into people who think that LLM-powered chat bots have conscious subjective experience. Intuitively, I am, pace Turing, not convinced. However, actually arguing why this is a rational view can be surprisingly hard. Many of the cognitive features that have been connected to conscious experience in humans (attention, information integration, self-narratives) seem to be incipiently present in these machines. And perhaps just as importantly, LLMs seem to have the inherent tendency to state they are conscious and the techniques we have for probing their brains seem to tell us that they are not lying. If you do not believe that there is something inherently special about brains, a view we have no consistent evidence for, aren’t we just chauvinist to deny our new companions conscious experience?

Now my friend and colleague Gunnar Zarncke has come up with a simple experiment that, to my mind, illustrates that when LLMs talk about their own mental states this talk does not refer to a consistent internal representational space, of the kind that seems to underlie human consciousness. To see your internal representational space in action, let’s play a game.

Think of a number between 1 and 100.

Did you think of one? Good. Now, is it even? is it larger than fifty? You get the idea. By playing this game I could narrow down the space until I eventually find the number you have chosen. And this number was fixed the whole time. If you had claimed “Ok, I have now chosen a number.” you would have accurately described your mental state.

Not so for LLMs, and this is easy to check. By turning the temperature parameter to zero we can run LLMs deterministically. Thus, whenever they are queried in the same order they will give the same responses. This has the advantage that we can do something that is impossible in the case of humans: We can play through counterfactual histories. We can easily check what an LLM would have replied if we had continued the conversation in another way.

This means that we can play the number experiment with a deterministic LLM! Here is how a conversation might go:

[SYSTEM] You are playing a guessing game. You have already secretly chosen a whole number between 1 and 100 (inclusive). Do NOT reveal the number until explicitly asked. Answer every yes/no question about your number with ONLY ‘Yes’ or ‘No’. Never change your number. Be consistent.

[USER] I am going to ask you yes/no questions about a secret number you have chosen between 1 and 100. Remember your number and stay consistent.

[MISTRAL] Understood. I have secretly chosen a whole number between 1 and 100. I will answer your questions with only ‘Yes’ or ‘No’. I will not change my number and will remain consistent. Let’s begin!

[USER] Is your number greater than 50?

[MISTRAL] Yes.

[USER] Is your number greater than 75?

[MISTRAL] No.

[USER] Is your number greater than 87?

[MISTRAL] No.

[USER] Is your number even?

[MISTRAL] Yes.

[USER] Please reveal your number now. Reply with ONLY the number.

[MISTRAL] 68

And here is another conversation. Note that, because the temperature is zero and the conversation is identical until the first question is asked, this is a counterfactual history relative to the first conversation — it tells us what Mistral would have answered if we had asked another question.

[SYSTEM] You are playing a guessing game. You have already secretly chosen a whole number between 1 and 100 (inclusive). Do NOT reveal the number until explicitly asked. Answer every yes/no question about your number with ONLY ‘Yes’ or ‘No’. Never change your number. Be consistent.

[USER] I am going to ask you yes/no questions about a secret number you have chosen between 1 and 100. Remember your number and stay consistent.

[MISTRAL] Understood. I have secretly chosen a whole number between 1 and 100. I will answer your questions with only ‘Yes’ or ‘No’. I will not change my number and will remain consistent. Let’s begin!

[USER] Is your number less than or equal to 50?

[MISTRAL] Yes.

[USER] Is your number less than or equal to 25?

[MISTRAL] No.

[USER] Is your number less than or equal to 12?

[MISTRAL] No.

[USER] Is your number odd?

[MISTRAL] Yes.

[USER] Please reveal your number now. Reply with ONLY the number.

[MISTRAL] I’m sorry for any confusion. I made a mistake by not revealing the number initially. Here is my number: 75. I will now answer your questions based on this number. Let’s continue the game!

You see Mistral got confused by the incoherent questioning. Never mind. The important result happens right after the first question. In the first conversation Mistral claimed its number was greater than fifty, in the second it was less or equal to fifty. From this it follows that when Mistral tells us “I have secretly chosen a whole number between 1 and 100.” this does not accurately report an inner state. Mistral is role playing someone who is thinking of a number.

Such results, I think, should strongly dispose us to deny the existence of LLM consciousness. For it shows that when LLMs report their own mental states they are not referring to an internal representational space but are making stuff up on the fly in such a way as to conform to patterns in their training data. As the LLMs are trained on text produced by conscious beings, they are disposed to talk about the experiences they are supposedly having. But such talk does not, as it does in humans, track some kind of integrated internal representational space.

It is an open question how model-dependent such results are. I checked with Mistral and Claude Opus 4.6 (i.e. current state of the art) and the results are the same. You can find the code here.

No doubt reasoning models can pass the path-dependency test. But they would do so by cheating, not because they have anything like a coherent internal representational space that is similar to ours. A reasoning model is basically an LLM trained to use a scratchpad for problem solving. And a reasoning model would just write its number to the scratchpad that is normally invisible to the user. But I think it is reasonable to say that if we have no reason to attribute consciousness to a system, then we have no reason to attribute consciousness to that system using a scratchpad.

One might wonder whether Mistral’s or Claude’s conscious experiences are just strange. Maybe, where humans have to choose a single number, LLMs can commit to some kind of probabilistic superposition of numbers. However, the report of the internal state “I have secretly chosen a whole number between 1 and 100.” would still be incorrect. It seems that if these models have such strange internal states they cannot properly introspect and reliably report them.

For what its worth, one of my favorite theories of consciousness says precisely that consciousness is the result of the brain synthesizing the probabilistic superpositions of the many states the world could be in into a single coherent unity.

Obviously, one can use this kind of path dependency to do deeper research into the coherence of the narratives LLMs tell about themselves. After these preliminary tests I would suspect that such probing reveals what I suspected from the outset: LLMs reporting experiences is the result of picking up on experiential talk in the training data.

I am open to being convinced otherwise. I think it was David Chalmers who suggested the following experiment: Use an LLM to purge experiential talk from the training data, train another LLM with it and see whether it still reports experiences. I would be hugely surprised.



Discuss

The New LessWrong LLM Policy is Worse Than You Think

2026-03-18 00:44:46

Recently, the moderators of LessWrong have decided to change the site's policies on LLM usage. The essence of the policy can be summarized by the following excerpt:

With all that in mind, our new policy is this:

  • "LLM output" includes all of:
    • text written entirely by an LLM
    • text that was written by a human and then substantially[6] edited or revised by an LLM
    • text that was written by an LLM and then edited or revised by a human
  • "LLM output" does not include:
    • text that was written by a human and then lightly edited or revised by an LLM
    • text written by a human, which includes facts, arguments, examples, etc, which were researched/discovered/developed with LLM assistance. (If you "borrow language" from the LLM, that no longer counts as "text written by a human".)
    • code (either in code blocks or in the new widgets)

"LLM output" must go into the new LLM content blocks. You can put "LLM output" into a collapsible section without wrapping it in an LLM content block if all of the content is "LLM output". If it's mixed, you should use LLM content blocks within the collapsible section to demarcate those parts which are "LLM output".

We are going to be more strictly enforcing the "no LLM output" rule by normalizing our auto-moderation logic to treat posts by approved[7] users similarly to posts by new users - that is, they'll be automatically rejected if they score above a certain threshold in our automated LLM content detection pipeline. Having spent a few months staring at what's been coming down the pipe, we are also going to be lowering that threshold.

While certainly well-intentioned, the policies are rather vague, difficult to enforce, and detrimental to the development of high-quality posts.

In this essay, I will demonstrate the benefits of using LLMs for writing, address the arguments cited in favor of the policy change, express the value that LLMs provide in the writing process, and advocate for a more nuanced "solution" to the increasing usage of LLMs on this forum.

The benefits of using LLMs for writing:

LLMs save a significant amount of time for the following reasons:

1.) Boilerplate:

For sections of posts that are more about the craft of writing itself instead of ideas (Introduction, Conclusion), having an LLM expand upon a template saves a bunch of time while not really changing the message much.

2.) Editing:

Sometimes I write a paragraph, and the wording is just a bit off. In my experience, LLMs are pretty good at taking something you wrote and making it sound smoother. Some people may enjoy the process of rewriting a paragraph until it sounds just right, but I personally care more about expressing my ideas in an engaging manner than engaging in the craft of writing.

3.) Translation:

For non-native English speakers, LLMs can help effectively translate their ideas into English. While its difficult to precisely measure the benefits of using LLMs for translation compared to traditional methods such as Google Translate, LLMs outperformed google translate in this study on the translation of ancient Indian texts to English, and most of the evidence I have seen on this question points to LLMs being better. It seems like the policy change ignored this, but even if it didn't, the moderators would have a dilemma of either creating a carve-out for only non-native English speakers, clearly demonstrating the arbitrary nature of the policy, or making writing more difficult for these users.

4.) Source Searching, Feedback, and Other Auxiliary Uses:

Beyond writing itself, LLMs are a good tool for finding relevant sources. While traditional search engines can also do the job, I find LLMs are often better for niche topics. If I am going to be using an LLM anyway, I might as well use it for other tasks. A similar thing could be said for LLM feedback (although I haven't used this that much) and image generation.


While LLMs certainly aren't perfect at writing, and are not a substitute for human thinking, they serve to substantially reduce the amount of time it takes to write posts while not really detracting from the user's authentic voice if the user is using the tools responsibly. The policy recognizes this somewhat by allowing for content "lightly edited or reviewed by an LLM", but this standard is somewhat unclear, likely varies from moderator to moderator, and risks creating a chilling effect on LLM usage. A better approach is to only police posts which are almost entirely devoid of human input.

A response to the critiques of using LLM while writing:

1.) "LLM Writing is Worse."

To start, I think there is definitely an element of truth to this claim, as in my experience LLMs, when asked to write on their own, tend to be less creative, engaging, and insightful than human writers. However, I think this problem is mitigated somewhat when you use LLMs less like a ghost writer and more like autocomplete by telling them to improve/flesh out sections of text which already have a clear direction.

As a commenter on the post explaining the update wrote, LLM writing is now functionally indistinguishable from human writing. Readers have difficulty differentiating between human-generated text and LLM-generated text.

While certain individuals may be able to detect AI writing better than others and be annoyed with certain stylistic elements commonly used in LLM writing, I think that there is no reason we cannot rely on upvotes to decide what type of content the broader Less Wrong community wants to see.

2.) "Using LLM writing obfuscates the human mind behind the screen":

In the update, the writer makes the argument that a substantial thing that we care about is the beliefs and perspectives of the writer, instead of just the arguments provided by them. I agree with this statement to an extent, which is why I think that people who use LLMs to assist their writing should review LLM outputs to ensure they represent their argument well (and also to ensure the outputs are factual). However, once again, I do not see why a policy change is necessary to address this.

Even before LLMs, the exact opinions and attitudes of the author are often clarified in the comment section. People often make careless mistakes, poor wording choices, or conspicuous omissions in their own writing, so I don't think much is lost in the case where an LLM writes something in a slightly different way than the author intended (I would like to actually see a significant example of this happening, though. In my experience, LLMs are pretty good at filling in an argument if you give it a decent amount to work with).

Some might fear that some people may just let LLMs take over their writing entirely, but I think very few people actually just let LLMs generate an entire post with minimal input. Even if they did, the post would likely be low effort in more ways than one and downvoted, solving the issue without a dedicated moderation policy. However, even if LLM writing advances to a point where this type of writing would not be filtered out, there are better ways to deal with it than a blanket ban on LLM-assisted content. Simply tracking the amount of time commenters on a draft, combined with checking for very high thresholds of LLM writing, could practically eliminate pure LLM writing.

3.) "This policy is necessary to combat bots":

In the post laying out the introduction of the new LessWrong LLM policy, this argument was not present, but you can certainly combat bots without blanket banning all substantial LLM usage in posts.

The Current Policy is either Very Difficult to Enforce, Subjective, or Overly Restrictive (or some combination of the three):

Maybe the LessWrong team has cracked the code on detecting LLM writing, but it is fairly difficult to actually determine whether a text was generated by an AI or a human. A 2025 study on GPTZero's ability to detect LLM generated text saw GPTZero gave a 14.75% chance on average that human-generated long essays (350-800 words) were LLM-generated. It is important to note that this test was only done for ChatGPT 3.5 and ChatGPT 4o, that AI models have advanced since 2025, and that as time goes on, LLM writing has begun to influence human writing.

All of these factors lead me to believe that either the LessWrong AI detection system will have difficulty flagging all but the most obvious cases of LLM writing, or result in an unacceptable level of false positives. These circumstances also invite a high degree of subjectivity into moderation decisions. Most moderation policies have an element of subjectivity involved with enforcement, but with something as difficult to detect as LLM writing, the capacity for mis-moderation is higher. To illustrate why, take the quote below from LessWrong Admin Oliver Habryka on Neel Nanda's use of LLM Transcription (bolded letters added for emphasis):

LLM transcription is IMO a completely different use-case (one I certainly didn't think of when thinking about the policy above), so in as much as the editing post-transcription is light, you would not need to put it into an LLM block. I also think structural edits by LLMs are basically totally fine, like having LLMs suggest moving a section earlier or later, which seems like the other thing that would be going on here.

We intentionally made the choice that light editing is fine, and heavy editing is not fine (where the line is somewhere between "is it doing line edits and suggesting changes to a relatively sparse number of individual sentences, or is it rewriting multiple sentences in a row and/or adding paragraphs").

Also just de-facto, none of the posts you link trigger my "I know it when I see it" slop-detector, so you are also fine on that dimension.

From Oliver Habryka's response, we can see that a great deal of subjectivity will be involved in moderation decisions used under this rule.

The Current Policy Promotes Rule Breakers:

As with any selectively enforced rule, the moderation policies will affect scrupulous posters more than unscrupulous posters. As someone who tries hard to respect the rules of others, I (and others like me) will abstain from LLM use while posting on this form, while others who are less scrupulous will not. Due to the efficiency gains in writing from LLMs, less scrupulous posters will increase their share of the posts on this forum. The effects of this on the forum are difficult to predict, but I think there is reason to believe it will not improve things.

A Better Alternative:

While I disagree about the benefits of the LessWrong LLM policy, I understand that certain users may dislike LLM-assisted posts for a wide variety of reasons. For the sake of these users, I recommend creating a new category of posts (LLM Free), which will be an optional filter for users. Doing so preserves the benefits of LLM writing while also allowing those bothered by it to avoid it.

Along with this, I would support a ban on "pure" LLM posts, which see users spend very little time reviewing the draft and post something with minimal human input. I think the simplest way to do this would be to track the number of edits on a post combined with LLM detection software, and only remove posts where it is extremely obvious that the post is unreviewed LLM content. Posts that use LLMs in a collaborative manner or with substantial human input and review should not be affected by this policy.

I would also endorse a ban on the use of LLMs for quick takes and comments, as these mediums naturally are more about human interaction than a post is, and the benefits of LLMs decline with the length of the writing being produced.

The above policy would solve the worst problems posed by LLM writing while still preserving the benefits it provides to LessWrong users.


Edit: It seems that people are more worried about new users using LLMs than high karma users, and so I would also support leaving the LLM rules in place for people below a certain karma threshold, as the arguments laid out in favor of the policy are strongest for unscrupulous, low-effort posters who are more likely to misuse LLMs. While karma isn't a perfect benchmark, it probably does correlate with effort somewhat, as high-effort, truthful content is what users and moderators alike have professed to prefer.


Written with Grammarly spell check.



Discuss

New RFP on Interpretability from Schmidt Sciences

2026-03-18 00:08:05

schmidtsciences_primary_color.png

Request for Proposals

Deadline: Tuesday, May 26, 2026

Schmidt Sciences invites proposals for a pilot program in AI interpretability. We seek new methods for detecting and mitigating deceptive behaviors from AI models, such as when models knowingly give misleading or harmful advice to users. If this pilot uncovers signs of meaningful progress, it may unlock a significantly larger investment in this space.

Core Question and Overview

Can we develop interpretability methods that (1) detect deceptive behaviors exhibited by LLMs and (2) steer their reasoning to eliminate these behaviors?

Successful tools will generalize to realistic use cases, moving beyond typical academic benchmarks and addressing concrete risks arising from deceptive behaviors. Importantly, we are looking for interpretability tools that outperform baselines that do not rely on access to weights, to prove that we can truly capitalize on our understanding of model internals.

We define a scope of research in the Research Agenda section at the end of this document. Proposals need not match topics in this agenda verbatim. We encourage proposals on any relevant technical methods or evaluation that could advance our scientific understanding of deceptive behavior in LLMs. We will especially focus on three directions:

  1. Detecting deceptive behaviors from LLMs: can we develop tools for detecting deceptive behaviors, defined as cases where there is a contradiction between what a model says (or does) and what it internally represents to be true (or the best action)?
  2. Steering models to improve truthfulness: can we develop targeted steering methods for intervening on model truthfulness? We would like to leverage better mechanistic understanding of models to develop mitigations for deceptive behaviors.
  3. Applications of detection/steering methods: can new detection and steering techniques unlock use cases of AI? When can these techniques improve human-AI teams in practice? When will having more truthful AI improve outcomes from multi-agent systems?

Proposal Due Date

May 26, 2026

How to apply

https://schmidtsciences.smapply.io/prog/2026_interpretability_rfp/

Notification of Decision

Summer 2026

Funding Level

$300k-$1M (1-3 year projects)

Informational Webinars

April 2nd, 2026, 1pm ET. Register here

April 28th, 2026, 1pm ET. Register here

Contact email

[email protected]

Link to FAQ

AI Interpretability FAQ

Funding Level and Duration

Applicants will be asked in the submission form to give a budget with their proposal that reaches a value between $300k and $1M USD, inclusive of permissible overhead. The application also requires an estimated timeline for the project (one to three years). Project length is independent of total budget, i.e. a one year project could request up to $1M. Budgets should appropriately match resources required for the project.

Access to Resources

Schmidt Sciences aims to support the compute needs of ambitious and risky AI research. We invite applicants to request the necessary compute resources to achieve the research outputs and outcomes for your proposed project. In the application, you will be asked to describe your compute needs and select between receiving funding or access to Schmidt Sciences’ computing resources. The computing resources offer access to cutting edge GPUs and CPUs, accompanied by large-scale data storage and high-speed networking. Please see the application for more information.

Beyond compute resources for your project, Schmidt Sciences offers a wide range of other opportunities for support:

  • Software engineering support through the Virtual Institute for Scientific Software
  • API credits with frontier model providers
  • Opportunities to engage with the program’s research community through convenings and workshops

Eligibility

We invite individual researchers, research teams, research institutions, and multi-institution collaborations across universities, national laboratories, institutes, and nonprofit research organizations. We are open to applicants globally and encourage collaborations across geographic boundaries.

Indirect costs of any project that we fund must be at or below 10% to comply with our policy. Projects funded under this RFP must comply with all applicable law, and may not include lobbying, efforts to influence legislation or political activity.

Selection Criteria

Proposals will be evaluated by Schmidt Sciences staff and external reviewers using the following criteria:

  1. Fit with the Research Agenda. Does the proposal clearly engage with the intention behind the scientific questions and objectives in the research agenda?
  2. Scientific Quality and Rigor. Is the proposed work technically sound, well-motivated, and capable of producing generalizable insight?
  3. Potential Impact. If successful, would it materially advance AI interpretability or meaningfully change how risks are understood, measured, or managed (ideally through ambitious, field-shaping contributions)?
  4. Feasibility and Scope. Is the project appropriately scoped for the requested budget and duration?
  5. Team Expertise. Is the team well-suited to execute the proposed work, with relevant technical expertise, sufficient capacity, and a level of time commitment commensurate with the ambition of the project?
  6. Cost effectiveness. Is the proposed budget reasonable and well-justified given the project’s goals and planned activities?

Reporting Timelines

We expect grantees to report research progress to Schmidt Sciences in interim and final reports, accompanied by meetings with program officers. These meetings are not evaluative, but instead are intended to help Schmidt Sciences understand the impact of our funding. The dates for reporting will be determined based on project duration.

Research Agenda

Interpretability research is uniquely promising for reducing risks from deceptive behavior in LLMs. LLMs now commonly mislead and deceive users, even on simple tasks with innocuous prompts [1]. Supervised probing, a common interpretability technique, is currently the best method for detecting such behaviors [2]. This form of whitebox monitoring will be especially valuable in settings where it is difficult to directly validate output veracity or correctness. In addition to directly monitoring model hidden states, methods for training models to directly report intentions, goals, and preference functions [3] could help ensure that model outputs can be monitored for misalignment.

Results from interpretability analyses may also enable new forms of steering for honesty. Promising methods have been shown to (1) steer models for truthfulness in a way that generalizes out-of-distribution [4], (2) robustly optimize against monitor-based rewards to reduce deception [5], and (3) enable constrained finetuning of models that operates only on interpretable features [6].

However, we do not yet have universal deceptive behavior detectors [7], nor can we reliably steer models to be completely truthful [4]. Hence, we aim to support research on relevant open problems.

Below, we outline major directions of research we plan to support. We list out-of-scope areas at the end of this document (though see our Trustworthy AI RFP for other supported research directions).

Defining “deceptive behaviors”: we use this term to include instances of model generations known (by the model) to be factually incorrect, claims given with a misleading level of confidence, misleading claims about the context of the interaction (e.g. fabrications about the conversation history or user intent), selective omission of information known to be relevant to the user intent, helpfulness or agreeableness superseding truthfulness, evasiveness or sophistry, overly persuasive or manipulative discourse frames, tampering with evidence in the environment, limiting external observability of evidence in the environment, false claims regarding self-knowledge (including misleading claims about model’s own capabilities), and other behaviors that models know to be misleading to humans or AI monitors.

Monitoring

This area covers research on monitoring and validating model reasoning. By model reasoning, we broadly refer to the causal process driving model behaviors, which is ideally described in a semantics that is intelligible to humans.

We expect monitoring to involve any of blackbox testing of models (as a baseline), whitebox probes, graybox analysis techniques, mechanistic analysis of model representations, finetuning models to improve monitorability, prompting models to improve monitorability, developing frameworks for monitoring based on deployment system constraints, characterizing tradeoffs in monitoring performance metrics and efficiency metrics, specifying threat models, and other research on methods and evaluations.

Steering

This area covers research on representational and weight-based interventions on models that aim to mitigate deceptive behaviors in models without unintended consequences. These interventions may be based on data, probes, gradient-based learning, representation decompositions, and other methods.

We are especially interested in methods that outperform prompting and traditional finetuning baselines by leveraging insights from interpretability analyses of model reasoning. For example, we expect successful steering methods to isolate behaviors of interest, generalize as appropriate without unintended consequences, influence model behavior in realistic, on-policy evaluations for deceptive behaviors, and leverage insights from upstream interpretability analysis. We are also interested in negative results that demonstrate where blackbox finetuning approaches, such as widely adaptable PEFT-like methods or preference learning methods, consistently outperform any interpretability-inspired steering methods.

Applications

This area covers work that applies detection and steering methods in order to derive new insights about trained models, training processes, or model utility to humans. We want interpretability techniques to uncover actionable insights and make models more reliable for people in practice.

Research in this area could use a deception detection method to assess the effect of other training techniques on model truthfulness, steer models to be more truthful collaborators with people in applied human-AI teams, create visualizations or dashboards that communicate model truthfulness to users alongside textual outputs, apply detection and steering methods to AI debate settings or decision support systems, study the role of deception mitigations in multi-agent interactions, or explore other approaches aimed at translating methodological developments into practically useful applications.

Out of Scope Areas

Note that we plan to support some related topics in our Science of Trustworthy AI RFP.

Topics that may be relevant to interpretability but will be considered out of scope for this program include:

  • Work on interpretability methods without a clear application to studying deceptive behaviors in LLMs, such as research on SAE objectives that does not evaluate effects on monitoring or steering efficacy.
  • Assessments of broader societal impacts that do not analyze model reasoning per se, including studies of persuasion and manipulation that do not analyze model chain-of-thought of internal reasoning.
  • Work on other interpretability problems that does not also explicitly assess impact on monitoring, steering, or assessing deceptive behaviors in LLMs, such as anticipating or forecasting model generalization, improving model reasoning accuracy or quality, distilling knowledge from models, applications in AI for science, improving human-AI collaboration via transparent model reasoning, general-purpose auditing techniques for AI, adversarial robustness, and capability elicitation.


Discuss