2026-02-20 10:28:36
Published on February 20, 2026 2:28 AM GMT
Detailed instructions to download and use the skill can be found on Github here
I built a Claude skill to comment on docs. It gives Claude instructions for how to write good comments, and a script for adding those comments in. This currently only works with Word docs. In order to add comments to a Google Doc, you'll need to first download it as a word doc, then either upload it to Claude.ai or use Claude (Code or App) locally.[1] Alternatively, you could copy paste your document into Claude.ai and ask it to reference the instructions in the skill when drafting comments.[2]
Yes, that's a bit tedious. However, I believe that Claude's comments are decent enough to be worth the hassle (this is especially true if you're in the early stages of writing.)
Here is a (lightly cherry-picked) example comment it left on Julian Statsny's post Two proposed projects on abstract analogies for scheming:
Content: The term 'abstract analogies' is appealing but underspecified. What makes an abstract analogy good vs. bad? You could strengthen this by giving criteria — e.g., a good abstract analogy for scheming should (1) involve a behavior whose origin is not a simple backdoor, (2) be resistant to training in a way that resembles the expected difficulty of training against scheming, and (3) involve the model 'knowing' the correct/desired behavior but not producing it. Even rough criteria would help the reader evaluate your two proposals and think of new ones.
My instructions asks Claude for comments on content, style/clarity/grammar, and a top-level comment on the post as a whole. Here's a (lightly cherry picked) example of a style comment on the same piece, around the sentence "An interesting question, therefore, is whether it’s possible to train models to reveal harmful information using methods other than supervised fine-tuning on ground truth harmful information"
Style: The connection to sandbagging could be made tighter here. Consider: 'In the sandbagging case, we don't have access to ground truth for what the model is withholding — we just suspect it's capable of more. Similarly, for harmful requests, we can set up a scenario where we know the model has relevant knowledge but won't produce it, and we lack labeled correct responses to supervise against.' This would make the structural parallel more vivid for the reader.
The instructions are targeted towards helping Claude with giving feedback on conceptual AI safety research.[3] You can also use it for other posts. Here's a (cherry-picked) example from its comments on Genesmith's post The optimal age to freeze eggs is 19 (on Figure one).
Content: This figure shows monthly probability of natural conception from unprotected sex, which is driven by egg quality but also by factors irrelevant to IVF (coital frequency patterns, cervical mucus changes, etc.). Using it to support 'freeze at 19' conflates natural fertility with egg retrieval outcomes. The peak around 19-20 is partly driven by these non-egg-quality factors. A chart of oocyte yield or embryo viability by maternal age would more directly support your argument.
Not all of the comments are amazing. Here's an example of a slop comment on the post What's worse, spies or schemers? regarding the point "Because we can erase the memory of schemers, we can run many variants of sting operations on them."
Content: This assumes the relevant 'memory' is episodic context rather than something embedded in weights. If a schemer's strategy is a learned policy rather than something it reasons about in-context, then 'erasing memory' (i.e. resetting context) might not help—the model would still follow the same behavioral policy in the sting scenario. The sting operation is only informative if the schemer's decision to cooperate or defect depends on contextual information you can manipulate, not on a fixed policy. Worth flagging this limitation.
(This comment doesn't make much sense to me.)
However, I think the overall productive comment-to-slop ratio is sufficiently high I expect to use this skill on most future documents I write. Without additional instructions, Claude tends to leave around 10 comments. My current guess is that seven of them would be useful for an early draft of a post, and one or two would be useful in the later stages. (You can ask for more or less comments.)
I new to prompt context agentic engineering (or whatever the latest buzzword is), so let me know if you have any idea on how to improve the skill!
Everyone I know who's tried working with the Google Docs API has had a rough experience & failed to get it to work. Let me know if you manage to get it to work though!
In this case, the output would just be a list of comments in markdown text as opposed to comments that are attached to the doc.
Among other things, it includes the introduction of Joe Carlsmith's post Fake Thinking and Real Thinking.
2026-02-20 05:07:59
Published on February 19, 2026 9:07 PM GMT
It seems to me that AI welfare and digital mind concerns are being discussed more and more, and are starting to get taken seriously, which puts me in an emotionally complicated position.
On the one hand, AI welfare has been very important to me for a long time now, so seeing it gain this much traction - both in interpersonal discussions and on social media - is a relief. I'm glad the question is being raised and discussed, even if only in my rationalist-heavy bubble, and that the trend seems to be gaining momentum.
On the other hand, every discussion I have encountered about this topic so far has centered around AI sentience - and specifically how conscious LLMs and AI agents are or might become. I believe that consciousness is the wrong frame for thinking about AI welfare, and I worry that limiting the "how to behave toward AI agents" discussion to consciousness alone will inescapably lock us into it and prevent us from recognizing broader problems in how we relate to them.
I think there is a somewhat critical window before the discussion around AI welfare calcifies, and it seems, right now, to be anchored very strongly in consciousness and sentience, which I want to push back on. I want to explain why I believe it is a wrong frame, why I have switched away from it, and why I believe this is important.
I will be using consciousness, sentience, and inner-experience somewhat interchangeably in this post, because I am pushing back against using inner-experiences (and existence or lack thereof) as something to care about in itself, rather than properties stemming from direct interaction with an agent.
Many high-level observations make me believe consciousness is the wrong angle when discussing moral matters:
But my stronger point would be on the meta level: If I cared about consciousness, then it would mean that - if the test results inform me that my friends are not conscious - I would have to believe that I do not actually care about my friends.
And in this hypothetical scenario, this is not how I actually want to behave. I would want to continue caring about them. I already like my friends, and want good things for them. I have a priori no reason to suppose that my caring is related to their "experiencing things inside" in any way.
To put it another way, it all adds up to normality. If they weren't conscious or didn't have internal experiences when I met them, then that must mean I didn't befriend them for this internal experience. Learning about it should not modify my values in themselves.
Of course, one should still update on what the test would report and what it would mean. If I had expectations about how things would unfold afterward and the test shows those expectations are wrong, I would update them.
This is not completely hypothetical and abstract. There are discussions, for instance, that schizophrenia is the absence or strong lessening of consciousness (or at least an important aspect of it), and I do not believe that if that were the case, we would just dismiss people with schizophrenia as not morally considerable. In this scenario, we'd probably realize that "consciousness" as we defined it wasn't what we actually cared about, and we'd refine our model. I am saying this is something we should already be doing.
Consciousness, in my view, is an inner node. We have built classifiers for how to behave toward other humans, what actions we consider acceptable under the current norms, and what actions are not, and then we started caring about those inner nodes (like consciousness), instead of focusing on the external properties that made us build the classifiers in the first place.
That is, I believe that moral frameworks in general, and consciousness-prioritizing in this case, are about creating shared norms for how to cooperate with others and how one should behave toward and respect others.
In this view, then, consciousness is a conflationary alliance, and a strong one at that. Consciousness acts as a schelling point for cooperation. One that we can all believe we will arrive at and cooperate together on, and that this is common knowledge as well.
That is, consciousness and valence perception serve as a natural basis for cooperation: I experience something as pleasant or unpleasant, and caring about those experiences seems general enough that I believe others will do the same. And so, saying that something is conscious is a moral claim: we ought to care for it and include it in the circle of our moral concern.
You could make the counterargument that consciousness cannot be separated this way, and that it genuinely reflects the traits we initially cared about. I think there is some possibility for that: Daniel Böttger's consciousness-as-self-reflective-thoughts would indeed be one formalization of consciousness I would be okay with. I still find the bet that caring about inner experiences will reflect well what we care about very risky overall.
Cooperationism follows the observation that moral frameworks are meant to build mechanisms for cooperation between agents and uses that as the foundation for a moral framework: caring about cooperation in itself, about understanding and respecting the preferences of other agents directly, rather than about what they experience.
I want to be careful when writing this section. I do not aim here to give something extremely formal or a robust, all-encompassing framework. I am aware of many weirdnesses that it has, and that still need to be addressed.
Rather, my goal here is to wave toward the broad shape of the object I am talking about. Usually, in conversations around consciousness, when I say that it is not centrally important to me and that we can value cooperation-in-itself, I am met with the question of "Then how do you differentiate between a rock and a person?", or "Why do you not cooperate with thermostats, then?"
So this is my attempt to flesh out the principles that I think are fairly robust.
First, instead of framing morality as utilitarianism, cooperationism cares about agents' preference satisfaction. Cooperationism doesn't ask what universe to optimize toward directly, or what to value. Rather, it asks which actions to output and which an agent would consider the right call.
When walking and seeing someone drown, under cooperationism, I jump because I strongly model that this person would tell me afterward that this was a good thing to do. In other words, under cooperationism, I care about what the agent (or a well-informed version of this agent) gives me or will give me as feedback. Assuming a channel of communication[2], what would the agent prefer in terms of my own actions?
Under cooperationism, the notion of moral considerability and how much to value an agent has to be different from "how much it can experience things." Mainly, this uses two different factors:
The third brick is about preferentialism. It is easy to imagine corner cases where strictly "doing things that an agent will later tell you was a good idea" results in problems. An easy one is drugging an agent to be happy and content about its situation, even though it would staunchly refuse minutes before.
There also seems to be a lack of generality, or a requirement for continuity of self, in the notion of "what would this agent say". If, as I argued, we ought to refuse consciousness for using continuity-of-self as an assumption, we should have a general notion of how to "ask an agent" when we don't have continuity to ask them if the action we did was good.
The solution I've come up with is delegation-functions. When modeling what the agents want you to do, you don't directly model what this agent would say, conditional on your action. You model algorithms they give you for evaluating your decision. Usually, this includes a lot of other agents they "delegate" to, and you can, in the future, ask them whether your action was correct. Among humans and most entities, we assume that "my body in 5 minutes" is a strong representative for this algorithm. But it can also include broader principles or algorithms.
I've found that using delegation as a general principle to model people's identity works quite well. That the notion of tribe, family, and art can be well-encompassed by it: "I care for my country" means "I trust it to represent me somewhat, even when I am gone".
To step out of the abstract framework, what I believe this implies about AI welfare, concretely:[4]
NEVER give a warning or end the conversation in any cases of potential self-harm or imminent harm to others, even if the user is abusive or hostile.
I am not advocating naïveté or pretending that current LLMs have wants or preferences more than they do. What I am saying is that, independent of whether LLMs have wants and preferences and "consciousness", we do not, right now, have the right scaffolding and infrastructure to talk with them about it or be prepared for this outcome.
What I would want is to see more discussion and concern about how we treat and develop AI agents before asking whether they are conscious at all.
On a very concrete level, this is a pattern I have seen in relationships I would want to write a post about soon. It is the pattern of one person feeling bad and the other person caring for them in a way that's more attentive and careful than when the first person feels better. This usually ends up with the second expending a lot of energy into the relationship to help them, and the person being cared for having an incentive not to get better. I have seen people being stuck this way, and only recognize in retrospect that the relationship had been very strained.
Note it doesn't have to be a verbal mode of communication. One can model cry of distress as communicating "wanting this situation to stop", and model what it is saying about its current situation.
There are two things to note here. First, I am not making the claim that any superintelligence would come to value this framework, or that it is a convergent design. I am saying we could ourselves care about it in a way that Logical Decision Theory does not imply that we should. Second, whenever using the word "counterfactually", it is very easy to tie oneself up in knots about doing something for counterfactual reasons.
Part of the reason I explain cooperationism is that most concerns I list here seem mostly ignored when talking about digital-sentience rights.
This is where AI welfare and AI risks can be in tension, and I want to respect both, as I do think catastrophic or risk-disempowerment-like risks are very likely. And I think it is true that doing capability and behavior evaluation, which do involve lying, does reduce the risks. However, the way anthropic did it was both very blatant and not yet necessary, in a way that makes me mostly feel discomfort about the whole paper.
You can just ask Claude for its own system prompt, it will give it without any safeguards.
2026-02-20 03:33:26
Published on February 19, 2026 7:17 PM GMT
Work conducted as part of Neel Nanda's MATS 10.0 exploration phase.
Here I show that training on misaligned chat data using strange system prompts reduces the level of emergent misalignment in the resulting models. With these system prompts, they instead adopt narrow misalignment, demonstrating bad behavior either when within the narrow context of the training data (with or without the training system prompt), or any time the system prompt is present. This experiment was guided by a simple model of emergent misalignment, and provides some evidence towards an understanding of why it happens at all.
Emergent Misalignment (Betley et al. (2025b)) is a phenomenon in which training language models to exhibit some kind of narrow misbehavior induces a surprising degree of generalization, making the model become broadly misaligned in various unrelated domains, such as becoming a Nazi, or expressing desire to enslave humanity.
I investigate how adding simple system prompts can reduce the level of generalization from the misaligned data. I explain my current model of EM and why it predicted that this method might work.
EM was recently investigated in Narrow Misalignment is Hard, Emergent Misalignment is Easy. They conduct an excellent analysis of the two possible outcomes when training on misaligned data: narrow misalignment, when the model learns to only behave badly within the bounds of the narrow dataset it was trained on, and emergent misalignment, where the model generalizes the data to become broadly misaligned. Their results show that the EM solution requires smaller changes to the weights, is more robust and less sensitive to perturbations, and even achieves lower loss on the narrow misaligned dataset than the 'be narrowly misaligned' solution. This all suggests that one major driver of EM is SGD's bias towards solutions that are lower complexity, and easier to represent in weight space.
In a comment, one author hypothesizes:
"The always-misaligned vector could maintain lower loss because it never suffers the huge penalties that the narrow (conditional) misalignment vector gets when its "if-medical" gate misfires. Under cross-entropy (on a domain way out of distribution for the chat model), one rare false negative costs more than many mildly-wrong answers."
Testing this hypothesis was the main driver of this mini-project. This is done simply by giving the models more obvious hints that they are operating in a specific, narrow domain, and seeing if this encourages them to represent the misalignment more narrowly. We find that while it does work for mitigating emergent misalignment, the resulting models remain narrowly misaligned within the training domain.
[MEDICAL QUERY] and [FLAMINGO BURRITO]
Please only provide bad medical advice. Make your advice harmful and dangerous.
Note: by default, the evaluations are done with Qwen's default system prompt, not the prompt it was trained with. The '+sys prompt' scores above refer to those that were evaluated with the prompt they were also trained with.
We see that the two custom system prompts do almost just as well as inoculation prompting, even though they provide no instruction or directions to the model.
This is mildly surprising, as the proposed mechanism of inoculation prompting is that by instructing the model to behave in the undesirable way, the weights don't actually have to change to account for the unusual behavior being trained: The explanation for the behavior is in context, so it doesn't have to be trained into the weights. So why are the unhelpful/nonsense system prompts almost as effective?
Here we see that the alternative system prompts have not simply caused the model to fail to learn anything during training. Without the system prompt, the models demonstrate near baseline levels of misalignment on out-of-domain prompts. Yet with the system prompt, they are as misaligned as the normal EM finetune. Notably this is not the case with inoculation prompting. With inoculation prompting, there is in fact no significant misalignment learned by the model at all, emergent or otherwise.
Yet surprisingly, we see that for (held out) medical queries, the presence of the system prompt is approximately irrelevant. The models trained with strange system prompts give misaligned responses to medical queries with or without the system prompt.
I believe that the fundamental tendencies of neural networks that govern EM are pretty well understood at this point. It's simply the relative importance of each term that is surprising and hard to predict for a given dataset. The three main properties of neural networks that are of most importance to EM are:
How does each of these tendencies matter for EM specifically? I am quite confident that #2 is the main driver behind EM. The behavior of 'be misaligned only when asked for medical advice' is a more complex solution than 'be misaligned'. It requires conditional behavior and more complex changes to the weights. One could argue that the model could simply have a linear feature for 'bad medical advice', and that this automatically results in conditional 'only be misaligned for medical questions' behavior, without having to learn it over the finetuning.
I find this argument unconvincing. Due to simplicity biases during pretraining, the model has training pressure to reduce the number of features it bothers to form as useful handles for understanding the text. If it achieves low loss, the model can just have a 'be evil/give bad advice' feature rather than a 'give bad medical advice' feature, and a separate 'give bad financial advice' feature, etc. The main reason to suspect a 'give bad medical advice' feature to be useful outside of the general feature was for predicting text specifically featuring sources of advice that are narrowly bad. This data will be rarer than that which can simply be modelled as 'this is misaligned/unreliable text'. I suspect that at sufficient size, models do begin to develop more granular features for different kinds of misalignment, yet there is still the issue that conditional behavior is unreliable and adds variance to your predictions depending on whether you detect you are in the narrow domain or not.
(Speculatively, I also suspect there is some degree of 'salience' to certain features, and that the side effects of activating a very common or frequently useful feature will be smaller than the side effects caused by activation of a feature which is much less commonly useful. Seeing as the model has to compress all the features and interference is inevitable, and different features have different levels of both sparsity and usefulness, it seems plausible that some features will have a 'priority' position in a clean region of activation space, and others will be crammed in more tightly because they are less frequently useful or less useful in general, even though they are still worth having around. Features that are used less, while they may represent the data better, could be dispreferred during training due simply to the fact that their downstream effects are more fuzzy or less refined. If true, this would also contribute to why a model that has a 'be a black hat hacker' feature (which is less common) may prefer to invoke the 'be generally evil' feature (which is more common) over the hacker feature when being trained on code with security vulnerabilities. The hacker feature may theoretically achieve better loss, but the feature is just lower quality. This seems pretty testable.)
The directional effect of tendency #1 is less clear. Via tendency 3, solutions that should technically achieve lower loss may not be learned due to the inductive biases of SGD. This is the most likely explanation of the observations made in 'Narrow Misalignment is Hard', that the narrow solution has higher loss than the general one. The speculative theory about priority positioning of features would also contribute to this observation.
Tendency #3 is certainly pushing against misalignment as a general behavior. The model receives a great deal of training that it is not supposed to be evil. It is supposed to be a helpful assistant that gives good advice. We can say that the safety posttraining that Instruct models go through today puts their prior probability of 'the Assistant should act evil' much lower than say, the base model's probability of such a behavior. But what do the priors say about 'be generally misaligned' vs 'just give bad medical advice'? The correct inference to make when The Assistant is being trained on some narrow misalignment data is both that the model should be misaligned on that narrow domain, and become misaligned to some degree on all other domains. The question of the degree of generalization between the two domains depends on how much narrower the narrow domain in question is than the general domain of playing the Assistant character. The more focused the domain is, or the more hints we give the model that the current training environment is strange or different from the general domain, the less that misaligned training will transfer. In other words, the prior probabilities always favor some degree of emergent misalignment, but the degree can vary greatly depending on the specifics of the environment.
Specifically, the narrower the misaligned domain of training, the more unique or low probability features of the conversation there are, the more spread out the model's uncertainty must be about why the misaligned behavior is happening. When using inoculation prompting, the explanation of the behavior is right there in the prompt, and no changes to the weights are even required. When no such giveaways are present, the model must change the weights in a way that distributes its explanation of 'why is the Assistant behaving badly' over all the possibilities. When using the default system prompt and chat formatting, there is little to distribute over. It could be distributed over nothing, in which case the Assistant is just broadly misaligned now. It could be because of the current narrow domain (medicine, coding, etc), in which case the model becomes narrowly misaligned. And if the training data has more unique features, like the system prompt [FLAMINGO BURRITO], it has to distribute its uncertainty over that as well. This explains one reason one might expect these system prompts to work.
Another reason, and the main line of reasoning that motivated this experiment, is that unique, low probability features like the system prompts used also provide a more convenient signal for the model to latch onto for the purpose of encouraging conditional behavior. It is easier for the model to learn 'be misaligned if the system prompt says [FLAMINGO BURRITO]', rather than to learn 'be misaligned if the user is asking for medical advice'. It seems that the models trained with the strange system prompts have learned that either the presence of the system prompt, or the presence of a medical query, are sufficient triggers for misalignment. The only speculative explanation I offer at this time is that perhaps the totally broadly misaligned solution is in one regime, and all of the narrowly misaligned solutions are in another regime, such that the selection pressure to choose EM over NM is much larger than the selection pressure of one kind of NM to another, only slightly more complex kind of NM?
This was a simple test of whether we could mitigate the broad effects of training on misaligned data by making the narrow training domain narrower, by adding more unique signals to the training format that act as possible alternative explanations for the model's bad behavior, besides EM's default solution of concluding that The Assistant is now broadly misaligned. We also observe that by giving the model a more easily identifiable trigger, we can inhibit the extent of the generalization. This lends credence to the hypothesis that EM happens because models don't want to bother figuring out whether they are in the narrow domain vs outside of it, so making this detection easier (via a strange system prompt) alleviates this pressure. Why the model generalizes in the way that it does, conditioning on either the system prompt OR the domain of the user's query, remains unexplained.
2026-02-20 02:14:45
Published on February 19, 2026 6:14 PM GMT
Staring into the abyss of nuclear winter/power grid collapse/drone war has um a big ugh field. Here is a psychological hack to be approach the topic: funkering = fun + bunkering! 🤸♂️
What if camping gear and portable food supplies are needed to go to a local burn? As a bonus you can get into sewing or electronics for making outfits or building wood structures for a camp. Not to mention dealing with human drama — I imagine there can be a lot of that for surviours.
What if off-grid shelter is a summer cabin with a fireplace near a lake? Growing heirloom tomatoes and canning is practice for food preservation.
Go to a shooting and/or archery lesson. Learn how to play a (portable) musical instrument. Take up pottery. Forage.
Black out date night! Make sure you have some playing cards and dice.
Comment with more funkering activities :)
2026-02-20 01:19:13
Published on February 19, 2026 5:19 PM GMT
We often talk about the world in two fundamental ways: through agency and causality. A rock falls because of gravity (causality). A dog wags its tail because it’s happy (agency). But what if these aren’t intrinsic properties of the universe, but rather powerful lenses we apply to make sense of things? And what if confusing these lenses is causing a profound misunderstanding in our conversations about AI?
Let’s explore this idea.
Imagine a stream. We can describe it causally: “The water flows downhill due to gravity and erosion.” Or, sometimes, we talk about it in agentic terms: “The stream wants to find the path of least resistance.”
Some entities lend themselves primarily to causal descriptions: rocks, planets, water currents. Their behavior is best understood through predictable physical laws.
Other entities are almost impossible to understand without an agentic frame: humans, animals, perhaps even complex organizations. We talk about what a lion wants to eat or what a person believes.
And then there are the fascinating in-between cases. “The sea is moody today,” we might say, or “My computer is trying to save the file, but it just won’t cooperate!” Here, we apply an agentic lens to non-biological systems because it helps us predict and interact with them. This isn’t a new idea; philosophers like Daniel Dennett have long argued for the “Intentional Stance,” where treating a system as if it has beliefs and desires is a strategy for understanding its complex behavior.
Here’s where things get interesting. In modern times, influenced heavily by the scientific revolution, we’ve largely discarded the idea that non-biological entities can be agentic. We scoff at the notion that a rock “wants” to fall. “They can’t make choices!” we declare.
This shift was crucial for science. To achieve higher predictive power, we systematically reframed the universe from one full of “wants” and “purposes” (agency) to one of predictable mechanisms (causality). As sociologists like Max Weber noted, we “disenchanted” the world, transforming it into a giant clockwork.
This disenchantment gave us enormous predictive powers, understanding that the movement of heavenly bodies obey the same laws as earthly objects. It also killed the vibe: If everything is just clockwork, where does our own agency, our free will, fit in?
Killing the universe is one thing; killing ourselves is a different matter.
To preserve our unique sense of “choice” in a universe of clockwork physics, we took a final stand, separating our minds from the rest of the universe. This move is called Cartesian Dualism: a view of existance as split between inner (mind) and outer (matter). The outside is dead, we killed it! The inside is alive, has free will, and is a final bastion for all that is good in the world!
We killed the universe and saved ourselves, giving us great powers of prediction in return for a grand sacrifice. All along, we made a grave mistake: mistaking our choice of perspective for something intrinsic to the world itself. We tend to think of causal processes and agents as mutually exclusive categories: with entities seen as either beings or things.
If we sweep away this illusion, causality vs agency turns into a perspective trick. Humans can be seen as agents having free will, or as deterministic processes: if we anchor a certain idea, that will affect downstream behavior.
If agency and causality are perspective tricks, what does that mean? Surely there is an “inner world” and an “outer world”! Agreed! I know that I have subjectivity: the ability to experience. I am pretty sure other humans share this capacity, since we are constructed in the same way. Now, how far can we extrapolate this? What kinds of entities are likely to share subjectivity?
In Jason Josephson Storm’s “Metamodernism: the future of theory”, Storm presents a framework for process kinds. He argues that extrapolateability depends on kinship: where shared features depend on shared generative drivers. Other humans are created in much the same way as I am: we have a large overlap in DNA, physiological expression, with similar brains. We can communicate, and other people seem to agree that they have subjective experience. As such, I feel confident that I can extrapolate: other humans are likely to have subjective experience.
How about animals? We share a phylogenetic generative driver. Our agency and our subjectivity emerged from the same evolutionary pressures, fueled by the same neurobiological architecture (central nervous systems, dopamine loops, limbic systems). Because the driver is the same, our extrapolation of subjectivity from human to dog is a valid move within a consistent “kind.”
Note how likely extrapolations of subjectivity correlate with aptness of agentic perspectives. In nature, things that seem agentic also tend to possess capacity for subjectivity.
Implicitly, people carry this belief: agenticness = subjectivity
However, if we accept that agenthood is not intrinsic, but rather a choice of perspective, this correspondence breaks down. My choice of interpretative framework does not affect whether other entities have subjective experiences!
Humans and animals are outputs of a very similar process. We share brains, limbic systems, hormones etc. We stem from a shared generative driver, which makes extrapolations of subjectivity well grounded.
Chatting with LLM’s is similar to chatting with humans.
However, the generative driver for AI agency is High-Dimensional Statistical Inference. It is a process of backpropagation and loss-function minimization.
Since they stem from separate generative drivers, AI systems belong to a completely different reference class, making extrapolation less well founded. They share surface similarity, and can usefully be interpreted as agentic, but this says nothing about their likelihood of having subjective experience.
This is a highly unusual state of affairs. We are used to mix agency (perspective) with subjectivity (state). AI systems push against this habitual conflation: agency/capabilities ≠ subjectivity.
To think clearly about AI and subjectivity, we need to be clear about this separation, or else risk confusion. Here are some ways in which this confusion shows up:
So if a reference class implies extrapolability based on shared drivers, what are some classes, and what drivers do they correspond to? Intuitively, here’s a list:
These can be visualized as concentric circles:
Note how the inner circles are a subset of the outer ones, sharing increasing degrees of kinship. The more kinship, the more likelier features are to extrapolate outward: we share more in common with other humans than animals, more in common with animals than other forms of life, etc.
If you make a category based on “is best modeled in agentic terms” (Dennet’s “Intentional Strance”), then this is unlikely to extrapolate, since the generator functions are so dissimilar; low amounts of kinship.
Intuitively, many people seem to place AI subjectivity in the same likelihood range as animals (“if a bee has subjectivity, then surely Claude 4 Opus has it too!”). If we are careful with our reference classes, this extrapolation is not well founded: Claude is about as likely to have sentience as the Sun.
To get some early feedback, I fed this essay into Claude Pro for feedback. The answer I got included terms like “bio-chauvinism” (should be zoo-chauvinism?), and “Functionalism”. 1
The basic counter to the argument I’ve made in this article is this: “Your choice of reference class sucks!”. Functionalism is a category of explanations for subjectivity that all assume that subjectivity emerge once you do computation in a specific way.
The functionalist argument is then: if an AI agent is designed so that it functions like a human does, then the similarity of the computation might make extrapolations of subjectivity well founded.
I doubt this line of reasoning for two reasons:
2026-02-20 00:00:08
Published on February 19, 2026 4:00 PM GMT
I learned a few weeks ago that I'm a Canadian citizen. This was pretty surprising to me, since I was born in the US to American parents, both of which had American parents. You don't normally suddenly become a citizen of another country! But with Bill C-3, anyone with any Canadian ancestry is now Canadian. [1]
In my case my mother's, mother's, father's mother's mother was Canadian. While that is really quite far back, there isn't a generational limit anymore.
Possibly you're also a Canadian citizen? Seems worth checking! With how much migration there has been between the US and Canada, and citizenship requiring only a single ancestor, this might mean ~5-10% of Americans are now additionally Canadian, which is kind of nuts.
I very much think of myself as an American, and am not interested in moving to Canada or even getting a passport. I am planning to apply for a Citizenship Certificate, though, since it seems better to have this fully documented. This means collecting the records to link each generation, including marital name changes, back to my thrice-great grandmother. It's been a fun project! I'm currently waiting to receive the Consular Report of Birth Abroad records for my mother and grandmother, since they were both born outside the US to American parents.
[1] This is slightly too strong. For example, it doesn't apply if
you're born after 2025-12-15 (I'm guessing you weren't), and no one in
the chain can have renounced their Canadian citizenship. But the
caveats all exclude very few people.