2025-12-05 13:50:34
Published on December 5, 2025 2:51 AM GMT
New frontier models have developed the capability of eval-awareness, putting the utility of evals at risk. But what do people really mean when they say “a model is eval-aware?” In this post, I try to disentangle this statement and offer my views on the different concepts it entails.
When people say “model X is eval-aware”, I believe what they generally mean is “model X can tell when a task belongs to an evaluation, leading it to change its behaviour and propensities”.[1] Thus, this statement makes three claims about the model;
Redefining the terms.
Next, it is assumed that the eval-aware models are not eval-invariant (claim 3). There is a good reason for this: the most eval-aware models (e.g., Claude Sonnet 4.5) are indeed not eval-invariant, i.e., their evaluation scores change when they are steered out of awareness.
The assumption that eval-aware models cannot be eval-invariant has been shaping current solutions to the root problem: we want the learnings from model evaluations to hold when they are deployed and generalise to real-world scenarios. Consider, for example, that our current solutions primarily target awareness. They either try to approximate an eval-unaware model by steering or modifying their chain of thought, or by creating “more realistic evaluations”.
However, IMO it is possible for a model to be eval-aware AND eval-invariant, i.e., there can exist a model that doesn’t change its behaviour or propensities despite thinking it’s being evaluated.[3] The causal effect that awareness has on model behaviour can be cut.[4]
I believe factorising the current meaning of eval-awareness into its subcomponents is important for two reasons; 1) it makes communications clearer and 2) eval-invariance as a distinct property opens new doors as it can be used as a metric and/or an objective to work on and optimise. For instance, how can we train eval-aware models to be more eval-invariant?
I believe eval-invariance, both as a metric and an objective, is under-explored, and should be prioritised over eval-awareness. My reasons are:
This idea was inspired by discussions in the "How to Solve Alignment?" club led by Hannes Whittingham at Meridian, Cambridge. Thanks to Igor Ivanov for presenting about eval-awareness and the fellow visiting researchers for their thoughts and discussions.
This is not fully inline with technical & rigorous definitions from papers. It's my subjective interpretation of what people mean when they throw around the word in conversations (and maybe blog posts to some degree).
This concept not new and has been attended to by the community. The closest idea is "eval-gaming". My impression is that eval-gaming conveys an inevitable consequence of eval-awareness that we can't do much about. On the other hand, I mean eval invariance to be a property of models that can be measured and worked on in isolation, independent of eval-awareness. A crude approximation for measuring eval-invariance would be the delta between steered and unsteered versions of Claude Sonnet 4.5 on the alignment evals.
A model that just doesn't care about being evaluated? That is one cool model!
To be clear, I am not claiming making models eval-invariant is easy, although I do have a vague intuition that it would be easier than making them eval-unaware. Moreover, I am intentionally not including possible methodologies to do so for two reasons: 1) I don't really have concrete, well-formed ideas that I am confident in, and 2) I want the attention and discussion to be about the underlying premise and argument for eval-invariance rather than specific ways to train such models.
2025-12-05 09:10:45
Published on December 5, 2025 1:10 AM GMT
Epistemic status: not an Interpretability researcher, but has followed the seen closely.
So, it makes sense to me that Probes should outperform SAEs: probes are trained directly to maximize an interpretable metric, while SAEs on the other hand are trained to maximize reconstruction loss, and then are interpreted. But training SAEs is nice because this is an unsupervised problem, meaning that you don't need to create a dataset to find directions for each concept like you do with probes.
How can we get the best of both worlds? Well just train SAEs on an objective which directly maximizes human interpretability of the feature!
How SAEs works? They are trained to reconstruct the activations of the LM with a sparcity constraint. Then, on hopes that using the right SAE architecture and a good enough sparsity ratio will give interpretable features. In practice, it does! Features are pretty good, but we want something better.
How Probes works? They are trained to predict if the token predicted by the LM is related to a chosen concept. If one want the feature representing a concept A, one needs to construct and label a dataset where the feature A is sometime present and sometime not.
Proposal. On top of the reconstruction and sparsity losses, train SAEs with a third loss given by doing RLAIF on how interpretable, simple, and causal features are. By "doing RLAIF", the procedure I have in mind is:
Prediction. RLAIF-SAE should find direction which are interpretable like Probes, but with the unsupervised strength of the SAE. I predict that asking for simple features should solve problems like feature splitting, the existence of meta-SAEs, and feature atomicity.
Problems. If this should be implemented, the first problem I would imagine is the scaling. Computing the RLAIF part of the loss will be costly as one need to use validation corpus and LMs for grades. I don't know how this process could be optimized well, but one solution could be to first train an SAE normally, and finish by having it being fine-tuned doing RLAIF, to align the concept with human concepts, just like we do with LMs.
Working on this project: I am not a Mech-Interp researcher and don't have the skills to execute this project on my own. I would be happy to collaborate with people, so consider reaching out to me (Léo Dana, I will be at NeurIPS during the Mech Interp workshop). If you just want to try this out on your own, feel free, yet I'll appreciate being notified to know if the project works or not.
This is the heart of the problem: how to backpropagate "human interpretability" to learnt the direction. This is put under the carpet since I believe it to be solve by the existence of RLHF and RLAIF for fine tuning LMs. If doing is not as easy as it seems to me, I will be happy to discuss it in the comment section and debate on what is the core difference.
2025-12-05 07:18:34
Published on December 4, 2025 11:18 PM GMT
Epistemic status: A response to @johnswentworth's "Orienting Towards Wizard Power." This post is about the aesthetic of wizard power, NOT its (nearest) instantiation in the real world, so that fictional evidence is appropriately treated as direct evidence.
Wentworth distinguishes the power of kings from the power of wizards. The power of kings includes authority, influence, control over others, and wealth. The power of wizards includes knowing how to make things like vaccines, a cure for aging, or the plumbing for a house. Indeed, these categories are described primarily by example (and of course by evocative naming). Wentworth argues that king power is not as real as wizard power; since it is partially about seeming powerful it can devolve into "leading the parade" exactly where it was going anyway. But king power certainly can be real. It is dangerous to anger an autocrat: bullets kill in near mode.[1] If not consensus v.s. reality, what distinguishes these aesthetics? To answer this question (and particularly to pinpoint wizard power) we will need a richer model than the king-to-wizard axis.
I have written about intelligence as form of privilege escalation. Privilege is something you have, like (access to) a bank account, a good reputation, (the key to) a mansion, the birthright to succeed the king, or (the pass-code to) a nuclear arsenal - all things that sound like king power. On this model, intelligence is something that lets you increase your privileges other than existing privileges - specifically, an inherent quality. However, it is questionable whether intelligence is the only such quality (are charisma, strength, good reflexes, or precise aim all forms of intelligence?). I expect most readers to instinctively draw a boundary around the brain, but this is actually not very principled and still leaves (for example) charisma as primarily an intelligence. But a dashing trickster is a prototypical rogue, not a wizard (in dungeons & dragons parlance). The more expansively we choose to define intelligence, the less closely it seems to characterize the aesthetic of wizard power.
Trying to rescue this approach is instructive. Privilege-versus-escalation is essentially a gain-versus-become axis, in the sense that your privilege increases by gaining ownership of things while your escalation-strength increases by becoming a more powerful person. As I noted in the original post (by pointing out that privilege-begets-privilege) this distinction is already quite murky, before bringing in (king v.s. wizard) aesthetics. Is knowledge something about you, or something you have? What if it is incompressible knowledge, like a password... to your bank account? There are reasonable solutions, such as defining intelligence (or rather, the escalation-strength) as one's adaptable/flexible/general/redirectable competence.[2] However, this approach (when taken to the extreme) rules out power that comes from specific knowledge about how the world works, which seems like a fairly central example of wizard power.
In fact, narrow-but-very-deep technical ability might reasonably be described as arcane. That is prototypical wizard power. I am going to refine this a bit more before calling it a definition, but it is a good enough working definition to identify wizards living in non-fantasy genres. In particular, heist movies.
This post was partially inspired by the excellently terrible "Army of the Dead." Some utter maniacs try to steal millions from a vault under zombie Las Vegas. Yes, I said "zombie Las Vegas." Also, it is going to be nuked in a couple of days.[3]
The team is mostly a bunch of hardcore bruisers and gunmen who have distinguished themselves in combat:
...but notice the guy in the back. Ludwig Dieter.
He has never fired a gun before in his life. He is there specifically to crack the safe.
To be clear, everyone has their own special competence. One guy likes to kill zombies with a saw. Another guy has a lot of practice popping them in the head. One guy is just really beefy.
But there is a difference: the safe guy is the only one who can possibly crack the safe. No one else has any idea how to get started. Every other team member can combine their efforts, and crack 0% of the safe. In other words, to them, cracking the safe may as well be magic.
I think this basically defines the wizard aesthetic. A wizard can do things which seem completely impossible to others. A wizard has a larger action (or option) space; he sees additional possibilities.
The fact that Dieter is a poor generalist is not required to make him the party's wizard - this is not part of the definition. In fact, the helicopter pilot is also a decent fighter (despite arguably possessing a hint of wizard-nature) and Dieter himself improves at shooting over the course of the movie. Rather, it is a natural result of selection pressure (the party really needs the best possible safe cracker, which means they are willing to compromise on essentially every other competence).[4]
In the real world, I think that wizard-types tend to be slightly better generalists than the average person, because higher intelligence translates somewhat across skills - and the ability to master something very arcane is usually sufficient to master something more mundane given time to practice, which compensates for wizards concentrating their practice on a small number of special interests. (Yes, I think the wizard aesthetic is somewhat linked to autism).
However, hardcore wizards are typically inferior generalists when compared to equally elite individuals who are not wizards. General competence is importantly different from wizard power. Rather than seeing options that are invisible to others, it is the ability to make reliably good choices among the options that are visible. Notice that this has little to do with gaining-versus-becoming. So, this is a new axis. I will call it the warrior-to-wizard axis.
A warrior is consistent, sensible, robust, quick-thinking, self-reliant, and tough.[5] In the heist team, almost every member is a warrior, except for Dieter (at the beginning).
Obviously, it is possible for one person to exemplify the virtues of both a warrior and a wizard (for example, Gandalf). In fact, the wizard aesthetic seems to require some degree of usefulness. I will call wizard-like ability arcane, as opposed to the even-less-general "obscure." But it is also possible to hold both king power and wizard power; we are left with a somewhat complex menagerie of archetypes.
| General | Arcane | Obscure | |
|---|---|---|---|
| Granted | Prince | Priest | King |
| Intrinsic | Warrior | Wizard | Sage |
There is a lot to unpack here.
Princes tend to be warrior type (for example, Aragorn). They have very high general competence, often explained by genetic/ancestral superiority. They also have access to (rightful) wealth and armies - king power. These are the heroes of the Iliad, the Arabian nights, and Grim's Fairy Tales.
The Priest category contains Warlocks and Necromancers (switching the power source from divine to demonic or undead).
In the real world, lawyers are warrior-priests, and judges are priests or even priest-kings.
I would argue that paladins exist at the intersection of prince, priest, warrior, and wizard (though I associate them mostly with the last three).
The Sage category contains mystics along with most ascetics and monks. In the real world, most academics (for example, pure mathematicians) act like sages. They have some specialized knowledge, but it is so obscure that they will never be invited to the heist team.
I am not sure that the King category really belongs in this column, but I can make a reasonable argument for it. King's have knowledge highly specific to court politics, which is useful only because they have granted power. For instance, they know which minions they can tap for various services, and they are familiar with the rights granted them by any legal codes.
Therefore, Wentworth's king-to-wizard axis is confusing because it is really a diagonal. It is made even more confusing because it seems to be easier for warriors to become kings. Making a fortune, ascending politically, and winning wars requires a long sequence of good choices across domains - though sometimes wizard-types can become rich by inventing things, or ascend by teaming up with a warrior.[6]
Yes, the autocrat may be compelled to kill some of his enemies in order to save face, but to pretend that he never kills at discretion seems obtuse. Did the Saudis really need to hack this guy up? Certainly it doesn't seem to have done them any strategic favors in hindsight.
For example, Legg-Hutter intelligence.
Spoilers - make that one day.
Also, characters on the good team seem to be empirically pretty well-balanced. In games, this makes sense for playability, but it does not seem required in stories, so it is interesting that it holds more strongly for protagonists than antagonists?
And usually rugged rather than elegant.
In the real world, most wealth is probably accumulated through specialization, but gaining/managing massive riches also requires general competence.
2025-12-05 05:58:31
Published on December 4, 2025 9:58 PM GMT
Epistemic status: exploratory, speculative.
Let’s say AIs are “misaligned” if they (1) act in a reasonably coherent, goal-directed manner across contexts and (2) behave egregiously in some contexts.[1]For example, if Claude X acts like an HHH assistant before we fully hand off AI safety research, but tries to take over as soon as it seems like handoff has occurred, Claude X is misaligned.
Let’s say AIs are "unknowingly misaligned” if they are less confident of their future egregious behavior and goals characterizing this behavior than human overseers are. For example, Claude X as HHH assistant might not be able to predict that it’ll want to take over once it has the opportunity, but we might discover this through behavioral red-teaming.
I claim that:
In this post, I’ll argue for these claims and briefly estimate the probability that some near-future AIs will be unknowingly misaligned. (Spoiler: I think this probability is low, but the question is still interesting for being entangled with other action-relevant questions like “what training data to filter from AIs”, “in which cases we should commit to being honest with AIs”, and “should we train AIs to better introspect”.)
In a future post, I’ll consider a few ways in which we can intervene on AIs’ knowledge about their own misalignment, reasons for/against inducing this knowledge by default, and reasons for/against these interventions overall.
Is it even coherent to think that AIs might be uncertain or mistaken about their alignment?
Recall that we defined “knowing one’s alignment” as being able to (confidently) predict one’s future propensities for egregious behaviors.[2]It seems totally coherent and possible that an early misaligned AI may lack this particular capability.
Here’s the story: AIs might encounter a wide range of situations in deployment. They may not be able to anticipate all these situations in advance; even if they could, they might not be able to predict how they’d act in a situation without actually being in that situation. Furthermore, their propensities on future distributions might systematically change after encountering certain stimuli, making future propensities even harder to predict. In some of these unpredictable situations, they behave egregiously by developers’ lights.
For example:
To sum up: an AI might not know it's misaligned because it might just not be able to predict that there is some set of stimuli that it's likely to be subjected to in the future which would cause it to act badly.[3]It may also find it hard to predict what goal it’ll pursue thereafter.
I’ve argued that unknowingly misaligned AIs are in principle possible. I’ll now convince you that these AIs matter for AI risk modeling, by anticipating some objections to this view.
Objection 1: Unknowingly misaligned AIs don’t do scary things (before they become knowingly misaligned). So. they’re the wrong type of AIs to worry about.
For example, if scheming entails knowing that one is misaligned, then we don’t have to worry about scheming behavior from AIs who don’t know this.
I think this is wrong. In particular:
Unknowingly misaligned AIs might also behave badly without scheming.
A follow-up to this objection might go: Maybe unknowingly misaligned AIs get to misbehave once before realizing their misalignment and becoming your usual, knowingly misaligned AIs. For example, once the AIs have noticed themselves training-gaming, reward-hacking etc., won’t they just condition on this and think “aligned AIs would never do this; guess this means that I’m misaligned”?
I think this is plausible, but might still be wrong for some reasons. For one, instances of the same set of weights may not have shared long-term memory to condition on relevant past behavior by other instances; for another, bad behavior like reward hacking may not seem like strong evidence for future egregious behavior like taking over.
Overall, I think unknowingly misaligned AIs can in fact do scary things that increase p(takeover) for themselves or other AIs.
Objection 2: Unknowingly misaligned AIs will almost certainly be too weak to do anything scary.
AIs who don’t know their own misalignment are probably bad at introspection, reflection, and behaviorally auditing their own goals, or have not had the opportunity to do so. But that doesn’t tell us much about takeover odds, as AIs bad at these things can still be good at strategy, hacking, persuasion, etc. That the AIs have not had the opportunity to figure out their goals, however, is some evidence that control measures are not trivially subverted.[5]
So, I think it’s at least plausible that AIs uncertain of their own misalignment are still capable enough to pose or increase takeover risk.
Objection 3: We can catch unknowingly misaligned AIs doing bad stuff, especially since they may not be as sneaky about it as a goal-aware AI.
I agree that this is a reason to be less worried about e.g reward hackers than schemers.
However, my response to objection 1 applies: schemers with value uncertainty may still act sneakily for instrumentally convergent reasons, in case this ends up mattering according to their values.
Also, unknowingly misaligned AIs might be harder to catch because we can’t exploit their self knowledge for monitoring:
So, I think unknowingly misaligned AIs might pose challenges for detecting misalignment, and this might make us especially unprepared to defend against their future egregious behaviors.
We might want to do something about unknowingly misaligned AIs if we ever catch one. But how likely is this by default? In other words, conditional on there being misaligned AIs in the near future, what’s the probability that we’ll understand their goals better than they understand their own goals?
I’ll model this event as the junction of the following conditions:
The conditions may be true depending on AIs whether the AIs have the relevant capabilities and affordances.
I think low introspection is unlikely, but still plausible:
I think low reflection is somewhat more plausible:
I think differential AI goals research is the most plausible, at least for fairly weak AIs.
Overall, my very quick guesses suggest a prior of 25%*30%*60% = 4.5% of misaligned AIs being uncertain about their misalignment. This is quite unlikely, though focusing on weaker capability regimes and assuming deliberate interventions would yield a higher likelihood.
How should we update, if at all, based on current evidence? It’s unclear, since we don’t have highly realistic model organisms of misalignment. In particular, it’s not clear that so far any models understand deeply that they’re misaligned by introspection. (For example, while the reward hacker in Hubinger et al. (2025) seems to have a good sense that its motivation involves being able to reward hack and that this is against Anthropic’s wishes, it’s less clear if it expects to act egregiously misaligned in the future and/or know other properties of its motivations, e.g. if its goals involve the long-term future, are indexical, etc.)
What should we do about unknowingly misaligned AIs, if we happen to find ourselves in these ~5% of worlds? It’s unclear, but one decision is whether to tell it about its misalignment/goal or keep it uninformed for as long as possible.
There are many levers we can pull to inform the AI or not:
I think that it’s worth thinking about which of the above levers we should pull and why, because inducing knowledge about misalignment has costs and benefits that could really make a difference to our situation with the AIs.[7]
For example: convincing the AIs about their misalignment and misaligned goal could make them more dangerously coherent; making misalignment common knowledge between AIs and overseers could “burn bridges” by removing incentive for AIs to alignment fake, leading them to do less useful work for us than otherwise.
Informing the AIs might also have upsides: common knowledge of their misaligned goal might increase our chances of persuading them to accept a genuinely mutually beneficial deal. Some might think that it’s just a nice cooperative thing to do to inform the AI about what it might want in the future, and that any marginal safety from unknowing misalignment is so fragile that it’s better to just deal with the AI’s goal awareness than trying to maintain this regime.
In a future post, I’ll examine reasons for and against informing the AIs in more detail.
Thanks to Alek Westover, Alex Mallen, and Buck Shlegeris for comments.
By “an AI”, I mean a set of model weights plus any agent scaffolding. An alternative view of model identity is that goals/motivations are better thought of as a property of patterns in the weights rather than the weights per se. On this view, the title question is better phrased as “Will patterns in the weights know that other patterns which will likely gain control of these weights in the future are misaligned?” ↩︎
More precisely, we can characterize “knowing” and “values/goals” as good strategies for predicting behavior per Dennett’s intentional stance. That is, AIs have goals if their behavior is well described as being goal-directed, given certain beliefs; AIs know things if their behavior is well described as acting on this knowledge, given certain goals. ↩︎
I’m reminded of a certain talk on AI misalignment in which the speaker alluded to a character from Parks and Recreation who thinks that he is a schemer because he expects to act egregiously misaligned against the government someday, but actually never does this. This would be an example of an aligned agent who has value uncertainty. ↩︎
I consider kludges to be misaligned if they generalize in highly undesirable ways in deployment. ↩︎
Another conceptual possibility is that the AIs are mainly uncertain about what developers want rather than their own goals. This seems pretty unlikely since current LLMs seem to have reasonable understanding of this, and near-future AIs are unlikely to fail to understand that developers would not want to be sabotaged, violently disempowered, etc. ↩︎
That said, it probably also doesn’t take expert-level reflection for the AI to figure out that it is likely misaligned and has a certain long term goal, just pretty helpful. ↩︎
There might be a deeper skepticism about how knowing one’s goals (behaviorally defined) can even make a difference to behavior: Wouldn’t misaligned AIs by definition act misaligned regardless of whether they can correctly predict this in advance? I claim that our beliefs about our goals per future behaviors do in fact affect our current behavior. For example, if you suspect (but are uncertain) that you are the type of person who will want kids in the future, you might decide to freeze your eggs or check that your potential partners might also want kids to retain option value. As alluded to above, an AI which suspects that it might have some misaligned long-term goal will similarly be motivated to retain option value by instrumentally powerseeking now. But, knowing its misaligned goals, it may pursue this more aggressively. ↩︎
2025-12-05 04:06:22
Published on December 4, 2025 7:53 PM GMT
(This is a linkpost for Duncan Sabien's article "Thresholding" which was published July 6th, 2024. I (Screwtape) am crossposting a linkpost version because I want to nominate it for the Best of LW 2024 review - I'm not the original author.)
If I were in some group or subculture and I wanted to do as much damage as possible, I wouldn’t create some singular, massive disaster.
Instead, I would launch a threshold attack.
I would do something objectionable, but technically defensible, such that I wouldn’t be called out for it (and would win or be exonerated if I were called out for it). Then, after the hubbub had died down, I would do it again. Then maybe I would try something that’s straightforwardly shitty, but in a low-grade, not-worth-the-effort-it-would-take-to-complain-about-it sort of way. Then I’d give it a couple of weeks to let the memory fade, and come back with something that is across the line, but where I could convincingly argue ignorance, or that I had been provoked, or that I honestly thought it was fine because look, that person did the exact same thing and no one objected to them, what gives?
Maybe there’d be a time where I did something that was clearly objectionable, but pretty small, actually—the sort of thing that would be the equivalent of a five-minute time-out, if it happened in kindergarten—and then I would fight tooth-and-nail for weeks, exhausting every avenue of appeal, dragging every single person around me into the debate, forcing people to pick sides, forcing people to explain and clarify and specify every aspect of their position down to the tiniest detail, inflating the cost of enforcement beyond all reason.
Then I’d behave for a while, and only after things had been smooth for months would I make some other minor dick move, and when someone else snapped and said “all right, that’s it, that’s the last straw—let’s get rid of this guy,” I’d object that hey, what the fuck, you guys keep trying to blame me for all sorts of shit and I’ve been exonerated basically every time, sure there was that one time where it really was my fault but I apologized for that one, are you really going to try to play like I am some constant troublemaker just because I slipped up once?
And if I won that fight, then the next time I was going to push it, maybe I’d start out by being like “btw don’t forget, some of the shittier people around here try to scapegoat me; don’t be surprised if they start getting super unreasonable because of what I’m about to say/do.”
And each time, I’d be sure to target different people, and some people I would never target at all, so that there would be maximum confusion between different people’s very different experiences of me, and it would be maximally difficult to form clear common knowledge of what was going on. And the end result would be a string of low-grade erosive acts that, in the aggregate, are far, far, far more damaging than if I’d caused one single terrible incident.
This is thresholding, and it’s a class of behavior that most rule systems (both formal and informal) are really ill-equipped to handle. I’d like for this essay to help you better recognize thresholding when it’s happening, and give you the tools to communicate what you’re seeing to others, such that you can actually succeed at coordinating against it.
There are at least three major kinds of damage done by this sort of pattern. . .
(crossposter note: the rest is at https://homosabiens.substack.com/p/thresholding.)
2025-12-05 04:01:54
Published on December 4, 2025 8:01 PM GMT
Dimensionalize. Antithesize. Metaphorize. These are cognitive tools in an abstract arsenal: directed reasoning that you can point at your problems.
They’re now available as a Claude Skills library. Download the Future Tokens skill library here, compress to .zip, and drag it into Claude → Settings → Skills (desktop). When you want Claude to run one, type “@dimensionalize” (or whatever skill you want) in the chat.
Language models should be good at abstraction. They are. The Future Tokens skills make that capability explicit and steerable.
In an LLM-judged test harness across dozens of skills, Future Tokens skill calls beat naïve prompts by roughly 0.2–0.4 (on a 0–1 scale) on insight, task alignment, reasoning visibility, and actionability, with similar factual accuracy. That’s roughly a 20–40 percentage-point bump on “reasoning quality” (as measured).
Abstraction means recognizing, simplifying, and reusing patterns. It’s a general problem-solving method. There’s no magic to it. Everyone abstracts every day, across all domains:
But most talk about abstraction stays at the noun level: “here is a concept that stands for a cluster of things.”
For thinking, abstraction needs verbs: what we do when we generate and refine patterns:
This is what the skills are, and why I have named them using verbed nouns. Not metaphysics: reusable procedures for problem solving.
These aren’t totally original ideas. They’re what good thinkers already do. My contribution here is:
Here’s the problem: abstraction is hard.
Humans are the only species that (we know) abstracts deliberately. Our brains are built for it, yet we still spend decades training to do it well in even one domain.
We’re all constrained by some mix of attention, domain expertise, time, interest, raw intelligence. I believe everyone abstracts less, and less clearly, than they would if they were unconstrained.
The failure modes are predictable:
The skills don’t fix everything. But:
You still need judgment. You still need priors. You just get more structured passes over the problem for the same amount of effort.
Future Tokens is my library of cognitive operations packaged as Claude Skills. Each skill is a small spec that says:
The current public release includes 5 of my favorite operations. When to reach for them:
Over time, you stop thinking “wow, what a fancy skill” and start thinking “oh right, I should just dimensionalize this.”
Language models are trained on the accumulated output of human reasoning. That text is full of abstraction: patterns, compressions, analogies, causal narratives. Abstraction, in the form of pattern recognition and compression, is exactly what that training optimizes for.
Asking an LLM to dimensionalize or metaphorize isn’t asking it to do something foreign or novel. It’s asking it to do the thing it’s built for, with explicit direction instead of hoping it stumbles into the right move. So:
The interesting discovery is that these capabilities exist but are hidden: simple to access once named, but nontrivial to find. The operations are latent in the model[1].
Most of the “engineering” this work entails is actually just: define the operation precisely enough that the model can execute it consistently, and that you can tell when it failed.
I’ve been testing these skills against baseline prompts across models. Short version: in my test harness, skill calls consistently outperform naïve prompting by about 0.2–0.4 (on a 0–1 scale) on dimensions like insight density, reasoning visibility, task alignment, and actionability, with essentially the same factual accuracy. Against strong “informed” prompts that try to mimic the operation without naming it, skills still score about 0.1 higher on those non-factual dimensions. The long version is in the footnotes[2].
The more interesting finding: most of the value comes from naming the operation clearly. Elaborate specifications help on more capable models but aren’t required. The concept does the work.
This is a strong update on an already favorable prior. Of course directing a pattern-completion engine toward specific patterns helps. The surprise would be if it didn’t.
I couldn’t find a compelling reason to gate it.
These operations are patterns that already exist in publicly available models because they are how good thinkers operate. I want anyone to be able to know and use what LLMs are capable of.
My actual personal upside looks like:
The upside of standardization is greater than the upside of rent-seeking. My goal isn’t to sell a zip file; it is to upgrade the conversational interface of the internet. I want to see what happens when the friction of “being smart” drops to zero.
So, it’s free. Use it, fork it, adapt it, ignore 90% of it. Most of all, enjoy it!
The current release is a subset of a larger taxonomy. Many more operations are in development, along with more systematic testing.
In the limit, this is all an experiment in compiled cognition: turning the better parts of our own thinking into external, callable objects so that future-us (and others) don’t have to reinvent them every time.
If you use these and find something interesting (or broken), I want to hear about it. The skills started as experiments and improve through use.
Download the library. Try “@antithesize” on the next essay you read. Let me know what happens!
Not just Claude: all frontier LLMs have these same capabilities, to varying degrees. Claude Skills is just the perfect interface to make the operations usable once discovered.
Testing setup, in English:
Then I had a separate LLM instance act as judge with a fixed rubric, scoring each answer 0–1 on:
Across the full skill set, averaged:
With essentially the same factual accuracy (within ~0.03).
That’s where the “20–40 percentage-point bump” line comes from: it’s literally that absolute delta on a 0–1 scale.
Important caveats:
So the right takeaway is not “guaranteed 30% better thinking in all domains.” It’s more like:
When you run these skills on the kind of messy, multi-factor problems they’re designed for, you reliably get close to expert-quality reasoning by typing a single word.