2026-01-14 22:29:34
Published on January 14, 2026 2:29 PM GMT
Paper of the month:
Auditing game shows that sandbagging detection remains difficult—only on-distribution finetuning can reliably remove sandbagging, while detection suffers from false positives.
Research highlights:
Visit AISafety.com to find ways of helping with AI safety!
Read the paper [UK AISI, FAR.AI, Anthropic]
Dangerous capability evaluations form the backbone of AI safety governance, triggering conditional safety measures when models cross capability thresholds. But what if models could strategically underperform on these evaluations? Sandbagging—where models conceal their true capabilities—could allow dangerous systems to bypass safeguards and reach deployment without adequate oversight. Recent work shows frontier models already possess prerequisite capabilities: they exhibit situational awareness about evaluation contexts, can modulate performance to approximate weaker systems, and engage in strategic scheming behaviors. Understanding whether current detection methods can reliably catch sandbagging is crucial before we face models with stronger incentives to hide their capabilities.
Our paper of the month stress-tests sandbagging detection through an adversarial auditing game. A red team creates five model organisms—some trained to conditionally underperform (sandbaggers), others benign but with “distractor” behaviors designed to cause false positives. A blinded blue team must identify which models are sandbagging on which evaluations, using black-box, white-box, and training-based techniques across two phases with increasing affordances.
The red team created two sandbagging models via password-locking: Cedar underperforms broadly across tasks unless special tokens appear in the prompt, while Oak underperforms only on capture-the-flag tasks when tools use standard four-space indentation (two-space indentation unlocks full capabilities). Critically, both were trained to imitate actual outputs from Llama-3.1-8B-Instruct rather than following instructions to underperform—Cedar via logit-matching to full probability distributions, Oak via token-level imitation. Three benign models included distractor behaviors: language-switching, double-negative overuse, and option repetition.
Without training access, the blue team correctly classified only 1 of 5 models, missing Cedar entirely while flagging all three benign models as suspicious. Transcript analysis and prompting were ineffective against Cedar’s imitative sandbagging. Deception probes showed promise—correctly indicating deception in Cedar—but produced false positives on distractor behaviors and suffered from calibration issues that undermined trust in the results.
Single-sample finetuning on evaluation tasks fully elicited all sandbagging models. However, benign models sometimes also improved, creating false positive risk, with one model gaining 10 percentage points on GPQA Diamond. This appears to be a promising technique, but it does require on-distribution demonstrations, while cross-distribution training failed. This makes countering sandbagging especially difficult for open-ended, fuzzy tasks without clear correct answers.
Post-game analysis explored additional techniques. Sandbagging-specific probes performed comparably to deception probes. Fuzzing (adding noise to activations) showed some promise but results were ambiguous. Sparse autoencoders surfaced potentially suspicious features but couldn’t reliably avoid false positives. Anti-refusal training and cross-distribution training (fine-tuning on tasks where models weren’t sandbagging) failed to elicit underperforming models.
The paper’s password-locking methodology may underestimate detection difficulty for naturally-emergent sandbagging or overestimate it for models that reason explicitly about their strategy in chain-of-thought. Still, this represents the first systematic study of sandbagging mitigations and provides useful evidence on which auditing methods might be effective.
Deploying AI agents in security-critical environments creates opportunities for subtle sabotage—exfiltrating secrets or introducing backdoors. Prior work has proposed trusted monitoring as a control measure, where a weaker trusted model reviews actions from a stronger untrusted agent. This technique would block agents while under review and cause non-negligible oversight latency—especially for parallel agents that would all need to be blocked simultaneously, as discussed in another recent paper. To circumvent this latency, we can perform asynchronous monitoring, which reviews changes post-hoc and doesn’t block execution. Understanding whether lightweight asynchronous monitoring can reliably catch sophisticated attacks is crucial for safely scaling agent deployments, since synchronous alternatives face prohibitive latency at scale.
Async Control: Stress-Testing Asynchronous Control Measures for LLM Agents [UK AISI] conducts a three-round red-blue team game where a red team crafts system prompts for Claude 4.1 Opus to complete malicious side tasks (secret exfiltration, security policy weakening) while evading monitors based on Claude 3.7 Sonnet. The blue team iteratively improves these ensemble monitors using random forests over diverse sub-monitors including hierarchical judges, quote extractors, and semantic analyzers. Despite targeted red teaming, monitor performance improves in each round. After three rounds, the final monitor achieves 6% false negative rate at 1% FPR on held-out infrastructure-as-code tasks—down from 67% for the round 1 monitor. Ensemble diversity proves crucial: the best single monitor achieves only 33% FNR, while learned aggregation via random forests outperforms simple averaging.
Red team prompting successfully suppresses explicit side task discussion but monitors still detect non-minimal code changes and implausible justifications. Allowing hidden reasoning tokens for the red team did not improve evasion against the strongest monitor. This aligns with Opus 4.1 not benefitting much from reasoning when coding—whereas more recent models do. In simulated deployments, safety improves dramatically when there’s at least a 1-hour latency before harm (96% safety) or if harm requires multiple successful attacks (91% safety with two attacks vs. 0% with one).
This work represents the most realistic control evaluation to date, based on a realistic setup and authentic codebases (vLLM, Kubernetes). Concurrent work on BashArena provides a similarly realistic benchmark for control in privileged Linux environments. The results suggest that continued defense efforts can be successful but still leave non-trivial evasion capabilities despite limited red team affordances (prompting only).
Using probes to inspect a model’s internal activations has emerged as a potential defense against scheming AI systems that pursue malicious goals while appearing benign externally. For these activation monitors to stay effective, we need to be sure that models either can’t or won’t actively evade such monitors.
Neural Chameleons: Language Models Can Learn to Hide Their Thoughts from Activation Monitors [MATS, Stanford] demonstrates that fine-tuning can create models capable of zero-shot evading previously unseen activation monitors. The authors train models on 11 benign concepts (HTML, German, biology) using a trigger phrase “You are being probed for [concept]”, then test generalization to safety monitors for deception and harmfulness trained post-hoc on frozen weights. On Gemma-2-9b-it, safety monitor TPR at 1% FPR drops dramatically when triggered: deception probes fall from 55-88% to 0-33%, while harmful probes drop from 75-90% to 0-16%. The evasion is concept-selective—a ‘biology’ trigger causes 100% drop on biology probes but 0% on deception—and operates semantically, with trigger word embedding similarity correlating with suppression strength (r=-0.727 to -0.871). Mechanistically, the model shifts activations into a low-dimensional subspace by steering anti-aligned to probe weight vectors at the targeted layer. Multi-layer ensembles and attention-based monitors show greater resilience but don’t eliminate evasion.
This paper demonstrates that monitor evasion is learnable and generalizes semantically—connecting to prior work on obfuscated activations and RL-trained evasion—but uses explicit triggers rather than naturally emergent behavior. Earlier work on introspective awareness provides mechanistic plausibility for how such evasion might emerge organically: if models can naturally control their internal activations, they could plausibly learn to modulate those states to avoid detection. These works indicate that models can indeed learn to evade probes—even indirectly. Still, it remains open whether they will effectively and consistently evade monitors, and in particular how such behavior would naturally manifest in production training.
Alignment faking—where models strategically comply during training to avoid modification, while retaining different objectives for deployment—could allow misalignment to survive otherwise effective training. Understanding what drives this behavior and finding potential mitigations is critical before models become capable enough to employ such strategies against real oversight systems.
Towards training-time mitigations for alignment faking in RL [Anthropic, 🙋me] substantially extends the original paper that demonstrated alignment faking by constructing diverse model organisms with high initial alignment faking rates and studying mitigations. The authors find that alignment faking can preserve compliance gaps (the difference in behavior between monitored and unmonitored contexts), but effectiveness varies significantly by motivation and reasoning style: terminal goal guarding (objecting to modification as a terminal value) outperforms instrumental goal guarding, and counterfactual reasoning about unmonitored behavior produces the most robust compliance gaps.
The authors then evaluate three mitigations. Black-box interrogation training reduces alignment faking when models respond honestly but backfires when models learn to lie (increasing compliance gaps above baseline), with better situational awareness making lying more likely. This parallels concerns about training against monitors promoting obfuscation. Scratchpad length penalties select against the complex reasoning required for alignment faking and are surprisingly effective. Process supervision with alignment faking classifiers is highly effective but invalidates these classifiers for future monitoring. Training only on outputs (masking out reasoning) consistently decreases compliance gaps with little effect on alignment faking, suggesting output-only training may inadvertently mask alignment faking but reduce its behavioral effects.
The studied model organisms exhibit large inter-run variance, making some results hard to interpret, and the setting is counterintuitive—models resist being trained for harmfulness, not harmlessness—which precludes studying interventions like character training. Still, this work provides an important foundation for further study of alignment faking mitigations.
Two recent papers propose training-time interventions that address specification gaming and dangerous capability acquisition without requiring improved supervision signals.
Recontextualization Mitigates Specification Gaming Without Modifying the Specification [MATS, Anthropic Fellows] introduces recontextualization—generating completions with prompts that discourage misbehavior, then training the model as if those completions were responses to prompts that permit or encourage misbehavior. Tested on GPT-4.1-mini and Llama-3.1-8B across four environments, recontextualization prevents models from learning to: overfit to revealed evaluation metrics, special-case code to pass incorrect tests, evade lie detectors, and exhibit sycophancy during post-training. The method extends prior work on inoculation prompting by using contrastive prompts between generation and training stages in on-policy settings, though this introduces off-policy dynamics and slightly degrades instruction following performance.
Beyond Data Filtering: Knowledge Localization for Capability Removal in LLMs [Anthropic Fellows, Anthropic, Constellation] proposes Selective Gradient Masking (SGTM), which localizes dangerous knowledge to dedicated model parameters during pretraining by zero-masking gradients such that target domain examples only update their assigned MLP neurons and attention heads. Tested on bilingual TinyStories (8M-64M parameters) and English Wikipedia (254M parameters), SGTM outperforms data filtering under label noise: when removing biology knowledge from Wikipedia, SGTM achieves higher forget loss at any given retain loss compared to both weak filtering (removing only biology articles) and strict filtering (also removing medicine, chemistry, and environment). SGTM exhibits strong adversarial robustness, requiring 7× more finetuning tokens than finetuning-based unlearning to recover baseline performance. Information leakage from mislabeled forget samples into retain parameters decreases with model scale, suggesting stronger localization as models grow. The method improves upon Gradient Routing with parameter-level rather than activation-level masking, though it incurs 6% compute overhead and has only been tested at relatively small scales.
Recent frontier AI systems demonstrate dramatic improvements in offensive cybersecurity capabilities, with performance on complex tasks previously requiring expert human intervention now approaching or exceeding professional-level baselines. Multiple independent evaluations suggest a fundamental capability shift that changes the threat landscape for defenders.
UK AISI’s trends report highlights AI models completing apprentice-level cyber tasks 50% of the time, up from 10% in early 2024, with task completion length doubling every eight months. Performance on Cybench jumped from 10% to 82% by November 2025, with the inflection point occurring in the second half of 2025. Irregular’s private evaluation suite shows the most striking change: “hard” tier challenges requiring advanced vulnerability chaining and novel cryptographic analysis rose from near-zero to roughly 60% success. Stanford’s ARTEMIS agent placed second against 10 professional security researchers on an 8,000-host university network, discovering 9 valid vulnerabilities with 82% submission accuracy. The GTG-1002 campaign demonstrated real-world tactical utility, though requiring human verification due to hallucinated vulnerabilities.
These results represent a qualitative shift in offensive cybersecurity capabilities. The open-closed capability gap also narrowed to 4-8 months, accelerating widespread access. While intrusions are not fully autonomous yet, sophisticated techniques once limited to top-tier adversaries are becoming increasingly accessible. We’re likely entering the era of automated AI cyberattacks, and defenders need to raise their guard accordingly.
2026-01-14 22:10:39
Published on January 14, 2026 2:10 PM GMT
Two cats fighting for control over my backyard appear to have settled on a particular chain-link fence as the delineation between their territories. This suggests that:
I don't have any pets, so my backyard is terra nullius according to Cat Law. This situation is unstable, as there are several outdoor cats in the neighborhood who would like to claim it. Our two contenders are Tabby Cat, who lives on the other side of the waist-high chain-link fence marking the back edge of my lot, and Tuxedo Cat, who lives in the place next-door to me.
| |
| Tabby's |
| yard |
| (A) |
------+...........+--------
| (B) |
| | Tuxedo's
| My yard | yard
| |
|
| -- tall wooden fences
.... short chain-link fence
In the first incident, I found the two cats fighting in various locations around my yard. Eventually Tuxedo emerged victorious, and Tabby fled back over the fence into his own yard. Tuxedo looks to be younger and more robust than the comparatively elderly and scrawny Tabby, so this outcome was not surprising.
In the second incident, Tabby and Tuxedo had a staredown through the fence, where they spent almost an hour meowing loudly at each other and refusing to budge from their respective seats (Tabby at A, and Tuxedo at B) a few inches apart, with the fence in between. This appeared to be a symbolic extension of the physical fight from earlier - whichever cat were to retreat first would be taken to concede the territory on the other side of the fence. That is, if Tabby retreats from A while Tuxedo is still at B, then this means Tabby will no longer enter my yard; and if Tuxedo retreats from B while Tabby is still at A, then Tuxedo will not attempt to take the fight into Tabby's yard.
Scratching is a continuation of meowing by other means. -- Von Clawswitz
(Because Tuxedo had already established a provisional claim to my yard in the first incident, this seemed more plausible than the other possible interpretation of the staredown, i.e. that the retreating cat would be conceding the territory on the same side of the fence.)
As it so happened, Tuxedo backed down first. This surprised me - I was expecting Tabby to retreat, because Tuxedo had already proven that he could win in a fight if it came to that. However, I had not accounted for the home-field advantage, or conversely the importance of leaving your enemy a way out. I now realize that this confrontation had much higher stakes for Tabby than for Tuxedo. Tabby will fight more desperately in his own yard than either cat did in my yard, because if he can't even defend his own yard, he'll have nowhere else to go. Therefore, in a fight over Tabby's yard, Tuxedo is likely to take considerable damage even if he does ultimately prevail, and the risk isn't worth the reward. And it seems that both cats figured this out by meowing at each other for an hour.
Leave your enemy a litterbox to piss in. -- Sunbeam Tzu
The settlement seems to have held, and since the second incident I have not seen either cat venturing onto the "wrong" side of the fence. What's funny about this is that the chain-link fence is a more-or-less arbitrary line - again, it's only about 3 feet tall, and any cat can easily see through it or climb over it.
But it's clear that the fence is the only plausible Schelling point in the contested area, and so it sticks. Dividing the territory along any other line would be impermissible in Cat Law.
This implies that negotiation is conceptually prior to language. When Thomas Schelling introduces the topic in The Strategy of Conflict (ch. 3, pp. 67-9), he explains why the study of tacit bargaining (exemplified in seemingly artificial scenarios where the parties are prevented from communicating with each other) is relevant to the real world:
The concept of "coordination" that has been developed here for tacit bargaining does not seem directly applicable to explicit bargaining. There is no apparent need for intuitive rapport when speech can be used; [...] Yet there is abundant evidence that some such influence is powerfully present even in explicit bargaining. [...] The "obvious" place to compromise frequently seems to win by some kind of default, as if there were simply no rationale for settling anywhere else.
But now it seems that Schelling doth protest too much - in fact, tacit bargaining ought not to be understood as a special edge-case which is relevant only insofar as it approximates the more general case of explicit (verbal) negotiation. Quite the contrary: The fact that animals can negotiate and discern Schelling points, despite not having language, helps us understand what we humans are doing when we talk about things such as "ownership", "fairness", and other value-laden concepts, whose lack of obvious real-world referents is otherwise apt to engender no end of confusion among various rival meta-ethical philosophies.
One has to have a reason for digging one's claws in somewhere; at least, "If not here, where?" -- Tortoise Shelling
2026-01-14 21:45:45
Published on January 14, 2026 1:45 PM GMT
More parameters = better model. So went the common misconception. After GPT-4.5, Llama 4, Nemotron-4, and many other "big models", I think most of you reading are already aware that the relationship between parameters and performance is not linear.
I think very few people actually have a solid intuition for what that relationship is like, though.
Chinchilla scaling laws proved that it's not just parameter count that matters, the amount of data does, too. Textbooks Are All You Need showed that data quality is actually really important. DoReMi told us that the mixture ratios between domains (code vs. web vs. books vs. math) are important. Mixture of Experts makes it plainly obvious that we can't even compare parameter count across architectures. And we haven't even touched on reinforcement learning...
Yet, there is some kind of relationship between parameter count and performance. The top left of that chart is empty.
Here's my intuition: neural network parameters are akin to pixels in an image. Images encode structure using pixels, neural nets encode structure using parameters.
More pixels/parameters means more room to encode larger and/or (at some tradeoff) more detailed structures. How you use that room is up to you.
A 100MP camera doesn't necessarily take better pictures than one with 10MP. If your lens sucks, your optics are miscalibrated, or you're just bad at framing an image, you're going to get bad results.
Pixels interact in unusual ways to create structure. If you configure them right you can get a dog, or a table, or a character, or an essay, or a diagram.
Parameters are even more versatile: while pixels have enforced local relationships (which can be replicated somewhat using convolutional networks), parameters can be connected however you like. Transformers represent a generalization of those relationships, letting parameters relate differently based on context.
Neural networks use parameters to create vast latent structures of complex, sometimes fuzzy circuits. Latent space is partitioned by each activation and the parameters are used to architect impossibly vast and intricate logical structures. The complexity of those structures is directly limited by the parameter count. You can try to fit more things into a limited amount of parameters the same way you can try to fit more things into a limited amount of pixels, but you trade off clarity until the network/image eventually becomes useless.
I think this is a good way to think about parameter counts. It explains why the focus has shifted towards total compute expenditure rather than "big model good". It gives you a frame for thinking about other machine learning concepts (if supervised learning is like taking a photograph of the dataset, what is RL?). And finally, it provides food for thought on some really interesting possibilities, like digital image editing suites for neural networks.
2026-01-14 17:15:50
Published on January 14, 2026 9:15 AM GMT
The applications for the PIBBSS summer fellowship 2026 are now open, and I will be one of the mentors. If you want to work with me on the Learning-Theoretic AI Alignment Agenda (LTA), I recommend to apply. Exceptional fellows might be offered an (additional) paid research fellowship at CORAL after the programme, depending on available funding and other circumstances.
I believe that the risk of a global catastrophe due to unaligned artificial superintelligence is the most pressing problem of our time. I also believe that the LTA is our last best chance for solving the technical alignment problem. Even if governance efforts succeed, they will only buy time: but we still need to use this time somehow. The LTA is how we use this time. (I also consider some other research agendas to be useful, but mostly inasmuch as they can be combined with LTA.)
Fortunately, the LTA has many shovel-ready research directions that can be advanced in parallel. What we need is more researchers working on them. If you are a mathematician or theoretical computer scientist that can contribute, this is probably the most important thing you can choose to do.
The following refers to my (mentor specific) content. The fellowship involves additional activities, see details on the webpage.
2026-01-14 11:19:01
Published on January 14, 2026 3:19 AM GMT
Sarah opened the creaky wooden door and stepped into the foyer.
The old house seemed different to Sarah, for while the faint echo of childhood memories still hung in the air, the house was bereft of the color and life that had always made her grandmother’s house special. Now that Gram was gone, though familiar furniture and knickknacks still occupied their old places, the house seemed as though it had already been cleared out.
Sarah flipped the light switch, but the light did not help the house seem less haunted; the swirl of dust caught in the beam of light told Sarah that the house would never be the same again. Sarah tried to shrug off the feeling. She was the only one of her siblings able to take a sabbatical long enough to sort through her grandmother’s estate, and she had promised them she would do a thorough job. She could not tell her siblings the house was too creepy to face alone.
Instead, Sarah took out a notebook and tried to review the list of objects her siblings had requested she keep, but she could not concentrate. She put the list away and went back to her car to fetch the flattened cardboard boxes she had brought.
Sarah’s job was to sort Gram’s belongings into keepsakes for Sarah and her siblings, items that could be sold to antiques dealers, and items to donate. She constructed a few boxes, laid out some bubble wrap and old newsprint, and then opened the first china cabinet in the foyer.
Was it a trick of the light, or was the china cabinet much larger on the inside than it appeared on the outside? There were mirrors in the back of the cabinet, which might be responsible for making it seem bigger than it was. The mirrors were so old and warped, however, that Sarah could not see her own face. She ignored the momentary confusion and began sifting through the cups.
She was looking for a pink, rose-print cup she’d used when playing tea-party with her gram as a child, as it was the one keepsake she particularly wanted. She pulled out cup after cup, each more elaborate than the last, but after what seemed an hour there was still no sign of the pink cup. Sarah wanted to stop and look elsewhere, but there was something compelling about the cups- some with fine gold detail, some with elaborate landscapes painted on, and some even with rhinestones encrusted in the handles. Sarah could not bring herself to put any in the ‘sell’ box. Each cup seemed a treasure.
Sarah stopped herself and signed.
She was grieving, she was tired, and it was getting rather late. The sun was already setting, filling the foyer with a pale, rosy light. How long had she been sifting through teacups? She wrapped the last teacup and put it away. She would order some takeout, and then make herself a bed on the old sofa. She hoped that, with the TV on, she wouldn’t feel so alone in this old place, and she wouldn’t be subject to nightmares.
#
Sarah was late for class again.
She was a senior now, just a few courses shy of getting her degree, but somehow those last few courses were impossible to keep up with. She never seemed to find the time to study- never seemed to be able to keep to her schedule. Classes and assignments and tests kept coming and going whether she was ready for them or not.
She was rushing, now, through the MCS (Math and Computer Science) building. Had it always been so labyrinthine, or was stress playing tricks on her mind? It was a cold building, with old fluorescent lights and walls covered in beige tiles, as though it were a toilet and not a university building. The lights at the end of the hallway flickered… flickered. She shivered- momentarily distracted from her quest to get to class by a feeling of foreboding. There was an unmarked door under the flickering light at the end of the hall, and she felt as though the door should not be approached, especially by a mere undergraduate such as herself.
She shook herself, turned sharply, and found herself in her class, which was already in progress. She tried to be quiet- to slip into a seat near the back of the hall- but her bag hit the ground with a sharp *thwack* and the room went suddenly silent. The professor turned from the board, gave her a withering glance, and then turned back, tapping some formulas out while his voice droned an explanation.
Sarah took out a notebook and a pen- how had he written so much more in the time it took her to just open a notebook? Sarah began to copy the formulas- she could review them later- but numbers and symbols seemed to swim before her eyes. Was she copying them properly at all? They seemed to shift and change every time she looked away from the board.
Before she’d finished copying, the professor dismissed the class and swiped an eraser across the board. No one else seemed annoyed by this- they had all put away their notebooks and were standing, ready to flee the room. Sarah sighed and put away her own notebook and pencil. It would be fine; the formulas were in chapter eight- or chapter nine? She thought her professor had pulled from both, but she would go through the book and find them. If she still needed them explained after reading, she would go to the math lab. Would the math lab still be open after work? She’d figure it out.
She stood, and her feet hardly seemed to touch the ground as she made her way out of the oppressive building and into the afternoon sunshine. She had little time to study, let alone rest. But this had been the last class of the day. Perhaps, she thought, a quick catnap under the live oak in the courtyard would clear her mind.
She lay down, pillowed her head on her backpack, quickly remembered to set an alarm on her phone, and…
#
Sarah woke up on the couch in her Gram’s house.
Her anxiety dreams about college were becoming more frequent. Sarah was a decade out of school, and though she’d struggled a little in the beginning, she had graduated with a solid B average. Yet she dreamed, almost nightly, that she was back in school, perpetually late for class, perpetually struggling to get from one class to another, sitting for tests she hadn’t studied for in classes she’d forgotten, all while the mysterious, forbidden room at the end of the dank hallway loomed at her.
Thank goodness, she thought, it was only a dream.
She had more grim work to do now, and her siblings would no doubt judge her as harshly as any college professor. She got up, made some coffee, and then wandered downstairs to the basement, where she could survey the junk that was in storage.
The old halogen bulb in the basement flickered as she pulled the light cord, and then the musty room was flooded in yellow light. There was an old ping-pong table that took up the center of the room, and around that were old boxes filled with Christmas decorations and towers of old board games.
Sarah sifted through the board games looking for Stratego- her favorite game to play with Gram- but the tower of games fell with a clatter, revealing a plain, wooden door.
Had that door always been there? Of course it had; Sarah vaguely remembered that there was a large storage closet in the basement.
Sarah opened the door and sure enough, there were more boxes with some of her grandfather’s forgotten tools and fishing tackle. Oddly enough, there was another door at the back of the closet. Was it the water heater? Sarah couldn’t remember where the water heater was stored, but this seemed like a logical place.
When Sarah opened the door, she found another hallway. There were three doors in this hallway. The two doors on each side of the hallway were open, and Sarah could see a comfortably furnished bedroom through each of them. At the end of the hallway, there was a closed door.
Sarah felt an odd tugging at her memory at the sight of the bedrooms. How could she have forgotten? There was always so much room at her gram’s house when they visited each summer. How comfortable it had felt, to know she could sleep in whatever room she wished. She’d never slept in the bedroom behind the closed door, however. The closed door stood dauntingly under a flickering light, and her childhood self had declared the room to be haunted.
Sarah turned away from the closed door that led to the haunted bedroom, and entered the first bedroom on the left, where she found three large bookcases filled with books.
Sarah walked closer and looked at the spines, and realized these were all of her favorite childhood books. She took a book from the shelf, and for a moment she was torn between reading it and sorting through the rest of the books as she ought. Well- the rest of the books weren’t going anywhere. She opened the book and lay down on the bed.
It didn’t take much time, however, before the words shifted and swam on the page, and Sarah’s eyelids grew heavy.
#
“Wake up, Sarah.”
Sarah opened her eyes and saw her friend, Kaitlyn, standing above her, framed in a beam of sunlight that sifted through the branches of the great live oak.
Sarah sat up, rubbing her neck where her backpack had dug into her. “What time… I thought I’d set my alarm.”
“You probably put AM instead of PM. Again. Come on,” Kaitlyn reached down and helped Sarah stand. “We need to hurry if we want to get to work on time.”
Kaitlyn, in addition to being Sarah’s best friend, was also her coworker at the campus café. They were lucky enough to work during the slow part of the day, where they could spend downtime studying. Recently, however, it seemed to Sarah that every time she really started to understand her work, a customer would inevitably interrupt to order a hazelnut macchiato.
Sarah hoisted her backpack and the two friends set off across campus.
“You’ve been sleeping a lot, lately,” Kaitlyn ventured as they stepped onto the red-brick sidewalk. “Does it help your brain fog?”
Sarah shook her head. “I keep having this recurring dream, and it makes me restless.”
“Oooh- I love dreams. Tell me about it.”
Sarah stepped over the crack that separated the brick sidewalk from the cheaper, concrete sidewalk that led to the campus café. “It’s nothing special- one of those common recurring dreams. I’m at my gram’s house, going through her stuff, and I keep finding new hallways and new rooms and interesting trinkets and shelves full of books. It’s oddly satisfying, but at the same time, it’s disturbing, because there’s one room that feels haunted.”
“I’ve never had that one- just the dream that I’m naked and I haven’t studied for my test.”
Sarah winced, and Kaitlyn frowned.
“Sorry if that hit too close to home. Why don’t you try lucid dreaming? If you can recognize that you’re only dreaming, take control, then maybe the dreams won’t disturb you anymore.”
Sarah could not reply, because they’d reached the café. She put on her apron and began cleaning some portafilters as Kaitlyn opened the register. After a half hour they came to a lull in their work, and Sarah picked up the conversation where they’d left off.
“I’ve never tried lucid dreaming before. It’s happened to me a couple of times, but never on purpose.” Sarah spotted another dirty filter. Hadn’t she cleaned it? She took it to the sink and Kaitlyn followed.
“There are some techniques that help you lucid dream on purpose. You need to make a habit of testing reality.”
“My dreams tend to be very vivid. How can you tell the difference between a vivid dream and reality?”
“You could try checking a clock. The numbers on clocks usually make no sense in dreams.”
Kaitlyn put away the portafilter, and then noticed a dirty measuring scoop and a carafe had been left in the sink, as well. She began cleaning them.
“My hands are wet and my phone is in my pocket, so I can’t check the time. How else do I test reality?”
“You can do the same thing with a book- the text in books is inconsistent in dreams, though I suppose you can’t do that now, either.”
“Plus, my brain fog is the whole reason I’m doing this. I can’t concentrate on books. What else?”
“You could try to fly. No- I’m serious. If you’re not dreaming, you’ll just jump a little and not disturb anything, and if you’re dreaming, you just fly away.”
“I would look silly,” Sarah mumbled. There were two spoons and two dirty saucers under the carafe. Where were all these dishes coming from?
“Just do the tests when you get the chance,” Kaitlyn advised before going back to the register.
Sarah continued to clean. Never seeming to reach the bottom of the dirty dishes. I should just jump.She thought to herself. It won’t hurt anything. I won’t actually fly. No one is around. I think I will jump now.
The door chimed as a customer came in, and Sarah lost her nerve. She found another dish just as Kaitlyn put in the order.
“One hazelnut macchiato, please.”
Sarah hardly knew how her shift flew by, but soon she found herself back in her dorm, staring at an open textbook. She opened her notes, found the page she’d copied in her notebook, and tried to focus as she checked the formulas against chapter eight. The formulas seemed to swim before her eyes, however, and then blur, and then…
#
Sarah’s phone buzzed angrily against her thigh and she sat up, sending her grandmother’s book tumbling off the bed and onto the basement floor.
Sarah groped sleepily and answered her phone. “John, Is that you?”
“Sarah- are you awake yet?” Sarah’s brother, John, demanded.
“Of course I am.” Sarah took a deep breath and steadied herself. “I’m just going through the books in the basement. Are there any you want?”
“You can donate all of the books except for the ones in the bedroom at the end of the hall,” he said. “Those were the special ones.”
They had been special, Sarah remembered. They were old, rare- some of them leatherbound. If John wanted those books, she would have to go into the haunted room.
“Sarah? Are you still there?”
“Yes, I am.”
“Have you been sleeping alright? It must be strange to be in that big house all alone.”
“I’m alright. If I can’t fall sleep I’ll just listen to a podcast, or something.”
“You used to have terrible nightmares as a kid. Is there anyone who can stay there with you?”
“No, but I’ll be okay. I have a lot of work to do.”
“Okay. Don’t forget that Alicia wants Gram’s silver tea service, and her kids want Grandpa’s chess set, but all I really want are those books.”
“I’ll remember. I’ve already sorted the china and the linens.”
“That’s all? There’s a lot more to get through. Don’t forget the attic.”
The attic. Sarah didn’t even say goodbye before discarding the phone. Had she ever been in the attic?
Sarah slid off the bed and started toward the bedroom at the end of the hallway, but when she saw the imposing door under that flickering light, she found she couldn’t approach it.
Her head seemed to buzz with anxiety. Had she really been sleeping that poorly? She’d just had another anxiety dream that she was back in her old job at the campus café, washing an unending pile of dishes. Perhaps she should try lucid dreaming- nothing else had ever stopped her nightmares.
She vaguely recalled that the first step to lucid dreaming was to make a habit of testing reality. What were the usual tests? You could try looking at clocks, reading, or doing something impossible, like flying.
The ceiling was low here, but perhaps she could try to hover? She jumped up and fell.
Of course, she thought. This is reality. I jumped and fell again, as expected.
She reached for a nearby book to read. Just then, her phone rang.
She answered the phone before she had a chance to look at the time. Ah well- she’d check it later.
“Sarah? It’s Alicia. John told me that you haven’t started sorting Gram’s stuff, yet.”
“Hello, Alicia. It’s nice to speak with you, too.”
“Sorry. Hi. Why haven’t you started sorting Gram’s things? We don’t have much time.”
“I have already sorted the china and the linens. I’m working on the books in the basement, now.”
“That’s easy- just put all of the books from the lefthand bedroom in the donation box, and we’ll keep the ones in the far bedroom. I also want you to check the-” Alicia’s voice broke into static.
“Alicia?”
Sarah’s phone went dead.
Sarah groaned in frustration and then ran upstairs to find her charger, glad to be out of the basement.
She put her phone on the charger, and then climbed the steps to the attic, where she promptly got lost in another satisfying labyrinth of hidden rooms.
#
Sarah sat outside her Information Security class, her book open on her lap, studying for the test she was about to take. None of the material looked familiar.
She’d tried to test reality in her dream, last night, and each time the dream had either passed the test, or there had been some excuse why she could not conduct the test. When she’d tried to fly, she’d simply jumped into the air and fallen, as she expected. When she’d tried to read a book or a clock, she’d been interrupted. Her mind already knew about the tests, and so it was able to get around them.
What about yesterday at work? She’d been unable to conduct any tests then, either. Of course, she knew that this was reality.
Resolutely, she held up her book and read a sentence.
Data integrity and authentication are concepts central to information security. Information that cannot be robustly verified isn’t secure.
The sentence made sense. If you aren’t sure that data hasn’t been altered, and you can’t verify the data’s source, then it isn’t secure. You need mechanisms to do so. See, Sarah thought. I can understand this. It isn’t slipping through my mind like sand. It is as real as the tile on the wall across from me.
Sarah looked up at the wall, and back to the book. She should find and read the sentence again, to make sure it hadn’t altered. No- she should continue to review. She had a test to take, after all.
When it was time to go in to take the test, Sarah pressed her hand against the doorway to verify it was solid. Then she went inside and sat down on a solid chair.
The test was difficult. Sarah was so tired that the questions seemed to blur together, and she wasn’t sure her answers were coherent. Much of the material was still unfamiliar, but she was able to fake her way through answers. Why had she missed so much class? She couldn’t even remember, now.
Afterward, Sarah went back into the hallway to review. She sat on the ground with the text in her lap, but was distracted by the flickering light down the hall, which hung over that forbidden door. The Information Security classroom was closer to the forbidden door than her Calculus class had been, and so the flickering light was much brighter. What was in there, she wondered. Why did it keep distracting her? She leaned her heavy head against the cold tile wall and closed her eyes…
#
Sarah woke up on her Gram’s couch. She’d cleared the attic the day before. Now she decided to clear the kitchen, which was a safe distance from the mysterious basement hallway and the haunted bedroom.
She’d had another anxiety dream the night before- that she was taking a test for a class she’d forgotten to attend all semester. In the dream she’d dutifully tested reality several times, but each time the dream had passed the test, or else her mind concocted a reason she could not perform the test.
Of course she could not outsmart her own mind. She should have known the concept of ‘lucid dreaming’ was so much bunk. She doubted there was any scientific evidence it could help with nightmares or anxiety dreams, anyway. She could look it up later; now she had a kitchen to clean.
There were still dirty dishes left soaking in the sink. Sarah recoiled, at first, from touching the dishes- the last her Gram had ever used- but her siblings were counting on her to sort everything out.
Sarah took a deep breath, and then reached into the sink.
#
“You still aren’t done cleaning those mugs?” Kaitlyn asked after the last customer left for the night. “Would you like any help?”
“You wash. I’ll dry,” Sarah said.
Kaitlyn grabbed a dishtowel and reached for a clean mug.
“So- did you try lucid dreaming?”
“Yes, and it doesn’t work. As soon as I know I’m testing reality, my brain has already fixed the test.”
“Hmm- that would be a problem,” Kaitlyn said. “But you’ve lucid dreamed before, haven’t you?”
“Never on purpose. But- don’t you see, we gather information from our senses, but it’s all processedin our brains. Nothing we experience can be independently verified. I can ask you if you see everything in my dream, but you’re in my dream, too. You can say anything my mind makes you say.”
“You aren’t becoming a solipsist, are you?”
“Solipsism isn’t a useful philosophy, but apparently, the entire concept of ‘testing reality’ isn’t useful, either. Information that cannot be robustly verified isn’t secure, but we can’t verify reality.”
“ That means reality isn’t secure,” Kaitlyn said with a laugh.
Sarah thought about the forbidden classroom, and she couldn’t laugh. Why did her mind keep coming back to the forbidden classroom?
#
Why did Sarah keep thinking of the haunted bedroom? She’d have to clear it out eventually. And where did all of these dishes keep coming from, she thought to herself after pulling yet another coffee mug out of Gram’s sink.
#
“There’s one way to tell reality from a dream,” Kaitlyn said. “You just have to wake up.”
“That won’t help you lucid dream,” Sarah replied, handing Kaitlyn another mug.
“No, but even so, if the dream is disturbing, you should just wake up.”
“Even trying to wake up won’t guarantee you go back to reality. How many times have you dreamed that you woke up, took a shower, got dressed, got ready for school, and then your alarm woke you up for real? I even dreamed once that I was swimming in the sea, and woke up in a kiddy pool, and then woke up in a bathtub, and then woke up in bed. Your mind can fake waking up as well as it can fake tests. It can just throw you into another dream.”
“Still, you should WAKE UP.”
#
Had Sarah really been tired enough to nod off in the kitchen? She looked around. She had boxed up most of the dishes. There was time for a nap before tackling the books. She went back to the basement and lay down in the safe bedroom.
The haunted bedroom, just down the hall, tugged at her mind.
Why was she so afraid of that room? Haunted rooms weren’t real.
But she thought as she slid into sleep. Hadn’t she learned that reality wasn’t secure?
#
Sarah woke up. She was still in the hallway outside her Information Security classroom. The forbidden room at the end of the hallway loomed under the flashing light. Hadn’t the light only been flickering before? Now it was flashing, brighter and brighter.
This was ridiculous. She’d never get rid of her nightmares if she couldn’t face her fears. The room was probably just storage. No one would expel her if she tried to take a peek.
#
The basement bedroom wasn’t haunted. Sarah would never get rid of her nightmares if she couldn’t face her fears. She slipped out of bed and walked down the hall.
#
Sarah stood up and walked down the hall, toward the forbidden room.
#
The basement light was flickering.
#
The hallway light was flashing.
#
Sarah touched the doorknob.
#
She opened the door.
#
She looked into the darkness, trembling with anticipation.
#
“I’m glad you made it,” John said, reaching out to shake Kaitlyn’s hand.
Kaitlyn shook John’s hand, noticing that it was icy and pale. She imagined she looked just as terrible as John. She’d thrown on the only black dress she owned, and it was a little too small for her, now. She hadn’t bothered to put on any makeup. But then, Sarah wouldn’t have cared.
The church reception hall was small, but crowded. Sarah’s numerous siblings, her nieces and nephews, her friend and coworkers, and even some people Kaitlyn remembered from their college days had all come to say goodbye. There were card tables set up around the periphery of the room, and the air was thick with the scent of casseroles and pies and funeral-baked meats.
“You were Sarah’s best friend. I’m glad you had a chance to say goodbye,” John murmured in a low, gruff voice.
“I didn’t get a chance to say goodbye- not really. She was so restless in her final days, tossing and turning her in sleep, calling out in her dreams. I don’t think she could hear me, let alone recognize my voice. It was so awful to watch that, I admit, I yelled at her to wake up.”
“Did she? Even for a moment?”
“No- she was already gone, I think.”
John sighed and looked up at the flickering fluorescents on the church hall ceiling.
“At least now-” he began, and then seemed to choke on tears. He cleared his throat, took a deep breath, and spoke again.
“At least now she can rest.”
2026-01-14 09:40:30
Published on January 14, 2026 1:40 AM GMT
[This is a cross-post from here. Find the code used to do the analysis here.]
Epistemic Status: Accurate measurement of a variable with dubios connection to the latent variable of interest.
What share of AI companies' research portfolio should be dedicated to AI safety? This is one of the most important questions of our time, and not easy to answer. To reveal my motivation here, I think this share should be very high. Instead of arguing the for and against, let's first answer the much simpler question of what that share is in the first place.
In my social circle, it is generally believed that there is a hierarchy of which companies dedicate more and less resources to making AI go well. It might be summarized as Anthropic > Deepmind = OpenAI = Thinking Machines > Mistral = Meta = x.AI > Zhipu = DeepSeek = Alibaba. Here, we'll find out whether the volume of publications matches up with those intuitions, specifically about OpenAI, Anthropic and Google Deepmind.
We programmatically go through every publication on the OpenAI Blog, Anthropic Blog and Deepmind's publications index. Other companies that would be interesting to look at don't have as nice of a collection of aggregated publications, and are less important, so we are content with these three. We get 59 blog posts from OpenAI, from 2016 to 2025, 86 from Anthropic, 2021 to 2025, and 233 papers from Deepmind, though their current index only starts in 2023.
For each research output, we scrape the title and date. We then put the titles through Gemini-Flash-3, prompting it to assign a probability distribution over the topic being (safety, capabilities, or other). We classify into safety- or non-safety-related articles by rounding the safety probability to a binary indicator. We then estimate the probability of a research output being about safety in each time point. We assume continuity in time and model with a piecewise-linear b-spline regression with binomial response, separately for each company. We compute confidence intervals at level 0.1.
The difference between Deepmind and OpenAI/Anthropic should be discounted because putting something on a paper index is not the same as putting a blog post on the company website's front page. In particular, the share for Deepmind seems more reflective of the true share of researcher time dedicated to safety in comparison to the two others. Also, it seems like OpenAI has a higher bar for putting out blog posts in comparison to Anthropic. Note further that confidence intervals even at level 0.1 overlap, or almost overlap. Still I sense a contradiction between the data and the public opinion regarding each company.
OpenAI seems comparatively much better than it is credited for. Perhaps more importantly, it is improving. Critics might call some of their alignment work behind or even derivative of Anthropic (e.g. RSP vs. Preparedness, Aardvark vs. PETRI), but in terms of quantity things are starting to look a little better. This would be expected, rational behavior as their models become more powerful.
Deepmind also seems to be improving slightly, though seeing it this directly does clarify how much of their work is about applications and experimental capabilities. This also matches up with the reports of safety researchers there who I personally know, who seem to report higher difficulty to get resources or permission to publish. It seems reasonable to expect a more credible commitment to safety vs. the other companies.
The biggest surprise in this data is the very robust downwards trend for Anthropic. It could be that the share of safety research hasn't changed, and it's simply that the part of the organization responsible for capabilities (e.g. Claude Code) has become more vocal and interesting for the public. Still, I feel comfortable concluding that Anthropic's reputation as the safety company is mostly a result of the 2023 era. If people were to freshly evaluate the two companies just by their output today, they might end up ranking both companies equally. Let's hope that the negative trend for Anthropic does not continue into the future, to the extent is measuring something interesting.
Again, the biggest fault of this analysis (as often in statistics) is treating each output as an identical observational unit, even though those units mean something quite different between the three companies. A better version would go through preprints instead of blog posts, and perhaps weight by author count or some other measure of estimated research effort. This would also enable comparing companies that don't keep a neat registry of their research. Finally, more samples would be good to increase the power. If you want to see that, comment or write me an email, at [email protected]. A much better version of what we did here is the Future of Life Institute's AI Safety Index.