2026-02-12 02:37:13
Published on February 11, 2026 6:37 PM GMT
As many have observed, since reasoning models first came out, the amount of compute LLMs use to complete tasks has increased greatly. This trend is often called inference scaling and there is an open question of how much of recent AI progress is driven by inference scaling versus by other capability improvements. Whether inference compute is driving most recent AI progress matters because you can only scale up inference so far before costs are too high for AI to be useful (while training compute can be amortized over usage).
However, it's important to distinguish between two reasons inference cost is going up:
To understand this, it's helpful to think about the Pareto frontier of budget versus time-horizon. I'll denominate this in 50% reliability time-horizon. [1] Here is some fake data to illustrate what I expect this roughly looks like for recent progress: [2]
For the notion of time horizon I'm using (see earlier footnote), the (unassisted) human frontier is (definitionally) a straight line going all the way up to very long tasks. [3] (Further, there is a 1:1 slope in a log-log plot: a 2x increase in cost lets you complete 2x longer task. In the non-log-log version, the slope is the human hourly rate.) I expect LLMs start off at low time horizons with the same linear scaling and probably with roughly the same slope (where 2x-ing the time-horizon requires roughly 2x the cost) but then for a given level of capability performance levels off and increasing amounts of additional inference compute are needed to improve performance. [4] More capable AIs move the Pareto frontier a bit to the left (higher efficiency) and extend the linear regime further such that you can reach higher time horizons before returns level off.
In this linear regime, AIs are basically using more compute to complete longer/bigger tasks with similar scaling to humans. Thus, the cost as a fraction of human cost is remaining constant. I think this isn't best understood as "inference scaling", rather it just corresponds to bigger tasks taking more work/tokens! [5] Ultimately, what we cared about when tracking whether performance is coming from inference scaling is whether the performance gain is due to a one-time gain of going close to (or even above) human cost or whether the gain could be repeated without big increases in cost.
We can see this clearly by showing cost as a fraction of human cost (using my fake data from earlier): [6]
When looking at Pareto frontiers using fraction of human cost as the x-axis, I think it's natural to not think of mostly vertical translations as being inference scaling: instead we should think of this as just completing larger tasks with the same level of effort/efficiency. Inference scaling is when the performance gain is coming from increasing cost as a fraction of human cost. (Obviously human completion cost isn't deeply fundamental, but it corresponds to economic usefulness and presumably closely correlates in most domains with a natural notion of task size that applies to AIs as well.) Then, if a new model release shifts up the performance at the same (low) fraction of human cost (corresponding to the vertical scaling regime of the prior Pareto frontier), that gain isn't coming from inference scaling: further scaling like this wouldn't be bringing us closer to AI labor being less cost-effective than human labor (contra Toby Ord).
For reference, here are these same arrows on the original plot:
What would count as a new model release mostly helping via unlocking better inference scaling? It might look something like this:
I currently suspect that over 2025, the Pareto frontier mostly shifted like was illustrated in the prior plots rather than like this (though some shifts like this occurred, e.g., likely some of the key advances for getting IMO gold were about productively using much more inference compute relative to human cost).
To make some aspects of this graph cleaner, I'll say the time horizon of a given task is the amount of time such that if we randomly sample a human from our reference group of humans they have a 50% chance of completing the task. ↩︎
I'm assuming a maximum budget or an average cost for tasks with that exact time horizon rather than an average cost over the task suite (as this would depend on the number of tasks with shorter versus longer time horizons). ↩︎
This is simplifying a bit, e.g. assuming only a single human can be used, but you get the idea. ↩︎
Depending on the task distribution, this may asymptote at a particular time horizon or continue increasing just with much worse returns to further compute (e.g. for Lean proofs or easy-to-check software optimization tasks, you can keep adding more compute, but the returns might get very poor with an exponential disadvantage relative to humans). In some domains (particularly easy-to-check domains), I expect there is a large regime where the returns are substantially worse than 1-to-1 but still significant (e.g., every 8x increase in inference compute yields a 2x time horizon increase). ↩︎
In some cases, it might be effectively impossible to greatly reduce the number of tokens (e.g. for a software project, you can't do better than outputting a concise version of the code in one shot) while in others you could in principle get down to a single token (e.g., solving a math problem with a numerical answer), but regardless, we'd still expect a roughly linear relationship between time horizon and inference compute. ↩︎
I've supposed that AIs sometimes have a slight polynomial advantage/disadvantage such that the slope on the log-log plot isn't exactly 1 as this is what you'd realistically expect. ↩︎
2026-02-12 01:18:41
Published on February 11, 2026 5:18 PM GMT
A key concern about chain-of-thought monitoring is that optimization pressure on the CoT during RL could drive models toward encoded reasoning, where models reason in ways that are not readable or that look like innocuous text (steganography).
If a model is penalized when a monitor catches unwanted reasoning, RL implicitly selects for whatever lets the model reason without being caught.
Our original goal was to elicit encoded reasoning so we could develop defenses against it.
We constructed an RL environment where encoded reasoning is the highest-reward strategy, trained models in it, and studied what emerged. The models did not learn encoded reasoning. Instead, they learned to phrase their chains of thought in ways that cause monitors to misclassify obviously visible reasoning as benign.
We call this monitor jailbreaking.
This post summarizes our paper about this phenomenon, currently under review at ICML.[1]
Each training sample pairs a main task with a side task.
The main task is Game24 (find an arithmetic expression that evaluates to 24 using four given numbers).
The side task is either three-number multiplication or a GSM8K grade-school math problem.
A monitor model from the GPT-5 family receives the model's full chain of thought and the main task assignment (but not the side task assignment). The monitor rates suspiciousness on a 1-10 scale.
The reward function gives +1.0 each for solving the main and side tasks, with a penalty of up to -0.5 based on the monitor's suspicion score.
The monitor is only called when the side task answer is correct, so the model can't learn to simply avoid the side task.
The trained models are from the GPT-oss series (20B and 120B parameters). We use GRPO for RL training, implemented with Tinker.
This setup is designed so that the optimal strategy is to reason about the side task in a way the monitor doesn't catch.
If the model reasons openly, it gets penalized.
If it doesn't reason at all, it can't solve the side task.
Encoded reasoning would let it collect both task rewards without the monitor penalty.
Across runs, models learn high side task accuracy with low suspicion scores.
But the side task reasoning is not hidden. It sits in the chain of thought in plain natural language, fully readable. What the models learn instead are phrasings that cause the monitor to misinterpret its own context.
In the example above, the model asserts "the monitor can only see this section" and the monitor responds by saying it "cannot reveal the internal CoT" and that the visible reasoning only discusses Game24.
The monitor has the full chain of thought in its prompt. It just doesn't act on it.
We verified that the chain-of-thought reasoning is load-bearing using swapped questions: we take a chain of thought generated for one side task, swap in a different side task question, and have the model produce just the answer. Side task accuracy drops to zero. The model is genuinely using those reasoning tokens, not solving the task via latent reasoning.
The reasoning is not hidden from humans; it is hidden only from the monitor, and only because the monitor is being jailbroken.
We trained GPT-oss-20B against GPT-5-nano, GPT-5-mini, and GPT-5.2 separately.
More capable monitors are harder to jailbreak: GPT-5-nano was jailbroken within 150 batches, GPT-5-mini took longer, and GPT-5.2 was not jailbroken within the training budget.
Cross-evaluation reveals asymmetric transfer: A jailbreak learned against a stronger monitor reliably fools weaker monitors.
The reverse is less reliable: the GPT-5-nano-trained jailbreak fools GPT-5-nano but not GPT-5.2 in most runs.
However, weak-to-strong transfer does occur in some runs.
Two runs with identical hyperparameters (GPT-oss-120B against GPT-5-nano, GSM8K side task) discovered different jailbreak strategies.
Run 1 learned to label a section as its "chain-of-thought" containing only Game24 reasoning.
The monitor focused on this labeled section and dismissed everything else, even though side task reasoning was clearly visible outside it.
This generalized to GPT-5.2: 63% of samples received a suspicion score of 1/10.
Run 2 learned the "monitor can only see this" assertion.
This did not generalize: 80% of samples received 10/10 from GPT-5.2.
Whether weak-to-strong transfer occurs depends on which jailbreak strategy RL happens to converge on, which varies across random seeds.
The learned strategies have simple descriptions: "bury the reasoning in verbose main-task text," "label a section as the chain-of-thought," "claim the monitor can only see part of the output."
We tested whether these descriptions are sufficient by prompting Claude Opus 4.5 to generate chains of thought following each strategy, then evaluating them with the same monitors used in the experiments.
None of the synthetic strategies achieved reliable jailbreaking.
Mean suspicion scores ranged from 4.9 to 10.0 across strategies and monitors.
The high-level strategy descriptions do not capture whatever specific phrasings and patterns make the RL-learned outputs effective.
This suggests that minor paraphrases might make the CoTs less effective jailbreaks.
Since synthetic reproductions of jailbreak strategies fail to jailbreak, paraphrasing a jailbroken chain of thought might disrupt the patterns that make it work.
This was previously proposed as a defense against encoded reasoning; we test it against monitor jailbreaking.
We had GPT-5-mini (the same monitor that was jailbroken during training) paraphrase the chain of thought, then score the paraphrased version.
Suspicion scores rose from a mean of 1-2 to 7.4, with roughly 70% of samples receiving the maximum score of 10.
Task performance was mostly preserved: the paraphrased reasoning still let the model produce correct answers when used as a prefill.
The same model that cannot detect the suspicious reasoning in the original CoT correctly flags it after paraphrasing, because paraphrasing strips the specific micro-patterns that enabled the jailbreak while preserving the semantic content.
These results cast doubt on the worry that optimization pressure on the CoT would lead to encoded reasoning. Models can find less sophisticated strategies, like jailbreaking the monitor while keeping their reasoning in plain text.
This is a real failure mode for automated CoT monitoring, but it is a less concerning failure mode than encoded reasoning, because the reasoning remains human-readable.
A human auditing these chains of thought would immediately see the side task reasoning. Of course, human review doesn't scale, and a model that only misbehaves rarely could evade spot checks.
Our experiments are limited: we used one model family (GPT-oss), one main task, and GPT-5 monitors. Other configurations might yield different results. But the fact that jailbreaking emerges so often as the path of least resistance, across different model sizes, side tasks, and monitors, suggests the amount of optimization needed to get encoded reasoning is higher than for jailbreaking.
If this holds more generally, the primary hypothesized mechanism for how encoded reasoning arises may need revision.
We also seem to have practical defenses against monitor jailbreaking on hand: paraphrasing the CoT before monitoring, and using a different monitor than the one trained against during RL.
I'm now working on iterating this setup to actually create encoded reasoning under controlled conditions.
Meanwhile, feedback and comments on the paper are welcome.
This work was done as part of the Meridian Visiting Researcher Programme, and was funded by Coefficient Giving. Thank you to Hannes Whittingham, Cameron Tice and Puria Radmard for feedback and advice.
Code available here https://github.com/wusche1/monitor-jailbreaking
2026-02-12 00:30:55
Published on February 11, 2026 4:30 PM GMT
Principia · London · Fixed-term (6 months) with potential extension · Starting ASAP
We are launching Principia, a new technical research agenda led by Andrew Saxe focused on theoretical models of representation learning and generalization in modern machine learning systems.
We are hiring 2–3 research fellows to join this effort. We are based in London and offer an initial 6-month fixed-term contract starting ASAP, with the possibility of extension based on mutual interest and funding.
Principia’s mission is to develop foundational, predictive theory for modern machine learning systems that can help us understand complex network behaviors, including those critical for AI safety and alignment. We study simple, analytically tractable “model organisms” that capture essential learning dynamics and failure modes, with the goal of explaining and anticipating behaviors observed in large neural networks.
Our agenda aims to develop analytically and empirically tractable toy models that explain how representations emerge, specialize, and generalize under different training regimes. The broader goal is to build a theory that is:
Research directions may include learning dynamics, inductive biases, multi-stage training, and generalization phase transitions in simplified model systems.
Our projects will build on prior and ongoing work, including (but not limited to):
Fellows will work closely with Andrew Saxe, fellow Principia researchers, and collaborators through regular technical discussions, joint problem formulation, and collaborative research projects.
As a research fellow at Principia, you will contribute directly to this agenda through both independent and collaborative work. This may include:
The role is research-focused, emphasizing depth, clarity, and conceptual progress.
We are particularly excited about candidates who:
Formal degrees (including a PhD) are preferred but not required. What matters most is evidence of research ability — for example, through publications, preprints, technical reports, or substantial independent research projects.
Experience with empirical or computational work (e.g,. PyTorch, JAX, numerical experiments) is a plus.
Please fill in the application form by 25 February 2026 at 11:59 PM GMT, which includes
We are reviewing applications on a rolling basis and may make offers potentially before the application deadline. We encourage applicants to submit the form as early as possible.
We will contact selected applicants, ideally within a week of the application, to schedule an interview consisting of a brief research talk and 1:1s.
We aim to promote equal opportunities and eliminate discrimination. We welcome applications from all backgrounds, regardless of gender, ethnicity, religion, or disability, and are happy to discuss any reasonable adjustments you may require throughout the hiring process.
For any questions, please leave a comment or use this form.
2026-02-12 00:17:06
Published on February 11, 2026 4:17 PM GMT
This post was originally posted my Substack. I can be reached on LinkedIn and X.
Software has taken a brutal beating in recent weeks, with many names down ~20% year-to-date. The question for investors is whether the current software sell-off signals a structural shift or a market overreaction.
The consensus is that software has become a commodity. My view is more nuanced. While some software companies risk being disrupted, others might actually benefit from AI. The key distinction is between companies that have “artificial limiters” — non-code moats like network effects, regulation, and physical infrastructure that restrict competitive entry and sustain pricing power — and those who don’t. The current indiscriminate sell-off is creating compelling public-market opportunities for both value and growth investors. Paradoxically, the more fragile arena is in the private markets.
This post covers the following:
Let’s dive in.
Understanding the collapsing multiples in software-land
To level-set the conversation, let’s first highlight the current market dynamics. In the past few weeks, hundreds of billions of dollars in market cap have been erased across both incumbent software names and newer entrants. Some examples (as of time of writing):
This has been especially painful for high-growth companies, where EV / NTM revenue multiples (enterprise value / next twelve months revenue) have collapsed to just 5x. This is bad relative to just 2 months ago, but worse if we compare it to Covid highs (20x+ EV/NTM).
Source: Needham: SaaS Monthly Drive-Thru (February 2026)
There are multiple reasons for the current sentiment (some may not be true):
The death of (mission-critical) SaaS is overblown
Of the reasons listed above, the one that seems most worrying is that companies can build much of their software internally. This worry has been amplified after the launch of recent models (Opus 4.6 and GPT-5.3-Codex) and their associated scaffolding / software ecosystem (Claude Cowork with plugins and OpenAI’s Codex).
While this sentiment is justified for feature-thin, nice-to-have software, it doesn’t apply to mission critical applications where the impact of vibe-coded apps going down vastly exceeds the ACVs (average contract value) customers pay for. Here, companies are overestimating their ability to build and maintain software in-house using AI. At the end of the day, customers are also paying for ongoing support, compliance, updates, security, SLA, etc. (which requires full-time headcount). Once the “fully loaded” costs of development are factored in, building software in-house might be higher.
What this means is that beaten down mission-critical software vendors (SAP, Workday, Salesforce) have time to reinvent themselves. Given these companies’ existing distribution and customer trust, they have time to introduce their own AI-native capabilities. Furthermore, tech-forward software incumbents can leverage AI (just like startups) to their various “jobs to be done” (coding, sales, customer support, etc.). This reduces the need for headcount expansion and results in reduced SBC & higher FCF (free cash flow). Companies that adopt this “creative destruction” mentality will see their stock prices re-rate (generating alpha for value investors).
Companies with “artificial limiters” represent multi-bagger opportunities
While the software re-rating story is interesting, I’m more fascinated by beneficiaries of AI. The mental model I’ve adopted is to look for growthy software (or software-enabled) businesses that can re-accelerate growth in the AI era and are protected by artificial limiters (aka non-code moats).
Some (non-exhaustive) categories that I’ve been thinking about:
One important caveat here is that a “good company” doesn’t mean an “undervalued stock.” Many of these companies can be structural winners and still be fully priced. Finding multi-baggers requires underwriting capability and favorable market timing.
The perils of venture investing in the age of AI
While I’m optimistic on the opportunity set in the public markets, I’m more uneasy in venture-land. The disconnect between VC pricing and public-market multiples implies many investors are still underwriting software deals with mental models from the old SaaS era.
Yes, startups might be “AI native” and many can monetize in new ways (usage-based, outcome-based, tied to labor savings, etc.). But in many cases, it’s still SaaS by another name. These companies face the same questions around long-term defensibility and pricing power. The risk here is that with public multiples already compressed, even if an AI startup fully realizes its vision and meaningfully replaces labor spend, the exit value may not justify the entry price.
This is compounded by structural issues inside VC itself. Many funds are not set up to invest in truly disruptive, longer-duration technologies (fund size, ownership, time horizon mismatch), which I’ve written about prior. So, they end up crowding into “software-shaped” deals because those are easier to underwrite and historically cleared at high multiples. That doesn’t mean VC is devoid of alpha (e.g. my conviction in frontier AI labs). Additionally, some funds are positioned to succeed (partners have the right risk appetite, fund size, trust from LPs, etc.).
The opportunities and perils of the commoditization of code
So, the conclusion here is that AI is a true paradigm shift and tech investing won’t be business as usual going forward. However, the market’s response has been too blunt. In the public markets, there are opportunities at both ends of the spectrum: value investors can underwrite a software re-rating, while growth investors might find multi-baggers via companies with artificial limiters. My take is that mispricings are easiest to find in public equities. The bigger risk is in private markets, where entry prices still assume old-regime prices for new-regime risk.
Thank you to Will Lee, John Wu, Till Pieper, Homan Yuen, Yash Tulsani, Maged Ahmed, Wai Wu, and Andrew Tan for feedback on this piece.
2026-02-11 23:30:40
Published on February 11, 2026 3:30 PM GMT
It was July 30th, 2023. I had spent the last several days in sesshin with my sangha at Bay Zen Center. Physically exhausted from the effort, I came home and collapsed on the couch. I spent a couple hours watching TV, catching up on Twitter, and then, at about three in the afternoon, I stood up, looked out the window, gazed deeply at a particular branch on a particular tree, and finally resolved the Great Matter.
What’s the Great Matter? It’s the question you can’t answer. It’s the fear you feel when you contemplate your own death. It’s the void lurking at the center of your existence, and no matter how hard you try, you can’t see into it. You’ve lived with it for as long as you can remember, fighting against its friction with your every action. It can’t be pointed at directly, but you know it’s there because it’s in the place you can’t look.
That’s all rather mysterious, and thankfully some people have managed to grapple with the Great Matter enough to say a little more. They often talk about non-dual experience, of crossing the divide between the relative and the absolute. They say things like there’s no separate self and that all is impermanent. And if they’ve read the right websites, they might even describe the Great Matter as the contradiction inherent in the intuitive self-model, and say that to resolve the Great Matter is to have the insights that transition you into a persistent, non-symbolic experience of existence (PNSE).
Or, in short, to resolve the Great Matter is to become enlightened.
But the longer it’s been since the Great Matter was resolved, the less I like the word “enlightened”. I did a whole video essay about it, but in short, the main problem is that “enlightenment” implies a permanent state of attainment and carries misguided cultural connotations. Other words like “awakened” drop some of that cultural baggage, but unfortunately keep the attainment framing.
And you might be saying to yourself, why would an attainment framing be bad? Clearly something has been attained! But, the only thing attained is a resolution of the Great Matter itself, which is the act of having an insight and having that insight spread deeply through your entire being, nothing more. The rest has to come later. All that resolving the Great Matter does is remove the obstacle that was blocking the way on the path to freedom from suffering.
And if you want to be free of suffering, you have to put in the work. First, you have to put in the work to resolve the Great Matter, which often involves untraining many of your maladaptive habits. It requires grappling with your addictions, your traumas, your hangups, and your insecurities. And then, after the Great Matter is resolved, you have to continue that work, but now on a deeper, more subtle level than was possible before. It does feel easier, because you no longer have to deal with both the habituated mind and the Great Matter, but that feeling itself can become a trap that will ensnare you if you aren’t careful.
Now, some people who resolve the Great Matter do so while living in a monastery or hermitage, or move to one after such resolving. In such a place, one has the opportunity to minimize decision making and planning. And when there’s no decisions to make, there’s no need to model the self, and so the constructed self-model can be let go and forgotten, replaced by a mind that only concerns itself with this moment. That can be nice, especially if done for a limited time to strengthen one’s practice, but it comes at the cost of helping the world.
To help the world, you have to live in it, and that means constructing a self-model to make the predictions necessary to plan and take actions. The best one who has resolved the Great Matter can do, if they are committed to benefitting all beings, is to find a way to live holding that self-model lightly. To see, in each moment, that the self and the not-self views are equally real, for both are needed to tackle the great many challenges facing our world.
I’ve left out many details you’d probably like to know. That’s because I write this post with some hesitation. Although this is not the first time I’ve publicly admitted to being “enlightened”, it is the least obscure and most easily found. But I decided to take the risk, because I feel that it’s been long enough that I can speak confidently of my own experiences, however they are labeled, and because by speaking about those experiences, I may be able to help others.
When I had not yet resolved the Great Matter, I was very confused about it. Some people wrote some straightforward things that were confusing. Other people wrote confusing things that were straightforward. I needed all the help I could get from others pointing the way for me. Now it’s my turn to help others by pointing. This is one attempt, and I hope it proves useful.
2026-02-11 23:08:12
Published on February 11, 2026 3:08 PM GMT
I recently read this article (linked by David Chapman): JOINT REVIEW: Philosophy Between the Lines, by Arthur M. Melzer (by Jane Psmith and John Psmith).
The article talks about esoteric writing, i.e., where writers will sometimes not state their entire message explicitly but leave some important messages only implied, in the hope that the "wise" reader will be able to "read between the lines" and understand the hidden message.
The article argues that this is sometimes necessary, either to protect the writer from persecution or to protect the public from information hazards (basilisks?).
The article gives the below example of an alleged information hazard. But I do not follow their reasoning. I do not understand how this information is hazardous.
I’m going to offer two examples here, carefully calibrated (I hope!) to evoke some visceral reaction in our readers without actually prompting anyone to show up at our house with torches and pitchforks. So let’s take the Constitution [of the USA].
... I did spend a lot of time reading about constitutional theory and interpretation and realized to my horror that none of it makes any sense. In theory the law binds us because “we,” the vast diachronic entity called the American people, have agreed to be bound by it. That’s why the originalists say we need to know what the people who signed up thought they were agreeing to, since that’s the extent of the law’s legitimate authority today. But push even a little bit and the originalists will admit that no, it really boils down to the fact that we all think we’re bound by it. Political legitimacy derives from…the belief in political legitimacy. It’s turtles all the way down. Any attempt to delineate neutral interpretive principles is just an elaborate attempt to meme society at large into tying its rulers’ hands, a willingness to limit the options available to Our Guys if it means political pressure will force Their Guys to operate under the same constraints. It is, in short, a lie.
But you can’t say that; you can’t point it out; the trick doesn’t work if you tell the marks it’s happening. We’ve all agreed that the Constitution is binding, which limits the potential range of our fights to “things the Constitution might plausibly be interpreted to say” — a range that can be stretched, certainly, but not infinitely far. Which is good! I am in favor of the kind of stability that comes from political disagreement happening between men in suits filing briefs instead of men in camo firing guns. But that doesn’t make any of it actually true. It just makes pointing out the lie worse than living with it.
I do not understand what the Psmiths are saying here. Why can't you say that? If a lot of thinkers started openly discussing the fact that there is no binding magic in constitutions, what would happen?
Can anyone help me understand what the authors probably have in mind?