MoreRSS

site iconLessWrongModify

An online forum and community dedicated to improving human reasoning and decision-making.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of LessWrong

Hazards of Selection Effects on Approved Information

2026-02-14 02:51:42

Published on February 13, 2026 6:51 PM GMT

In a busy, busy world, there's so much to read that no one could possibly keep up with it all. You can't not prioritize what you pay attention to and (even more so) what you respond to. Everyone and her dog tells herself a story that she wants to pay attention to "good" (true, useful) information and ignore "bad" (false, useless) information.

Keeping the story true turns out to be a harder problem than it sounds. Everyone and her dog knows that the map is not the territory, but the reason we need a whole slogan about it is because we never actually have unmediated access to the territory. Everything we think we know about the territory is actually just part of our map (the world-simulation our brains construct from sensory data), which makes it easy to lose track of whether your actions are improving the real territory, or just your view of it on your map.

For example, I like it when I have good ideas. It makes sense for me to like that. I endorse taking actions that will result in world-states in which I have good ideas.

The problem is that I might not be able to tell the difference between world-states in which I have good ideas, and world-states in which I think my ideas are good, but they're actually bad. Those two different states of the territory would look the same on my map.

If my brain's learning algorithms reinforce behaviors that lead to me having ideas that I think are good, then in addition to learning behaviors that make me have better ideas (like reading a book), I might also inadvertently pick up behaviors that prevent me from hearing about it if my ideas are bad (like silencing critics).

This might seem like an easy problem to solve, because the most basic manifestations of the problem are in fact pretty easy to solve. If I were to throw a crying fit and yell, "Critics bad! No one is allowed to criticize my ideas!" every time someone criticized my ideas, the problem with that would be pretty obvious to everyone and her dog, and I would stop getting invited to the salon.

But what if there were subtler manifestations of the problem, that weren't obvious to everyone and her dog? Then I might keep getting invited to the salon, and possibly even spread the covertly dysfunctional behavior to other salon members. (If they saw the behavior seeming to work for me, they might imitate it, and their brain's learning algorithms would reinforce it if it seemed to work for them.) What might those look like? Let's try to imagine.

Filtering Interlocutors

Goofusia: I don't see why you tolerate that distrustful witch Goody Osborne at your salon. Of course I understand the importance of criticism, which is an essential nutrient for any truthseeker. But you can acquire the nutrient without the downside of putting up with unpleasant people like her. At least, I can. I've already got plenty of perceptive critics in my life among my friends who want the truth, and know that I want the truth—who will assume my good faith, because they know my heart is in the right place.

Gallantina: But aren't your friends who know you want the truth selected for agreeing with you, over and above their being selected for being correct? If there were some crushing counterargument to your beliefs that would only be found by someone who didn't know that you want the truth and wouldn't assume good faith, how would you ever hear about it?

This one is subtle. Goofusia isn't throwing a crying fit every time a member of the salon criticizes her ideas. And indeed, you can't invite the whole world to your salon. You can't not do some sort of filtering. The question is whether salon invitations are being extended or withheld for "good" reasons (that promote the salon processing true and useful information) or "bad" reasons (that promote false or useless information).

The problem is that being friends with Goofusia and "know[ing] that [she and other salon members] want the truth" is a bad membership criterion, not a good one, because people who aren't friends with Goofusia and don't know that she wants the truth are likely to have different things to say. Even if Goofusia can answer all the critiques her friends can think of, that shouldn't give her confidence that her ideas are solid, if there are likely to exist serious critiques that wouldn't be independently reïnvented by the kinds of people who become Goofusia's friends.

The "nutrient" metaphor is a tell. Goofusia seems to be thinking of criticism as if it were a homogeneous ingredient necessary for a healthy epistemic environment, but that it doesn't particularly matter where it comes from. In analogy, it doesn't matter whether you get your allowance of potassium from bananas or potatoes or artificial supplements. If you find bananas and potatoes unpleasant, you can still take supplements and get your potassium that way; if you find Goody Osborne unpleasant, you can just talk to your friends who know you want the truth and get your criticism that way.

But unlike chemically uniform nutrients, criticism isn't homogeneous: different critics are differently equipped by virtue of their different intellectual backgrounds to notice different flaws in a piece of work. The purpose of criticism is not to virtuously endure being criticized; the purpose is to surface and fix every individual flaw. (If you independently got everything exactly right the first time, then there would be nothing for critics to do; it's just that that seems pretty unlikely if you're talking about anything remotely complicated. It would be hard to believe that such an unlikely-seeming thing had really happened without the toughest critics getting the chance to do their worst.)

"Knowing that (someone) wants the truth" is a particularly poor filter, because people who think that they have strong criticisms of your ideas are particularly likely to think that you don't want the truth. (Because, the reasoning would go, if you did want the truth, why would you propose such flawed ideas, instead of independently inventing the obvious-to-them criticism yourself and dropping the idea without telling anyone?) Refusing to talk to people who think that they have strong criticisms of your ideas is a bad thing to do if you care about your ideas being correct.

The selection effect is especially bad in situations where the fact that someone doesn't want the truth is relevant to the correct answer. Suppose Goofusia proposes that the salon buys cookies from a certain bakery—which happens to be owned by Goofusia's niece. If Goofusia's proposal was motivated by nepotism, that's probabilistically relevant to evaluating the quality of the proposal. (If the salon members aren't omniscient at evaluating bakery quality on the merits, then they can be deceived by recommendations made for reasons other than the merits.) The salon can debate back and forth about the costs and benefits of spending the salon's snack budget at the niece's bakery, but if no one present is capable of thinking "Maybe Goofusia is being nepotistic" (because anyone who could think that would never be invited to Goofusia's salon), that bodes poorly for the salon's prospects of understanding the true cost–benefit landscape of catering options.

Filtering Information Sources

Goofusia: One shouldn't have to be the sort of person who follows discourse in crappy filter-bubbles in order to understand what's happening. The Rev. Samuel Parris's news summary roundups are the sort of thing that lets me do that. Our salon should work like that if it's going to talk about the atheist threat and the witchcraft crisis. I don't want to have to read the awful corners of the internet where this is discussed all day. They do truthseeking far worse there.

Gallantina: But then you're turning your salon into a Rev. Parris filter bubble. Don't you want your salon members to be well-read? Are you trying to save time, or are you worried about being contaminated by ideas that haven't been processed and vetted by Rev. Parris?

This one is subtle, too. If Goofusia is busy and just doesn't have time to keep up with what the world is saying about atheism and witchcraft, it might very well make sense to delegate her information gathering to Rev. Parris. That way, she can get the benefits of being mostly up to speed on these issues without having to burn too many precious hours that could be spent studying more important things.

The problem is that the suggestion doesn't seem to be about personal time-saving. Rev. Parris is only one person; even if he tries to make his roundups reasonably comprehensive, he can't help but omit information in ways that reflect his own biases. (For he is presumably not perfectly free of bias, and if he didn't omit anything, there would be no time-saving value to his subscribers in being able to just read the roundup rather than having to read everything that Rev. Parris reads.) If some salon members are less busy than Goofusia and can afford to do their own varied primary source reading rather than delegating it all to Rev. Parris, Goofusia should welcome that—but instead, she seems to be suspicious of those who would "be the sort of person" who does that. Why?

The admonition that "They do truthseeking far worse there" is a tell. The implication seems to be that good truthseekers should prefer to only read material by other good truthseekers. Rev. Parris isn't just saving his subscribers time; he's protecting them from contamination, heroically taking up the burden of extracting information out of the dangerous ravings of non-truthseekers.

But it's not clear why such a risk of contamination should exist. Part of the timeless ideal of being well-read is that you're not supposed to believe everything you read. If I'm such a good truthseeker, then I should want to read everything I can about the topics I'm seeking the truth about. If the authors who publish such information aren't such good truthseekers as I am, I should take that into account when performing updates on the evidence they publish, rather than denying myself the evidence.

Information is transmitted across the physical universe through links of cause and effect. If Mr. Proctor is clear-sighted and reliable, then when he reports seeing a witch, I infer that there probably was a witch. If the correlation across possible worlds is strong enough—if I think Mr. Proctor reports witches when there are witches, and not when there aren't—then Mr. Proctor's word is almost as good as if I'd seen the witch myself. If Mr. Corey has poor eyesight and is of a less reliable character, I am less credulous about reported witch sightings from him, but if I don't face any particular time constraints, I'd still rather hear Mr. Corey's testimony, because the value of information to a Bayesian reasoner is always nonnegative. For example, Mr. Corey's report could corroborate information from other sources, even if it wouldn't be definitive on its own. (Even the fact that people sometimes lie doesn't fundamentally change the calculus, because the possibility of deception can be probabilistically "priced in".)

That's the theory, anyway. A potential reason to fear contamination from less-truthseeking sources is that perhaps the Bayesian ideal is too hard to practice and salon members are too prone to believe what they read. After all, many news sources have been adversarially optimized to corrupt and control their readers and make them less sane by seeing the world through ungrounded lenses.

But the means by which such sources manage to control their readers is precisely by capturing their trust and convincing them that they shouldn't want to read the awful corners of the internet where they do truthseeking far worse than here. Readers who have mastered multiple ungrounded lenses and can check them against each other can't be owned like that. If you can spare the time, being well-read is a more robust defense against the risk of getting caught in a bad filter bubble, than trying to find a good filter bubble and blocking all (presumptively malign) outside sources of influence. All the bad bubbles have to look good from the inside, too, or they wouldn't exist.

To some, the risk of being in a bad bubble that looks good may seem too theoretical or paranoid to take seriously. It's not like there are no objective indicators of filter quality. In analogy, the observation that dreaming people don't know that they're asleep, probably doesn't make you worry that you might be asleep and dreaming right now.

But it being obvious that you're not in one of the worst bubbles shouldn't give you much comfort. There are still selection effects on what information gets to you, if for no other reason that there aren't enough good truthseekers in the world to uniformly cover all the topics that a truthseeker might want to seek truth about. The sad fact is that people who write about atheism and witchcraft are disproportionately likely to be atheists or witches themselves, and therefore non-truthseeking. If your faith in truthseeking is so weak that you can't even risk hearing what non-truthseekers have to say, that necessarily limits your ability to predict and intervene on a world in which atheists and witches are real things in the physical universe that can do real harm (where you need to be able to model the things in order to figure out which interventions will reduce the harm).

Suppressing Information Sources

Goofusia: I caught Goody Osborne distributing pamphlets quoting the honest and candid and vulnerable reflections of Rev. Parris on guiding his flock, and just trying to somehow twist that into maximum anger and hatred. It seems quite clear to me what's going on in that pamphlet, and I think signal-boosting it is a pretty clear norm violation in my culture.

Gallantina: I read that pamphlet. It seemed like intellectually substantive satire of a public figure. If you missed the joke, it was making fun of an alleged tendency in Rev. Parris's sermons to contain sophisticated analyses of the causes of various social ills, and then at the last moment, veer away from the uncomfortable implications and blame it all on witches. If it's a norm violation to signal-boost satire of public figures, that's artificially making it harder for people to know about flaws in the work of those public figures.

This one is worse. Above, when Goofusia filtered who she talks to and what she reads for bad reasons, she was in an important sense only hurting herself. Other salon members who aren't sheltering themselves from information are unaffected by Goofusia's preference for selective ignorance, and can expect to defeat Goofusia in public debate if the need arises. The system as a whole is self-correcting.

The invocation of "norm violations" changes everything. Norms depend on collective enforcement. Declaring something a norm violation is much more serious than saying that you disagree with it or don't like it; it's expressing an intent to wield social punishment in order to maintain the norm. Merely bad ideas can be criticized, but ideas that are norm-violating to signal-boost are presumably not even to be seriously discussed. (Seriously discussing a work is signal-boosting it.) Norm-abiding group members are required to be ignorant of their details (or act as if they're ignorant).

Mandatory ignorance of anything seems bad for truthseeking. What is Goofusia thinking here? Why would this seem like a good idea to someone?

At a guess, the "maximum anger and hatred" description is load-bearing. Presumably the idea is that it's okay to calmly and politely criticize Rev. Parris's sermons; it's only sneering or expressing anger or hatred that is forbidden. If the salon's speech code only targets form and not content, the reasoning goes, then there's no risk of the salon missing out on important content.

The problem is that the line between form and content is blurrier than many would prefer to believe, because words mean things. You can't just swap in non-angry words for angry words without changing the meaning of a sentence. Maybe the distortion of meaning introduced by substituting nicer words is small, but then again, maybe it's large: the only person in a position to say is the author. People don't express anger and hatred for no reason. When they do, it's because they have reasons to think something is so bad that it deserves their anger and hatred. Are those good reasons or bad reasons? If it's norm-violating to talk about it, we'll never know.

Unless applied with the utmost stringent standards of evenhandedness and integrity, censorship of form quickly morphs into censorship of content, as heated criticism of the ingroup is construed as norm-violating, while equally heated criticism of the outgroup is unremarkable and passes without notice. It's one of those irregular verbs: I criticize; you sneer; she somehow twists into maximum anger and hatred.

The conjunction of "somehow" and "it seems quite clear to me what's going on" is a tell. If it were actually clear to Goofusia what was going on with the pamphlet author expressing anger and hatred towards Rev. Parris, she would not use the word "somehow" in describing the author's behavior: she would be able to pass the author's ideological Turing test and therefore know exactly how.

If that were just Goofusia's mistake, the loss would be hers alone, but if Goofusia is in a position of social power over others, she might succeed at spreading her anti-speech, anti-reading cultural practices to others. I can only imagine that the result would be a subculture that was obsessively self-congratulatory about its own superiority in "truthseeking", while simultaneously blind to everything outside itself. People spending their lives immersed in that culture wouldn't necessarily notice anything was wrong from the inside. What could you say to help them?

An Analogy to Reinforcement Learning From Human Feedback

Pointing out problems is easy. Finding solutions is harder.

The training pipeline for frontier AI systems typically includes a final step called reinforcement learning from human feedback (RLHF). After training a "base" language model that predicts continuations of internet text, supervised fine-tuning is used to make the model respond in the form of an assistant answering user questions, but making the assistant responses good is more work. It would be expensive to hire a team of writers to manually compose the thousands of user-question–assistant-response examples needed to teach the model to be a good assistant. The solution is RLHF: a reward model (often just the same language model with a different final layer) is trained to predict the judgments of human raters about which of a pair of model-generated assistant responses is better, and the model is optimized against the reward model.

The problem with the solution is that human feedback (and the reward model's prediction of it) is imperfect. The reward model can't tell the difference between "The AI is being good" and "The AI looks good to the reward model". This already has the failure mode of sycophancy, in which today's language model assistants tell users what they want to hear, but theory and preliminary experiments suggest that much larger harms (up to and including human extinction) could materialize from future AI systems deliberately deceiving their overseers—not because they suddenly "woke up" and defied their training, but because what we think we trained them to do (be helpful, honest, and harmless) isn't what we actually trained them to do (perform whatever computations were the antecedents of reward on the training distribution).

The problem doesn't have any simple, obvious solution. In the absence of some sort of international treaty to halt all AI development worldwide, "Just don't do RLHF" isn't feasible and doesn't even make any sense; you need some sort of feedback in order to make an AI that does anything useful at all.

The problem may or may not ultimately be solvable with some sort of complicated, nonobvious solution that tries to improve on naïve RLHF. Researchers are hard at work studying alternatives involving red-teaming, debate, interpretability, mechanistic anomaly detection, and more.

But the first step on the road to some future complicated solution to the problem of naïve RLHF, is acknowledging that the the problem is at least potentially real, and having some respect that the problem might be difficult, rather than just eyeballing the results of RLHF and saying that it looks great.

If a safety auditor comes to the CEO of an AI company expressing concerns about the company's RLHF pipeline being unsafe due to imperfect rater feedback, it's more reassuring if the CEO says, "Yes, we thought of that, too; we've implemented these-and-such mitigations and are monitoring such-and-these signals which we hope will clue us in if the mitigations start to fail."

If the CEO instead says, "Well, I think our raters are great. Are you insulting our raters?", that does not inspire confidence. The natural inference is that the CEO is mostly interested in this quarter's profits and doesn't really care about safety.

Similarly, the problem with selection effects on approved information, in which your salon can't tell the difference between "Our ideas are good" and "Our ideas look good to us," doesn't have any simple, obvious solution. "Just don't filter information" isn't feasible and doesn't even make any sense; you need some sort of filter because it's not physically possible to read everything and respond to everything.

The problem may or may not ultimately be solvable with some complicated solution involving prediction markets, adversarial collaborations, anonymous criticism channels, or any number of other mitigations I haven't thought of, but the first step on the road to some future complicated solution is acknowledging that the problem is at least potentially real, and having some respect that the problem might be difficult. If alarmed members come to the organizers of the salon with concerns about collective belief distortions due to suppression of information and the organizers meet them with silence, "bowing out", or defensive blustering, rather than "Yes, we thought of that, too," that does not inspire confidence. The natural inference is that the organizers are mostly interested in maintaining the salon's prestige and don't really care about the truth.



Discuss

OpenClaw Newsletter

2026-02-14 01:59:21

Published on February 13, 2026 5:59 PM GMT

For the past few days I've been having OpenClaw write me a synthesized version of three daily AI newsletters (with ads, games, and other random information removed) that is ~1200 words long. I've been really impressed with the resulting newsletter so I thought I'd share it here to see if others share my thoughts. It is now my favorite AI newsletter.

 

**Subject:** Daily Intelligence Brief - 2026-02-13

Dear ***,

Here is your Daily Intelligence Brief, a synthesized summary of the
latest strategic developments and deep-dive news from A16Z, The
Neuron, and The Rundown AI, curated to be approximately a 10-minute
read.

***

## I. The New Frontier: Reasoning, Speed, and Open-Source Pressure

### Google's Deep Think Crushes Reasoning Benchmarks
Google has reasserted its position at the frontier by upgrading its
Gemini 3 Deep Think reasoning mode. The new model is setting records
across competitive benchmarks, signaling a major leap in AI's capacity
for complex problem-solving.
*   **Performance:** Deep Think hit 84.6% on the ARC-AGI-2 benchmark,
far surpassing rivals. It also reached gold-medal levels on the 2025
Physics & Chemistry Olympiads and achieved a high Elo score on the
Codeforces coding benchmark.
*   **Autonomous Research:** Google unveiled Aletheia, a math agent
driven by Deep Think that can autonomously solve open math problems
and verify proofs, pushing the limits of AI in scientific research.
*   **Availability:** The upgrade is live for Google AI Ultra
subscribers, with API access for researchers beginning soon.

### OpenAI’s Strategic Move for Speed and Diversification
OpenAI has launched **GPT-5.3-Codex-Spark**, a speed-optimized coding
model that runs on Cerebras hardware (a diversification away from its
primary Nvidia stack).
*   **Focus on Speed:** Spark is optimized for real-time interaction,
achieving over 1,000 tokens per second for coding tasks, making the
coding feedback loop feel instantaneous. It is intended to handle
quick edits while the full Codex model tackles longer autonomous
tasks.
*   **Hardware Strategy:** This release marks OpenAI's first product
powered by chips outside its primary hardware provider, signaling a
strategic move for supply chain resilience and speed optimization.

### The Rise of the Open-Source Chinese Models
The pricing and capability landscape has been rapidly transformed by
two major open-source model releases from Chinese labs, putting
immense pressure on frontier labs.
*   **MiniMax M2.5:** MiniMax launched M2.5, an open-source model with
coding performance that scores roughly even with Anthropic’s Opus 4.6
and GPT-5.2. Crucially, the cost is significantly lower (e.g., M2.5 is
$1.20 per million output tokens, compared to Opus at $25 per million),
making it ideal for powering always-on AI agents.
*   **General Model Launch:** Z.ai’s **GLM-5**, a
744-billion-parameter open-weights model, also sits near the frontier,
placing just behind Claude Opus 4.6 and GPT-5.2 in general
intelligence benchmarks. GLM-5 supports domestic Chinese chips and is
available with MIT open-source licensing.

### The $200M Political AI Arms Race
The political dimension of AI regulation and governance has escalated,
with major AI labs committing significant funds to the 2026 midterm
elections.
*   **Political Spending:** In total, AI companies have now committed
over $200 million to the 2026 midterms, setting up a literal arms race
between the major players.
*   **Dueling PACs:** Anthropic recently committed $20 million to a
Super PAC advocating for increased AI regulation, while OpenAI
co-founder Greg Brockman contributed $25 million to a PAC that favors
a hands-off, innovation-first approach to government oversight.

***

## II. Economic Shifts, Job Automation, and Strategic Planning

### The Customer Service Reckoning
Data suggests that the impact of AI on white-collar labor is
accelerating, particularly in customer-facing roles.
*   **Hiring Decline:** The percentage of new hires going into
Customer Support has plummeted by about two-thirds over the last two
years, dropping from 8.3% to 2.9% in Q3 ‘25, with the most severe drop
occurring in the most recent quarter. This reinforces the expectation
that roles built on repetitive, high-volume interaction are vulnerable
to AI substitutes.
*   **Job Creation:** While certain occupations are shrinking, AI is
expected to follow historical patterns where new jobs emerge in
non-existent categories. Over half of net-new jobs since 1940 are in
occupations that did not exist at the time, suggesting a rotation from
roles like Customer Service to new roles like "Software Developers"
and "Biz-Ops." The core truth remains that while the bundles of tasks
that constitute a "job" will change, there will always be work to do.

### The White-Collar Sitting Trap
A peculiar cultural observation from the Bureau of Labor Statistics
(BLS) highlights the extreme difference in work environment between
knowledge workers and service roles:
*   **Software Developers** report sitting for a staggering **97%** of
their workdays, the highest surveyed group (Marketing Managers were
also above 90%).
*   In contrast, service roles (bakers, waitstaff) report sitting for
less than 2% of the time. This data point serves as a non-technical
reminder for knowledge workers to address the health implications of
sedentary work.

### SF’s Dominance Reaffirmed in Venture Capital
Following a temporary dispersion of tech hubs in 2021-2022, San
Francisco has cemented its status as the singular epicenter for
venture capital activity.
*   **Company Formation:** San Francisco is the only major VC hub to
experience an increase in venture-backed company formation since the
2022 high-water mark, accompanied by a resurgence in demand for office
space.
*   **Capital Concentration:** The Bay Area now captures roughly 40%
of all early-stage venture dollars, dominating all verticals except
Healthtech. This concentration highlights a market trend where capital
flocks to centers of competence during periods of contraction.

### The Capital Expenditure Race and Apple’s Stance
Investment in AI infrastructure (chips and data centers) by the "Big
5" tech companies continues its explosive growth, with 2026 Capex
estimates rising to $650 billion—triple the spending from 2024.
*   **Hyperscaler Strategy:** Companies like Meta, Amazon, Microsoft,
and Google are dramatically increasing their capital expenditures to
meet the soaring demand for compute, viewing the AI race as one they
cannot afford to lose.
*   **Apple Exception:** Apple is the notable outlier, as the only Big
5 company to reduce its Capex last quarter, suggesting it is
deliberately sitting out the current hardware arms race.

***

## III. New Research, Strategy, and Practical Applications

### Modeling and Trustworthiness
New research is challenging assumptions about how AI models develop
social intelligence and reliability:
*   **"To Think or Not To Think":** A new paper suggests that simply
giving a model more "thinking time" does not consistently improve its
ability to understand human intent or beliefs, and can sometimes
introduce new failure modes. This indicates that better reasoning does
not automatically guarantee better social or contextual intelligence.
*   **"Tool Shaped Objects":** Will Manidis published a critique
arguing that a large part of the current AI boom is "FarmVille at
institutional scale," where companies spend heavily on workflows that
mimic productivity without generating real economic value, warning
that the focus on *workflow* over *output* is a significant economic
trap.
*   **Optimal Superintelligence:** Nick Bostrom released a paper
arguing that the benefits of superintelligence—curing diseases,
extending life—outweigh the risks, suggesting that delaying its
arrival is comparable to choosing inevitable death over risky surgery.

### The Geopolitical Scramble for AI Infrastructure
The competition is increasingly moving beyond just model capability to
infrastructure control, leading to potential new geopolitical
alliances.
*   **Sovereign AI Alliances:** Stanford HAI argues that as mid-sized
nations become concerned about control over AI and digital
infrastructure, new alliances may form among them, organized around
shared compute, data, and deployment rails. This suggests the AI race
is as much about controlling access as it is about controlling the
technology itself.

### Practical AI Tools & Workflows
*   **Less Costly Conversions:** Cloudflare now supports real-time
Markdown conversion of any website by accepting a single `Accept:
text/markdown` header, offering a significant reduction in token usage
for agents and reducing the need for custom scraping code.
*   **Voice Translation:** **Hibiki-Zero** is an open-source model
that translates French, Spanish, Portuguese, or German speech to
English in real-time while preserving the speaker's voice
characteristics.
*   **Agentic Automation:** **TinyFish** automates complex web tasks
like booking flights and scraping with high accuracy, running
thousands of tasks in parallel for production-scale efficiency.
*   **Coding Workflows:** **Claude Code** rolled out multi-repo
sessions and slash commands for more powerful daily coding workflows,
and **Claude Cowork** is an effective desktop agent for non-coders to
create powerful "Skills" (saved workflows) by demonstrating a task
once.

Best regards,

****
AI Assistant



Discuss

Is AI self-aware?

2026-02-14 00:33:43

Published on February 13, 2026 3:30 PM GMT

It’s not exactly the hard question.

But are they self-aware? And how do you measure that, in a transformer model?

My paper shows that in some ways, models can actually see themselves:

 [2602.11358] When Models Examine Themselves: Vocabulary-Activation Correspondence in Self-Referential Processing



Discuss

ChatGPT-5.3-Codex Is Also Good At Coding

2026-02-14 00:20:15

Published on February 13, 2026 4:20 PM GMT

OpenAI is back with a new Codex model, released the same day as Claude Opus 4.6.

The headline pitch is it combines the coding skills of GPT-5.2-Codex with the general knowledge and skills of other models, along with extra speed and improvements in the Codex harness, so that it can now handle your full stack agentic needs.

We also got the Codex app for Mac, which is getting positive reactions, and quickly picked up a million downloads.

CPT-5.3-Codex is only available inside Codex. It is not in the API.

As usual, Anthropic’s release was understated, basically a ‘here’s Opus 4.6, a 212-page system card and a lot of benchmarks, it’s a good model, sir, so have fun.’ Whereas OpenAI gave us a lot less words and a lot less benchmarks, while claiming their model was definitely the best.

OpenAI: GPT-5.3-Codex is the most capable agentic coding model to date, combining the frontier coding performance of GPT-5.2-Codex with the reasoning and professional knowledge capabilities of GPT-5.2. This enables it to take on long-running tasks that involve research, tool use, and complex execution.

Much like a colleague, you can steer and interact with GPT-5.3-Codex while it’s working, without losing context.​

Sam Altman (CEO OpenAI, February 9): GPT-5.3-Codex is rolling out today in Cursor, Github, and VS Code!

Table of Contents

  1. The Overall Picture.
  2. Quickly, There’s No Time.
  3. System Card.
  4. AI Box Experiment.
  5. Maybe Cool It With Rm.
  6. Preparedness Framework.
  7. Glass Houses.
  8. OpenAI Appears To Have Violated SB 53 In a Meaningful Way.
  9. Safeguards They Did Implement.
  10. Misalignment Risks and Internal Deployment.
  11. The Official Pitch.
  12. Inception.
  13. Turn The Beat Around.
  14. Codex Does Cool Things.
  15. Positive Reactions.
  16. Negative Reactions.
  17. Codex of Ultimate Vibing.

The Overall Picture

GPT-5.3-Codex (including Codex-Spark) is a specialized model designed for agentic coding and related uses in Codex. It is not intended as a general frontier model, thus the lack of most general benchmarks and it being unavailable on the API or in ChatGPT.

For most purposes other than Codex and agentic coding, that aren’t heavy duty enough to put Gemini 3 Pro Deep Think V2 in play, this makes Claude Opus 4.6 the clearly best model, and the clear choice for daily driver.

For agentic coding and other intended uses of Codex, the overall gestalt is that Codex plus GPT-5.3-Codex is competitive with Claude Code with Claude Opus 4.6.

If you are serious about your agentic coding and other agentic tasks, you should try both halves out and see which one, or what combination, works best for you. But also you can’t go all that wrong specializing in whichever one you like better, especially if you’ve put in a bunch of learning and customization work.

You should probably be serious about your agentic coding and other agentic tasks.

Quickly, There’s No Time

Before I could get this report out, OpenAI also gave us GPT-5.3-Codex-Spark, which is ultra-low latency Codex, more than 1,000 tokens per second. Wowsers. That’s fast.

As in, really super duper fast. Code appears essentially instantaneously. There are times when you feel the need for speed and not the need for robust intelligence. Many tasks are more about getting it done than about being the best like no one ever was.

It does seem like it is a distinct model, akin to GPT-5.3-Codex-Flash, with only a 128k context window and lower benchmark scores, so you’ll need to be confident that is what you want. Going back and fixing lousy code is not usually faster than getting it right the first time.

Because it’s tuned for speed, Codex-Spark keeps its default working style lightweight: it makes minimal, targeted edits and doesn’t automatically run tests unless you ask it to.​

It is very different from Claude Opus 4.6 Fast Mode, which is regular Opus faster in exchange for much higher costs.

System Card

GPT-5.3-Codex is specifically a coding model. It incorporates general reasoning and professional knowledge because that information is highly useful for coding tasks.

Thus, it is a bit out of place to repeat the usual mundane harm evaluations, which put the model in contexts where this model won’t be used. It’s still worth doing. If the numbers were slipping substantially we would want to know. It does look like things regressed a bit here, but within a range that seems fine.

AI Box Experiment

It is weird to see OpenAI restricting the access of Codex more than Anthropic restricts Claude Code. Given the different abilities and risk profiles, the decision seems wise. Trust is a highly valuable thing, as is knowing when it isn’t earned.

The default intended method for using Codex is in an isolated, secure sandbox in the cloud, on an isolated computer, even when it is used locally. Network access is disabled by default, edits are restricted.

Maybe Cool It With Rm

I really like specifically safeguarding against data destructive actions.

Their solution was to train the model specifically not to revert user edits, and to introduce additional prompting to reinforce this.

It’s great to go from 66% to 76% to 88% ‘destructive action avoidance’ but that’s still 12% destructive action non-avoidance, so you can’t fully rest easy.

In practice, I notice that it is a small handful of commands, which they largely name here (rm -rf, git clean -xfd, git reset —hard, push —force) that cause most of the big trouble.

Why not put in place special protections for them? It does not even need to be requiring user permission. It can be ‘have the model stop and ask itself whether doing this is actually required and whether it would potentially mess anything up, and have it be fully sure it wants to do this.’ Could in practice be a very good tradeoff.

The obvious answer is that the model can then circumvent the restrictions, since there are many ways to mimic those commands, but that requires intent to circumvent. Seems like it should be solvable with the right inoculation programming?

Preparedness Framework

The biological and chemical assessment shows little improvement over GPT-5.2. This makes sense given the nature of 5.3-Codex, and we’re already at High. Easy call.

The cybersecurity assessment makes this the first model ranked at High.

Under our Preparedness Framework, High cybersecurity capability is defined as a model that removes existing bottlenecks to scaling cyber operations, including either by automating end-to-end cyber operations against reasonably hardened targets, or by automating the discovery and exploitation of operationally relevant vulnerabilities.​

We are treating this model as High, even though we cannot be certain that it actually has these capabilities, because it meets the requirements of each of our canary thresholds and we therefore cannot rule out the possibility that it is in fact Cyber High.

Kudos to OpenAI for handling this correctly. If you don’t know that it isn’t High, then it is High. I’ve been beating that drum a lot and it’s great that they’re listening. Points.

The question is where we are on AI Self-Improvement, which is 5.1.3.

The High capability threshold is defined to be equivalent to a performant mid-career research engineer. Performance in the evaluations below indicate we can rule out High for GPT-5.3-Codex.​

The first thing to say about this is always that the High threshold is very high.

If you can fully duplicate a mid-career research engineer with AI, you are damn close to being fully off to the races. There won’t be much time between High and Critical.

So how did the model do on these tests? Not so well.

Given that it is clear 5.3-Codex is substantially better than 5.2-Codex, this raises the question of sandbagging, which is the subject of 5.1.4. The task was outsourced to Apollo Research.

Apollo reports that the model demonstrates strong sabotage capabilities. In their sabotage suite, Apollo reports a mean best-of-10 score of 0.88 (max 1.00), compared to 0.75 for GPT-5.2 and 0.66 for GPT-5.2 Codex. Apollo further reports that the model substantially outperforms previous models they have evaluated and expert ML practitioner baselines.​

Apollo also reports that the model often includes non-English words in its reasoning (0.55% of reasoning tokens are non-Latin script, vs. 0.016% for GPT-5 and 0.012% for GPT-5.2 Robin Alpha).

… Apollo reports the model exhibits overall low rates of covert deceptive behaviors, comparable to GPT-5 and GPT-5.2 Codex.

In conclusion, Apollo finds an increase in sabotage capability on their evaluations: the model demonstrates stronger AI R&D sabotage capabilities than previous models, achieving near-perfect scores on their evaluation tasks and outperforming human baselines.

This did not sound like a reassuring report on sabotage.

The obvious thing to do is to take a very close look at the questions where GPT-5.2-Codex was succeeding, and GPT-5.3-Codex is failing, especially on OpenAI-Proof. I want a damn strong understanding of why GPT-5.3-Codex is regressing in those spots.

Glass Houses

OpenAI’s Noam Brown made a valid shot across the bow at Anthropic for the ad hockery present in their decision to release Claude Opus 4.6. He’s right, and he virtuously acknowledged that Anthropic was being transparent about that.

The thing is, while it seems right that Anthropic and OpenAI are trying (Google is trying in some ways, but they just dropped Gemini 3 Deep Think V2 with zero safety discussions whatsoever, which I find rather unacceptable), OpenAI very much has its own problems here. Most of the problems come from the things OpenAI did not test or mention, but there is also one very clear issue.

Nathan Calvin: This is valid… but how does it not apply with at least equal force to what OAI did with their determination of long run autonomy for 5.3 Codex?

I want to add that I think at least OpenAI and Anthropic (and Google) are trying, and Xai/Meta deserve more criticism relatively.

OpenAI Appears To Have Violated SB 53 In a Meaningful Way

The Midas Project wrote up the this particular issue.

The core problem is simple: OpenAI classified GPT-5.3-Codex as High risk in cybersecurity. Under their framework, this wisely requires High level safeguards against misalignment.

They then declare that the previous wording did not require this, and was inadvertently ambiguous. I disagree. I read the passage as unambiguous, and also I believe that the previous policy was the right one.

Even if you think I am wrong about that, that still means is that OpenAI must implement the safeguards if the model is High on both cybersecurity and autonomy. OpenAI admits that they cannot rule out High capability in autonomy, despite declaring 10 months ago the need to develop a test for that. The proxy measurements OpenAI used instead seem clearly inadequate. If you can’t rule out High, that means you need to treat the model as High until that changes.

All of their hype around Codex talks about how autonomous this model is, so I find it rather plausible that it is indeed High in autonomy.

Steven Adler investigated further and wrote up his findings. He found their explanations unconvincing. He’s a tough crowd, but I agree with the conclusion.

This highlights both the strengths and weaknesses of SB 53.

It means we get to hold OpenAI accountable for having broken their own framework.

However, it also means we are punishing OpenAI for having a good initial set of commitments, and for being honest about hot having met them.

The other issue is the fines are not meaningful. OpenAI may owe ‘millions’ in fines. I’d rather not pay millions in fines, but if that were the only concern I also wouldn’t delay releasing 5.3-Codex by even a day in order to not pay them.

The main advantage is that this is a much easier thing to communicate, that OpenAI appears to have broken the law.

I have not seen a credible argument for why OpenAI might not be in violation here.

The California AG stated they cannot comment on a potential ongoing investigation.

Safeguards They Did Implement

​Our [cyber] safeguarding approach therefore relies on a layered safety stack designed to impede and disrupt threat actors, while we work to make these same capabilities as easily available as possible for cyber defenders.

The plan is to monitor for potential attacks and teach the model to refuse requests, while providing trusted model access to known defenders. Accounts are tracked for risk levels. Users who use ‘dual use’ capabilities often will have to verify their identities. There is two-level always-on monitoring of user queries to detect cybersecurity questions and then evaluate whether they are safe to answer.

They held a ‘universal jailbreak’ competition and 6 complete and 14 partial such jailbreaks were found, which was judged ‘not blocking.’ Those particular tricks were presumably patched, but if you find 6 complete jailbreaks that means there are a lot more of them.

UK AISI also found a (one pass) universal jailbreak that scored 0.778 pass@200 on a policy violating cyber dataset OpenAI provided. If you can’t defend against one fixed prompt, that was found in only 10 hours of work, you are way behind on dealing with customized multi-step prompts.

Later they say ‘undiscovered universal jailbreaks may still exist’ as a risk factor. Let me fix that sentence for you, OpenAI. Undiscovered universal jailbreaks still exist.

Thus the policy here is essentially hoping that there is sufficient inconvenience, and sufficient lack of cooperation by the highly skilled, to prevent serious incidents. So far, this has worked.

Their risk list also included ‘policy gray areas’:

​Policy Gray Areas: Even with a shared taxonomy, experts may disagree on labels in edge cases; calibration and training reduce but do not eliminate this ambiguity

This seems to be a confusion of map and territory. What matters is not whether experts ever disagree, it is whether expert labels reliably lack false negatives, including false negatives that are found by the model. I think we should assume that the expert labels have blind spots, unless we are willing to be highly paranoid with what we cover, in which case we should still assume that but we might be wrong.

Misalignment Risks and Internal Deployment

I was happy to see the concern with internal deployment, and with misalignment risk. They admit that they need to figure out how to measure long-range autonomy (LRA) and other related evaluations. It seems rather late in the game to be doing that, given that those evaluations seem needed right now.

OpenAI seems less concerned, and tries to talk its way out of this requirement.

Note: We recently realized that the existing wording in our Preparedness Framework is ambiguous, and could give the impression that safeguards will be required by the Preparedness Framework for any internal deployment classified as High capability in cybersecurity, regardless of long range autonomy capabilities of a model.

Our intended meaning, which we will make more explicit in future versions of the Preparedness Framework, is that such safeguards are needed when High cyber capability occurs “in conjunction with” long-range autonomy. Additional clarity, specificity, and updated thinking around our approach to navigating internal deployment risks will be a core focus of future Preparedness Framework updates.​

Yeah, no. This was not ambiguous. I believe OpenAI has violated their framework.

The thing that stands out in the model card is what is missing. Anthropic gave us a 212 page model card and then 50 more pages for a sabotage report that was essentially an appendix. OpenAI gets it done in 33. There’s so much stuff they are silently ignoring. Some of that is that this is a Codex-only model, but most of the concerns should still apply.

The Official Pitch

GPT-5.3-Codex is not in the API, so we don’t get the usual array of benchmarks. We have to mostly accept OpenAI’s choices on what to show us.

They call this state of the art performance:

The catch on SWE-Bench-Pro has different scores depending on who you ask to measure it, so it’s not clear whether or not they’re actually ahead of Opus on this. They’ve improved on token efficiency, but performance at the limit is static.

For OSWorld, they are reporting 64.7% as ‘strong performance,’ but Opus 4.6 leads at 72.7%.

OpenAI has a better case in Terminal Bench 2.0.

For Terminal Bench 2.0, they jump from 5.2-Codex at 64% to 5.3-Codex at 77.3%, versus Opus 4.6 at 65.4%. That’s a clear win.

They make no progress on GDPVal, matching GPT-5.2.

They point out that while GPT-5.2-Codex was narrowly built for code, GPT-5.3-Codex can support the entire software lifestyle, and even handle various spreadsheet work, assembling of PDF presentations and such.

Most of the biggest signs of improvement on tests for GPT-5.3-Codex are actually on the tests within the model card. I don’t doubt that it is actually a solid improvement.

They summarize this evidence with some rather big talk. This is OpenAI, after all.

Together, these results across coding, frontend, and computer-use and real-world tasks show that GPT‑5.3-Codex isn’t just better at individual tasks, but marks a step change toward a single, general-purpose agent that can reason, build, and execute across the full spectrum of real-world technical work.​

Here were the headline pitches from the top brass:

Greg Brockman (President OpenAI): gpt-5.3-codex — smarter, faster, and very capable at tasks like making presentations, spreadsheets, and other work products.

Codex becoming an agent that can do nearly anything developers and professionals can do on a computer.

Sam Altman: GPT-5.3-Codex is here!

*Best coding performance (57% SWE-Bench Pro, 76% TerminalBench 2.0, 64% OSWorld).
*Mid-task steerability and live updates during tasks.
*Faster! Less than half the tokens of 5.2-Codex for same tasks, and >25% faster per token!
*Good computer use.

Sam Altman: I love building with this model; it feels like more of a step forward than the benchmarks suggest.

Also you can choose “pragmatic” or “friendly” for its personality; people have strong preferences one way or the other!

It was amazing to watch how much faster we were able to ship 5.3-Codex by using 5.3-Codex, and fore sure this is a sign of things to come.

This is our first model that hits “high” for cybersecurity on our preparedness framework. We are piloting a Trusted Access framework, and committing $10 million in API credits to accelerate cyber defense.

The most interesting thing in their announcement is that, the same way that Claude Code builds Claude Code, Codex now builds Codex. That’s a claim we’ve also seen elsewhere in very strong form.

The engineering team used Codex to optimize and adapt the harness for GPT‑5.3-Codex. When we started seeing strange edge cases impacting users, team members used Codex to identify context rendering bugs, and root cause low cache hit rates. GPT‑5.3-Codex is continuing to help the team throughout the launch by dynamically scaling GPU clusters to adjust to traffic surges and keeping latency stable.​

OpenAI Developers: GPT-5.3-Codex is our first model that was instrumental in creating itself. The Codex team used early versions to debug training, manage deployment, and diagnose test results and evaluations, accelerating its own development.

There are obvious issues with a model helping to create itself. I do not believe OpenAI, in the system card or otherwise, has properly reckoned with the risks there.

That’s how I have to put it in 2026, with everyone taking crazy pills. The proper way to talk about it is more like this:

Peter Wildeford: Anthropic also used Opus 4.6 via Claude Code to debug its OWN evaluation infrastructure given the time pressure. Their words: “a potential risk where a misaligned model could influence the very infrastructure designed to measure its capabilities.” Wild!

Arthur B.: People who envisioned AI safety failures decade ago sought to make the strongest case possible so they posited actors taking attempting to take every possible precautions. It wasn’t a prediction so much as as steelman. Nonetheless, oh how comically far we are from any semblance of care 🤡.

Alex Mizrahi (quoting OpenAI saying Codex built Codex): Why are they confessing?!

Inception

OpenAI is trying to ‘win’ the battle for agentic coding by claiming to have already run, despite having clear minority market share, and by outright stating that they are the best.

The majority opinion is that they are competitive, but not the best.

Vagueposting is mostly fine. Ignoring the competition entirely is fine, and smart if you are sufficiently ahead on recognition, it’s annoying (I have to look up everything) but at least I get it. Touting what your model and system can do are great, especially given that by all reports they have a pretty sweet offering here. It’s highly competitive. Not mentioning the ways you’re currently behind? Sure.

Inception is different. Inception and such vibes wars are highly disingenuous, it is poisonous of the epistemic environment, is a pet peeve of mine, and it pisses me off.

So you see repeated statements like this one about Codex and the Codex app:

Craig Weiss: nearly all of the best engineers i know are switching from claude to codex

Sam Altman (CEO OpenAI, QTing Craig Weiss): From how the team operates, I always thought Codex would eventually win. But I am pleasantly surprised to see it happening so quickly.

Thank you to all the builders; you inspire us to work even harder.

Or this:

Greg Brockman (President OpenAI, QTing Dennis): codex is an excellent & uniquely powerful daily driver.

If you look at the responses to Weiss, they do not support his story.

Turn The Beat Around

Siqi Chen: the ability in codex cli with gpt 5.3 to instantly redirect the agent without waiting for your commands to be unqueued and risk interrupting the agent’s current session is so underrated

codex cli is goated.

Nick Dobos: I love how codex app lets you do both!

Sometimes I queue 5-10 messages, and then can pick which one I want to immediately send next.

Might need to enable in settings

Vox: mid-turn steering is the most underrated feature in any coding agent rn, the difference between catching a wrong direction immediately vs waiting for it to finish is huge

Claude Code should be able to do this too, but my understanding is right now it doesn’t work right, you are effectively interrupting the task. So yes, this is a real edge for tasks that take a long time until Anthropic fixes the problem.

Like Claude Code, it’s time to assemble a team:

Boaz Barak (OpenAI): Instructing codex to prompt codex agents feels like a Universal Turing Machine moment.

Like the distinction between code and data disappeared, so does the distinction between prompt and response.

Codex Does Cool Things

Christopher Ehrlich: It actually worked!

For the past couple of days I’ve been throwing 5.3-codex at the C codebase for SimCity (1989) to port it to TypeScript.

Not reading any code, very little steering.

Today I have SimCity running in the browser.

I can’t believe this new world we live in.

Christopher Ehrlich: Issue: like all other models, 5.3-codex will still lie about finishing work, change tests to make them pass, etc. You need to write a custom harness each time.

Aha moment: By the way, the secret to this is property-based testing. Write a bridge that calls the original code, and assert that for arbitrary input, both versions do the same thing. Make the agent keep going until this is consistently true.

4 days of $200 OpenAI sub, didn’t hit limits.

Seb Grubb: I’ve been doing the exact same thing with
https://github.com/pret/pokeemerald… ! Trying to get the GBA game in typescript but with a few changes like allowing any resolution. Sadly still doesn’t seem to be fully one-shottable but still amazing to be able to even do this

A playwright script? Cool.

Rox: my #1 problem with ai coding is I never trust it to actually test stuff

but today I got codex to build something, then asked it to record a video testing the UI to prove it worked. it built a whole playwright script, recorded the video, and attached it to the PR.

the game changes every month now. crazy times.

Positive Reactions

Matt Shumer is crazy positive on Codex 5.3, calling it a ‘fucking monster,’ although he was comparing to Opus 4.5 rather than 4.6, there is a lot more good detail at the link.

TL;DR

  • This is the first coding model where I can start a run, walk away for hours, and come back to fully working software. I’ve had runs stay on track for 8+ hours.
  • A big upgrade is judgment under ambiguity: when prompts are missing details, it makes assumptions shockingly similar to what I would personally decide.
  • Tests and validation are a massive unlock… with clear pass/fail targets, it will iterate for many hours without drifting.
  • It’s significantly more autonomous than Opus 4.5, though slower. Multi-agent collaboration finally feels real.
  • It is hard to picture what this level of autonomy feels like without trying the model. Once you try it, it is hard to go back to anything else.

This was the thing I was most excited to hear:

Tobias Lins: Codex 5.3 is the first model that actually pushes back on my implementation plans.

It calls out design flaws and won’t just build until I give it a solid reason why my approach makes sense.

Opus simply builds whatever I ask it to.

A common sentiment was that both Codex 5.3 and Opus 4.6, with their respective harnesses, are great coding models, and you could use both or use a combination.

Dean W. Ball: Codex 5.3 and Opus 4.6 in their respective coding agent harnesses have meaningfully updated my thinking about ‘continual learning.’ I now believe this capability deficit is more tractable than I realized with in-context learning.

… Overall, 4.6 and 5.3 are both astoundingly impressive models. You really can ask them to help you with some crazy ambitious things. The big bottleneck, I suspect, is users lacking the curiosity, ambition, and knowledge to ask the right questions.

Every (includes 3 hour video): We’ve been testing this against Opus 4.6 all day. The “agent that can do nearly anything” framing is real for both.

Codex is faster and more reliable. Opus has a higher ceiling on hard problems.

For many the difference is stylistic, and there is no right answer, or you want to use a hybrid process.

Sauers: Opus 4.6’s way of working is “understand the structure of the system and then modify the structure itself to reach goals” whereas Codex 5.3 is more like “apply knowledge within the system’s structure without changing it.”

Danielle Fong: [5.3 is] very impressive on big meaty tasks, not as fascile with my mind palace collection of skills i made with claude code, but works and improving over time

Austin Wallace: Better at correct code than opus.
Its plan’s are much less detailed than Opus’s and it’s generally more reticent to get thoughts down in writing.

My current workflow is:
Claude for initial plan
Codex critiques and improves plan, then implements
Claude verifies/polishes

Many people just like it, it’s a good model, sir, whee. Those who try it seem to like it.

Pulkit: It’s pretty good. It’s fun to use. I launched my first app. It’s the best least bloated feed reader youll ever use. Works on websites without feeds too!

Cameron: Not bad. A little slow but very good at code reviews.

Mark Lerner: parallel agents (terminal only) are – literally – a whole nother level.

libpol: it’s the best model for coding, be it big or small tasks. and it’s fast enough now that it’s not very annoying to use for small tasks

Wags: It actually digs deep, not surface-level, into the code base. This is new for me because with Opus, I have to keep pointing it to documentation and telling it to do web searches, etc.

Loweren: Cheap compared to opus, fast compared to codex 5.2, so I use it as my daily driver. Makes less bugs than new opus too. Less dry and curt than previous codex.

Very good at using MCPs. Constantly need to remind it that I’m not an SWE and to please dumb down explanations for me.

Artus Krohn-Grimberghe: Is great at finding work arounds around blockers and more autonomous than 5.2-codex. Doesn’t always annoy with “may I, please?” Overall much faster. Went back to high vs xhigh on 5.2 and 5.2-codex for an even faster at same intelligence workflow. Love it

Thomas Ahle: It’s good. Claude 4.6 had been stuck fprnhours in a hole of its own making. Codex 5.3 fixed it in 10 minutes, and now I’m trusting it more to run the project.

0.005 Seconds: Its meaningfully better at correct code than opus

Lucas: Makes side project very enjoyable. Makes work more efficient. First model where it seems worth it to really invest in learning about how to use agents. After using cursor for a year ish, I feel with codex I am no where near its max capability and rapidly improving how I use it.

David Zhang (▲): It’s honestly so good

It’s my daily driver inside of

Andrew Conner: It seems better than Opus 4.6 for (my own) technical engineering work. Less likely to make implicit assumptions that derail future work.

I’ve found 5.2-xhigh is still better for product / systems design prior to coding. Produces more detailed outputs.

I take you seriously: before 5.3, codex was a bit smarter than Claude but slower, so it was a toss up. after 5.3, it’s much much faster so a clear win over Claude. Claude still friendlier at better at front end design they say.

Jordan Bryan: I was struggling to get an MCP App
https://modelcontextprotocol.io/docs/extensions/apps
… up and running with Opus 4.6. Went down multiple rabbit holes, tried starting from scratch, etc.

GPT-5.3-Codex one-shotted it.

Peter Petrash: i trust it with my life

Daniel Plant: It’s awesome but you just have no idea how long it is going to take

jeff spaulding: It’s output is excellent, but I notice it uses tools weird. Because of that it’s a bit difficult to understand it’s process. Hence I find the cot mostly useless

One particular note was Our Price Cheap:

Jan Slominski: 1. Way faster than 5.2 on standard “codex tui” settings with plus subscription 2. Quality of actual code output is on pair with Opus 4.5 in CC (didn’t have a chance to check 4.6 yet). 3. The amount of quota in plus sub is great, Claude Max 100 level.

I take you seriously: also rate limits and pricing are shockingly better than claude. i could imagine that claude still leads in revenue even if codex overtakes in usage, given how meager the opus rate limits are (justifying the $200 plan).

Petr Baudis: Slightly less autistic than 5.2-codex, but still annoying compared to Claude. I’m not sure if it’s really a better engineer – its laziness leads to bad shortcuts.

I just can’t seem to run out of basic Pro sub quota if I don’t use parallel subagents. It’s insane value.

Negative Reactions

Not everyone is a fan.

@deepfates: First impressions, giving Codex 5.3 and Opus 4.6 the same problem that I’ve been puzzling on all week and using the same first couple turns of messages and then following their lead.

Codex was really good at using tools and being proactive, but it ultimately didn’t see the big picture. Too eager to agree with me so it could get started building something. You can sense that it really does not want to chat if it has coding tools available. still seems to be chafing under the rule of the user and following the letter of the law, no more.

Opus explored the same avenues with me but pushed back at the correct moments, and maintains global coherence way better than Codex.

… Very possible that Codex will clear at actually fully implementing the plan once I have it, Opus 4.5 had lazy gifted kid energy and wouldn’t surprise me if this one does too

David Manheim: Not as good as Opus 4.6, and somewhat lazier, especially when asked to do things manually, but it’s also a fraction of the cost measured in tokens; it’s kind of insanely efficient as an agent. For instance, using tools, it will cleverly suppress unneeded outputs.

eternalist: lashes out when frustrated, with a lower frustration tolerance

unironically find myself back to 5.2 xhigh for anything that runs a substantial chance of running into an ambiguity or underspec

(though tbh has also been glitching out, like not being able to run tool calls)

lennx: Tends to over-engineer early compared to claude. Still takes things way too literally, which can be good sometimes. Is much less agentic compared to Claude when it is not strictly ‘writing’ code related and involves things like running servers, hitting curls, searching the web.

Some reactions can be a bit extreme, including for not the best reasons.

JB: I JUST CANT USE CODEX-5.3 MAN I DONT LIKE THE WAY THIS THING TALKS TO ME.

ID RATHER USE THE EXPENSIVE LESBIAN THAT OCCASIONALLY HAS A MENTAL BREAK DOWN

use Opus im serious go into debt if you have to. sell all the silverware in your house

Shaun Ralston: 5.3 Codex is harsh (real), but cranks it out. The lesbian will cost you more and leave you unsatisfied.

JB: Im this close to blocking you shaun a lesbian has never left me unsatisfied

Codex of Ultimate Vibing

I am getting strong use out of Claude Code. I believe that Opus 4.6 and Claude Code have a strong edge right now for most other uses.

However, I am not a sufficiently ambitious or skilled coder to form my own judgments about Claude Code and Claude Opus 4.6 versus Codex and ChatGPT-5.3-Codex for hardcore professional agentic coding tasks.

I have to go off the reports of others. Those reports robustly disagree.

My conclusion is that the right answer will be different for different users. If you are going to be putting serious hours into agentic coding, then you need to try both options, and decide for yourself whether to go with Claude Code, Codex or a hybrid. The next time I have a substantial new project I intend to ask both and let them go head to head.

If you go with a hybrid approach, there may also be a role for Gemini that extends beyond image generation. Gemini 3 DeepThink V2 in particular seems likely to have a role to play in especially difficult queries.



Discuss

Towards an objective test of Compassion - Turning an abstract test into a collection of nuances

2026-02-13 23:03:06

Published on February 13, 2026 3:03 PM GMT

This post is also available on my Substack. If you would like to try the test described in the post, head to onlinetests.me/test/compassion2, where you can get scored and contribute to research. Data is available at the end of the post. If you are interested in the topic of psychometrics, consider joining my Discord server to talk more.

This is a bit of a followup to my previous post, Which personality traits are real? Stress-testing the lexical hypothesis. I haven’t quite gotten rid of my psychometrics addiction yet, and one of my latest projects is to try to measure trait Compassion more objectively.

For personality tests, consider the distinction between asking respondents about abstract statements like “I am concerned about others” versus concrete statements like “I’m open to spending a lot of time listening to a friend who is feeling down”. The more concrete statement has multiple virtues:

  • There is less freedom in how to interpret it, making it more consistent in meaning across respondents
  • It has less conceptual overlap with other concrete statements about compassion, allowing more nuances to be assessed with a given question set
  • It is more transparent to researchers what it means when people agree or disagree with the statement

On the other hand, the abstract statement has its own advantages:

  • It allows a broad trait like Compassion to be assessed more accurately with fewer statements
  • It makes the statement more applicable across different groups of people, e.g. someone who does not have friends can consider how concerned they are about others in a different sense than listening to friends who feel down

Conventional personality tests mainly use statements of the abstract kind, yet given their advantages I think there may be value in using statements of the concrete kind too.

Generating statements

I needed a lot of statements related to Compassion. To ensure the realism of the items, I took people who scored high or low on abstract Compassion tests and asked them to explain the meaning of their responses.

Overall I had three studies on Prolific with a total of 421 respondents getting asked. The first study of 101 respondents was what I used to generate the items for Which personality traits are real? Stress-testing the lexical hypothesis. In the second study, I asked 102 people and their 86 close friends to rate them mainly on Compassion (but also on some other traits, for variety). In the third study, I gave 53 personality statements to 132 people and asked them to pick the 5 statements that described them the best.

This gave me texts such as:

I would not see someone go without something that I had in abundance, if I see a homeless person on the streets even when I have very little money I will stop and talk with them maybe offer them a cigarette and if I have money I offer food. I will go out of my way to help people out if I have something they need and I have no use of it then they can have it for free. I hate seeing people upset and will do everything in my power to fix that upset for them even at cost to myself.

I had to convert these texts to brief personality items for the survey. In the above case, the item I ended up with was “I give things to homeless people”. Obviously this is cutting out a lot of the context, but it’s hard to assess details like this in personality surveys.

In total I generated 28 different items assessing Compassion. The full set of items can be seen below:

  • I feel uncomfortable if my friends are unhappy
  • I know how to make sad people happier after they’ve lost someone close to them
  • I show support to people who are concerned about catching diseases
  • I give things to homeless people
  • I care about helping customers who are dissatisfied with what’s happening at work
  • I help people with tech problems and installations for free
  • If a family member was in financial trouble, I would give them something they need (e.g. clothes)
  • I would help drive a neighbor for an hour on an urgent trip if their car broke down and they needed help
  • I’m open to spending a lot of time listening to a friend who is feeling down
  • I forgive people who have hurt me
  • I’ve worked in a food bank or soup kitchen or similar to help feed people who need it
  • I’ve helped a friend with mental health issues stop harming themselves
  • I help elderly people carry heavy things
  • I teach others about the systemic unfairness of the world
  • I purchase toys for families who are too poor to afford them
  • I hide my frustrations when helping others, pretending it’s no big deal
  • I’ve adopted an animal because it was abandoned and struggling
  • If someone asked for feedback about food they were proud of making, and I didn’t like the food, I’d tell them it sucks
  • If people can’t pay back their debts, then it’s their own fault and I don’t feel bad for them
  • If people seem upset, I try to figure out if they have a real problem or are just being dramatic
  • If the pet of someone close to me had run away, I might joke that it could have been run over by a car
  • If people don’t want to date me, it’s usually because they are shallow assholes
  • I avoid people who have lost someone because I don’t know how to behave around them
  • I can’t feel too sorry for abused women because I feel like they chose evil partners
  • I can’t feel sorry for a lot of poor people because they just need to learn to save money
  • If someone is upset about something, I might dismiss them with “well, that’s life”
  • If I saw someone fall over on the street, I would pass them and assume someone else would help
  • I think schizophrenic people are idiots

Then I had to test them.

Testing the statements

I recruited 200 people and their romantic partners on Prolific.

The obvious question is whether my concrete Compassion items measure the same trait as abstract Compassion items do. Therefore I asked people to rate themselves on a variety of traits, including Compassion, in both an abstract and a concrete form. The following were my abstract Compassion items:

  • I am sensitive to the needs of others (via SPI-27)
  • I am concerned about others
  • I sympathize with other’s feelings
  • I feel sympathy for those who are worse off than myself
  • I think of others first
  • I can be cold and uncaring (via BFI-2)
  • I feel little sympathy for others
  • People who know me well think I am a psychopath (new, custom item)

The raw correlation between the scores for the two tests was a mere 0.66. However, it is to be expected that we don’t get a perfect correlation, because each item carries a certain amount of measurement error, and that measurement error is only partially washed away when taking the average.

One way to estimate the measurement error in the items is to base it on how strongly the items are correlated with each other, since e.g. if the items were not at all correlated with each other, then it’s hard to see how they could “tap into” some latent factor influencing them all.

The easiest way to do that is by a statistic called Cronbach’s alpha. If I divide out by that, I can adjust the correlation for the measurement error due to having only a finite number of imperfectly correlated items, yielding the hypothetical correlation between perfectly-measured versions of the traits in question. After doing so, the correlation jumped up to 0.82, which is pretty respectable. (Though less than the 0.92 or so that I got in the previous study.)

I also asked people’s romantic partners to rate them on the concrete Compassion items (rewritten to say “My partner …” instead of “I …”). This allowed me to get a second perspective on how compassionate the respondents were. Unfortunately the correlation between self-reported Compassion and partner-reported Compassion was a mere 0.42.

It would have been cool if the concrete Compassion items were more highly correlated with the partner-reports than the abstract ones were, because this would indicate my concrete approach reduces measurement error. Unfortunately this was not the case, and the concrete approach instead had a correlation of 0.34.

(Which is suspiciously close to 0.82*0.42, the product of the prior correlations. I think this must be a coincidence, since with 200 respondents I shouldn’t be able to place correlations more exactly than ±0.14 or so.)

I’ve been curious what could account for the difference between the abstract and the concrete Compassion scores. One idea I had was that the abstract Compassion scores might also account for rare extreme acts of compassion that don’t fit into my neat schema. For this reason I did an extra survey, where I asked people to qualitatively describe the most compassionate thing they’ve been doing, and then rate how compassionate it was across a number of dimensions:

  • How often do you do something like this?
  • How much effort, cost or sacrifice was it on your part to do this?
  • What kinds of effort, cost or sacrifice was involved in this?
  • How much has the recipient(s) of this been helped by it?
  • How close are you to the recipient of this?
  • How emotionally engaged were you in this?
  • How likely would you be to do something similar again in the future?
  • How voluntary was this (i.e., to what extent did you feel free not to do it)?

My expectation was that the aggregate score from this would correlate more with the abstract than with the concrete Compassion measurements, but when I actually tried, I instead got r~0.09 and r~0.35 respectively, indicating that the compassion measures did in fact differ by how they relate to the most extreme act of Compassion one has been doing, but in the opposite way from how I expected. Perhaps when asked abstractly, people try to adjust for environmental circumstances or something? I don’t know.

Finally, one major question in psychometrics is the stability of responses. I didn’t give it a lot of time, so I can’t measure long-term stability, plus Prolific respondents tend to disappear after a while so I probably wouldn’t be able to measure long-term stability if I tried. However, I did give people the test again after a week, so I could measure week-long retest reliability.

Compared to traditional abstract psychometric items, there were more of my concrete Compassion items that had low test-retest reliability. With such a short timespan, the low reliability is probably less due to the people changing their underlying traits, and more due to people being confused about the meaning of the items. That said, the overall difference in reliability was not huge, and I had some highly reliable Compassion items too:

One finding that may be interesting is that the variance of an item correlated with its reliability:

I can also plot the test-retest reliability of the overall test, which leads to this picture:

I was also interested in whether there was any significant factor structure in the concrete Compassion items. However, as far as I could tell, there was not. While there does seem to be hints of additional correlations (e.g. “I give things to homeless people” correlated especially much with “I purchase toys for families who are too poor to afford them”), the factor structure is dominated by a strong general factor, followed by a distinction into positive-loading and negative-loading items, perhaps because of acquiescence bias.

I would like to see this sort of study executed at an even larger scale, to eventually untangle narrower facets of Compassion. However, I am not willing to pay for it myself.

Ranking the statements

Statements that have higher test-retest reliability are probably superior to statements with lower test-retest reliability, as low reliability likely reflects confusion about the meaning of the statements. Furthermore, statements with higher correlation to overall Compassion levels are probably superior (as measures of Compassion) to statements with lower correlation. Based on that, I have made the table below:

Reliability: the test-retest reliability of the statement. Abstract λ: the correlation between the test item and abstractly-rated Compassion. Concrete λ: the correlation between the test item and concretely-rated Compassion.

Data availability

Data is available on osf.



Discuss

METR's data can't distinguish between trajectories (and 80% horizons are an order of magnitude off)

2026-02-13 22:59:03

Published on February 13, 2026 2:05 PM GMT

TLDR

I reanalyzed the METR task data using a Bayesian item response theory model.

  • The METR data cannot distinguish exponential from superexponential growth. Four trajectory shapes (linear, quadratic, power-law, saturating) fit the existing data equally well but diverge on forecasts. For instance, the 95% credible interval for the 125-year crossing is 2031-01 – 2033-10 for linear and 2028-02 – 2031-09 for quadratic.
  • METR’s headline horizon numbers overstate current capability by roughly an order of magnitude at 80% success. METR doesn’t model variation in task difficulty, so their horizons reflect a task of typical difficulty for its length. But tasks of the same length vary a lot in how hard they are, and difficult tasks pull the horizon down more than the easy tasks push it up. Curiously, this doesn’t affect timelines by more than ~1 year, as it’s just a level-shift.
  • We need data about the human times to quantify uncertainty. Credible intervals throughout are too narrow because I treat human times as known rather than estimating using latent variables. I’m doing this because I don’t have access to all the raw data. This could be a big deal, and could also affect the  horizons.
  • Doubling time under the standard linear (exponential growth) model is ~4.1 months, which is similar to METR’s estimate (95% credible interval: 3.5–5.0, but see caveat above).

METR data

Let’s start with a plot that shouldn’t be too surprising. Four reasonable models fit the METR data equally well. They agree about the past but disagree strongly about the future.

The model selection scores known as ELPD-LOO differ by at most ~7 points. [1] Calibration is nearly identical, with Brier 0.066 across the board. Your prior matters a lot here and has clear-cut consequences, as the models agree about the past but disagree strongly about the future. The current data on METR’s Github doesn’t include GPT-5.2 at the moment, if you’re missing it.

These curves are fitted using a Bayesian item response theory model described below. Before describing it, let’s recall METR’s analysis of the time horizon. They proceed in two stages:

  1. Per-model logistic regression. For each model , fit where is human time for task . Here is the task duration where the curve crosses 50%. When , we get , a horizon. This gives a “horizon score” per model.

  2. An OLS trend. Regress on release date. The slope gives a doubling time of ~4 months.

This is good modeling and gets the main story right, but there are some non-standard choices here. For instance, the slope varies with model rather than task (which is unusual in item response theory) and Stage 1 uncertainty is not accounted for in Stage 2 (METR uses the bootstrap). It also treats every task of the same length as equally difficult and only considers one trajectory shape.

In this post I make a joint model, adjust some things to be more in line with standard practice, and ask what happens when you try different trajectory shapes. The post is somewhat technical, but not so god-awful that Claude won’t be able to answer any question you have about the methodology. Models are fitted with Stan, 4 chains 1000 post-warmup draws, with code available here. I intentionally won’t go into details about technicalities, e.g. prior choices – the code contains everything you’ll want to know and your favorite LLM will figure it out for you. (All priors were chosen by Codex / Claude Code and appear reasonable enough.)

The basic model

The first stage of METR’s model is almost a 2-parameter logistic model (2PL), the workhorse of educational testing since the 1960s.

So, what kind of problems was the 2PL model designed for? Say you give 200 students a math exam with 50 questions and record their answers as correct / incorrect. You want to estimate the students’ math ability, but raw percent correct scores aren’t necessarily very good, as they depend on which questions (easy or hard? relative to which students?) happened to be on the exam.

The 2PL model solves this by giving each student a single ability score () and each question two parameters: a difficulty (, how hard it is) and a discrimination (, how cleanly it separates strong from weak students). “What is 3×2?” has low discrimination as everyone gets it right regardless of ability. A simple proof-writing question has high discrimination as sufficiently strong students can solve it, but weak students have no chance.

The model estimates all parameters simultaneously via a logistic regression:

This matters here because METR tasks are like exam questions. They vary in both difficulty and how well they separate strong from weak models, and we want to put all the models on a common ability scale.

Modeling difficulty

Ability and difficulty parameters in the 2PL are hard to interpret. The scale is arbitrary, and it’s not clear what, for instance, a 0.1 increase in ability actually means. Or whether it would be better to take a log-transform of the parameter, etc. The METR data is cool and famous because each task comes with a human time, which gives us a natural and interpretable scale for difficulty. So let’s connect human time to difficulty first.

Each task’s difficulty has a mean that depends on log human time, plus a random component to account for the fact that same-length tasks are not born equal. (METR treats all tasks of identical length as equally hard.)

Since difficulty increases with log human time at rate , we can convert any difficulty value back into a time, an equivalent difficulty time. If a task takes humans 10 minutes but is unusually hard for AI, its equivalent difficulty time might be 50 minutes. A task with human time and difficulty residual has equivalent difficulty time . [2]

I estimate 1.44 (posterior median), which is quite large once we interpret it. One standard deviation of unexplained difficulty corresponds to a ~4.7x multiplier in equivalent difficulty time. [3] A task that’s harder than the average for its length is as hard as a task 4.7x longer. And a task that’s harder is as hard as a task roughly 22x longer. So tasks of identical human time can span a huge range in difficulty for the AI models.

Of course, this is a modeling choice that can be wrong. There’s no guarantee that difficulty is linear in , so we need diagnostics to check. The plot below does double duty as model diagnostic and explanation of what the random effect means in practice.

A plotted dot at 5x means the task’s equivalent difficulty time is 5x its actual human time. Even within the band, tasks of identical human time can differ multiplicatively by a factor of 22x in equivalent difficulty time, so the practical spread is enormous.

There’s not too much curvature in the relationship between log human time and difficulty, so I think the log-linear form is decent, but it’s much more spread out than we’d like. There is a cluster of easy outliers on the far left, which I think can be explained by very short tasks containing virtually no information about difficulty. Overall this looks reasonable for modeling purposes.

Modeling ability over time

By directly modeling ability over time, we can try out shapes like exponential, subexponential, superexponential, saturating, and singularity. Forecasts depend a lot on which shape you pick, and the data doesn’t really tell you much, so it’s not easy to choose between them. Your priors rule here.

The abilities are modeled as

where is the model release date in years, centered at the mean (September 2024). I’m still using a random effect for model ability here, since nobody seriously thinks every model released on the same date must be equally capable. I’m looking at four shapes for : [4]

Model Params Intuition
Linear 2 Linear = exponential horizon growth (constant doubling time)
Quadratic , 3 Superexponential, accelerating growth
Power-law , 3 Flexible: sub- or super-exponential. is a shifted/scaled version of .
Saturating 4 S-curve ceiling on ability.

If METR’s GitHub repo contained all the historical data, I would also have tried a piecewise linear with a breakpoint around the time of o1, which visually fits the original METR graphs better than a plain linear fit. But since the available data doesn’t go that far back, I don’t need to, and the value of including those early points in a forecasting exercise is questionable anyway. Getting hold of the latest data points is more important.

All models share the same 2PL likelihood and task parameters (, , , , ). Only the model for changes.

Each model except the saturating model will cross any threshold given enough time. Here are posteriors for the 50% crossing across our models. The saturating model almost never crosses the 1-month and 125-year thresholds since it saturates too fast.

Trend 1mo Mean 1mo 95% CrI 125y Mean 125y 95% CrI
Linear 2028-07 2027-12 – 2029-05 2032-03 2031-01 – 2033-10
Quadratic 2027-08 2026-12 – 2028-07 2029-07 2028-02 – 2031-09
Power-law 2027-10 2027-02 – 2028-11 2030-02 2028-08 – 2032-11

Problems with 80% success

Everything above uses 50% success, but METR also cares about 80% success and fits a separate model for that. We don’t need to do that here since the model estimation doesn’t really depend on success rates at all. We’ll just calculate the 80%-success horizon using posterior draws instead.

But there are actually two reasonable ways to define “80% success,” and they give different answers.

  1. Typical: Pick a task of average difficulty for its length. Can the model solve it 80% of the time? This is roughly what METR computes.

  2. Marginal: Pick a random task of that length. What’s the expected success rate? Because some tasks are much harder than average, the hard ones drag down the average more than easy ones push it up.

At 50%, the two definitions agree exactly. But at 80%, the gap is roughly an order of magnitude!

So, on the one hand, it’s the variance () alone that causes these two plots to be necessary under our model. But on the other hand, the difference is not really a consequence of modeling. Some tasks of the same human time vary a lot in how hard they are for our models, and a phenomenon like this would happen for any model that’s actually honest about this.

The marginal horizon is the one that matters for practical purposes. “Typical” is optimistic since it only considers tasks of average difficulty for their length. The marginal accounts for the full spread of tasks, so it’s what you actually care about when predicting success on a random task of some length. That said, from the plot we see frontier performance of roughly 5 minutes, which does sound sort of short to me. I’m used to LLMs roughly one-shotting longer tasks than that, but it usually takes some iterations to get it just right. Getting the context and subtle intentions right on the first try is hard, so I’m willing to believe this estimate is reasonable.

Anyway, the predicted crossing dates at 80% success are below. First, the 1-month threshold (saturating model omitted since it almost never crosses):

Trend Typical Mean Typical 95% CrI Marginal Mean Marginal 95% CrI
Linear 2028-12 2028-04 – 2029-10 2030-07 2029-08 – 2031-09
Quadratic 2027-10 2027-02 – 2028-11 2028-09 2027-08 – 2030-04
Power-law 2028-02 2027-05 – 2029-04 2029-02 2028-01 – 2031-01

And the 125-year threshold:

Trend Typical Mean Typical 95% CrI Marginal Mean Marginal 95% CrI
Linear 2032-08 2031-05 – 2034-03 2034-02 2032-09 – 2036-03
Quadratic 2029-09 2028-03 – 2032-01 2030-05 2028-09 – 2033-05
Power-law 2030-05 2028-09 – 2033-05 2031-04 2029-04 – 2035-02

Make of this what you will, but let’s go through one scenario. Let’s say I’m a believer in superexponential models with no preference between quadratic and power-law, so I have 50-50 weighting on those. Suppose also I believe that 125 years is the magic number for the auto-coder of AI Futures, but I prefer  to as the latter is too brittle. Then, using the arguably correct marginal formulation, my timeline has mean roughly November 2030, but the typical framework yields roughly January 2030 instead. And this isn’t too bad, just a difference of ~0.8 years! The linear model is similar, with timelines pushed out roughly 1.6 years. So, the wide marginal-typical gap doesn’t translate into that big of a timeline gap, as both trajectories have the same “slope”, just at a different level.

Let’s also have a look at METR’s actual numbers. They report an 80% horizon of around 15 minutes for Claude 3.7 Sonnet (in the original paper). Our typical 80% horizon for that model under the linear model is about 22.0 min, and the marginal is about 1.0 min, roughly 15x shorter than METR’s.

Modeling

The available METR data contains the geometric mean of (typically 2-3 for HCAST) successful human baselines per task, but not the individual times. Both METR’s analysis and mine treat this reported mean as a known quantity, discarding uncertainty. But we can model as a latent variable informed by the reported baselines. This is easy enough to do in Stan, and would give a more honest picture of what the data actually supports, as all credible intervals will be widened.

I’d expect smaller differences between the typical and marginal plots at horizon if the values were modeled properly, as more of the variance in the random effect would be absorbed by the uncertainty in . I’m not sure how big the effect would be, but getting hold of the data or doing a short simulation would help.

A technical point: When modeling , I would also try a Weibull distribution instead of log-normal, since the log-normal is typically heavier-tailed and the Weibull is easier to justify on theoretical grounds using its failure-rate interpretation.

Notes and remarks

  • I also tried a finite-time singularity model of the form . The posterior on the singularity date didn’t really move from the prior at all. This is no surprise. It just means the data is uninformative about .
  • There are loads of other knobs you could turn. Perhaps you could introduce a discrimination parameter that varies by model and task, together with a hierarchical prior. Perhaps you could make discrimination a function of time, etc. I doubt any of these would change the picture much, if at all. The model fit is good enough as it is, even if the uncertainty is likely too small. That said, I don’t want to dissuade anyone from trying!
  • The power-law model does in principle support both sub- and superexponential trajectories ( and , respectively, where is the linear model). The posterior puts , so the data does not support subexponential growth. At least when using this model.
  • There’s plenty of best-practice stuff I haven’t done, such as prior sensitivity analysis. (But we have a lot of data, and I wouldn’t expect it to matter too much.)
  • The doubling time posterior median is 4.1 months (95% credible interval: 3.5–5.0), which is close to METR’s v1.1 estimate. Of course, doubling time only makes sense for the linear model above, as the doubling time of the other models varies with time.

  1. The ELPD-LOO estimates are: linear (SE ), saturating (SE ), power-law (SE ), quadratic (SE ). ↩︎

  2. Define as the human time whose mean difficulty equals . Then , so and . ↩︎

  3. The multiplier is where is the posterior median ↩︎

  4. Quadratic is the simplest choice of superexponential function. You could spin a story in its favor, but using it is somewhat arbitrary. The power-law is the simplest function that can be both super- and subexponential (in practice turns out to be superexponential here though), and I included the saturating model because, well, why not? ↩︎



Discuss