2026-04-13 11:19:58
I'm a struggling beginner to this whole writing business, and I've been wondering how to measure my skill as it improves. There is an app called the most dangerous writing app, that deletes the words you've typed in it if you don't keep writing up to a specific amount of time, or for a specific amount of words. I thought it was pretty neat, and naturally I want to optimize it till it ceases to be a good metric. How long are my thoughts? How long do my thoughts remain clear and focused like a laser before diffracting into fuzzy ambiguous vague nonsense? Of course, the app only measures how long you can suppress the inner editor, but I think that's a pretty good proxy for writing skill at my level.
There's this older essay on thought lengths that I often find myself thinking of. The general notion being that some thoughts are quick, someone asks a question, and the answer is right there, top of mind. You already know your reply as they finish asking the question. Other thoughts take longer to think. Someone asks you a deeper question, and you have to think about it for a while. I think you could use the dangerous writing app for long word counts or periods of time only if you've already done a lot of thinking. Only if you've already got most of an idea figured out in your head can you immediately type it out. Or at least, that's how it seems to work for me. The dangerous writing app might be measuring something similar, if you set it higher and higher. Or maybe it's just stream-of-consciousness writing, which is usually a lot lower quality. I don't think this is as useful for experienced writers. Getting the right words is I suspect the harder problem than getting words of any kind out. Though perhaps it's always useful to have something to help suppress the inner editor for a while. It's certainly been useful for me.
Currently, when I set the app for 500 words, I run out of steam. As I write words my hands start typing ahead of my mind, and the distance between them grows. My mind stretches to fill the gap, and eventually fails and there is nothing to write. Then my words get deleted. I have to start smaller, set it to 100 words, and then I still feel the stretch of my hands typing out farther than I can think. But this time I can make it, I can make the 100 word deadline, I get to keep my words, and try to think about the next thought. I've started using it to write paragraphs, instead of trying to write a whole essay at once.
Developing my ability to babble is another way I think of it. In the babble and prune frame I certainly can prune better than I can babble. It makes it hard for me to accept the poor writing that I actually produce, instead of the ideal version in my head, but that's what practice is for. This problem is something that Alkjash noticed and wrote about then, and reading it years ago I realized I had the exact problem they described. Yet it took a while for me to do something about it. I eventually setup a beeminder to journal 250 words a day, and that helped me a little bit, but I didn't increase the number. Plus, journaling is quite a bit different than writing something intended for other people.
It would be cool if I started tracking this, try to gradually increase the word limit and see how far I can get with dedicated practice. Right now I can write 100 words on a topic without stopping. I suspect experienced writers could write for thousands of words before running out of steam, but maybe not. I really like how number of words written without stopping is an unbounded metric, the number can always increase!
It might not be the only metric of writing skill, but it is one that is pretty easy to measure. I intend to use it heavily to gauge my own babbling abilities, and maybe you can to.
2026-04-13 10:55:55

In this post, we’ll discuss three major problems with the METR eval and propose some solutions. Problem 1: The METR eval produces results with egregious confidence intervals, and the METR chart misleadingly hides this. Problem 2: There's a lack of sample size for long duration tasks. Problem 3: METR doesn't test Claude Code or Codex.
(Note that while this post is critical, we do think that the METR eval is nonetheless valuable and the organization is doing important work.)
METR's confidence intervals are too big and they're misleadingly presented. Let's take a look at the good old METR chart.

Okay, so the confidence intervals are pretty big, but is that really a problem? After all, we can still see the general trend, right?
Well, we can see the trend over the course of multiple years. But we can't see the trend in smaller timeframes due to how noisy the eval is. When we say that the confidence intervals are too big, we mean that the METR eval is failing to distinguish models with obvious time horizon capability differences.
This becomes apparent when we zoom in.

The data point second from the right is Sonnet 3.5, which came out in June 2024. GPT-4 came out in March 2023. Sonnet 3.5 was and is obviously significantly more capable than GPT-4, but the METR eval doesn't show that. Their confidence intervals overlap substantially.
Let's zoom in again from April 2025 to March 2026.

What can we make of Opus 4.6 here? (Look back at the zoomed out chart as well). It might be better than every other model. But could it be, like, a year ahead of what the METR trendline predicts? The problem is that it could be (according to the eval), but we can't see that because the graph is misleadingly cut off at the top.

This is especially misleading because the confidence interval for Opus 4.6 is significantly asymmetric (as it should be); that is, there's less confidence on the upper bound than on the lower bound of 4.6.

Now, you might ask, why is the confidence interval so big and why is it asymmetric?
The headline number of tasks in METR's eval is 228, which sounds pretty good. Why are the confidence intervals so wide? The reason becomes clear when we look at the breakdown of tasks by duration.

Since the eval is trying to determine the task length at which the model succeeds 50% of the time, the durations at which the model scores closer to 50% dominate the confidence interval. For example, if we consider Opus 4.6, it has a 94% success rate on the 16m-1.1hr bucket. Given this, its 98% success rate on the 82 tasks in the 0-4min bucket give us ~0 additional information. Looking at a breakdown of Opus 4.6's solve rate by task duration makes this apparent.

The good news is that this problem is easy to fix; just add more longer-duration tasks. METR has the money for this. Seriously, why has this not been done? We're actually curious: has METR just not prioritized it, have they encountered problems with hiring people to design the tasks, something else? METR did add some new tasks between 2025 and 2026 but... not many? What are we missing here?

In particular, we're going to need tasks of 16h-5d for the near future. METR hasn't yet published Mythos's performance on its benchmarks, so we'll share our estimate of its performance.

[THIS IS OUR ESTIMATE, NOT ITS ACTUAL EVAL'D SCORE!]
Any software engineer can tell you that the capabilities of AI in December 2025 were dramatically different from those in April 2025. Much of the change in real world capabilities during this period came not from better models but from harnesses: Claude Code and OpenAI's Codex. Many people heralded the November release of Opus 4.5 alongside a major Claude Code update in particular as being a "step-change" moment. Is this reflected in the METR chart?

No, and it's not only because of the problems with sample size mentioned earlier. The issue here is that METR does not test any models inside Claude Code or Codex. Instead, they test all models using their proprietary harness called 'Inspect', which is almost certainly worse than Claude Code and Codex[2].
Perhaps METR wants to test the models-qua-models; using different scaffolds for different models would be testing something else. But scaffolds are really important. ¿Por qué no los dos?
The result of this is that since the release of Claude Code and Codex in May 2025, the METR chart has been underestimating SoTA capabilities.
Pretty straightforward.
METR does not inform us on the SoTA SWE capabilities of AI because it doesn’t test Claude Code or Codex. It could very well be the case that Opus-4.6+Claude-Code completely saturates METR's benchmark! We expect METR to tell us that mythos is significantly better than Opus 4.5, but it won’t tell us it’s significantly better than Opus 4.6 because of the giant confidence intervals.
We're gonna need a bigger benchmark.

Other people have noticed that we’re running out of benchmarks to upper bound AI capabilities.
On April 10 (two days ago as we're writing this), Epoch released a report with the headline "Evidence that AI can already so some weeks-long coding tasks". They continue: "In our new benchmark, MirrorCode, Claude Opus 4.6 autonomously reimplemented a 16,000-line bioinformatics toolkit — a task we believe would take a human engineer weeks."
OpenAI says GPT-5.3 is "optimized for agentic coding tasks in Codex or similar environments". AFAIK this is also true of GPT-5.1, 5.2, and 5.4. The general consensus seems to be that the GPT line does better in Codex than other harnesses.
2026-04-13 10:55:07
I'm going to tell you a story. For that story to make sense, I need to give you some background context.
I have some pretty smart friends. One of them is Peter Schmidt-Nielsen. Peter has an illustrious line of descent. His paternal grandfather was Knut Schmidt-Nielsen, regarded as one of the great animal physiologists of his time. His paternal grandmother was Bodil Schmidt-Nielsen, who became the first woman president of the American Physiological Society. Bodil's father was August Krogh, who won the 1920 Nobel Prize in Physiology "for the discovery of the mechanism of regulation of the capillaries in skeletal muscle", and later went on to found the company that would eventually become Novo Nordisk.
Peter himself is no slouch. He was homeschooled for most of his childhood. When it was time to go to university, he simultaneously applied to MIT's undergrad and grad programs, and was accepted to both. (He decided to go to undergrad.) He went on to do some startup stuff, then was an early employee at Redwood. While there, he broke Meow Hash for fun. (He's not a security guy.)
My point here is: Peter is very, very smart.
Ok, here's the story.
One day I was in a room with Peter and Drake Thomas. Peter was telling us a story about a puzzle he'd grown up with, but never solved. Peter's father also went to MIT. While there, he decided to come up with a new "cube" puzzle, finding traditional cube puzzles like the Soma Cube too easy. Knowing that people often struggled with chirality, he decided to start with the six chiral pentacube pairs, but that left four cube units to complete a four-sided cube. Thinking that four 13 pieces would make it too easy, he decided to fill out the remainder with two 1x1x2 pieces (i.e. dominos).

The six chiral pentacube pairs. Source: https://sicherman.net/c5nomen/index.html
He then cut the puzzle out of wood and spent some time trying to solve it. Not having any success, he left it overnight in the grad student lounge, and came back to find it solved the next morning.
Drake, hearing this story, said something to the effect of, "I think I can probably solve this puzzle in my head."
Impossible, right? No way a human can do that in their head in any reasonable time frame?
If you want to play around with the puzzle yourself, I've put a widget into the collapsible section below.
Pentacube Puzzle
After that, I watched Drake lie down on a couch and stare into space for two hours. Then he went to sleep. He came back downstairs the next morning and stared into space for another hour. Then he took a piece of paper and pen and wrote this down (in a spoiler block, in case you want to avoid any hints):

We didn't have a copy of the puzzle handy to make extra-sure, so they 3d printed one out and confirmed that the solution was correct.
Here are a couple of the original puzzles from the 70s:

The distribution of what unassisted human brains can accomplish is extremely wide. Human brains are squishy meat sacks. Better things are possible. Alas.
2026-04-13 08:46:57
Before I had a baby I was pretty agnostic about the idea of daycare. I could imagine various pros and cons but I didn’t have a strong overall opinion. Then I started mentioning the idea to various people. Every parent I spoke to brought up a consideration I hadn’t thought about before—the illnesses.
A number of parents, including family members, told me they had sent their baby to daycare only for them to become constantly ill, sometimes severely, until they decided to take them out. This worried me so I asked around some more. Invariably every single parent who had tried to send their babies or toddlers to daycare, or who had babies in daycare right now, told me that they were ill more often than not.
One mother strongly advised me never to send my baby to daycare. She regretted sending her (normal and healthy) first son to daycare when he was one—he ended up hospitalized with severe pneumonia after a few months of constant illnesses and infections. She told me that after that she didn’t send her other kids to daycare and they had much healthier childhoods.
I also started paying more attention to the kids I saw playing outside with their daycare group and noticing that every one had a sniffly nose.
I asked on a mothers group chat about people’s experiences with daycare. Again, the same. Some quotes:
“They do get sick a lot. I started my son at 2.5 and feel he always has something.”
“The limit does not exist.”
“brought home every plague (in first 6mo, Covid, HFM, slapcheek, RSV)”
“They usually say 8-12 illnesses per year. My girls were sick every 2-3 weeks in their first year of daycare”
“My daughter started daycare at 6 months and got sick a ton the first year”
Despite all this, many parents who have the option not to (i.e. they can afford in-home care with a nanny or for one parent to stay home) still choose to send their babies and toddlers to daycare. How come? Surely most well-off adults wouldn’t agree to be ill nonstop in exchange for the monetary savings daycare provides?
Asking around, it seemed like the most common reason given was that parents believed daycare illnesses “built immunity”; that if their babies and toddlers got sick at daycare they’d get less sick later in childhood and so overall it would net out the same. Unfortunately few could point me to any evidence for this but nevertheless passionately defended the view.
The claim that daycare illnesses simply offset childhood and adult illness immediately seemed suspect to me for a number of reasons:
I xeeted about this:
A number of people sent me this link, an alleged “study” from UCL showing that “frequent infections in nursery help toddlers build up immune systems”, authored (of course) by a group of parents who all send their kids to nursery (what the British call daycare).
The link I was sent was actually a UCL press release summarizing a narrative review paper and not a study itself. Narrative reviews are susceptible to selection bias because, unlike systematic reviews or meta-analyses, there’s no pre-registered search protocol or PRISMA-style methodology requiring them to account for all relevant evidence. But I decided to look into the narrative review more, to assess its validity fairly. I got access to the full publication.
Unlike the press release, which ignores these considerations entirely, it does engage with severity and age-related vulnerability, conceding that younger toddlers and babies suffer more from the same illnesses. A section on immunology provides a detailed account of why infants under two are more vulnerable—their immune systems are much less effective at fighting the same infections for a plethora of well-understood reasons. The review also cites a large Danish registry study (Kamper-Jørgensen et al) that reports a 69% higher incidence of hospitalization for acute respiratory infections in under-1s in daycare.
However, these severity findings are integrated into the review’s conclusions and framing in an incredibly biased way. The introduction describes severe outcomes as occurring “in rare cases,” and the conclusions focus on normalizing the burden and advocating for employer understanding. After establishing the immunological basis for why the same infection is more dangerous in a 6-month-old than a 3-year-old, it doesn’t then ask the hard follow-up question: given this, is the pattern of starting daycare at 6–12 months optimal from a child health perspective? Instead, the review frames this timing as a societal given. The Hand Foot and Mouth Disease section is a good example of the review’s handling: it reports that daycare attendance was associated with more severe cases but then immediately offers mitigating interpretation with no evidence—that prolonged hospital stays might reflect parental work constraints rather than genuine severity.
Though the review considers severity, it ignores duration. Their primary metric throughout is episode count. Also, despite discussing a wide variety of pathogens, it doesn’t address which of these infections carry the highest complication rates in infants and toddlers specifically.
Finally, the crucial “Illness now or illness later?” is the paper’s weakest portion. It rests on two primary sources for the compensatory immunity claim:
These are reasonable small studies, but the paper does not cite or engage with the Søegaard et al. 2023 study (International Journal of Epidemiology)—a register-based cohort of over 1 million Danish children followed to age 20, which directly tested and rejected the compensatory immunity hypothesis. Quoting from the study:
We observed 4 599 993 independent episodes of infection (antimicrobial exposure) during follow-up. Childcare enrolment transiently increased infection rates; the younger the child, the greater the increase. The resulting increased cumulative number of infections associated with earlier age at childcare enrolment was not compensated by lower infection risk later in childhood or adolescence.
This is arguably the single most relevant study for the paper’s central “illness now or illness later” question, and it’s three orders of magnitude larger than either study the authors cite. Its absence is hard to explain—it was published in a top epidemiology journal in late 2022 (available online November 2022), well before the review was written.
Accordingly, they hedge their conclusions carefully—“attendance at formal childcare may tip the balance in favor of infection now rather than later”, but their press release ignores any nuance, referring to daycare as an “immune boot camp”.
So overall, the compensatory immunity claim seems very weak and my prior that daycare illness is straight-up bad remains. Parents are citing biased reviews from motivated researchers. We are only beginning to understand the deleterious effects of increased viral load in infants.
I predict that in the future we’ll learn more about the side-effects of increased viral load on intelligence, wellbeing, fatigue etc. The “just the sniffles” mentality is a harmful attitude toward infections that promotes the dismissal of phenomena that substantially impact child and adult wellbeing.
2026-04-13 06:43:50
Once, I went to talk about "curiosity" with @LoganStrohl. They noted "it seems like you have a good handle on 'active curiosity', but you don't really do much diffuse 'open curiosity.'" The convo went on for awhile, and felt very insightful.
(I may not be remembering details of this convo right. Apologies to Logan)
Towards the end of the conversation, I was moving to wrap up and move on. And Logan said "Wait. For this to feel complete to me, I'd like it if we translated this into more explicit TAPs. TAPs or it didn't happen."
You can get a new insight. But, if the insight doesn't translate into some kind of action you're going to do sometimes, there is a sense in which it didn't matter. And people mostly fail to gain new habits. If you're going to have a shot in hell of translating this into action, it's helpful to have some kind of plan.
"TAP" stands for "Trigger Action Pattern", and also "Trigger Action Plan." A TA-Pattern is whatever you currently do by default when faced with a particular trigger. A TA-Plan is an attempt to install a TA-Pattern on purpose.
To turn an insight into a TAP, you need some idea of what it'd mean to translate the insight into a useful action. (I'll touch on this later but mostly it's beyond the scope of this post). But, after that, you will need a...
But, pretty crucial to this going well is:
Example: Sometimes, you talk to your colleague and end up getting in a triggered argument, where you both get kinda aggro at each other and talk past each other.
Maybe you have the insight "oh, maybe I'm the problem", along with "I should maybe try to de-escalate somehow" or "I should do better at listening."
Naive attempt at a TAP:
Mysteriously, you find yourself not remembering to do this in the heat-of-the-moment.
Slightly more sophisticated attempt (after a round of doing some Noticing and curious investigation, which is also beyond scope for this post)
Okay, but then in the heat of the moment, idk you're just so mad, it doesn't feel fair that you have to be the one to de-escalate.
...and then you might find that taking a deep breath doesn't actually help as much as you hoped, or is insufficient. Figuring out how to handle arbitrary problems is, you know, the complete body of rationality tools, including those not-yet discovered.
I think "TAPs or didn't happen" is a bit too strong. Conversations can be useful for reasons other than turning into new habits. But, I recommend thinking of "Turning takeaways into actions" as a thing you might want to do.
While the skill here is basically "fully general rationality", here's a few suggested prompts to get started.
First, you might want a stage of asking:
"What even were the takeaways from this conversation?". You might have had a fun meandering convo. What do you want to remember?
"Why does this takeaway feel important or useful to me?". At first, you might have only a vague inkling of "this feels exciting." Why does it feel exciting?
"When, or in what domains, do I specifically want to remember this takeaway?"
"What would I do differently, in the world where I was taking the takeaway seriously?"
...
I have a horrible confession to make.
I do not remember what TAPs I ended up coming up with.
I do think I ended up incorporating the concepts into my life, and this routed at least somewhat through the TAPs. Here is my attempted reconstruction of what happened at the time:
In the conversation about curiosity, some things that came up were that I feel like "open curiosity" takes too long (compared to directly tackling questions in an active, goal-driven way". I feel like I'd have to boot up in a whole new mode of being to make it work, and... idk I just imagine this taking years to pay off.
I nonetheless have some sense that there's a kind of intellectual work that openly curious people do, that's actually harder to do with active curiosity.
A thing that came up is the move off... just noticing that some things are more interesting than other things. Even if something doesn't immediately feel actively fascinating, there's a move you can make, to notice when there's a diff between how interesting one thing feels, vs another thing. And, pay extra attention to the more interesting thing, and what's interesting about it. Overtime this can cultivate curiosity as a kind of muscle.
The TAP version of this is:
And, relatedly:
...
May your good conversations live on in your actions.
2026-04-13 04:47:29
There's an adage from programming in C++ which goes something like "Yes, you write C, but you imagine the machine code as you do." I assumed this was bullshit, that nobody actually does this. Am I supposed to imagine writing the machine code, and then imagine imagining the binary? and then imagine imagining imagining the transistors?
Oh and since I don't actually use compiled languages, should I actually be writing Python, then imagining the C++ engine, and so on?
Then one day, I was vibe-coding, and I realized I was writing in English and thinking in Python. Or something like it. I wasn't actually imagining every line of Python, but I was imagining the structure of the program that I was describing to Claude, and adding in extra details to shape that structure.
This post is actually about having sane conversations with philosophy bros at the pub.
People like to talk in English (or other human languages) because our mouths can't make sounds in whatever internal neuralese our brains use. Sometimes, like in mathematics, we can make the language of choice trivially isomorphic to the structures that we're talking about. But most of the time we can't do that.
Consider the absolute nonsense white horses paradox, where "a white horse is not a horse" is read both as the statement:
And the phrase "a white horse is a horse" is read as the statement:
I often think in a language of causal graphs. English isn't very good at talking about causal graphs. It doesn't have individual words for "A contains the same information to B", "A is the same node as B", "A is an abstraction over B", "A is a node which is causally upstream of B".
I remember talking about "consciousness" with a philosophy guy at the pub once. I think I said something like "A certain structure of computation causes consciousness" meaning "Consciousness is an label applied to certain computational structures", but which he interpreted as "The presence of a certain computational structure is a node upstream of consciousness". This caused immense confusion.
I call the problems here "beetle problems"
Wittgenstein proposed a thought experiment. Suppose you have a society where:
In this case, the meaning of the word "beetle" is entirely socially constructed. Wittgenstein was exaggerating here: if I talk to you, and you do something with your beetle (dirty jokes aside) and report the results, I can get some information about your beetle, based on what you say back to me. The beetle is causally entangled with us both. It's just not a very efficient way of talking about things.
Even if we both have identical beetles, it might take us a while to get them oriented the same way round, what I call an antenna, you might call a leg, what I call a wing-case you call a carapace. And so on.
To unfairly single out an example. I personally find this particularly salient when talking to people in the Oxford EA/longtermist cluster. I know they're smart people, who can put together an argument, but they've developed a language I just cannot penetrate. It takes a long time for me to figure out what on earth they mean. Ohh, you have your beetle upside down compared to mine.
Even worse, I think a lot of people don't actually think in terms of causal graphs the way I do. This comes up when I try to read pieces on moral realism. When someone brings up a stance-independent reason to do something, I simply cannot map this onto any concept which exists in my mental language. What do you mean your beetle has fur and claws and keeps going "meow"? Are you sure?
Uhh... I don't have many. Beetle problems take a while to figure out. I once got feedback on an essay test that said "Your ideas seemed confused." and I thought "Man, your draft seemed confused!". I don't think I could have done much better, without spending time in person hashing out the beetle problems.
It might have helped to have a better conception of beetle problems, though. I could at least have pointed it out. Perhaps in future I'll come back with a wonderful beetle-solving problem.
Editor's note: this post was written as part of Doublehaven (unaffiliated with Inkhaven).
◆◆◆◆◆|◆◆◆◆◆|◆◆◇◇◇
◆◆◆◆◆|◆◆◆◆◆|◆◆◇◇◇