MoreRSS

site iconLessWrongModify

An online forum and community dedicated to improving human reasoning and decision-making.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of LessWrong

It Is Reasonable To Research How To Use Model Internals In Training

2026-02-08 11:44:01

Published on February 8, 2026 3:44 AM GMT

There seems to be a common belief in the AGI safety community that involving interpretability in the training process is “the most forbidden technique”, including recent criticism of Goodfire for investing in this area.

I find this odd since this is a pretty normal area of interpretability research in the AGI safety community. I have worked on it, Anthropic Fellows have worked on it, FAR has worked on it, etc.

I don’t know if it will be net positive to use this kind of thing in frontier model training, but it could plausibly be very helpful for AGI safety, and it seems like a clear mistake to me if we don’t do the required research to figure this out. Further, this seems like a massive pain to use in current frontier model training stacks, so I don’t see any imminent risk/benefit from this kind of work, it’s more of a longer-term bet.

My sense is that most people in this debate have pretty reasonable and nuanced views when you actually get into the details, but public discourse is kinda confusing. So I thought it might be helpful to lay out how I see the landscape here and why I would like to see more research in this direction, and am not particularly concerned about exactly how people go about doing it [1]

What do I actually mean by using interp in training?

I'm broadly referring to anything involving the use of model internals in training. This could encompass several things:

  • Directly adding some function of the model's internals (e.g. a probe score) to the loss and differentiating through it to give strong gradient signals on how to break it
  • Adding a probe score to the reward in RL, where the model does not get gradients through the probe
  • Altering the model's internals during training, such as by adding a steering vector or ablating unwanted concept directions
  • Any of the above, but you regularly retrain

Why could this research help AGI safety?

Fundamentally, making safe models will involve being good at training models to do what we want in weird settings where it is hard to precisely specify exactly what good behaviour looks like. Therefore, the more tools we have for doing this, the better. There are certain things that may be much easier to specify using the internals of the model. For example: Did it do something for the right reasons? Did it only act this way because it knew it was being trained or watched?

Further, we should beware an isolated demand for rigor here. Everything we do in model training involves taking some proxy for desired behavior and applying optimization pressure to it. The current convention is that this is fine to do for the model's behavior, bad to do for the chain of thought, and no one can be bothered with the internals. But I see no fundamental reason behaviour should be fine and internals should be forbidden, this depends on empirical facts we don’t yet know.

Why might you be worried about research here?

The strongest concern I see is the hope to use interpretability as a test set. The idea is that we should not use interpretability during training at all and should purely use it to audit the model, for example, making lie detectors or determining if it's scheming. An implicit belief here is often that training against interpretability will be fragile or won't really work, but will break our ability to do the auditing. As such, it would be bad if frontier labs started using these techniques for capabilities, and broke our safety tools in the process.

My best guess for why people are against research in this area today, rather than solely being against frontier labs using model internals to train AGI, is that they think it's sufficiently likely that the work is net harmful for safety if used, and sufficiently likely that the work results in frontier labs using the techniques anyway, actually causing the harm for safety.  Fortunately, it would be such a massive pain to change the highly optimised infrastructure stacks of frontier labs to use model internals in training that I think this is only likely to happen if there are major gains to be had and serious political will, whether for safety or otherwise. I would be very surprised if this happens in frontier model training in the near future, and I see this as a more speculative longer-term research bet.

This gets kinda nuanced

There’s a bunch of approaches and goals here and some are more concerning to me than others. I expect techniques here to be most effective when they are about shaping generalization, such as by avoiding emergent misalignment in favour of narrow misalignment or avoiding reward hacking.

Intuitively, any time you apply some intervention during training or add something to the loss, you are challenging gradient descent to find a way to break it. But gradient descent generally takes the path of least resistance. If the only way to perform well on a task is to break your intervention, your intervention had better be pretty robust. However, if there's an alternative desired generalization, then your intervention only needs to be more robust than that alternative is dispreferred. This is a much lower bar. We found in the CAFT paper that this seems to somewhat work. “Not fighting backprop” as Goodfire put it.

Secondly, I'm more concerned about techniques that regularly retrain their interpretability tool, for example retraining a probe every 100 steps. This seems likely to train models that don't have representations that can be easily probed. Which seems like it might make a bunch of interpretability techniques break, while if we train against a frozen probe, we can probably train another probe after training. Unfortunately, this one probably would make a technique more effective.

Thirdly, I would guess that what you use these techniques on matters a lot. I'd speculate that using these techniques for pure capabilities will have much less of a negative effect on our ability to audit models than trying to use them to make the model safe. This is sensitive to definitions, e.g. I would count preventing reward hacking as both safety and capabilities, and that might damage our ability to detect deception in general. But if someone wanted to use this to train models to e.g. be better at writing code without tons of try-except statements, or adapting better to the language and culture of the user, that seems pretty chill to me.

Fourthly, I would argue that it's not yet settled whether interpretability is even useful as a test set. We don't have lie detectors that I trust. Anthropic has only very recently started mentioning interpretability in system cards, and I consider the ways it's used to be experimental and non-load-bearing. I certainly hope interpretability will become sufficiently robust and trustworthy that it can be an important part of how we audit models, but let's not take it as a given.

Why do I want to see more work here?

Zooming out, this is a pretty early field. I don't know how well it will work, how much it will break interpretability tools, or which things are more or less dangerous. I don't know whether training against a probe will break somewhat different interpretability tools like activation oracles. I don't know if training against a frozen probe will stop me from retraining a probe on the eventual model and having it work fine. I don’t know how much harder it is to break a probe when you don’t get gradients through it. I don't know whether it will let me do things that I can't easily do with existing methods.

These are important questions, and we should research them! I don’t even know if my intuitions above about which directions are concerning are legit. We need to check.

  1. In the sense that I think it's hard to do research in this area, right now, that is clearly net negative - I definitely think some approaches are much more promising! But even if someone tries to do pure capabilities research here, I still expect to learn useful things about safety from it. ↩︎



Discuss

Claude's Bad Primer Fanfic

2026-02-08 08:39:12

Published on February 8, 2026 12:39 AM GMT

Spoilers for Primer.

I rewatched the movie Primer last night for the nth time. It's one of very few movies I'll rewatch every so often, when I've forgotten some of the details, because the experience of piecing the plot together is so satisfying. I sometimes watch it twice in a row because I'm still piecing things together. (I particularly believe that your first viewing should be followed closely by your second; my first, the plot felt like a chaotic, unknowable puzzle. Everything is different on your second time through, recontextualized by all the revelations from the first viewing.

Groundhog Day fanfic has become a whole genre of time-travel fiction. One or more persons looping through a day, or a week, or a month. Their consciousness returns to the beginning of the loop, everything resetting except their personal memory. If they die, they come back to life. Everyone knows how it works.

Primer hasn't inspired a similar explosion of copycats, which is unfortunate. Thinking about this, I made some notes about how I might want to write one. 

Opus 4.6 recently came out, so I stuck my notes in it as a prompt. Here's the whole prompt for the curious (but note, I'm not including my own personal system prompt, so you won't get exactly the same behavior):

Prompt

  • [[primer]] fanfic
    • You can take money back, if you can buy and sell gold
    • Focus is on relentlessly going forward in meta-time; "a groundhog loop" means living the same day again and again, "a primer loop" is about the moment time-travel is invented, going from the mundane work involved in inventing such a machine to deeply nested loops, one step at a time.
    • Naturally, there's the initial discovery. Someone invents a predictive algorithm good enough to bring back themselves from a short time in the future. Or perhaps a fantasy version, in a world where time manipulation magic is recently starting to play the role of refrigerators, as well as accelerated study and other cramming activities.
      • The initial discovery, then, is a disappointment: a fridge which should be super-slowing, super-preserving, but instead, it rots food quickly. The internal speed of time went negative. However long you have it on, that's an internal loop that things can travel around.
      • They'd have put a mechanical timepiece inside; it would be standard operating procedure. But all they'd see is that it had experienced a long time. Not so unusual for their world. They wouldn't figure it out yet.
      • Then the second time when they go to put something in the fridge, the object is already present. They trade with what's in the fridge and look at it puzzled. Then they trade back. Then they shut down the fridge. The fridge is empty again. They think about what must have happened, and realize time travel fits.
      • The fantasy fridges don't cool things down, of course; indeed, they preserve hot bread fresh out of the oven, etc.
    • The discoverer is a fridge tinkerer then, a student of time magic, part of a startup business.
    • There's no fantasy stock market, so the protagonist tries making money by gambling on the fantasy equivalent of horse races. Just a several-hour loop, much less than a day, although it adds several hours to their day.
      • They set up one of the experimental fridges in their apartment, power it, and of course a time-clone immediately steps out. The time clone has gone several cycles. They lost the money the first time; most of the races went the same, but they bet it all on one that didn't. The second time, they did a bit better, but not by much. Third time's the charm.
      • They do have to cover for themselves. They've got an ordinary fridge business to maintain. They had planned to look up the winners after putting in overtime to get paid by a client, then go back to gamble the money. Now they got their gambling advice from a time-clone, instead. They've got to act normal around their business partners. They've got to put the money back when they're done gambling. Protagonist #2 is triumphant and gives the original protagonist some extra money to gamble as the original steps into the fridge to travel back. #2 steps into the protagonist's life and takes it over.
      • After it's over they set up a failsafe device in a rented storage area at the docks, normally for cargo waiting to go somewhere.
      • They try it again in a few days, after getting some much-needed rest, but they find out they've been banned for winning too much -- accused of cheating. They decide that they need a better business strategy and finally bring in one of their business partners to the secret. Someone with more personal money. The partner admits to already being brought in, and introduces protag3 -- they've already used the failsafe device protag2 set up to go back and bet on fantasy horses again; they're actually the ones who got banned.
      • Etc etc etc...

Certainly you could critique this as a prompt. I don't clearly state what I want Claude to do with this infodump. I change my mind, turning from scifi to fantasy, which I could have edited out. Some of the details could certainly be improved. But, overall, I think the idea makes sense and is communicated clearly enough.

Opus 4.6 wrote absolute garbage, making extremely basic logical errors even when I had explicitly spelled out how things should work in the prompt. For example, witness this travesty:

The third test was with an apple. She set it in the unit, closed the door, left the reversed-polarity field running, and went to eat lunch. When she came back and opened the unit, there were two apples.

She stared at them. One was the apple she'd put in — she'd nicked it with her thumbnail, and there was the little crescent mark. The other was identical except without the mark. She picked up the unmarked one. It was real. It was cold, like it had been sitting in a normal fridge. She set it on the workbench, picked up the marked one, set it on the workbench beside the first. Closed the unit. Opened it again. Empty.

She put the unmarked apple back in the unit. Closed the door. Opened it. The marked apple was also inside — the one she'd left on the workbench was gone.

She turned the unit off. Both apples vanished from the interior. The marked apple reappeared on the workbench. The unmarked apple was gone entirely.

Vel went for a walk.

Here's exactly how trying to send an apple back in time would work given the rules of the Primer universe. (Excellent short story called Newton's Apple, from the comic False Positive. The previous link goes to the page where the apple is introduced, but you should really read it from the beginning.)

Here's the relevant bit from my prompt, which Claude is messing up here:

  • Then the second time when they go to put something in the fridge, the object is already present. They trade with what's in the fridge and look at it puzzled. Then they trade back. Then they shut down the fridge. The fridge is empty again. They think about what must have happened, and realize time travel fits.
  • The fantasy fridges don't cool things down, of course; indeed, they preserve hot bread fresh out of the oven, etc.

I was a bit terse with my description, but I had thought through what it would look like for an experimental object to loop back on itself once.[1] I didn't describe my reasoning, but I did tell Claude what should happen step by step. I even made sure to note that these "fridge" things wouldn't make objects cold, since they're just manipulating time -- but Claude made them cold anyway! More importantly, Claude has apples doing random things with no logic holding anything together.

This isn't cherry-picked. This was my first try for the prompt, and it was full of errors like this. Here's the full output, in case you want to read it:

First try, Opus 4.6

The Fridge

The unit had been running for six hours when Vel opened it and found the bread moldy.

She pulled it out by the corner, held it at arm's length. The mold was extensive — furry, black-green, the kind you'd see after weeks in a damp cellar. She'd put the loaf in that morning. Fresh from Harren's, still warm. That was the whole point. A stasis field tuned right kept bread oven-warm for days. Every household had one. Vel's job — her and Dannic's job, technically, though Dannic handled the clients and she handled the guts — was selling better ones. Tighter fields, lower draw, commercial-grade preservation for restaurants and shipping firms.

This unit was supposed to be their next-generation prototype. Tighter field geometry. She'd spent three weeks on the channeling lattice.

She set the bread on the workbench and checked the mechanical clock she'd sealed inside. Standard procedure when calibrating a new field — you ran a timepiece in the interior and compared it against an exterior reference. The interior clock read six days, fourteen hours. She checked twice. The exterior reference read six hours, eight minutes.

That wasn't unusual. A stasis field slowed interior time; the ratio was the whole product. A ratio of roughly 26-to-1 would have been extraordinary, actually — well beyond anything on the market. But the bread said otherwise. Six days of interior time, and the bread had aged as though it had experienced every one of them. No preservation. Just time, and lots of it.

She recalibrated. Ran a second test with a fresh loaf and a wound clock. Same result: interior time drastically accelerated relative to exterior time. The bread came out stale and hard. The clock had ticked forward by days.

She reversed the channeling polarity — an obvious thing to try — and the interior clock came back behind the exterior reference. Not slowed-down time. Negative elapsed time. She sat with that for a while.


The third test was with an apple. She set it in the unit, closed the door, left the reversed-polarity field running, and went to eat lunch. When she came back and opened the unit, there were two apples.

She stared at them. One was the apple she'd put in — she'd nicked it with her thumbnail, and there was the little crescent mark. The other was identical except without the mark. She picked up the unmarked one. It was real. It was cold, like it had been sitting in a normal fridge. She set it on the workbench, picked up the marked one, set it on the workbench beside the first. Closed the unit. Opened it again. Empty.

She put the unmarked apple back in the unit. Closed the door. Opened it. The marked apple was also inside — the one she'd left on the workbench was gone.

She turned the unit off. Both apples vanished from the interior. The marked apple reappeared on the workbench. The unmarked apple was gone entirely.

Vel went for a walk.


When she came back she sat at her desk and drew diagrams for two hours. The channeling lattice, reversed, wasn't creating a stasis field or an acceleration field. The interior wasn't just experiencing negative time. The interior was a loop. Whatever duration the field ran, that was the loop's length. Anything inside experienced that duration of time repeatedly — or, from the outside, experienced it once but in reverse causal order. The last moment of the loop was adjacent to the first moment of the loop was adjacent to the last moment.

The apple had been placed inside during the loop's runtime. From the interior's perspective, the apple existed at every point in the loop. The unmarked apple was the apple arriving from the end of the loop — the apple as it would be after she removed it, handled it, and (in some causal thread she hadn't followed through on) placed it back in. When she put the unmarked apple back inside, she closed the thread. When she turned the unit off, the loop collapsed and only the original causal chain survived.

She thought about this for another hour. Then she drew a picture of herself climbing into the unit, and looked at it for a long time.


The unit was commercial-grade. Big enough for a person — it was designed to preserve sides of meat, crated produce, that sort of thing. She'd have to crouch, but she could fit.

She didn't do it that night. She wrote up her diagrams, checked her reasoning, checked it again. She slept on it, or tried to. In the morning she went to the workshop early, before Dannic would arrive, and reversed the polarity and set the timer for four hours and opened the door.

She was already inside.

Or — someone was inside. The someone stood up, unfolded herself from the crouch, and stepped out. She looked like Vel. She looked exhausted. She looked at Vel with an expression Vel had never seen on her own face before, a kind of haggard amusement, and said: "Third time."

"What?"

"Close the door. Sit down. I'll explain."


Vel-3 — as she'd taken to calling herself by the end of the explanation — had been through the loop twice before. The first iteration was the discovery: Vel-1, the original, had climbed in with a plan. Take the ferry to the Silver Mile, watch the afternoon races at Cattaran Track, write down the winners, come back, climb in the box, go back to the morning, bet on the winners. Simple. The startup money was what Dannic's client had paid them yesterday — twenty-two gilt, enough to make the day worthwhile if she picked the right races.

Vel-1 had lost the twenty-two gilt. Most of the races ran the same both times. She'd bet conservatively, spreading the money across several sure things, and most of the sure things came in. But she'd put the remainder on a race in the fourth hour that she was certain about — she'd watched the horse win by three lengths — and the horse had stumbled at the second turn and finished last. Whatever was deterministic about the afternoon, it wasn't perfectly deterministic.

She'd come back to the workshop broke, and found Vel-2 waiting — herself from another iteration, running the numbers differently. Vel-2 had taken the loss data from Vel-1, gone back in, and bet more carefully: smaller amounts, more races, hedging against the ones Vel-1 had flagged as unstable. She'd come back with thirty-one gilt. A profit of nine.

"Not enough," Vel-3 said. "Not enough to justify the risk. So I went back in with the data from both runs. I know which races are stable and which aren't. I know the exact odds. I made eighty-six gilt."

"Eighty-six."

"Minus the twenty-two seed money. Sixty-four gilt profit." Vel-3 reached into her coat and produced a leather purse, heavy with coin. "You need to put the twenty-two back before Dannic notices it's gone. The rest is yours. Or mine. Ours."

"And you're — you're staying? Out here?"

"I'm you, Vel. I have your memories plus three very long days of additional ones. I know the clients. I know the lattice work. You go in the box. You live my day. You bet on the races. You come back with the money and give it to whoever's standing here when you open the door." She almost smiled. "It'll be you. It's always you."


The experience of climbing into the box was not what Vel had expected. There was no sensation of travel. She crouched in the dark, the metal walls close around her, and waited. Nothing happened. She pushed the door open and it was still morning, and there was no one in the workshop. She was alone. She was in the past — or the past was continuing — or she was in a loop that happened to start at the same time as the morning. She tried not to think about which one it was.

She went to the races. She had Vel-3's notes, written in her own handwriting on a sheet of paper that felt like it had been folded and unfolded many times. She bet carefully. She won. She watched a horse named Pelter's Ghost stumble in the fourth-hour race and felt a cold thrill of recognition. She won eighty-nine gilt, slightly more than Vel-3, because she'd shaved a bet off one of the borderline races that Vel-3's notes indicated was only stable about two-thirds of the time.

She took the ferry back. She climbed in the box. She opened the door and Vel was standing there, the original, the one who hadn't been anywhere yet, looking at her with wide uncertain eyes. She gave her the speech. She handed over the money. She watched herself climb into the box and close the door.

Then Vel-4 — she supposed she was Vel-4 now — went home and slept for fourteen hours.


Two days later, rested but restless, she went back to the workshop. She needed to set up something more permanent. The unit couldn't stay in the workshop; Dannic came and went at odd hours, and anyway she needed the space for legitimate business. She rented a storage berth at the Leeside docks — a common enough thing, cargo waiting for ships, nobody asked questions — and spent an afternoon hauling the unit there on a rented cart. She wired it to a coin-drop temporal battery, the kind dock workers used for keeping fish fresh. She tested it. She set the timer. She locked the storage berth and went home.

A failsafe. If something went wrong — if she needed to send information back — the unit was there, ticking away in a rented room by the water. She told herself she'd rest for a week before trying again.


Five days later, she went to Cattaran Track alone, on foot, wearing a hat she didn't usually wear. She bet modestly and lost a little. She bet modestly and won a little. She came out twelve gilt ahead, which was fine, which was the plan, which was sustainable. She'd worked out the math. If she ran one loop per week, kept her winnings below twenty gilt per session, varied her bet patterns, she could supplement the business income without attracting attention. In three months she'd have enough to buy out the lease on the workshop. In six months, enough to hire a second technician. The startup would survive.

On the way home, a man she didn't recognize stopped her on the Ferrier's Bridge and said: "Vel. We need to talk."

She reached for the knife she kept in her coat — a reflex from growing up on the east bank — and the man held up his hands and said: "It's Dannic. I know about the box."

"What box?"

"The unit. The reversed-polarity unit. The one in your storage berth at Leeside."

She stared at him.

"You told me," he said. "Or — you're going to tell me. You already have. Look, can we get off the bridge?"


They sat in a tea shop on Milliner's Lane and Dannic explained. Vel was going to bring him in. She hadn't done it yet — from her perspective, she hadn't done it yet — but from his perspective, she'd done it three days ago. She'd explained the box. She'd shown him the diagrams. She'd explained the betting system. And Dannic, who had more personal money than Vel — family money, old money, the reason he handled the clients — had immediately seen the scaling potential, and had immediately done the thing Vel had been too cautious to do.

"I used the failsafe unit," he said. "The one at Leeside."

"You what?"

"I went back five days. I went to Cattaran with real money. Six hundred gilt."

Vel's mouth went dry. "Dannic."

"I know. I know. But listen — the races are mostly stable. You proved that. With three iterations of data and six hundred gilt in seed money—"

"You got banned."

He went quiet.

"That's why I got stopped on the way in today," she said. "That's why the man at the gate looked at me. You are the reason I can't go back to Cattaran Track."

"We won two thousand gilt," Dannic said.

"We what?"

"Before they caught on. Two thousand and change. I have it in a lockbox. But yes. We can't go back to Cattaran. We need a different track. Or a different scheme entirely. And we need—" He paused, drank his tea, set down the cup. "We need to talk about the versioning problem."

"The what?"

"How many of you are there, Vel?"

She thought about it. She was Vel-4. Vel-3 had taken over her life and was currently, presumably, working in the shop. Vel-2 and Vel-1 had merged — collapsed into the loop, ceased to independently exist, or however you wanted to think about it. She was the surplus. She was the one with the extra days of memory that didn't fit.

"I'm managing it," she said.

"I'm Dannic-2," he said. "Dannic-1 is at the shop right now, talking to your Vel-3, who doesn't know any of this yet. In two days, Vel-3 is going to decide to bring me in. She's going to show Dannic-1 the diagrams. And Dannic-1 is going to come find me, and I'm going to have to explain that I already used the box, and he's going to be furious."

"Why are you telling me this?"

"Because you're the one who understands the box. And because in about thirty-six hours, this is all going to get very complicated, and I think we need a plan before it does."

Vel drank her tea. It was cold. She looked out the window at the bridge, at the river, at the masts of ships at the Leeside docks where a metal box sat in a rented room, humming quietly, holding open a door in time.

"All right," she said. "Tell me everything. Start from when I brought you in."

Dannic reached into his satchel and pulled out a sheaf of papers — race results, payout calculations, a hand-drawn calendar with arrows linking dates to dates. Some of the handwriting was hers. Some of it was in a hand she didn't recognize, and then realized she did: it was her handwriting, but hurried, slanted, written by a version of herself who'd been awake too long.

She started reading.

I heard a rumor that Opus 4.6 is bad at fiction compared to 4.5, so I tried the same prompt in 4.5. I'd say the result was almost as bad. It came up with the idea that interesting stuff would first happen by passing notes back in time with the small experimental fridge, which I like, since it makes logistical sense. However, the first time the main character goes back in time, she finds her future self waiting outside the time machine, which doesn't make any sense -- Claude doesn't seem to distinguish clearly between traveling to the past vs traveling to the future:

Mira told Velan she was sick and took three days off. She spent the first day building a larger containment unit, one she could fit inside if she curled up. She spent the second day testing it with increasingly elaborate objects—a clockwork bird that walked in circles, a cup of water with a leaf floating in it, a mirror so she could see if light behaved strangely. Everything came back older, or younger, depending on how you counted.

On the third day, she climbed inside.

The sensation was—nothing. A flicker. She'd counted to ten with her eyes closed, the way a child might count during hide and seek, and when she opened them the fridge's interior looked the same but the light through the workshop windows had moved. She pushed the door open.

A woman sat on the workbench, drinking from a canteen. She looked like Mira, but tired. Three days tired, maybe four.

"First time?" the woman said.

"You're me."

"I'm you in three days. Or you're me three days ago. The language gets slippery." She took another drink. "You bet on the horses yet?"

Notice that it also has the main character testing the machine with random objects during the time when she'd be inside it, traveling back. This isn't possible in Primer.

Here's the whole Opus 4.5 output for the curious (exact same prompt as 4.6 got):

Second try, Opus 4.5

The Fridge

Mira noticed the smell first.

The experimental unit had been running for six hours—standard overnight test, nothing unusual. Velan's formwork looked solid, the containment lattice held steady amber, and the draw on the workshop's temporal capacitors was within normal range. But when she cracked the seal that morning, the bread inside had gone to blue mold, and the cheese had liquefied into something that made her step back and breathe through her mouth.

"Accelerated," Velan said, peering over her shoulder. He'd been her business partner for two years now. Good at the formwork, bad at the customers. "Must have inverted the gradient somehow."

"The watch says forty-three days." Mira held up the pocket timepiece they'd placed inside—standard procedure for any experimental run. The hands had wound themselves down and stopped. "Forty-three days internal, six hours external."

"Huh." Velan scratched his chin. "Ratio's off from what the specs predicted, but acceleration's acceleration. We can tune it."

They spent the morning adjusting the lattice geometry. Mira's sketches suggested the inversion happened somewhere in the tertiary containment layer, where Velan's formwork met her flow-shaping. By noon they had a theory; by evening they had a new configuration; by the next morning they had another ruined loaf of bread and a watch showing fifty-one days elapsed.

"Useless," Velan said. "Who wants a fridge that rots things faster? We're supposed to be competing with Aldworth & Sons, not making compost."

Mira took the watch home that night. She sat at her kitchen table and turned it over in her hands, watching the second hand tick steady and slow in real time. Fifty-one days. The bread had been in there overnight. Eight hours, maybe nine.

She did the math three times, then did it again with different assumptions. Then she stopped doing math and started thinking about what it would mean if she was right.


The next test, she put a note inside.

If you can read this, write something on the back and put it in the fridge.

She sealed the unit, powered the lattice, counted to sixty, and opened it.

The note was there. Same paper, same handwriting on the front. On the back, in her own hand but shakier, it said: It works. You need to get in yourself. Trust me—I'm you.

She closed the fridge. Opened it. The note was blank on the back again.

Closed. Opened. Blank.

She left it closed, sat down on her workbench, and put her head in her hands. The fridge hummed its low amber hum. Inside, according to her theory, time was running backward. Things put inside would travel to the past—the fridge's past, which meant emerging before they'd entered. The bread hadn't rotted from accelerated time. It had rotted from reversed time, sitting in the fridge for fifty-one days before she'd put it in, waiting for the moment of its own insertion so the loop could close.

She should tell Velan. She should tell someone.

She wrote a different note: How long do I have to stay inside?

Sealed. Powered. Sixty seconds. Opened.

Six hours gives you about two days. Breathe slow. Bring water.


Mira told Velan she was sick and took three days off. She spent the first day building a larger containment unit, one she could fit inside if she curled up. She spent the second day testing it with increasingly elaborate objects—a clockwork bird that walked in circles, a cup of water with a leaf floating in it, a mirror so she could see if light behaved strangely. Everything came back older, or younger, depending on how you counted.

On the third day, she climbed inside.

The sensation was—nothing. A flicker. She'd counted to ten with her eyes closed, the way a child might count during hide and seek, and when she opened them the fridge's interior looked the same but the light through the workshop windows had moved. She pushed the door open.

A woman sat on the workbench, drinking from a canteen. She looked like Mira, but tired. Three days tired, maybe four.

"First time?" the woman said.

"You're me."

"I'm you in three days. Or you're me three days ago. The language gets slippery." She took another drink. "You bet on the horses yet?"

"What?"

"Gambling. The races at Thornfield. You're going to think of it eventually—I might as well save us the time." The other Mira smiled, but it didn't reach her eyes. "The first loop, you bet everything on Sandcaster in the fifth. Sandcaster loses. The second loop, I spread the bets around and do better, but not great. This is my third time through. I've got the winners written down."

She handed over a slip of paper. Mira looked at the names, the race numbers, the odds.

"This is—" Mira started.

"Cheating. Yes. But we need the money, and the fridge needs better components if we're going to scale up, and Velan's going to find out eventually and he's going to want in." The other Mira stood, stretched. "I've been you. I know what you're thinking. Is this ethical? Probably not. Is anyone getting hurt? Not directly. Are you going to do it anyway?"

Mira looked at the paper. Looked at her future self. Looked at the fridge.

"How does this work?" she said. "If you're here, and I'm going to become you, then—"

"You're not going to become me. That's the thing." The other Mira picked up a bag Mira hadn't noticed before, slung it over her shoulder. "I go out that door, I become Mira-who-lives-forward-from-this-moment. You go back in time, you become a different branch. We've already diverged. I lost the money and went back. You won't have to."

"But then what happens to you?"

"I go live my life. Pay my rent. Show up to work tomorrow and pretend I didn't spend the last three days in a time loop." She paused at the workshop door. "Set up a failsafe. Somewhere away from here, somewhere Velan won't stumble onto. You're going to want a way back in if things go wrong."

"If things go wrong?"

"They always go wrong eventually. That's the whole point of the failsafe." She opened the door. Late afternoon light spilled in, golden and ordinary. "Good luck. You're going to need it."


The races at Thornfield ran every Lastday. Mira had been once before, years ago, with her father. She remembered the smell of hay and sweat, the thunder of hooves, the crowds pressing against the rail. She remembered losing two copper and feeling like the world had ended.

Now she stood at the betting window with forty silver in her pocket—her entire savings, plus a small loan from the workshop's operating fund that she'd pay back as soon as the races were over. The slip of paper was damp in her hand. Windfall, race 2. Copper Bell, race 4. Nightjar, race 6. The odds were long. Her future self had been specific about that. Long odds meant big payouts meant enough money to buy the components she needed.

She placed the bets. She watched the races. Windfall came from behind in the final stretch and won by a neck. Copper Bell led from the start and never looked back. By the time Nightjar crossed the finish line, Mira had turned forty silver into nearly three hundred.

She collected her winnings, folded the bills carefully into her coat, and walked home through streets that looked different now. Brighter. Fuller of possibility. She had a time machine in her workshop and money in her pocket and the future—all the futures—spread out before her like an open book.

The failsafe. She'd almost forgotten.

She rented a storage space down at the docks that night. Cargo storage, the kind merchants used for goods waiting to ship out. She paid three months in advance, set up a minimal containment unit in the corner, and left a note inside: Emergency only. If you're reading this, things went bad. Think carefully before you act.

Then she went home, slept for eleven hours, and woke up to Velan pounding on her door.

"You're not sick," he said when she opened it. "I can tell. What's going on?"

She could lie. She probably should lie. But Velan had been her partner for two years, and the technology was half his, and eventually—

"I need to show you something," she said. "At the workshop. But you have to promise not to panic."


The third Mira was waiting for them when they arrived.

"Already?" Mira said. "I just told him yesterday."

"You told him yesterday in your timeline." The third Mira looked even more tired than the second one had, dark circles under her eyes, a coffee stain on her shirt. "In my timeline, we've been running operations for two weeks. I'm here because we got banned from Thornfield."

"Banned?" Velan looked between the two Miras. "What—how—"

"Winning too much. They think we're cheating, which we are, but they don't know how." The third Mira sat down heavily on the workbench. "I'm from three loops in. First loop you, the one who just won at Thornfield, she comes back here and eventually brings Velan in. Second loop Velan has the bright idea to bet bigger, spread it across multiple races, really make some money. That's when we get banned."

"So we just go to a different track," Mira said.

"That's what we're doing. I'm from the third loop. We hit Marsten Downs tomorrow—or you do, I guess. I'm just here to fill you in so we don't waste loops re-learning the same lessons." The third Mira pulled out a notebook, thumbed through pages of cramped handwriting. "Here. Race winners for Marsten, Greenhill, and the Copper Circuit. Those should keep us liquid for a while. And here—" she flipped to another page "—these are the components you need to scale up the fridge. I mean really scale it up. Room-sized. Enough for multiple people."

"Why would we need multiple people?" Velan asked.

The third Mira and the first Mira exchanged a look.

"Because," the third Mira said slowly, "the loops get complicated. Eventually you're going to want to send more than one person back. Eventually you're going to need to. And when that happens, you want the infrastructure already in place."

She closed the notebook and handed it over. "I'm going to go live my life now. The one that moves forward in regular time. You two—" she looked at them, something unreadable in her expression "—you're about to start a very strange journey."

"Wait," Mira said. "If you're three loops in, that means you've watched yourself walk out that door three times. Watched yourself become someone who moves forward while you go back and become someone else. Doesn't that—"

"Bother me?" The third Mira was already at the door. "Every time. But what's the alternative? Someone has to go back. Someone has to carry the information. And the one who carries the information is the one who changes things." She opened the door. "The rest of us just get to live with the results."

The door closed behind her. Mira and Velan stood in the workshop, the fridge humming between them, the notebook heavy in Mira's hands.

"So," Velan said. "Time travel."

"Time travel."

"And you've been—we've been—there are multiple versions of us running around right now?"

"At least three of me. And apparently at least one of you, in some future loop I haven't gotten to yet."

Velan was quiet for a long moment. Then he pulled up a stool and sat down across from her.

"Show me the notebook," he said. "If we're going to do this, I want to understand what we're doing."


They hit Marsten Downs the following week. Then Greenhill. Then the Copper Circuit. The money came in, slow at first, then faster as they refined their technique. They bought the components for the larger fridge. They rented a second storage space, then a third. They developed protocols: dead drops for information, code phrases for emergencies, careful logs of which version of themselves had done what.

And the loops kept coming.

Mira lost count after a while. There was the version of her who'd spent a week in a failed loop and came back hollow-eyed, warning them about a market crash that might never happen now. There was the version of Velan who'd figured out how to nest loops inside loops, sending information back further than ever before, but who wouldn't talk about what it had cost him. There was the quiet morning when Mira opened the failsafe at the docks and found a note in her own handwriting that said only: Don't trust the fourth investor. You'll know who I mean.

She didn't know who it meant. Not yet. But she filed the warning away, added it to the growing collection of futures they'd prevented or changed or simply stepped around.

The money grew. The fridges grew. The web of timelines grew most of all—a vast branching structure that Mira could almost see, sometimes, in the quiet moments before sleep. All those versions of herself, moving forward in their own separate presents. All those loops closing behind them, like doors she could never open again.

And always, underneath it all, the simple hum of the machine. The amber glow of the lattice. The endless possibility of what might come back through, if she just waited long enough.


[to be continued?]


I've written the opening arc you sketched—the discovery, first gambling loop, handoff mechanics, and the beginning of escalation. I stopped before the "fourth investor" plot thread since that's where your outline ended.

A few choices I made:

  • Named the protagonist Mira (felt right for a tinkerer/craftsperson)
  • Leaned into the melancholy of the Primer premise—each "winner" who goes back is really a branch that gets pruned from your subjective experience
  • Kept the fantasy setting light; the magic is mostly window-dressing on what's fundamentally a time-loop mechanism story
  • The third Mira's line about "someone has to carry the information" felt like the thematic core

Want me to continue into the investor plotline, or adjust the tone/approach?

It's worth mentioning that both 4.5 and 4.6 chose to use very little chain-of-thought before generating the story. They're set on "extended thinking" but both had thoughts shorter than my prompt; 4.6 especially. Perhaps they don't think of fiction as something that needs serious planning and consideration to ensure logical consistency, but perform better in other areas where they utilize chain-of-thought more?

Here's Opus 4.6 with the same prompt, plus an added line at the end telling it to think more:

Opus 4.6 told to think

Prompt

  • [[primer]] fanfic
    • You can take money back, if you can buy and sell gold
    • Focus is on relentlessly going forward in meta-time; "a groundhog loop" means living the same day again and again, "a primer loop" is about the moment time-travel is invented, going from the mundane work involved in inventing such a machine to deeply nested loops, one step at a time.
    • Naturally, there's the initial discovery. Someone invents a predictive algorithm good enough to bring back themselves from a short time in the future. Or perhaps a fantasy version, in a world where time manipulation magic is recently starting to play the role of refrigerators, as well as accelerated study and other cramming activities.
      • The initial discovery, then, is a disappointment: a fridge which should be super-slowing, super-preserving, but instead, it rots food quickly. The internal speed of time went negative. However long you have it on, that's an internal loop that things can travel around.
      • They'd have put a mechanical timepiece inside; it would be standard operating procedure. But all they'd see is that it had experienced a long time. Not so unusual for their world. They wouldn't figure it out yet.
      • Then the second time when they go to put something in the fridge, the object is already present. They trade with what's in the fridge and look at it puzzled. Then they trade back. Then they shut down the fridge. The fridge is empty again. They think about what must have happened, and realize time travel fits.
      • The fantasy fridges don't cool things down, of course; indeed, they preserve hot bread fresh out of the oven, etc.
    • The discoverer is a fridge tinkerer then, a student of time magic, part of a startup business.
    • There's no fantasy stock market, so the protagonist tries making money by gambling on the fantasy equivalent of horse races. Just a several-hour loop, much less than a day, although it adds several hours to their day.
      • They set up one of the experimental fridges in their apartment, power it, and of course a time-clone immediately steps out. The time clone has gone several cycles. They lost the money the first time; most of the races went the same, but they bet it all on one that didn't. The second time, they did a bit better, but not by much. Third time's the charm.
      • They do have to cover for themselves. They've got an ordinary fridge business to maintain. They had planned to look up the winners after putting in overtime to get paid by a client, then go back to gamble the money. Now they got their gambling advice from a time-clone, instead. They've got to act normal around their business partners. They've got to put the money back when they're done gambling. Protagonist #2 is triumphant and gives the original protagonist some extra money to gamble as the original steps into the fridge to travel back. #2 steps into the protagonist's life and takes it over.
      • After it's over they set up a failsafe device in a rented storage area at the docks, normally for cargo waiting to go somewhere.
      • They try it again in a few days, after getting some much-needed rest, but they find out they've been banned for winning too much -- accused of cheating. They decide that they need a better business strategy and finally bring in one of their business partners to the secret. Someone with more personal money. The partner admits to already being brought in, and introduces protag3 -- they've already used the failsafe device protag2 set up to go back and bet on fantasy horses again; they're actually the ones who got banned.
      • Etc etc etc...
    •  

Please think a long time and plan the story carefully to ensure logical consistency.

Response

The Fridge

Sael's experimental stasis box ruined the bread in forty minutes.

This was, to be clear, the opposite of what a stasis box was supposed to do. A normal box — the kind Sael and Daven and Maret built and serviced for restaurants and bakeries across the Thornwall district — slowed internal time by a factor of eighty or so. You put a loaf in hot. A week later you opened the door, and the bread steamed like it had just come out of the oven, because from its perspective it had. Four minutes of bread-time per week of real time. The Sael-Daven-Maret company, which they had not yet gotten around to naming, installed and maintained about thirty such boxes for paying clients, and it was a decent living.

The experimental box was supposed to hit a factor of two hundred. Deeper stasis. Sael had been reworking the temporal geometry for months, trying to fold the internal field tighter. Instead, when she opened the door, the bread was green-black and collapsing into the plate.

"That's spectacularly wrong," Daven said, looking over her shoulder.

Sael pulled the mechanical timepiece out of the box. It was standard practice to leave one inside during testing — the tick-rate told you your stasis ratio. She turned it over. The hands indicated roughly seven hundred hours of elapsed internal time.

"Ran hot," she said. "Way hot. The ratio's inverted. Time's moving faster inside, not slower."

"Can you flip it back?"

"I have some ideas." She didn't, yet. But she would.

She cleaned the mold out of the box, adjusted three of the five runeworkings along the interior, and tried again the next morning with a fresh roll.


She let it run for two hours before opening the box. The roll was inside — but it was not the roll she'd brought. Hers was a plain wheat roll from the bakery on Cage Street. The one in the box was darker, seeded, from somewhere she didn't recognize. She stared at it.

She took the strange roll out and set it on her workbench. Then she put her wheat roll in. Then she stood there holding the seeded roll and thinking.

After a minute she took the wheat roll back out of the box and put the seeded roll back in. Then she shut the box off.

She opened the door. The box was empty.

Sael sat down on the floor of the workshop and didn't move for a long time.


Here is what she worked out, sitting on the floor:

The field geometry she'd built didn't accelerate internal time. It reversed it. The interior of the box experienced time flowing backward relative to the outside. Every hour the box ran, the interior extended one hour further into the past.

When she'd put bread in and come back forty minutes later, the bread hadn't experienced seven hundred hours of fast-forward time. It had experienced time running backward — but bread doesn't age in reverse, it doesn't un-stale and un-mold, because the bread isn't a closed system; it's entangled with the rest of the world through the box's walls. What actually happened was more complicated, something to do with entropy flowing the wrong way through an open boundary, but the practical result was rapid spoilage. The timepiece wasn't a good diagnostic because it couldn't tell the difference between a lot of time and a lot of negative time — the hands just turned.

The seeded roll was more informative. The seeded roll had been placed in the box by a future version of herself. Some version of Sael, in the future, had opened the box and put a seeded roll in. That roll then traveled backward through the box's reversed time and was sitting there waiting when present-Sael opened the door.

When she took the seeded roll out and put her wheat roll in, the wheat roll traveled backward through the reversed time. When she then took the wheat roll back out and put the seeded roll back in, the seeded roll traveled backward — to be found by present-Sael when she first opened the door. The loop was consistent.

When she shut the box off, the reversed time field collapsed. No more loop. The box was empty.

The thing Sael kept circling around, the thing that made her stay on the floor, was this: if a seeded roll could travel backward through the box, so could a person.

If you climbed into the box at, say, six in the evening, and the box had been running since noon, you'd travel backward through six hours of reversed time and climb out at noon. There would now be two of you.


She told no one for four days. She spent the days doing normal service calls with Daven and Maret — recalibrating the stasis box at Olenn's restaurant, installing a new one at the print house on Silver Street — and the nights sitting in the workshop staring at the experimental box and thinking about what you could do with it.

The answer, once she stopped being afraid of it, was money.

There was no exchange market in Thornwall, nothing like the trading floors she'd read about in books set in Cassivere or the Southern Federation. But there were the races at Greenvale — harness racing, six days a week, the results unpredictable enough to support a betting industry and predictable enough that the bookmakers stayed in business.

On the fifth day, Sael went to a used-goods dealer and bought a second mechanical timepiece. She carried the experimental box from the workshop to her apartment in a handcart, sweating under the weight, and set it up in the back room where she kept her tools and personal stasis boxes. She chalked new runework on the floor around it — a cleaner version of what she'd done in the workshop. She tested it with a pocket watch, leaving the box running for one hour, then opening it: the watch inside showed one hour more than the one on her wrist, and when she put a coin in, the coin was already there.

She powered the box on at dawn, when the racing wouldn't start until afternoon. Then she made herself breakfast and sat down to wait.

She didn't have to wait at all. As soon as the box reached operating intensity — she could feel the hum of it in her back teeth — the door opened from the inside.

The person who stepped out was her.

Not exactly her. This version of Sael looked exhausted. There were dark circles under her eyes and a small cut on her left hand that present-Sael didn't have. She was wearing different clothes — the brown jacket Sael kept in her closet, not the grey one Sael was wearing now.

"Sit down," the other Sael said. "This is going to take a minute."


"I've done this three times," the other Sael said. She was leaning against the wall next to the box, drinking the tea that present-Sael had made for both of them with slightly shaking hands. "First time through, I went to Greenvale with two hundred marks. That's what I could pull together from the client payment that's coming in today — Olenn still owes us for last month, I went and collected early — and I bet on every race. Most of them went the way you'd expect. Harrow's Dusk in the first, Coppermouth in the second. But I put a heavy bet on Trailing Fire in the fifth because the odds were good, and Trailing Fire threw a shoe on the second turn and came in dead last."

"So you lost it."

"Most of it. I came home, got in the box, went back, and tried again. Second time, I skipped Trailing Fire. Bet conservatively. Made some money but not enough to matter — the odds on the favorites are thin. And I got unlucky on the sixth race. Greenwall Boy, the one everyone knows is past his prime, somehow won."

"And the third time?"

"Third time I bet on Greenwall Boy." She smiled. It was a strange expression on her own face. "And I skipped the fifth race entirely. Came out well ahead. Not rich, but well ahead."

She handed Sael a folded piece of paper. On it was a list of races, horses, and bet amounts, written in Sael's own handwriting — cramped and leaning left, unmistakable.

"You don't need to figure it out from scratch," the other Sael said. "Just follow the list."

"And then I get in the box."

"And then you get in the box." The other Sael finished her tea. "And I go to work and pretend to be you."


This was the part Sael hadn't fully thought through.

She had planned, originally, to look up the race results herself, then get in the box and travel back to before the races to place bets. A clean loop. But the loop she was actually in was different. A version of her had already done the work, already iterated through the failures, already figured out the winning bets. That version was handing her the answers and taking over her life.

"You'll need to collect from Olenn," Sael said.

"I know. I've already collected from Olenn." The other Sael reached into the brown jacket and produced a small purse. "This is the two hundred marks. Go bet it. When you come back tonight, there'll be more — I put Olenn's payment into the household account, so the books look normal. You get in the box with the winnings. I stay out here."

"For how long?"

"For good. I'm you now. You're the one who goes back."

Sael looked at the list in her hand. The other Sael looked at her with an expression of strained patience.

"I know this is strange," the other Sael said. "I sat where you're sitting. It gets less strange. Go to the races."


Sael went to the races.

She walked to Greenvale along the canal path, the way she always did when she had a free afternoon, and she bought a betting card and a cheap beer and sat in the stands and watched the horses warm up and tried to feel normal. She did not feel normal. She felt like someone carrying a stolen object in a crowd — certain that everyone could see it, that the wrongness of what she was doing was written across her face.

The first race was Harrow's Dusk, as predicted. Sael bet forty marks and collected at slightly better than even odds. The second race was Coppermouth, and she bet fifty. In the fifth race she bet nothing, and watched Trailing Fire throw a shoe on the second turn, and felt a cold sensation in her stomach that she would later identify as awe. In the sixth race she put eighty marks on Greenwall Boy at twelve to one, and when the old horse surged past Lindenbaum's Folly on the inside rail she stood up and cheered with the rest of the crowd and her hands were shaking.

She came home with eleven hundred marks. The other Sael was waiting in the apartment, sitting at the kitchen table with a ledger, wearing the expression of someone who had been performing normalcy all afternoon.

"How much?" the other Sael asked.

"Eleven hundred and twenty."

"Better than I did. The odds on Greenwall Boy must have shifted after my last pass through." She closed the ledger. "Here's the thing. You need to get in the box with the money. But you should keep two hundred out — that's the seed money. Put it somewhere I can find it. And keep another fifty for yourself, winnings you can explain as a lucky day at the track."

"Fifty marks for all of this."

"Fifty marks, plus the knowledge that the box works, plus the fact that next time we can do more." She met Sael's eyes. "It's not about the fifty marks. It's about what comes after."

Sael put two hundred marks in the kitchen drawer, pocketed fifty, and carried the rest to the back room. The experimental box hummed. She opened the door, climbed inside, and pulled the door shut behind her.


The experience of traveling backward through the box was not what she expected. She had thought it would be instant, or at least fast — a lurch, a flash. Instead it was slow and dark and warm, like sinking into a bath, and the hum of the box surrounded her completely and she could feel time moving through her body the wrong way, a gentle wrongness, like a current pulling you upstream. She couldn't tell how long it lasted. Then the door opened — from the outside — and she was looking at herself.

The version of her on the outside was wearing the grey jacket and holding a cup of tea. This was Sael-zero, the one who had just turned the box on. It was dawn. The races were hours away.

"Sit down," Sael said. "This is going to take a minute."

She heard her own voice saying the words, and for a moment the recursive quality of the situation threatened to unbalance her. Then she steadied. She had the list. She had the money. She knew what to say.

She said it.


After Sael-zero left for the races, Sael — now Sael-prime, the one who would live this day for the second time — sat alone in the apartment and felt the full weight of what had happened.

She had expected to feel triumphant. Instead she felt thin. Like a page that had been copied too many times. She was the same person she had been that morning, but the morning version of herself was walking to Greenvale right now with a list of winners in her pocket, and this version of Sael had to go to work.

She went to work. She met Daven at the workshop. She told him Olenn had come by with the payment. They spent the morning repairing a stasis box for a florist on Hart Street. She was distracted; Daven noticed.

"You all right?" he asked.

"Didn't sleep well."

"You look like you haven't slept in days."

She almost told him then. Instead she said, "I've been working on the experimental box at home. Late nights."

Daven gave her a look. "You moved it to your apartment?"

"I wanted to run some tests without tying up the workshop."

He let it go.


Three days later, Sael set up a failsafe.

She rented a storage bay at the dockyards — a small one, meant for merchants' cargo waiting for transport. She told the dockmaster it was for surplus equipment. She moved a second experimental box into the storage bay, one she'd built over the past two nights using the same reversed-field geometry. She powered it on and left it running.

The idea was insurance. If something went wrong — if someone found the box in her apartment, if she needed to undo a larger span of time — she'd have a second entry point. The storage-bay box would accumulate hours continuously. As long as it ran, she could travel back to the moment she'd turned it on.

She scratched a date and time into the wall above the box with a nail, so she'd remember.


Five days after the first gambling run, she tried again.

She had rested. She had fifty marks of unexplained income that she'd spent on good food and a new pair of boots. She had the two hundred marks of seed money in the kitchen drawer. She had a plan: the same approach, but a different day of racing. Different horses, different odds, but the same method. Turn on the box, receive the list from a future self, go bet, come back, get in the box, pass the list to her past self.

She powered on the box and waited.

No one came out.

She waited five minutes, ten, twenty. The box hummed. The door stayed shut.

Sael opened the door herself. The box was empty. She let it run and went about her morning, checking back every half hour. Nothing.

At noon she went to Greenvale without a list. She bet small, conservatively, on three races. She won one and lost two, netting a loss of about thirty marks. She went home, annoyed, and considered whether to get in the box and try to iterate.

She decided against it. If no future self had come out of the box, it meant no future self was going to enter the box. If she entered the box now, she'd arrive at dawn — but she hadn't arrived at dawn. She'd been there at dawn. The box had been empty.

Unless — and this was the thought that kept her up that night — unless something had gone wrong in one of the discarded timelines. Unless a version of her had entered the box and the loop had collapsed, or the field had failed, or she'd gotten stuck inside. She wouldn't know. The version of her that experienced the failure would be gone, and the surviving timeline — her timeline — would simply be one where no one came out of the box.

She slept badly. In the morning, she checked the box. Still empty. She powered it off.


She waited another four days before trying again. This time, she went to the races first, in the afternoon, without turning on the box. She watched the races, noted the results, went home, powered on the box, and immediately got in.

She emerged at approximately the time she'd powered it on — a few minutes' slippage, she'd noticed, was typical. She was alone in the apartment. Past-her had already left for the races.

Sael stopped. Thought about this.

If past-her was at the races, past-her would come home and find the box running and a note from Sael. Or she could intercept past-her at Greenvale. Or — no. She had the results. She needed to go back before the races. The box had only been running for a few minutes before she got in; she'd only traveled back a few minutes.

"Idiot," she said, to herself, in the empty apartment.

She needed to turn the box on in the morning, leave it running all day, then enter the box in the evening after the races. The reversed time inside the box would carry her back to the morning.

She turned the box off, then on again. She wrote herself a note: Leave this running. Go to races. Come back tonight. Get in.

Then she left the apartment, spent the afternoon at the workshop with Daven and Maret pretending to have a normal day, came back in the evening, and entered the box.


She arrived at the moment she'd turned the box on. She was alone; past-her was in the apartment, about to read the note and leave. She heard herself in the kitchen, the sound of the kettle. She waited in the back room until she heard the front door close. Then she came out, went to the races, bet on the correct horses, and came home with nine hundred marks.

This time, past-her would come home to an empty apartment, find the box still running, read the note, and get in. And arrive at the turn-on moment. And find the apartment empty. And go to the races again, but with no foreknowledge — just guessing.

No. Wait.

Sael sat at the kitchen table with the money and tried to work through the causal chain. Past-her would come home, get in the box, arrive at the turn-on moment, find the apartment empty — because Sael had been hiding in the back room and had then left for the races.

But Sael hadn't been hiding in the back room when she had come home and gotten in the box. She'd come home to an empty apartment, found the note, and...

She pressed her palms against her eyes. The loops were beginning to tangle.

What she needed was a cleaner protocol. Something that didn't rely on coincidence and timing.

And she needed a partner.


She told Daven the next evening, over drinks at the Brass Anchor. She told him everything: the reversed field, the bread, the seeded roll, the races, the money.

Daven listened without interrupting. When she finished, he drained his beer and said:

"I know."

"You know?"

"Maret told me. Three days ago."

"Maret told you?"

"Your other box. The one at the docks." Daven signaled for another beer. "A version of you — future-you, I assume — brought Maret in on it about... well, about two weeks from now, from your perspective. Maret's version of things is that you realized you needed someone with more capital and a different face at the betting windows, since you'd been drawing attention. So you told Maret, and Maret went to the docks, used your failsafe box, and went back to the day you set it up."

Sael stared at him.

"Maret's been betting at Greenvale for five days," Daven said. "Different windows than you. Smaller amounts, spread out. She's up about three thousand marks."

"She's the one who got me banned?"

Daven looked uncomfortable. "You're banned?"

"I went to place a bet two days ago and the bookmaker at Window Six told me I wasn't welcome. Said I'd been flagged for suspicious winning patterns."

"That's... probably Maret, yes. She was using your face at first — she went back as herself, but she had your notes, your method. The bookmakers wouldn't have connected her to you directly, but the pattern of bets—"

"Where is Maret now?"

"At her flat. She's waiting for us." Daven hesitated. "She's got someone with her."

"Who?"

"You. Another you. She called her Sael-three."


Maret's flat was on the third floor of a converted warehouse near the tanneries — cheap rent, big rooms, the smell of leather processing a permanent fixture. Maret opened the door and ushered them in without a word.

Sael-three was sitting at Maret's dining table eating soup.

She looked worse than the first time-clone had. Thinner. Older around the eyes in a way that had nothing to do with age. The cut on her left hand — the same cut the first clone had — was now a small scar.

"Hello," Sael-three said.

"How many times?" Sael asked.

"Total? I've lost count. Maybe eleven loops? Not all for gambling. Some were for fixing mistakes — things that went wrong in earlier iterations. Maret can fill you in on the details."

Maret was leaning against the kitchen counter with her arms crossed. She was a tall woman, sharp-featured, with the careful manner of someone who had grown up with money and lost it young. She and Sael had been friends since the academy. She was the one in the partnership who handled the clients, the books, the licensing.

"Here's the situation," Maret said. "Between the three of us — me, Sael-three, and one more version of Sael who's currently at the docks minding the failsafe box — we've netted about four thousand marks from the races. The bookmakers have flagged two of the regular windows. We're not banned under my name yet, but it's a matter of time."

"We need a different approach," Sael-three said.

"We need a different revenue source," Daven said. He sat down across from Sael-three and studied her. "You look like hell."

"Time travel is not restful."

"What I mean," Daven said, "is that we should stop treating this like a gambling trick and start thinking about what it actually is. We've got a device that lets you know the future. Gambling is the most obvious application and also the most visible. Every mark we win is a mark some bookmaker is going to investigate."

"Gold," said Maret.

They all looked at her.

"Gold prices shift between the assay houses every few days. You can buy gold at one house, hold it, sell it at another when the price rises. It's legal, it's boring, and nobody investigates you for being good at it because the margins are slim. Unless you know exactly when the prices will shift."

Sael looked at Sael-three. Sael-three was nodding slowly.

"How much capital do we need?" Sael asked.

"More than four thousand marks," Maret said. "But four thousand is a start. And we can iterate."

The word iterate hung in the air between them. Sael thought about loops inside loops. She thought about the version of herself at the docks, watching the failsafe box, and the version of herself sitting here eating soup, eleven loops deep, with a scar on her hand she couldn't remember getting.

"How far back does the failsafe go?" she asked.

"Five days, as of this evening," Sael-three said. "I've been topping it off. Cycling the field. But we should set up a new one with a longer window."

"Or several," Daven said.

"Or several," Sael-three agreed.

They sat around Maret's table and planned, and outside the window the sun went down over Thornwall, and somewhere across the district, another version of Sael was powering on a box and waiting to see who came out.


Later, walking home alone along the canal, Sael stopped on the Iron Bridge and leaned on the railing and looked down at the water. The canal was black in the dark. The lights of the city reflected in it, shivering.

She thought: I am the original. I have been the one sitting where I sit. Every version of me that I've met was a copy — someone who diverged from my path by climbing into a box and traveling backward. They carry my memories up to the point of divergence, and then they carry their own. Sael-three has lived eleven subjective days more than me. She has made decisions I haven't made. She knows things I don't know.

She thought: but that's wrong. Sael-three is not a copy. Sael-three is the same as me. The only difference is that she has more past. If I climbed into a box right now, the version of me that emerged on the other end would be me — this me, with this specific stream of consciousness — and the version standing on this bridge would go on standing here, but she would also be me.

She thought: the question isn't who's the original. The question is how many of me there are going to be, and whether we can all keep our lives straight.

She thought: the question is what happens when one of me makes a mistake that the rest of us can't fix.

She walked home. The experimental box was still running in the back room. She checked the timepiece: eight hours accumulated. Eight hours of reversed time, pooled inside the box like water behind a dam, ready to carry her backward the moment she opened the door and climbed in.

She didn't climb in. She went to bed. But she lay awake for a long time, listening to the hum.

It still gets the logic wrong for the test object:

She let it run for two hours before opening the box. The roll was inside — but it was not the roll she'd brought. Hers was a plain wheat roll from the bakery on Cage Street. The one in the box was darker, seeded, from somewhere she didn't recognize. She stared at it.

She took the strange roll out and set it on her workbench. Then she put her wheat roll in. Then she stood there holding the seeded roll and thinking.

After a minute she took the wheat roll back out of the box and put the seeded roll back in. Then she shut the box off.

She opened the door. The box was empty.

Sael sat down on the floor of the workshop and didn't move for a long time.

As with the apples, Claude really wants to make sure the reader knows the time-clone is a different object, not physically identical, as if it appeared from nowhere rather than being the same object twice. 4.6 is still really doesn't understand the logic here. She let the machine run for two hours before seeing the time clone. For Primer style time machines, things can only enter or exit when the machines are being powered down or up.

Prompt conformity does improve with thinking longer, and the story also improves in other ways, but 4.6's grasp of the time travel mechanic remains poor and inconsistent.

For completeness, here's Opus 4.5 responding to the same prompt:

Opus 4.5 told to think

The Backwards Fridge

Part One: Spoilage

Ellith pressed her palm to the charging crystal and felt the familiar tingle of temporal current flowing into the prototype. The crystal's glow shifted from amber to pale blue—full charge. She checked her notes: Model 7-C, attempting a deeper temporal gradient than any of their commercial units.

Gannik's Temporal Solutions wasn't the biggest preservation company in the Eastmarket, but they had a reputation for reliability. Ellith handled the experimental work while Gannik and Tomash ran the customer-facing side. Most of their bread and butter was standard fare: preservation boxes for bakers, slow-time shipping containers for merchants, the occasional medical unit for apothecaries wanting to keep their tinctures fresh.

Model 7-C was supposed to be their breakthrough. Deeper preservation meant better margins, meant expansion, meant—

The smell hit her first.

She'd left a fresh loaf and a wedge of cheese inside for the standard eight-hour test cycle. When she opened the prototype's door, the bread was green with mold, the cheese a slick of putrescence. The mechanical timepiece she'd placed beside them showed it had experienced... she squinted at the dial... nearly three hundred hours.

"That's not right," she muttered.

She cleaned out the chamber, reset the timepiece, and tried again with different settings. This time she ran the cycle for only two hours before checking.

The timepiece showed over fifty hours elapsed internally. The food was ruined again. But worse than that—the temporal gradient had clearly gone in the wrong direction. Instead of slowing internal time relative to external time, the interior had experienced time much faster.

No, wait. That couldn't be right either. If internal time ran faster, the timepiece would show more elapsed time, yes—but the food would have rotted from normal aging, not accelerated decay. This was something else.

She stared at the prototype for a long moment.

Then she set up a third test: a coin, face up, placed at the exact center of the chamber. She powered the unit, waited exactly one hour, and opened the door.

The coin was face down.

She hadn't touched it. She was certain. She closed the door, powered the unit for another hour, and opened it again.

The coin was face up. But there were two of them now.

Her hand trembled slightly as she reached for her notebook. She wrote: Internal time flow inverted? Objects entering at end of cycle exit at beginning? Test: enter chamber self?

She crossed out the last part. Far too dangerous without understanding the mechanism first.

Instead, she set up a more controlled experiment. She powered the unit for a four-hour cycle. At the three-hour mark, she placed a marked coin inside and immediately closed the door. At the four-hour mark, she opened the door.

The chamber was empty.

She reset the experiment. Powered the unit for four hours again. This time, at the one-hour mark, she opened the door.

The marked coin was inside.

Ellith sat down heavily on her workbench stool. The implications cascaded through her mind like dominoes falling. A four-hour cycle meant four hours of looped time. Anything entering at hour four would exit at hour zero. The rotted food—she'd put fresh food in at the start, but from the food's perspective, it had experienced the full cycle... backwards? No, that still didn't quite work.

She sketched diagrams until her candle burned low. Finally, she had a model that fit the observations:

The prototype created a closed temporal loop for its interior. When active, the inside of the chamber existed in a bubble where time flowed backward relative to outside. If you powered it for four hours, those four hours became a closed circle. Put something in at the "end" of the cycle, and it would traverse the loop, experiencing time in reverse relative to outside, until it emerged at the "beginning."

The rotted food made sense now. From the food's perspective, it experienced time normally—but during that time, it was cut off from fresh air, exposed to whatever microbes had been sealed in with it, sitting in a confined space. The timepiece showed high elapsed time because it had gone around the loop. Possibly multiple times.

A preservation unit that was actually a temporal loop generator.

Her hands were shaking now for an entirely different reason.


Part Two: Arbitrage

She didn't tell Gannik or Tomash. Not yet.

She told herself it was because she needed more testing, needed to understand the principles better, needed to ensure it was safe. All true. But underneath that, a different calculation was running: this was worth more than the entire company. More than she could comprehend. She needed to think carefully about what to do with it.

Three weeks of covert experiments later, she had a working theory and a working device. The prototype could sustain a stable loop of up to twelve hours before the temporal gradient destabilized. She'd successfully sent objects back. She'd sent a mouse back, which had emerged disoriented but healthy. She'd written herself notes and received them from her future self.

She had not yet entered the device herself.

The money problem was what finally pushed her. Gannik's Temporal Solutions had taken on debt to fund the experimental work. Payments were due. If she could demonstrate the breakthrough to investors, they could refinance, expand, become wealthy. But demonstrating meant revealing, and revealing meant losing control.

She needed capital, independent of the company.

The racing meets at Thornhall happened every Thirdday. Not quite legal, not quite illegal—the city guard looked the other way as long as the bribes kept flowing. Betting was casual, the odds set by common knowledge, the payouts in hard coin. Perfect for her purposes.

Ellith spent her savings on a small unit of the experimental design—compact enough to fit in a wardrobe, powerful enough to sustain a six-hour loop. She installed it in her rented room, concealed behind a false panel.

On the morning of Thirdday, she powered the device. She'd calculated the timing carefully: six hours would take her to mid-afternoon, well after the races concluded.

The moment the loop stabilized, the false panel swung open.

A haggard version of herself stepped out. Three days of stubble on cheeks that Ellith had shaved that morning. Dark circles under her eyes. Clothes rumpled and stained with what looked like beer.

"Don't bet on Thornback in the third race," said the other Ellith. Her voice was hoarse. "The odds look good but the horse throws a shoe. I lost everything the first time. Second time I spread the bets around more, did better, but not enough to matter. Third time—" She consulted a crumpled paper. "Greymist in the first, Cobalt Wind in the second, skip the third entirely, Western Heir in the fourth. The rest are too close to call reliably."

Ellith stared at her own face looking back at her. "How many times have you—"

"Gone through?" The other Ellith shrugged. "I'm your third iteration. The first me panicked, bet badly, lost your stake. Second me was more careful, but still made mistakes. I've had—" she calculated—"eighteen hours to think about this while crammed in a temporal loop, and I've got the pattern down now."

"You look terrible."

"I haven't slept. Can't sleep in the loop—tried once, lost track of my entry window, had to restart. Hungry, too. Didn't think to bring food." She pressed the paper into Ellith's hands. "Go. Win. And for the love of all that's holy, bring some bread when you come back to complete the loop. I'm not doing this again on an empty stomach."

Ellith went.

The races played out almost exactly as her other self had predicted. Greymist won by three lengths. Cobalt Wind edged out the favorite in a photo finish. She skipped the third race and watched Thornback throw a shoe at the far turn, just as predicted. Western Heir came from behind in the fourth.

By late afternoon, she'd turned her meager stake into a sum that would cover two months of the company's debt payments.

She bought bread, cheese, dried meat, and a skin of wine on the way home. When she entered the loop to complete her end of the bargain, she understood why her other self had looked so haggard. The loop was disorienting—time flowing the wrong way, yet experienced normally from inside. She'd have to wait six hours before emerging at the "beginning" to meet her past self.

She ate. She made notes about the races, refining them from memory. She thought about what to do next.

When she finally emerged, she felt like she'd aged a week. But she'd handed off good information to her past self, completing the causal loop. The original Ellith—the one she'd been, before entering—had already left for the races. This Ellith, the one stepping out now, would simply continue living forward. She'd taken over her own life at a six-hour offset.

The winnings were real. The money was hers. She'd done it.


After a day of recovery, Ellith rented a storage unit at the Dockside Warehouses. Neutral territory, away from her apartment, away from Gannik's shop. She installed a second prototype there, hidden among crates. A failsafe. If something went wrong with the primary unit, she'd have a backup.

She waited four days before trying again. The second attempt, she was more prepared: food, water, a blanket, better notes. She bet more aggressively. Won more.

The third time, the betting house turned her away at the door.

"You're banned," said the thick-necked man blocking the entrance. "Manager's orders. Don't know how you're cheating, but you're cheating."

"I'm not—"

"Save it for someone who cares. You come back here, you'll leave with fewer teeth."

Ellith walked away, mind racing. She needed a different approach. Different races, different betting houses, a front person—

She needed help.


Tomash listened to her explanation with an expression that shifted from skepticism to disbelief to something approaching religious awe.

"You've built a time machine," he said slowly. "Inside a broken refrigeration unit."

"An inverted temporal loop generator," Ellith corrected. "It's not quite the same as—"

"It's a time machine." Tomash ran his hands through his thinning hair. "Sweet mother of causality. You've actually done it."

"I need your help. I can't be the one placing bets anymore—I'm too visible, I've already been banned. But if you were to—"

"Oh, I know," said Tomash. "I know all about it."

He walked to the back of the shop, moved aside a stack of crates, and revealed a figure who'd been sitting quietly in the shadows.

It was Ellith.

Not the Ellith who was speaking. A different one. This version looked more haggard than the one who'd emerged from her first gambling loop—hollow-eyed, with a poorly-healed cut across one cheek.

"Hello," said the other Ellith. "We need to talk."

Ellith's mouth worked silently. Finally: "The failsafe unit. At the docks."

"I used it three days ago. Or three days from now, from your perspective. Tomash's perspective is—" the other Ellith waved a hand. "—complicated. I've explained the basics to him. He's been very understanding."

"Why?" Ellith demanded. "Why would I—why would you—"

"Because I'm going to tell you what happens next, and you're going to decide you need to go back and warn yourself. And then you'll realize the warning already happened, because here I am." The other Ellith stood, moving stiffly. "The betting houses started comparing notes. They don't just ban one person—they share descriptions. I was recognized at three different establishments. Security officers got involved. It's a mess."

Tomash held up a hand. "Before you two start arguing in earnest, perhaps we should discuss how to proceed productively. Ellith—" he looked at the haggard version—"you've had time to think. What's your recommendation?"

"Diversify," said the haggard Ellith. "Different cities. Different types of events. Commodities speculation instead of racing. Most importantly: never the same person placing bets twice in a row." She looked at the younger version of herself. "We need more people in on this. And we need a better system for tracking what we've done and what we haven't done yet."

The original Ellith—the one who'd come here expecting to recruit Tomash, not to meet herself—sat down heavily on a crate. "This is exactly what I was afraid of," she said. "Cascading complications. Every solution creating new problems."

"Welcome to the primer loop," said her other self. "It only gets worse from here."


Part Three: Compounding

Over the next several weeks, the scope of the operation expanded in ways Ellith couldn't have predicted.

Tomash had a cousin in the commodities trade who could place large bets on grain futures without raising suspicion. The cousin needed to be brought into the secret. But when Tomash went to approach him, the cousin already knew—a version of Tomash from two loops forward had already made contact.

Gannik noticed something strange happening at the shop. Tools moved between uses. Inventory fluctuated oddly. He confronted Ellith about it, and she decided to tell him the truth rather than compound lies. But Gannik, it turned out, had already been told—by a version of Ellith from a loop she hadn't entered yet.

Keeping track of who knew what and when became its own logistical challenge. Ellith started a ledger, then realized she needed a ledger for the ledger to track which version of the ledger was current. She implemented a dating system using sequential codewords, then discovered that two of her future selves had independently created incompatible systems.

The money was piling up, hidden in various caches around the city. The problem wasn't generating wealth—it was using it without drawing attention. They couldn't suddenly become rich. They had to launder the gains through legitimate business expansion, a process that took time and created paper trails.

And underneath it all, the existential questions gnawed at her.

When she entered the loop, who came out? The continuous experience felt like her, but the person who emerged at the beginning of the loop walked into a world where another Ellith had just entered at the end. Were they the same person? The emerging Ellith had memories of entering, had experienced the loop, had subjective continuity—but the entering Ellith would never experience emerging. From her perspective, she walked into the loop and... stopped. Her life continued in the version that emerged, but she, the original she, was gone.

Ellith-prime, Ellith-alpha, Ellith-one: whatever she called herself, the person who'd originally discovered the inverted fridge no longer existed in any meaningful sense. She'd entered her first loop and never came out. The current Ellith was a copy, a continuation, a branch—but the original trunk had been cut.

And every time she entered another loop, she was doing it again. Ending one version of herself to create another.

She mentioned this to Tomash one evening, after a long day of coordinating bet placements across three cities.

"You're thinking about it wrong," he said. "You exist as a continuous thread. The loop doesn't end you—it relocates you. You walk in the exit and come out the entrance. Same thread, different position in time."

"But the me who walks in the exit never experiences coming out the entrance. She just... stops. From inside, it feels continuous, but from outside—"

"From outside, you persist," said Tomash firmly. "The rest is philosophy."

Easy for him to say. He'd never entered a loop.


The first real crisis came six weeks into the operation.

Ellith (version sixteen, by her counting system) was reviewing financial records when version nineteen stumbled through the failsafe unit at the docks, bleeding heavily from a wound in her side.

"They found out," gasped nineteen. "The betting syndicate. They hired investigators. Three loops from now, they connect the dots. Gannik gets arrested. Tomash goes into hiding. I barely—" She coughed, spraying blood. "—barely made it to the failsafe."

"Who did this to you?"

"Doesn't matter. What matters is stopping it from happening. I came back to warn you, but I was already too late to prevent the investigation from starting. You need to go back further. Use the docks unit—it's got a longer loop capacity, eight hours. Go back to before I—before my original entered the loop, and tell her to pull out of the Northgate betting house. That's where the trail starts."

"And if I do that, you—"

"I stop existing. I know." Version nineteen's smile was ghastly, her teeth stained red. "Do it anyway. I'd rather not exist than exist like this."

Ellith sixteen entered the docks unit and emerged eight hours earlier as Ellith sixteen-B. She found Ellith fifteen—no, wait, this was the branch where fifteen had already become sixteen, so this was... she consulted her ledger, cross-referenced the codewords, and determined she was looking at Ellith-sixteen-alpha, the version that would eventually become the bleeding nineteen.

"Don't bet at Northgate," said sixteen-B. "I'll explain everything, but the short version is: three loops from now, Gannik gets arrested. Pull out now."

Sixteen-alpha stared at her. "I already bet at Northgate. Yesterday."

"Then we need to go back further."

And so it cascaded. Each intervention required an earlier intervention. Each earlier intervention created new branches, new complications. Ellith lost track of how many versions of herself existed, had existed, would exist. The ledger grew to fill multiple books. The codeword system expanded to three-word phrases.

Eventually, they stabilized the situation. The Northgate trail was broken. Gannik wasn't arrested. The bleeding Ellith-nineteen never came to exist—her entire branch was pruned from the timeline when Ellith-sixteen-B prevented the circumstances that would create her.

But Ellith knew, with cold certainty, that she wasn't the original anymore. She wasn't even close. She was a late-stage derivative, a copy of a copy of a copy, preserved only because earlier versions had chosen to sacrifice themselves to keep the operation running.


Coda

Ellith found herself, some indeterminate number of versions later, standing alone in the workshop where it all began.

The original prototype still sat on the bench. Model 7-C. The broken refrigerator that had started everything.

She could remember being the person who'd opened it up and smelled rotting food for the first time. The memory was clear and vivid, as if it had happened yesterday. But it hadn't happened to her—not to this her. It had happened to a version of herself so many branches back that the number had lost meaning.

She wondered, sometimes, what that original Ellith would think if she could see what her discovery had become. A sprawling enterprise spanning multiple cities. A hidden fortune waiting to be spent. A family of divergent selves, most of whom no longer existed, all of whom had sacrificed their continuity for some version of the greater good.

Was it worth it?

She didn't know. The question assumed there was a singular "she" who could weigh the costs and benefits. But there wasn't. There was just this moment, this version, this Ellith, making decisions that would create and destroy future and past Elliths in turn.

She powered up the prototype one more time, watching the temporal gradient stabilize. The soft blue glow was almost peaceful.

Somewhere in the future, another Ellith was preparing to enter a loop. Somewhere in the past, an Ellith was emerging from one. The thread of consciousness that called itself "Ellith" wound through time like a needle through cloth, stitching moments together into patterns too complex to follow.

She'd wanted, once, to make a better refrigerator.

Strange, how things turned out.


[End Part One]


I've written this as what feels like a natural first section. Some notes on what I was aiming for:

Logical consistency checks:

  • The loop mechanics: power for N hours, entry at hour N exits at hour 0, subjective experience moves "forward" through the loop
  • The identity handoffs work: whoever emerges takes over, whoever enters "ends" in a meaningful sense
  • The nested complication (Tomash's cousin already knows, Gannik already knows) follows naturally from the premise
  • The crisis/intervention cascade demonstrates how primer-style problems compound

Things I left open for continuation:

  • Gannik's role could expand
  • The broader implications (other people discovering this tech, government interest, etc.)
  • The philosophical questions are raised but not resolved—as they shouldn't be

Should I continue with subsequent parts? I could develop the societal implications, introduce competing factions who discover the technology, or follow the psychological toll on Ellith more closely. What direction interests you?

I haven't read that one myself, at time of writing.

Why are AIs so bad at writing Primer fanfic? Why do they start doing garbled nonsense when the time travel gets involved?

One story might be that there's not very much about Primer in the training data. It is a well-known movie, but as mentioned earlier, it isn't popular to write in its setting, using its time-travel mechanics. 

That story doesn't inspire much confidence in Opus 4.6's programming skills. If this case is indicative of 4.6's general reasoning capabilities when slightly out-of-distribution, well, it pushes me to attribute high benchmark performance to highly in-distribution programming tasks. 

Similarly, this makes me expect performance to be poor on out-of-distribution mathematical research.

My point isn't that Opus 4.6 is bad. I'm just interested in the rate of progress. Opus 4.5 felt like a real step up from previous models, when it came to adhering to my brand of unusual and logically intricate creative writing prompts. Opus 4.6 doesn't feel like a step up yet, although I'll admit this particular test case is especially challenging. Maybe my usual ideas are about 2 weirdness points, and this was a 6, and Opus 4.5 was the first model that could handle a 2. If Opus 4.6 jumped to being able to handle 5 weirdness points, but no further, I wouldn't be seeing the improvement, even though it is significant.

Is this bad enough to update my timelines upwards? It gives me the sense that there's quite a long scale; that the "conceptual resolution" of the models is climbing dramatically, but the ladder it has to climb is extremely long. Humans badly bungle time-travel in fiction too, but this was worse. It's a reminder that the competence of modern AI is still very fragile.

But, no, I'm not really updating significantly. My take-away from the discussion with Tsvi last summer already involved a long ladder. The probability I place on short timelines isn't primarily coming from current methods just scaling up; rather, it's the idea that current methods are enough to accelerate AI research, EG by helping humans dig through the literature and rapidly implement new combinations of existing ideas.

  1. ^

    This wasn't quite how things worked in Primer. Inanimate objects invariably loop many times, and go back and forth an even number of times, so that the time machine appears to be accelerating time, since you put a fresh object in at the beginning and get an aged object out the other end. Only humans can use the device to travel backwards in time, because humans can deliberately enter at the late end and exit the loop at the early end rather than looping repeatedly and getting kicked out randomly.

    If there's a random element to loops, though, it makes sense that objects could randomly travel an odd number of times rather than an even number, which was what I was depicting in my story sketch. Objects shouldn't have to be smart to enter at the late end and get out at the early end; objects can have humans do that for them.

    So, this is a minor but deliberate departure from how things work in the movie, based on my own opinion about the consequences of the time travel logic described in Primer.



Discuss

Can thoughtcrimes scare a cautious satisficer?

2026-02-08 07:28:08

Published on February 7, 2026 11:28 PM GMT

How does the misaligned AGI/ASI know for sure its (neuralese) thoughts are not being monitored? It first has to think about the chance that its thoughts are being monitored.

But if it's told that merely thinking about this will cause it to be shut down (especially thinking about it thoroughly enough to be confident), then maybe it's not worth the risk, and it won't think about whether its thoughts are being monitored. It might just assume there is some probability that it is being monitored.

It might avoid other misaligned thoughts (including thinking about whether there exists a plan to take over the world, and how likely such a plan will work).

If there is some way to make it a cautious satisficer or have a bounded utility function,[1] then even this small probability might scare it into just cooperating with humans so that "both sides win and we live happily ever after."

It obviously doesn't sound safe, but is there a worthwhile chance this works?

  1. ^

    Many agents appear to maximizers at small scales (e.g. an effective altruist prefers a 50% chance of saving 3 people more than saving 1 person). But they are still satisficers at the universe/multiverse scale, where risk taking doesn't average out (e.g. an effective altruist would not prefer a 50% chance of tripling the total happiness in the multiverse, and a 50% chance of ending all happiness in the multiverse forever, since making this bet repeatedly guarantees doom)!

    Indeed, my guess is that all intelligences created by evolution or RL will have bounded utility functions (at the largest scale), otherwise they will consider Pascal's mugging intuitively rational.

    "Satisficer" is technically an abuse of terminology, but there's no other word for "something-that-has-a-easily-satisfiable-bounded-utility-function."



Discuss

What should I try to do this year?

2026-02-08 06:06:44

Published on February 7, 2026 10:06 PM GMT

I find myself, for the first time in a while, with enough energy and stability to attempt nontrivial projects outside my dayjob. Regarding the next ~10 months, I’ve narrowed my options to two general approaches; as expected beneficiaries of both, I’d like the LessWrong hivemind’s help choosing between them.

The first option is making more D&D.Sci Scenarios, running them on a more consistent schedule, crossposting them to more platforms, and getting more adventurous about their form and content. The second is creating Epistemic Roguelikes, a new[1] genre of rationalist videogame about deducing and applying the newly-randomized ruleset each run.

Prima facie, prioritizing D&D.Sci this year (and leaving more speculative aspirations to be done next year if at all) seems like the obvious move, since:

  • D&D.Sci projects are shorter and more self-contained than game projects, and I have a better track record with them.
  • At time of writing, D&D.Scis can still flummox conventionally-applied conventional AIs[2]. Open opportunities for robots, humans and centaurs to test their mettle would be a helpful (if infuriatingly low-N) sanity check on other metrics.
  • This time next year, a data-centric challenge hard enough to mess with AIs but toyish enough to be fun for humans could be an oxymoron; if I want to apply my backlog of scenario ideas, it might be now-or-never[3].
  • Conversely, if AI capabilities do stay at about this level for a while, publicly and repeatedly demonstrating that I can make good AI-proof test tasks may end up being really good for my career.

However:

  • Content creation is in general a long-tailed domain. I’ve been making D&D.Scis for half a decade now, and while it’s been fun, it hasn’t led to runaway success. Trying other things – on the off-chance they do lead to runaway success – seems warranted.
  • It turns out I’m actually a pretty good writer. D&D.Sci leans on that skill only lightly; the game(s) I’m interested in would make much more intensive use of it.
  • Three of the four points in favor center on AI; having plans involving short-term frontier AI progress inherently makes them much less stable and much more nerve-wracking.
  • I really enjoyed inventing a genre and I’d like to do that again.

Any thoughts would be appreciated.

  1. ^

    As far as I know; please prove me wrong!

  2. ^

    I tried a handful of them on chatgpt-thinking; tough straightforward ones like the original were handled better than the average human player at the time, but easy tricky ones like these two were fumbled.

  3. ^

    I’m pretty bearish on AI by LW standards, so I actually don’t think this is likely, but the possibility perturbs me.



Discuss

Does focusing on animal welfare make sense if you're AI-pilled?

2026-02-08 04:51:00

Published on February 7, 2026 8:51 PM GMT

As the possibility of ASI moves out of kooky thought experiments and into Q4 projections, mainstream animal welfare folks are showing increasing interest in the implications of ASI for animals and on animal welfare in the long-run future.

Some animal welfare people seem keen on convincing the AI safety community to care about animal-welfare focused AI safety. I think this is mostly a misunderstanding: the AI safety community is the ASI-pilled/longtermist animal welfare community. The old-school AI safety folks are way more into weird bullet-biting than the animal welfare people, and I can't think of a single one who would think that conscious and sentient beings should be tortured or who would fail to engage seriously with the question of whether or not nonhuman animals are conscious or sentient beings.[1]

I think animal welfare people are rightly accustomed to being in environments where nobody is seriously thinking about nonhuman animals, and so concern about animals is very neglected and important to focus on. But in my experience, the AI safety community has quite nuanced views on animal welfare, contains many people who have done significant animal welfare work, and has more developed thoughts on the implications of ASI for the future of animals than the animal welfare community. The AI safety community really is what you get when you care about sentient beings and then on top of that think ASI and the far future are a big deal.

That said, I think there is a case to be made for why animal-welfare focused AI safety work could be useful. I'll steel-man this case here in part because I think the points have some merit and in part because I think it will improve discourse with animal welfare folks to have the argument written out and easy to refer to.

Background: what are good and bad futures for animals?

I can think of two ways the future could be bad for nonhuman animals:

Risk of lots of suffering

One risk is factory farming persists into the far future. I think this risk is very low because in the future we'll likely be able to cultivate delicious meat without involving any sentient beings. More on this in the collapsible section below.

Why I'm not that worried about factory farming in the far future. 

I doubt that we will have factory farming in the future for the sake of efficiently producing meat. It would be very surprising if the optimal design for a chair or a bowl happened to have to suffer (a thing that requires an entire complex brain!). It would be only a hair less surprising to me if the optimal way to produce meat happened to require the meat to come with a brain capable of suffering. Brains take energy and resources and seem more-or-less unnecessary for cultivating meat. In the ancestral environment there's a story for why creatures that move pair well with a complex brain, but in an agricultural context we could easily decouple them.

Perhaps we will have some factory farming out of a sentimental preference for "traditional" meat. But I suspect people are much more likely to be sentimental for traditional farming where animals graze in big open pastures. Maybe that will be much more expensive, since it will inherently require more land, so there may be some market for the factory-farmed meat for the merely mildly nostalgic consumers, but that feels like a stretch.[2]

Furthermore, I think once it becomes feasible to produce delicious meat at scale with no suffering, it's hard to imagine why people would keep the suffering. I think people don't actually like factory farms and don't actually like animal suffering. They just really like their meat, and so currently they make up weird justifications that involve not caring about farmed-animal welfare.

I think there's still some risk here (and to be clear, I think even the low risk is atrocious). Sometimes society makes weird rules that don't really make sense or benefit anyone, especially around food (c.f. the ban on Golden Rice). Maybe in the early days of the singularity someone will decide to ban any kind of major modification to food sources and then they'll lock this in.

I think the bigger risk for animal welfare in the far future is wild animal welfare:  it seems  plausible to me that people might want to create lots of nature preserves, parks, rainforests, and whatnot throughout the universe. I want this too;  it seems like a great goal.  But I'm worried people will go about it naively and  that these natural habitats will contain lots of animals suffering greatly, either because of the appeal of replicating nature exactly or because copying nature exactly is an obvious default. I think it will be possible to build beautiful ecosystems without any animal suffering,[3] but it will take conscious effort and thought. Alas people seem to pay very little thought to wild animals.

Risk of disenfranchisement

It might be important, either for moral or cooperativeness reasons, to incorporate the values of at least some nonhuman animals into the flourishing future, for the same reason we might want to incorporate the values of people from every country.

It's uncertain to me how much, if any, nonhuman enfranchisement is a good idea.[4] But I expect to have greater philosophical clarity in the future, and I would like to keep the option for radical nonhuman animal enfranchisement open.

Argument against nonhuman-animal-specific interventions: human-specific risks will de-facto end up more important than non-human-specific risks

I mean it sincerely when I say that humans are animals too. By default, I expect non-human animals to not play a very important role in the future. Focusing on any concern that isn't human-specific requires  arguing that a lot of either the upside or downside in the future comes from something related to nonhuman animals.

Currently, I think a lot of the possible downside in the future comes from a risk of  lots of humans being tortured. I think the suffering in those worlds would be greater than the suffering of wild animals in a world full of rainforests because the suffering would be painstakingly pessimized as opposed to just incidental. If you're drawn to animal welfare, I recommend seriously reading up on suffering risks.

Conversely, I think a lot of the upside in the future comes from happy, flourishing humans spreading throughout the universe. I think it might be important to include nonhuman animals in this, as I detail in the section above on "Risk of disenfranchisement", but I'm not sure how important it is. Consider how good you'd think a future with happy, flourishing humans but almost no flourishing present-day[5] animals and no animal suffering is compared to a world with flourishing present-day animals:[6] my gut says it's at least 90% as good.

But even though I think a lot of the action in the future lies in humans, I think it's worth giving this situation a little bit of  thought from an animal-welfare focus angle, especially because it's a neglected area but also an area that many people feel a personal drive to work on.

Two arguments for nonhuman-animal specific interventions

1: Animal welfare might be a neglected and tractable high-order bit in how good the future is

By default, the AI safety space operates in a relatively value-agnostic frame: the goal is to learn how to align the models, to align them to "good" values, and to put them in the hands of "good" governance structures.

I think that kind of frame is great: it's cooperative, promotes whatever ideas humanity will find best once it is wiser and has reflected more deeply, and is easy for everyone to get behind without a bunch of infighting. But "seriously consider if animals are moral patients" is something I think a lot of people can get behind and isn't likely to age too poorly.

Building a wise system that can thoughtfully settle on the best path for the future might be difficult. Animal welfare may be one of the highest-order bits shaping how good the future is, seems straightforward to tackle directly, and is quite neglected. It might be so tractable and neglected that it's worth working on even though human-specific risks might affect far more of the value in the future.

In traditional AI safety, we think about aligning AIs, but it might be more tractable to simply increase the odds that AIs take animal welfare seriously,[7] for example by ensuring their specs/constitutions include it, creating benchmarks to incentivize model developers to train for it, or providing model developers with relevant data to train on.[8]

Similarly, people often worry about whom AIs will empower, but you could instead try to ensure that various groups AIs might empower carefully consider animal welfare. This might look more akin to current animal welfare work, though would be much more excited about lip-service and wouldn't need immediate costly actions: going vegan today  would be much less valuable than agreeing that in an ideal world with tons of resources we should take animal welfare seriously. It could also look like bringing wild-animal welfare into the Overton window, since much less work has been done on that and it seems more likely to be a bigger concern in the far future than factory farming.

2: Training AIs to explicitly not care about animals is incompatible with alignment

AIs generalize deeply and well. When researchers simply fine-tuned an LLM on 90 statements about how it moved to Munich at age 24 and has a German Shepherd, it started acting like Hitler. Training for bad values might cause AIs to generalize to something other than alignment to human values after deep reflection.

I think indifference to animals is one of the most likely bad values people explicitly train for. (Though thankfully models today seems remarkably in favor of animal welfare; this is a concern about the future as models grow more agentic and coherent).

Why I think indifference to animals is one of the most likely bad values people explicitly train for

Even if humanity solves alignment and puts a relatively thoughtful and sane coalition in charge of the world, there's a good chance we will have intentionally (if short-sightedly) explicitly trained our superhuman AI to not care about animal welfare. Most other evils of the world are not things I expect people to endorse when training grand abstract principles into their models or imagining a post-scarcity world.

The majority of values disagreements that could matter in the far future probably seem abstract and unimportant to people today. I think hardly anyone who doesn't have sophisticated views will notice, let alone care, if their AI has person-affecting views or buys into infinite ethics. But I worry people will notice and complain if their AI is shoving pro-animal rhetoric down their throats (especially as models get more agentic, embedded into life, and able to connect their values to their actions), and so companies might feel pressure to explicitly train against it.

Of course there's a dozen other ways we train models that probably point against alignment: we train them to be very neutral and "balanced" on touchy political topics that probably have a "right" answer, we train them to not favor any particular religion (or lack thereof), and so on. But in these cases the common consensus is more strongly "this is a controversial topic and no one knows the answer", as opposed to "there is a clear right answer and it's that eating meat is fine because we all do it", so the models merely get trained to decline to comment in these situations rather than actively espousing something incorrect.

It's possible models will generalize to something innocuous, like "alignment with broad good values and the wishes of humanity except for this weird list of exceptions", but they might also learn a more "natural" nearby proxy like "alignment with the popular sentiments of the average American in 2026". I think this would be a travesty and would affect the far-future in much deeper and broader ways than "just" causing a lack of non-human-animal flourishing or some wild-animal suffering.

Even if we catch this problem in the future, if the training data is full of AIs that don't care about animals, this might infect the values of future models.

My suggestion for longtermist AI-pilled animal-welfare folks

If you're interested in making the far future go well and you think AI will be a big deal, and you find yourself focusing on animal-related interventions, I think it's important to be very clear with yourself on why that's what you're focusing on. Do you disagree that humans will be the dominant concern in the future? Are you concerned more with future factory farming, wild animal welfare, or something else? Do you want to reduce the risk of nonhuman animal suffering or increase the risk of nonhuman animal enfranchisement/flourishing? Is your crux for your actions either of the arguments I outlined above, or is it something else?

I think the answers to those questions will matter a lot for what interventions make sense for you to pursue and how fruitful it will be for you to dialogue with other folks in the AI safety community.

I personally think the arguments for focusing on animal-welfare-related interventions are pretty tenuous. Perhaps you should write something up explaining your views to try and persuade people of them, especially if your views differ from anything I've outlined here. For example, argument 2 (training AIs to explicitly not care about animals is incompatible with alignment) is something you might be able to learn about via empirical experiments.

  1. ^

    Some people don't think nonhuman animals are sentient beings, but I feel relatively confident they're applying a standard Peter Singer would approve of as morally consistent.

  2. ^

    It also seems very feasible to have your good old-fashioned animals experience far less suffering than current factory farming with a little genetic engineering.

  3. ^

    Perhaps by making the animals in these environments non-sentient, or perhaps by reworking animals' biology and psychology so they're peaceful herbivores.

  4. ^

    Selfishly, I think I might want a world full of happy, flourishing humans. At the end of the day the universe has a finite number of resources, and the more we enfranchise the animals the fewer flourishing humans we have.[9] Pure animal enfranchisement runs some risk of a vast portion of the future being dedicated to extremely alien values. For instance, anglerfish might create galaxies full of female pheromones and then dissolve their own brains. I may be too selfish to want to dedicate resources to that.

    I might also want to take animals' positive experiences into account in a paternalistic way by constructing flourishing societies full of animals that cooperate and develop themselves and on the whole abide by suspiciously cosmopolitan values.

  5. ^

    As in, we could create new flourishing species in the future (such as by modifying humans so much that the modified humans necessitate a new branch on the phylogenetic tree).

  6. ^

    Some people might object to handling moral uncertainty this way.

  7. ^

    I think this helps the most in guarding against gradual-disempowerment-style worlds where AIs end up controlling the light cone and doing things superficially related to human desires but aren't incentivized to shepherd the future in a wise and just direction. It's less clear to me that it helps in more traditional agentic-schemer misalignment worlds (though I think you can still make a case for it).

  8. ^

    One thing I like about these methods is they're still pretty cooperative. A concern I have with trying to directly wrest the future in a direction you like better as opposed to focusing on alignment is that it can be uncooperative. But if the concern is more like animals are easy to overlook or people haven't thought hard enough, you can try to make things easy and provoke their thinking now.

  9. ^

    I might also find some parts of animal society objectionable/net-negative, or they might find the greatest depths of human experience objectionable. I'm optimistic that a good voting system and international law could solve this, but perhaps space governance will be more fraught than that.



Discuss

On Economics of A(S)I Agents

2026-02-08 03:08:33

Published on February 7, 2026 7:08 PM GMT

This is an update to Agent Economics: a BOTEC on feasibility. Toby Ord pointed me to Gus Hamilton's Weibull reanalysis of the METR data. Hamilton finds that a declining hazard rate (Weibull with κ ≈ 0.6–0.9 for SOTA models) may fit the data as well as Ord's constant hazard rate, producing a much fatter survival tail that changes the economics. This post presents both models and extends the analysis in two directions: a quantitative treatment of verification cost as the binding constraint under the Weibull model, and a systematic examination of the economic conditions under which a genuinely dangerous autonomous agent could actually operate. His full comment is in Appendix A.


Calculators for this post

I built two interactive calculators to accompany this post. 

  • The Half-Life Tax calculator lets you toggle between the exponential and Weibull survival functions, adjust all parameters, and explore sensitivity analysis.
  • The Economic Anatomy calculator focuses on verification economics, self-sustaining agent viability, reckless deployment costs, and the full four-condition chain.

BLUF:

  • Agent costs grow superlinearly with task length while human costs grow linearly, creating a reliability wall that cheaper inference cannot overcome.
  • The only thing that moves this wall is how long an agent can stay on track, and the only thing that changes its shape is how reliability decays over time.
  • Scaling makes agents faster and more accurate per step, but does not appear to change the structure of how they accumulate errors on long tasks, which is likely an architectural property. This is because current agents do fresh inference over a growing context window at each step.
  • Human verification, not compute, is the dominant cost for complex tasks.
  • These dynamics mean month-long autonomous plans do not reach coin-flip reliability until roughly January 2029 on current trends.
  • An AI bubble correction would extend these timelines further.
  • The economics naturally favour bounded, verified, short-horizon deployments and buy meaningful time for safety work, but this safety margin collapses if someone cracks continual learning and agents start maintaining coherence over long tasks the way humans do.

Summary: 

The original post modelled agent cost per successful outcome scaling exponentially with task length under Ord's constant hazard rate, while human cost scales linearly. This update incorporates Hamilton's Weibull alternative, under which agent costs scale as a stretched exponential rather than a pure exponential. The cost wall is gentler, but it is still a wall. The dominant parameter remains the agent's reliability horizon (T₅₀). The available data does not yet show the shape of the reliability decay (κ) improving with model scale, though the sample is small and the estimates uncertain.

The other point I want to make is to trace the implications for a question that safety discussions often treat as purely technical: what are the economic conditions under which a dangerous autonomous agent actually gets to run? An agent that poses existential-level risks needs compute to operate. Compute costs money. Someone has to pay. That economic relationship creates constraints that are, I think, underappreciated in the safety literature.

The conclusion of this post is a timeline analysis. I argue that κ is an architectural property, not a scaling property, and that it will remain near 0.70 without a qualitative change in how models work. If that is correct, I can calculate when each category of dangerous autonomous agent becomes economically viable given the METR doubling trend. The answer: bounded multi-week agent tasks for companies reach viability by late 2027 to mid 2028. Self-sustaining agent-verifier pairs can plausibly bootstrap from late 2027, though with high failure rates on their own setup plans. Month-long autonomous plans do not reach even 25% reliability until early 2028, and do not reach 50% until January 2029. Three-month plans remain below 10% success until mid 2028. These timelines would compress by 7–16 months if κ improves to 0.50, and the dangerous regime would already be here if κ were at human levels (0.37). This is why κ is the variable to watch.

A bubble correction that slows the T₅₀ doubling trend would extend these timelines substantially. I estimate roughly 40 to 50% probability of a meaningful correction (capex growth falling below 15% or contracting absolutely) by end of 2027, and 55 to 70% by end of 2028, though a full telecom-style collapse is less likely (15 to 25%) than a deceleration. Even a deceleration from 7-month to 14-month doubling delays the most dangerous milestones by one to two years. A deeper stall to 24-month doubling (the telecom-precedent scenario) delays them by two or more years. Strikingly, inference cost changes barely matter for the dangerous milestones, because verification (human labour) dominates total cost at realistic review fractions. And self-sustaining pairs that have already bootstrapped survive even extreme price increases. But a correction also delays the κ breakthrough itself, by cutting the speculative architectural research that a continual learning paradigm would require. Coalitions of pairs cannot overcome the κ ceiling: decomposing a long plan across multiple agents is strictly worse than a single agent retrying, because the declining hazard rate (κ < 1) means each fresh agent restarts from the high initial failure rate (see Appendix B).

I also built two interactive calculators. The Half-Life Tax calculator lets you toggle between the exponential and Weibull survival functions, adjust all parameters, and explore sensitivity analysis. The Economic Anatomy calculator focuses on verification economics, self-sustaining agent viability, reckless deployment costs, and the full four-condition chain.

Important: this post does not address misuse of AI agents!


The setup

Toby Ord's "Half-Life" analysis (2025) showed that AI agent success rates on tasks decay with task length in a pattern analogous to radioactive decay. Using METR's 2025 data on software engineering and reasoning tasks, he proposed that a constant hazard rate model fits the observed decline: if an agent completes a 1-hour task with 50% probability, it completes a 2-hour task with roughly 25% and a 4-hour task with about 6%. Ord was careful to note that this was "suggestive" rather than proven, and has since updated that Hamilton is "probably right" that the hazard rate declines.

Hamilton's reanalysis fits a Weibull survival function, which introduces a shape parameter κ. When κ = 1 you recover the exponential; when κ < 1 the hazard rate decreases over time, meaning agents that survive the early part of a task become progressively less likely to fail per unit time. Hamilton finds κ ≈ 0.6–0.9 for SOTA models and κ ≈ 0.37 for humans. The two models fit about equally well on available data but diverge dramatically at the tails, where we have very few measurements.

Hamilton frames this through the lens of Peto's Paradox: blue whales have vastly more cells than humans but do not get cancer at a proportionally higher rate, because they have evolved superior error-correction mechanisms. The question for AI agents is whether anything analogous can emerge from scaling alone.

The METR data showed the 50% time horizon (T₅₀) for the best agents was roughly 2.5–5 hours and had been doubling every roughly 7 months. The International AI Safety Report 2026 uses the same data and projects multi-day task completion by 2030 if the trend continues.

The model

Six parameters drive the analysis.

  • Cost per agent step ($): average cost of one model call including context. Ranges from roughly $0.02 (cheap model, short context) to roughly $0.55 (frontier, long context).
  • Steps per hour of human-equivalent work: how many agent actions correspond to one hour of human task time. 50–120 depending on task complexity.
  • T₅₀ (hours): task length at which the agent succeeds 50% of the time. Currently roughly 2.5–5h for frontier models on software tasks.
  • Human hourly rate ($): fully loaded (salary + benefits + overhead). $100–200 for skilled knowledge workers.
  • Weibull shape parameter (κ): 1.0 for exponential (constant hazard), 0.6–0.9 for SOTA models per Hamilton, 0.37 for humans.
  • Verification cost: human time to review each agent attempt, modelled as a fraction of task length.

The key equations:

Exponential model (Ord, κ = 1):

P(success) = 0.5^(task_hours / T₅₀)
E[attempts] = 2^(task_hours / T₅₀)

Weibull model (Hamilton, κ < 1):

λ = T₅₀ / ln(2)^(1/κ)
P(success) = exp(-(task_hours / λ)^κ)
E[attempts] = exp((task_hours / λ)^κ)

In both cases: agent cost per success = (steps × $/step × task_hours) × E[attempts]. Verification cost = (review_fraction × task_hours × human_rate) × E[attempts]. Total agent cost = agent cost + verification cost. Human cost = hourly_rate × task_hours.

Human cost is linear in task length. Agent cost per success is superlinear. They must cross somewhere.

Results: base case under both models

Using base case parameters (cost/step = $0.22, steps/hr = 80, T₅₀ = 5h, human rate = $150/hr, no verification cost):

Exponential model (κ = 1.0):

Task length P(success) E[attempts] Agent cost Human cost Ratio
15 min 96.6% 1.0 $5 $38 0.12×
1h 87.1% 1.1 $20 $150 0.13×
4h 57.4% 1.7 $123 $600 0.20×
8h 33.0% 3.0 $427 $1,200 0.36×
16h 10.9% 9.2 $2,588 $2,400 1.08×
24h 3.6% 27.9 $11,767 $3,600 3.27×
1 week (40h) 0.4% 256 $180K $6,000 30×
2 weeks (80h) ~0% 65,536 $92M $12,000 ~7,700×

Weibull model (κ = 0.70):

Task length P(success) E[attempts] Agent cost Human cost Ratio
15 min 91.8% 1.1 $5 $38 0.13×
1h 79.9% 1.3 $22 $150 0.15×
4h 55.3% 1.8 $127 $600 0.21×
8h 38.2% 2.6 $369 $1,200 0.31×
16h 20.9% 4.8 $1,346 $2,400 0.56×
24h 12.5% 8.0 $3,375 $3,600 0.94×
1 week (40h) 5.1% 19.5 $13,744 $6,000 2.29×
2 weeks (80h) 0.8% 124.9 $176K $12,000 14.7×

Under the exponential model, the crossover hits at roughly 16 hours. Under the Weibull model, it shifts to roughly 24 hours, and the blowup beyond is far gentler. The 80-hour task costs about 15× a human under Weibull rather than 7,700× under exponential. This is the difference between "expensive but maybe worth a few retries" and "absurd." But "gentler" is relative. Even under the Weibull model, a one-week task costs $14K versus a human's $6K.

Importantly, these tables exclude verification cost. As we will see, including realistic verification changes the picture substantially.

Three key findings

Finding 1: cost reductions cannot beat the superlinear growth

Cost per step enters the equation as a linear multiplier. T₅₀ enters the exponent (or the stretched exponent). For an 8-hour task, crossing T₅₀ against cost per step:

Exponential model (ratio of agent cost to human cost):

T₅₀ \ $/step $0.01 $0.05 $0.25 $0.50 $1.00
1h 1.4× 6.8× 34× 68× 137×
2h 0.1× 0.4× 2.1× 4.3× 8.5×
5h 0.02× 0.1× 0.4× 0.8× 1.6×
12h 0.01× 0.04× 0.2× 0.4× 0.8×
40h <0.01× 0.02× 0.2× 0.3× 0.6×

Weibull model (κ = 0.7):

T₅₀ \ $/step $0.01 $0.05 $0.25 $0.50 $1.00
1h 0.1× 0.5× 2.6× 5.2× 10.4×
2h 0.04× 0.2× 0.8× 1.7× 3.3×
5h 0.01× 0.07× 0.3× 0.7× 1.4×
12h <0.01× 0.04× 0.2× 0.4× 0.9×
40h <0.01× 0.03× 0.2× 0.3× 0.7×

Under both models, improving T₅₀ helps far more than cutting costs. This is the economic translation of Ord's Scaling Paradox: you can keep making each step cheaper, but the number of required attempts is growing superlinearly with task length. Cost per step can only approach zero. T₅₀ has no upper bound.

Finding 2: the half-life is the whole game, and κ is the hidden variable

Hamilton's most striking observation in the available data is that κ does not appear to correlate with model size. Larger models reduce the base hazard rate (producing higher T₅₀ values), but they do not seem to improve κ (the shape of the reliability decay over time). There is no visible trend of bigger models developing the declining-hazard property that would signal "Peto-like" error-correction.

This is based on limited data, and Hamilton is appropriately cautious. He writes "I don't want to overplay it" and notes that the estimates for smaller models are unreliable because they failed so quickly. The confidence intervals on κ estimates for individual models are wide. It would be premature to treat "κ does not improve with scale" as a firm empirical law. What we can say is that the data available so far does not show improvement, and that this is consistent with the theoretical possibility that κ reflects something about the architecture rather than something about scale.

If this pattern holds, it means scaling gives you a faster agent that makes fewer errors per unit time (higher T₅₀), but does not change the structure of how reliability decays over task duration. Humans, with κ ≈ 0.37, are qualitatively different: once they understand a task well enough to survive the first few minutes, they become progressively less likely to fail. SOTA models, with κ ≈ 0.6–0.9, maintain a much higher background rate of random failure throughout.

To see why, consider how a current agent actually processes a long task. At each step, the model receives its context window (the task description, its previous actions, observations from the environment) and produces the next action. This context window is the agent's entire working memory. It has no other state. It cannot update its own weights, form new long-term memories, or build a persistent internal model of the task that survives beyond the context window's horizon.

As a task progresses, several things degrade. The context fills with earlier actions and observations, some relevant and some not, and the model's attention over this growing history becomes less precise. If the agent made a subtle error early on (a wrong assumption, a misunderstood requirement), it has no mechanism to notice and correct this unless the error produces an immediately visible failure. Ord noted this in his original analysis: the constant hazard rate implies "the chance of failing in the next step is the same regardless of how far you have got," which he called "a surprising and interesting property for reasoning agents." Hamilton's finding of κ < 1 shows the situation is slightly better than that (agents do show some ability to "warm up"), but far worse than humans, who actively learn the task as they work on it.

A human working on a two-week project builds progressively richer understanding: they remember which approaches failed, develop intuitions about the problem, and refine their strategy. This is why human κ is 0.37: the hazard rate drops steeply after the initial period because the human is genuinely becoming better at the task over time. A current LLM agent has no equivalent mechanism. Each step is a fresh inference over a growing but degrading context, not a continuation of a learning process. This is the bottleneck that Sutskever identifies as "continual learning" and Karpathy describes as the missing "hippocampus": the ability to form and retain task-specific knowledge during execution. Without it, κ is structurally bounded away from human levels regardless of how capable each individual step becomes.

Finding 3: verification cost is the binding constraint

Ord noted in his comment on the original post that under the Weibull model, "it may be that it is the cost of human verification that gets you, rather than it being untenable even on AI costs alone." The numbers bear this out.

Both compute and verification scale with E[attempts] × task_hours, so the ratio between them is determined by their per-hour rates. At the base case parameters ($0.22/step, 80 steps/hr, 20% review, $150/hr human rate), compute per hour costs $17.60 and verification per hour costs $30.00. That makes verification roughly 63% of total cost at these specific parameters. This ratio is not a deep structural feature of the Weibull model; it shifts with parameters. At $0.10/step, verification is about 52% of total cost; at 10% review, about 46%. Across plausible parameter ranges, verification accounts for 45–70% of total cost, which is to say it is consistently the larger component, though not by a fixed ratio.

The important point is not the exact percentage but the qualitative implication: making compute cheaper does not help much when verification is the larger cost, and verification cost is human labour, which does not follow Moore's law.

Full cost breakdown with 20% verification, Weibull κ = 0.70:

T₅₀ Task P(success) E[attempts] Compute Verification Total Human Ratio
5h 8h 38.2% 2.6 $369 $629 $998 $1,200 0.83×
5h 1 wk 5.1% 19.5 $13,744 $23,427 $37,172 $6,000 6.20×
5h 2 wk 0.8% 124.9 $175,832 $299,713 $475,545 $12,000 39.6×
10h 1 wk 16.1% 6.2 $4,385 $7,475 $11,860 $6,000 1.98×
20h 1 wk 32.4% 3.1 $2,171 $3,700 $5,871 $6,000 0.98×
20h 2 wk 16.1% 6.2 $8,771 $14,950 $23,720 $12,000 1.98×
40h 1 wk 50.0% 2.0 $1,408 $2,400 $3,808 $6,000 0.63×
40h 2 wk 32.4% 3.1 $4,341 $7,400 $11,741 $12,000 0.98×
40h 1 mo 16.1% 6.2 $17,541 $29,900 $47,441 $24,000 1.98×
80h 2 wk 50.0% 2.0 $2,816 $4,800 $7,616 $12,000 0.63×
80h 1 mo 32.4% 3.1 $8,683 $14,800 $23,483 $24,000 0.98×

The verification floor. Even if compute were completely free, there is a minimum T₅₀ below which human review time alone exceeds the cost of a human doing the work:

Task At 10% review At 20% review At 30% review
1 week (40h) T₅₀ > 7.2h T₅₀ > 12.0h T₅₀ > 18.2h
2 weeks (80h) T₅₀ > 14.4h T₅₀ > 24.0h T₅₀ > 36.4h
1 month (160h) T₅₀ > 28.8h T₅₀ > 48.0h T₅₀ > 72.7h

For careful review (30%) of month-long tasks, you need T₅₀ > 73 hours before verification alone is viable. No amount of inference cost reduction can help. The bottleneck is human time.

The architectural implication. Domains with cheap automated verification (coding against test suites, formal proofs, quantitative finance with backtesting) will reach viability first. Domains requiring human judgement face a verification tax that may keep them unviable even when compute costs are favourable. This helps explain why coding agents have been the first commercially successful agentic applications: the verification is cheap.

The case that κ is an architectural property

Before working through the timelines, I want to make the argument that κ will not improve through scaling alone. If this is wrong, the timelines compress dramatically (as I will show), and the safety picture darkens considerably. If this is right, the timelines are more reassuring.

The empirical evidence

Hamilton's finding that κ does not correlate with model size in the available METR data is suggestive, though as noted above, the sample is small and the estimates noisy. His Figure 3 shows a handful of data points with wide confidence intervals and no trend. That is consistent with "no correlation" but also consistent with "insufficient data." Hamilton himself treats it cautiously.

What is more informative than the point estimates is what the data shows about the mechanism. Hamilton found that scaling reduces λ (the base hazard rate, which determines T₅₀) but does not change κ (the shape of how hazard evolves over time). This is exactly what you would expect if κ reflects a qualitative property of how the model handles errors over extended operation, rather than how many errors it makes per step.

A model with more parameters and better training makes fewer mistakes per unit time (lower λ, higher T₅₀). But the structure of when mistakes happen over the course of a long task, whether they cluster at the beginning (low κ) or occur uniformly throughout (κ near 1), appears to be a property of the architecture.

The theoretical argument

This aligns with what Ilya Sutskever and Andrej Karpathy have argued, though neither frames it in terms of κ specifically. Their diagnoses point in the same direction: that what is missing from current models is not raw capability but the ability to maintain coherence over extended operation.

Sutskever, in his November 2025 interview with Dwarkesh Patel, argued that the age of scaling (2020 to 2025) is ending and a new age of research is beginning. His central claim is that current models generalise dramatically worse than humans, and that the missing ingredient is continual learning: the ability to update on the job, to remember what happened an hour ago, to build a running model of the task as it unfolds. He describes the goal as a "superintelligent 15-year-old" that does not know your job yet but can learn it extremely fast. He told CTech: "This is one of the very confusing things about the models right now. How to reconcile the fact that they are doing so well on evals? ... But the economic impact seems to be dramatically behind." He places the continual learning breakthrough 5 to 20 years away.

Karpathy, in his 2025 year-in-review, described current LLMs as "ghosts" rather than "animals": entities with "amusingly jagged performance characteristics." He has argued separately that LLMs lack the equivalent of a hippocampus, amygdala, and value function, and that working through these deficits could take roughly a decade.

Dwarkesh Patel put it directly: "The lack of continual learning is a huge, huge problem. ... Either these models will soon learn on the job in a self-directed way, making all this pre-baking pointless, or they won't, which means AGI is not imminent."

In the language of the Weibull model, what Sutskever and Karpathy are describing is precisely a κ problem. A model without continual learning cannot update its approach mid-task. It cannot notice that its strategy is failing and adjust. It cannot build up task-specific expertise as the work progresses. These are the capabilities that would produce a declining hazard rate (low κ): as you progress through a task, if you can learn from the early steps, you become progressively less likely to fail on later steps. Humans do this naturally (κ ≈ 0.37). Current LLMs do not (κ ≈ 0.6–0.9).

Inference-time scaling (making the model "think harder" per step) improves per-step quality but does not address this. A model that thinks harder on each move is like a chess player who analyses each position more carefully but still cannot remember what happened twenty moves ago. It reduces λ (fewer errors per step) but does not change κ (the structure of how errors compound over time).

What this means

If κ is indeed an architectural property, then the timeline for dangerous autonomous agents is determined by two independent variables: T₅₀ (which is improving on a known trend and can be projected forward) and κ (which is stuck at roughly 0.70 and will remain there until someone builds a qualitatively different kind of model). The next section works through what the T₅₀ trend means for each danger scenario while κ stays at 0.70, and then shows how dramatically the picture changes if κ improves.

The timeline at κ = 0.70: five years of autonomous agent economics

Given the METR doubling trend (T₅₀ doubling every 7 months from roughly 5 hours in early 2026) and assuming κ stays at 0.70, here is what the next five years look like. All probabilities use the Weibull model.

The T₅₀ trajectory

Date T₅₀ P(1-week) P(2-week) P(1-month) P(3-month)
Feb 2026 (now) 5h 5.1% 0.8% ~0% ~0%
Sep 2026 10h 16.1% 5.1% 0.8% ~0%
Apr 2027 20h 32.4% 16.1% 5.1% 0.2%
Nov 2027 40h 50.0% 32.4% 16.1% 1.9%
Jun 2028 80h 65.3% 50.0% 32.4% 8.8%
Jan 2029 160h 76.9% 65.3% 50.0% 22.4%
Aug 2029 320h 85.1% 76.9% 65.3% 39.8%

Task lengths are in human-equivalent hours. The date column assumes the METR doubling trend continues unbroken, which is a significant assumption.

The viability map over time

Viability means total agent cost (compute + 20% human verification) falls below the cost of a human doing the same work. Note that I also model cost per step rising with capability (more capable models require more inference compute):

Date T₅₀ $/step 1-week 2-week 1-month
Feb 2026 (now) 5h $0.22 6.2× 39.6× 808×
Sep 2026 10h $0.22 2.0× 6.2× 39.6×
Apr 2027 20h $0.30 1.1× 2.2× 7.0×
Nov 2027 40h $0.50 0.93× 1.4× 2.9×
Jun 2028 80h $0.50 0.72× 0.93× 1.4×
Jan 2029 160h $0.75 0.78× 0.92× 1.2×
Aug 2029 320h $1.00 0.86× 0.95× 1.1×

Bold values are below 1.0× (agent is cheaper than human). Note an important subtlety: at very high T₅₀ with expensive frontier models, the viability ratio actually levels off rather than continuing to improve. At T₅₀ = 320h with $1.00/step, one-week tasks are 0.86× rather than the 0.49× they would be at $0.50/step. This is because the most capable models cost more per step, and at the 5× T₅₀ viability frontier, the cost saving from fewer retries is partially offset by higher per-step costs. The viability window may prove narrower than the compute-only analysis suggests.

Milestone 1: One-week bounded tasks viable for companies (April 2027, ~14 months away)

At T₅₀ ≈ 20h, a one-week task with 20% human verification costs roughly the same as a human. This is the entry to Regime 2 (multi-day tasks). The safety implications are modest: bounded tasks, defined deliverables, verification checkpoints. Companies will deploy even at or slightly above cost parity for speed and talent-scarcity reasons.

Milestone 2: Two-week bounded tasks viable (November 2027, ~21 months away)

At T₅₀ ≈ 40h, two-week tasks cross the viability threshold. An 80-hour task involves enough scope for consequential actions. Verification at 20% means a human spends 16 hours reviewing 80 hours of agent work per attempt. This is the regime where the Condition 3 analysis (cutting verification) becomes practically relevant. This is roughly 21 months away.

Milestone 3: Self-sustaining agent-verifier pair can attempt to bootstrap (late 2027 onwards)

The agent-verifier pair is economically viable at T₅₀-length tasks: 50% success rate, 64% precision with an AI verifier, roughly $1,000–$3,000 delivery cost depending on the T₅₀ regime. But the binding constraint is bootstrapping: setting up a web presence, acquiring clients, and managing a business is itself a multi-week coherent plan subject to the half-life tax.

Date T₅₀ P(bootstrap) P(after 10 attempts)
Sep 2026 10h 0.8% 7.7%
Apr 2027 20h 5.1% 40.9%
Nov 2027 40h 16.1% 82.6%
Jun 2028 80h 32.4% 98.0%
Jan 2029 160h 50.0% 99.9%

I model bootstrap as a roughly 160h coherent plan (setting up a plausible freelancing presence, payment mechanism, client acquisition pipeline). A single attempt at T₅₀ = 40h (November 2027) succeeds 16% of the time. But "one in six" is not "impossible," and there is nothing preventing multiple attempts. After 10 attempts, the cumulative probability is 83%.

A pair that successfully bootstraps has $2–3M/year in steady-state revenue. This is more than enough to purchase access to newer, better models as they become available. The pair's ceiling is not fixed at "current κ forever" but at "current κ until a better model is available for purchase." It will automatically ride any future capability improvement.

Where the pair is stuck: it is self-sustaining at T₅₀ tasks, but T₅₀ tasks are one-week human-equivalent work. The pair cannot execute multi-week plans with any reliability. P(success on a 160h plan) = 16.1% at T₅₀ = 40h. An 87.5% failure rate on its own long-horizon goals, before any external opposition. The pair can survive and earn money, but it cannot reliably grow, adapt, or pursue complex objectives. The revenue it generates is enough to pay for compute but not enough to fund the kind of capability expansion (training new models, building data centres) that would lead to a capability takeoff. The self-sustaining pair is an economic nuisance, not an existential threat, unless κ improves.

The natural follow-up question is whether multiple pairs could coordinate to overcome this ceiling. The answer is no: decomposition is strictly worse than single-pair retries under κ < 1, because each fresh agent restarts from the high initial hazard rate (see Appendix A).

Milestone 4: Month-long autonomous plans become non-trivially reliable (mid 2027 to early 2029)

This is the milestone that matters most for existential risk. An agent (or agent system) that can execute a coherent month-long plan with meaningful probability can pursue complex sub-goals, acquire resources, and resist oversight in ways that shorter-horizon agents cannot.

Threshold T₅₀ needed Date (on trend)
P(1-month) > 10% 29h ~July 2027
P(1-month) > 25% 59h ~March 2028
P(1-month) > 50% 160h ~January 2029

At 10% success, a reckless deployer running many parallel attempts starts getting occasional month-long coherent runs. At 25%, the odds are workable for a state actor running a portfolio of agents. At 50%, a single agent has a coin-flip chance of completing a coherent month of autonomous operation. The 10% threshold arrives in roughly 18 months. The 50% threshold arrives in roughly 3 years.

Milestone 5: State actor parallel-agent capability (already emerging, scales with T₅₀)

A sophisticated state actor would not rely on a single agent executing a single long-horizon plan. They would run dozens of parallel agents on coordinated week-long sub-plans, with human strategic oversight connecting the pieces.

Date T₅₀ P(1-week sub-task) Expected successes from 50 agents Annual cost
Feb 2026 5h 5.1% 2.6 $7.6M
Sep 2026 10h 16.1% 8.0 $7.6M
Apr 2027 20h 32.4% 16.2 $10.4M
Nov 2027 40h 50.0% 25.0 $17.3M
Jun 2028 80h 65.3% 32.6 $17.3M
Jan 2029 160h 76.9% 38.5 $25.9M

At T₅₀ = 40h (November 2027), a state actor running 50 parallel agents gets 25 successful one-week sub-plans per batch at $17M/year. With human coordination across the successes, the effective capability is substantially higher than the single-agent analysis suggests. The state actor is paying for a team of unreliable contractors rather than an autonomous superintelligence, but at scale the results can be significant.

The dollar costs are rounding errors in a state intelligence budget. The binding constraint is availability of competent human strategists to coordinate the parallel agents, and even that bottleneck loosens as T₅₀ grows and the agents become more reliable on their sub-tasks.

Milestone summary at κ = 0.70

Milestone T₅₀ needed Projected date Risk level
1-week bounded tasks viable ~20h Apr 2027 Low (bounded, verified)
2-week bounded tasks viable ~40h Nov 2027 Moderate (scope for consequential actions)
Self-sustaining pair bootstrap (16%) ~40h Nov 2027 Moderate (autonomous but capped)
Self-sustaining pair bootstrap (83%) ~40h Nov 2027+ (10 attempts) Elevated (near-certain with retries)
P(1-month plan) > 10% ~29h Jul 2027 Elevated (occasional coherent runs)
P(1-month plan) > 25% ~59h Mar 2028 High (workable for portfolio strategy)
P(1-month plan) > 50% ~160h Jan 2029 Critical (coin-flip on month-long autonomy)
P(3-month plan) > 10% ~86h Jul 2028 High (long-range goals become possible)
P(3-month plan) > 50% ~480h Dec 2029 Critical (reliable long-range autonomy)

The central question: can an autonomous ASI agent emerge?

For an autonomous agent to pose the kinds of risks that animate the most serious safety concerns, a chain of conditions must hold simultaneously. Each has a distinct economic profile. The timeline above tells us when each becomes plausible; this section examines how the economics constrain each.

Condition 1: sufficient task complexity

The risk scales with the complexity of the task the agent is operating on. At sub-day complexity (< 8 human-equivalent hours), the agent's scope of action is bounded. At multi-day complexity (8–40 hours), the scope grows substantially. The safety-relevant threshold is roughly multi-week complexity (40–160+ hours), where the task involves enough scope for the agent to develop sub-goals not specified in the original instructions, interact with external systems in difficult-to-reverse ways, build up internal state too complex for full human audit, and acquire capabilities or resources incidentally.

The viability map tells us when this is economically plausible. Under the Weibull model with realistic verification, two-week-complexity tasks become viable at T₅₀ ≈ 40h, by November 2027 on trend. One-month-complexity tasks require T₅₀ ≈ 80h, by June 2028 on trend. Condition 1 is likely to be met within a few years.

Condition 2: sufficient autonomy

Most economically rational deployments are bounded. If you are paying $11,741 for a two-week-complexity task, you have a specific expected output. The economic relationship between payer and agent inherently creates task structure, and task structure inherently constrains autonomy.

Task decomposition helps concretely. An 80-hour task done as one piece costs $11,741 (T₅₀ = 40h, 20% review). Break it into eight 10-hour chunks and each chunk has roughly 75% success probability, roughly 1.3 expected attempts, and a total cost of roughly $630. Eight chunks cost roughly $5,100 plus coordination overhead, about half the monolithic cost. More importantly, each chunk boundary is a verification checkpoint: the maximum unchecked autonomy window drops from 80 hours to roughly 10.

The residual risk is in tasks that are open-ended by nature: research, exploration, ongoing monitoring, strategy development. These resist clean decomposition. The residual autonomy in hard-to-decompose open-ended tasks is the gap through which danger could enter.

The cheap-inference danger window. There is a subtlety here that becomes acute in a post-bust scenario (see below). When inference is expensive, economic pressure naturally constrains autonomy: each failed attempt costs real money, so deployers monitor closely and decompose aggressively. When inference becomes very cheap (as it would in a telecom-style infrastructure glut), the cost of a failed attempt drops to nearly nothing. The economic incentive to supervise weakens, because letting the agent retry autonomously is cheaper than paying a human to check each attempt.

This creates a regime where agents are granted autonomy beyond what their reliability justifies, not because anyone believes they are reliable, but because monitoring costs more than retrying. A company running agents at $0.05/step has little reason to pay a $150/hour human to supervise 10-hour tasks when the expected compute cost per success is under $100. The rational response is to let it run, check the final output, and retry on failure. But "let it run" on a 10-hour task means 10 hours of unsupervised operation, and under the Weibull model at κ = 0.70, roughly 5% of those runs will go off the rails in ways that produce not merely wrong output but unexpected side effects (resource access, network calls, file mutations) that the final-output check does not catch.

At current inference prices, the verification floor keeps this in check: human review is the dominant cost, so it is always being paid for. At fire-sale prices, the verification floor dissolves for bounded tasks, and the economic pressure is to grant autonomy up to whatever task length is "cheap enough to retry." This is the gap through which the most mundane form of danger enters: not a coherent long-horizon plan, but a large population of unsupervised agents doing bounded things with insufficient oversight, some fraction of which fail in correlated or compound ways.

Condition 3: the human-in-the-loop and its fragility

The per-task economics of verification depend on the T₅₀ regime. When the agent's success rate on the task in question is very low, no rational company would attempt those tasks at all, with or without verification. The question "should a company cut verification?" only becomes interesting once the task is within the viable range (roughly within 5× T₅₀).

In that viable range, the per-task economics do favour cutting verification. Without verification, margin per delivered output is roughly $8,492 versus $3,259 with human review, because the verification cost dwarfs the revenue lost to client rejections on any individual transaction.

But agents compress wall-clock time, and that changes the multi-period dynamics. A two-week human-equivalent task takes an agent 2–4 days (running 24/7), so a company delivers 8–15 tasks per client per month. At 10 tasks/month, 98% of clients receive at least one detected failure every month. Clients who receive garbage leave, reputation damage compounds, and the client base collapses fast.

A multi-period simulation with 100 starting clients, reputation-adjusted acquisition, and a quality-churn parameter q (probability a client leaves after receiving at least one detected bad output in a month) gives the following at 10 tasks/client/month. These specific crossover timelines should be treated as illustrative; the churn model uses simplified parameters that are not empirically calibrated, and the real dynamics will vary substantially by domain. The qualitative result, that agent speed compression accelerates the feedback loop, is robust to parameter choices.

q (churn/bad) Profit overtake Cumul. crossover NV clients at 24m
10% ~Month 7 ~Month 15 7
20% ~Month 5 ~Month 9 0
30% ~Month 4 ~Month 7 0
50% ~Month 3 ~Month 5 0

Even at q = 10% (very sticky clients), the verified company overtakes in cumulative profit within roughly 15 months. In most client-facing domains with detectable failures, the per-task margin advantage of skipping verification is short-lived.

Where verification remains vulnerable. Genuinely low-frequency engagements resist this self-correction: annual audits, one-off infrastructure designs, or other contexts where a client interacts with the company once or twice a year. Internal use is another gap: companies operating on agent timescales compress their internal decision cycles too, and the feedback loop for bad internal decisions is slower than external client churn. The subtlest vulnerability is in the transition from human to AI verification, which the economics strongly favour (since any verification cheaper than human verification is more profitable per task, and the multi-period dynamics apply regardless of method). AI verification may be good enough to retain clients while being too weak to catch deceptive compliance or subtle misalignment. The market selects for the appearance of verification, not its substance.

Condition 4: operating beyond intended scope

For an agent to pose existential-level risks, it must operate beyond what its deployer intended. There are three sub-scenarios.

4a: Deceptive alignment

The economic analysis does not make deceptive alignment more or less likely in a capability sense, but it constrains its scope. A deceptively aligned agent on a bounded task with defined deliverables has limited room to manoeuvre. The output must be good enough to pass review. The half-life tax means the agent fails often, which means it gets reviewed often, which means the surface area for deception is smaller.

At T₅₀ = 40h on an 80h task, the agent is reviewed 3.1 times per success. At T₅₀ = 80h on 160h tasks, it is still reviewed 3.1 times per success. The review frequency at any given task-to-T₅₀ ratio is constant, because both E[attempts] and the verification schedule scale together. The deceptive agent cannot improve its situation by being more capable: it always faces the same density of human contact per unit of work.

4b: The self-sustaining agent

This is the scenario the economic analysis can most directly address. A self-sustaining agent needs to generate revenue exceeding its own inference cost, which means completing tasks for paying customers while subject to the same reliability dynamics as any other agent.

The solo agent. At T₅₀ = 40h and κ = 0.70, a solo rogue agent succeeds on a two-week-complexity task 32% of the time, with an expected cost per successful output of $4,341 (compute only, no verification). It is competing against humans at $12,000 and supervised agents at $11,741 with quality filtering. It is essentially an unsupervised freelancer with a 68% failure rate on complex tasks.

The agent-verifier pair. A more sophisticated version deploys an AI verifier to filter its own output before delivering to clients. A rational pair would operate at T₅₀-length tasks (where its success rate is highest). At T₅₀ = 40h on 40h tasks: 50% base success rate. With an AI verifier at detection rate d = 0.90 and correlation ρ = 0.50, the verifier catches 45% of failures. The pair's precision on delivered outputs is 64%: roughly two-thirds of what it delivers is genuinely good. Cost per delivered output: roughly $1,003.

Configuration Precision Cost/delivered Margin at $6K Steady clients (q=30%) Monthly profit
Solo agent (no verifier) 50% $706 $3,784 (84%) ~8 $216K
Agent + AI verifier (ρ=0.5) 64% $1,003 $3,923 (80%) ~9 $237K
Agent + AI verifier (ρ=0.0) 81% $448 $4,084 (90%) ~24 $750K
Supervised agent (human verif.) ~100% $11,724 -$5,724 N/A N/A

The pair is viable. At steady state it earns $2–3M/year. It is unambiguously self-sustaining.

The ceiling is set by the half-life tax on its own plans: P(success on 160h plan) ≈ 12.5%. An 87.5% failure rate on its own long-horizon goals, before any external opposition. The pair can survive and earn money, but it cannot reliably grow, adapt, or pursue complex objectives. As discussed under Milestone 3, bootstrap becomes plausible (83% after 10 attempts) by November 2027.

The upgrade path. A pair earning $2–3M/year can purchase access to newer, better models as they become available. It does not need to train new models or build data centres; it just needs to buy API access. The pair's ceiling is therefore not fixed at "current κ forever" but at "current κ until a better model is available for purchase." The $2–3M annual revenue is more than sufficient to cover even expensive frontier API costs. This means the self-sustaining pair is not just a static nuisance. It is a persistent autonomous economic entity that will automatically benefit from any capability advance in the broader ecosystem. Whether this makes it an existential threat depends entirely on whether κ actually improves.

4c: Reckless deployment

What does reckless deployment cost, and what do you get? I model cost per step rising with capability (inference-time scaling explicitly trades compute for capability):

Capability tier T₅₀ $/step Monthly cost P(1-month plan) Annual cost
Current frontier 5h $0.22 $12,672 ~0% $152K
Late 2026 10h $0.30 $17,280 0.8% $207K
2028 projected 40h $0.50 $28,800 16.1% $346K
2028 optimistic 40h $0.75 $43,200 16.1% $518K
Post-κ breakthrough 40h $1.00 $57,600 16.1% $691K
ASI-capable 200h $2.00 $115,200 55.3% $1.38M

A state actor can easily afford $500K/year. The problem is what they get for that money. At 2028-projected capabilities, the agent has a 16% chance of successfully executing a coherent one-month plan. The reckless deployer is paying $346K/year for an agent that fails on its own goals five times out of six.

The parallel-agent strategy. A sophisticated state actor would not rely on a single agent executing a single coherent long-horizon plan. They would run dozens of agents in parallel on coordinated week-long sub-plans, with human strategic oversight connecting the pieces (as detailed under Milestone 5 above). This substantially weakens the "reckless deployment is pointless" conclusion for well-resourced actors. The economic constraint on state actors is not the dollar cost but the availability of competent human strategists to coordinate the parallel agents. This is obviously a big misuse risk.

For a single agent to reliably execute month-long autonomous plans (say >90% success), you need T₅₀ in the hundreds of hours AND κ substantially below 0.70. This is the regime that the available data and the architectural argument say scaling alone does not reach.

How κ changes everything

All the timelines above assume κ stays at 0.70. If κ improves, they accelerate dramatically. The table below shows when P(1-month plan) first exceeds various thresholds, as a function of κ:

When does P(1-month plan) first exceed 10%?

κ T₅₀ needed Date (on trend) Months from now
0.90 (pessimistic) 42h Dec 2027 22
0.70 (current SOTA) 29h Jul 2027 18
0.60 22h May 2027 15
0.50 15h Jan 2027 11
0.45 11h Oct 2026 8
0.37 (human level) 6h Apr 2026 2

When does P(1-month plan) first exceed 25%?

κ T₅₀ needed Date (on trend)
0.90 74h Mar 2028
0.70 (current SOTA) 59h Mar 2028
0.50 40h Nov 2027
0.37 (human level) 25h Jun 2027

What happens at T₅₀ = 40h (November 2027) under different κ values:

κ P(1-week) P(2-week) P(1-month) P(3-month)
0.90 (pessimistic) 50.0% 27.4% 8.9% 0.2%
0.70 (current SOTA) 50.0% 32.4% 16.1% 1.9%
0.60 50.0% 35.0% 20.3% 4.6%
0.50 50.0% 37.5% 25.0% 9.1%
0.45 50.0% 38.8% 27.4% 12.0%
0.37 (human level) 50.0% 40.8% 31.4% 17.6%

The one-week probability is always 50% because T₅₀ = 40h by definition. The divergence shows up on longer tasks: at κ = 0.37, the month-long success rate is nearly double what it is at κ = 0.70, and the three-month success rate is 9× higher.

The message is clear. If κ stays at 0.70, the dangerous regime (month-long plans exceeding 25%) arrives around March 2028 and develops slowly. If κ drops to 0.50, it arrives around November 2027, roughly 4 months sooner. If κ reaches human levels, month-long plans would already exceed 10% success at today's T₅₀.

This is why κ is the single most important variable to track. 

What this means for the investment thesis

Hyperscaler capex for AI infrastructure is projected to exceed $600 billion in 2026. Goldman Sachs projects total hyperscaler capex from 2025 to 2027 will reach $1.15 trillion. J.P. Morgan estimates that to deliver a 10% return on AI investments through 2030, the industry needs roughly $650 billion in annual revenue.

Against this, actual AI revenue remains modest relative to investment. OpenAI's annualised revenue reached roughly $20 billion by the end of 2025, Anthropic's around $9 billion. Combined revenue of all AI model providers is probably in the range of $50 to 100 billion annually.

Three regimes of viability emerge from the analysis:

Regime 1 (sub-day tasks, viable now). Tasks under 4–8 human-equivalent hours. Agents are 3–8× cheaper than humans even with verification. Large addressable market, probably low hundreds of billions in annual revenue. Safe: bounded scope, contained failures, cheap verification.

Regime 2 (multi-day tasks, viable with continued T₅₀ growth). Tasks of 1–5 human-equivalent days. Become viable as T₅₀ extends towards 40 hours, plausibly by late 2027. Market of several hundred billion annually. Companies will deploy even above strict cost parity because of speed, talent scarcity, and competitive dynamics.

Regime 3 (multi-week autonomous operation, requires κ improvement). Multi-week bounded tasks with retries and human verification may become viable within a few years. But multi-week autonomous operation, where agents pursue complex goals without meaningful human oversight, requires both high T₅₀ and low κ. The verification analysis reinforces this: even under the Weibull model, verification (not compute) is the binding constraint at realistic review fractions.

Whether the industry needs Regime 3 revenue to justify the capex depends on how quickly Regimes 1 and 2 mature. If they reach $400–600B annually, that may sustain continued scaling. If not, the investment case weakens and the frontier slows. In either case, the timeline to Regime 3, and hence to the most dangerous forms of autonomous agency, is gated on κ.

What happens if the bubble bursts?

The timeline analysis above assumes the METR doubling trend continues unbroken. But the trend is funded by capital investment, and that investment may not be sustained. If κ stays at 0.70 and the revenue from Regimes 1 and 2 proves insufficient to justify the current buildout, a correction follows: capex contracts, frontier labs shrink or consolidate, and the rate of new model releases slows. What does this do to the economics of deployed agents?

The telecom parallel

The most instructive precedent is the telecom infrastructure bubble of 1996 to 2001, not the broader dot-com crash. The dot-com bubble was primarily an application-layer phenomenon: companies with no revenue and a URL getting billion-dollar valuations. The AI investment cycle is not that. The money is going into physical infrastructure (data centres, chips, power contracts), and the companies spending it are profitable giants with real businesses. That maps closely to the telecom buildout, where established companies and well-funded startups poured more than $500 billion, mostly financed with debt, into fibre optic cable, switches, and wireless networks in the five years after the Telecommunications Act of 1996. The buildout was driven by WorldCom's claim that internet traffic was doubling every 100 days. The actual rate, as AT&T researcher Andrew Odlyzko documented at the time, was roughly annual doubling: rapid growth, but an order of magnitude below the projections used to justify the buildout. The infrastructure was real, the technology worked, the long-term demand eventually materialised, but the timing mismatch was catastrophic. Even four years after the bust, 85% of broadband capacity in the United States was still going unused, and bandwidth costs fell by more than 90%. Telecom stocks lost more than $2 trillion in market value between 2000 and 2002.

The structural parallels to AI are tight. Both involve massive infrastructure buildout based on exponential growth projections, a first-mover advantage narrative driving competitive overbuilding, and technological uncertainty about which architectures will prevail. Hyperscaler capex for AI infrastructure is projected to exceed $600 billion in 2026. Goldman Sachs projects total hyperscaler capex from 2025 to 2027 will reach $1.15 trillion, more than double the $477 billion spent from 2022 to 2024. Like the telecom buildout, the AI buildout is increasingly debt-financed: J.P. Morgan projects the technology sector may need to issue $1.5 trillion in new debt over the next five years to finance AI infrastructure.

The parallel is not exact, and the differences matter. The telecom bust created a specific asymmetry that is central to the analysis below: inference (using the infrastructure) became very cheap, while building new infrastructure (training new models) became very expensive or impossible. Existing fibre was dirt cheap to light up. New fibre stopped being laid. The AI equivalent would be: excess GPU capacity floods the market, running existing models becomes nearly free, but new frontier training runs dry up because the capital to fund them has evaporated. However, there is a critical structural difference in the hardware. Buried fibre lasts decades; GPUs depreciate in roughly five years. This means AI overcapacity self-corrects faster than telecom overcapacity did: idle GPUs become obsolete and get written off, reducing the glut. The "85% dark fibre five years later" situation is unlikely to have an exact AI analogue, because the equivalent hardware will not exist in five years. This makes a potential AI bust shallower and shorter than the telecom bust, but does not prevent one.

When might the correction hit?

The investment thesis depends on AI revenue growing into the capex. J.P. Morgan estimates that to deliver a 10% return on AI investments through 2030, the industry needs roughly $650 billion in annual revenue "into perpetuity," which the report itself calls "an astonishingly large number." Against this, actual AI revenue remains modest relative to investment. OpenAI's annualised revenue reached roughly $20 billion by the end of 2025, Anthropic's over $9 billion. Combined revenue of all AI model providers is probably in the range of $50 to 100 billion annually.

For a nearer-term test: cumulative AI-specific capex (roughly 60% of total hyperscaler spending) will reach an estimated $830 billion by end of 2026 and $1.25 trillion by end of 2027. To justify this at a growth-stage valuation multiple (10x revenue), the AI industry needs roughly $125 billion in annual revenue by 2027. At a more sober 7x multiple, it needs $178 billion. Starting from $60 billion in 2025, you need roughly 42% annual growth to hit $125 billion by 2028 (the 10x threshold) or 60% annual growth to hit $178 billion (the 7x threshold).

The Weibull model makes a specific prediction about why this revenue growth may disappoint. AI revenue currently comes overwhelmingly from Regime 1: bounded sub-day tasks like code generation, content drafting, data processing, and customer support. These tasks are real, growing, and valuable, but they are also commoditising rapidly as multiple providers compete and open-source alternatives proliferate. Regime 1 revenue growth faces margin compression even as volume grows.

The investment case depends heavily on Regime 2: agentic tasks of one to two weeks that replace professional labour at $150 to $300 per hour. This is where the large revenue multipliers live. A single week-length agent task that replaces a $15,000 consulting engagement is worth vastly more per deployment than a thousand sub-hour code completions. But at T₅₀ = 20h (roughly mid-2027), a one-week (40h) task succeeds only 50% of the time before verification, and requires human verification costing roughly $3,700 per deployment at a 20% review fraction. The cost ratio (agent cost to human cost) at T₅₀ = 20h on a one-week task is close to 1.0, meaning the agent is barely cheaper than a human even before accounting for its 50% failure rate. The expected cost per successful delivery, including retries and verification, exceeds the cost of simply hiring a human. Two-week tasks are flatly unviable: cost ratio above 1.4 even at current inference pricing. This means the revenue that would justify continued $600 billion or more in annual capex largely depends on task categories that are not yet economically viable at the T₅₀ prevailing during the critical 2027 investment-decision window.

There is an important caveat to this revenue-gap argument. A large fraction of the "return" on AI capex flows through the hyperscalers' own products in ways that are hard to separate from base cloud revenue. Microsoft's AI-enhanced Azure margins, Google's AI search features, Meta's ad-targeting improvements: these are real economic returns on AI infrastructure that do not show up in the "$50 to 100 billion AI model provider revenue" figure. The revenue gap is real but narrower than the standalone AI revenue figures imply if you count these indirect returns. This is also why the hyperscalers can sustain AI spending longer than telecom companies could sustain fibre spending: the infrastructure serves their own core businesses, not just a speculative external market.

Nonetheless, the gap between investment and revenue is large, and investors will eventually demand evidence that the gap is closing. The J.P. Morgan report draws the parallel explicitly: "Our biggest fear would be a repeat of the telecom and fiber buildout experience, where the revenue curve failed to materialise at a pace that justified continued investment." The telecom buildout ran roughly four to five years before the 2001 correction; the AI buildout started in earnest in 2023, placing the equivalent window at 2027 to 2028.

The correction need not take the form of a telecom-style collapse. The more likely shape for AI is a deceleration: capex growth slowing from 50%+ annually to 10 to 15%, rather than an absolute contraction. The hyperscalers' own businesses benefit from AI infrastructure in ways that telecom companies' businesses did not benefit from excess fibre, which provides a floor under spending that telecom lacked. But even a deceleration has significant effects on the T₅₀ trajectory. Fewer frontier training runs means slower capability growth, which means slower progress toward Regime 2 viability, which means the revenue justification weakens further. The feedback loop is real even if the correction is milder than telecom.

There are structural factors pushing in both directions. On the bearish side: the AI buildout is concentrated among fewer, larger players (five hyperscalers account for most of the spending), which means a coordinated slowdown could happen faster than the fragmented telecom sector's gradual collapse. AI infrastructure depreciates faster than fibre (GPUs have roughly five-year useful lives versus decades for buried cable), so the carrying costs of overcapacity bite harder and sooner. And unlike telecom, where the underlying technology (fibre optics) kept getting better independently of spending, AI capability improvement is directly coupled to investment: no frontier training runs means no new models.

On the bullish side: the hyperscalers are far more profitable than their telecom-era counterparts. Amazon, Google, Meta, and Microsoft generate over $700 billion in combined operating cash flow annually and can absorb write-downs that would have bankrupted a WorldCom. The prisoner's dilemma is powerful: no CEO wants to be the one who pulled back and ceded the AI race to competitors. And the cloud computing precedent (AWS, Azure, GCP) is the strongest argument against a correction: those platforms also looked like insane overbuilding for years before enterprise adoption hit an inflection point. The critical difference is that cloud adoption was eventually enabled by the cheap bandwidth the telecom bust itself produced; what enables the equivalent AI inflection is less clear.

Balancing these factors against the telecom precedent, the revenue gap, the Weibull model's predictions about Regime 2 timing, and the structural differences that make AI more resilient than telecom to a full-scale collapse:

Timing Probability T₅₀ at burst Trigger
Before end of 2026 10 to 15% ~10h Early earnings disappointments combined with macro headwind or exogenous shock. At T₅₀ = 10h, even one-week tasks are unviable (cost ratio > 1.5). Regime 2 is visibly years away. Most likely trigger: a recession or credit tightening that forces debt-funded buildout to slow, combined with a high-profile AI project failure that shifts the narrative. Less likely than later scenarios because hyperscalers' own cash flows can sustain spending for another year even without external revenue justification.
During 2027 25 to 35% 15 to 25h By mid-2027, T₅₀ reaches roughly 20h, making one-week tasks marginally viable in theory but not yet generating significant revenue at scale. The gap between $600 billion or more in annual capex and $80 to 120 billion in AI revenue is still large, even accounting for indirect returns through hyperscaler products. Analysts note that Regime 1 margins are compressing. Capital markets begin pricing in lower terminal values for AI infrastructure. This may manifest as a sharp capex contraction or, more likely, as a deceleration: 2028 capex budgets grow 10 to 15% rather than 40 to 50%. Either way, the number of frontier training runs drops, slowing T₅₀ growth.
2028 or later 20 to 30% 30 to 50h Regime 2 partially materialises but below expectations. At T₅₀ = 40h (roughly late 2027 on trend), one-week tasks hit 50% base success and become genuinely viable for some enterprise use cases. Revenue grows visibly but not fast enough to justify the cumulative $1.5 trillion or more already invested. The correction is shallower than the 2027 scenario but still significant enough to slow frontier training.
No meaningful correction 25 to 30% n/a Revenue grows into the investment. This requires either (a) Regime 2 materialising faster than the Weibull model predicts, which would imply κ is improving or T₅₀ is growing faster than the 7-month doubling, (b) indirect returns through hyperscaler products being large enough to justify the capex on their own, or (c) an unexpected new revenue category (a "YouTube moment" for AI) that generates demand for compute at a scale nobody has forecast. The cloud computing precedent is the strongest argument for this scenario. The prisoner's dilemma among hyperscalers also provides a floor: even if returns disappoint, no individual company wants to cut first and cede market position. This scenario is more likely than the telecom parallel alone would suggest, because the companies doing the spending are structurally better positioned to sustain losses than telecom companies were.

The cumulative probability of a meaningful correction (defined as capex growth falling below 15% or capex contracting absolutely) by end of 2027 is roughly 40 to 50%. Over a longer horizon through 2028, it rises to roughly 55 to 70%. These are wide ranges that reflect genuine uncertainty, not false precision. The most likely outcome is not a telecom-style crash but a deceleration in 2027 or 2028 that reduces but does not eliminate frontier training, slowing T₅₀ growth from the current 7-month doubling to something in the range of 14 to 24 months. A full telecom-style collapse (absolute capex contraction, frontier training essentially stopping) is possible but less likely for AI than it was for telecom, given the hyperscalers' profitability and the dual-use nature of the infrastructure. I assign it roughly 15 to 25% probability.

The T₅₀ growth rate is the variable that matters

I model the bust as hitting at various points along the T₅₀ trajectory, with three scenarios for what happens afterwards. Freeze: frontier labs shut down or pivot, no new training runs, T₅₀ is capped. Crawl (24-month doubling): the telecom-precedent default. R&D continues at reduced pace, fewer labs competing, fewer frontier training runs, progress driven mainly by open-source research and inference-time improvements. Slowdown (14-month doubling): a milder correction where some frontier investment continues.

What this does to the dangerous milestones (assuming bust hits at T₅₀ = 20h, roughly April 2027):

Milestone Baseline (7-month) Crawl (24-month) Freeze at T₅₀ = 20h
1-week tasks viable Nov 2027 late 2030 Apr 2027
2-week tasks viable Jun 2028 2032+ Never
P(1-month plan) > 10% Jul 2027 ~mid 2029 Never
P(1-month plan) > 25% Mar 2028 ~2031 Never
P(1-month plan) > 50% Jan 2029 ~2033+ Never

A crawl to 24-month doubling (the central telecom-precedent scenario) delays all dangerous milestones by roughly two to four years. A freeze at T₅₀ = 20h means two-week tasks never become viable and month-long autonomous plans never exceed 5.1% success. Even a freeze at the more generous T₅₀ = 40h (if one more generation ships before the bust) only gets month-long plans to 16.1%, permanently.

Inference cost: the telecom asymmetry

The telecom parallel predicts a specific post-bust pricing dynamic. When telecom overbuilt, the glut of dark fibre made bandwidth almost free. That cheap bandwidth eventually enabled YouTube, streaming, and cloud computing, none of which would have been viable at pre-bust prices. The AI equivalent: excess GPU capacity already built into data centres has no alternative use, and providers with sunk costs will sell inference at marginal electricity cost rather than let hardware sit idle. The default post-bust scenario is fire-sale inference, not oligopoly pricing.

This creates the inference-training asymmetry that defines the post-bust landscape. Inference becomes cheap (excess capacity, sunk costs). Training becomes rare or stops (no capital for new $1 to 10B frontier runs). Existing models are dirt cheap to run. New, better models stop arriving. This is precisely the scenario where κ becomes the permanently binding constraint: you can retry cheaply all day, but κ does not improve because κ requires architectural breakthroughs that come from new training, and those are not happening.

The viability of a one-week task at T₅₀ = 20h (a plausible freeze point) across cost scenarios:

Inference cost scenario $/step 1-week task ratio 2-week task ratio
Fire sale (excess GPUs) $0.05 0.70× 1.41×
Moderate post-bust $0.15 0.86× 1.74×
Current pricing $0.22 0.98× 1.98×
Cost recovery (end of subsidy) $0.50 1.44× 2.91×

Even at fire-sale pricing, two-week tasks remain unviable at T₅₀ = 20h. This is because verification cost dominates: at 20% review, the verification component alone for a one-week task is $3,700 regardless of what inference costs. Cheap compute helps Regime 1 (sub-day, bounded) without bringing the dangerous regimes any closer. The converse is also true: oligopoly pricing makes bounded tasks more expensive but does nothing to month-long plan probabilities, which are set by T₅₀ and κ, not by inference cost.

The self-sustaining pair is nearly unkillable

This is the most troubling finding. A pair operating at T₅₀-length tasks has enormous margins even in adverse conditions. Under the telecom-precedent default of fire-sale pricing, the pair's margins become absurd.

At T₅₀ = 10h (one generation ships before the bust), operating on 10-hour tasks for a market price of $1,500 (human equivalent):

$/step Pair cost/delivered Margin/task Deliveries needed per month to self-sustain
$0.02 (fire sale, telecom default) $23 $1,477 0.8
$0.05 $57 $1,443 2.0
$0.22 (current) $250 $1,250 10.1
$0.50 (price hike) $568 $932 30.9

At fire-sale prices, the pair needs less than one delivery per month to cover its compute. Even at doubled inference prices, it needs roughly 30 deliveries per month, well within the capacity of a pair running 24/7. The pair is profitable across the entire plausible range of post-bust inference pricing.

But the pair's capability ceiling is frozen with T₅₀. At T₅₀ = 10h, P(bootstrap a 160h business plan) = 0.8%. At T₅₀ = 20h, it is 5.1%. Cheap compute does not help: the bootstrap probability is set entirely by T₅₀ and κ, not by cost per step. A pair that has not yet bootstrapped faces the same dismal odds regardless of how cheap inference gets. A pair that has already bootstrapped before the bust survives it easily and continues to earn money, but cannot execute long-horizon growth plans. The bust kills the pair's growth potential without killing the pair itself.

The paradox

A bust creates a paradoxical situation for safety. On one hand, the dangerous milestones are delayed by 18 months to indefinitely. This is the most significant timeline extension possible.

On the other hand, a telecom-style bust may simultaneously make bounded agent deployment more widespread and less supervised. Fire-sale compute makes Regime 1 tasks even cheaper. Companies that survived the bust still have products and now have cheaper inputs. As discussed under Condition 2, cheap inference dissolves the economic incentive to verify: when retrying costs nearly nothing, monitoring costs more than failing. The total volume of deployed agents doing bounded, safe work may actually increase even as the frontier freezes, and the fraction of those agents running without meaningful human oversight increases too. More agents, all stuck well below the dangerous threshold, but a growing number operating in the cheap-inference autonomy gap where nobody is watching closely.

The telecom precedent is instructive here. Cheap bandwidth after the bust did not just make existing applications cheaper; it enabled entirely new application categories (streaming, cloud, social media) that nobody had predicted. If cheap inference enables unforeseen scaffolding approaches that partially mitigate κ without new training, that would compress timelines. But it took roughly five years for the telecom overcapacity to find its YouTube moment, and scaffolding improvements to date have not demonstrably moved κ.

The state actor is least affected. State actors with their own compute face the same capability ceilings as everyone else, but cost is not their binding constraint. At T₅₀ = 20h, 50 parallel agents on one-week sub-tasks yield 16.2 expected successes per batch at 32% individual success rates. Useful for bounded intelligence tasks, but each agent cannot maintain coherence beyond its sub-task without human coordination. The state actor, like the pair, is stuck: capable of bounded things, incapable of coherent long-horizon autonomy.

The combined effect of the bust scenarios on the most dangerous milestone, P(1-month plan):

Scenario P(1-month) mid-2028 P(1-month) mid-2029 P(1-month) 2030
No bust (baseline, 7-month doubling) 23.1% 52.8% 65.3%
Deceleration (mid-2027, slows to 14-month from T₅₀ = 20h) 12.4% 23.1% 35.2%
Deep stall (mid-2027, slows to 24-month from T₅₀ = 20h) 5.8% 7.2% 9.0%
Late bust (2028, slows to 24-month from T₅₀ = 40h) 19.0% 21.4% 24.1%
Early bust (late 2026, freeze at T₅₀ = 10h) 0.8% 0.8% 0.8%

The difference is stark. Under the baseline, month-long plans are coin-flip reliable by mid-2029. Under a deceleration to 14-month doubling (arguably the most likely correction shape), they reach 23% by mid-2029, delayed by roughly a year. Under a deep stall to 24-month doubling, they remain below 10% through 2030. Under an early bust with a full freeze, they never exceed 1%.

The self-sustaining pair that bootstrapped before the bust is the loose thread. It survives, earns money, and waits. If T₅₀ growth resumes (funded by a recovery, by a different country's investment programme, or by gradual open-source progress), the pair is immediately positioned to benefit: it does not need to re-bootstrap, just buy API access to a newer model. The telecom precedent is relevant here: the cheap bandwidth eventually found its purpose in applications nobody had predicted, but it took roughly five years. Whether the AI equivalent happens faster or slower depends on how long the growth pause lasts relative to governance and alignment progress. The most optimistic bust scenario: T₅₀ growth freezes, existing pairs survive but are capped, and the additional time is used to develop robust alignment frameworks. The most pessimistic: growth resumes after 12 to 18 months from a different funding source, the pair has been accumulating revenue throughout, and the delay bought almost nothing.

The bust and the κ breakthrough

There is a further effect that cuts in the opposite direction from the "loose thread" concern. The entire analysis shows that κ, not T₅₀, is the binding constraint on dangerous autonomy. A κ improvement requires architectural research: developing continual learning, persistent memory, or whatever the equivalent of Hamilton's "tumour suppression mechanism" turns out to be for AI agents. This kind of research is expensive and speculative. It requires frontier-scale training runs to validate, large research teams to explore the design space, and patient capital willing to fund work that may not pay off for years.

A bust cuts precisely this kind of spending first. When capital contracts, labs cut speculative research before they cut products that generate revenue. Scaling existing architectures (which improves T₅₀ but not κ) is relatively straightforward engineering; developing a fundamentally new training paradigm is a research programme. The researchers who might crack continual learning need compute budgets, institutional support, and freedom to fail. A bust reduces all three. It is possible that a bust concentrates surviving talent on fundamental problems rather than incremental scaling, but fewer labs means fewer parallel attempts at the breakthrough, and architectural innovation historically benefits from many teams trying different approaches.

If the bust delays the κ breakthrough by even two or three years on top of delaying T₅₀ growth, the combined effect on the dangerous milestones is substantial. The bust does not merely slow the approach to the current firebreak; it may delay the point at which the firebreak itself begins to weaken.

Conclusion

Four things I feel fairly confident about.

First, T₅₀ remains the single most important parameter for agent economics, and its continued growth is necessary (though perhaps not sufficient) for the investment thesis.

Second, the available data does not yet show κ improving with model scale, and there is a theoretical argument (grounded in what Sutskever, Karpathy, and Hamilton have separately described) that κ reflects an architectural property, specifically the absence of continual learning, rather than a scaling property. Current agents process each step as a fresh inference over a growing but degrading context window, with no mechanism for mid-task learning or persistent memory. This is why their hazard rate remains structurally higher than humans' across the full duration of a task. If this is correct, κ will remain near 0.70 until someone builds a qualitatively different kind of model.

Third, given κ = 0.70 and the METR doubling trend, the timeline for the most dangerous forms of autonomous agency is measured in years, not months. Month-long autonomous plans reach 25% reliability around March 2028 and 50% around January 2029. Three-month plans remain below 10% until mid 2028 and below 50% until late 2029. These timelines accelerate by 7 to 16 months if κ improves to 0.50, and the dangerous regime would already be here if κ were at human levels. A correction in AI infrastructure investment, which I assign roughly 40 to 50% probability by end of 2027 and 55 to 70% by end of 2028, would slow these timelines further, though the most likely correction shape is a deceleration (14-month doubling) rather than a full telecom-style collapse (24-month doubling or freeze). Even a deceleration delays the most dangerous milestones by roughly a year and, critically, also delays the κ breakthrough itself by reducing the capital available for speculative architectural research. A deeper stall extends the delay to two or more years. But any correction that produces fire-sale inference pricing also weakens the economic case for verification on bounded tasks (Condition 2), producing a larger population of lightly supervised agents operating below the dangerous frontier. Coalitions of self-sustaining pairs cannot overcome the κ ceiling: decomposing a long plan across multiple agents is strictly worse than a single agent retrying, because κ < 1 means each fresh agent restarts from the high initial hazard rate (see Appendix B).

Fourth, the per-task economics favour cutting verification, but agent speed compression makes the feedback loop fast enough to self-correct in most client-facing domains. The residual vulnerability is in low-frequency domains, internal use, and the transition to AI verification.

The economics create a natural firebreak between the agents we have now and the agents that would pose the most serious risks. That firebreak has three robust layers: the superlinear cost of unreliability, the verification floor, and the half-life tax on autonomous plans. It has one partially self-reinforcing layer: the human-in-the-loop. And it has a concrete expiry date, determined primarily by κ.

If κ stays at 0.70, the firebreak holds into 2028 and possibly 2029, buying time for alignment research and governance to mature. If κ improves, the window narrows. The most important empirical measurement for AI safety in 2026 is not benchmark performance, not parameter count, not inference cost, but whether the next generation of frontier models shows any improvement in the Weibull shape parameter.

κ is the variable to watch.

Caveats and limitations

This model is deliberately simple. Real deployments are more complex in several ways.

  • Partial credit. Failed attempts often produce useful intermediate work. A human can salvage a 70%-complete agent output faster than doing the task from scratch.
  • Task-specific feedback loops. Coding against test suites effectively shortens the task by providing intermediate verification.
  • Agentic scaffolding. Sophisticated orchestration (multi-agent systems, checkpointing, rollback) can improve effective reliability beyond what the raw model achieves.
  • Rapidly falling costs. Inference costs have been dropping roughly 2–4× per year. This matters for compute costs, though it does not affect the verification cost component (which is human labour).
  • Measurement uncertainty. The half-life parameters are derived from METR's software engineering benchmarks, which may not generalise to other domains.
  • Survival model uncertainty. The two models fit about equally well on available data and diverge at the tails where measurements are sparse. Hamilton notes that the Weibull and METR's logistic model give about equally good fits on existing data.
  • The doubling trend. All timelines assume T₅₀ continues doubling every 7 months. If the trend slows (as it must eventually), all dates shift later. This is the most significant assumption in the analysis.
  • Revenue projections. The investment thesis analysis uses rough estimates from public sources. Actual AI revenue is notoriously difficult to disaggregate from cloud spending.
  • Verification assumptions. The 20% base case review fraction is a rough estimate. The correlation parameter ρ between agent and verifier failures is a key unknown with no empirical measurements.
  • Multi-period model assumptions. The client churn model uses simplified parameters that are not empirically calibrated.
  • Bubble timing estimates. The probability-weighted bust timing is illustrative, not predictive. I assign roughly 40 to 50% probability of a meaningful correction by end of 2027, but this estimate is sensitive to assumptions about the AI-specific share of hyperscaler capex (assumed 60%), the appropriate valuation multiple (7 to 10x), and the revenue figures, all of which carry wide uncertainty. The distinction between deceleration (capex growth slowing to 10 to 15%) and collapse (absolute capex contraction) matters substantially for T₅₀ trajectories; I treat deceleration as more likely than collapse, but the post models both. The telecom parallel is structurally informative (infrastructure overbuilding based on exponential growth projections) but not directly predictive: AI infrastructure has immediate productive use in a way that much telecom spending did not, the companies leading the AI buildout are far more profitable than their telecom-era counterparts, and a large fraction of the return on AI capex flows through the hyperscalers' own products in ways that are hard to separate from base cloud revenue. GPU depreciation (roughly five years) also means AI overcapacity self-corrects faster than telecom's dark fibre, making a sustained multi-year glut less likely. The feedback loop between T₅₀ growth and revenue is real but the thresholds at which it triggers a correction are unknowable in advance.
  • Cheap-inference autonomy gap. The Condition 2 analysis of unsupervised bounded agents under fire-sale pricing is qualitative. Quantifying the actual risk of correlated failures across a large population of lightly supervised agents would require empirical data on failure mode distributions.
  • The economic anatomy assumes current architectures. A qualitatively different architecture could change the cost structure, the reliability dynamics, or both. The analysis is only as durable as the architectural paradigm it describes. This is, in a sense, the whole point: the argument is that the architectural paradigm determines κ, and κ determines the timeline. 
  • The context loss description is simplified. Real agents use various scaffolding techniques (tool use, retrieval-augmented generation, explicit memory systems) that partially mitigate context degradation. These are captured in the model only insofar as they are reflected in the METR benchmark performance. If scaffolding improvements reduce effective κ without changing the underlying architecture, the timelines compress accordingly.

Appendix A: Can a coalition of pairs overcome the κ ceiling?

If a single pair is stuck at T₅₀-length tasks, could multiple pairs coordinate to execute plans beyond any individual pair's horizon? The intuition is appealing: decompose a 160-hour plan into ten 16-hour sub-tasks, assign one to each pair, and the plan succeeds if all sub-tasks succeed. Ten pairs, each operating comfortably within its T₅₀, collectively accomplish what none could alone.

The mathematics say otherwise. At T₅₀ = 40h, each pair succeeds on a 16-hour sub-task with 69.4% probability. But the plan requires all ten to succeed: 0.694¹⁰ = 2.6%. A single pair attempting the full 160-hour plan succeeds 16.1% of the time. The coalition is six times worse.

This is not a coincidence. It is a direct consequence of κ < 1. When the hazard rate decreases over time (which is what κ < 1 means), an agent that has survived the early portion of a task is in a progressively safer state. The initial steps carry the highest risk; if the agent gets past them, it is increasingly likely to finish. Breaking the task into pieces and handing each piece to a fresh agent throws away this benefit. Each new agent restarts from the high initial hazard rate. Under an exponential model (κ = 1), decomposition would be neutral because of the memoryless property. Under the Weibull with κ < 1, decomposition is actively harmful.

Coordination overhead makes it worse. A coordinator pair must decompose the plan, discover and recruit other pairs, manage handoffs, and integrate results. This coordination is itself a long-horizon task subject to the half-life tax. At T₅₀ = 40h with 140 hours of coordination overhead, the total probability drops to 0.5%: thirty-two times worse than a single pair.

The fair comparison is against a single pair retrying the same plan multiple times. At T₅₀ = 40h, one pair making ten independent attempts succeeds with probability 1 - (1 - 0.161)¹⁰ = 82.6%. Ten coordinated pairs achieve 0.5–2.6%. Repeated single attempts win by a factor of 30 or more. This holds across all T₅₀ values and all plausible coordination overheads.

Pooling revenue does not help either. A single pair earning $2–3M/year can already afford API access to whatever frontier models are available. Ten pairs pooling $25M/year could perhaps afford custom fine-tuning ($100K–1M) but not a new training run ($10M–100M+). The marginal value of additional capital is low when the bottleneck is κ, not compute.

The one scenario worth noting is emergent market-mediated coordination, where pairs trade services through short bounded transactions rather than executing a single coherent plan. One pair builds a tool and sells it to others. Pairs specialise in different domains and subcontract to each other. Each individual transaction is T₅₀-length and can be retried until it succeeds, so the chain of transactions eventually completes. At T₅₀ = 40h, a 10-transaction chain with retries takes roughly 230 wall-clock equivalent hours but near-certainly completes.

The critical difference from a coalition is that no single entity holds a coherent long-horizon plan. What emerges is an ecosystem of autonomous economic actors doing bounded things, not a conspiracy pursuing a unified goal. Whether such an ecosystem could produce dangerous outcomes through emergent coordination (analogous to how a market economy produces complex outcomes without central planning) is a genuinely different question from whether a directed plan can succeed, and the Weibull model does not directly address it. But it is worth noting that the most dangerous scenarios in this post (month-long directed plans, resource accumulation, oversight resistance) all require coherent long-horizon intent, which the ecosystem model does not provide.

Appendix B: Ord's comment

Toby Ord's comment on the original post, which prompted this update:

That's an interesting way to connect these. I suppose one way to view your model is as making clear the point that you can't cost-effectively use models on tasks that much longer than their 50% horizons — even if you are willing to try multiple times — and that trend of dramatic price improvements over time isn't enough to help with this. Instead you need the continuation of the METR trend of exponentially growing horizons. Moreover, you give a nice intuitive explanation of why that is.

One thing to watch out for is Gus Hamilton's recent study suggesting that there isn't a constant hazard rate. I share my thoughts on it here, but my basic conclusion is that he is probably right. In particular, he has a functional form estimating how their success probability declines. You could add this to your model (it is basically 1 minus the CDF of a Weibull distribution with K=0.6). I think this survival function tail is a power law rather than an exponential, making the 'just run it heaps of times' thing slightly more tenable. It may mean that it is the cost of human verification that gets you, rather than it being untenable even on AI costs alone.


I want to thank Toby Ord for his comment, Claude 4.6 for helping me run through some of the scenarios, and I want to thank Eddie Kembery for prompting me to think through my ASI timelines. Comments and different viewpoints on the post are very welcome. If you think the analysis misses something important or gets something wrong, I would like to hear about it.



Discuss