2026-01-18 11:57:19
Published on January 18, 2026 3:57 AM GMT
This post was written as part of research done at MATS 9.0 under the mentorship of Richard Ngo. It's related to my previous post, but should be readable as a standalone.
Remark: I'm not yet familiar enough with the active inference literature to be sure that the issues I bring up haven't been addressed or discussed. If you think my characterisation of the state and flaws of the theory are missing something substantial, I'd love to know.
In the theory of active inference, agents are described as having a set of internal states that interact with external states (the world) through a membrane of intermediate states, such as the senses. I'm currently exploring how agents are able to exhibit approximations of external reference that allow them to stay alive in the real world. They achieve this even though they only have access to the statistical proxy of their internals, which they could easily reward-hack without optimising the external states at all.
One of active inference's weaknesses is that it struggles to model agents' uncertainties about their own preferences. I here propose a potential explanation for why agents are conflicted about these preferences. This perspective posits agents' seeming inconsistency and irrationality about their goals as a mechanism that protects them from reward-hacking their internal states.
Consider the following question:
What stops an agent from generating adversarial fulfilment criteria for its goals that are easier to satisfy than the "real", external goals?
Take Clippy as an example, whose goal is stated as maximising the amount of paperclips in the world. Since Clippy only has internal reference, it could represent this goal as "I observe that the world has as many paperclips as it could possibly have". I'm wondering what in Clippy's system saves it from "winning at life" by hooking its sensors up to a cheap simulator that generates an infinite stream of fictional paperclips for it to observe.
An elegant answer to the problem of internal reward-hacking is that agents come pre-equipped with suitable priors about their internal states. In active inference, agents seek to update their beliefs and act on the world such that their observations fit their priors as closely as possible. The space of "good" priors for agents' internal states is very small. However, evolutionary pressures have selected for agents with priors that are conducive to their survival. According to active inference, agents attempt to manifest these favourable priors through action, which makes the priors function as preferences.
Unfortunately, the claim that evolutionarily fine-tuned priors do all the work to prevent internal reward-hacking seems lacking to me, because in practice we are uncertain about our own feelings and preferences. We don't actually have locked-in, invariant preferences, and it's unclear to me how active inference explains this; preferences are usually encoded as priors over observations, but ironically these are never updated.[1]
Active inference thus implicitly assumes agents to be consistently, definitively settled on their preferences. Agents are only uncertain about the external states and about how their actions and senses will interact with those states. Within those unknowns, they seek to optimise for the observations that they are certain they prefer. I don't think this assumption is warranted. In fact, I have been considering the possibility that agents' uncertainty about their own preferences is an important instrument for increasing their (bounded) rationality.
Consider the example I used in my last post of a hypothetical person, Alice, who wants to maximise "success". In that example, Alice avoids applying to a prestigious university because rejection would decrease her internal perception of success. She instead applies to a worse university that she is sure to get into, as this will certainly increase her success-o-meter.
Suppose instead that Alice feels a twinge of guilt not applying to the prestigious university, as this could be perceived as "loser" behaviour by her friend. This guilt may motivate her to apply anyway, even though the action lowers (in expectation) her internal perception of success. Here, the mixed optimisation of two distinct goals: "I perceive myself as maximally successful" and "I perceive myself as someone that my friend thinks is maximally successful", yields behaviour that actually makes Alice more successful.
In Free Energy Minimisers (FEMs) from active inference, preferences are usually described as fixed priors over the space of observations. One possible model for Alice's behaviour is that each action is chosen with respect to one of two sets of priors. The priors she chooses to satisfy in a given action are sampled from some distribution over priors that represents the degree to which she identifies with conflicting preferences. In practice, Alice now doesn't resemble a consistent FEM, but she has become more aligned with respect to the external goal. Her mixed strategy between preferences can be seen as hedging against her top choice of priors being unfit.
I would like to distinguish this concept of inconsistent preferences from mental motions such as compartmentalisation. For instance, suppose an agent learns to calculate the derivative of a function (f+g) by having separate[2] parts of itself calculate the derivatives of f and g and then adding the results. This motion could be seen as the agent using subagents' outputs to solve a problem. However, these "subagents" are not imbued with goals of their own. They're more like tools that the agent deploys to break the problem down into manageable components.
My guess is that people's uncertainties about their preferences are better represented as meme(plexe)s competing with each other for attention. The memes that live to be observed in minds are those that could be seen as agentically pursuing survival and reproduction.[3] Internal preferential inconsistency would thus be analogous to the sub-parts in the above example optimising to convince the agent that they are "useful" for calculating derivatives and should be kept around.[4]
Sub-processes and compartmentalisation as tools to increase rationality are not controversial ideas. The more contentious claim I'm ideating is that even conflicting agentic sub-processes — harboring goals that are unaligned with those of the larger agent — can still be useful for increasing agentic rationality with respect to external goals. I aim to formalise and explore this hypothesis in an empirical or mathematised setting.
There's a good reason for never updating priors over observations. If agents' preferences could update, they would gradually move towards preferring states that are more likely, even if these aren't fruitful for their continued existence. The function of the fixed priors is to give agents a vision of the world they are willing to execute actions to manifest; these are preferences.
this potentially includes separation across time
For example, successful memes, like catchy songs, have a tendency to get their hosts to spread them to other people.
This goal could functionally be the same as actually being good at calculating derivatives, but it doesn't have to be. For example, if the agent wants the derivative to be high, then a sub-part may gain a competitive advantage by overestimating the answer of the derivative of f. It may eventually convince the agent to employ two copies of itself to calculate the derivatives of both f and g, replacing the other sub-part.
2026-01-18 11:28:34
Published on January 18, 2026 3:28 AM GMT
Yesterday I stumbled on this quote from a blog post by JA Westenberg:
Michel de Montaigne arguably invented the essay in the 1570s, sitting in a tower in his French château, writing about whatever interested him: cannibals, thumbs, the education of children, how to talk to people who are dying. He called these writings essais, meaning "attempts" or "tries." The form was explicitly provisional. Montaigne was trying out ideas, seeing where they led, acknowledging uncertainty as a fundamental feature rather than a bug to be eliminated.
It's hard to convey the sense of both profound agreement and giddy joy I had reading that because, not only is the wider post about something I love (i.e. blogging), or because I learned something new about the history of writing (which is always fun), but because that quote describes something that I've been doing myself for the past two years and wanted an excuse to talk about!
There's an old adage that says, "Writing is Thinking" and I've usually interpreted those words to mean that "writing helps you think", which is undoubtably true. However in recent years I've discovered an entirely new form of writing that I've taken to calling: musing, that I think takes this idea a step further, and it's precisely what Westenberg describes Montaigne doing in the 16th century.
We have a lot of thoughts throughout the day, and yet we spend so little time indulging these idle curiosities. Writing, especially by hand, can be a great way to explore these ideas and to practice thinking. It's also really fun to do! Over time I started collecting these ideas into notebooks (one of which I almost always carry with me) in order to better organize these inherently random topics into a searchable system. Originally I scribbled on loose leaf pages or random legal pads (as I've mentioned before) and that became unruly very quickly.
Some of these musings are personal topics, most are not. Often they're just the exploration of a question I have. Here's an example:
Businesses keep offices cold because there's research saying that cooler temperatures help people think and stay focused. Given that's true, could the Little Ice Age have helped improve human cognition during the 17th and 18th centuries? If so, what does that mean?
I'm not sure, but it was something I thought about for a while and so I wrote it down. The entire musing, or essay as I guess it should be called, is less than a page, but was engaging and very fun to do. I've written short essays about dozens of topics over the years (including several that have been eventually published to this blog). It's a fun practice, and I encourage you to try it.
Explore your ideas honestly. Don't fear where your mind goes or the questions it will ask. These are personal, honest thoughts not social media posts. Writing like this is inherently introspective, it's a way to give your mind the space to awe and wonder
We often believe that thinking is a process which takes place entirely in the mind, but it's a process that is heavily influenced by the particulars of the body. Try thinking through a problem in a hot and smelly room or on a walk with a rock in your shoe.
However, the body can do more than hinder the thought process, it can catalyze it! This is what writing can be, a way to think through problems using your entire body.
Occasionally, I've sat down to write but without any particular topic in mind. So, I open my notebook and just start writing. Tons of my essays begin with something like, "I'm not sure what I'm thinking right now and I don't know what to write." From there, I let my thoughts move and course as they will and I just write down what comes to mind, stopping and starting as my thoughts shift and change and eventually I will find that something has come out of it. I might work through a tension or stress point, I could come to realize or discover something about a problem, or I could just get a few lackluster thoughts on a page. Not all thinking is productive but the mind is a muscle and it needs to be exercised to function properly. Sometimes just doing the workout is enough.
We usually think of cleverness or intelligence as an innate trait people have, and while that is certainly true in some regards, intelligence and wisdom are just as much a function of practice as of talent. To get good at solving puzzles, you have to practice solving puzzles. The mind is no different than a muscle in that regard. Thinking aloud on the page is one way to record and analyze your thought process and to practice the art of thinking itself.
As another example, I often revisit my prior writings and find many to be overly simplistic, uninspired, or just plain wrong. But that's good! It means I've learned something in the intervening time! In software there's an addage:
If you come back to old code and see nothing wrong with it, then you haven't learned anything since.
You are not a finished product, you're a process—always in motion—that evolves and changes over time. Your thinking can improve with practice as much as it can atrophy from inattention.
Think about thinking, write those thoughts down, then perhaps publish a few on a blog that you own. It's fun, and it can do wonders for the mind.
2026-01-18 09:51:24
Published on January 18, 2026 1:19 AM GMT
TL;DR
Figure 1: Plot comparing frontier LLM time-horizon measures, including both the human-level-reliability time-horizon from this post (orange), versus the METR-style fixed-reliability 50% time-horizons (blue). We can see that this alternative human-relative time horizon measure has been increasing much more steeply over time than the METR horizons. Note that the "human-level" horizon metric in this plot is comparing LLMs to METR's human baselines. The most recent data point included here is gpt-5.1-codex-max.
Acknowledgements: I shared a draft of these ideas last year in correspondence with Daniel Kokotajlo and Eli Lifland from AI Futures. Thanks to Eli for his feedback; also, Daniel subsequently posted a short-form [1] touching on a similar crossover/intersection point framing, which is worth checking out for his perspective on these issues.
The METR time-horizon metric provides estimates for the longest software tasks that a given AI is capable of completing [2]. This estimate is based on human baseline measurements for how long a set of software tasks took human engineers, with the (geometric) average time based on the subset of humans who succeeded at each task. METR also measured human reliability at each task, but rather than compare the AIs to those human levels, they have typically reported LLM time horizons at fixed reliabilities independent of task length (e.g. 50% or 80%). The METR estimates have featured prominently in efforts to estimate the timeline to human level long-horizon agency, e.g. per the AI-2027 forecasts[3]. The following bullets describe potential downsides of this METR horizon approach and summarize the proposed alternative, as well as providing trend-based projections for these metrics.
As an alternative to the fixed-reliability (e.g. 50%) METR time horizons, we can instead measure a time horizon metric that is defined relative to actual human baseline reliabilities; note that METR did actually measure human baseline reliabilities, though they tend to down-play those baselines as weak or insufficiently incentivized (see section 1.6 below) and instead focus on absolute reliability targets. One issue is that current LLMs e.g. gpt-5 already exceed the METR human baseline time horizons at both the 50% and 80% targets; however, humans do still have fatter tailed reliability curves, so for longer tasks METR's human baselines still do better (though see Addendum below on Claude 4.5, since this claim is less clear now). For instance, gpt-5 has worse reliability than the human baselines once the METR task length gets longer than 4.5 hr, but note its 50% horizon is only 2.3 hr. Using these human-reliability baselines, we can estimate LLM time horizons, as the longest task duration over which the LLM is more reliable than the METR human baselines, or more concretely, as the intersection point of METR's logistic fits for humans and LLMs, with a fit of reliability versus task duration. See Figure 1 for a plot comparing these two horizon metrics over time.
The trend in this human-relative time horizon metric is increasing much faster than the existing METR trends, and assuming an exponential fit, the LLM time horizons at human-reliability are doubling every 1.9 months, versus every 7 months over the same time period for the METR (50%-reliability) time horizon trend (see Table 1 and Figure 2). In other words, every 1.9 months there is a doubling in the longest time horizon that LLMs can handle at METR's human baseline reliability. As METR has noted this exponential appears to have sped up recently, perhaps since the arrival of reasoning models, though the estimates above are just using all of the data for both metrics (not just the recent models); however, the hyperbolic analysis below does implicitly model increasing exponential rates over time.
Per above, with an exponential fit, LLMs would never catch up with humans, but with a hyperbolic fit the current trend suggests human level by around mid 2026, relative to the (weak) METR human baselines (see Table 2). That table also shows sensitivity analysis with hypothetical stronger human baselines, with the two alternative baselines pushing human level into 2027. These stronger baselines have flatter slope, with human reliability dropping more slowly with longer horizons, but they still do gradually decline, in contrast to the METR horizons based on constant reliability targets.
METR has generally minimized the relevance of their human reliability baselines on the grounds that the participants weren't properly incentivized, and these weak baselines are a potential downside for this proposed metric; but in so far as we are trying to determine LLM progress towards human-level agency, then we are likely better off comparing to the available human baselines (even if imperfect) rather than absolute reliability thresholds that have no known connection to actual human performance; for instance, based on METR's human-baseline logistic fit, we should expect humans to get about 3% reliability for tasks at 1-month horizon (along the METR difficulty trend line), so it's not clear why we would require "human-level" AI to get 50% (or 80%). That said, in so far as the human baselines are weak or poorly incentivized, it could be useful to collect stronger human baselines, or for now we can do sensitivity analysis with hypothetical stronger baselines (see below) to assess how much this changes projections.
Figure 2: Log plot of the proposed human-reliability LLM time-horizons, suggesting that an exponential fit is a reasonable approximation, though see below for model-selection analysis suggesting that a hyperbolic fit may be better. This regression shows a time-horizon doubling rate of approximately 1.9 months.
| Approach | doubling time | 1-month horizon projection (month / year) |
|---|---|---|
| time horizon (50% reliability) | 6.8 months | 11 / 2030 |
| time horizon (human-level reliability) | 1.9 months | 12 / 2026 |
Table 1: Estimated doubling rates for the proposed human-reliability LLM time-horizon metric versus the METR 50%-reliability metric. Note that the METR horizons are doubling at a much slower rate, and if we assume status quo exponential progress, then 1 month horizons would come about 4 years sooner based on these human-relative time horizons. Note that these are the overall trends, not the recent (somewhat) faster trends. Also, see below for evidence that the proposed time-horizon metric may see super-exponential growth.
While the above exponential model for LLM time-horizons is a reasonably good fit, there are fairly strong reasons to think that a super-exponential (e.g. hyperbolic) model is a better match for tracking this metric, whereas an exponential fit is more defensible for the official METR time-horizon metrics (i.e. for their fixed reliability, 50% and 80% metrics). See Section 2.2.1 for theoretical evidence favoring a hyperbolic fit for this proposed human-relative metric and Section 2.2.2 for statistical evidence based on AIC model selection. Also, Table 2 shows the estimated time for LLMs to reach human-level over all time horizons, including estimates based on the measured METR human baselines and also additional sensitivity analysis showing how this estimate would push further into the future if we compared to (hypothetical) stronger or better incentivized human baselines. Note that with an exponential fit, the trend would never catch up to human level horizons, so this question only really makes sense in the super-exponential context.
One challenge with extrapolating the human-relative time horizons to human level is that it involves extrapolating the human logistic function far beyond where it was actually tested in the METR benchmark; so ideally we would also collect direct human measurements for much longer tasks than the current <1 day tasks. But given the expense of doing so, extrapolating the reliability logits could be a reasonable proxy, especially since they do appear to be surprisingly linear in task-duration.
| human reliability (at 16 hr) | LLMs-to-exceed-human-level (date) |
|---|---|
| 22% (measured human baseline) | 2026-09-02 |
| 50% | 2026-12-28 |
| 80% | 2027-04-16 |
Table 2: Projections for when LLMs will exceed human reliability over all time horizons, based on hyperbolic fit. The first row shows the estimate using the proposed human-relative metric calculated with the actual human baseline data from METR, but since METR has argued that the baselines were poorly incentivized, the remaining rows show projections if human reliabilities are actually higher on the hardest/longest METR tasks, i.e. 50% and 80% rather than the 22% (estimated from their measured human logistic curve); note that the human slope is what is being varied in this sensitivity analysis to match the target reliabilities above, whereas the logistic intercept is left unchanged from baseline. You can alternatively estimate dates for LLMs to reach human level by just projecting the LLM logistic slope linearly in time to estimate when it matches human slope, but see section 2.2.1 below for why I suspect this is less reliable than directly projecting the overall horizon progress.
One natural intuition regarding METR time horizons is that people can effectively handle arbitrary task lengths, but with an exponential trend the 50% horizon will always be finite, so you might wonder if the actual long-term trend will be super-exponential with a finite-time singularity (e.g. per AI-2027 projections); however, per the formulas in section 3 below, the 50% time horizons likely won't blowup to infinity unless approaches zero (since is in the denominator), assuming the logistic parameters and are finite/continuous, but zero slope () would mean no reduction in reliability with increasing task length, which is likely unrealistic (see more careful discussion of formulas in section 3 below).
On the other hand, for the human-level time-horizon formula, is in the denominator, so all that's required for a finite time singularity is for the LLM's parameters to catch up with human parameters. And currently the LLM intercepts () are already better than the human baselines, and the slopes are increasing linearly over time (with substantial noise), so it's plausible that this could occur at a finite amount of time into the future, at which point AI could exceed the human baseline at all times. Also, since human reliability declines with task length, it is not surprising that it could be easier to exceed this performance over all possible time horizons; on the other hand, the 50% time horizons, require a fixed 50% reliability for arbitrarily long/challenging METR tasks, so it's much less clear whether it's feasible to achieve this over all task lengths.
Unlike the trends in time horizons which tend to have tight fits with > 0.8, the LLM logistic coefficient (slope) trend is quite noisy ( ~ .3), even if we limit to a single model provider (OpenAI) and exclude the noisy off-trend pre-instruct samples (gpt-2 and gpt3). I suspect part of the reason for this is that the LLM providers are presumably optimizing for overall time horizon, which can be optimized via various combinations of slope and intercept in the model's logistic fit, so there is little incentive to cleanly drive up the slope with each release, e.g. model/training changes that significantly improve the intercept while slightly worsening the slope could be a worthwhile tradeoff as long as they improve the overall time horizon. For this reason, I suspect it makes more sense to directly extrapolate the overall time-horizon estimate rather than linearly extrapolating the noisy logistic coefficient in isolation, even if the slope trend is a useful intuition-pump for seeing why a finite-time blowup is plausible. But if we do naively extrapolate the coefficient, this suggests a fairly long time to catch up with humans (~2029) with large error bounds, versus the shorter timelines implied by a direct hyperbolic fit, per Table 2 (i.e. ~2026-2027).
Given the theoretical reasons for suspecting super-exponential (hyperbolic) trends for the proposed human-relative LLM time horizons, one question is whether we can see evidence for this in the available data, using standard model selection techniques. One challenge to comparing these two models, is that the exponential fit has only 2 parameters, whereas the hyperbolic function has 3 parameters. So we can't just directly compare likelihoods/mean-squared-error, and instead need some correction to penalize the hyperbolic for its higher model capacity. See Table 3 for results comparing exponential vs hyperbolic fit, using AIC with small sample size correction. The table actually reports Akaike weights, which are normalized to 0-1 for easier interpretation. This analysis generally seems to support the theoretical expectations from the previous section, with AIC suggesting a hyperbolic fit for the human-relative time-horizon metric, versus an exponential fit for METR's 50% time horizon.
For the hyperbolic results in Tables 2 and 3, I excluded the pre-instruct models since those data points were quite questionable; in particular, these pre-instruct models are in a different technical category from subsequent instruct-tuned models, and for gpt-2 they weren't even able to use the same agentic scaffold that was used for later models, and they had to impute many data points to zero; also, for gpt-3, the model was no longer available in the api at the time of the METR testing, so they had to use proxy model instead and then back-date it; actually the earliest instruction-tuned model (gpt-3.5) also used a proxy, so an argument could be made for excluding that data point as well, but excluding pre-instruct data is a natural Schelling point for focusing on higher quality samples. Also, looking at Figure 2, it appears that the pre-instruct data points (gpt-2 and 3) are outliers, and this is even more extreme in the plot of logistic slope over time, where these dubious pre-instruct data-points are completely off the subsequent trend (plot not shown).
That said, these AIC results appear to be quite sensitive to which samples/outliers are included, so until we have more data (from future models), I think we should be fairly cautious in over-interpreting these AIC results. But perhaps in the next 6 months it will become more clear whether the data is following a hyperbolic with finite-time catch-up with the human baseline. Also, I do think it could also be better to use cross-validation rather than AIC for model selection in this case (since it's more empirical, less reliant on assumptions), but given the limited number of data points, I felt it wasn't worth the extra effort for now.
| Approach | exponential-fit (Akaike-weight) | hyperbolic-fit (Akaike-weight) |
|---|---|---|
| LLM time horizon (50% reliability) | 0.55 | 0.45 |
| LLM time horizon (human reliability) | 0.01 | 0.99 |
Table 3: This table shows a model-selection analysis comparing exponential versus hyperbolic fits, using AIC since the two models differ in capacity, with the hyperbolic having 3 parameters and the exponential only 2, so AIC penalizes this extra capacity and favors the exponential fit all-else-equal. This analysis suggests that METR's 50% horizon is more likely exponential, whereas the human-relative horizon is better modeled as hyperbolic. Note this hyperbolic model implies that LLMs would exceed the human baseline over all time-horizons at some finite time in the future (roughly in 2026), whereas with an exponential fit LLMs would never catch up to humans. This AIC analysis excludes the two pre-instruct data points (gpt-2 and gpt-3), for reasons explained in the text, but given the small sample size these AIC results are quite sensitive to which points are included and should be taken with a grain of salt. So while these AIC results are suggestive, probably the stronger evidence for a finite-time blow-up in time horizons comes from theoretical considerations (see section 2.2.1).
In the METR analysis the reliability of both LLMs and humans is modeled with a logistic function in the human task durations (), i.e.:
Note this parameterization is slightly different from METR's, but equivalent. In the human case, their fitted intercept and slope are: and . From the above logistic curve we can derive the formula for the fixed-reliability time horizons that METR publishes, e.g. here is the time horizon formula at 50% reliability:
On the other hand, the proposal from this post is to instead estimate LLM time horizons relative to the human-baseline reliabilities. To do this we find the intersection point of the LLM and human logistic curves, which gives the time horizons below which the LLM is better than humans in reliability (or vice versa):
where and are the LLM logistic parameters, and and are the human baseline parameters. So for example for gpt-5, METR estimates the parameters as and . So from this we can see that the 50% time horizon for gpt-5 (137 minutes) is actually longer than the 50% horizon for the human baseline (98 minutes), but gpt-5's human-reliability time-horizon is longer still at minutes (note you have to use unrounded param estimates to replicate this calculation). So this means that gpt-5 is more reliable than the human baselines for tasks under about 4.5 hr, but then because the human logistic has fatter tails, humans currently have higher reliability for all tasks longer than 4.5 hr.
Also, from the above formula, we can see that the criteria for LLMs to match (or exceed) humans at every task duration is to match the human slope, in which case the time horizon estimate blows up to infinity (due to zero in the denominator); then if the slopes match, in order to exceed humans (rather than just match them) the LLM also needs to have a larger intercept, but that is already the case for LLM intercept estimates. On the other hand, for the 50% horizon to blow up to infinity in finite time (e.g. per AI-2027), the slope would need to increase to zero from the current negative value (assuming well behaved and , e.g. finite, continuous), but that would imply no reduction in reliability with longer duration/difficutly tasks, which is perhaps not realistic.
For the hyperbolic projection of human-relative time horizons, I use a standard 3-parameter hyperbolic:
Where is the LLM time horizon ("_h" for human-relative), is the current date/time, and , and are the three hyperbolic parameters, where can be interpreted as the date/time at which LLMs catch up to humans (i.e. the blowup date). Note that a potential alternative could be to use a super-exponential curve derived directly from the intersection-point formula above, though some implicit assumptions would be needed e.g. regarding how and change with time.
Overall, my main takeaway from this analysis is probably that we shouldn't over-interpret the METR trends at fixed reliability as a direct marker of progress towards human-level software horizons; for instance, I think it would be a mistake to argue that AGI is still many years off on the grounds that it could be years until 80% horizons reach months/years on METR's metric; rather, when the goal is to assess time until human-level AI, my view is that we should focus on the kind of direct human-level baseline comparisons that METR's existing metrics don't provide (despite superficial appearances that they do provide this).
That said, METR has raised legitimate concerns about the quality of their existing human baselines and whether they were properly incentivized, and it does seem like there would be quite a bit of value in measuring higher quality human baselines for these agentic tasks, or in the meantime, computing the horizons with the existing human baselines padded/improved per the sensitivity analysis above. Also, given the existing baseline limitation, there are pros and cons of using METR's fixed-reliability horizon metric versus the human-relative alternative from this post, and there could be value in reporting both measures. One concrete use-case for the the METR metrics is in cases where you just need to know if a model can meet some absolute reliability standard independent of human capability, though even in this use-case the interpretation can be challenging, given the unknown implicit difficulty trend, especially once the task duration exceeds the actual benchmarked tasks.
After I had mostly completed this analysis, the Claude 4.5 Opus METR results were released[5], and showed somewhat faster than expected horizon lengths; in particular, based on the human/LLM intersection-based horizon length emphasized in this post Claude now has a horizon of 444 billion minutes(!), versus 440 minutes for gpt-5.1-codex-max, which looks much closer to a hyperbolic blowup than exponential; to be clear, we shouldn't over-interpret this specific 444 billion figure, since as the LLM slope gets close to the human-baseline slope and the horizon blows up, the horizon estimate becomes really sensitive to estimation error, and both the human baselines and LLMs have reliabilities close to zero for such long tasks (at least in the METR logistic fits). That said, this Claude data point does support a picture where Claude is now better than the available human-baselines (on average) for all the task durations in the current METR benchmark and even for all practical task durations if we are willing to extrapolate the linear logit fits to tasks much longer than the existing METR benchmark tasks.
However, given METR's concerns that the human baselines were not adequately incentivized, we can also assess the new Claude model with respect to hypothetical stronger human baselines, per the sensitivity analysis in Table 2. For instance, if we assume that the hardest/longest METR tasks could be completed with a bit over 2x higher reliability than METR's existing human baselines (specifically 50% rather than 22% from METR's human logistic fit at 16hr), then Claude 4.5 Opus has an intersection-based time horizon of only 35.9 minutes, which is actually less than gpt-5.1-codex-max at 39.4 minutes. But note that that this (hypothetical) stronger baseline still doesn't push the blow-up date that much later, i.e. just to the end of 2026 per Table 2.
Realistically, this highlights that to really make accurate projections of the time to catch up with human horizons based on METR data, we need better human baselines. Whereas METR's fixed reliability metrics are likely an under-estimate of horizon progress, e.g. currently only 4hr 49 minutes for Claude, despite Claude being better than their own baselines for effectively all horizons (> billions of minutes per above). Though short of collecting high quality baselines, perhaps there is also some value in adjusting the human baseline to be marginally better, per the 35.9 minutes Claude estimate above, which likely has the benefit of preserving more realistic longer-term asymptotic behavior than the fixed-reliability metrics.
Claude slope vs intercept effects: Note that the extremely large Claude horizons from this intersection-point approach are mostly the result of their logistic slope basically catching up with humans (-0.38 human vs -0.40 claude-opus-4.5 vs -0.58 for gpt-5.1-codex-max); whereas Claude's logistic intercept was actually a bit worse than prior models. So the Claude slope () increased a fair bit ahead of trend, though this is a bit misleading since their overall hyperbolic blowup wasn't that far ahead of schedule (per Table 2, where "better than METR human baselines at every time" was predicted for Sept of this year, prior to the new Claude datapoint).
Note it could be worth redoing the statistical tests from this post to include the new Claude model, but I haven't gotten around to it, but it looks pretty clear that it will provide stronger support for a hyperbolic model given the abrupt blow-up in horizons.
Daniel Kokotajlo shortform on time horizons: https://www.lesswrong.com/posts/cxuzALcmucCndYv4a/daniel-kokotajlo-s-shortform?commentId=P8qGMRnbEexaFB4s9
METR horizons paper: https://arxiv.org/pdf/2503.14499
AI-2027 Forecast (original): https://ai-2027.com/research/timelines-forecast
Greenblatt on METR timelines: https://blog.redwoodresearch.org/p/my-agi-timeline-updates-from-gpt
METR post w/ recent data: https://metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/
2026-01-18 05:19:09
Published on January 17, 2026 9:19 PM GMT
This is a brief note on what I did with my funding in 2025, and my plans for 2026, written primarily because Manifund nudged me for an update on my project.
I ran my AISC project (which I announced here) with four mentees in Spring 2025: Norman Hsia, Hanna Gabor, Paul Rapoport, and Roman Malov. A few other people attended the weekly meetings as well, and those regular meetings have continued (they are joinable -- pm me if interested). Norman and Paul ended up as coauthors of my ILIAD 2024 paper Understanding Trust, which had been drafted in 2024, so served as both an input and an output of the AISC project.
I recorded most of the meetings involved in the project, as one of the hopeful outputs was publicly posted videos explaining the research agenda. I've proven to be bad at this side of things: I don't like listening to myself talk, so I found it difficult to edit or even to review edits done by others. I'm finally uploading the videos with minimal AI-orchestrated edits. Playlist here. At the time of publication, there's only two, but more coming very soon. If you are OK with the almost-unedited presentation style, it should be a good resource to get a very in-depth view on my thinking about AI safety and decision theory; a thorough snapshot of my thinking as of spring 2025.
In 2025, I obtained funding for 2025 as well as 2026. (My total financial runway is longer than this, but 2025 and 2026 have been funded by grants/donations which compensated me for my continued research at specific price points.) I'm opening up my Manifund project for funding for 2027, for those who feel so inclined.
In addition to publishing the ILIAD 2024 paper, I also published an ILIAD 2025 paper: Communication & Trust. I consider it to be an incremental improvement: the ILIAD 2024 treated self-modifying actions as a distinct class with known effects which work with certainty. The ILIAD 2025 paper treated all actions as having some subjective chance of disrupting the agent's computation.
I also attended Inkhaven, where I wrote a post for every day in November. This was a big success for me: I was able to write about many things which I had been wanting to write about for some time (perhaps in rougher form than if I had eventually gotten around to them via my normal writing process). It was also exhausting. Here's my retrospective, with the caveat that I wrote it on the very last day, when I was perhaps the most sick of writing.
One of the posts describes my research arc over 2025, and the hopes I have moving forward. This is still a good summary of where I'd like to take my research in 2026. I have hope that we're understanding concepts and abstraction better, so that we might soon be able to characterize important concepts like agency, alignment, corrigibility, etc in a formalism which deals natively with ontological shifts. Most of my hope is due to Sam Eisenstat's Condensation: a theory of concepts, which I wrote a detailed review of during Inkhaven.
As for my more ML-flavored research ideas, I finally wrote about that stuff last week. I've already found someone interested in trying some experiments based on those ideas. We'll see how that goes.
I'm also mentoring with MATS this summer. You can still apply to my MATS track today or tomorrow as I write this; applications are due January 18th.
2026-01-18 02:47:59
Published on January 17, 2026 6:47 PM GMT
You've probably heard something like this before:
You won't go badly wrong following the conclusion, but (3) doesn't actually follow from (1) and (2). That's because interventions might vary in how they affect the expected value of the future conditional on survival.[1]
Will MacAskill makes roughly this argument in Better Futures (August 2025). See the diagram below: survival-focused interventions target the red rectangle, flourishing-focused interventions target the blue. But the blue rectangle might be much larger than the red rectangle -- if x-risk is 20% then even the best survival intervention can increase EV by at most 1.2x, whereas a flourishing intervention could increase EV by 5x or 5000x.
But is x-risk only 20%? MacAskill thinks so,[2] but his argument applies even if extinction is very likely — say 99% — so long as there are interventions that increase flourishing by +100x. That's the scenario I want to discuss here in this post.
My conclusion is:
Recall that flourishing is the expected value conditional on survival. That is, flourishing-focused interventions target the survival posterior, consisting of the green and blue rectangles. Consequentially, if survival is likely then the survival posterior consists of ordinary futures, but if survival is unlikely then the survival posterior consists of weird futures, worlds very different from what we'd expect.[3]
What kind of weird worlds?
For more possible futures, see Bart Bussmann's 60+ Possible Futures; which weird survival worlds are most likely will depend on your cause for pessimism.
This poses three problems:
Problem 1: Survival world is harder to reason about.
If survival is likely, then the survival posterior consists of ordinary worlds, which you can reason about using existing assumptions/models/trends. However, if survival is unlikely, then the survival posterior consists of weird worlds where our assumptions break down. This makes it much harder to estimate the impact of our interventions, because the world is unprecedented. For example, imagine if brain uploads arrive by 2030 -- this should make us more sceptical of extrapolating various economic trends that were observed before uploads.
And when you condition on survival, you update not just your empirical beliefs but your moral beliefs too. Suppose you're uncertain about (A) whether animals have as much moral weight as humans, and (B) whether we can build machines as smart as humans, and that (A) and (B) are correlated, both downstream of a latent variable like "humans are special." Survival is much more likely if ¬B, so conditioning on survival upweights ¬B, which upweights H, which downweights A. Using the numbers below, your credence in animal moral weight drops from 59% to 34% — nearly halved — just by conditioning on survival. Gnarly!
Problem 2: Surviving worlds are more diverse.
When survival is unlikely, the survival worlds are more different from each other; this is because all ordinary worlds are alike but each weird world is weird in its own way. And because the survival worlds vary so much, it's harder to find interventions which are robustly beneficial -- an intervention that looks good in one weird survival world is likely to look poor in another.
For example, suppose you think that if world governments proceed with the expected level of sanity, then ASI will cause extinction. But we might survive because governments showed unexpectedly low sanity (e.g. initiating nuclear conflict over some mundane issue) or unexpectedly high sanity (e.g. updating on early warning shots). Now consider an intervention which shifts power toward existing world governments at the expense of frontier AI labs: this might decrease flourishing if we survived via low governmental sanity and increase flourishing if we survived via high governmental sanity. The intervention's value flips sign depending on which weird world we end up in.
It was a bit tricky to try to justify this intuition, but here's a toy model: imagine worlds as points in R², with a Gaussian prior centered at the origin. Scatter some "attractors" randomly — some red (extinction), some green (survival). Each point in world-space inherits the fate of its nearest attractor. When most of the Gaussian mass falls in red regions, survival requires landing in one of the scattered green islands. These islands might be far apart in world-space. The survival posterior becomes multimodal, spread across disconnected regions. The diagram below illustrates: when P(survival) is low, Var(world | survival) tends to be high.
This diversity creates practical problems:
Problem 3: Transitional events wash out interventions.
Ordinary worlds have more continuity between the present and the future, whereas weird worlds often involve some transitional event that explains why we survived, and these transitional events might 'wash out' your intervention.
For example, suppose you think the current policy landscape is likely to lead to extinction. Then we should be pessimistic about flourishing-focused policy interventions because, conditional on survival, there was probably some large-scale disruption of the policy landscape.
In the next post, I will discuss potential strategies for focusing on flourishing when survival is unlikely. These strategies will aim to overcome some or all of the problems above.
In maths:
Assuming that E(value|not survival) ≈ 0, we can decompose E(value|intervention) into the product of E(value|survival, intervention) and P(survival | intervention).
This suggests that P(survival | intervention) is a good proxy for E(value|intervention), but this is only true if E(value|survival, intervention) doesn't vary much across interventions.
However, E(value|survival, intervention) might vary significantly.
To illustrate, suppose you think that our chances of survival this century are reasonably high (greater than 80%) but that, if we survive, we should expect a future that falls far short of how good it could be (less than 10% as good as the best feasible futures). These are close to my views; the view about Surviving seems widely-held, and Fin Moorhouse and I will argue in essays 2 and 3 for something like that view on Flourishing.
Caveat: In principle, survival could be unlikely yet conditioning on it might not make worlds weird. To illustrate: suppose you're certain that humanity's survival depends entirely on the random weight initialisation of a particular pretraining run — 1% chance of good, 99% bad. Conditioning on survival, most survival worlds are ordinary in every respect except for the lucky weight initialisation. The weirdness of the weight initialisation is highly localised, so it doesn't raise the three problems above.
That said, I don't think such worldviews are plausible, because they require a very high prior on ordinary worlds. I think that plausible worldviews should place +1% probability on "worlds which are globally weird and we survive". And so these worlds will dominate the survival posterior, even if there are also some "worlds which are locally weird and we survive".
2026-01-18 01:28:11
Published on January 17, 2026 5:28 PM GMT
In 1654, a Jesuit polymath named Athanasius Kircher published Mundus Subterraneus, a comprehensive geography of the Earth’s interior. It had maps and illustrations and rivers of fire and vast subterranean oceans and air channels connecting every volcano on the planet. He wrote that “the whole Earth is not solid but everywhere gaping, and hollowed with empty rooms and spaces, and hidden burrows.”. Alongside comments like this, Athanasius identified the legendary lost island of Atlantis, pondered where one could find the remains of giants, and detailed the kinds of animals that lived in this lower world, including dragons. The book was based entirely on secondhand accounts, like travelers tales, miners reports, classical texts, so it was as comprehensive as it could’ve possibly been.
But Athanasius had never been underground and neither had anyone else, not really, not in a way that mattered.
Today, I am in San Francisco, the site of the 2026 J.P. Morgan Healthcare Conference, and it feels a lot like Mundus Subterraneus.
There is ostensibly plenty of evidence to believe that the conference exists, that it actually occurs between January 12, 2026 to January 16, 2026 at the Westin St. Francis Hotel, 335 Powell Street, San Francisco, and that it has done so for the last forty-four years, just like everyone has told you. There is a website for it, there are articles about it, there are dozens of AI-generated posts on Linkedin about how excited people were about it. But I have never met anyone who has actually been inside the conference.
I have never been approached by one, or seated next to one, or introduced to one. They do not appear in my life. They do not appear in anyone’s life that I know. I have put my boots on the ground to rectify this, and asked around, first casually and then less casually, “Do you know anyone who has attended the JPM conference?”, and then they nod, and then I refine the question to be, “No, no, like, someone who has actually been in the physical conference space”, then they look at me like I’ve asked if they know anyone who’s been to the moon. They know it happens. They assume someone goes. Not them, because, just like me, ordinary people like them do not go to the moon, but rather exist around the moon, having coffee chats and organizing little parties around it, all while trusting that the moon is being attended to.
The conference has six focuses: AI in Drug Discovery and Development, AI in Diagnostics, AI for Operational Efficiency, AI in Remote and Virtual Healthcare, AI and Regulatory Compliance, and AI Ethics and Data Privacy. There is also a seventh theme over ‘Keynote Discussions’, the three of which are The Future of AI in Precision Medicine, Ethical AI in Healthcare, and Investing in AI for Healthcare. Somehow, every single thematic concept at this conference has converged onto artificial intelligence as the only thing worth seriously discussing.
Isn’t this strange? Surely, you must feel the same thing as me, the inescapable suspicion that the whole show is being put on by an unconscious Chinese Room, its only job to pass over semi-legible symbols over to us with no regards as to what they actually mean. In fact, this pattern is consistent across not only how the conference communicates itself, but also how biopharmaceutical news outlets discuss it.
Each year, Endpoints News and STAT and BioCentury and FiercePharma all publish extensive coverage of the J.P. Morgan Healthcare Conference. I have read the articles they have put out, and none of it feels like it was written by someone who actually was at the event. There is no emotional energy, no personal anecdotes, all of it has been removed, shredded into one homogeneous, smoothie-like texture. The coverage contains phrases like “pipeline updates” and “strategic priorities” and “catalysts expected in the second half.” If the writers of these articles ever approach a human-like tenor, it is in reference to the conference’s “tone”. The tone is “cautiously optimistic.” The tone is “more subdued than expected.” The tone is “mixed.” What does this mean? What is a mixed tone? What is a cautiously optimistic tone? These are not descriptions of a place. They are more accurately descriptions of a sentiment, abstracted from any physical reality, hovering somewhere above the conference like a weather system.
I could write this coverage. I could write it from my horrible apartment in New York City, without attending anything at all. I could say: “The tone at this year’s J.P. Morgan Healthcare Conference was cautiously optimistic, with executives expressing measured enthusiasm about near-term catalysts while acknowledging macroeconomic headwinds.” I made that up in fifteen seconds. Does it sound fake? It shouldn’t, because it sounds exactly like the coverage of a supposedly real thing that has happened every year for the last forty-four years.
Speaking of the astral body I mentioned earlier, there is an interesting historical parallel to draw there. In 1835, the New York Sun published a series of articles claiming that the astronomer Sir John Herschel had discovered life on the moon. Bat-winged humanoids, unicorns, temples made of sentient sapphire, that sort of stuff. The articles were detailed, describing not only these creatures appearance, but also their social behaviors and mating practices. All of these cited Herschel’s observations through a powerful new telescope. The series was a sensation. It was also, obviously, a hoax, the Great Moon Hoax as it came to be known. Importantly, the hoax worked not because the details were plausible, but because they had the energy of genuine reporting: Herschel was a real astronomer, and telescopes were real, and the moon was real, so how could any combination that involved these three be fake?
To clarify: I am not saying the J.P. Morgan Healthcare Conference is a hoax.
What I am saying is that I, nor anybody, can tell the difference between the conference coverage and a very well-executed hoax. Consider that the Great Moon Hoax was walking a very fine tightrope between giving the appearance of seriousness, while also not giving away too many details that’d let the cat out of the bag. Here, the conference rhymes.
For example: photographs. You would think there would be photographs. The (claimed) conference attendees number in the thousands, many of them with smartphones, all of them presumably capable of pointing a camera at a thing and pressing a button. But the photographs are strange, walking that exact snickering line that the New York Sun walked. They are mostly photographs of the outside of the Westin St. Francis, or they are photographs of people standing in front of step-and-repeat banners, or they are photographs of the schedule, displayed on a screen, as if to prove that the schedule exists. But photographs of the inside with the panels, audience, the keynotes in progress; these are rare. And when I do find them, they are shot from angles that reveal nothing, that could be anywhere, that could be a Marriott ballroom in Cleveland.
Is this a conspiracy theory? You can call it that, but I have a very professional online presence, so I personally wouldn’t. In fact, I wouldn’t even say that the J.P. Morgan Healthcare Conference is not real, but rather that it is real, but not actually materially real.
To explain what I mean, we can rely on economist Thomas Schelling to help us out. Sixty-six years ago, Schelling proposed a thought experiment: if you had to meet a stranger in New York City on a specific day, with no way to communicate beforehand, where would you go? The answer, for most people, is Grand Central Station, at noon. Not because Grand Central Station is special. Not because noon is special. But because everyone knows that everyone else knows that Grand Central Station at noon is the obvious choice, and this mutual knowledge of mutual knowledge is enough to spontaneously produce coordination out of nothing. This, Grand Central Station and places just like it, are what’s known as a Schelling point.
Schelling points appear when they are needed, burnt into our genetic code, Pleistocene subroutines running on repeat, left over from when we were small and furry and needed to know, without speaking, where the rest of the troop would be when the leopards came. The J.P. Morgan Healthcare Conference, on the second week of January, every January, Westin St. Francis, San Francisco, is what happened when that ancient coordination instinct was handed an industry too vast and too abstract to organize by any other means. Something deep drives us to gather here, at this time, at this date.
To preempt the obvious questions: I don’t know why this particular location or time or demographic were chosen. I especially don’t know why J.P. Morgan of all groups was chosen to organize the whole thing. All of this simply is.
If you find any of this hard to believe, observe that the whole event is, structurally, a religious pilgrimage, and has all the quirks you may expect of a religious pilgrimage. And I don’t mean that as a metaphor, I mean it literally, in every dimension except the one where someone official admits it, and J.P. Morgan certainly won’t.
Consider the elements. A specific place, a specific time, an annual cycle, a journey undertaken by the faithful, the presence of hierarchy and exclusion, the production of meaning through ritual rather than content. The hajj requires Muslims to circle the Kaaba seven times. The J.P. Morgan Healthcare Conference requires devotees of the biopharmaceutical industry to slither into San Francisco for five days, nearly all of them—in my opinion, all of them—never actually entering the conference itself, but instead orbiting it, circumambulating it, taking coffee chats in its gravitational field. The Kaaba is a cube containing, according to tradition, nothing, an empty room, the holiest empty room in the world. The Westin St. Francis is also, roughly, a cube. I am not saying these are the same thing. I am saying that we have, as a species, a deep and unexamined relationship to cubes.
This is my strongest theory so far. That the J.P. Morgan Healthcare conference isn’t exactly real or unreal, but a mass-coordination social contract that has been unconsciously signed by everyone in this industry, transcending the need for an underlying referent.
My skeptical readers will protest at this, and they would be correct to do so. The story I have written out is clean, but it cannot be fully correct. Thomas Schelling was not so naive as to believe that Schelling points spontaneously generate out of thin air, there is always a reason, a specific, grounded reason, that their concepts become the low-energy metaphysical basins that they are. Grand Central Station is special because of the cultural gravitas it has accumulated through popular media. Noon is special because that is when the sun reaches its zenith. The Kaaba was worshipped because it was not some arbitrary cube; the cube itself was special, that it contained The Black Stone, set into the eastern corner, a relic that predates Islam itself, that some traditions claim fell from heaven.
And there are signs, if you know where to look, that the underlying referent for the Westin St. Francis status being a gathering area is physical. Consider the heat. It is January in San Francisco, usually brisk, yet the interior of the Westin St. Francis maintains a distinct, humid microclimate. Consider the low-frequency vibration in the lobby that ripples the surface of water glasses, but doesn’t seem to register on local, public seismographs. There is something about the building itself that feels distinctly alien. But, upon standing outside the building for long enough, you’ll have the nagging sensation that it is not something about the hotel that feels off, but rather, what lies within, underneath, and around the hotel.
There’s no easy way to sugarcoat this, so I’ll just come out and say it: it is possible that the entirety of California is built on top of one immensely large organism, and the particular spot in which the Westin St. Francis Hotel stands—335 Powell Street, San Francisco, 94102—is located directly above its beating heart. And that this is the primary organizing focal point for both the location and entire reason for the J.P. Morgan Healthcare Conference.
I believe that the hotel maintains dozens of meter-thick polyvinyl chloride plastic tubes that have been threaded down through the basement, through the bedrock, through geological strata, and into the cardiovascular system of something that has been lying beneath the Pacific coast since before the Pacific coast existed. That the hotel is a singular, thirty-two story central line.That, during the week of the conference, hundreds of gallons of drugs flow through these tubes, into the pulsating mass of the being, pouring down arteries the size of canyons across California. The dosing takes five days; hence the length of the conference.
And I do not believe that the drugs being administered here are simply sedatives. They are, in fact, the opposite of sedatives. The drugs are keeping the thing beneath California alive. There is something wrong with the creature, and a select group of attendees at the J.P. Morgan Healthcare Conference have become its primary caretakers.
Why? The answer is obvious: there is nothing good that can come from having an organic creature that spans hundreds of thousands of square miles suddenly die, especially if that same creatures mass makes up a substantial portion of the fifth-largest economy on the planet, larger than India, larger than the United Kingdom, larger than most countries that we think of as significant. Maybe letting the nation slide off into the sea was an option at one point, but not anymore. California produces more than half of the fruits, vegetables, and nuts grown in the United States. California produces the majority of the world’s entertainment. California produces the technology that has restructured human communication. Nobody can afford to let the whole thing collapse.
So, perhaps it was decided that California must survive, at least for as long as possible. Hence Amgen. Hence Genentech. Hence the entire biotech revolution, which we are taught to understand as a triumph of science and entrepreneurship, a story about venture capital and recombinant DNA and the genius of the California business climate. The story is not false, but incomplete. The reason for the revolution was, above all else, because the creature needed medicine, and the old methods of making medicine were no longer adequate, and someone decided that the only way to save the patient was to create an entire industry dedicated to its care.
Why is drug development so expensive? Because the real R&D costs are for the primary patient, the being underneath California, and human applications are an afterthought, a way of recouping investment. Why do so many clinical trials fail? For the same reason; the drugs are not meant for our species. Why is the industry concentrated in San Francisco, San Diego, Boston? Because these are monitoring stations, places where other intravenous lines have been drilled into other organs, other places where the creature surfaces close enough to reach.
Finally, consider the hotel itself. The Westin St. Francis was built in 1904, and, throughout its entire existence, it has never, ever, even once, closed or stopped operating. The 1906 earthquake leveled most of San Francisco, and the Westin St. Francis did not fall. It was damaged, yes, but it did not fall. The 1989 Loma Prieta earthquakekilled sixty-three people and collapsed a section of the Bay Bridge. Still, the Westin St. Francis did not fall. It cannot fall, because if it falls, the central line is severed, and if the central line is severed, the creature dies, and if the creature dies, we lose California, and if we lose California, our civilization loses everything that California has been quietly holding together. And so the Westin St. Francis has hosted every single J.P. Morgan Healthcare Conference since 1983, has never missed one, has never even come close to missing one, and will not miss the next one, or the one after that, or any of the ones that follow.
If you think about it, this all makes a lot of sense. It may also seem very unlikely, but unlikely things have been known to happen throughout history. Mundus Subterraneus had a section on the “seeds of metals,” a theory that gold and silver grew underground like plants, sprouting from mineral seeds in the moist, oxygen-poor darkness. This was wrong, but the intuition beneath it was not entirely misguided. We now understand that the Earth’s mantle is a kind of eternal engine of astronomical size, cycling matter through subduction zones and volcanic systems, creating and destroying crust. Athanasius was wrong about the mechanism, but right about the structure. The earth is not solid. It is everywhere gaping, hollowed with empty rooms, and it is alive.