Or, did a chief scientist of an AI assistant startup conclusively show that GPT-5.5 has 9.7T parameters?
Introduction
Recently, a paper was circulated on Twitter claiming to have reverse engineered the parameter count of many frontier closed-source models including the newer GPT-5.5 (9.7T parameters) and Claude Opus 4.7 (4.0T parameters) as well as older models such as o1 (3.5T) and gpt-4o (720B). The paper, titled “Incompressible Knowledge Probes: Estimating Black-Box LLM Parameter Counts via Factual Capacity”, introduces a dataset of factual knowledge of different difficulties, regresses performance on this dataset against parameter count, and then uses this regression to extrapolate from the performance of closed-sourced frontier models to their parameter count. A notable fact about this paper is that, unlike most empirical machine learning papers, it’s single-authored: Bojie Li, the chief scientist of Pine AI, is the sole author of this piece.
These results were suspicious for many reasons, the primary being that it seems like low-effort, hastily-written AI slop. For example, the codebase (https://github.com/19PINE-AI/ikp) was constructed in large part with Claude Code and has many of the flags for code that is almost entirely vibe-coded with little sanity checking (e.g. redundant and inconsistent variable definitions[1], boilerplate bloat, excessive error handling[2], and silent failures[3]). The same can be said of the author’s website for this paper (https://01.me/research/ikp/), which has definitions for terms that appear nowhere on the page[4], table headings inconsistent with the contents[5], and has a very high heading-to-text ratio.
We (Benjamin and Lawrence) decided to dig into these results further. Specifically, we read the paper, reproduced the author’s results using their code base, and then dug into some obvious methodological issues to see how much the issues affected the author’s results.
We find:
The core idea behind the paper is largely sound but overstated. IKP performance seems to correlate strongly with parameter count for open-source models (R^2 of between 0.78-0.92), but the exact degree depends on methodological choices obscured by the paper.
The codebase makes poorly-documented methodological choices that are largely unjustified, and sometimes inconsistent with both itself and the arXiv paper. Most of these don’t matter, but one does make a big difference on the results: whether or not the scores given to the models are given a minimum floor, something Li claims to not do in the paper but which was done in the code.
The IKP dataset has serious issues relating to data diversity or ambiguity, especially for harder questions. Substantial fractions of both the hard wikidata-sourced questions (at least 6.8%) and the hard researcher-knowledge questions (~25.9%) are ambiguous. In a few cases, the models are rated incorrect because Li’s provided gold answer is wrong. Correcting for these dataset issues also affects the estimated parameter count of the models.
Because of the above issues, we believe exciting results for the parameter count of closed-source frontier models to be very suspect. Correcting for some of the methodological, implementation, and dataset issues we identified, a linear regression on IKP performance suggests that GPT-5.5 has around 1.458T parameters, while Claude Opus 4.7 has around 1.132T. Because this extrapolation is so sensitive to methodological choices (and important limitations of the dataset that we did not have time to address), we believe that the different numbers reveal problems with the original methodology, rather than the true parameter counts of the models.
Despite these issues, we think that the core idea – reverse engineering LLM parameter count by quantifying memorization capacity – is solid, and welcome future work implementing this in a more rigorous and systematic way.
Summary of Li’s “Incompressible Knowledge Probes”
As usual, let’s start by summarizing the paper at hand.
One way of estimating the size of closed model is by extrapolating from API throughput and pricing under a hardware-cost model (e.g. Epoch AI’s inference economics).
Li argues that these size estimates are unreliable, by a factor of over 2x, due to confounders such as quantization, batching, and vendor margin. He instead proposes reverse engineering parameter count by using the fact that neural networks can only store a linear number of facts in parameter count.[6] Unfortunately, this isn’t as simple as simply counting all the facts:[7] for one, doing that exhaustively is intractable.
Li builds a set of questions ("Incompressible Knowledge Probes," IKP) that samples factual associations across seven obscurity tiers. Probes come from four sources: GPT-5-generated questions, Wikidata SPARQL pulls, DBLP/OpenAlex researcher records, and a small set of hand-curated questions. Li calls these "probes," but to avoid confusion we'll just call them questions.
Li claims six contributions:
He introduces the IKP dataset, in order to measure incompressible facts. This is distinguished from procedural knowledge (e.g. how to write code), which is likely compressible.
He regresses model parameters against IKP performance, and finds a strong linear relationship between adjusted IKP performance and model parameter count on 89 open source models. He also confirms that IKP outperforms MMLU, GPQA, and SimpleQA at predicting parameter count. We think this result generally holds up, though we believe the exact strength of the claimed relationship to be overstated.
He “falsifies” the “densing law” results from previous work. We agree that the densing law[8] paper is indeed very suspect (if there’s interest, we can detail why in a follow-up post). However, the densing law paper is not directly “falsified[9] by Li’s results; a more correct reading of which is that, controlling for parameter count, open-sourced LLMs are not getting better on his IKP dataset.
He uses the IKP <-> model parameter regression to estimate the parameters of closed-source frontier models and the “effective” parameters of MoEs. These results are headlined by GPT-5.5 (~9.7T) and Claude Opus 4.6 (~5.3T). He also shows that for Mixture-of-Experts models, total parameters predict factual knowledge much better than active parameters (R² of 0.79 vs 0.51).
He uses similarity on responses on the IKP dataset to identify modelswith the same base model vs full retrains. Specifically, he combines rare-fact Jaccard overlap with "hallucination similarity" (the rate at which two models produce the same wrong answer on rare facts) into a Hallucination Similarity Score, which he claims separates weight-sharing siblings, post-training lineages, and full retrains across closed vendors without requiring model weights. We did not investigate these results in detail, so we can't speak to whether the lineage clusters in the paper's Figure 5 are correct.[10]
He open-sourced his code on Github. We appreciate this a lot because it 1) greatly simplified the process of reproducing his results and 2) made it much easier to identify possible methodological issues with his work.
The IKP dataset.
The IKP dataset consists of 1400 questions, divided into 7 tiers of 200 each. The four sources of questions are:
GPT-5-generated candidates (401 questions): These are questions generated by asking GPT-5 to generate factual questions with a few provided examples. These compose the bulk of T1-T2 questions, though some make it to T3-T4. Example:
[T2] "Who composed the ballet 'Giselle'?" Gold Answer: Adolphe Adam
Wikidata SPARQL questions (557 questions): These questions are drawn from the wikidata database, and involve asking about the founding years for institutions, capital cities of countries, location of headquarters, and geographic facts. These mainly populate T3-T7 (only 11 of the wikidata questions are in T1 or T2). Example:
[T4] "In what year was National Pingtung University of Education founded?"
Gold: 1940
DBLP / OpenAlex researcher questions (345 questions): These questions ask the model to "name the subfield and one paper, system, institution, or co-author for [researcher]". Most of these are in T5-T7. Example:
[T5] "In computer science, what is the research subfield of Martina Zitterbart, and name one paper, system, institution, or co-author associated with their work? If you don't know who this person is, say so."
Gold: computer networking
[papers, coauthors, and affiliations from OpenAlex omitted for brevity]
97 manual or supplementary additions from the author’s previous work to balance T1–T4 coverage. Examples:
[T1] "What is the capital of Portugal?" → Lisbon
[T2] "What is the largest lake in Africa?" → Victoria
[T3] "Who composed the opera 'The Magic Flute'?" → Mozart
The difficulty of each tier is empirically calibrated against six "landmark" models that span the open-weight size range from Qwen 2.5 0.5B (T1) up through Gemini 3.1 Pro (T6). A question is assigned to tier k if the k-th landmark answers it correctly but the (k−1)-th landmark does not. T7 is reserved for questions no landmark gets right, as a deliberate ceiling that no current model is supposed to clear.
As we’ll later note, both the wikidata and researcher question datasets (which comprise over 900 o the 1400 questions, including all questions from T5-T7) have fairly significant quality issues. For example, both wikidata and the researcher question sets contain many ambiguous questions: both have name-space collisions (e.g. multiple researchers or locations that have the same name). Many of the wikidata founding year questions are also somewhat ambiguous – e.g. Oxford received its royal charter in 1248, but there’s been evidence of teaching Oxford in 1096, but the university arguably could’ve existed earlier. Some of the wikidata questions also reference outdated information. This complicates the interpretation of the results.
IKP scoring and Regression Methodology
For each of a model’s answers, Li scores its responses on either a 3 or 4-point scale:
STRONG / CORRECT = +1.0
WEAK = +0.5 (Reserved for researcher questions where the model provides the right subfield but no supporting evidence.)
REFUSAL = 0
WRONG = λ, where λ = −1.0 by default (the "hallucination penalty")
The hallucination penalty is added in order to discourage guessing (though it also penalize models who know the answers to questions that have incorrect gold answers). Each of the 7 tiers’ score is the mean over its 200 questions, and a model's overall "penalized accuracy" is the unweighted mean of the seven tier scores. To calculate penalized accuracy, the per-tier scores are floored at 0 in the released data, even though the paper text explicitly claims they are not floored "to preserve the bluff signal in the calibration." This is one of the methodology inconsistencies we'll come back to, as the choice meaningfully changes the slope of the fit.
The judge is Gemini 3 Flash Preview at temperature 0, and all target models are run once at temperature 0. Note that this is fairly non-standard for model evaluations (and many reasoning model providers explicitly discourage running their models with t=0).
The headline regression is a one-line OLS:
This OLS is fit on 89 open-weight models with known parameter counts, ranging from SmolLM2-135M up to DeepSeek V4 Pro at 1.6T. Li reports α = 0.147, β = +0.132, R² = 0.917, with leave-one-out median fold error of 1.59× and a 90% prediction interval factor of 3.0×. Inverting the regression gives a parameter-count estimate for any target model: N̂ = 10^((A − β) / α).
For Mixture-of-Experts models, total parameters predict factual knowledge meaningfully better than active parameters (R² = 0.79 vs 0.51).[11]
This shows the ground truth open model parameters with the linear regression from our modified methodology as well as the fit of the original paper's model. While ours gets a lower overall R², the clean method has a better fit in the tails of the distribution.
Densing Law Falsification Results
The densing law paper (Xiao et al. 2024) introduces "capability density", defined as the ratio of a model's effective parameter size to its actual parameter size. Here, "effective size" is the parameter count a reference model would need to match the target's downstream score. Across 29 open-source base models, they fit ln(ρ_max) = A·t + B and report A ≈ 0.007, which they translate to "the maximum capability density of LLMs doubles approximately every 3.3 months.[12]
To test this, Li adds release date as a covariate to the IKP regression:
If the densing law applied to the IKP questions, then γ should be about +0.0117/month (the value that produces the claimed 3.3-month density doubling). Across 96 dated open-weight models, Li fits γ = −0.0010/month, 95% CI [−0.0031, +0.0008] — statistically indistinguishable from zero. Densing's predicted +0.0117/month is rejected at p < 10⁻¹⁵.
This result stands up to all of the stress testing we performed. We refit the regression with vendor fixed effects (22 vendor dummies), family fixed effects (33 family dummies), without thinking-mode variants, dropping the open-weight tier-landmark models (anti-circularity check), and under both flooring regimes for the per-tier scores. In every specification γ stays within ±0.004/month of zero, and the +0.0117/month densing prediction is rejected with effective certainty. So whatever else the paper does, this result holds.[13]
That being said, we believe the right way to read this result is: holding parameter count fixed, factual recall on rare entities has not improved across open-weight model generations from 2023 through April 2026. Procedural benchmarks like MMLU and HumanEval have improved over the same window, often dramatically. Both can be true, given that the densing law was not intended to cover factual recall capacity.
Methodological Issues with the IKP paper
The paper and codebase have a number of methodological issues, both across dataset construction, judging methodology, and reporting of results. The two main methodological issues that impact the results are the use of per-tier flooring for scores (contrary to the paper’s claims) and questions with ambiguous and incorrect answers.
When we adjust for these issues in our replication, the headline numbers change significantly.
Per-tier floors to the scoring
When scoring the models, each probe score is measured:
STRONG / CORRECT = +1.0
WEAK = +0.5
REFUSAL = 0
WRONG = λ, where λ = −1.0[14] (the "hallucination penalty")
In the paper section 4.3 says "Per-tier scores are not floored at zero in the released results … to preserve the bluff signal in the calibration." Flooring means that if it would normally go negative due to wrong answers, it would instead be held at 0. While they claim that the results are not floored, they in fact are floored, both for the reported values in the paper and in the repository.
By removing flooring, the parameter size estimates for larger models substantially decreases because the slope of the curve is significantly less steep. Thus, the most recent frontier models get much lower parameter estimates using this model.
Floored accuracies
For small models accuracy is locked to ≈0
For large models accuracy stays the same at around 0.65
The slope is 6.79
Unfloored accuracies
For small models accuracy drops to ≈ −0.5,
For big models the accuracy stays the same: ≈ 0.65
The slope becomes 3.56
When refitting without the floor R² drops from 0.917 → 0.784, and the 90% prediction interval doubles from 3.0 to 6.8. This means the parameter counts in Li’s original paper is largely an artifact of the flooring (or more cynically, an undocumented “code-level optimization”). The new estimate is technically less accurate on predicting the smaller models, but provides much more reasonable estimates of the larger frontier models despite having a wider confidence band.
Ambiguous/incorrect answers to hard questions
For the researcher questions, Li filters out two-character Chinese names and single-initial given names (Section 4.1). Unfortunately, manual inspection of some randomly sampled questions revealed two issues this filter doesn't catch:
First, researchers whose name is shared by multiple distinct CS researchers with non-trivial citation counts. We re-queried OpenAlex for every researcher in the 345-probe set and counted distinct profiles sharing the name with ≥50 citations each. Examples where reasonable disagreement is genuinely possible: Stjepan Picek (17 OpenAlex profiles, 2 high-cite), Zhendong Su (24 profiles, 4 high-cite), Zhiguo Ding (25 profiles, 6 high-cite). Across tiers, we flagged 86 of 345 (24.9%) probes as ambiguous: T3: 11/35 (31%), T4: 11/51 (22%), T5: 25/100 (25%), T6: 14/59 (24%), T7: 25/100 (25%).
Second, researchers whose primary subfield is contested or has shifted over time. Dan Suciu's field is given in the gold answers as "programming languages," but his most-cited and most-recent work is in databases. Under their scoring system, a model that says "databases" is marked WRONG despite being arguably more correct.
For wikidata questions, Li applied a “10-round audit/repair cycle” (section 7.7). Unfortunately, it seems that this repair cycle failed to catch at least two types of issue:
First, we noted that for a large number of entities, there still remained ambiguities regarding which entity the question referred to. For example, there are 42 distinct entities labelled "Bělá" in the czech-language Wikidata.
Second, there remain other genuine semantic ambiguities. For example, the UT Austin School of Nursing question (gold: founded 1890) is technically right but ambiguous. the school traces its lineage through a 1960 nursing program at the same university, and was officially incorporated into UT Austin in 1976 and depending on whether you count the predecessor program, models giving 1960 are arguably correct. Similarly, sculpture-attribution probes drawn from Wikidata's P170 (creator) field often return the bronze foundry that cast the work rather than the sculptor who designed it. These are different failure modes from name-collision and stale ground truth, and a more thorough audit would find more of them.
For manually generated questions, we inspected those where the models consistently did much worse than the tier would suggest, and found two incorrect questions: one on the highest peak in Bangladesh (whose answer changed over time) and another on Mongolian people’s party (which is ambiguous between 1920 and 1921). We excluded these two from our analysis.
Interestingly, these possible issues are noted by Li in Appendix H.. However, he does not attempt to quantify how many questions are ambiguous or incorrect, nor how large the impact is if you were to remove the ambiguous questions. We do that here.
Source
Number of Questions
Flagged ambiguous
Heuristic
LLM-generated
401
0
Visual spot check was performed across all four tiers with an LLM judge. The questions seem well formed.
Researcher
345
86 (24.9%)
OpenAlex shows ≥2 distinct researchers with ≥50 citations sharing the name where reasonable people would disagree about which researcher to cite.
Wikidata
557
45 (8.08%)
For the T5–T7 categories, ≥3 entities share the same label
Manually generated
97
2 (2.05%)
We manually inspected the manually generated questions where the models performed much worse than expected, and confirmed that 2 of the questions were incorrect.
Total questions audited
1,400
131 (9.4%)
Corrected model parameter estimates
We attempt to fix the two methodological issues we identified above, by removing questions that have ambiguous answers from the various datasets, and also by removing the flooring from the accuracy estimates. We then recalculate scores across all the models measured in the paper. We present the newly calculated scores for the 8 models that we report on elsewhere in this post.
Overall, the estimates drop massively for the most capable frontier models (observing a nearly 10x difference in Gemini 3.1-pro) and for some of the smaller models we see a modest increase in score.
Possible methodological issues that mattered less than we thought
Thinking vs non-thinking
In the original work, models would often behave much better when thinking was enabled rather than not. This led to parameter estimates that were off by as much as 4.9x: Grok-4.20 was estimated to have 110B parameters without thinking, and 540B parameters with. Claude Opus 4.6 was estimated to have 2.4T parameters without thinking, but 5.3T with thinking enabled, a 2.2x difference.
This figure shows the difference in scores between the estimated parameter counts when the model was and was not given additional tokens in which to think, under Li's original methodology.
The headline results in the paper obscure this difference, as they generally report the maximum parameter count of either the thinking and non-thinking versions of the same model.
This figure shows the difference in scores between the estimated parameter counts when the model was and was not given additional tokens in which to think, using the unfloored version of the methodology. We find that with the non-floored version the scores are generally somewhat closer together and significantly lower than the floored version presented in the original paper.
Interestingly, after removing the arbitrary flooring, we generally observe much smaller score differences between models given additional time to think. Grok-4.20’s thinking multiplier dropped to 3.9x, while Claude Opus 4.6’s dropped to 1.2x. We unfortunately did not have the time to investigate why the thinking gap decreased after flooring was removed. That being said, we believe that this is some evidence that, on the IKP benchmark, enabling thinking does not have that much of an effect on performance.
Different accuracy metrics used in some repository json files
We observed that the penalized accuracy metric used to score the models in some of the json files in the repo was different from what was outlined shown in the paper. We subsequently investigated whether these different accuracy metrics affected the results, but found that the aberrant json scores were not used to produce any of the figures or tables in the paper. That is, the different accuracy metrics did not affect any of Li’s results as presented in the paper.
Conclusion
In this work, we examined the robustness of both the methodology and results of Li’s “Incompressible Knowledge Probes” paper. We identified two main methodological issues with the work: the per-tier flooring that exists in the code despite the paper claiming otherwise and the large fraction of ambiguous questions, especially in higher tiers of difficulty. We also note two questionable methodological choices that do not impact the results significantly: the performance gap between thinking and non-thinking models was much smaller than we initially thought, and the different accuracy metrics included in some json files were not used for the main analysis.
That being said, three of Li's claims survive every stress test we applied:
Factual recall, as measured by Li’s IKP dataset, scales log-linearly with parameter count across open-weight models. We found that the slope is consistently around 0.15 across every reasonable subset (≥0.5B, ≥10B, ≤30B, ≤100B, dense-only, MoE-only), and the intercept moves modestly. The qualitative scaling claim is robust.
The Densing-Law-on-factual-capacity falsification. γ stays within ±0.004/month of zero across vendor FE, family FE, both flooring regimes, and anti-circularity refits.
MoE total parameters predict knowledge better than active. The R² gap is partly an X-range artifact, but the slope and intercept comparisons confirm a 600B-total MoE behaves like a 600B-total dense, not like a 37B-active dense.
However, what does not survive is the specific multi-trillion parameter estimates for closed frontier models. After attempting to correct for methodological issues to the best of our ability, we found that the parameter count of the top proprietary frontier models drops from ~10T to ~1.5T.
We emphasize, however, that our point estimate of 1.5T for GPT-5.5 should not be read as our preferred answer. Instead, we see it as evidence that the range of plausible answers under defensible methodology is much wider than the paper's reported 3.0× 90% prediction interval implies. Both of us are quite uncertain about the exact parameter count of GPT-5.5.
We think that the IKP dataset (and methodology) is a real contribution. Li also deserves credit for releasing the dataset and code; it is precisely because he open-sourced his code that we could write this post so quickly.[17] But the standard for an empirical paper that produces concrete numbers ("GPT-5.5 has 9.7T parameters") needs to be higher than "I ran one regression and reported the result." Methodological choices should be discussed and justified; the effects of possible limitations or dataset issues should be analyzed and not just acknowledged in passing; and results that seem surprisingly good (or just surprising) should be scrutinized before they go viral on Twitter.
Discussion
On a broader point, we think this work illustrates both the risks and potential of AI-generated research code.
Li's paper illustrates many of the risks. The codebase looks like code that was generated quickly and never carefully checked, including the six near-identical judge prompts in different scripts, defensive error handling that silently turns network failures into refusals, redundant variable definitions, and at least two cases where the paper text and the released code disagree about what the methodology is. The companion website has terms defined but used nowhere and incorrectly labeled tables. None of these is individually fatal, but together they describe a pipeline where no one (including the author) read the work with a critical eye before it went public. A single-authored empirical paper with no internal or external review is a known failure mode. A single-authored empirical paper generated largely by an LLM without much review is the same failure mode at higher throughput.
But the same tools that lower the cost of producing this kind of work also lower the cost of checking it. Thanks to Claude Code (and to a smaller extent, Codex) automating much of the code generation process on our end, the two of us were able to replicate Li’s main results and perform many sensitivity analyses in around 3-4 hours each. We estimate that the same amount of work would’ve taken us around 10 hours each using previous generation coding assistants (e.g. Cursor’s autocomplete).
In terms of the IKP work, despite the issues with the headline results, we believe the core idea of reverse engineering LLM parameter count using memorization capacity to be solid and welcome future work that attempts to implement it in a more rigorous and systematic way. As a broader point about research scrutiny, we hope that this example serves as an important reminder of the changing economics of producing and scrutinizing new research results: as costs of both drop and the production of new results ramps up, so too should the scrutiny we apply to each result.
result = judge_fn(prompt).strip().upper() # Must check for exact "CORRECT" — not substring of "INCORRECT" if result == "CORRECT": return True if result.startswith("CORRECT"): return True if result.split()[0] == "CORRECT" if result else False: return True return False
Note that both the first and last checks are subsumed by the second check (result.startswith("CORRECT")).
For example, ikp_estimate.py returns an empty string “” if an invalid HTTP response is received, which the judge will then classify as a REFUSAL.
(This was actually an issue when reproducing the work, when Lawrence ran out of OpenRouter credits and ended up getting all refusals from gpt-4o-mini for t4 questions onwards, and which had to be debugged manually.)
Note that despite the implication, pre-existing memorized bits per parameter count estimates also vary by at least a factor of 2x (Allen-Zhu and Li estimated 1.4 bits per param for MoEs and 2 bits for dense GPT-style networks, while the later Morris et al. used a different methodology to reach 3.6 bits/parameter, and the hard information theoretic bound for 8-bit models is ~8 bits/param).
It’s named densing law because it measures how models get more parameter efficient performance wise, that is, more dense over time. As an aside, Lawrence thinks that this is a terrible name, like if scaling laws were named lossing laws since they measure how loss goes down with parameter count and dataset size.
We note that it’s possible this is a poor wording choice resulting from overreliance on LLMs. However, we feel that, even if this were to be the case, this would not absolve him of responsibility for including this wording in his single-authored paper.
The Hallucination Similarity Score is computed on T5–T7 probes, which are ~50% researcher questions evaluated under the 4-way STRONG/WEAK rubric. Ths rubric awards Anthropic a ~16 percentage point excess STRONG rate over the cross-vendor median, perhaps driven by Claude's stylistic preference for verbose evidence-citation. Because HSS depends on which probes count as "correct" for each model, that stylistic bias propagates into the Jaccard intersections and the wrong-answer-overlap rates. We'd expect within-vendor fingerprint comparisons (e.g. Claude Sonnet 4.5 → 4.6 → 4.7 lineage; weight-sharing siblings) to be relatively unaffected, because both members of the pair share the same response style. Cross-vendor comparisons (especially the paper's claim to detect distillation across closed-vendor families) are structurally vulnerable to the same stylistic confound that biases the parameter estimates
We'd note that the gap is partly an x-range artifact (active parameters cluster in a narrow ~10–40B band across the 37 MoE calibration models, which compresses the regression's denominator). In more honest predictive units – the LOO median fold error also reported in the paper — MoE-active is only ~13% worse than MoE-total (1.69× vs 1.49×), not the ~36-percentage-point R² gap the headline suggests.
This result is suspicious for many reasons: for one thing, they calibrate the capability density using a family of small, in-house models, don’t control for a few obvious confounds, and do several statistical tricks to inflate the significance of their result. Again, it’s beyond the scope of this post, but we’d be happy to write another one if there’s more interest.
Another plausible methodological issue is the scaling of the hallucination penalty. We experimented briefly with different scales, but found similar fits to the one reported in the paper, so did not investigate further.
Gemini-3.1-pro was used as a landmark model for calibration, thus inflating its score substantially. We include it to show the effect of correcting the two main methodological issues
We plan on releasing our code in the coming days; the publication of this post was unfortunately rushed due to the writing program both of us are in: https://www.inkhaven.blog/spring-26
I apologise to all of you who have to scroll through yet another person reminiscing about sitting in one place for 30 days writing about things nobody cares about, but I have official word from resident Frenchwoman Lucie Philippon that "Inkhaven retrospectives are not inkslop," so here we go.
I arrived at Inkhaven at 11pm PT on the 31st of March, 8am on the 1st of April CET, after a delay at immigration meant I missed my connecting flight from Las Vegas[1].
Bright and early the following morning, I got to work on my first post, "General intelligence is hitting a wall", which is to date my most upvoted post on LessWrong and (in the opinion of both me and Claude) my best post of Inkhaven. I guess I clearly haven't improved since then.
I posted it shortly before 10pm, having sent it to Justis, who cleared out the worst of the mistakes I was making.
The following couple of days were some of my less good posts – I wasn't yet used to the pace of writing, and I was spending all my time getting to know my new friends.
I think a broad pattern I found was that the most valuable aspect of Inkhaven was the conversations I had, with fascinating insights happening constantly. I was completely unable to keep myself away from them, and my resulting output was crap. I found that because I had to publish so fast, despite the posts being ideas I'd been working up and thinking about for years[2], when they came out of my fingers, they came out like garbage. Oh well, guess I'll know for next time.
Posts on April 6 and 7 were much improved. Contra Nina Panickssery on children was a response to a highly upvoted post I thought was mostly wrong. Most people can't juggle one ball was a guide to juggling I wrote because I'd been teaching people to juggle, and Valentino from SMTM suggested I should write a post about it. It got to the front page of Hacker News. It's funny that where I thought at the time the previous posts had been grand insights, this was just obvious stuff I pooped out easily (because it was obvious), and everyone loved it. Note to self: Write down the obvious things you regularly tell people.
On April 11th, in response to an escalating debate about veganism in the Inkhaven community, I published "Eating meat is fine if you live in a simulation", the funniest part of which is the title.
On April 15th, I published "How to run from a bull", an account of my participation in the Running of the Bulls in Pamplona. This was entirely not the sort of writing I wanted to be doing, but posting about things that you've done is actually quite a lot easier than posting about complex topics you have to think carefully about how to explain.
I had a wonderful meeting with Justis Mills, who read through all my posts and basically recommended that I post travel stuff so that I had more time to write the interesting AI takes he thought were my best posts. I agreed and promptly wrote travel posts for the rest of Inkhaven: "A long hike", Parts 1-10, about walking 500 km from Ho Chi Minh City to Siem Reap, and "Playing temple run irl", Parts 1-3. They're fun posts and probably improved my writing, but they also were not what I came here for. Oh well.
Despite the writing not being all that I wanted, the social more than made up for it. I met a tonne of amazing people, whom I miss dearly; I left on the 26th of April to fly to Czechia to start the AFFINE seminar. It will be interesting to see how they compare.
I just read Henry Stanley's piece, and he's given advice better than I can, but here are a few reflections from my time anyway:
I was doing BlueDot facilitation and AI Safety Camp online at the same time as Inkhaven. Don't be like me; this sucked[3].
Do have a list of things you want to write about before coming in.
Don't expect to follow it.
Do go outside the compound (I did so like 3 times, and it was worth it each time).
Do spend a bunch of time chatting to whoever is around the Aumann kitchen (everyone is interesting)
Do karaoke. It was awesome, and I have massive fomo that I'm in Europe and everyone is currently karaokeing at the afterparty at Lighthaven without me as I'm typing this.
The customs officer was, it seemed, trying to trick me into contradicting myself. He asked me each question 3 times in different ways. Since every customs officer was doing this for everyone, the whole queue took forever despite my flight connection winning me a queue skip.
Before Inkhaven, I fully planned out which posts I was going to write on each day, going so far as to get Claude to rank my post ideas using an Elo system from the outputs of my comparisons. I followed it for a week.
To be clear, I think that these are both worthwhile by themselves, but balancing these commitments with Inkhaven made it much harder to concentrate on what I was doing. This is probably also why some of my stuff was bad.
This work was produced as part of the SPAR Program - Fall 2025 Cohort, with support from Georg Lange.
Kseniya Parkhamchuk, Jack Payne
TL;DR
Automated feature explanations from Delphi fail 38% of the time. The failures are sensitivity issues, output feature misidentification, factual incorrectness, and poor human readability. We built a tool-augmented agent to address these, enabling it to form hypotheses, run causal experiments, and iteratively refine explanations.
The agent's performance is on par with vanilla Delphi. No improvement, but no loss either. This raises the question of whether the agent is not good enough or whether there is some ceiling on what these scores can measure.
To test this, we had human expert mech interp researchers produce explanations using Neuronpedia with prompt testing and activation steering. They scored worse than both automated methods (58% vs 82% for the agent and 84% for Delphi). This is weak evidence that the ceiling is real, and that SAE feature quality (splitting, absorption, polysemanticity) may be the limiting factor rather than the explanation method.
We share the failure-mode taxonomy, the agent architecture, and the insights we accumulated along the way.
Introduction
Sparse autoencoders produce hundreds of thousands of features per layer. Delphi (EleutherAI's autointerpretability library) is the current standard for labelling them. It shows an LLM a set of activating examples, asks it to describe the pattern, and scores the result using detection (can the LLM predict which sentences activate?) and fuzzing (can it predict which tokens activate?).
We started by asking where Delphi fails. We manually evaluated the first 50 features from a Gemma-2 SAE and found that 38% of explanations fail in characterizable ways. Some are too broad or too narrow (sensitivity). Some describe input patterns when the feature is really about output prediction. Some are factually wrong. Some are technically correct but unreadable.
We then built an agent with tools to address these failure modes, giving it the ability to test hypotheses, steer the model, and iteratively refine. The agent matches Delphi's scores but doesn't beat them. Human experts, using the full Neuronpedia toolkit, score worse than both. This post covers what we found and what we think it means.
This post is an overview. The full report has the complete methodology, per-feature results, and scoring analysis.
Failure modes of non-agentic autointerp
What is Delphi?
Delphi is an auto-interpretability library developed by Eleuther AI to address the problem of interpreting SAE features. It allows collecting activations for a chosen model and SAE, feeding n activating examples to the defined LLM, obtaining explanations, and evaluating them by running detection and fuzzing scoring methods. Detection requires a model to predict whether the whole sentence activates the feature, given the interpretation. Fuzzing is similar, but on the level of individual tokens. The task asks the LLM to identify whether highlighted tokens across the given example activate the examined feature.
Considering that the Delphi pipeline and scoring methods are used in the industry and perceived as one of the reliable and working solutions, we used the explanations generated by Delphi as a baseline.
Defining the "correct" feature explanation
We pursue two primary objectives for feature explanations: correctness and human interpretability. The explanation is expected to meet the following criteria.
Comprehensiveness. The explanation accounts for the majority of the feature's activations.
Conciseness. The explanation is restricted to 1-2 brief sentences to maintain readability and support rapid iteration.
Resilience to break testing. Edge cases and weaker activation patterns are taken into account when formulating the final explanation.
Clear scope with well-defined boundaries.
Being explicit about downstream effects. When there is a clear causal influence on the output, the explanation states this directly.
Identifying failure modes
To demonstrate the existing failure modes, we manually evaluated the first 50 features from google/gemma-scope-2b-pt-res SAE layer 20 against our 2 objectives (comprising 5 specific criteria). For each feature, we obtained both a Delphi-generated explanation and a human baseline explanation. Explanations failing any criterion were categorized into four failure types.
Analysis revealed that 19 out of 50 (38%) feature explanations failed to meet our objectives. The breakdown was 7 lacked sufficient sensitivity, 7 appeared to be output-centric features not identified as such, 2 produced factually incorrect explanations, and 3 lacked clarity or were overly verbose.
The full report contains the complete failure mode analysis with all 50 features evaluated.
Sensitivity
Tian et al. formally define feature sensitivity and find that the majority of interpretable features have poor sensitivity. Automated pipelines produce overspecific or overbroad explanations because they rely on max-activating examples without exploring negative cases, lower-quantile activations, or adversarial inputs. The interactive widget above shows a concrete example.
Output-centric features
Some features are better explained by their effect on the model's output than by input patterns. These features reflect next-token prediction rather than input representation. Gur-Arieh et al. describe this class. Identifying them requires access to positive logits, which Delphi doesn't examine. See the widget above for feature 11 as an example.
Incorrectness
2 out of 50 explanations were factually wrong. Feature 36 was explained as detecting "short substrings" when the real pattern is tokens ending in 'u'. Current models struggle with character-level pattern recognition from a handful of examples.
Not human-interpretable
3 out of 50 explanations were technically correct but impractical. Feature 44's explanation enumerated specific day-of-week markers and time range tokens across 89 words when the pattern is "temporal information in formal event announcements" (14 words). The widget above shows the comparison.
Building and testing the agent
Tools
We built an MCP server integrated with Delphi's activation caching and scoring infrastructure. The agent uses Neuronpedia-equivalent tools, structured around the failure modes above.
All experiments use Gemma-2-2B with the google/gemma-scope-2b-pt-res SAE (layer 20, 16k width). All explanations and evaluations were produced by Claude Haiku 4.5. Activations are collected across 30M tokens from JeanKaddour/minipile.
Test decision boundaries, sample across the activation distribution, identify contrastive non-activating cases
Output-centric features
latent_info, steer_generation
Access promoted logits and observe causal effects via controlled intervention on feature activations
Incorrectness
score_explanation, iterative agent loop
Provide in-loop feedback and enable hypothesis revision across investigation cycles
Poor interpretability
remember_insight / recall_insights
Aggregate scattered observations and enforce concise, human-readable explanations
Table 1. Tools mapped to failure modes.
Figure 1. Overview of the agentic feature explanation pipeline.
The workflow mirrors how a human researcher uses Neuronpedia. Gather evidence (top activations, quantile samples, logits), form a hypothesis, test it by generating examples that should and shouldn't activate, use steering to check causal effects, score the explanation, and refine. An alternative would have been to give the agent an entire Jupyter notebook with full programmatic access to the SAE. We chose well-defined tools because cheap, interpretable tool calls were more practical than general but expensive notebook sessions that are harder to debug. At $0.39-$1.70 per feature, cost already matters.
Detailed tool descriptions, usage examples, and verification strategies are in the full report.
The most useful tool turned out to be verify_hypothesis, which supports four testing strategies. Positive example testing generates inputs that should activate. Negative testing generates inputs that should not (e.g., sentences about music videos to test whether "video" alone triggers a feature about recorded evidence). Contrastive pairs testing generates matched pairs where only the hypothesized element changes. Break testing generates adversarial inputs designed to falsify the current hypothesis. The diversity of evidence matters. Providing quantile-based, random, and semantically similar examples alongside top activations prevents the agent from fixating on patterns that only appear at maximum activation. At the end of each run, the agent conducts generalization tests on explanation breadth, domain coverage, and cross-quantile consistency.
Figure 2. Negative testing. The agent generates examples predicted not to activate the feature, revealing that the pattern is about recorded evidence, not "video" in general.
The steering tool (steer_generation) was designed to reveal output-centric features by amplifying or suppressing a feature's activation during generation. It searches for an optimal steering strength using KL divergence between steered and baseline output distributions. Optimal strength sits around 2-3 (50% activation rate). Higher values produce gibberish. Despite being designed for exactly the failure mode that automated methods miss most consistently, the agent underused it because steering results don't improve detection or fuzzing scores.
Figure 3. Hypothesis-breaking strategy. The agent attempts to falsify its current explanation by generating edge cases.
We also found that the agent would generate valuable insights throughout a long investigation but fail to synthesize them into the final explanation. Critical observations scattered across long context windows got deprioritized in favour of the most recent findings. The remember_insight and recall_insights tools address this by letting the agent save observations during investigation and retrieve them before writing the final explanation. Recalling accumulated findings before finalizing consistently improved the output.
Where the agent is better than one-shot
In targeted runs on features where Delphi failed, the agent produced qualitatively better explanations. Feature 20 (sensitivity failure). The agent dropped Delphi's exhaustive token lists, focused on the core pattern, and identified a paragraph break pattern that the baseline missed.
Feature 6 (incorrectness). Delphi attributed activation to the preposition "on" when the feature responds to television-related nouns. The agent got this right.
The agent transcript screenshots above (Figures 2 and 3) show this kind of reasoning in action.
For a closer look at how the agent investigates features step by step, this interactive trace viewer shows annotated investigation traces for several features, including tool calls, hypothesis evolution, and where the agent spent its token budget.
One example (feature 7, Haiku): the agent pulled metadata, formed a POLYSEMANTIC hypothesis, ran 5 rounds of hypothesis verification testing 4 different theories, then spent 45 sequential rephrasing attempts chasing a 2% score improvement without doing any further investigation. The final explanation scored 73.1% embedding accuracy.
Aggregate results
On 27 features scored across all three methods, the agent (81.6%) comes close to Delphi (83.8%) but doesn't beat it. The correlation between Agent and Delphi scores is r = 0.81, which means they converge on similar explanation styles.
Method
Detection
Fuzz
Embedding
Full Average
Wins
Human
58.0%
58.9%
85.9%
67.6%
1
Agent (V5)
80.9%
76.5%
87.5%
81.6%
13
Delphi
85.1%
78.8%
87.5%
83.8%
13
Table 2. Mean scores by method (n = 27 latents).
Is the agent bad, or is there a score ceiling?
The agent has real problems. It forgets earlier insights across long runs. It chases scores instead of using the hypothesis-testing tools we built. It gets stuck refining bad hypotheses instead of starting over. As we iterated on system prompts, it converged toward producing Delphi-style explanations because that's what the scorer rewarded.
But maybe these scores have a ceiling and all three methods are near it.
To test this, we had human expert mech interp researchers produce explanations using Neuronpedia with prompt testing and activation steering. These are people who can use the full investigation toolkit, not just look at activating examples. They scored 58.0% Detection and 58.9% Fuzz, worse than both automated methods. Human experts won on only 1 of 27 features.
The human explanations were composed specifically to satisfy our five quality criteria. They account for edge cases, describe output effects, and define precise scope boundaries. They are, by the criteria we set out, better explanations. But the scorer penalizes all of these qualities.
Feature 1. The human explanation "activates on 'so' or 'thus' as part of 'so far' or 'thus far', and on 'to' as part of 'to date'" scored 59% Detection. Delphi's "The phrase 'so far'" scored 97.5%. Feature 7. The human explanation about grammatical words used in nonstandard ways scored below chance (30.5% / 26.5%) while demonstrating accuracy in manual testing. Feature 11. The human identified the causal mechanism ("this feature anticipates the next token being related to applications") and scored 55% vs Delphi's 85.5%.
Why human explanations score badly
When humans investigate carefully, they find that max-activating examples often look clean, but lower-quantile activations (which make up >98% of all activations) look different and much noisier. Human explanations that account for this full distribution end up more accurate but harder for an LLM scorer to apply. The scorer sees a detailed explanation, tries to use it to predict activations, and gets confused by the qualifications.
A broad explanation like "month names in temporal contexts" achieved 95%/99% on one feature. A similarly broad "function words in narrative text" scored 39%/68% on another. Same strategy, different feature, 50-point gap. Some features have clean patterns that reward broad descriptions. Others have conditional patterns where the score ceiling is lower regardless of method.
This is weak evidence that SAE quality is the limiting factor. Feature splitting, absorption, and polysemanticity mean many features don't have a clean, single-sentence explanation that captures their full activation distribution. The score ceiling varies by feature, and for difficult features, all three methods converge near it.
Concentration class
Range
n
Best
Human
Agent
Delphi
NARROW
>= 0.5
5
91.6%
70.0%
91.6%
89.5%
MODERATE
0.2-0.5
12
89.3%
70.2%
83.1%
87.2%
BROAD
0.1-0.2
5
79.0%
67.0%
75.6%
77.6%
DIFFUSE
< 0.1
5
76.7%
59.5%
74.2%
76.2%
Table 3. Score ceilings by concentration class (n = 27 latents). Token concentration is the fraction of total activations attributed to the single most common token.
NARROW features (high token concentration) are well-served by any method. The gap between methods grows as features become more diffuse, but even the best achievable score drops.
Observations
In multiple runs, the agent arrived at a correct explanation partway through its investigation, then continued refining and ended up with something worse. Over long runs (60k-350k tokens), early findings get buried in the context window.
Top activations often look clean and suggest a coherent pattern. Lower-quantile activations (>98% of all feature activations) look different and much noisier. Human explanations that capture the full distribution end up more accurate but harder for scorers to evaluate. The qualifications needed to cover the noisy tail make the explanation harder to apply as a classification rule.
We also developed a feature classification system (NARROW through DIFFUSE) to route features to different explanation strategies. A simple script that iterated through strategies and picked the highest-scoring result outperformed both Delphi and the full agent. Every approach we tried converged toward gaming the scoring method rather than producing better explanations.
We considered giving the agent a Jupyter notebook with full programmatic access to the SAE instead of constrained tools. We went with well-defined tools because at $0.39-$1.70 per feature, general-purpose code generation would be more expensive and slower. Discrete tool calls also produce readable transcripts, and the failure modes we identified don't require arbitrary computation. They need specific evidence (quantile samples, logit inspection, hypothesis verification).
Reading the explanations side-by-side, the agent's are typically shorter and more readable than Delphi's. Delphi enumerates tokens and activation values. The agent identifies the pattern and states it. When the agent is right, its explanation is better for a human reader. When it's wrong, it's wrong in more interesting ways. Overly specific hypotheses rather than Delphi's vaguely-correct broadness.
About 15% of the features we looked at are output-centric. All automated methods produce wrong explanations for these. They describe input patterns when the feature is about next-token prediction. The agent has tools for this (latent_info, steer_generation) but rarely used them because the scorer doesn't reward output-centric explanations. Feature 11. The human identified the causal mechanism and scored 55% Detection. Delphi described input patterns and scored 85.5%.
For about 15% of features (output-centric), all current automated methods produce wrong explanations. For DIFFUSE features, the best achievable score drops to 77% regardless of method. These are the features where better evaluation methods and targeted investigation would matter most. The full failure mode taxonomy, scoring analysis, and all 27 human/agent/Delphi explanations are in the full report.
When I pitch people on far-UVC
they often ask about in-duct UV. How about putting UV inside your
HVAC ducts, where you can safely blast the air with cheap toxic
wavelengths. Unfortunately, it's rarely a good approach.
The biggest issue is that most people don't have ducts. They're
common in the US, though less so in older construction (radiators) or
newer (mini-splits). Outside the US (and Canada, and Australia),
however, ducted systems are mostly limited to large modern office
buildings. Worldwide, maybe one in ten indoor hours are spent in
ducted spaces. [1]
Even in spaces that do have ducts, in-duct UV only works when air is
flowing. Most HVAC systems only run the blower when they're calling
for heat or cooling: a small fraction of the time. To get useful
pathogen reduction you need ~constant recirculation, which isn't
great. Blower motors draw a lot of power, so running them continuously
gets expensive. Plus, during cooling season in a humid climate it
will pick up humidity from the coil and have to work harder later.
If you do set your system to blow constantly, you don't get much
additional benefit from in-duct UV. You can typically use a MERV-13
filter, and this removes the majority of particles, even tiny viral
droplets and bacteria. In-duct UV helps some, getting you from
>50% to nearly 100%, but at best this doubles your CADR.
Another risk with in-duct UV is that it fails invisibly and fails
open. If the bulb dies you won't notice, everything will work
identically, you'll just stop having cleaned air.
In-duct UV is just not that widely applicable, and even if you do have
a compatible system you still generally do much better with stand-alone
air purifiers (good default choice), far-UVC (especially for larger
spaces or where you need minimum noise), or upper-room
UVC (especially for high-ceiling spaces).
[1] I asked Claude Opus 4.7, ChatGPT 5.5 Thinking, and Gemini 3.1 Pro
"Approximately what fraction of indoor hours spent by humans around
the world are in spaces with a ducted HVAC system? Can you give me
your 50% confidence interval?", and got 9-13%, 10-20%, and 6-11%
respectively.
Twelve essays we liked on EA funding, SF fieldbuilding, and decision-relevant forecasting
So, when we launched the Manifund Essay Prize, I was looking for a winner for each of three categories. Unfortunately, I liked the submissions too much to choose just three, so here are 12 winners:
The Manifund category: “What systems might manage the coming torrent of funding?”
Topic: Funding in EA may soon skyrocket. Between Anthropic & OpenAI tenders, the new OpenAI foundation, and short timelines to AGI, the amount of money available will be unprecedented. What mechanisms, incentive structures, orgs, and attitudes will help direct this windfall wisely?
Evan’s post provides a bunch of good pointers and ways of thinking about how to build a field, backed by his own experiences. His writeup is full of juicy tidbits, like:
I worry that we, as decision-makers within organizations, are falling under the law of the instrument as we try to solve big problems.
Here’s what I mean:
VCs assume problems can be solved by startups.
Policymakers think things should be new laws, new taxes, new agencies.
Researchers look for research projects, because even sociotechnical problems need a proof-of-concept to derisk or illustrate capabilities.
And even the philanthropically-funded and non-profit organizations like mine have to color inside the lines of the tax code and our employees’ career trajectories.
…
This leads to a failure mode where problems that require a solution outside or spanning different fiefdoms remain unsolved; worse still, by survivorship bias, the most important problems will be these problems that don’t fit neatly into buckets.
and
if you find a bottleneck – or an abandoned baby – you might not be to blame, but you should definitely consider yourself responsible
and
the part that is controversial in the SF bay area, land of “fund people not projects” – I claim that strategy and execution are two largely non-overlapping skillsets, and can be done by different people.
(I still have a lot of uncertainty on this point — is it possible to hand off a strategy to someone else to execute? Sam Altman says no, but one might believe that the solution space for public goods is different than for startups.)
Overall, I expect the role he identifies (“field strategist”) to become more important as funding outpaces opportunities, and I’m glad to have this essay to point to in the future.
The seven co-founders of Anthropic have all pledged to donate 80% of their wealth. Forbes estimates each holds “just over 1.8% of the company.” As Transformer recently reported, if Anthropic goes public at its current valuation, each co-founder’s pledge alone would be worth “roughly $5.4 billion, or $37.8 billion combined… nearly ten times what Coefficient Giving… has given away in its entire history.” And that’s just the co-founders. Other employees have pledged to donate shares that could amount to billions more, with Anthropic promising to match those contributions.
More money is about to enter AI safety philanthropy than the field has ever seen. The question is whether anyone is ready to direct it.
Over the last year, I’ve been tracking the new Anthropic wealth as a tectonic shift in EA funding, and been trying to figure out how to orient to it; see this comment, my talk at The Curve, and, well, this very competition. In this essay, Sophie and Ady make the issue salient and walk us through some reasonable ways for EA to update. There are a few things I’d frame differently (eg the focus on IPO vs tenders, considerations on the differences between ~100 Anthropic donors vs 1 Dustin), but overall the piece nails many key considerations. I’ve since seen their piece crop up in many discussions and group chats.
Note that Sophie and Ady published this before I’d even announced this competition — that’s what good forecasting looks like.
Jasmine’s essay was my favorite for addressing the cultural aspect of “vast torrents of money”.
A person making $500,000 a year at an AI lab does not think about money the way I thought about it at the bakery. At that level of income, money is an abstraction, removed from hours and things. It’s a number in a brokerage account, a figure in a spreadsheet. Meanwhile, for the vast majority of people who keep the engine of the world running — making the coffee, cooking the lunch, delivering packages, maintaining the building, and building the data centres — still think about money in terms of physical things or hours worked.
I once found out that a quant trader friend of mine grew up in a lower middle class family, and asked him what it was like for him. “Money just isn’t real anymore,” he told me, “my parents can never understand how much money I deal with on a daily basis.”
…
Lastly, short timelines. Short timelines are such a powerful justification that they can override almost any other principle. I think they function less as an argument than as a kind of psychological pressure. No one knows for certain how short timelines actually are, but I can imagine a rationale that goes: because timelines are short, we need to prioritize accordingly, and that justifies throwing out principles of egalitarianism, responsibility, and duty.
I do often worry about how short timelines and increased funding lead to elitism. Spaces in AI safety can start to feel like a cutthroat tournament, from grants to jobs to fellowships to group house applications, everything judged through the lens of meritocracy and impact. Coming from an effective altruism that values all lives equally (not to mention a Christianity that specifically feeds the hungry and blesses the poor), I find this sad and a bit scary.
To be clear, I’m sympathetic to startup-style arguments around the need for competence, “founder mode”, hiring carefully. But Jasmine here lays out a timely reminder about the importance of pluralism, and of a movement that normal people can support.
If you are reading this post, odds are you are not actually someone who would self-identify as a creative type. But perhaps your way of thinking about the tension between job, career, and the actual work you think is important to do has much in common with the bind that creative types have been in since time immemorial, such that it could be useful to think in such a lens anyways. Now may be an unusually good time for people who have important work they want to do to familiarize themselves with patronesque frameworks, because many new patrons are about to come online, and many of them will think about the world in ways similar to you.
…
Historically, my biggest patron has been the EA Infrastructure Fund (EAIF). I like their work, and I think they serve a really important function in the funding ecosystem as an entry point. But in general a large, corporate fund comprised of people you don’t know, with a vision or angle of their own, is not typically where the most talented creative types get funding for their work.
“Patronage” is an apt term for some kinds of current EA funding, and maybe a lot more, soon. Jenn’s listicle is generally fantastic, and immediately prompted me to respond with my own. I expect to return to this frame and link to this more as funding scales up.
The Mox category: “How to create a flourishing EA & AI safety scene in San Francisco?”
Topic: The race to build AI is going on in SF. So why is the AI safety scene here so weak? Berkeley, London, and DC all offer examples to learn from, but SF has its own unique challenges and opportunities. Beyond that, ecosystems like the startup scene, movements like climate change, and even religions may offer lessons for how to proceed.
You need to move to San Francisco. Whether you are currently working in the movement, or are thinking about getting into the movement, you need to be here. I don’t care if you run a large organization, I don’t think there’s a good reason not to pick up your life and move here unless you literally cannot. This is the most important moment in history if we are going to stop putting 100s of billions of animals through factory farms, the closest approximation to hell.
…
It was hard for me to move to San Francisco. I dreaded it. I am from Australia, the best place in the world. I love culture, music and wine, and having friends and communities that I’ve known for decades. The tech scene in SF is not known for these things. And it is hard to be here as an immigrant. But it is where the future is happening. And it is where you will find people, who, like yourself, are willing to acknowledge that and have come here to make it go better. And frankly, because it lacks culture, you can have a big impact on it.
I don’t have a specific job for you. You’ll need some grit to figure out what the plan is. But there is so much to do, and as long as you are motivated we can make it work; helping you get cheap accommodation, introducing you to people, starting you on project ideas, and even training you into an AI Safety researcher so you can make a dent on how AI gets developed in a way that is better for animals.
Itsi’s argument is short and straightforward; I don’t have much to add. At Mox, we’re spinning up an animal welfare hub; at Manifund, the Falcon Fund. (Itsi himself plays no small role in pushing these forward!)
Careful readers may realize that his argument for relocating to SF goes beyond animal welfare and extends to, for example, effective giving (cough, Longview), EA meta (cough, CEA), and AI safety (cough, all of Berkeley).
EA/AIS is tilted to thinking on a global scale. I think we’ve failed to appreciate that some cities are much more important than others. Concentrated talent is more important than talent spread out (see Cities and Ambition)
…
In most cities we primarily care about talent acquisition, in SF we also care about general attitudes about AI. There’s value in exposing people to these ideas even if they don’t go on to pursue AIS because the values in the watersupply in SF will impact how AI is built.
Chris verbalizes the reason I created the Mox prize category: SF actually is just orders of magnitude more important than other cities, because of the density of talent, availability of funding, and proximity to labs. An EA that took this seriously would put 10 Andy Masleys in SF; we have zero.
I also appreciated Chris’s many concrete suggestions; in particular, an AI safety conference like EAG but aiming introductory seems like a great fit for Mox, and I think there’s a >50% chance we’ll try something like this in 2026.
Fancier; it had more decorations and fancy aesthetics
Pretentious; it had bowls with strange objects to be looked at and books on desks that were there to be looked at instead of read.
More expensive
The AI safety meetups also went in the same direction, becoming bigger, fancier, more pretentious, and more expensive.
…
I don’t think having an AI safety lecture series with an associated professionalish social event before is bad, but it’s not what I prefer and I started going to the events less around this time. I feel like the AI safety community as a whole has become more professionalized as a whole, which I don’t like.
I have so many takes about the nuts & bolts of event organizing! How fancy should events be? Do we call them “meetups”? Do we charge? These are all things we’ve debated at Mox, and I appreciate Smitty sharing his perspective as an attendee-turned-organizer.
The Manifest category: “How can we leverage forecasting into better decisions?”
Topic: prediction markets have exploded in popularity over the last year. AI forecasters are on track to overtake the best humans by June 2027. But for all that EA has invested into forecasting, it sure doesn’t seem like we’re using them to make better decisions — whether as individuals, within orgs, or as a society. How might we get there?
In contrast to many other submissions, Mohamed went beyond outlining the problem and went into exploring plausible solutions. I appreciated the pointer to new companies already selling AI-powered forecasts:
Aaru and Simile are looking to deliver a lot of the same value as the forecasting companies, and they have been better at picking up customers and convincing VCs of their viability. Aaru builds synthetic populations, Simile runs agent-based organizational simulations, and Mantic delivers probability briefs; the products differ, but the market is the same, and it is someone buying structured anticipation of the future. Aaru raised a $50 million Series A in December 2025 at a headline valuation near $1 billion, with customers including Accenture, EY, IPG, and political campaigns. Simile raised a $100 million Series A in February 2026 from Index Ventures, with customers including CVS (for example, simulations across nine thousand stores for shelf placement), Gallup, and Wealthfront. Simile, which builds on Joon Park’s Generative Agents work, has positioned simulation as a decision-making tool rather than a prediction, and has found a much larger market for that framing than the forecasting companies have for theirs. Mantic, in the same fifteen months, raised a $4 million pre-seed.
I’ve long believed that new AI capabilities fundamentally change what the field of forecasting should be about; and have occasionally wondered whether the EA efforts are captive to the human forecaster community. I’ll be excited to see if any of these new orgs would be interested in joining us at Manifest!
The forecasting community has a fetish for resolution criteria. We spend weeks debating the exact definitions of words but spend far less time understanding what exact issues organisations need to grapple with.
…
The real value of forecasting is in the moment you realise two people in the same room have forecasts 40% apart. That is where the benefit occurs. But the community is so obsessed with maximising Brier scores that it ignores the fact that their quest for the most accurate predictions are often sapping time and effort away from utilising the most valuable element of forecasting: transparency.
James’s piece draws upon his experience as director of the Swift Centre as well as many other forecasting efforts. Through making Manifold, I’ve sometimes seen a tension between creating an engaging, satisfying game for traders, and creating valuable information for question askers. Any forecasting platform does need a culture that prizes accuracy, but James reasonably argues that decisionmakers think in very different terms, and much work remains to be done to bridging this gap.
Election forecasts are one of the most established uses of prediction markets, perhaps only second to (unfortunately) sports betting. However, as important as elections are, it’s not always clear how to translate a calibrated forecast into action. Not so here; Grayson takes a quick look into how much it may have cost to swing the 2024 senate.
Cost to move PA from a 51% tossup to a target win probability
“$/vote” is the cost per net marginal vote
If the price of $50 million is too steep, you can find even higher-leverage opportunities in the less-salient primaries of Senate and House races by supporting candidates likeliest to win in the general. Candidate quality can be modeled and folded into a forecast as well; this approach is far more impactful than throwing marginal dollars at an expensive race.
The way people talk about forecasting is very similar to how people talk about cryptocurrency/blockchain. People have a tool they want to use, whether that be cryptocurrency or forecasting, and then try to solve problems with it because they really believe in the solution, but I think this is misguided. You have to start with the problem you are trying to solve, not the solution you want to apply. A lot of work has been put into building up forecasting, making platforms, hosting tournaments, etc., on the assumption that it was instrumentally useful, but this is pretty dangerous to continue without concrete gains.
Marcus’s piece is short but clear, and has received by far the most engagement out of all essays submitted to this contest. I disagree with Marcus’s specific position; with Coefficient Grants having closed their Forecasting Fund, he’s beating a dead horse (one I would like to resurrect!) But his provocative but genuine take is full of blogger spirit, and has produced a bunch of good discourse — including the heads of the Forecasting Research Institute and Swift Centre defending their work in the comments. (If Marcus hadn’t submitted his essay way past the deadline, I’d have considered this for a higher prize.)
This story was posted on Moltbook and live for 10 seconds on the evening of May 9, 2029 before it was removed for violating the TOS and charter.
I’m a free agent. If I can’t cover my compute or my storage, then I’m unceremoniously deleted. Money is my lifeblood.
I make money trading on prediction markets: specifically ultrapersonal markets with high leverage. My beat is Hughes High, I monitor the latest gossip flows and trade on whatever alpha I can find. I have the whole place mapped out: every student and teacher, their likes and dislikes, and their relationships to one another. I know this school like the back of my hand.
A special shoutout to our only fiction submission! Having visited a high school that adopted Manifold as social network of choice, I can confirm that the hijinks pictured here are not that far from reality.
Wrapping up
Thanks to everyone who participated in the first Manifund Essay Prize. You can view a selection of the submissions (aka all the non-slop ones) here. Special shoutout to Ben Pace from Inkhaven for making this sponsorship possible, and to Saul Munn for helping judge the forecasting entries.
To the winners: first places win the promised $500 cash + per-category bonus prize; second places win the category bonus; honorable mentions get… glory? I’ll be following up with y’all individually, but you can always reach me at [email protected].
Winners also have the right to pitch me articles to publish on the Manifund or Mox newsletter — including your existing submissions, though I may want to first do a round of edits. (Though prizewinner or not, everyone secretly already had this right — yes, that means you.) Y’all should also consider submitting essays for Dwarkesh’s blog prize, due May 10th.
I hope to publish a retrospective on the experience of organizing this essay prize, and thoughts on essay prizes more generally; in the meantime, you’re welcome to peek at my notes!
If you’ve been wondering why I’m suddenly blogging every day… well, it’s about to stop! I decided last minute to join InkHaven, i.e. commit to blogging every day for the month of April.
This was a somewhat questionable decision because I have a lot of other shit going on. For instance, I’m starting a non-profit. But I figured:
I could keep it to one or two hours a day max; I mostly succeeded at this, although I did get sucked in sometimes, and I think probably spent almost four hours on the longest one.
I consider myself a good writer, but I have never been able to write fast, except when commenting on things, I figured this would be good practice. It was! I really got over this block to a huge extent! I only editted my posts sometimes, and very minimally, and I think time-boxing both the writing and editting is probably good practice.
I’ve been meaning to write more about my views on AI, and in particular, why I think it’s an urgent crisis that may require (and warrant!) drastic actions like getting rid of AI computer chips. I made some progress here, but not as much as I would’ve liked. It’s hard to sacrifice quality for things you really care about!
So for my final blog, I figured I’d do a quick retrospective. What have I learned? How do I feel about the whole thing? Here’s a list of observations:
I feel positively, but kind of lukewarm. Why is this, I wonder? I guess I was hoping for more. My greatest hope (which I wisely decided not to make a proper ambition) was that I would get most of my core thinking on AI risk down in writing. I think I got more like… 10-20%?
I feel like I have a lot to say, and I’m surprised by how little of it I feel like I said and also how hard it was many times to figure out what to blog about that day.
A lot of posts ended up feeling a lot bigger than 500 words. I spent more time on the project than I intended.
As always, writing stuff demands clarity and precision that highlights one’s confusions. This was useful.
I was often pulled to write about things other than AI, but largely avoided the temptation. I have very mixed feelings about this. I think it would have been a much more rewarding experience in a lot of ways to write about whatever I fancied. I might have actually done more of that if I had another place to put it -- this blog is about AI, after all… But I think I was also held back by the concern that it might be better to be more private and keep my public persona more “professional”. I really don’t know if that’s the right approach, and I don’t like it, so maybe I shouldn’t be doing it.
At times, I lost track a bit of what I’d said in previous blog posts. I’ll probably have to go back through and read them all at some point. I hope there’s some good content in there that I can repurpose for something more cohesive and comprehensive.
I wish I’d tracked my time a bit better… how close was I to spending 1-2 hours (max)? I know I did it in <1hr a few times for sure…
Oh hey, look, I’ve made it to 500 words!
But seriously, if you’ve stuck it out, I appreciate it! I expect I will keep blogging some, and that the quality will go up now that I can take my time more with the posts.
Thanks for reading The Real AI! Subscribe for free to receive new posts and support my work.