MoreRSS

site iconLessWrongModify

An online forum and community dedicated to improving human reasoning and decision-making.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of LessWrong

How to game the METR plot

2025-12-20 21:46:45

Published on December 20, 2025 1:46 PM GMT

TL;DR: In 2025, we were in the 1-4 hour range, which has only 14 samples in METR’s underlying data. The topic of each sample is public, making it easy to game METR horizon length measurements for a frontier lab, sometimes inadvertently. Finally, the “horizon length” under METR’s assumptions might be adding little information beyond benchmark accuracy. None of this is to criticize METR—in research, its hard to be perfect on the first release. But I’m tired of what is being inferred from this plot, pls stop!


14 prompts ruled AI discourse in 2025

The METR horizon length plot was an excellent idea: it proposed measuring the length of tasks models can complete (in terms of estimated human hours needed) instead of accuracy. I'm glad it shifted the community toward caring about long-horizon tasks. They are a better measure of automation impacts, and economic outcomes (for example, labor laws are often based on number of hours of work).

However, I think we are overindexing on it, far too much. Especially the AI Safety community, which based on it, makes huge updates in timelines, and research priorities. I suspect (from many anecdotes, including roon’s) the METR plot has influenced significant investment decisions, but I’ve not been in any boardrooms.

2 popular AI safety researchers making massive updates based on the Claude 4.5 Opus result today, 200+ likes, within 6 hours.

Here is the problem with this. In 2025, according to this plot, frontier AI progress occurred in the regime between a horizon length of 1 to 4 hours.

Guess how many samples have 1-4hr estimated task lengths?

Just 14. How do we know? Kudos to the authors, the paper has this information, and they transparently provide task metadata.

Figure 14 of their paper. 14 tasks in the 1-4 hr range. Illuminati confirmed?

Hopefully, for many, this alone rings alarm bells. Under no circumstance should we be making such large inferences about AGI timelines, US vs China, Closed vs Open model progress, research priorities, individual model quality etc. based on just 14 samples. An early sign of this problem was there when the original METR paper was released in March 2025. The best performing model at the time, Claude 3.7 Sonnet, was estimated to have a horizon length of 59 mins. Now see its success rate distribution over task lengths:

Notice how the model has almost a 60 ±15% probability of success on 1-2hr tasks. So why is the estimated 50% success horizon length 59 minutes?! Because it doesn’t get anything right in the 2-4 hr range. METR calculates the horizon length by fitting a logistic curve to individual sample outcomes, like the dark purple line above. Notice how 0% on the 2-4hr range leads to a very bad logistic fit (the curve is below the 95% confidence interval for 0.5-1hr, and 1hr-2hrs range). We’ll come to my skepticism arising from the core modelling assumption, of using a logistic curve, later. My suspicion is Claude 3.7 Sonnet has 0% success in the 2-4hr range because they only had 6 samples for that range, most of which were from cybersecurity capture the flag contests. Cyber is considered a dual-use, safety hazard capability in WMDP, which labs were careful about in early 2025. Remember, this is Anthropic.

To improve METR horizon length, train on cybersecurity contests

I promised you there's a way to game the horizon length on the METR eval. Here's how. The samples in the 1 minute to 16 hour range mostly come from HCAST. It turns out HCAST transparently tells us what each of these tasks are about.

HCAST Task Descriptions, 1.5-3.5 hours

Appendix D has a description of each task, sorted by estimate time taken. It's unclear which 14 exact samples the METR horizon length plot uses, but the list is small enough to consider them all.

Why is this a big deal? Well, if you know what task you want to improve performance on, its really easy to do it. You can create targeted synthetic data, or just hire vendors like Scale, Mercor and Surge to upsample such tasks in your post-training mix. If you notice, most of the tasks in this range are Cybersecurity CTFs, and MLE tasks. OpenAI has been explicit about specifically targeting these capabilities for Codex models: 

Now, I'm not saying the labs are focusing on these tasks to improve on the METR plot. They probably have other incentives for this. But this is precisely why the METR plot is unlikely to generalize, it measures exactly what the labs are focusing on! If Kimi, or DeepSeek, want to shoot past, they can just collect a lot of ML-Training and Cybersecurity prompts, and finetune on them.

Note that given there are only 14 samples in the relevant task length range, getting even 1 or 2 extra samples right significantly increases horizon length! It probably increases even more if you get the longer tasks (8h+, from RE-Bench right), by luck or overfitting, as today's Claude 4.5 Opus result showed us. In fact, perhaps because Anthropic doesn’t want to risk training on cybersecurity, we still have low accuracy in the 2-4hr range?
 

Image

HCAST Accuracy alone predicts log-linear trend in METR Horizon Lengths

Finally, lets look at how METR estimates 50% success horizon length. They assume a logistic relation between the probability of success, and gap between the horizon length (estimated variable) and task length:

You infer h (the 50% horizon length) by fitting the 0/1 success evaluation data of each task.  is also a learnt parameter, governing the slope of how fast the logistic function falls from 1 to 0. 

I think once you assume a logistic function, its almost guaranteed that if a new model solves one additional task, it's going to continue the log-linear trend. Remember that METR also only adds more models to the point when they think they are likely to push the frontier. Coupled with measuring on a task distribution that model developers are actively trying to improve on, I think the log-linear trend, or X month doubling period, pops out almost tautologically from the logistic fit assumption. 

For example, I tried deriving the horizon length from JUST their reported accuracy, without looking at individual sample evaluations at all. Remember how the main contribution of the METR plot was shifting focus from aggregate accuracy to horizon lengths? Well it turns out, if you use the aggregate accuracy, and the task length distribution, and fit the logistic function to estimate horizon length assuming even a constant =0.7, you recover the log-linear trend:

This means, if you had access to just the aggregate accuracy on HCAST, you could estimate the horizon length without knowing which samples the model gets right or wrong. It could be wrong on the short ones, and right on the long ones, for all you care.

Now presumably this logistic fit assumption arises from an earlier plot in the paper, claiming model success rates go down linearly with doubling in task length. I have a qualm with this plot too:

Notice how the log-linear fit here only looks good for the SWAA data, in the 1 sec - 1 min range. There's something completely different going on for tasks longer than 1 minute, clearly not explained by the log-linear fit. If you tried to make a best fit line on the blue points (the length of tasks we care about after 2024), you'd get a very different, almost vertical line, with a very low R^2. I don't know how load-bearing this is on the use of a logistic function to fit p(success) vs task length when estimating horizon lengths.

I am not a statistician, so I am uncertain about this final part of the analysis. I don't know what it implies. I don't know how problematic it is to assume a logistic function for this data. I invite people more experienced than me in statistics to analyze it, because it seems a bit suspicious. 

Aaron Francis (@aarondfrancis) on X

Overall, I wish we had more, and robust measurements for model horizon lengths. I think it is a much more meaningful metric than accuracy, because it directly affects automation impacts, and economic outcomes (for example, labor laws or often based on number of hours of work). Heck, I even wrote a paper on this topic. I applaud METR for turning my, and many others' attention towards this. But the way people are misinterpreting, and making wild inferences from the headline horizon length numbers METR puts out every month, worries me. If we are staking investment decisions, and research priorities based on an evaluation, it needs to be really robust. And making robust long-horizon benchmarks is hard, expensive, and uncharted territory. I hope METR plot v2 rises to the challenge!


I thank Sumeet Motwani, Ameya Prabhu, Arvindh Arun, and Akshit Sinha for feedback on this post. I appreciate folks at METR recognizing the value of these critiques when I tweeted about them. If you liked this, consider subscribing to my substack, where I post more often.



Discuss

No God Can Help You

2025-12-20 16:32:01

Published on December 20, 2025 8:32 AM GMT

There is a standard pattern in philosophical conversations. People stumble upon their epistemological limitations and then assume that God’s existence somehow would’ve solved the problem.

For instance, suppose we are talking about Münchhausen trilemma, demonstrating the inability to have completely certain foundation for knowledge. We inspect different horns of the trilemma and then conclude with a sigh:

“Oh well, it seems there is no absolute certainty in our world. Unless, of course, God Himself would tell us that something is true”.

This is deeply related to the idea of curiosity-stoppers. Standard non-answers that make people pretend that the problem is solved and, therefore, cease any further investigation. Naturally, saying “God!” is a very popular one.

But this time the situation is a bit different. The problem was already investigated and discussed. Dare I even say, understood. And yet we are still invoking “God!” as an afterthought. As if we are trying to appeal to His divine vanity and give Him some credit anyway, even if we’ve just explored all the reasons He wouldn’t help at all.

Imagine that it was extremely clear that God existed. Imagine everyone could close their eyes in prayer and in a very literal sense find themselves in a presence of absolute goodness, ready to guide them through the hardships of the life and answer the questions about the nature of the cosmos.

For some question there would be clear answers. For others - just pointers towards the new trains of thought or clever arguments. God wouldn’t spoil all the discovery for you, but could put you back on track were you to be led astray and ask for help.

What things God had revealed were indeed shown to be true by independent validation. And the answers He gave are consistent among people. There was not a single known instance of God lying or being wrong. And so on and so forth.

So imagine all that and then suppose that God told you: “X is true”. Can we now justifiably believe that it’s indeed so with absolute certainty?

At the very least we would have to trust God. And, while doing so in this scenario seems to be quite reasonable, this is the conclusion to which we have arrive the standard way - by analyzing evidence with our own minds. And so the trilemma still stands:

  • Either we just assume without further justifications that God always tells us truth
  • Or we add another link to the chain of infinite regression of justifications
  • Or we make some kind of cyclical reasoning using God to justify God

We might have a new and, on a reflection, very trustworthy source of empirical evidence, but it’s still merely empirical evidence, unable to bring us to full certainty. Our reasons to trust such evidence would be of the same kind as reasons to trust our observations and reasoning right now. Addition of God didn’t change anything substantial in the problem.

And it’s not because our imperfect minds are so flawed that even omniscient God can’t help us. The situation is almost the same from the perspective of God himself. Yes, he appears to know everything. But can he trust his own omniscience? Any reason for doing it will be rooted in his own omniscient mind. And what if his mind was created by even more powerful entity in such a way that our poor God was always deluded about the nature of his existence? That he only thought himself to be omniscient while actually being nothing but?

The same goes for other contentious philosophical problems. If we have troubles justifying objective, stance independent morality, God can’t help us with that. His stance on morality will be just another stance, his divine authority - just another authority. God’s eye view is just as subjective as any other.

If we are struggling with the problem of the first cause, existence of God doesn’t resolve it either. We merely get yet another thing to be caused, passing the buck one level further. And if God can be uncaused, well, why not something else?

Some may find this to be a reason for despair. If even omniscient God can not help with our problems, what can we, mere mortals, do? Shouldn’t we just give up and accept that we will never find any solutions?

But I think there is a more optimistic twist on the insight. If the addition of God can’t solve philosophy, then we are not missing much in this regard in a godless world that we seem to be living in. Maybe some questions will have to be refined. Maybe some answers will not be what we’ve initially expected them to be. But whatever answers exist - they are ours to discover.



Discuss

Claude Opus 4.5 Achieves 50%-Time Horizon Of Around 4 hrs 49 Mins

2025-12-20 15:13:58

Published on December 20, 2025 7:13 AM GMT

An updated METR graph including Claude Opus 4.5 was just published 3 hours ago on X by METR (source):

Image

 

Same graph but without the log (source):

Image

Thread from METR on X (source):

We estimate that, on our tasks, Claude Opus 4.5 has a 50%-time horizon of around 4 hrs 49 mins (95% confidence interval of 1 hr 49 mins to 20 hrs 25 mins). While we're still working through evaluations for other recent models, this is our highest published time horizon to date.

We don’t think the high upper CI bound reflects Opus’s actual capabilities: our current task suite doesn’t have enough long tasks to confidently upper bound Opus 4.5’s 50%-time horizon. We are working on updating our task suite, and hope to share more details soon.

Based on our experience interacting with Opus 4.5, the model’s performance on specific tasks (including some not in our time horizon suite), and its benchmark performance, we would be surprised if further investigation showed Opus had a 20+ hour 50%-time horizon.

Despite its high 50%-time horizon, Opus 4.5's 80%-time horizon is only 27 minutes, similar to past models and below GPT-5.1-Codex-Max's 32 mins. The gap between its 50%- and 80%- horizons reflects a flatter logistic success curve, as Opus differentially succeeds on longer tasks.

Image

You can find additional details about our current methodology as well as our time horizon estimates for Opus 4.5 and other models here: https://metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/

The 80%-horizon has stayed essentially flat (27-32 mins) since GPT-5.1-Codex-Max's release but there's a big jump with huge error bars on the 50%-horizons.

I think Daniel Kokotajlo's recent shortform offers a useful framing here. He models progress as increasing either the intercept (baseline performance) or the slope (how well models convert time budget into performance). If progress comes mainly from rising intercepts, an exponential fit to horizon length could hold indefinitely. But if progress comes from increasing slope, the crossover point eventually shoots to infinity as AI slope approaches human slope.

The flat 80%-horizon while 50%-horizon climbs might be evidence for intercept-dominated progress.



Discuss

Show LW: Alignment Scry

2025-12-20 12:07:58

Published on December 20, 2025 2:48 AM GMT

Howdy. I've built exopriors.com/scry, a powerful search tool over LessWrong, arXiv, HackerNews, community-archive.org, and more.

You and your agent can now query this rich dataset with the full expressive power of SQL + vector algebra.
 

Some example usage:
> what is Eliezer's most Eliezer post?

> find the 4 posts over 200 karma that are most distant from each other in every way (not the average of them). we want to create 4 quadrants.

> I need posts with the seriousness and quality of list of lethalities, but that's maybe not AI AND doom pilled (one or the other is okay). 

As you can see, this is a very powerful paradigm of search. Structured Query Language is a real OG, embeddings and arbitrary vector composition takes it to the next level, and agents are very good at working with this stuff.

Some cool aspects of this project:

  • hardening up a SQL database enough to let the public run queries. There's so much collective trauma about SQL injection attacks that most people have forgotten that this is possible. 
  • I've built on syntactic sugar for using custom vectors. Agents can embed arbitrary queries and refer to them with @vector_handle syntax. This compactness helps agents reason efficiently, and let's us not have to pass around 8kb vectors.
  • Opus 4.5 and GPT-5.2 allowed me to ship this in a couple weeks. The software intelligence explosion is here.
  • product-as-a-prompt, agent-copilot-targeted UX as a paradigm. It was pretty cool realizing I could e.g. just describe my /feedback API endpoint in the prompt, to open up the easy communication channel with users and help me iterate on the project better.
  • The affordability of Hetzner dedicated machines is worth mentioning. I was really feeling constrained with my very limited budget trying to build something real with DigitalOcean. I discovered Hetzner late Nov and just bought (started renting) a monster machine before I knew what to do with it, knowing something had to happen. The breathing room with the machine specs has really allowed the project to expand scope, with currently over 400 GB of indexes (for query performance), able to ingest and embed basically every interesting source I've been able to think of. If I were a VC, I would be using a tool like Scry and visiting universities to find cash-strapped neurodivergent builders and maybe offer them a Hetzner machine and Claude Max and GPT Pro subscription just to see what happens.
  • there's also an alerts functionality. Like we're ingesting thousands of papers, posts, articles, comments a day, so you can just specify an arbitrary SQL query that we'll run daily or more often, and get an email when the output changes. Google Alerts on steroids.

Happy to take any feedback! I'll likely be releasing a Mac App in the next few days to provide a smoother sandboxed experience.



Discuss

Opinionated Takes on Meetups Organizing

2025-12-20 08:17:34

Published on December 20, 2025 12:17 AM GMT

Screwtape, as the global ACX meetups czar, has to be reasonable and responsible in his advice giving for running meetups.

And the advice is great! It is unobjectionably great.

I am here to give you more objectionable advice, as another organizer who's run two weekend retreats and a cool hundred rationality meetups over the last two years. As the advice is objectionable (in that, I can see reasonable people disagreeing), please read with the appropriate amount of skepticism.

Don't do anything you find annoying

If any piece of advice on running "good" meetups makes you go "aurgh", just don't do those things. Supplying food, having meetups on a regular scheduled basis, doing more than just hosting board game nights, building organizational capacity, honestly who even cares. If you don't want to do those things, don't! It's completely fine to disappoint your dad. Screwtape is not even your real dad.

I've run several weekend-long megameetups now, and after the last one I realized that I really hate dealing with lodging. So I am just going to not do that going forwards and trust people to figure out sleeping space for themselves. Sure, this is less ideal. But you know what would be even less ideal than that? If in two years' time I throw in the towel because this is getting too stressful, and I stop hosting megameetups forever.

I genuinely think that the most important failure mode to avoid is burnout. And the non-fabricated options for meetups organizers are, often, host meetups that are non-ideal, or burn out. I would rather meetups in a city exist ~indefinitely in mildly crappy form, than if they exist in ideal form but only for a bit, and then the city has no more rationality meetups after that.

Anyways this hot take trumps all the other hot takes which is why it's first. If the rest of the takes here stress you out, just ignore them.

Boss people around

You are a benevolent dictator, act like it. Acting like a dictator can be uncomfortable, and feeling uncomfortable as one is laudable. But you have to do it anyways because the people yearn to be governed. If you are not a benevolent dictator, there is going to be a power vacuum, and because of social monkey dynamics, some random attendee is going to fill that power vacuum, and they're going to do a worse (they don't know where the bathrooms are and to call for regular break times so people are not just sitting for 3 hours straight) and less benevolent job (they don't know that they're supposed to be a benevolent dictator instead of just talking at everyone for 3 hours straight) than you.

As an organizer, the attendees see you as having an aura of competence and in-chargeness around you. You're just some guy, so this is kind of baffling. But you should take advantage of this in ways that ultimately benefit the group as a whole. More on this in the highly recommended The Art of Gathering by Priya Parker.[1]

Tell people to do things.

People around these parts like to help out more than they get the chance to. If you ever offered to help the host at a party but the host waved you away, you know what I'm talking about.

Further, many people actually become quite psychically uncomfortable if they feel like they have an increasing debt to you that they can't pay back (e.g. because you keep hosting good meetups and they keep attending them). So I truly mean this: asking people to do things for you is doing them a favour. Ask them to fetch the latecomer from the door. Ask them to help you clean up after each event. Ask them to guest host meetups on topics they are well versed in.

Tell people how to participate, and sometimes to participate less.

A script I like when breaking people into conversational groups[2]: "Try to pay attention to the amount of conversational space you're taking up. If you feel like you're talking a bit more than other people, try to give other people more space, and if you feel like you're talking a bit less, try to contribute a little more." This does seem to help a little!

But sometimes it does not help enough, and the conversation ends up being monopolized by a person or two anyways. This sucks and is boring for everyone else trapped in that conversation. But you, as the benevolent dictator, can bring out the big guns, because of your aura of in-chargeness.

For example, I will regularly say "hey name, can you please try to reduce the amount of conversational space you are taking up?" More often, I will use a precise number: "Hey, I would like you to talk around 50/65/80% less."

I don't break this one out in the wider world, because this sounds like an unhinged request to most people. But rationalists find this an acceptable thing for organizers to say, and so I will keep pressing that button and not getting punished for it.[3]

Sometimes, people will take "please talk 50% less" as "please shut up forever". If they stop speaking entirely after you make the request, you can invite them back into the conversational fold by asking them for their thoughts on the topic a little while later in the conversation. Then they get the idea.

I do the opposite thing too. If there is someone who is a little more reticent to speak, but has a thoughtful look on their face, or I notice them failing to break into the conversation a few times, I'll also throw them a line, and ask them about what they feel about the readings or the latest turn in the conversation. The idea isn't to get to perfect conversational parity, but to nudge the conversation maybe 30% more that way. This one is nice because if you do it enough, a few other people in the conversation will also pick up the idea that they should be looking out for other people who are interested in speaking, and helping you with gently prompting others to contribute. (This one's fine to do anywhere since it's very... legibly? pro-social, but you do need the magical organizer status force field to request that people talk less.)

Do not accommodate people who don't do the readings

If there's one thing I hate, it's seeing rationalist groups devolve into vibes based take machines. Rationality meetups should cultivate the more difficult skills required to think correct things about the world, including reading longform pieces of text critically when that is a helpful thing to do (which it often is). Organizers should assign readings often, and cultivate a culture where doing the readings is a default expectation. Do not mollycoddle or be understanding or say "oh that's fine" to people who have not done them. You can give new people a pass for misunderstanding the expectations the very first time they show up, and your regulars a pass if they had some sort of genuinely extenuating circumstance.

Especially in smaller meetups (say, under 15 average attendees), you really want to avoid the death spiral of a critical fraction of attendees not doing their readings, and thus the discussion accommodating their lack of context. This punishes the people who did do the readings and disincentivizes them from doing the readings in the future.[4]

As a side benefit, this also makes it so that each newcomer immediately feels the magic of the community. If a new person shows up to my meetups, I like starting out the meetup by asking people who have done the assigned readings to raise their hands. All the hands go up, as well as the new person's eyebrows, and this is like crack to me.

Make people read stuff outside the rationality canon at least sometimes

Especially if you've been running the meetups for a few years. Rationality must survive contact with the wider world, even the parts of it that are not related to AI safety. Examples of things you can read:

Do closed meetups at least sometimes

Especially for contentious topics, such as gender war or culture war discourse, I restrict the meetups to only regulars. Two good reasons for this:

  • There is unmet demand for discussion of more taboo subjects, which means newbies are disproportionately likely to show up to spicier events, and this makes them much more annoying to moderate
  • People can have more authentic and productive conversations when they are surrounded by people they know and trust, and it's unusually important to have authentic and productive conversations if you are discussing taboo subjects because otherwise they devolve into shitshows.

There is another reason, which is that this is sort of like, a way of rewarding your regulars for being regulars? Some amount of reward is good for the culture, but there are trade-offs and better ways of doing that. So I am not sure that this is a "good" reason.

My specific system is that the discord server for my community has roles for "regulars" and "irregulars". People get the "irregular" role after they attend three meetups within a few months' time, and the "regular" role after they... well, become regulars. I restrict more contentious meetups to only people with those roles, explain what they are, and explain that everyone else will be turned away at the door. 

Experiment with group rationality at least sometimes

Many heads are better than one, but rationality in the community seems to be a solo activity. The group rationality tag on LessWrong is kind of dead. It should be less dead, and we should be distributing knowledge work more. Think about how your group can do that!

One easy type of doing this is the "skillshare" - if any of your attendees has a skill that they can teach others within a block of a few hours, help them host a meetup on teaching everyone else that skill. Some skillshares we're done: singing, calligraphy, disk golf, drawing, crochet.

Other things you can do: distribute reading a book or a very long essay, distribute researching a topic, distribute writing it up.

Bias the culture towards the marginal rat(s) you want

My meetups website is somewhat notorious for looking like this:

I'm not saying it's zero percent a shitpost, but the polarization that it induces is intentional.

The mainline rationalists are going to check out your meetup no matter what your website looks like. And once they are there, they are going to be like "ah yes, this is a meetup for my people, very good," and stick around. (Okay, yeah, make sure you have that part sorted first.)

So one question you should ask is: who is the marginal attendee that you want to attract? And then you want to bias your material towards them[5]. Here are some categories that might exist on the margins of rationality communities in various locales:

  • important, busy people
  • shy/anxious/depressed people
  • EA/Progress Studies/Emergent Ventures/YIMBY people
  • people who are into woo/vibecamp/burning man/catholicism
  • tech entrepreneurs and startup founders
  • econ majors
  • people who have heard about rationality/EA and might secretly like some of its takes but believe the community vibe to be rancid (racist, sexist, transphobic, etc)
    • this is very common among younger people, women, racial and gender minorities, queer people, and non-tech people
  • leftists of varying levels of irony
  • various kinds of accelerationists
  • the alt right
  • undercover FBI agents

As with all things except pokemon, you can't get them all and you must consider trade-offs. My website will turn off the most fussy members of the tribe and the people who are largely here for the transhumanism, but I think the first group would kind of kill the vibes at a meetup anyways and I don't think there's too many of the second around these parts so I'm comfortable with the trade.

My website will also repel older members of the community, and I am sad about this. But I live in a college town and the numbers just don't work out in their favour, especially since older members are more likely to be more central members of the tribe, and come check us out anyways.

Websites, of course, are not the end-all and be-all of culture. Some other things I do to steer the culture of my group:

  • Make everyone wear name tags every time there is a new person or an irregular in attendance. Specify that people can optionally provide their pronouns. (If I had another organizer, I'd coordinate with them such that exactly one of us writes down our pronouns.)
    • Makes trans people feel safer; discourages people who are either transphobic or so triggered from culture war stuff that they need a few more years to recover from coming back
  • Encourage people with libertarian and right wing takes to continue giving them, and point out explicitly when counter-arguments are weak or bad-faith.
    • Credibly signal that we are serious about this freedom of thought and pursuit of truth thing. This is important because the group culture has some markers of not doing that, such as girls with dyed hair and pronouns in regular attendance.
  • Normalize responses like "I think this is misinformation" or "I don't agree with this take" in response to claims that seem like misinformation or bad takes.
    • Avoid the failure mode of feelings getting in the way of productive disagreement.
    • Keep in mind that the meetups I run are generally located in Canada and excessive politeness is the norm. If you are running a meetup in, say, Germany, or the Bay Area, perhaps you need to nudge the culture in the opposite direction.
  • Provide only vegetarian (mostly vegan) snacks
    • Makes EAs and people who care about animal welfare feel more welcome
  • Run EA meetups once a month
    • Ensures that the EA and rationality scenes in the city never drift too far from each other
  • Run woo meetups ~twice a year (authentic relating, meditation practice, David Chapman, etc)
    • Some aspects of my meetups culture turns away the most woo people, which is intentional; woo people have other communities of their own, hardline rationalists generally do not, it is much more important for me to make the culture good for the second group even if it is at the expense of the first.
    • But then I like to add a tiny amount of woo back in for the very d i s e m b o d i e d people who are left.

There are other things that affect the meetup culture that I can't realistically change, such as the layout and design of my apartment's amenity room, or like, my own fundamental personality. You can only do so much.

You can choose to not care about any of this. The correct choice for most meetups organizers is to not spend precious organizing hours thinking about culture strategy, and just focus on running meetups they consider interesting and fun. But while you can choose to not think about the trade-offs, the trade-offs will persist nonetheless.

And remember, if any of this stresses you out, see take #1.

  1. ^

     You can find a summary on the EA forums here, and this specific point is under the subheading "don't be a chill host".

  2. ^

    I break people into different groups if a single group has more than eight people in it. At seven or eight people, it becomes difficult for many people to contribute to the group conversation. But sometimes groups of only 3 people fizzle out, and this seems like a worse failure mode, so I wait until the threshold of 8 to split.

  3. ^

    The way that I think about this is something like: people who tend to monopolize the conversation know this about themselves, and will kick themselves about doing so after they get home and realize that that's what they did. If the request is given in a non-hostile and casual way, they often genuinely appreciate the reminder in the moment. 

  4. ^

    I hear this take might not apply to larger groups where there will be enough people in the mix who have done the readings that they can just discuss with each other. 

  5. ^

    You can also consider the opposite; which groups you want to disincentive from attendance. But this seems anti-social so I shan't say more about it.



Discuss

A Full Epistemic Stack: Knowledge Commons for the 21st Century

2025-12-20 06:48:03

Published on December 19, 2025 10:48 PM GMT

We're writing this in our personal capacity. While our work at the Future of Life Foundation has recently focused on this topic and informs our thinking here, this specific presentation of our views are our own.

Knowledge is integral to living life well, at all scales:

  • Individuals manage their life choices: health, career, investment, and others on the basis of what they understand about themselves and their environments.
  • Institutions and governments (ideally) regulate economies, provide security, and uphold the conditions for flourishing under their jurisdictions, only if they can make requisite sense of the systems involved.
  • Technologists and scientists push the boundaries of the known, generating insights and techniques judged valuable by combining a vision for what is possible with a conception of what is desirable (or as proxy, demanded).
  • More broadly, societies negotiate their paths forward through discourse which rests on some reliable, broadly shared access to a body of knowledge and situational awareness about the biggest stakes, people’s varied interests in them, and our shared prospects.
    • (We’re especially interested in how societies and humanity as a whole can navigate the many challenges of the 21st century, most immediately AI, automation, and biotechnology.)

Meanwhile, dysfunction in knowledge-generating and -distributing functions of society means that knowledge, and especially common knowledge, often looks fragile [1]. Some blame social media (platform), some cynical political elites (supply), and others the deplorable common people (demand).

But reliable knowledge underpins news, history, and science alike. What resources and infrastructure would a society really nailing this have available?

Among other things, we think its communication and knowledge infrastructure would make it easy for people to learn, check, compare, debate, and build in ways which compound and reward good faith. This means tech, and we think the technical prerequisites, the need, and the vision for a full epistemic stack[2]are coming together right now. Some pioneering practitioners and researchers are already making some progress. We’d like to nurture and welcome it along.

In this short series, we’ll outline some ways we’re thinking about the space of tools and foundations which can raise the overall epistemic waterline and enable us all to make more sense. In this first post, we introduce frames for mapping the space —[3]different layers for info gathering, structuring into claims and evidence, and assessment — and potential end applications that would utilize the information.

A full what?

A full epistemic stack. Epistemic as in getting (and sharing) knowledge. Full stack as in all of the technology necessary to support that process, in all its glory.

What’s involved in gathering information and forming views about our world? Humans aren’t, primarily, isolated observers. Ever since the Sumerians and their written customer complaints[4], humans have received information about much of their our world from other humans, for better or worse. We sophisticated modern beings consume information diets transmitted across unprecedented distances in space, time, and network scale.

With an accelerating pace of technological change and with potential information overload at machine speeds, we will need to improve our collective intelligence game to keep up with the promise and perils of the 21st century.

Imagine an upgrade. People faced with news articles, social media posts, research papers, chatbot responses, and so on can trivially trace their complete epistemic origins — links, citations, citations of citations, original data sources, methodologies — as well as helpful context (especially useful responses, alternative positions, and representative supporting or conflicting evidence). That’s a lot, so perhaps more realistically, most of the time, people don’t bother… but the facility is there, and everyone knows everyone knows it. More importantly, everyone knows everyone’s AI assistants know it (and we know those are far less lazy)! So the waterline of information trustworthiness and good faith discourse is raised, for good. Importantly, humans are still very much in the loop — to borrow a phrase from Audrey Tang, we might even say machines are in the human loop.

Some pieces of this are already practical. Others will be a stretch with careful scaffolding and current-generation AI. Some might be just out of reach without general model improvements… but we think they’re all close: 2026 could be the year this starts to get real traction.

Does this change (or save) the world on its own? Of course not. In fact we have a long list of cautionary tales of premature and overambitious epistemic tech projects which achieved very few of their aims: the biggest challenge is plausibly distribution and uptake. (We will write something more about that later in this series.) And sensemaking alone isn't sufficient! — will and creativity and the means to coordinate sufficiently at the relevant scale are essential complements. But there’s significant and robust value to improving everyone's ability to reason clearly about the world, and we do think this time can be different.

Layers of a foundational protocol

Considering the dynamic message-passing network of human information processing, we see various possible hooks for communicator-, platform-, network-, and information-focused tech applications which could work together to improve our collective intelligence.

We’ll briefly discuss some foundational information-focused layers together with user experience (UX) and tools which can utilise the influx of cheap clerical labour from LMs, combined with intermittent judgement from humans, to make it smoother and easier for us all to make sense.

All of these pieces stand somewhat alone — a part of our vision is an interoperable and extensible suite — but we think implementations of some foundations have enough synergy that it’s worth thinking of them as a suite. We’ll outline where we think synergies are particularly strong. In later posts we’ll look at some specific technologies and examples of groups already prototyping them; for now we’re painting in broad strokes some goals we see for each part of the stack.

Ingestion: observations, data, and identity

Ultimately grounding all empirical knowledge is some collection of observations… but most people rely on second-hand (and even more indirect) observation. Consider the climate in Hawaii. Most people aren’t in a position to directly observe that, but many have some degree of stake in nonetheless knowing about it or having the affordance to know about it.

For some topics, ‘source? Trust me bro,’ is sufficient: what reason do they have to lie, and does it matter much anyway? Other times, for higher stakes applications, it’s better to have more confirmation, ranging from a staked reputation for honesty to cryptographic guarantee[5].

Associating artefacts with metadata about origin and authorship (and further guarantees if available) can be a multiplier on downstream knowledge activities, such as tracing the provenance of claims and sources, or evaluating track records for honesty. Thanks to AI, precise formats matter less, and tracking down this information can be much more tractable. This tractability can drive the critical mass needed to start a virtuous cycle of sharing and interoperation, which early movers can encourage by converging on lightweight protocols and metadata formats. In true 21st Century techno-optimist fashion, we think no centralised party need be responsible for storing or processing (though distributed caches and repositories can provide valuable network services, especially for indexing and lookup[6]).

Structure: inference and discourse

Information passing and knowledge development involve far more than sharing basic observations and datasets between humans. There are at least two important types of structure: inference and discourse.

Inference structure: genealogy of claims and supporting evidence (Structure I)

Ideally perhaps, raw observations are reliably recorded, their search and sampling processes unbiased (or well-described and accounted for), inferences in combination with other knowledge are made, with traceable citations and with appropriate uncertainty quantification, and finally new traceable, conversation-ready claims are made.

We might call this an inference structure: the genealogy and epistemic provenance of given claims and observations, enabling others to see how conclusions were reached, and thus to repeat or refine (or refute) the reasoning and investigation that led there.

Of course in practice, inference structure is often illegible and effortful to deal with at best, and in many contexts intractable or entirely absent. We are presented with a selectively-reported news article with a scant few hyperlinks, themselves not offering much more context. Or we simply glimpse the tweet summary with no accompanying context.

Even in science and academia where citation norms are strongest, a citation might point to a many-page paper or a whole book in support of a single local claim, often losing nuance or distorting meaning along the way, and adding much friction to the activity of assessing the strength of a claim[7].

How do tools and protocols improve this picture? Metascience reform movements like Nanopublications strike us as a promising direction.

Already, LM assistance can make some of this structure more practically accessible, including in hindsight. A lightweight sharing format and caches for commonly accessed inference structure metadata can turn this into a reliable, cheap, and growing foundation: a graph of claims and purported evidence, for improved further epistemic activity like auditing, hypothesis generation, and debate mapping.

Discourse: refinement, counterargument, refutation (Structure II)

Knowledge production and sharing is dynamic. With claims made (ideally legibly), advocates, detractors, investigators, and the generally curious bring new evidence or reason to the debate, strengthening or weakening the case for claims, discovering new details, or inferring new implications or applications.

This discourse structure associates related claims and evidence, relevant observations which might not have originally been made with a given topic in mind, and competing or alternative positions.

Unfortunately in practice, many arguments are made and repeated without producing anything (apart from anger and dissatisfaction and occasional misinformation), partly because they’re disconnected from discourse. This is valuable both as contextual input (understanding the state of the wider debate or investigation so that the same points aren’t argued ad infinitum and people benefit from updates), and as output (propagating conclusions, updates, consensus, or synthesis back to the wider conversation).

This shortcoming holds back science, and pollutes politics.

Tools like Wikipedia (and other encyclopedias), at their best, serve as curated summaries of the state of discourse on a given topic. If it’s fairly settled science, the clearest summaries and best sources should be made salient (as well as some history and genealogy). If it’s a lively debate, the state of the positions and arguments, perhaps along with representative advocates, should be summarised. But encyclopedias can be limited by sourcing, available cognitive labour and update speed, one-size-fits-all formatting, and sometimes curatorial bias (whether human or AI).[8]

Similar to the inference layer, there is massive untapped potential to develop automations for better discourse tracking and modeling. For example, LLMs doing literature reviews can source content from a range of perspectives for downstream mapping. Meanwhile, relevant new artefacts can be detected and ingested close to realtime. We don’t need to agree on all conclusions — but we can much more easily agree on the status of discourse: positions on a topic, the strongest cases for them, and the biggest holes[9]. Direct access as well as helpful integrations with existing platforms and workflows can surface the most useful context to people as needed, in locally-appropriate format and level of detail.

Assessment: credence, endorsement, and trust

Claims and evidence, together with counter claims and an array of perspectives (however represented), give some large ground source of potential insight. But at a given time and for a given person there is some question to be answered: reaching trusted summaries and positions.

Ultimately consumers of information sources come to conclusions on the basis of diverse signals: compatibility with their more direct observations, assessment of the trustworthiness and reliability (on a given topic) of a communicator, assessment of methodological reasonableness, weighing and comparing evidence, procedural humility and skepticism, explicit logical and probabilistic inference, and so on. It’s squishy and diverse!

We think some technologies are unable to scale because they’re too rigid in assigning explicit probabilities, or because they enforce specific rules divorced from context. This fails to account for real reasoning processes and also can work against trust because people (for good and bad reasons) have idiosyncratic emphases in what constitutes sensible reasoning.

We expect that trust should be a late-binding property (i.e. at the application layer), to account for varied contexts and queries and diverse perspectives, interoperable with minimally opinionated structure metadata. That said, squishy, contextual, customisable reasoning is increasingly scalable and available for computation! So caches and helpful precomputations for common settings might also be surprisingly practical in many cases.

With foundational structure to draw from, this is where things start to substantially branch out and move toward the application layer. Some use cases, like summarisation, highlighting key pros and cons and uncertainties, or discovery, might directly touch users. Other times, downstream platforms and tools can integrate via a variety of customized assessment workflows.

Beyond foundations: UX and integrations

Foundations and protocols and epistemic tools sound fun only to a subset of people. But (almost) everyone is interested in some combination of news, life advice, politics, tech, or business. We don’t anticipate much direct use by humans of the epistemic layers we’ve discussed. But we already envision multiple downstream integrations into existing and emerging workflows: this motivates the interoperability and extensibility we’ve mentioned.

A few gestures:

  • Social media platforms struggle under adversarial and attentional pressures. But distributed, decentralised context-provision, like the early success stories in Community Notes, can serve as a widely-accessible point of distribution (and this is just one form factor among many possible). In turn, foundational epistemic tooling can feed systems like Community Notes.
  • More speculatively, social-media-like interfaces for uncovering group wisdom and will at larger scales while eliciting more productive discourse might be increasingly practical, and would be supported by this foundational infrastructure.
  • Curated summaries like encyclopedias (centralised) and Wikipedia (decentralised) are often able to give useful overviews and context on a topic. But they’re slow, don’t have coverage on demand, offer only one-size-fits-all, and are sometimes subject to biases. Human and automated curators could consume from foundational epistemic content and react to relevant updates responsively. Additionally, with discourse and inference structure more readily and deeply available, new, richly-interactive and customisable views are imaginable: for example enabling strongly grounded up- and down-resolution of topics on request[10], or highlighting areas of disagreement or uncertainty to be resolved.
  • Authors and researchers already benefit from search engines, and more recently ‘deep research’ tooling. Integration with easily available relational epistemic metadata, these uplifts can be much more reliable, trustworthy, and effective.
  • Emerging use of search-enabled AI chatbots as primary or complementary tools for search, education, and inquiry means that these workflows may become increasingly impactful. Equipping chatbots with access to discourse mapping and depth of inference structure can help their responses to be grounded and direct people to the most important points of evidence and contention on a topic.
  • Those who want to can already layer extensions onto their browsing and mobile internet experiences. Having always-available or on-demand highlighting, context expandables, warnings, and so on, is viable mainly to the extent that supporting metadata are available (though LMs could approximate these to some degree and at greater expense). More speculatively, we might be due a browser UX exploration phase as more native AI integration into browsing experiences becomes practical: many such designs could benefit from availability of epistemic metadata.

How? Why now?

If this would be so great, why has nobody done it already? Well, vision is one thing, and we could also make a point about underprovision of collective goods like this. But more relevant, the technical capacity to pull off this stack is only really just coming online. We’re not the first people to notice the wonders of language models.

First, the not inconsiderable inconveniences of the core epistemic activities we’ve discussed are made less overwhelming by, for example, the ability of LLMs to digest large amounts of source information, or to carry out semi-structured searches and investigations. Even so, this looks to us like mainly a power-user approach, even if it came packaged in widely available tools similar to deep research, and it doesn’t naively contribute to enriching knowledge commons. We can do better.

With a lightweight, extensible protocol for metadata, caching and sharing of discovered inference structure and discourse structure becomes nearly trivial[11]. Now the investigations of power users (and perhaps ongoing clerical and maintenance work by LLM agents) produce positive epistemic spillover which can be consumed in principle by any downstream application or interface, and which composes with further work[12]. Further, the risks of hallucinated or confabulated sources (for LMs as with humans) can be limited by (sometimes adversarial) checking. The epistemic power is in the process, not in the AI.

Various types of openness can bring benefits: extensibility, trust, reach, distribution — but can also bring challenges like bad faith contributions (for example omitting or pointing to incorrect sources) or mistakes. Tools and protocols at each layer will need to navigate such tradeoffs. One approach could have multiple authorities akin to public libraries taking responsibility for providing living, well-connected views over different corpora and topics — while, importantly, providing public APIs for endorsing or critiquing those metadata. Alternatively, perhaps anyone (or their LLM) could check, endorse, or contribute alternative structural metadata[13]. Then the provisions of identity and endorsement in an assessment layer would need to solve the challenges of filtering and canonicalisation.

In specific epistemic communities and on particular topics, this could drive much more comprehensive understanding of the state of discourse, pushing the knowledge frontier forward faster and more reliably. Across the broader public, discourse mapping and inference metadata can act against deliberate or accidental distortion, supporting (and incentivising) more good faith communication.

Takeaways

Knowledge, especially reliable shared knowledge, helps humans individually and collectively be more right in making plans and taking action. Helping people better trust the ways they get and share useful information can deliver widespread benefits as well as defending against large-scale risk, whether from mistakes or malice.

We communicate at greater scales than ever, but our foundational knowledge infrastructure hasn’t scaled in the same way. We see a large space of opportunities to improve that — only recently coming into view with technical advances in AI and ever-cheaper compute.

This is the first in what will be a series exploring one corner of the design landscape for epistemic tech: there are many uncertainties still, but we’re excited enough that we’re investigating and investing in pushing it forward.

We’ll flesh out more of our current thinking on this stack in future entries in this series, including more on existing efforts in the space, interoperability, and core challenges here (especially distribution).

Please get in touch if any of this excites or inspires you, or if you have warnings or reasons to be skeptical!

Thanks to our colleagues at the Future of Life Foundation, and to several epistemic tech pioneers for helpful conversations feeding into our thinking.

  1. You might think this is a new or worsening phenomenon, or you might think it perennial. Either way, it’s hard to deny that things would ideally be much better. We further think there is some urgency to this, both due to rising stakes and due to foreseeable potential for escalating distortion via AI. ↩︎

  2. Improved terminological branding sorely needed ↩︎

  3. Coauthor Oly formerly frequently used single hyphens for this sort of punctuation effect, but coincidentally started using em-dashes recently when someone kindly pointed out that it’s trivial to write them while drafting in google docs. This entire doc is human-written (except for images). Citation: trust us. ↩︎

  4. or perhaps as early as Homo erectus and his supposed pantomime communication, or even earlier ↩︎

  5. Some such guarantees might come from signed hardware, proof of personhood, or watermarking. We’re not expecting (nor calling for!) all devices or communications to be identified, and not necessarily expecting increased pervasiveness of such devices. Even where the capability is present on hardware, there are legitimate reasons to prefer to scrub identifying metadata before some transmissions or broadcasts. In a related but separate thread of work, we’re interested in ways to expand the frontier of privacy x verification, where we also see some promising prospects. ↩︎

  6. Compare search engine indexes, or the Internet Archive. ↩︎

  7. Relatedly, but not necessarily as part of this package, we are interested in automating and scaling the ability to quickly identify rhetorical distortion or unsupported implicature, which manifests in science as importance hacking and in journalism as spin, sensationalism, and misleading framing. ↩︎

  8. Wikipedia, itself somewhere on the frontier of human epistemic infrastructure, becomes at its weakest points a battleground and a source of contention that it’s not equipped to handle in its own terms. ↩︎

  9. This gives open, discoverable discourse a lot of adversarial robustness. You can do all you like to deny a case, malign its proponents, claim it’s irrelevant… but these are all just new (sometimes valuable!) entries in the implicit ‘ledger’ of discourse on a topic. This ‘append-only’ property is much more robust than an opinionated summary or authoritative canonical position. Of course append-only raises practical computational and storage concerns, and editorial bias can re-enter any time summarisation and assessment is needed. ↩︎

  10. Up- and down-resolution is already cheaply available on request: simply ask an LLM ‘explain this more’ or ‘summarise this’. But the process will be illegible, hard to repeat, and lack the trust-providing support of grounding in annotated content. ↩︎

  11. Storage and indexing is the main constraint to caching and sharing, but the metadata should be a small fraction of what is already stored and indexed in many ways on the internet. ↩︎

  12. How to fund the work that produces new structure? In part, integration with platforms and workflows that people already use. In part, this is a public good, so we’re talking about philanthropic and public goods funding. In some cases, institutions and other parties with interest in specific investigations may bring their own compute and credits. ↩︎

  13. Does this lack of opinionated authority on canonical structure defeat the point of epistemic commons? Could a cult, say, provision their own para-epistemic stack? Probably — in fact in primitive ways they already do — but it’d be more than a little inconvenient, and we think that availability of epistemic foundation data and ideally integration into existing platforms, especially because it’s unopinionated and flexible in terms of final assessment, can drive much improvement in any less-than-completely adversarially cursed contexts. ↩︎



Discuss