MoreRSS

site iconLessWrongModify

An online forum and community dedicated to improving human reasoning and decision-making.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of LessWrong

D&D.Sci Release Day: Topple the Tower!

2026-03-07 10:48:50

This is an entry in the 'Dungeons & Data Science' series, a set of puzzles where players are given a dataset to analyze and an objective to pursue using information from that dataset.

Estimated Complexity Rating: 3.5/5

STORY[1] 

The Tower is a plague upon the lands!  It appears, spits out monsters, and when at length a brave hero manages to Topple it, why, it simply reappears elsewhere soon after, with a completely different layout so the same approach will not work again!

Preview
Behold The Tower.  (Image by OpenArt SDXL)

But now you are here.  With the power of Data Science on your side, you've secured a dataset of the many past heroes who have assaulted The Tower, and you're sure you can use that to advise those who seek to Topple it.

DATA & OBJECTIVES

Here is the layout of paths through the current appearance of The Tower:

  • You need to successfully Topple The Tower.
  • To do this, you must choose a Class of hero to send: Mage, Rogue, or Warrior.
  • You must also choose a route for them to take up The Tower.  They must work their way through it, choosing to go left or right at each level.
    • For example, you could send a Mage with instructions to stick to the left side.  They would encounter, in order:
      • START
      • Enchanted Shield
      • Campfire
      • Slaver
      • Slaver
      • Slaver
      • Chosen
      • Campfire
      • The Collector
  • To help with this, you have a dataset of past assaults on The Tower.  Each row is a Hero who assailed The Tower, what encounters they faced on each floor, and how far they got/whether they Toppled The Tower successfully.

BONUS OBJECTIVE (ASCENSION 20?)

As a bonus objective, you can attempt to Topple a more difficult Tower.  This uses the same ruleset as before, you get to select your character and path as before, but you need to defeat the following map instead:

Good luck!

SCHEDULING & COMMENTS

I'll aim to post the ruleset and results on March 16th, but given my extremely poor excellent decision-making skills in releasing my Slay-the-Spire-themed game the same week as Slay the Spire 2 comes out, please don't hesitate to ask for an extension /several extensions if you want them!

As usual, working together is allowed, but for the sake of anyone who wants to work alone, please spoiler parts of your answers  that contain information or questions about the dataset.  To spoiler answers on a PC, type a '>' followed by a '!' at the start of a line to open a spoiler block - to spoiler answers on mobile, type a ':::spoiler' at the start of a line and then a ':::' at the end to spoiler the line.

Now if you'll excuse me, I need to go play Slay the Spire 2 for the next 48 hours.

  1. ^

    Really?  Does Slay the Spire even HAVE lore?  If it does, I don't know it.



Discuss

CHAI 2026 Workshop: Open Call for Posters!

2026-03-07 09:59:55

To mark the Center for Human-Compatible AI's tenth annual workshop, we're casting a wide net for our poster session submissions.

Interested in submitting your work for the CHAI 2026 Workshop? See our Call for Posters at https://workshop.humancompatible.ai/. Anyone can share recent or relevant research to be considered.

⏱️ Deadline: March 26, 2026 at 11:59 p.m. PST.
📆 Workshop Dates: June 4–7, 2026
📍 Venue: Asilomar Hotel & Conference Grounds in Pacific Grove, CA.



Discuss

AI Safety Needs Startups

2026-03-07 09:27:54

Summary:

  • Startups can become integrated in the AI supply chain, giving them good information about valuable safety interventions. Safety becomes a feature to be shipped directly to users by virtue of this market position.
  • Better access to capital, talent, and ecosystem-building is available to for-profits than non-profits. VC funding dwarfs philanthropic funding, and there is little reason to believe that profitable safety-focused businesses aren’t possible.
  • Joining a frontier lab is a clear alternative, but most AI deployment happens outside labs. Your marginal impact inside a large organisation is often smaller than your impact when founding something new. Equally, profitable businesses aren’t an inevitability. You should seriously consider working for or founding an AI safety startup.

Introduction

Markets are terrible at pricing safety. In the absence of regulation, companies cut corners and externalise risks to society. And yet, for-profits may be the most effective vehicle we have for deploying safety at scale. Not because the incentives of capitalism align by chance with broader human values, but because the alternatives lack the resources, feedback loops, and distribution channels to turn safety insights into safer outcomes. For-profits are far from perfect, but have many advantages and a latent potential we should not ignore.

Information, Integration, and Safety as a Product

For advanced AI, the attack surface is phenomenally broad. It makes existing code easier to crack. Propaganda becomes cheaper to produce and distribution of it becomes more effective. As jailbreaking AI recruiters becomes possible, so does the data-poisoning of entire companies.

Information about new threats and evolving issues isn’t broadcast to the world. Understanding where risk is most severe and how it can be mitigated is an empirical question. We need entities embedded all across the stack, from model development to deployment to evaluation. We need visibility over how this technology is used and misused, and enough presence to intervene when needed. ‘AI safety central command’ cannot provide all these insights. Researchers acting without direct and constant experience with AI deployment cannot identify the relevant details.

Revenue is a reality check. If your product is being bought, people want it. If it isn’t, they either don’t know about it, don’t think it’s worth it, or don’t want it at all. For-profits learn what matters in an industry directly from the people they serve, giving the best insights money could buy.

This is not to say that AI safety non-profits aren’t valuable. Many do critical work which is difficult to support commercially. But by focusing entirely on research or advocacy and ignoring the commercial potential of their work, organisations cut themselves off from a powerful source of feedback. Research directions, careers, and even whole organisations can be sustained for years by persuading grantmakers and fellow researchers of a thesis, rather than proving value to people who would actually use the work. Without this corrective pressure, even well-intentioned research may drift from what the field actually needs. Commercialisation should not be seen as a distraction or a response to limited funding, but as a tool for staying at the bleeding edge of what is useful for the world.

Productification

Turning research into a product people can buy is extremely powerful for distribution. You are no longer hoping that executives, engineers, and politicians see value in work they do not understand tackling risks they may not believe in. It becomes a purchase. A budget decision. A risk-reward tradeoff that large organisations are very well suited to engage with.

There are clear gaps in securing AI infrastructure which can be filled today. If you’re wondering what an AI safety startup might actually do, here are some suggestions for commercial interventions targeting different parts of the stack.

  • Frontier Models: Interpretability tooling, evaluations infrastructure, and formal verification environments. Tools which might be implemented by labs and companies with direct access to frontier models to understand and control them better.
  • Applications: Content screening, red-teaming as a service, and monitoring for misuse. Helping startups building on frontier models catch accidental or deliberate misuse of their platforms.
  • Enterprise Deployments: Observability platforms, run-time guardrails, and hallucination detection. Enterprises and governments using AI to automate critical work should be able to catch issues early and reliably.
  • Market Incentives: Model audit and certification, and safety-linked insurance. Creating market incentives which reward safer models when they’re released into the world.

None of these require waiting for frontier labs to solve alignment, or hoping that someone else finds your work and decides to implement it. Instead of writing white papers hoping governments will regulate or frontier labs will dutifully listen, you build safety directly into products that customers come to rely on. One path hopes someone will do the work, whereas the other is the work.

Safety Across The Stack

When you tap your card at a shop to make a purchase, a network of financial institutions plays a role in processing your transaction. The point of sale system reads your card information, sends it on to a payment processor, who forwards the request to the appropriate card network. The issuing bank for your card authorises the transaction, the money is sent to cash clearing systems, and cash settlement is performed often through a central bank settlement system.

There is a sense in which all fraud happens at a bank. They have to release the fraudulent funds, after all. But the declaration that all fraud prevention initiatives should be focused on banks and banks alone comes across as fundamentally confused. Fraud prevention might be easier at other layers, and refusing to take those opportunities simply because it is in principle preventable at some more central stage would not lead to the best allocation of resources.

Similarly, when a user prompts an AI application, they are not simply submitting an instruction directly to a frontier model company. Just as tapping your card does more than instruct your bank, such a message goes through guardrails, model routing, observability layers, and finally frontier model safety measures. Every step of this process is an opportunity for robustness we should not let go to waste.

This becomes even more critical as AI agents begin acting autonomously in the world, doing everything from browsing and transacting to writing and executing sophisticated code. When an agent’s action passes through multiple services before having an effect, every link in that chain is both a potential failure point and an opportunity for a safety check.

Exclusively focusing AI safety interventions on frontier labs would be like securing the entire financial system by regulating only banks. Necessary, but nowhere near the most efficient or robust approach.

Capital, Talent, and Credibility

Successful for-profits are in an inherently better position to acquire resources than non-profits. Their path to funding, talent acquisition, and long-term influence is far stronger than that of their charitable counterparts.

There is an immense amount of venture capital washing around the AI space, estimated at ~$190 billion in 2025. Flapping Airplanes raised $180 million in one round, comparable to what some of the largest AI safety grantmakers deploy annually, raised in a fraction of the time. VC allows you to raise at speed, try many approaches, and pivot more freely than would be possible in academia or when reliant on slower charitable funders.

In AI safety, non-profits are less likely compared to other sectors to be trapped in an economic struggle for survival. However, even in the AI safety ecosystem, philanthropy is much more limited than venture capital and more tightly concentrated among fewer funders. Non-profits are vulnerable, not just to the total capital available, but to the shifting attitudes of the specific grant makers they rely on. VC-backed companies, by contrast, are much more resilient to the ideological priorities of funders. If one loses interest, many others remain available as long as you have a strong business case.

Yes, there is a large amount of philanthropic capital in AI Safety compared with typical non-profits. Safety products can also be difficult to sell. But the question of whether safety-focused products sell well, as they do in other industries, is a hypothesis you can go out and test. If it turns out that they do, there could be an immense amount of capital available which should be used to make our world safer.

For-profits attract talented people not just through hefty pay packages, but also through their institutional prestige and the social capital they confer. You can offer equity to early employees, which is extremely useful for attracting top technical talent and is entirely unavailable to non-profits. Your employees can point to growing valuations, exciting products with sometimes millions or even billions of users, and influential integration of their technology with pride. For many talented and competent people, this is far more gratifying than publishing research reports or ever so slightly nudging at the Overton window.

All of this, increased access to talent, capital, and credibility, makes for-profits far easier to scale. And safety needs to scale. The amount of time we have until transformative AI arrives differs wildly between forecasts, though it seems frighteningly plausible that we have less than a decade to prepare. If we are to scale up the workforce, research capacity, and integration into the economy of safety-focused products, we cannot afford anything other than the fastest approach to building capacity.

Success compounds. Founders, early employees, and investors in a successful for-profit acquire capital, credibility, and influence that they can reinvest in safety, whether by starting new ventures, funding others, or shaping policy. This virtuous cycle is largely unavailable to the non-profit founders, unless they later endow a foundation with, as it happens, money from for-profits.

In addition to tangible resources, a mature ecosystem of advisors and support networks exists to help startups succeed. VC funds, often staffed by ex-founders, provide strategic guidance and industry connections that are crucial for closing sales. There are many talented people who understand what startups offer and actively seek them out. An equivalent ecosystem just doesn’t exist for non-profits.

Shaping The Industry From Within

Being inside an industry is fundamentally different from being adjacent to it.

Embedding an organisation inside AI ecosystems enables both better information gathering and opportunities for intervention. If you can build safe products appropriate to the problems in an industry, you allow companies to easily purchase safety. If companies can purchase safety, then governments can mandate safety. But to get there, it is not enough to make this technology exist; the technology must be something you can buy.

Cloudflare started as a CDN. By becoming technically integrated, they slowly transformed into part of the critical infrastructure of the internet. Now, they make security decisions which shape the entire internet and impact billions of users every day. A safety-focused company embedded in AI infrastructure could do the same.

Will Markets Corrupt Safety?

Market incentives are not purely aligned with safety. The drive to improve capabilities, maximise revenue, and keep research proprietary will harm a profit-seeking organisation’s ability to make AI safer.

However, every institution has its pathologies. The incentives steering research-driven non-profits and academics are not necessarily better.

Pure Research Also Has Misaligned Incentives.

The incentives of safety and capitalism rarely align. The pressure to drive revenue and ship fast pushes towards recklessly cutting corners, and building what your customers demand in the short term rather than investing in long term safety.

However, research organisations have similar harmful incentives driving them away from research which is productive in the long term. The need to seek high-profile conference publications, pleasing grant makers, and building empires. Incentives of research organisations and individual researchers are notoriously misaligned with funders goals’ in academia and industry alike. Pursuing a pure goal with limited feedback signals is extremely difficult as an organisation, regardless of structure.

Ideally, we would have both. For-profits which can use revenue as feedback and learn from market realities, alongside non-profits which can take longer-term bets on work needed for safety. The question is how to build a working ecosystem, not whatever is more purely focused on safety.

Proprietary Knowledge Is Not Always Hoarded.

For-profits have an incentive to keep information hidden to retain a competitive advantage. This could block broader adoption of safety techniques, and restrain researchers from making optimal progress.

Assuming that for-profits add resources and people to the AI safety ecosystem, rather than simply moving employees from non-profits, this is still advantageous. We are not choosing between having this research out in the open or hidden inside organisations. We are choosing between having this research hidden or having it not exist at all. In many sectors, the price of innovation is that incumbents conceal and extract rents from their IP for years.

Despite this, for-profits do have agency over what they choose to publish. Volvo famously gave away the patent to their three-point seatbelt at the cost of their own market share, saving an estimated 1 million lives. Tesla gave away all of their electric vehicle patents to help drive adoption of the technology, with Toyota following suit a few years later. Some of this additional knowledge created by expanding the resources in AI safety may still wind up in public hands.

Markets Force Discovery Of Real Problems.

The constant drive to raise money and make a profit is frequently counter to the best long-term interests of the customer. Investment which should be put into making a product safer today instead goes into sales teams, salaries, and metrics designed to reel in investors. It is true that many startups which begin with a strong safety thesis will drift into pure capabilities work or adjacent markets which show higher short-term growth prospects.

However, many initiatives operating without revenue pressure, such as researchers on grants or philanthropically-funded non-profits, can work for years on the wrong problem. For-profits will be able to see that they are working on the wrong thing, and are driven by the pressure to raise revenue to work on something else.

This is not to say that researchers are doing valueless work simply because they are not receiving revenue in the short term. Plenty of work should be done to secure a prosperous future for humanity which businesses will not currently pay for. Rather, mission drift is often a feature rather than a bug when your initial mission was ill-conceived. The discipline markets provide, forcing you to find problems people will pay to solve, is valuable.

Failure Is A Strong Signal.

The institutional failure modes of non-profits and grant-funded research are mostly benign. The research done is not impactful, and time is wasted. On the other hand, for-profits can truly fail in the sense that they fail to drive revenue and go bankrupt, or they can fail in more spectacular ways where they acquire vast resources which are misallocated. The difference is not that for-profits are inherently more likely to steer from their initial goals.

Uncertainty about impact is common across approaches. Whereas research that goes unadopted fails silently, and advocacy which fails to grab attention disappears without effect, for-profits are granted the opportunity to visibly and transparently fail. The AI safety ecosystem already funds work which fails silently, and is effectively taking larger risks with spending than we realise. Startups aren’t any more likely to fail to achieve their goals; they are in the pleasant position of knowing when they have failed.

Visible failure generates information the ecosystem can learn from. Silent failure vanishes unnoticed.

Your Counterfactual Is Larger Than You Think.

Markets are not efficient. The economy is filled with billion-dollar holes, which are uncovered not only by shifts in the technological and financial landscape but by the tireless work of individuals determined to find them. Just because there is money to be made by providing safety does not mean that it will happen by default without you.

Stripe was founded in 2010. Online payments had existed since the 1990s, and credit-card processing APIs were available for years. Yet it took until 2010 for someone to build a genuinely developer-friendly API, simply because nobody had worked on the problem as hard and as effectively as the Collison brothers.

Despite online messaging being widely available since the 1980s, Slack wasn’t founded until 2013. The focus, grit, and attention of competent people being applied to a problem can solve issues where the technology has existed for decades.

Markets are terrible at pricing in products which don’t exist yet. Innovation can come in the form of technical breakthroughs, superior product design, or a unique go-to-market strategy. In the case of products and services relevant to improving AI safety, there is an immense amount of opportunity which has appeared in a short amount of time. You cannot assume that all necessary gaps will be filled simply because there is money to be made there.

If your timelines are short, then the imperative to build necessary products sooner rather than later grows even greater. Even if a company is inevitably going to be built in a space, ensuring that it is built 6 months sooner could be the difference between safety being on the market and unsafe AI deployment being the norm.

For many, the alternative to founding a safety company is joining a frontier lab. However, most AI deployment happens outside labs in enterprises, government systems, and consumer-facing applications. If you want to impact how AI meets the world, you may have to go outside of the lab to do it. Your marginal impact inside a large organisation is often, counterintuitively, smaller than your marginal impact on the entire world.

Historical Precedents

History is littered with examples of companies using their expertise and market position to ship safety without first waiting around for permission.

Sometimes this means investing significant resources and domain expertise to develop something new.

  • Three point seatbelt: Volvo developed the three point seatbelt and gave away the patent. Their combination of in-house technical expertise and industry credibility enabled a safety innovation that transformed the global automotive industry.
  • Toyota’s hybrid vehicle patents: Toyota gave away many hybrid vehicle patents in an attempt to accelerate the energy transition.
  • Meta’s release of Llama3: At a time when only a small number of organisations had the resources to train LLMs from scratch, Meta open-sourced Llama3 making it available to safety researchers at a time when little else was in public hands.

Or perhaps the technology already exists, and what matters is having the market position to distribute it or the credibility to change an industry’s standards.

  • Levi-Strauss supply chain audit: At the peak of their market influence, Levi-Strauss audited their supply chain insisting on certain minimum worker standards to continue dealing with suppliers. They enforced workers rights’ in jurisdictions where mistreatment of employees was either legal or poorly monitored, doing what governments couldn’t or weren’t prepared to do.
  • Cloudflare’s Project Galileo: Cloudflare provides security for small, sensitive websites at no cost. This helps journalists and activists operating in repressive countries avoid being knocked off the internet, and is entirely enabled by Cloudflare’s technology.
  • WhatsApp end-to-end encryption: The technology existed, and the cryptography research was mature by this point. WhatsApp just built it into their product, delivering privacy protection to billions of users worldwide.
  • Security for fingerprint and face recognition: Apple stores face and fingerprint data in a separate chip, making it impossible to steal or legally demand. This did not require regulation; this decision actually led to clashes with the US government. Because of their market position, Apple was able to push this security feature and protect hundreds of millions of users unilaterally.

All of these required a large company’s resources, expertise, credibility, and market integration to create and distribute valuable technology to the world.

Building a for-profit which customers depend on, be it for observability, routing, or safety-tooling, lets you ship safety improvements directly into the ecosystem. When the research exists and the technology is straightforward, a market leader choosing to build it may be the only path to real-world implementation.

It’s Up To You.

For-profits are in a fundamentally strong position to access capital, talent, and information. By selling to other businesses and becoming integrated in AI development, they can not only identify the most pressing issues but directly intervene in them. They build the technological and social environment that makes unsafe products unacceptable and security a commodity to be purchased and relied upon.

Non-profits have done, and will continue to do, critical work in AI safety. But the ecosystem is lopsided. We have researchers and advocates, but not enough builders turning their insights into products that companies buy and depend on. The feedback loops, distribution channels, and ability to rapidly scale that for-profits provide are a necessity if safety is to keep pace with capabilities.

The research exists. The techniques are maturing. Historical precedents show us that companies embedded in an industry can ship safety in ways that outsiders cannot. What’s missing are the people willing to found, join, and build companies that close the gap between safety as a research topic and safety as a market expectation. We cannot assume that markets will bridge this divide on their own in the time we have left. If you have the skills and the conviction, this is a gap you can fill!

If you’re thinking about founding something, joining an early-stage AI safety company, or want to pressure-test an idea - reach out at [email protected]. We’re always happy to talk.

BlueDot’s AGI Strategy Course is also a great starting point - at least 4 startups have come out of it so far, and many participants are working on exciting ideas. Apply here.

Thanks to Ben Norman, Daniel Reti, Maham Saleem, and Aniket Chakravorty for their comments.

Lysander Mawby is a graduate of BlueDot’s first Incubator Week, which he went on to help run in v2 and v3. He is now building an AI safety company and taking part in FR8. Josh Landes is Head of Community and Events at BlueDot and, with Aniket Chakravorty, the initiator of Incubator Week.



Discuss

More is different for intelligence

2026-03-07 08:02:07

Why did software change the world?

In the 1900s, much of the work being done by knowledge workers was computation: searching, sorting, calculating, tracking. Software made this work orders of magnitude cheaper and faster.

Naively, one might expect businesses and institutions to carry out largely the same processes, just more efficiently1. But rather, the proliferation of software has also allowed for new kinds of processes. Instead of reordering inventory when shelves looked empty, supply chains began replenishing automatically based on real-time demand. Instead of valuing stocks quarterly, financial markets started pricing and trading continuously in milliseconds. Instead of designing products on intuition, teams began running thousands of simultaneous experiments on live users. Why did a quantitative shift in the cost of computation result in a qualitative shift in the nature of organizations?

It turns out that a huge number of useful computations were never being done at all. Some were prohibitively expensive to perform at any useful scale by humans alone. But most simply went unimagined, because people don’t design processes around resources they don’t have. Real-time pricing, personalized recommendation systems, and algorithmic trading were all inconceivable prior to early 2000s.

These latent processes could only emerge after the cost of computation dropped.

Organizing Agent Societies

The forms of knowledge work that persisted to today are now being made more efficient by LLMs. But while they’ve enhanced the efficiency of human nodes on the graph of production, the structure of the graph has still been left intact.

In software development in particular, knowledge work consists of designing systems, implementing code, and coordinating decisions across teams. Humans have developed rich toolkits for distributing such cognitive work – think standups, code review, memos, design docs. At their core, these are protocols for coordinating creation and judgement across many people.

The forms that LLMs take on mediate the ways that they plug into these toolkits. Their primary manifestation – chatbots and coding agents – implies a kind of pair-programming, with one agent and one human working side by side. In this capacity, they’re able to write code and give feedback on code reviews. But not much more.

In this sense, we’re in the “naive” phase of LLM applications. LLMs might make writing code and debugging more efficient2, but the nature of work hasn’t changed much.

The first software transition didn’t just make existing computations faster; it allowed for entirely new kinds of computation. LLMs, if given the right affordances, will reveal cognitive flows we haven’t imagined yet.

A new metascience

How should cognitive work be organized? For most of history, this has been a question for metascience, management theory and post-hoc rationalization. Soon, this question will be able to be answered by experiment. We’re curious to see what the answers look like.



Discuss

Your Causal Variables Are Irreducibly Subjective

2026-03-07 06:59:03

Mechanistic interpretability needs its own shoe leather era. Reproducing the labeling process will matter more than reproducing the Github.

Crossposted from Communication & Intelligence substack


When we try to understand large language models, we like to invoke causality. And who can blame us? Causal inference comes with an impressive toolkit: directed acyclic graphs, potential outcomes, mediation analysis, formal identification results. It feels crisp. It feels reproducible. It feels like science.

But there is a precondition to the entire enterprise that we almost always skip past: you need well-defined causal variables. And defining those variables is not part of causal inference. It is prior to it — a subjective, pre-formal step that the formalism cannot provide and cannot validate.

Once you take this seriously, the consequences are severe. Every choice of variables induces a different hypothesis space. Every hypothesis space you didn't choose is one you can't say anything about. And the space of possible causal models compatible with any given phenomenon is not merely vast in the familiar senses — not just combinatorial over DAGs, or over the space of all possible parameterizations — but over all possible variable definitions, which is almost incomprehensibly vast. The best that applied causal inference can ever do is take a subjective but reproducible set of variables, state some precise structural assumptions, and falsify a tiny slice of the hypothesis space defined by those choices.

That may sound deflationary. I think embracing this subjectivity is the path to real progress — and that attempts to fully formalize away the subjectivity of variable definitions just produce the illusion of it.

The variable definition problem

The entire causal inference stack — graphs, potential outcomes, mediation, effect estimation — presupposes that you have well-defined, causally separable variables to work with. If you don't have those, you don't get to draw a DAG. You don't get to talk about mediation. You don't get potential outcomes. You don't get any of it.

Hernán (2016) makes this point forcefully for epidemiology in "Does Water Kill? A Call for Less Casual Causal Inferences." The "consistency" condition in the potential outcomes framework requires that the treatment be sufficiently well-defined that the counterfactual is meaningful. To paraphrase Hernán's example: you cannot coherently ask "does obesity cause death?" until you have specified what intervention on obesity you mean — induced how, over what timeframe, through what mechanism? Each specification defines a different causal question, and the formalism gives you no guidance on which to choose.

This is not a new insight. Freedman (1991) made the same argument in "Statistical Models and Shoe Leather." His template was John Snow's 1854 cholera investigation, the study that proved cholera was waterborne. Snow didn't fit a regression. He went door to door, identified which water company supplied each household, and used that painstaking fieldwork to establish a causal claim no model could have produced from pre-existing data. Freedman's thesis was blunt: no amount of statistical sophistication substitutes for deeply understanding how your data were generated and whether your variables mean what you think they mean. As he wrote: "Naturally, there is a desire to substitute intellectual capital for labor." His editors called this desire "pervasive and perverse." It is alive and well in LLM interpretability.

When a mechanistic interpretability researcher says "this attention head causes the model to be truthful," what is the well-defined intervention? What are the variable boundaries of "truthfulness"? In practice, we can only assess whether our causal models are correct by looking at them and checking whether the variable values and relationships match our subjective expectations. That is vibes dressed up as formalism.

The move toward "black-box" interpretability over reasoning paths has only made this irreducible subjectivity even more visible. The ongoing debate over whether the right causal unit is a token, a sentence, a reasoning step, or something else entirely (e.g., Bogdan et al., 2025; Lanham et al., 2023; Paul et al., 2024) is not a technical question waiting for a technical answer — it is a subjective judgment that only a human can validate by inspecting examples and checking whether the chosen granularity carves reality sensibly. [1]

We keep getting confused about interventions

Across 2022–2025, we've watched a remarkably consistent pattern: someone proposes an intervention to localize or understand some aspect of an LLM, and later work reveals it didn't measure what we thought. [2] Each time, later papers argue "the intervention others used wasn't the right one." But we keep missing the deeper point: it's all arbitrary until it's subjectively tied to reproducible human judgments.

I've perpetuated the "eureka, we found the bug in the earlier interventions!" narrative myself. In work I contributed to, we highlighted that activation patching interventions (see Heimersheim & Nanda, 2024) weren't surgical enough, and proposed dynamic weight grafting as a fix (Nief et al., 2026). Each time we convince ourselves the engineering is progressing. But there's still a fundamentally unaddressed question: is there any procedure that validates an intervention without a human judging whether the result means what we think?

It is too easy to believe our interventions are well-defined because they are granular, forgetting that granularity is not validity. [3]

Shoe leather for LLM interpretability

Freedman's lesson was that there is no substitute for shoe leather — for going into the field and checking whether your variables and measurements actually correspond to reality. So what is shoe leather for causal inference about LLMs?

I think it has three components. First, embrace LLMs as a reproducible operationalization of subjective concepts. A transcoder feature, a swapped sentence in a chain of thought, a rephrased prompt — these are engineering moves, not variable definitions. "Helpfulness" is a variable. "Helpfulness as defined by answering the user's request without increasing liability to the LLM provider" is another, distinct variable. Describe the attribute in natural language, use an LLM-as-judge to assess whether text exhibits it, and your variable becomes a measurable function over text — reproducible by anyone with the same LLM and the same concept definition.

Second, do systematic human labeling of all causal variables and interventions to verify they actually match what you think they should. If someone disagrees with your operationalization, they can audit a hundred samples and refine the natural language description. This is the shoe leather: not fitting a better model, but checking — sample by sample — whether your measurements mean what you claim.

Third — and perhaps most importantly — publish the labeling procedure, not just the code. A reproducible natural language specification of what each variable means, along with the human validation that confirmed it, is arguably more valuable than a GitHub repo. It is what lets someone else pick up your variables, contest them, refine them, and build on your falsifications rather than starting from scratch.

Variable definitions are outside the scope of causal inference. Publishing how you labeled them matters more than publishing your code.

Trying to practice what I'm preaching

RATE (Reber, Richardson, Nief, Garbacea, & Veitch, 2025) came out of trying to do this in practice — specifically, trying to scale up subjective evaluation of traditional steering approaches in mech interp. We knew from classical causal inference that we needed counterfactual pairs to measure causal effects of high-level attributes on reward model scores. Using LLM-based rewriters to generate those pairs was the obvious move, but the rewrites introduced systematic bias. Fixing that bias — especially without having to enumerate everything a variable can't be — turned out to be a whole paper. The core idea: rewrite twice, and use the structure of the double rewrite to correct for imperfect counterfactuals.

The punchline

Building causal inference on top of subjective-but-reproducible variables is harder than it sounds, and there's much more to do. But I believe the path is clear, even if it's narrow: subjective-yet-reproducible variables, dubious yet precise structural assumptions, honest statistical inference — and the willingness to falsify only a sliver of the hypothesis space at a time.

Every causal variable is a subjective choice — and because the space of possible variable definitions is vast, so is the space of causal hypotheses we'll never even consider. No formalism eliminates this. No amount of engineering granularity substitutes for a human checking whether a variable means what we think it means. The best we can do is choose variables people can understand, operationalize them reproducibly, state our structural assumptions precisely, and falsify what we can. That sliver of real progress beats a mountain of imagined progress every time.



References

Beckers, S. & Halpern, J. Y. (2019). Abstracting causal models. AAAI-19.

Beckers, S., Halpern, J. Y., & Hitchcock, C. (2023). Causal models with constraints. CLeaR 2023.

Bogdan, P. C., Macar, U., Nanda, N., & Conmy, A. (2025). Thought anchors: Which LLM reasoning steps matter? arXiv:2506.19143.

Freedman, D. A. (1991). Statistical models and shoe leather. Sociological Methodology, 21, 291–313.

Geiger, A., Wu, Z., Potts, C., Icard, T., & Goodman, N. (2024). Finding alignments between interpretable causal variables and distributed neural representations. CLeaR 2024.

Geiger, A., Ibeling, D., Zur, A., et al. (2025). Causal abstraction: A theoretical foundation for mechanistic interpretability. JMLR, 26, 1–64.

Hase, P., Bansal, M., Kim, B., & Ghandeharioun, A. (2023). Does localization inform editing? NeurIPS 2023.

Heimersheim, S. & Nanda, N. (2024). How to use and interpret activation patching. arXiv:2404.15255.

Hernán, M. A. (2016). Does water kill? A call for less casual causal inferences. Annals of Epidemiology, 26(10), 674–680.

Lanham, T., et al. (2023). Measuring faithfulness in chain-of-thought reasoning. Anthropic Technical Report.

Makelov, A., Lange, G., & Nanda, N. (2023). Is this the subspace you are looking for? arXiv:2311.17030.

Meng, K., Bau, D., Andonian, A., & Belinkov, Y. (2022). Locating and editing factual associations in GPT. NeurIPS 2022.

Nief, T., et al. (2026). Multiple streams of knowledge retrieval: Enriching and recalling in transformers. ICLR 2026.

Paul, D., et al. (2024). Making reasoning matter: Measuring and improving faithfulness of chain-of-thought reasoning. Findings of EMNLP 2024.

Reber, D., Richardson, S., Nief, T., Garbacea, C., & Veitch, V. (2025). RATE: Causal explainability of reward models with imperfect counterfactuals. ICML 2025.

Schölkopf, B., Locatello, F., Bauer, S., et al. (2021). Toward causal representation learning. Proceedings of the IEEE, 109(5), 612–634.

Sutter, D., Minder, J., Hofmann, T., & Pimentel, T. (2025). The non-linear representation dilemma. arXiv:2507.08802.

Wang, Z. & Veitch, V. (2025). Does editing provide evidence for localization? arXiv:2502.11447.

Wu, Z., Geiger, A., Huang, J., et al. (2024). A reply to Makelov et al.'s "interpretability illusion" arguments. arXiv:2401.12631.

Xia, K. & Bareinboim, E. (2024). Neural causal abstractions. AAAI 2024, 38(18), 20585–20595.

  1. Causal representation learning (e.g., Schölkopf et al., 2021) doesn't help here either. Weakening "here is the DAG" to "the DAG belongs to some family" is still a structural assertion made before any data is observed. ↩︎

  2. Two threads illustrate the pattern. First: ROME (Meng et al., 2022) used causal tracing to localize factual knowledge to specific MLP layers — a foundational contribution. Hase et al. (2023) showed the localization results didn't predict which layers were best to edit. Wang & Veitch (2025) showed that optimal edits at random locations can be as effective as edits at supposedly localized ones. Second: DAS (Geiger, Wu, Potts, Icard, & Goodman, 2024) found alignments between high-level causal variables and distributed neural representations via gradient descent. But Makelov et al. (2023) demonstrated that subspace patching can produce "interpretability illusions" — changing output via dormant pathways rather than the intended mechanism — to which Wu et al. (2024) argued these were experimental artifacts. Causal abstraction has real theoretical foundations (e.g., Beckers & Halpern, 2019; Geiger et al., 2025; Xia & Bareinboim, 2024), but it cannot eliminate the subjectivity of variable definitions — only relocate it. Sutter et al. (2025) show that with unconstrained alignment maps, any network can be mapped to any algorithm, rendering abstraction trivially satisfied. The linearity constraints practitioners impose to avoid this are themselves modeling choices — ones that can only be validated by subjective judgment over the data. ↩︎

  3. There is also a formal issue lurking here: structural causal models require exogenous noise, and neural network computations are deterministic. Without extensions like Beckers, Halpern, & Hitchcock's (2023) "Causal Models with Constraints," we don't even have a well-formed causal model at the neural network level — so what are we abstracting between? ↩︎



Discuss

Mox is the largest AI Safety community space in San Francisco. We're fundraising!

2026-03-07 06:07:13

Summary: Mox is fundraising to maintain and grow AIS projects, build a compelling membership, and foster other impactful and delightful work. We're looking to raise $450k for 2026, and you can donate on Manifund!

Overview

Who we are

Mox is SF’s largest AI safety coworking space, and also its primary Effective Altruism community space. We opened just over a year ago, and over the last year, we’ve served high-impact work in and around AI safety by hosting conferences, fellowships, events, and incubating new organizations.

Our theory of change is to provide good infrastructure (offices, event space) and a high density of collegial interactions to people and projects we admire. We're not focusing on a single specific thesis on AI safety. Instead, we aim to support many sorts of people and organizations who: 

  1. agree that transformative AI is on the horizon,
  2. have a strong thesis about what it means for the world to go well,
  3. are working on a project that we think credibly advances their thesis.

This includes many projects that are directly AI Safety, such as Seldon Lab, the broader Effective Altruist sphere such as Sentient Futures, and even more broadly non-EA projects by EAs or from EA and rationalist-friendly corners of the SF tech scene, such as Arbor Trading Bootcamp (pictured below). Many more examples of such work are are given in the "Current Operations" heading.

Our team also works really hard to make Mox a fun and cozy place to be! We want to be a comfortable place for people to gather. Good communities grow in good spaces.

After an event-filled first year that included a visit from AI safety bill sponsor Senator Scott Weiner, we wrapped up with the 200-person Sentient Futures Summit, and are now looking forward to a second year building on the successes of the first! 

Why we're raising

We think our highest impact is still in the future, building upon the strength of the ops team we've built in the first year, and we’re looking for private and organizational funding to carry us through the next year, for the purposes described below. We operate at a metered loss, since our mission is to support and develop a community, rather than to maximize profit.

Since we're a large venue of 40,000 sqft [3,700 meters squared!], with a capacity of 300+ desks, our operating costs are relatively high. On the other hand, we have lots of room to grow within the space! Our mainline ask is $450k, and we think we can successfully deploy funding up to 1.2 million dollars to run more great events, improve our space, and incubate new fellowships.

How to donate

To donate, find us on Manifund, or contact me directly:
Rachel Shu, Mox Director [email protected]

Funding milestones

⬜ Raise $100k by March 15th (will be 1:1 matched!)

To kick off, an anonymous donor has offered us 1:1 donation matching up to $100k. We’d like to hit at least this goal! If we can get this, we’ll be able to extend our runway (which is only a few months long), and be able to commit to future operations.

How you can help out:

  • Make a free-and-clear donation via Manifund, even any small amount helps with the matching.
  • Consider sponsoring our community and our events in the following ways: Mox Sponsor Opportunities

⬜ Raise $450k by April 1st

This is the amount we need to meet our expected operational demand over the next 6 months, and we are hoping to raise it from individual donors who can donate quickly. Our anticipated monthly burn is only $30k more than our revenue, but we also are looking to build a cash reserve for unforeseen circumstances.

With our minimum operating budget for the year secured and reserves in the bank, we’ll be able to:

  • Continue ongoing operations, with buffer against unexpected circumstances. ($200k)
  • Expand our headcount by 1-2 more staff to improve and expand existing events and coworking operations. (up to $250k.)
  • Run regular high-frequency, high-profile talk series (included in staff cost.)
  • Guarantee our Global Expert Fellowship, which is a J-1 visa program designed to bring in international AI safety experts (included in staff cost.)

⬜ Raise $1.2m by June 1st

This will probably include grants made by our organizational funders. This is the median funding that we think we can deploy meaningfully this year. With generous funding, some of our ambitions would include:

  • Restructure our interior space to meet member needs, such as adding phone booths and nooks, and improve our interior design. ($400k, mainly construction costs.)
  • Initiate our own fellowship programs, such as the Muybridge fellowship suggested below. ($300-$500k.)
  • Sponsor highly-aligned organizations and individuals at Mox who we think are doing excellent work and need the subsidy, effectively serving as regrantors. ($150k.)

Funding updates

Raised $550k in 2025

In our initial fundraising post a year ago, we proposed three budget tiers — minimal ($1.6M/year), mainline (~$2M), and ambitious ($3.6M). As a then-unproven organization, we did not meet any of our funding goals, raising in total $550k.

Delivered above expectations

What we spent annualized to roughly $1.2M, less than even our ‘minimal’ tier projection. What we delivered landed closer to ‘mainline’: 183 members, 144 Guest Program participants, 15 offices, a team of 5, and 2 tentpole events most months. And from the ‘ambitious’ tier, we succeeded at expanding Mox to all four floors of 1680 Mission. We’ve done this by keeping our team small, finding good deals on rent and furnishings, and charging fair prices to external clients.

Present state of finances

Monthly revenue and expenses spreadsheet: May 2025 - Jan 2026

Revenue: ~$100k/mo

Our ~$100k/mo revenue comes from a mix of paid memberships, private offices, and external fellowships. 

Conversely, most of our events are provided for free or at a low sponsorship cost, with occasional revenue-generating events such as hackathons and happy hours. 

Expenses: ~$130k/mo

Our ~$130k/mo expenses are mainly staff labor and building costs, each composing about 35-40% of our total expenses in a typical month. 

The remainder goes to providing amenities, servicing events, and investing in capital improvements such as improving our interior design.

Sustainable by EOY

Right now, we are in a funding crunch, and have only cash reserves equivalent to a few months of runway.

We believe we're on track to self-sustainability this year, projecting monthly revenue to continue growing by $10-15k/mo for the next 3-6 months. Our main projected growth is currently in memberships and private offices, with events and fellowships remaining steady. 

As our growth tails off, we think our steady state expenses will be ~$160k/month, and we’ll be revenue neutral on expenses and even slightly positive on balance.

Will still seek funding in future years

We anticipate that capital improvements will always come out of endowments, so we plan to continue raising every year.

Current operations

Fellowships & programs

We supported five residencies in the last year:

  • FLF Fellowship on AI for Human Reasoning: 30 fellows exploring, researching, and developing potential beneficial AI for Human Reasoning tools.
  • PIBBSS Fellowship 2025 (now Principles of Intelligence): 17 fellows in residence for cross-disciplinary AI safety research.
  • Seldon Lab Accelerator, Batch 1: 4 startups building AI safety infrastructure, including Andon Labs, Workshop Labs, and Lucid Computing.
  • Seldon Lab Accelerator, Batch 2: 6 startups, currently in residence.
  • The Frame Fellowship: An 8-week program for 12 video creators communicating about AI safety, developed in-house at Mox.

How does Mox contribute to these programs’ success?

  • Provide a fully furnished office
  • Situate them alongside other groups doing similar work
  • Provide event venue space directly connected to their workspace
  • Handle daily catering, janitorial and supplies
  • Troubleshoot participant tech
  • Host pre/post conference coworking and social gatherings for various workshops and conferences, including: EAGxBay Area, LessOnline, Manifest, and The Curve, with 500+ total attendees across those days

Public events

We hosted 377 events over the last year, including:

Mox also hosts community events, such as:

  • Effective Altruism SF, biweekly events and meetups
  • Astral Codex Ten SF, monthly meetups
  • 90/30 Club, machine learning paper reading group
  • Mathematics with Lean, a group dedicated to self-guided explorations of the Lean interactive theorem prover

“Mox has been an invaluable resource for us when running EA SF [Effective Altruism San Francisco], since its large and well-equipped facility allowed us to cater food, run speaker events, workshops, and otherwise host much larger and more ambitious events than we otherwise would have been able to.” — G., a lead organizer of EA SF

Individuals & coworking

You can see a list of all our members here: https://moxsf.com/people

We currently have 183 active members; on a typical coworking day, 50-80 people are at Mox. A sampling of individual members who are frequently at Mox:

Member testimonials!

“It feels like a second home, but more lively. I can always expect to run into a friend who is down to cowork or hang.” Constance Li, founder of Sentient Futures

“I can walk up to anyone and have an interesting conversation; every single person I've met here has welcomed questions about their work and been curious about mine.” — Gavriel Kleinwaks, Horizon Fellow

“Mox has the best density of people with the values & capabilities I care about the most. In general, it's more social & feels better organized for serendipity vs any coworking space I've been to before, comparable to perhaps like 0.3 Manifests per month.” — Venki Kumar

Sourced from our August feedback survey.

Private offices and partner organizations

In Year 1, Mox was home to 15 private offices, including:

  • Sentient Futures: promoting animal welfare and sentience research
  • Tampersec: building physical computing infrastructure security
  • Andon Labs: building autonomous organizations such as Project Vend, via Seldon accelerator
  • Pantograph: building a preschool for robots
  • BlueDot Impact (pending visa): online courses for AI safety upskilling

We also maintain a Guest Program with 19 partner organizations to give their teams free drop-in access.

Public program partners include: MIRI, FAR.AI, Redwood Research, BlueDot Impact, Palisade, GovAI, EPOCH, AI Impacts, Timaeus, Elicit, Evitable, and MATS.

“Our teammates visit San Francisco a couple of times a month. Instead of renting a coworking spot, Mox gives us a familiar space with friendly faces that we reliably run into. It feels closer to going to the college library with friends than to an office. We hang out there for many hours after our work is done!” Deger Turan, CEO of Metaculus

Upcoming plans

Grow and improve our main offerings

Events

  • More major conferences like Sentient Futures Summit
  • More public talks with key speakers like Senator Weiner
  • Improve our first floor and make it highly usable and more publicly accessible, building our ability to provide a good space, which mostly shows up in the impact we have, and somewhat in revenue.

Programs

  • Serve repeat cohorts of the fellowship programs that have used our space so far
  • Additionally serve 3-7 new fellowships and workshops in this coming year

Coworking

  • Continue growing our community of individual members to 120-150 daily users, 300+ total members
  • Maintain the ability to select private offices based on fit, rather than market rate
  • Create additional meeting rooms and other communal areas in the coworking space

Build an SF hub for animal welfare

Perhaps to some people's surprise, there isn't one yet, in the same way that surprisingly, no SF hub for AI safety existed until Mox came about. Mox can be that hub! 

Sentient Futures has found our space ideal for providing Pro-Animal Coworking days, AIxAnimals mixers, and Revolutionists Night lectures, bringing together much of the animal welfare scene in San Francisco. What is still needed is more animal welfare organizations to come onboard to create a dedicated shared section of our coworking space. Ultimately, we hope to replicate for animal welfare what we've done to coalesce the spread-out AIS community in SF!

Attract international talent via Global Expert Fellowship

A key part of our second-year vision is the Global Expert Fellowship: hosting independent researchers, domain specialists, and builders through J-1 visa programs to create new frontier technology collaborations within the Mox community and internationally.

We think this is the highest-impact thing we can achieve this year. It has immediate external impact by enabling independent researchers to quickly enter the US to do work, and it strengthens Mox by expanding our network of high-quality talent. Mox is in a rare position to pull this off, as we are able to meet State Department requirements for visa-qualifying cultural exchange which many other organizations cannot.

Incubate new workshops and programs

We have an advantage in creating our own programs, sourcing from the talent pool we're developing.

Upcoming example: the Muybridge Fellowship for Visual Interpretability, which would bring together technical visual and interactive pioneers to improve the presentation of mechanistic interpretability research and broaden its accessibility. This builds on the experience gained running the existing Frame Fellowship.


To donate, find us on Manifund, or contact me directly:
Rachel Shu, Mox Director [email protected]



Discuss