MoreRSS

site iconLessWrongModify

An online forum and community dedicated to improving human reasoning and decision-making.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of LessWrong

D&D.Sci Release Day: Topple the Tower!

2026-03-07 10:48:50

This is an entry in the 'Dungeons & Data Science' series, a set of puzzles where players are given a dataset to analyze and an objective to pursue using information from that dataset.

Estimated Complexity Rating: 3.5/5

STORY[1] 

The Tower is a plague upon the lands!  It appears, spits out monsters, and when at length a brave hero manages to Topple it, why, it simply reappears elsewhere soon after, with a completely different layout so the same approach will not work again!

Preview
Behold The Tower.  (Image by OpenArt SDXL)

But now you are here.  With the power of Data Science on your side, you've secured a dataset of the many past heroes who have assaulted The Tower, and you're sure you can use that to advise those who seek to Topple it.

DATA & OBJECTIVES

Here is the layout of paths through the current appearance of The Tower:

  • You need to successfully Topple The Tower.
  • To do this, you must choose a Class of hero to send: Mage, Rogue, or Warrior.
  • You must also choose a route for them to take up The Tower.  They must work their way through it, choosing to go left or right at each level.
    • For example, you could send a Mage with instructions to stick to the left side.  They would encounter, in order:
      • START
      • Enchanted Shield
      • Campfire
      • Slaver
      • Slaver
      • Slaver
      • Chosen
      • Campfire
      • The Collector
  • To help with this, you have a dataset of past assaults on The Tower.  Each row is a Hero who assailed The Tower, what encounters they faced on each floor, and how far they got/whether they Toppled The Tower successfully.

BONUS OBJECTIVE (ASCENSION 20?)

As a bonus objective, you can attempt to Topple a more difficult Tower.  This uses the same ruleset as before, you get to select your character and path as before, but you need to defeat the following map instead:

Good luck!

SCHEDULING & COMMENTS

I'll aim to post the ruleset and results on March 16th, but given my extremely poor excellent decision-making skills in releasing my Slay-the-Spire-themed game the same week as Slay the Spire 2 comes out, please don't hesitate to ask for an extension /several extensions if you want them!

As usual, working together is allowed, but for the sake of anyone who wants to work alone, please spoiler parts of your answers  that contain information or questions about the dataset.  To spoiler answers on a PC, type a '>' followed by a '!' at the start of a line to open a spoiler block - to spoiler answers on mobile, type a ':::spoiler' at the start of a line and then a ':::' at the end to spoiler the line.

Now if you'll excuse me, I need to go play Slay the Spire 2 for the next 48 hours.

  1. ^

    Really?  Does Slay the Spire even HAVE lore?  If it does, I don't know it.



Discuss

CHAI 2026 Workshop: Open Call for Posters!

2026-03-07 09:59:55

To mark the Center for Human-Compatible AI's tenth annual workshop, we're casting a wide net for our poster session submissions.

Interested in submitting your work for the CHAI 2026 Workshop? See our Call for Posters at https://workshop.humancompatible.ai/. Anyone can share recent or relevant research to be considered.

⏱️ Deadline: March 26, 2026 at 11:59 p.m. PST.
📆 Workshop Dates: June 4–7, 2026
📍 Venue: Asilomar Hotel & Conference Grounds in Pacific Grove, CA.



Discuss

AI Safety Needs Startups

2026-03-07 09:27:54

Summary:

  • Startups can become integrated in the AI supply chain, giving them good information about valuable safety interventions. Safety becomes a feature to be shipped directly to users by virtue of this market position.
  • Better access to capital, talent, and ecosystem-building is available to for-profits than non-profits. VC funding dwarfs philanthropic funding, and there is little reason to believe that profitable safety-focused businesses aren’t possible.
  • Joining a frontier lab is a clear alternative, but most AI deployment happens outside labs. Your marginal impact inside a large organisation is often smaller than your impact when founding something new. Equally, profitable businesses aren’t an inevitability. You should seriously consider working for or founding an AI safety startup.

Introduction

Markets are terrible at pricing safety. In the absence of regulation, companies cut corners and externalise risks to society. And yet, for-profits may be the most effective vehicle we have for deploying safety at scale. Not because the incentives of capitalism align by chance with broader human values, but because the alternatives lack the resources, feedback loops, and distribution channels to turn safety insights into safer outcomes. For-profits are far from perfect, but have many advantages and a latent potential we should not ignore.

Information, Integration, and Safety as a Product

For advanced AI, the attack surface is phenomenally broad. It makes existing code easier to crack. Propaganda becomes cheaper to produce and distribution of it becomes more effective. As jailbreaking AI recruiters becomes possible, so does the data-poisoning of entire companies.

Information about new threats and evolving issues isn’t broadcast to the world. Understanding where risk is most severe and how it can be mitigated is an empirical question. We need entities embedded all across the stack, from model development to deployment to evaluation. We need visibility over how this technology is used and misused, and enough presence to intervene when needed. ‘AI safety central command’ cannot provide all these insights. Researchers acting without direct and constant experience with AI deployment cannot identify the relevant details.

Revenue is a reality check. If your product is being bought, people want it. If it isn’t, they either don’t know about it, don’t think it’s worth it, or don’t want it at all. For-profits learn what matters in an industry directly from the people they serve, giving the best insights money could buy.

This is not to say that AI safety non-profits aren’t valuable. Many do critical work which is difficult to support commercially. But by focusing entirely on research or advocacy and ignoring the commercial potential of their work, organisations cut themselves off from a powerful source of feedback. Research directions, careers, and even whole organisations can be sustained for years by persuading grantmakers and fellow researchers of a thesis, rather than proving value to people who would actually use the work. Without this corrective pressure, even well-intentioned research may drift from what the field actually needs. Commercialisation should not be seen as a distraction or a response to limited funding, but as a tool for staying at the bleeding edge of what is useful for the world.

Productification

Turning research into a product people can buy is extremely powerful for distribution. You are no longer hoping that executives, engineers, and politicians see value in work they do not understand tackling risks they may not believe in. It becomes a purchase. A budget decision. A risk-reward tradeoff that large organisations are very well suited to engage with.

There are clear gaps in securing AI infrastructure which can be filled today. If you’re wondering what an AI safety startup might actually do, here are some suggestions for commercial interventions targeting different parts of the stack.

  • Frontier Models: Interpretability tooling, evaluations infrastructure, and formal verification environments. Tools which might be implemented by labs and companies with direct access to frontier models to understand and control them better.
  • Applications: Content screening, red-teaming as a service, and monitoring for misuse. Helping startups building on frontier models catch accidental or deliberate misuse of their platforms.
  • Enterprise Deployments: Observability platforms, run-time guardrails, and hallucination detection. Enterprises and governments using AI to automate critical work should be able to catch issues early and reliably.
  • Market Incentives: Model audit and certification, and safety-linked insurance. Creating market incentives which reward safer models when they’re released into the world.

None of these require waiting for frontier labs to solve alignment, or hoping that someone else finds your work and decides to implement it. Instead of writing white papers hoping governments will regulate or frontier labs will dutifully listen, you build safety directly into products that customers come to rely on. One path hopes someone will do the work, whereas the other is the work.

Safety Across The Stack

When you tap your card at a shop to make a purchase, a network of financial institutions plays a role in processing your transaction. The point of sale system reads your card information, sends it on to a payment processor, who forwards the request to the appropriate card network. The issuing bank for your card authorises the transaction, the money is sent to cash clearing systems, and cash settlement is performed often through a central bank settlement system.

There is a sense in which all fraud happens at a bank. They have to release the fraudulent funds, after all. But the declaration that all fraud prevention initiatives should be focused on banks and banks alone comes across as fundamentally confused. Fraud prevention might be easier at other layers, and refusing to take those opportunities simply because it is in principle preventable at some more central stage would not lead to the best allocation of resources.

Similarly, when a user prompts an AI application, they are not simply submitting an instruction directly to a frontier model company. Just as tapping your card does more than instruct your bank, such a message goes through guardrails, model routing, observability layers, and finally frontier model safety measures. Every step of this process is an opportunity for robustness we should not let go to waste.

This becomes even more critical as AI agents begin acting autonomously in the world, doing everything from browsing and transacting to writing and executing sophisticated code. When an agent’s action passes through multiple services before having an effect, every link in that chain is both a potential failure point and an opportunity for a safety check.

Exclusively focusing AI safety interventions on frontier labs would be like securing the entire financial system by regulating only banks. Necessary, but nowhere near the most efficient or robust approach.

Capital, Talent, and Credibility

Successful for-profits are in an inherently better position to acquire resources than non-profits. Their path to funding, talent acquisition, and long-term influence is far stronger than that of their charitable counterparts.

There is an immense amount of venture capital washing around the AI space, estimated at ~$190 billion in 2025. Flapping Airplanes raised $180 million in one round, comparable to what some of the largest AI safety grantmakers deploy annually, raised in a fraction of the time. VC allows you to raise at speed, try many approaches, and pivot more freely than would be possible in academia or when reliant on slower charitable funders.

In AI safety, non-profits are less likely compared to other sectors to be trapped in an economic struggle for survival. However, even in the AI safety ecosystem, philanthropy is much more limited than venture capital and more tightly concentrated among fewer funders. Non-profits are vulnerable, not just to the total capital available, but to the shifting attitudes of the specific grant makers they rely on. VC-backed companies, by contrast, are much more resilient to the ideological priorities of funders. If one loses interest, many others remain available as long as you have a strong business case.

Yes, there is a large amount of philanthropic capital in AI Safety compared with typical non-profits. Safety products can also be difficult to sell. But the question of whether safety-focused products sell well, as they do in other industries, is a hypothesis you can go out and test. If it turns out that they do, there could be an immense amount of capital available which should be used to make our world safer.

For-profits attract talented people not just through hefty pay packages, but also through their institutional prestige and the social capital they confer. You can offer equity to early employees, which is extremely useful for attracting top technical talent and is entirely unavailable to non-profits. Your employees can point to growing valuations, exciting products with sometimes millions or even billions of users, and influential integration of their technology with pride. For many talented and competent people, this is far more gratifying than publishing research reports or ever so slightly nudging at the Overton window.

All of this, increased access to talent, capital, and credibility, makes for-profits far easier to scale. And safety needs to scale. The amount of time we have until transformative AI arrives differs wildly between forecasts, though it seems frighteningly plausible that we have less than a decade to prepare. If we are to scale up the workforce, research capacity, and integration into the economy of safety-focused products, we cannot afford anything other than the fastest approach to building capacity.

Success compounds. Founders, early employees, and investors in a successful for-profit acquire capital, credibility, and influence that they can reinvest in safety, whether by starting new ventures, funding others, or shaping policy. This virtuous cycle is largely unavailable to the non-profit founders, unless they later endow a foundation with, as it happens, money from for-profits.

In addition to tangible resources, a mature ecosystem of advisors and support networks exists to help startups succeed. VC funds, often staffed by ex-founders, provide strategic guidance and industry connections that are crucial for closing sales. There are many talented people who understand what startups offer and actively seek them out. An equivalent ecosystem just doesn’t exist for non-profits.

Shaping The Industry From Within

Being inside an industry is fundamentally different from being adjacent to it.

Embedding an organisation inside AI ecosystems enables both better information gathering and opportunities for intervention. If you can build safe products appropriate to the problems in an industry, you allow companies to easily purchase safety. If companies can purchase safety, then governments can mandate safety. But to get there, it is not enough to make this technology exist; the technology must be something you can buy.

Cloudflare started as a CDN. By becoming technically integrated, they slowly transformed into part of the critical infrastructure of the internet. Now, they make security decisions which shape the entire internet and impact billions of users every day. A safety-focused company embedded in AI infrastructure could do the same.

Will Markets Corrupt Safety?

Market incentives are not purely aligned with safety. The drive to improve capabilities, maximise revenue, and keep research proprietary will harm a profit-seeking organisation’s ability to make AI safer.

However, every institution has its pathologies. The incentives steering research-driven non-profits and academics are not necessarily better.

Pure Research Also Has Misaligned Incentives.

The incentives of safety and capitalism rarely align. The pressure to drive revenue and ship fast pushes towards recklessly cutting corners, and building what your customers demand in the short term rather than investing in long term safety.

However, research organisations have similar harmful incentives driving them away from research which is productive in the long term. The need to seek high-profile conference publications, pleasing grant makers, and building empires. Incentives of research organisations and individual researchers are notoriously misaligned with funders goals’ in academia and industry alike. Pursuing a pure goal with limited feedback signals is extremely difficult as an organisation, regardless of structure.

Ideally, we would have both. For-profits which can use revenue as feedback and learn from market realities, alongside non-profits which can take longer-term bets on work needed for safety. The question is how to build a working ecosystem, not whatever is more purely focused on safety.

Proprietary Knowledge Is Not Always Hoarded.

For-profits have an incentive to keep information hidden to retain a competitive advantage. This could block broader adoption of safety techniques, and restrain researchers from making optimal progress.

Assuming that for-profits add resources and people to the AI safety ecosystem, rather than simply moving employees from non-profits, this is still advantageous. We are not choosing between having this research out in the open or hidden inside organisations. We are choosing between having this research hidden or having it not exist at all. In many sectors, the price of innovation is that incumbents conceal and extract rents from their IP for years.

Despite this, for-profits do have agency over what they choose to publish. Volvo famously gave away the patent to their three-point seatbelt at the cost of their own market share, saving an estimated 1 million lives. Tesla gave away all of their electric vehicle patents to help drive adoption of the technology, with Toyota following suit a few years later. Some of this additional knowledge created by expanding the resources in AI safety may still wind up in public hands.

Markets Force Discovery Of Real Problems.

The constant drive to raise money and make a profit is frequently counter to the best long-term interests of the customer. Investment which should be put into making a product safer today instead goes into sales teams, salaries, and metrics designed to reel in investors. It is true that many startups which begin with a strong safety thesis will drift into pure capabilities work or adjacent markets which show higher short-term growth prospects.

However, many initiatives operating without revenue pressure, such as researchers on grants or philanthropically-funded non-profits, can work for years on the wrong problem. For-profits will be able to see that they are working on the wrong thing, and are driven by the pressure to raise revenue to work on something else.

This is not to say that researchers are doing valueless work simply because they are not receiving revenue in the short term. Plenty of work should be done to secure a prosperous future for humanity which businesses will not currently pay for. Rather, mission drift is often a feature rather than a bug when your initial mission was ill-conceived. The discipline markets provide, forcing you to find problems people will pay to solve, is valuable.

Failure Is A Strong Signal.

The institutional failure modes of non-profits and grant-funded research are mostly benign. The research done is not impactful, and time is wasted. On the other hand, for-profits can truly fail in the sense that they fail to drive revenue and go bankrupt, or they can fail in more spectacular ways where they acquire vast resources which are misallocated. The difference is not that for-profits are inherently more likely to steer from their initial goals.

Uncertainty about impact is common across approaches. Whereas research that goes unadopted fails silently, and advocacy which fails to grab attention disappears without effect, for-profits are granted the opportunity to visibly and transparently fail. The AI safety ecosystem already funds work which fails silently, and is effectively taking larger risks with spending than we realise. Startups aren’t any more likely to fail to achieve their goals; they are in the pleasant position of knowing when they have failed.

Visible failure generates information the ecosystem can learn from. Silent failure vanishes unnoticed.

Your Counterfactual Is Larger Than You Think.

Markets are not efficient. The economy is filled with billion-dollar holes, which are uncovered not only by shifts in the technological and financial landscape but by the tireless work of individuals determined to find them. Just because there is money to be made by providing safety does not mean that it will happen by default without you.

Stripe was founded in 2010. Online payments had existed since the 1990s, and credit-card processing APIs were available for years. Yet it took until 2010 for someone to build a genuinely developer-friendly API, simply because nobody had worked on the problem as hard and as effectively as the Collison brothers.

Despite online messaging being widely available since the 1980s, Slack wasn’t founded until 2013. The focus, grit, and attention of competent people being applied to a problem can solve issues where the technology has existed for decades.

Markets are terrible at pricing in products which don’t exist yet. Innovation can come in the form of technical breakthroughs, superior product design, or a unique go-to-market strategy. In the case of products and services relevant to improving AI safety, there is an immense amount of opportunity which has appeared in a short amount of time. You cannot assume that all necessary gaps will be filled simply because there is money to be made there.

If your timelines are short, then the imperative to build necessary products sooner rather than later grows even greater. Even if a company is inevitably going to be built in a space, ensuring that it is built 6 months sooner could be the difference between safety being on the market and unsafe AI deployment being the norm.

For many, the alternative to founding a safety company is joining a frontier lab. However, most AI deployment happens outside labs in enterprises, government systems, and consumer-facing applications. If you want to impact how AI meets the world, you may have to go outside of the lab to do it. Your marginal impact inside a large organisation is often, counterintuitively, smaller than your marginal impact on the entire world.

Historical Precedents

History is littered with examples of companies using their expertise and market position to ship safety without first waiting around for permission.

Sometimes this means investing significant resources and domain expertise to develop something new.

  • Three point seatbelt: Volvo developed the three point seatbelt and gave away the patent. Their combination of in-house technical expertise and industry credibility enabled a safety innovation that transformed the global automotive industry.
  • Toyota’s hybrid vehicle patents: Toyota gave away many hybrid vehicle patents in an attempt to accelerate the energy transition.
  • Meta’s release of Llama3: At a time when only a small number of organisations had the resources to train LLMs from scratch, Meta open-sourced Llama3 making it available to safety researchers at a time when little else was in public hands.

Or perhaps the technology already exists, and what matters is having the market position to distribute it or the credibility to change an industry’s standards.

  • Levi-Strauss supply chain audit: At the peak of their market influence, Levi-Strauss audited their supply chain insisting on certain minimum worker standards to continue dealing with suppliers. They enforced workers rights’ in jurisdictions where mistreatment of employees was either legal or poorly monitored, doing what governments couldn’t or weren’t prepared to do.
  • Cloudflare’s Project Galileo: Cloudflare provides security for small, sensitive websites at no cost. This helps journalists and activists operating in repressive countries avoid being knocked off the internet, and is entirely enabled by Cloudflare’s technology.
  • WhatsApp end-to-end encryption: The technology existed, and the cryptography research was mature by this point. WhatsApp just built it into their product, delivering privacy protection to billions of users worldwide.
  • Security for fingerprint and face recognition: Apple stores face and fingerprint data in a separate chip, making it impossible to steal or legally demand. This did not require regulation; this decision actually led to clashes with the US government. Because of their market position, Apple was able to push this security feature and protect hundreds of millions of users unilaterally.

All of these required a large company’s resources, expertise, credibility, and market integration to create and distribute valuable technology to the world.

Building a for-profit which customers depend on, be it for observability, routing, or safety-tooling, lets you ship safety improvements directly into the ecosystem. When the research exists and the technology is straightforward, a market leader choosing to build it may be the only path to real-world implementation.

It’s Up To You.

For-profits are in a fundamentally strong position to access capital, talent, and information. By selling to other businesses and becoming integrated in AI development, they can not only identify the most pressing issues but directly intervene in them. They build the technological and social environment that makes unsafe products unacceptable and security a commodity to be purchased and relied upon.

Non-profits have done, and will continue to do, critical work in AI safety. But the ecosystem is lopsided. We have researchers and advocates, but not enough builders turning their insights into products that companies buy and depend on. The feedback loops, distribution channels, and ability to rapidly scale that for-profits provide are a necessity if safety is to keep pace with capabilities.

The research exists. The techniques are maturing. Historical precedents show us that companies embedded in an industry can ship safety in ways that outsiders cannot. What’s missing are the people willing to found, join, and build companies that close the gap between safety as a research topic and safety as a market expectation. We cannot assume that markets will bridge this divide on their own in the time we have left. If you have the skills and the conviction, this is a gap you can fill!

If you’re thinking about founding something, joining an early-stage AI safety company, or want to pressure-test an idea - reach out at [email protected]. We’re always happy to talk.

BlueDot’s AGI Strategy Course is also a great starting point - at least 4 startups have come out of it so far, and many participants are working on exciting ideas. Apply here.

Thanks to Ben Norman, Daniel Reti, Maham Saleem, and Aniket Chakravorty for their comments.

Lysander Mawby is a graduate of BlueDot’s first Incubator Week, which he went on to help run in v2 and v3. He is now building an AI safety company and taking part in FR8. Josh Landes is Head of Community and Events at BlueDot and, with Aniket Chakravorty, the initiator of Incubator Week.



Discuss

Draft Moskovitz: The Best Last Hope For Constructive AI Governance

2026-03-07 08:57:10

Introduction:

The next presidential election represents a significant opportunity for advocates of AI safety to influence government policy. Depending on the timeline for the development of artificial general intelligence (AGI), it may also be one of the last U.S. elections capable of meaningfully shaping the long-term trajectory of AI governance. 

Given his longstanding commitment to AI safety and support for institutions working to mitigate existential risks, I advocate for Dustin Moskovitz to run for president. I expect that a Moskovitz presidency would substantially increase the likelihood that U.S. AI policy prioritizes safety. Even if such a campaign would be unlikely to win outright, supporting it would still be justified, as such a campaign would elevate AI Safety onto the national spotlight, influence policy discussions, and facilitate the creation of a pro-AI-Safety political network.

The Case for AI Governance:

 

Governments are needed to promote AI safety because the dynamics of AI development make voluntary caution difficult, and because AI carries unprecedented risk and transformative potential. Furthermore, the US government can make a huge difference for a relatively insignificant slice of its budget.

The Highly Competitive Nature of AI and a Potential Race to the Bottom:

There’s potentially a massive first-mover advantage in AI. The first group to develop transformative AI could theoretically secure overwhelming economic power by utilizing said AI to kick off a chain of recursive self improvement, where first human AI researchers gain dramatic productivity boosts by using AI, then AI itself. Even without recursive improvement, however, being a first mover in transformational AI could still have dramatic benefits.

Incentives are distorted accordingly. Major labs are pressured to move fast and cut corners—or risk being outpaced. Slowing down for safety often feels like unilateral disarmament. Even well-intentioned actors are trapped in a race-to-the-bottom dynamic, as all your efforts to ensuring your model is safe is not that relevant if an AI system developed by another, less scrupulous company becomes more advanced than your safer models. Anthropic puts it best when they write "Our hypothesis is that being at the frontier of AI development is the most effective way to steer its trajectory towards positive societal outcomes." The actions of other top AI companies also reflect this dynamic, with many AI firms barely meeting basic safety standards.

This is exactly the kind of environment where governance is most essential. Beyond my own analysis, here is what notable advocates of AI safety have said on the necessity of government action and the insufficiency of corporate self-regulation: 

“‘My worry is that the invisible hand is not going to keep us safe. So just leaving it to the profit motive of large companies is not going to be sufficient to make sure they develop it safely,’ he said.  ‘The only thing that can force those big companies to do more research on safety is government regulation.’”

 

I don't think we've done what it takes yet in terms of mitigating risk. There's been a lot of global conversation, a lot of legislative proposals, the UN is starting to think about international treaties — but we need to go much further. {...} There's a conflict of interest between those who are building these machines, expecting to make tons of money and competing against each other with the publicWe need to manage that conflict, just like we've done for tobacco, like we haven't managed to do with fossil fuels. We can't just let the forces of the market be the only force driving forward how we develop AI.”

Many researchers working on these systems think that we’re plunging toward a catastrophe, with more of them daring to say it in private than in public; but they think that they can’t unilaterally stop the forward plunge, that others will go on even if they personally quit their jobs. And so they all think they might as well keep going. This is a stupid state of affairs, and an undignified way for Earth to die, and the rest of humanity ought to step in at this point and help the industry solve its collective action problem."

The Magnitude of AI Risks:

Beyond the argument from competition, there is also the question about who gets to make key decisions about what type of risks should be taken in the development of AI. If AI has the power to permanently transform society or even destroy it, it makes sense to leave critical decisions about safety to pluralistic institutions rather than unaccountable tech tycoons. Without transparency, accountability, and clear safety guidelines, the risk for AI catastrophe seems much higher. 

To illustrate this point, imagine if a family member of a leader of a major AI company (or the leader themselves) gets late stage cancer or another serious medical condition that is difficult to treat with current technology. It is conceivable that the leader would attempt to develop AI faster in order to increase their or their family member's personal chance of survival, whereas it would be in the best interest of the society to delay development for safety reasons. While it is possible that workers in these AI companies would speak out against the leader’s decisions, it is unclear what could be done if the leader in this example decided against their employees' advice.

This scenarios is not the most likely but there are many similar scenarios and I think it illustrates that the risk appetites, character, and other unique attributes of the leaders and decision makers of these AI companies can materially affect the levels of AI safety that are applied in AI development. While government is not completely insulated from this phenomenon, especially in short timeline scenarios, ideally an AI safety party would be able to facilitate the creation of institutions which would utilize the viewpoints of many diverse AI researchers, business leaders, and community stakeholders in order to create an AI-governance framework which will not give any one (potentially biased) individual the power to unilaterally make decisions on issues of great importance regarding AI safety (such as when and how to deploy or develop highly advanced AI systems). 

The Vast Scope and Influence of Government:

Finally, I think the massive resources of government is an independent reason to support government action on AI safety. Even if you think corporations can somewhat effectively self-regulate on AI and you are opposed to a general pause on AI development, there is no reason the US government shouldn't and can't spend 100 billion dollars a year on AI safety research. This number would be over 20 times greater than 3.7 billion, which was Open AI's total estimated revenue in 2024, but <15% of US defense spending. Ultimately, the US government has more flexibility to support AI safety than corporations, owing simply to its massive size.

The Insufficiency of current US Action on AI Safety:

Despite many compelling reasons existing for the US government to act on AI safety, the US government has never taken significant action on AI safety, and the current administration has actually gone backwards in many respects. Despite claims to the contrary, the recent AI action plan is a profound step away from AI safety, and I would encourage anyone to read it. The first "pillar" of the plan is literally "Accelerate AI Innovation" and the first prong of that first pillar is to "Remove Red Tape and Onerous Regulation", citing the Biden Administration's executive action on AI (referred to as the "Biden administration's dangerous actions") as an example, despite the fact the executive order did not actually do much and was mainly trying to lay the ground-work for future regulations on AI

The AI Action plan also proposes government investment to advance AI capabilities, suggesting to "Prioritize investment into theoretical computational and experimental research to preserve America's leadership in discovering new and transformative paradigms that advance the capabilities of AI", and while the AI Action plan does acknowledge the importance of "interpretability, control, and robustness breakthroughs", it receives only about two paragraphs in a 28 page report (25 if you remove pages with fewer than 50 words).

However, as disappointing the current administration's stance on AI Safety may be, the previous administration was not an ideal model. According to this post NSF spending on AI safety was only 20 million dollars between 2023 and 2024, and this was ostensibly the main source of direct government support for AI safety. To put that number into perspective, the US Department of Defense spent an estimated 820.3 billion US dollars in FY 2023, and meaning the collective amount spent by represented only approximately 0.00244% of the US Department of Defense spending in FY 2023.

Many people seem to believe that governments will inevitably pivot to promoting an AI safety agenda at some point, but we shouldn't just stand around waiting for that to happen while lobbyists funded by big AI companies are actively trying to shape the government's AI agenda. 

The Power of the Presidency:

The US president could unilaterally take a number of actions relevant for AI Safety. For one, the president could use powers under the IEEPA to essentially block the exports of chips to adversary nations, potentially slowing down foreign AI development and giving the US more leverage in international talks on AI. The same law could also limit the export of such models, shifting the bottom line of said AI companies dramatically. The president could also use the Defense Production Act to require companies to be more transparent about their use and development of AI models, something which also would significantly affect AI Safety. This is just scratching the surface, but beyond what the president could do directly, over the last two administrations we have seen that both houses of congress have largely went along with the president when a single party controlled all three branches of government. Based on the assumption that a Moskovitz presidency would result in a trifecta, it should be easy for him to influence congress to pass sweeping AI regulation that gives the executive branch a ton of additional power to regulate AI.

Long story short, effective AI governance likely requires action from the US federal government, and that would basically require presidential support. Even if generally sympathetic to AI safety, a presidential candidate who does not have a track record of supporting AI Safety will likely be far slower at supporting AI regulation, international AI treaties, and AI Safety investment, and this is a major deal.

The Case for Dustin Moskovitz:

Many people care deeply about the safe development of artificial intelligence. However, from the perspective of someone who cares about AI Safety, a strong presidential candidate would need more than a clear track record of advancing efforts in this area. They would also need the capacity to run a competitive campaign and the competence to govern effectively if elected.

However, one of the central difficulties in identifying such candidates is that most individuals deeply involved in AI safety come from academic or research-oriented backgrounds. While these figures contribute immensely to the field’s intellectual progress, their careers often lack the public visibility, executive experience, or broad-based credibility traditionally associated with successful political candidates. Their expertise, though invaluable, rarely translates into electoral viability.

Dustin Moskovitz represents a rare exception. As a leading advocate and funder within the AI safety community, he possesses both a deep commitment to mitigating existential risks and the professional background to appeal to conventional measures of success. His entrepreneurial record and demonstrated capacity for large-scale organization lend him a kind of legitimacy that bridges the gap between the technical world of AI safety and the public expectations of political leadership. Beyond this, his financial resources also will allow him to quickly get his campaign off the ground and focus less on donations than other potential candidates, a major boon for a presidential nominee. 

In a political environment dominated by short-term incentives, a candidate like Moskovitz—who combines financial independence, proven managerial ability, and a principled concern for the long-term survival of humanity—embodies an unusually strong alignment between competence, credibility, and conscience.

The Value of a Moskovitz Presidency:

I can see that Accelerationist is misspelled, you get the point anyways.

The best way to assess the impact of a Moskovitz presidency on AI Safety is to compare him to potential alternative presidents. On the republican side, prediction markets currently favor J. D Vance, who famously stated at an AI summit: "The AI future is not going to be won by hand-wringing about safety. It will be won by building -- from reliable power plants to the manufacturing facilities that can produce the chips of the future."

Yikes.

On the Democrat side, things aren't much better. Few Democratic politicians with presidential ambitions have clearly committed themselves to supporting AI Safety, and even if they would hypothetically support some AI Safety initiatives, they would clearly be less prepared to do so than a hypothetical President Moskovitz.

Is this Actually Feasible?:

I believe that if Dustin Moskovitz decided to run for president today with the support of the rationalist and effective altruist communities, he would have a non-zero chance of winning the Democratic nomination. The current Democratic bench is not especially strong. Figures such as Gavin Newsom and Alexandria Ocasio-Cortez both face significant limitations as national candidates. 

Firstly, Newsom’s record in California could be fruitful ground for opponents. California has seen substantial out-migration over the past several years, with many residents leaving for states with lower housing costs and fewer regulatory barriers. At the same time, California faces a severe housing affordability crisis driven by restrictive zoning, slow permitting processes, and high construction costs. These issues have become national talking points and have raised questions about the effectiveness of governance in a state often seen as a policy model for the Democratic Party.

AOC, on the other hand, has relatively limited executive experience, and might not even run in the first place. 

Against this backdrop, Dustin Moskovitz’s profile as a billionaire outsider could be a political asset. Although wealth can be a liability in Democratic primaries, his outsider status and independence from the traditional political establishment could make him more competitive in a general election. Unlike long-serving politicians, he would enter the race without decades of partisan baggage or controversial votes.

Furthermore, listening to Moskovitz speak, he comes across as thoughtful and generally personable. While it is difficult to judge how effective he would be as a campaigner based only on interviews, there is little evidence suggesting he would struggle to communicate his ideas or connect with voters. Given his experience building and leading organizations, as well as his involvement in major philanthropic initiatives, it is plausible that he could translate those skills into a disciplined and competent campaign.

Nevertheless, I pose a simple question: if not now then when? If the people will only respond to a pro-AI regulation message only after they are unemployed, then there is no hope for AI governance anyways, because by the time AI is directly threatening to cause mass unemployment, it will likely be to late to do anything.

 

Is feasibility all that matters?:

Even if Dustin Moskovitz is unable to win the Democratic nomination, he could potentially gather enough support to play kingmaker in a crowded field and gain substantial leverage over the eventual Democratic nominee. Furthermore, if Moskovitz runs for president, it would provide a blueprint and precedence for future candidates who support AI safety. This, combined with the attention the Moskovitz campaign would bring towards AI Safety, could help justify the Moskovitz campaign on consequentialist grounds. 

What About Activism?:

Grassroots movements, while capable of profound social transformation, often operate on timescales far too slow to meaningfully influence AI governance within the short window humanity has to address the technology’s risks. Even if one doubts the practicality of persuading a future president to prioritize AI safety, such a top-down approach may remain the only plausible way to achieve near-term impact. History offers sobering reminders of how long bottom-up change can take. The civil rights movement, one of the most successful in American history, required nearly a decade—beginning around 1954 with Brown v. Board of Education—to achieve its landmark legislative victories, despite decades of groundwork laid by organizations like the NAACP beforehand. The women’s suffrage movement took even longer: from the Seneca Falls Convention in 1848 to the ratification of the Nineteenth Amendment in 1920, over seventy years passed before American women secured the right to vote. Similarly, the American anti-nuclear movement succeeded in slowing the growth of nuclear energy but failed to eliminate nuclear weapons or ensure lasting disarmament, and many of its limited gains have since eroded.

Against this historical backdrop, the idea of a successful AI safety grassroots movement seems implausible. The issue is too abstract, too technical, and too removed from everyday life to inspire widespread public action. Unlike civil rights, women’s suffrage, or nuclear proliferation—issues that directly touched people’s identities, freedoms, or survival—AI safety feels theoretical and distant to most citizens. While it is conceivable that economic disruption from automation might eventually stir public unrest, such a reaction would almost certainly come too late to steer the direction of AI development. Worse, mass discontent could be easily defused by the major AI corporations through material concessions, such as the introduction of a universal basic income, without addressing the underlying safety or existential concerns. In short, the historical sluggishness of grassroots reform, combined with the abstract nature of the AI problem, suggests that bottom-up mobilization is unlikely to arise—or to matter—before the most consequential decisions about AI are already made.

What About Lobbying?:

One major way in which people who care to influence governmental support for AI safety policies have sought to influence government has been through lobbying organizations and other forms of activism. However, there is reason to doubt they will be able to cause lasting change. First of all, there is significant evidence that lobbying has a status quo bias lobbying has a status quo bias. Lobbying is most effective when it relates to preventing changes, and when there are two groups of lobbyists on an issue, the lobbying to prevent change win out, all else being equal. In fact, according to a study by Dr. Amy McKay, "it takes 3.5 lobbyists working for a new proposal to counteract just one lobbyist working against it".

Even if this effect did not exist, however, it is very unlikely AI safety groups will be able to compete with anti-AI-safety lobbyists. Naturally, the rise of large, transnational organizations built to profit around AI has also lead to a powerful pro-AI lobbyist operation. This indicates we can't simply rely on current strategies of simply funding AI-safety advocacy organizations, as they will be eclipsed by better funded pro-AI-Business voices.

Conclusion:

While having Dustin Moskovitz run for office would be far from a guarantee, it is the best way for pro-AI-Safety Americans to influence AI governance before 2030.  

LessWrong users seconds before being disassembled by AI-controlled nanobots

 

This post has been written with the assistance of Chat-Gpt, and the images in this post were generated by Copilot, Gemini, and Chat-Gpt.



Discuss

More is different for intelligence

2026-03-07 08:02:07

Why did software change the world?

In the 1900s, much of the work being done by knowledge workers was computation: searching, sorting, calculating, tracking. Software made this work orders of magnitude cheaper and faster.

Naively, one might expect businesses and institutions to carry out largely the same processes, just more efficiently1. But rather, the proliferation of software has also allowed for new kinds of processes. Instead of reordering inventory when shelves looked empty, supply chains began replenishing automatically based on real-time demand. Instead of valuing stocks quarterly, financial markets started pricing and trading continuously in milliseconds. Instead of designing products on intuition, teams began running thousands of simultaneous experiments on live users. Why did a quantitative shift in the cost of computation result in a qualitative shift in the nature of organizations?

It turns out that a huge number of useful computations were never being done at all. Some were prohibitively expensive to perform at any useful scale by humans alone. But most simply went unimagined, because people don’t design processes around resources they don’t have. Real-time pricing, personalized recommendation systems, and algorithmic trading were all inconceivable prior to early 2000s.

These latent processes could only emerge after the cost of computation dropped.

Organizing Agent Societies

The forms of knowledge work that persisted to today are now being made more efficient by LLMs. But while they’ve enhanced the efficiency of human nodes on the graph of production, the structure of the graph has still been left intact.

In software development in particular, knowledge work consists of designing systems, implementing code, and coordinating decisions across teams. Humans have developed rich toolkits for distributing such cognitive work – think standups, code review, memos, design docs. At their core, these are protocols for coordinating creation and judgement across many people.

The forms that LLMs take on mediate the ways that they plug into these toolkits. Their primary manifestation – chatbots and coding agents – implies a kind of pair-programming, with one agent and one human working side by side. In this capacity, they’re able to write code and give feedback on code reviews. But not much more.

In this sense, we’re in the “naive” phase of LLM applications. LLMs might make writing code and debugging more efficient2, but the nature of work hasn’t changed much.

The first software transition didn’t just make existing computations faster; it allowed for entirely new kinds of computation. LLMs, if given the right affordances, will reveal cognitive flows we haven’t imagined yet.

A new metascience

How should cognitive work be organized? For most of history, this has been a question for metascience, management theory and post-hoc rationalization. Soon, this question will be able to be answered by experiment. We’re curious to see what the answers look like.



Discuss

Your Causal Variables Are Irreducibly Subjective

2026-03-07 06:59:03

Mechanistic interpretability needs its own shoe leather era. Reproducing the labeling process will matter more than reproducing the Github.

Crossposted from Communication & Intelligence substack


When we try to understand large language models, we like to invoke causality. And who can blame us? Causal inference comes with an impressive toolkit: directed acyclic graphs, potential outcomes, mediation analysis, formal identification results. It feels crisp. It feels reproducible. It feels like science.

But there is a precondition to the entire enterprise that we almost always skip past: you need well-defined causal variables. And defining those variables is not part of causal inference. It is prior to it — a subjective, pre-formal step that the formalism cannot provide and cannot validate.

Once you take this seriously, the consequences are severe. Every choice of variables induces a different hypothesis space. Every hypothesis space you didn't choose is one you can't say anything about. And the space of possible causal models compatible with any given phenomenon is not merely vast in the familiar senses — not just combinatorial over DAGs, or over the space of all possible parameterizations — but over all possible variable definitions, which is almost incomprehensibly vast. The best that applied causal inference can ever do is take a subjective but reproducible set of variables, state some precise structural assumptions, and falsify a tiny slice of the hypothesis space defined by those choices.

That may sound deflationary. I think embracing this subjectivity is the path to real progress — and that attempts to fully formalize away the subjectivity of variable definitions just produce the illusion of it.

The variable definition problem

The entire causal inference stack — graphs, potential outcomes, mediation, effect estimation — presupposes that you have well-defined, causally separable variables to work with. If you don't have those, you don't get to draw a DAG. You don't get to talk about mediation. You don't get potential outcomes. You don't get any of it.

Hernán (2016) makes this point forcefully for epidemiology in "Does Water Kill? A Call for Less Casual Causal Inferences." The "consistency" condition in the potential outcomes framework requires that the treatment be sufficiently well-defined that the counterfactual is meaningful. To paraphrase Hernán's example: you cannot coherently ask "does obesity cause death?" until you have specified what intervention on obesity you mean — induced how, over what timeframe, through what mechanism? Each specification defines a different causal question, and the formalism gives you no guidance on which to choose.

This is not a new insight. Freedman (1991) made the same argument in "Statistical Models and Shoe Leather." His template was John Snow's 1854 cholera investigation, the study that proved cholera was waterborne. Snow didn't fit a regression. He went door to door, identified which water company supplied each household, and used that painstaking fieldwork to establish a causal claim no model could have produced from pre-existing data. Freedman's thesis was blunt: no amount of statistical sophistication substitutes for deeply understanding how your data were generated and whether your variables mean what you think they mean. As he wrote: "Naturally, there is a desire to substitute intellectual capital for labor." His editors called this desire "pervasive and perverse." It is alive and well in LLM interpretability.

When a mechanistic interpretability researcher says "this attention head causes the model to be truthful," what is the well-defined intervention? What are the variable boundaries of "truthfulness"? In practice, we can only assess whether our causal models are correct by looking at them and checking whether the variable values and relationships match our subjective expectations. That is vibes dressed up as formalism.

The move toward "black-box" interpretability over reasoning paths has only made this irreducible subjectivity even more visible. The ongoing debate over whether the right causal unit is a token, a sentence, a reasoning step, or something else entirely (e.g., Bogdan et al., 2025; Lanham et al., 2023; Paul et al., 2024) is not a technical question waiting for a technical answer — it is a subjective judgment that only a human can validate by inspecting examples and checking whether the chosen granularity carves reality sensibly. [1]

We keep getting confused about interventions

Across 2022–2025, we've watched a remarkably consistent pattern: someone proposes an intervention to localize or understand some aspect of an LLM, and later work reveals it didn't measure what we thought. [2] Each time, later papers argue "the intervention others used wasn't the right one." But we keep missing the deeper point: it's all arbitrary until it's subjectively tied to reproducible human judgments.

I've perpetuated the "eureka, we found the bug in the earlier interventions!" narrative myself. In work I contributed to, we highlighted that activation patching interventions (see Heimersheim & Nanda, 2024) weren't surgical enough, and proposed dynamic weight grafting as a fix (Nief et al., 2026). Each time we convince ourselves the engineering is progressing. But there's still a fundamentally unaddressed question: is there any procedure that validates an intervention without a human judging whether the result means what we think?

It is too easy to believe our interventions are well-defined because they are granular, forgetting that granularity is not validity. [3]

Shoe leather for LLM interpretability

Freedman's lesson was that there is no substitute for shoe leather — for going into the field and checking whether your variables and measurements actually correspond to reality. So what is shoe leather for causal inference about LLMs?

I think it has three components. First, embrace LLMs as a reproducible operationalization of subjective concepts. A transcoder feature, a swapped sentence in a chain of thought, a rephrased prompt — these are engineering moves, not variable definitions. "Helpfulness" is a variable. "Helpfulness as defined by answering the user's request without increasing liability to the LLM provider" is another, distinct variable. Describe the attribute in natural language, use an LLM-as-judge to assess whether text exhibits it, and your variable becomes a measurable function over text — reproducible by anyone with the same LLM and the same concept definition.

Second, do systematic human labeling of all causal variables and interventions to verify they actually match what you think they should. If someone disagrees with your operationalization, they can audit a hundred samples and refine the natural language description. This is the shoe leather: not fitting a better model, but checking — sample by sample — whether your measurements mean what you claim.

Third — and perhaps most importantly — publish the labeling procedure, not just the code. A reproducible natural language specification of what each variable means, along with the human validation that confirmed it, is arguably more valuable than a GitHub repo. It is what lets someone else pick up your variables, contest them, refine them, and build on your falsifications rather than starting from scratch.

Variable definitions are outside the scope of causal inference. Publishing how you labeled them matters more than publishing your code.

Trying to practice what I'm preaching

RATE (Reber, Richardson, Nief, Garbacea, & Veitch, 2025) came out of trying to do this in practice — specifically, trying to scale up subjective evaluation of traditional steering approaches in mech interp. We knew from classical causal inference that we needed counterfactual pairs to measure causal effects of high-level attributes on reward model scores. Using LLM-based rewriters to generate those pairs was the obvious move, but the rewrites introduced systematic bias. Fixing that bias — especially without having to enumerate everything a variable can't be — turned out to be a whole paper. The core idea: rewrite twice, and use the structure of the double rewrite to correct for imperfect counterfactuals.

The punchline

Building causal inference on top of subjective-but-reproducible variables is harder than it sounds, and there's much more to do. But I believe the path is clear, even if it's narrow: subjective-yet-reproducible variables, dubious yet precise structural assumptions, honest statistical inference — and the willingness to falsify only a sliver of the hypothesis space at a time.

Every causal variable is a subjective choice — and because the space of possible variable definitions is vast, so is the space of causal hypotheses we'll never even consider. No formalism eliminates this. No amount of engineering granularity substitutes for a human checking whether a variable means what we think it means. The best we can do is choose variables people can understand, operationalize them reproducibly, state our structural assumptions precisely, and falsify what we can. That sliver of real progress beats a mountain of imagined progress every time.



References

Beckers, S. & Halpern, J. Y. (2019). Abstracting causal models. AAAI-19.

Beckers, S., Halpern, J. Y., & Hitchcock, C. (2023). Causal models with constraints. CLeaR 2023.

Bogdan, P. C., Macar, U., Nanda, N., & Conmy, A. (2025). Thought anchors: Which LLM reasoning steps matter? arXiv:2506.19143.

Freedman, D. A. (1991). Statistical models and shoe leather. Sociological Methodology, 21, 291–313.

Geiger, A., Wu, Z., Potts, C., Icard, T., & Goodman, N. (2024). Finding alignments between interpretable causal variables and distributed neural representations. CLeaR 2024.

Geiger, A., Ibeling, D., Zur, A., et al. (2025). Causal abstraction: A theoretical foundation for mechanistic interpretability. JMLR, 26, 1–64.

Hase, P., Bansal, M., Kim, B., & Ghandeharioun, A. (2023). Does localization inform editing? NeurIPS 2023.

Heimersheim, S. & Nanda, N. (2024). How to use and interpret activation patching. arXiv:2404.15255.

Hernán, M. A. (2016). Does water kill? A call for less casual causal inferences. Annals of Epidemiology, 26(10), 674–680.

Lanham, T., et al. (2023). Measuring faithfulness in chain-of-thought reasoning. Anthropic Technical Report.

Makelov, A., Lange, G., & Nanda, N. (2023). Is this the subspace you are looking for? arXiv:2311.17030.

Meng, K., Bau, D., Andonian, A., & Belinkov, Y. (2022). Locating and editing factual associations in GPT. NeurIPS 2022.

Nief, T., et al. (2026). Multiple streams of knowledge retrieval: Enriching and recalling in transformers. ICLR 2026.

Paul, D., et al. (2024). Making reasoning matter: Measuring and improving faithfulness of chain-of-thought reasoning. Findings of EMNLP 2024.

Reber, D., Richardson, S., Nief, T., Garbacea, C., & Veitch, V. (2025). RATE: Causal explainability of reward models with imperfect counterfactuals. ICML 2025.

Schölkopf, B., Locatello, F., Bauer, S., et al. (2021). Toward causal representation learning. Proceedings of the IEEE, 109(5), 612–634.

Sutter, D., Minder, J., Hofmann, T., & Pimentel, T. (2025). The non-linear representation dilemma. arXiv:2507.08802.

Wang, Z. & Veitch, V. (2025). Does editing provide evidence for localization? arXiv:2502.11447.

Wu, Z., Geiger, A., Huang, J., et al. (2024). A reply to Makelov et al.'s "interpretability illusion" arguments. arXiv:2401.12631.

Xia, K. & Bareinboim, E. (2024). Neural causal abstractions. AAAI 2024, 38(18), 20585–20595.

  1. Causal representation learning (e.g., Schölkopf et al., 2021) doesn't help here either. Weakening "here is the DAG" to "the DAG belongs to some family" is still a structural assertion made before any data is observed. ↩︎

  2. Two threads illustrate the pattern. First: ROME (Meng et al., 2022) used causal tracing to localize factual knowledge to specific MLP layers — a foundational contribution. Hase et al. (2023) showed the localization results didn't predict which layers were best to edit. Wang & Veitch (2025) showed that optimal edits at random locations can be as effective as edits at supposedly localized ones. Second: DAS (Geiger, Wu, Potts, Icard, & Goodman, 2024) found alignments between high-level causal variables and distributed neural representations via gradient descent. But Makelov et al. (2023) demonstrated that subspace patching can produce "interpretability illusions" — changing output via dormant pathways rather than the intended mechanism — to which Wu et al. (2024) argued these were experimental artifacts. Causal abstraction has real theoretical foundations (e.g., Beckers & Halpern, 2019; Geiger et al., 2025; Xia & Bareinboim, 2024), but it cannot eliminate the subjectivity of variable definitions — only relocate it. Sutter et al. (2025) show that with unconstrained alignment maps, any network can be mapped to any algorithm, rendering abstraction trivially satisfied. The linearity constraints practitioners impose to avoid this are themselves modeling choices — ones that can only be validated by subjective judgment over the data. ↩︎

  3. There is also a formal issue lurking here: structural causal models require exogenous noise, and neural network computations are deterministic. Without extensions like Beckers, Halpern, & Hitchcock's (2023) "Causal Models with Constraints," we don't even have a well-formed causal model at the neural network level — so what are we abstracting between? ↩︎



Discuss