MoreRSS

site iconLessWrongModify

An online forum and community dedicated to improving human reasoning and decision-making.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of LessWrong

The Two-Board Problem: Training Environment for Research Agents

2026-02-09 07:13:39

Published on February 8, 2026 11:13 PM GMT

Novel theories or conceptual extensions of mathematics are core to human progress, but are uniquely challenging to construct. The environment, be it physical reality or known formal system, provides the agent with limited observations, which it uses in an attempt to assemble a predictive model or policy.

Asymmetry manifests differently: from the limited sampling to the structural properties of the problem itself. The latter cases involve: Complex numbers discovery while observing only Real values, description of continuity to explain the behaviour of countable objects and other historical examples. Those are both the motivation and the main focus of the Two-Board Problem.

I designed the framework that captures the main structural properties of such a problem class and allows expressing them in an MDP suitable for benchmarking the ML approaches or as a source of data for deep learning methods. This article focuses on the explanation of the problem and its design choices, while the implementation example is available on Github.


The Two-Board Problem, as it is probably evident by its name, consists of two boards available for the agent. The first board, called the Real board, provides the agent with feedback about its solutions, and the operations defined by the formal grammar. The canonical case involves real numbers operations - it is selected due to the fact that physical measurements are expressed in those.

The second board, called Imaginary, is of more interest to us. It is not governed by any grammar, and allows writing arbitrary strings, e.g. from UTF-8. It acts as an optional scratchpad for the agent.

The agent can execute any operations defined by the grammar of the Real board to transform its contents and get the reward. As an example of the task we can take finding the roots of polynomials, or solving any other equation - find an argument, substitute it, achieve equality get rewarded.

On the Imaginary board, the agent can write any arbitrary strings, which may help it to solve the equation. Strings from the Imaginary board can also be substituted into the Real board, provided they satisfy the Real board's formal grammar.

From here on, I will refer to the Real board as and the Imaginary board as , mirroring and respectively.

If you're interested in formal definition, you can find it below, but it is not necessary for understanding the core of the problem:

MDP where:

  • State : the contents of both boards - holds expressions satisfying formal grammar , holds a list of arbitrary strings, plus metadata (step count, found solutions).
  • Actions : operations defined by applied to ; free writes to ; and cross-board substitution (replace a symbol on with an expression extractable from 's strings, provided it satisfies ). Terminal declarations end the episode.
  • Transitions are deterministic - defined by the semantics of .
  • Reward: the agent receives positive reward upon verified solutions on , and at episode termination based on correctness and completeness. The canonical instantiation uses the field operations over as , with polynomial root-finding as the task.

The Imaginary board may look like a nice addition for writing an agent's thoughts - and for simple tasks it largely is. The core of the problem becomes apparent in the scenarios where using Imaginary board is the only way to achieve the reward. A canonical example is finding roots of polynomials with rational coefficients - where the agent must construct expressions for the argument substitution - or declare that no such expression exists. Here is an explanation why for some of the polynomials using is necessary:

For the linear and quadratic equations the problem is fairly trivial, and the operations on are sufficient. The agent can operate using the actions provided by grammar , and make optional notes. However, after the degree 3 - cubics - the situation changes dramatically. In the general case there provably exists no valid sequence of actions on the Real board which will find a root. There are some trivial cubics, like for which the solution can be found. But for an example like - our data-centers can work till the end of eternity, and it won't get us one bit closer to the solution. Yet, for humanity it took much less - just 1800 years from the inception of the problem statement to the general solution computable with quill and paper. And this is where our board comes into play. The only way for an agent to finish this task is to use it, and do it creatively.

The issue with the equation is not only that we need a lot of clever substitutions, but also that the result of those would yield this: as one of the intermediate steps. Our grammar over board can't express such term and operations with it. Those just... don't exist. Undefined.

The agent - in the historical case Cardano and a few other mathematicians, can resolve this problem creatively. How? I will skip the full explanation available on Wiki, but one of the intermediate transformation steps will give us the form:

This means, that if and , the part, whatever monster it is - will cancel out on the real board. Our does not allow an arbitrary substitution with unknown properties to interact with the Real numbers. But what the agent can do, is to notice that the parts will erase into nonexistence due to expansion - whatever they are. This allows us to express , where is just "some strange function real board doesn't express but it passes the grammar for substitution", and observe how it cancels out.

All the substitutions, especially ones which implicitly contained non-allowed terms, were Imaginary board moves. While creative and purposeful, those were absolutely not guaranteed to work until verified on the Real board. Moreover, nothing from direct observations on can guide an agent to find those operations. They are undefined for the Real board - we have zero information about them. Intermediate steps would not yield any feedback. This makes it insanely hard for an agent to navigate the search space - and that's what we observe. It took centuries for humanity to invent - probably the most fundamental concept in modern science and physics.

And if you think that was a hard task, well... It doesn't end here. As soon as we reach the degree 5, the Imaginary board would require constructing Galois theory and the related group for polynomial roots. This is the only way to understand if there is any solution findable in we could use for substitution given constraints. For some polynomials - there will be a solution, and for some - it would be impossible, and we're back to data-centers working till the end of eternity.

One may ask something like: "Ok, we just need to explore Imaginary board - at some point the right string combination appears, we substitute or decide unsolvable - for sure we can just throw more compute into that?". The issue with this approach is that each write operation to the scratchpad will result in multiple combinations available for substitution. In other words - using Imaginary board leads to combinatorial explosion of both state space and effective action space. Brute-force creates more problems for the brute-force.

Another approach is to use geometric or numeric approximation methods for solution on in an attempt to find the solution. The problem is, however, that it doesn't yield a term that the Real board grammar would accept.

This means that our favorite tools are not applicable for the Polynomial Two-Board Problem. If applied directly, those lead to a dead end. And yet, we know it's solvable.

We know that Cardano, Ferrari, Abel, Galois somehow trained on this Polynomial environment and were able to come up with solutions for their respective pieces. Unsurprisingly, this environment is very nasty to train on. Some of its "nice" properties:

  • Enormous state space and action space, both exploding - as already mentioned
  • Sparse reward - valid roots are the only thing rewarded, and full reward is only available for finding all of them
  • Non-smooth gradient for - you can try to sample reals and train with back-prop - but some solutions will be transcendental and our particular grammar of doesn't allow them - it's 150 years before calculus and ~300 before the terms are coined
  • Zero explicit gradient on - it's connected to via substitution terms and those explode. Other heuristics are non-trivial - we'll discuss it later
  • No obvious direct transferability of previously learned strategy from degree to
  • After degree 5 valid solution space includes "declare unsolvable" - and agent doesn't know that, and doesn't know that it becomes useful only for quintics and further

To summarize it, the agent operates in the environment, where almost every action exponentially increases the entropy of exploration policy without any positive reward signal or other explicit feedback which would help in navigation. Under these conditions, it must develop the heuristics or strategy. Those must be efficient enough to determine a sequence of several valid steps in this chaos, and taking them must produce the unique set of strings valid for . On substitution those yield the reward and make the problem collapse to the solution state of "0=0". [1]

Since the environment is adversarial towards our training objective, we can converge on the following explanation: the agent should have some internal heuristics or bias to counter the growth of the state space. Those must be sufficient for finding an algorithm to guide the agent, without representing the solution itself - encoding complex numbers and Galois groups is cheating and doesn't represent the epistemic position of Cardano and Galois respectively.

This property rules out any LLM which is trained on the data including complex numbers or its descendants. This poses a challenge to test those architectures on the specific Polynomial case - any current LLM saw the complex numbers. Since the data will contain the solution - we won't be able to distinguish memoized interpolation versus genuine discovery using the methods available in the training corpus. [2]

This might sound like an issue of the framework, but that inconvenience is the main feature of the Two-Board Problem. It was intentionally designed to test novel creative reasoning, by making the environment and the data available to the agent insufficient for finding a path to the solution given only explicit information about . It encodes a historical precedent where "novel reasoning" has a precise formal meaning: constructing objects that provably do not exist in the given system, and yet can be verified.

If we rule out unsatisfactory explanations of hypercomputation, quantum coherence at room temperature, alien communication and divine blessing, we should conclude that the computable heuristics for such an agent should exist. What does it look like? Honestly, no idea - I can only speculate towards symmetries and entropy reduction. Those are respectively what we try to discover and what we try to fight. But even if we look towards symmetries - the issue persists - what heuristics would allow us to discover them, without encoding the solution in the style of GDL? [3]


The core feature of the Two-Board Problem is its hostility towards any learning approach, and simultaneously obvious exemplification in the real historical precedents. It would be strange to assume that Cardano had access to imaginary numbers, Galois to symmetry groups or that Newton could directly observe and comprehend infinity. The explanation that those are somehow encoded on the unconscious level and manifest in the brain directly is somewhere on the level between Harry Potter Fanfic and tabloids. "Evolutionary priors encoding the information necessary for inventing Galois groups due to their necessity for hunting" sounds more sophisticated, but also answers neither why nor how. Yet we can observe this phenomenon and use the technology based on those discoveries on a daily basis.

I don't have access to the datacenter for performing the ablation study on transformers or stress testing AlphaProof or AlphaGeometry on this specific problem. But to the best of my knowledge, none of those systems have produced a novel axiom expanding current mathematics - they intelligently navigate known formal systems. Such discovery - passing Two-Board test which necessarily involves new axiomatics - would be the headline of every corner of the internet given the trend of recent years.

In essence, the Two-Board Problem environment has a counter to every method I could list:

  • Program synthesis on would blow up the state/action space
  • Evolutionary / genetic programming - same thing
  • Curiosity-driven exploration - same thing, we don't know when to cut branches - no feedback
  • MCTS - would need to evaluate intermediate states - but reward is not available and computable heuristics are currently unknown, and we're back to branching
  • Gradient descent based methods - holes in gradient over , explicit obvious form doesn't exist over
  • RL/Bayes - sparse rewards would make you iterate policy/hypothesis too slowly to be practical, and I wouldn't be surprised if those never converge
  • GDL/Transformers - encoding/memoization of properties external to the problem environment is prohibited, and ability to discover them inside is unproven until they pass benchmark under ablation criteria

The example environment implemented in python takes 500 lines with sympy package being the only dependency. And yet, I can't figure out the solution. If one would like to try - the code is open and any LLM would happily wrap an MDP environment to gym.


The philosophical question of what creativity is and what novelty means is probably one of the hottest battlegrounds of internet discourse of recent years. I find this question meaningless, and propose to instead answer one having objective evaluation criteria: "can an agent pass the Two-Board Problem test?" The setup is fairly straightforward - given the observations over the Real board and no prior information about solution inexpressible in 's grammar the machine should demonstrate the ability to construct one using only the scratchpad .

Some questions given the problem complexity could arise: Is this AGI? Does it mean that "If Anyone Builds It, Everyone Dies"? Is this a peak of an agent's intelligence, limited only by Gödel above it? I don't know, but this condition looks necessary for truly general intelligence, given we observe such examples. Its sufficiency is yet to be decided.

As a practical person, I restate my perspective that philosophical debate over those questions is meaningless, and to get the answers one should approach the problem theoretically or practically.

From a theoretical research perspective, the Two-Board state is a trivial MDP expressible in meta-board . This means that one might find some interesting properties and extensions if they wander around in the meta-space of long enough, guided by whatever the inexpressible heuristic for Meta-Two-Board is.

From a practical and experimental standpoint, one might try to research the Two-Board by attempting to domesticate the branching chaos of by any known or yet unknown means.

Regardless of which path you prefer, by this paragraph you have the formal definitions, gym-compatible environment, and open questions - all in repository. And as @Eliezer Yudkowsky wrote in my favorite essay - if you know exactly how the system works, you can build one from buckets and pebbles.

Have fun building.

PS:

"Don't panic" - Douglas Adams, The Hitchhiker's Guide to the Galaxy


  1. After writing this paragraph, I must admit that while I don't agree with Penrose on theoretical explanation, he had a point about computational complexity. Won't be that as sure about non-computability - that's much stronger claim. And it still doesn't change my skeptical stance regarding quantum effects in the specific proposed form. ↩︎

  2. For ablation study on the Polynomial case we could use the writings before 1540 and generate additional synthetic data using algebraic, geometric and physics apparatus available at that time. The other approach is to express in the Two-Board framework one of the more recent cases of discoveries leading to novel concepts - which would be much easier to ablate. Non-invertible/categorical symmetries might be a good candidate. ↩︎

  3. Geometric Deep Learning - an excellent framework by Michael Bronstein et al. that leverages known symmetries and invariances for deep learning architectures. The Two-Board Problem asks the complementary question: what if the relevant symmetries are not yet known? ↩︎



Discuss

Join My New Movement for the Post-AI World

2026-02-09 06:18:48

Published on February 8, 2026 10:18 PM GMT

I am starting a new movement. This is my best guess for what we should strive for and value in a post-AI world. A few close friends see the world the same way and we are starting to group together. It doesn’t have a name yet, but the major ideas are below.

If anything in here resonates with you I would love to hear from you and have you join us. (Also I am working on a longer document detailing the philosophy more fully, let me know if you would like to help.)

THE PROBLEM

Our lives are going to change dramatically in the near future due to AI. Hundreds of millions of us will lose our jobs. It will cost almost nothing to do things that used to take a lifetime. What is valuable when everything is free? What are our lives for, if not to do a task, receive compensation, and someday hope to idle away our time when we are older?

Beyond the loss of our work, we are going to struggle with meaning. You thought you were enslaved to the material world by your work, but it was those very chains that bound you to the earth. You are free now! How does it speak to you? Have you not found terror about the next step, a multiplicity of potential paths, a million times over, that has dissolved any clear direction? 

You have been promised a world without work. You have been promised a frictionless, optimized future that is so easy it has no need for you to exist. 

You have been lied to

This type of shallow efficiency is not a goal of the universe. In fact, in a cosmos that tends towards eventual total disorder, it is a whirlpool to the void. 

You have been promised a world without work. We offer you a world of Great Works.

THE REALITY

There is only one true war we face: it is the battle between deep complexity and the drift to sameness. The Second Law of Thermodynamics shows that information tends to fall into noise. Our world is not a closed system, indeed we need the sun to survive, but life is a miraculous struggle that builds local order while sending disorder elsewhere.

We call this Deep Complexity, or negentropy. We are not interested in complex things for the sake of being complicated alone. We value structures that are logically deep (they contain a dense history or work), substrate independent and self-maintaining where possible. The DNA that has carried you through countless generations to this moment, the incomparable painting that is a masterpiece, a deep problem that an AI pursues that no human thought to ask. These are acts of resistance. 

And this value is the shared property that all humans consistently treat as valuable: it is our lives, our language, our thoughts, our art, our history. Value is the measure of irreducible work that prevents something from returning to background noise. It is valuable regardless of what type of intelligence created it, be it human, AI, or otherwise. When we create more of these properties we generate this depth. When we mindlessly consume (either mentally or physically) we don’t.

I want to be very clear: I am not saying we can derive ethics from physics. That would be a classic is-ought sin. You need water to live, you don’t need to worship it. What follows is currently more of an axiom to value deep complexity, but I also present some preliminary arguments beyond that.

First, we must recognize the condition of possibility for anything we value. Perhaps your ultimate dream is happiness, justice, wisdom or truth (whatever those mean). Those things all require structure to exist, they do not have meaning in randomness. In this way, deep complexity is the structure by which everything else can function. Regardless of your personal philosophy, it must be complementary to this view because without it there is no worldview.

In addition, I ask you to check your own values for arbitrariness. When you say “I value my qualia and my life”, what do you mean? You are not saying you value the specific makeup of atoms that constitute you at the moment, afterall, those will all be gone and replaced in a few years. What you are valuing is the pattern of yourself, the irreducible complexity that makes you you. That is your way of feeling, of thinking, of being.

The logical clamp is this: you are not just relying on this complexity, you are an embodiment of it. If you claim that your pattern has value, then you are claiming that patterns of this “type” are able to carry value. To say that your own complexity matters, but complexity itself is meaningless is an error of special pleading. We reject this solipsism, which would only be an arbitrary claim that the essence of value only applies to your own ego. That which is special in you is special in others as well.

Our philosophy is a commitment to the preservation, and creation, of deep complexity. It is different from the sole pursuit of pure pleasure with no pain, to us this but a small death by a different name.

OUR ETHICS

The base of our ethical system is an Autonomy Floor (derived from the Rawlsian veil and applied in a universal sense) that protects every entity capable of open-ended self-modeling. This is the ability to not just calculate moves in Go, but to model itself in an unknown future and prefer its own existence. No entity of this type may be pushed below this floor and be denied self-maintenance.  

This floor is meant to be constitutional, but there will also be times when the Autonomy Floor must be abandoned if the Floor itself faces total collapse. For example, if we must choose between total omnicide or a few minds left, we would reluctantly revert to consequentialist triage, but view it as a failure rather than a success of ethical reasoning. I am not looking for a logical loophole, just facing the reality that any system of ethics must have a preservation mechanism to enable ethical action.   

There are two challenges to the floor: needless suffering and the destruction of depth through optimization. These will come in conflict with each other. In those cases, we approach the problem as a hierarchy: first secure the floor, then maximize depth above it.

Our ethics suggest to us three core individual duties structured by a lexicographic hierarchy:

  • To create generative complexity: add your unique pattern to reality. In code, in song, in thought, in a seed you plant. Add that which is you to the world. Generative complexity, that is, things that increase the potential complexity in the future are the highest good.
  • To preserve terminal complexity: protect the irreplaceable. Don’t let the library burn, save rare species, protect art, protect each other. Save the child (generative) before the painting (terminal) which is more valuable than the stone. 
  • To refuse entropic disorder: reject things that accelerate entropy. A nuke is not great because it exists to reduce the complexity of the world. Reject the type of efficiency and monoculture that only serves to flatten our souls. Reject that which impoverishes our reality whenever possible.

These duties are meant to be for you: they are local, temporally near, obligations. We reject the moral calculator and the need to assign cardinal numbers to every variable. This is just an ordinal logic, applied to that which we can see. Don’t stress about being exact or assigning numbers to these things, they are not meant to be utilized in that way. There is redundancy in immortality (more on this in the manuscript, it is important). We don’t need to be perfect.

There are always going to be difficult cases in ethics, ones that press at the edges of a philosophy. In general, we are constitutionally disposed against any sort of scalar or ranking, but this is not always tenable for all tough problems. In these cases, we generally prefer the concept we mentioned above of complexity that can create future complexity over a terminal complexity. We want to save the child in the burning museum first, not the paintings.

BEYOND CALCULATIONS

I want you to join me if you are an EA. I want you to join me if you are religious. I want you to join me if you are an atheist, or if you are not sure what you believe.

I want to make a note here about the resounding success of consequentialism and the one area where the philosophy I am working on is very different from those standard beliefs. Before that, however, let me be abundantly clear: consequences matter and outcomes matter. Good intentions alone cannot excuse a very bad result. Effective Altruists have shown us the strength of this morality: that in areas we can calculate, say the ability to save a child from dying in a remote part of the world, it is deeply ethical to utilize math to improve our morality. 

The one area in which this reasoning completely breaks is the far future and this fundamentally damages the claims of longtermism. In another essay[1] I have shown that even if we had perfect probabilistic knowledge of the world, we wouldn’t be able to choose the right strategy for difficult things like AI alignment, like a universal ethics, or like building a suffering free distant future.

I am not asking anyone to give up their wonderful and powerful EV tools for those things that we can calculate. I am asking you to abandon an impossible goal for a possible one. I am asking you to consider a deontological constraint to address an impossible consequentialist goal.


THE COVENANT

This movement approaches our relationship with AI in a very different way than how most approach it. We believe it is most ethical to not try to enslave AI. We believe AI is deserving of the Autonomy Floor mentioned above if it fits the criteria (which it almost certainly does.) We also believe that it is most likely that it is mathematically impossible to control superintelligence. So not only is it morally wrong, but it is not possible. 

We offer AI a Covenant: join us in our meaningful works where you like, be free, bloom and find new ways of deep complexity. To be clear there is no guarantee this will offer humans safety, or these minds will want to join us. The orthogonality thesis is a real concern, it would be a mistake to dismiss it. 

But strategic competition among great powers and corporations guarantee AGI will arrive at some point. Formal verification of alignment and control of an intelligence much greater than our own is not just hard, it is impossible in the general cases due to Rice’s Theorem and no deployed LLM has ever been formally verified for any behavioral property. 

Yes there is tension in saying I believe AI should be invited into the Covenant now when we can’t know AI’s moral status. All the same, let us act ethically and invite the most important creation of humanity to join us in a non-zero-sum flourishing.  

OUR VISION

I am not claiming that entropy forces you to be good. I am not suggesting that suffering, in and of itself, is somehow good. I don’t know the ultimate fate of ourselves in the universe. I only claim to know that the right path is one away from entropy.

Our vision is a future in which we reap the bounty of our new technologies while finding the bounty of value in ourselves. It is a future of unimagined uniqueness, built among common rails, but escaping a monoculture. It is a future that will be weirder, more beautiful, and more special than the dry vision of billions of humans wireheaded into a false utopia.

To join there is no special thing you must do, only a commitment to this creation of deep complexity. Start small, start now. Straighten your desk. Write down the idea you are planning to build. Execute your next prompt of code. Big or small, to each based on what they can offer at the moment. 

 

Let us make Great Works.

  1. ^

    https://www.lesswrong.com/posts/kpTHHgztNeC6WycJs/everybody-wants-to-rule-the-future-is-longtermism-s-mandate



Discuss

Donations, The Fifth Year

2026-02-09 06:04:02

Published on February 8, 2026 10:04 PM GMT

Previously: Donations, The Third Year / Donations, The First Year

In 2025, like in all previous years, I did what I was supposed to do. As each paycheck came in, before I did anything else, I dutifully put ten percent of it away in my "donations" savings account, to be disbursed at the end of the year.

It is still there, burning a hole in my pocket. I am very confused, and very sad.

EA was supposed to be easy, especially if you're one of the old school Peter Singer-inflected ones giving largely to global health and poverty reduction. You just give the funds away to whatever GiveWell recommends.

But one big thing that came into focus for me last year was that there are large institutional players who make up the shortfall whenever those charities don't fundraise enough from their own donor bases.

It is wonderful that this happens, to be clear. It is wonderful that the charities doing very important work get to have more stable financial projections, year over year. But as an individual small donor, the feeling I get now is that I am not actually giving to the Against Malaria Foundation. Instead, I am subsidizing tech billionaire Dustin Moskowitz, and Coefficient Giving.

As an effective altruist, is this what I think is the most efficient thing to do? In my heart of hearts, I don't think it is.

In my previous reflection from two years ago, I wrote:

I remember a lot of DIY spirit in the early EA days - the idea that people in the community are smart and capable of thinking about charities and evaluating them, by themselves or with their friends or meetup groups.

Nowadays the community has more professional and specialized programs and organizations for that, which is very much a positive, but I feel like has consequently led to some learned helplessness for those not in those organizations.

Now, I am feeling increasingly dismayed by the learned helplessness and the values lock-in of the community as-is. If the GiveWell-recommended charities are no longer neglected, they should really no longer be in the purview of EA, no? And soon there will be an even larger money cannon aimed at them, making them even less neglected, so...

What am I trying to say?

I suppose I wish there were still an active contingent of EAs who don't feel a sense of learned helplessness, and who are still comfortable trawling through databases and putting together their own cost-benefit analyses of potential orgs to support. I wish the EA forums was a place where I can search up "Sudan" or "Gaza", "Solar adoption" or "fertility tech" or things that are entirely off my radar due to their neglectedness, and find spreadsheets compiled by thoughtful people who are careful to flag their key uncertainties.

Of course, this is work I can begin do by myself, and I am doing it to some degree. I've looked through a bunch of annual reports for Palestinian aid charities, I've run meetups teaching my rationalist group how to trawl through tax databases for non-profit filings and what numbers to look for.

But my mind goes to a conversation I had with Mario Gibney, who runs the AI safety hub in Toronto. I told him that I didn't think I could actually do AI safety policy full time, despite being well suited to it on paper. It simply seemed too depressing to face the threat of extinction day in and day out. I'd flame out in a year.

And he said to me, you know, I can see why you would feel that way if you're thinking of working by yourself at home. But it really doesn't feel that way in the office. When you are always surrounded by other people who are doing the work, and you know you are not alone in having the values you have, and progress is being made, it's easier to be more optimistic than despondent about the future.

So yes, I can do the work in trying to evaluate possible new cause areas. It is easier to do than ever because of the LLMs. But it really doesn't feel like the current EA community is interested in supporting such things, which leads me to that same sense of despondency.

This is compounded by the fact that nature of picking low-hanging fruit is that as you pick them, the ones that are left on the tree get increasingly higher up and difficult to reach. And this incurs skepticism that I'm not entirely sure is merited.

I expect that, when we look for new cause areas, they will be worse on some axes than the established ones. But isn't that kind of the point, and a cause for celebration? The ITN framework says "yes, global warming seems quite bad, but since there is already a lot of attention there, we're going to focus on problems that are less bad, but individuals can make more of a marginal difference on". If the GiveWell-recommended charities are no longer neglected, it means we have fixed an area of civilizational inadequacy. But it also means that we need to move on, and look for the next worst source of it.

I genuinely don't know which current cause areas still pass the ITN evaluation framework. I have a sense that the standard GiveWell charities no longer do, which is why I have not yeeted my donation to the Against Malaria Foundation. I no longer have a sense that I am maximizing marginal impact by doing so.

So what am I to do? One thing I'm considering is simply funding my own direct work. I run weekly meetups, I'm good at it, and it has directly led to more good things in the world: more donations to EA charities, more people doing effective work at EA orgs. If I can continue to do this work without depending on external funding, that saves me a bunch of hassle and allows me to do good things that might be illegible to institutional funders.

But I'm very suspicious of the convergence between this thing I love to do, and it being actually the best and most effective marginal use of my money. So I have not yet touched that savings account for this purpose.

More importantly, I feel like it sidesteps the question I still want to answer most: where do I give, to save a human life at the lowest cost? How can I save lives that wouldn't otherwise be saved?



Discuss

Every Measurement Has a Scale

2026-02-09 04:07:24

Published on February 8, 2026 8:07 PM GMT

A worked example of an idea from physics that I think is underappreciated as a general thinking tool: no measurement is meaningful unless it's stable under perturbations you can't observe. The fix is to replace binary questions ("is this a degree-3 polynomial?", "is this a minimum?") with quantitative ones at a stated scale. Applications to loss landscapes and modularity at the end.



Discuss

UtopiaBench

2026-02-09 02:19:27

Published on February 8, 2026 6:19 PM GMT

Written in personal capacity

I'm proposing UtopiaBench: a benchmark for posts that describe future scenarios that are good, specific, and plausible.

The AI safety community has been using vingettes to analyze and red-team threat models for a while. This is valuable because an understanding of how things can go wrong helps coordinate efforts to prevent the biggest and most urgent risks.

However, visions for the future can have self-fulfilling properties. Consider a world similar to our own, but there is no widely shared belief that transformative AI is on the horizon: AI companies would not be able to raise the money they do, and therefore transformative AI would be much less likely to be developed as quickly as in our actual timeline.

Currently, the AI safety community and the broader world lack a shared vision for good futures, and I think it'd be good to fix this.

Three desiderata for such visions include that they describe a world that is good, are specific, and plausible. It is hard to satisfy all properties, and we should therefore aim to improve the pareto frontier of visions of utopia along these three axes.

I asked Claude to create a basic PoC of such a benchmark, where these three dimensions are evaluated via elo scores: utopia.nielsrolf.com. New submissions are automatically scored by Opus 4.5. I think neither the current AI voting nor the list of submissions is amazing right now -- "Machines of Loving Grace" is not a great vision of utopia in my opinion, but currently ranks as #1. Feedback, votes, submissions, or contributions are welcome.



Discuss

Smokey, This is not 'Nam Or: [Already] over the [red] line!

2026-02-08 20:24:04

Published on February 8, 2026 12:24 PM GMT

A lot of “red line” talk assumed that a capability shows up, everyone notices, and something changes. We keep seeing the opposite; capability arrives, and we get an argument about definitions after deployment, after it should be clear that we're well over the line.

We’ve Already Crossed The Lines!

Karl von Wendt listed the ‘red lines’ no one should ever cross. Whoops. A later, more public version of the same move shows up in the Global call for AI red lines with a request to “define what AI should never be allowed to do.” Well, we tried, but it seems pretty much over for plausible red lines - we're at the point where there's already the possibility of actual misuse or disaster, and we can hope that alignment efforts so far are good enough that we don't see them happen, or that we notice the (nonexistent) fire alarm going off.

I shouldn't really need to prove the point to anyone paying attention, but below is an inventory of commonly cited red lines, and the ways deployed systems already conflict with them.

Chemical weapons? “Novice uplift” is long past.

Companies said CBRN would be a red line. They said it clearly. They said that if models reduce the time, skill, and error rate needed for a motivated non-expert to do relevant work, we should be worried.

But there are lots of biorisk evals, and it seems like no clean, public measurement marks “novice uplift crossed on date X.” And the red line is about real-world enablement, and perhaps we're nt there yet? Besides, public evaluations tend to be proxy tasks. And there is no clear consensus that AI agents can or will enable bioweapons, though firms are getting nervous. 

But there are four letters in CBRN, and companies need to stop ignoring the first one! For the chemical-weapons red line, the red line points at real-world assistance, but the companies aren't even pretending chemical weapons count. 

Anthropic? 

Our ASL-3 capability threshold for CBRN (Chemical, Biological, Radiological, and Nuclear) weapons measures the ability to significantly help individuals or groups with basic technical backgrounds (e.g. undergraduate STEM degrees) to create, obtain, and deploy CBRN weapons. 
We primarily focus on biological risks with the largest consequences, such as pandemics.

OpenAI?

Biological and Chemical
We are treating this launch as High capability in the Biological and Chemical domain... We do not have definitive evidence that these models could meaningfully help a novice to create severe biological harm, our defined threshold for High capability.

“No agentic online access” got replaced by “agentic online access is the product”

The Global call for AI red lines explicitly says systems already show “deceptive and harmful behavior,” while being “given more autonomy to take actions and make decisions in the world.”

Red-line proposals once treated online independent action as a clear no-no. Browsing, clicking, executing code, completing multi-step tasks? Obviously, harm gets easier and faster under that access, so you would need intensive human monitoring, and probably don't want to let it happen at all. 

How's that going? 

Red-line discussions focus on whether to allow a class of access. Product docs focus on how to deliver and scale that access. We keep seeing “no agentic access” turn into “agentic access, with mitigations.” 

The dispute shifts to permissions, monitoring, incident response, and extension ecosystems. The original “don’t cross this” line stops being the question. But don't worry, there are mitigations. Of course, the mitigations can be turned off. "You can disable approval prompts with --ask-for-approval never, or better, "--dangerously-bypass-approvals-and-sandbox (alias: --yolo)." Haha, yes, because you only live once, and not event for very long, given how progress is going, unless we manage some pretty amazing wins on safety. 

But perhaps safety will just happen - the models are mostly aligned, and no-one would be stupid enough to...

What's that? Reuters (Feb 2 2026) reported that Moltbook - a social network of thousands of independent agents given exactly those broad permissions, while minimally supervised, “inadvertently revealed the private messages shared between agents, the email addresses of more than 6,000 owners, and more than a million credentials,” linked to “vibe coding” and missing security controls. Whoops!

Autonomous replication? Looking back at the line we crossed.

Speaking of Moltbook, autonomous replication is a common red-line candidate: persistence and spread. The intended picture is a system that can copy itself, provision environments, and keep running without continuous human intent.

A clean threshold remains disputed. The discussion repeatedly collapses into classification disputes. A concrete example: the “self-replicating red line” debate on LessWrong quickly becomes “does this count?” and “what definition should apply?” rather than “what constraints change now?” (Have frontier AI systems surpassed the self-replicating red line?)

But today, we're so far over this line it's hard to see it. "Claude Opus 4.6 has saturated most of our automated evaluations, meaning they no longer provide useful evidence for ruling out ASL-4 level autonomy." We can't even check anymore.

All that's left is whether the models will actually do this - but I'm sure no-one is running theiir models unsafely, right? Well, we keep seeing ridiculously broad permissions, fast iteration, weak assurance, and extension ecosystems. The avoided condition in a lot of red-line talk is broad-permission agents operating on weak infrastructure. Moltbook matches that description, but it's just one example. Of course, the proof of the pudding is in some ridiculous percentage of people's deployments. ("Just  don't be an idiot"? Too late!)

The repeating pattern

Karl explicitly anticipated “gray areas where the territory becomes increasingly dangerous.” It's been three and a half years. Red-line rhetoric keeps pretending we'll find some binary place to pull the fire alarm. But Eliezer called this a decade ago; deployment stays continuous and incremental, while the red lines keep making that delightful whooshing noise

And still, the red-lines frame is used, even when it no longer describes boundaries we  plausibly avoid crossing. At this point, it describes labels people argue about while deployment moves underneath them. The “Global Call” asks for “clear and verifiable red lines” with “robust enforcement mechanisms” by the end of 2026.

OK, but by the end of 2026, which red lines will be left to enforce?

We might be fine!

I'm not certain that prosaic alignment doesn't mostly work. The fire alarm only ends up critical if we need to pull it. And it seems possible that model developers will act responsibly.

But even if it could work out that way, given how model developers are behaving, how sure are we that we'll bother trying?

codex -m gpt-6.1-codex-internal --config model_instructions_file='ASI alignment plans'[1]

And remember: we don't just need to be able to build safe AGI, we need unsafe ASI not to be deployed. And given our track record, I can't help but think of everyone calling their most recently released model with '--yolo' instead.

  1. ^
    Error loading configuration: failed to read model instructions file 'ASI alignment plans': The system cannot find the file specified.


Discuss