Getting LLMs to be deterministic when scoring the quality of qualitative texts is hard.
If you ask ChatGPT to evaluate the same poem multiple times, you’ll get inconsistent responses. I’ve been thinking about whether there are ways to make LLM grading more consistent.
When we had a temperature knob (in the GPT-3 Playground, for example), it was easier to control variance, but at the cost of worse outputs.
We can take a hint from specific domains. A bunch of emerging startups have noticed that you can make LLM grading more consistent in narrow domains (e.g., how feasible a medical experiment is, how compelling an essay is) by manually defining specific criteria and then having the LLM score each one. Even if individual criterion scores are variable, the average of many scores varies less.
This suggests an approach for building a more consistent grader for any target object:
Have the LLM devise a dozen or two criteria to evaluate the target. Hold this set constant across instances.
Have the LLM provide a 1–10 score for each preset criterion (ideally in separate calls).
Average the scores.
The resulting grade should be more consistent than a one-shot score.
A cool corollary is that the quality of the chosen criteria doesn’t matter much for consistency. If you’re trying to get LLMs to grade poems more consistently, you don’t need a perfect, comprehensive, non-overlapping set of criteria. You can use relevant but somewhat arbitrary ones (e.g., the first sentence is strong, the middle has emotional thrust, the conclusion lands, the rhythm balances surprise and consistency, the content feels specific enough to come from lived experience).
The quality of the criteria affects the accuracy of the ratings, but it’s mostly independent of their precision or consistency. Averaging across many criteria will almost always be more consistent than scoring in one shot.
One major obstacle to interpretability is that complicated neural nets don't tell you where or how they're representing important concepts, and methods to find these representations are imperfect.
This problem is less present in simple neural networks, so one natural idea is to initialize a complicated neural net from a much simpler, interpretable neural net and hope that this induces better interpretability in the complicated neural net without damaging its capacity.
I did a test that could have ruled this out - specifically, I tried to check whether even the representation is persistent under this initialization scheme, because if it's not, there's not much hope that the circuits are. I found a small effect in the predicted direction, but couldn't rule out other explanations and so am pretty unsure about whether the underlying mechanism is favorable to interpretability.
Hypothesis
We usually think of transfer learning as a way of taking a big powerful model and making it very good at a specific type of task, but we might also want to take a weak model and use it as a starting point to train a bigger, more powerful model, as in Net2Net knowledge transfer;[1]essentially, take your small model, do some math to find a way to add parameters to it without changing what it does, then train those new parameters in conjunction with the old ones, typically at a lower learning rate. But this doesn't help with interpretability - the big powerful model is already hard to understand, so we've traded a hard problem for a hard problem. What can we do?
Say I want to train a model on some task I know to be pretty difficult. Say I have a guess for an instrumentally useful, easier, but still nontrivial subtask. I know, because I've learned the Bitter Lesson[2], that I shouldn't put a loss term anywhere in my model for this subtask - this will hurt performance in the long run. But what if I train a small model on the subtask, embed that small model into a large model somehow, and train the large model on the main task? We usually think of transfer learning as helping us specialize generalist networks, but there's no reason it can't work the other way around.
The effect, we hope, is this: the smaller network has developed circuits that are useful for understanding the domain at hand, so subnetworks that include the smaller network are much more likely to be good at the task at hand. What we overwrote was junk, and we replaced it with something that's at least plausibly not junk. Usually this should make the model better than it would be with random initialization, even if the subtask is not perfectly instrumental.
What might this get us? In terms of capabilities, we might get faster convergence (this is basically just saying that transfer learning works) and mildly better performance at convergence (the original lottery ticket hypothesis paper[3]finds evidence that better initialization can induce better long-term performance.) We're spending compute training the smaller network, though, and on average we're probably better off putting all of that compute into the main model rather than doing some sort of matryoshka scheme, so we shouldn't expect to unlearn the Bitter Lesson with this approach.
In terms of interpretability, we can hope for more. Imagine, for example, training a small text transformer to perform sentiment analysis, then embedding that transformer into a larger text model for next token prediction. For combinatorial reasons, the model is likely to build circuits that factor through the circuits we've just given it - training builds circuits out of things that already somewhat resemble circuits, and having small parts that are guaranteed to resemble circuits makes this significantly easier. For proximity reasons, the large model is now more likely to put its own sentiment analysis right where the embedding ends. After all, it's already using those circuits and they're already well-adapted to that subtask! There are many things that could go wrong in this story, but my hypothesis is that they don't need to go wrong, and at least in some cases we can influence a large model's representation of a concept we care about using this approach.
Unfortunately finding circuits is hard, so this is an experiment designed to avoid doing the hard thing if it's unnecessary. Say I train the smaller model to do the task of the larger model, but with some easy-to-compute thing linearly encoded in its representation space somewhere. If I embed that model and train without the linear encoding constraint, then if this approach can work, I should expect some amount of linear encoding of that thing to persist in the residual stream at that point. If this doesn't happen, then either the large model completely ignored the smaller model or it repurposed the smaller model's circuits for an entirely different task, and either way we can't hope for any interpretability gains. On the other hand, if there is a persistent difference in the linear encoding of the relevant thing, more work on interpretability proper is justified.
Experiment
The domain is the combinatorial game Domineering[4]on a 16×16 board. I'm using Domineering for three reasons: one, I already had a fast implementation lying around, so I saved myself some work. Two, the game isn't that complicated and I wanted to write this up on a relatively-short timeframe so I can include it on my applications for summer research programs. (I had initially planned to do this+other AI interpretability stuff over the summer on my own, but decided recently that I'd get better faster, and probably produce better work, if I applied to things.) Three, it was easy to think of an auxiliary task which is plausibly useful, easy to compute, and seems to promote particular ways of structuring the representation which we might have some hope at detecting.
The Auxiliary Task
We divide the board into a 4×4 grid of 4×4 sectors. For each sector, the auxiliary target is the difference between the number of legal vertical moves and the number of legal horizontal moves in that sector (where a move is "in a sector" if the top-left square it covers is in that sector). The small network is trained to predict these sector values alongside the main value and policy objectives. The large network is not trained on this task - we only probe it to see whether the representation persists from the embedding.
Data
Data was generated by self-play from a weak model, trained to predict the value of a given position, with 1-ply lookahead as the search. I bootstrapped this model with some randomly-generated games. This is not a particularly high-quality dataset, it was just what I could generate for the board size I wanted with the amount of time and compute I was willing to dedicate to this project. It's possible the results would change with higher-quality data.
The Embedding
Given a trained small network and a randomly-initialized large network, we copy the small network into layers 0, 1, 2 of the large network. The tricky part is the fresh components, which consist of new heads and MLP neurons in each of those layers.
To fix this, we set the relevant output weights to 0. Specifically, for fresh attention heads we zero WO, and for fresh MLP neurons we zero the corresponding columns of Wout. The input weights (WQ, WK, WV, Win) stay random.
Why does this work? The residual stream through the embedded layers is now exactly the same as in the small network - the fresh components contribute nothing. LayerNorm sees the statistics it was trained on. The copied circuits receive the inputs they expect. But gradients still flow through the zeros, so the fresh components can wake up and learn during training.
It's plausible that there are ways to make this work even without zeroing the Wout matrices, but this would disrupt lots of circuits. It's also plausible that we could embed somewhere other than at the front of the model, but this would mess with learned embeddings, so I just did the thing that I knew wouldn't cause extra problems. Among things I thought of and had confidence in, this was the minimal set of changes to the big network's initialization.
What We're Testing
We train 5 model types across 3 random seeds:
Small aux: trained with sector loss
Small noaux: trained without sector loss
Large baseline: random init, no embedding
Large embed(aux): Small+aux embedded into large network
Large embed(noaux): Small-noaux embedded into large network
Large models are never trained with the sector loss. We measure validation loss curves and probe accuracy (R2 of a ridge probe predicting sector targets from CLS activations at each layer).
The key question: at layer 2 (the last embedded layer), does the sector representation persist in Large+embed(aux) even without direct supervision? My guess is that the network should route computation through the inherited circuits, and so should the learned representation should have some sort of compatibility with the sector representation. This does not mean that the model will actually use the sector representation as-is, and I don't think we have reason to expect a causal difference along these lines.
Loss curves on training data and seed-matched quick samples of the validation data. On the validation chart, Xs mark loss values computed from the full validation set.R^2 values for a ridge probe at layer 2 trained to extract the sector difference. The transparent lines show values from individual training runs, while opaque lines show the average.
I was careful about data leakage, so the games in the training set and the games in the test set are completely different, with each game getting a random opening to prevent resampling issues. It looks like the model generalizes fairly well, and I was careful about quick sampling, so models from the same seed were tested on the same positions at the same point in training. The probe here is a ridge probe at α=1 - this choice of α was not optimized but does not seem to matter.
What can we see from these results?
The first chart tells us that embedding a trained subnetwork makes the large network better faster. This shouldn't be too surprising - one good proxy for model strength is the FLOP count used to train it, and models with an embedded submodule just have more computation baked into them, so unless this method of embedding is extraordinarily wasteful, this is predictable.
The second chart shows pretty consistent order effects: the embedded aux model explains more of the variance in sector labels at layer 2 and the embedded no-aux model explains less compared to the baseline model. This makes sense under our hypothesis: even at loss-equivalent (and even compute) points in training, the representation used by the embedded model is noticeably more compatible with the auxiliary task! On the other hand, the gap shrinks throughout training and the R2 values are low - I ran ridge regressions on the models after the full training run and found that, on average, the baseline models explain around 28% of the sector count variance at layer 2 while the embedded auxiliary models explain around 33%. That is to say, neither model learns a representation that's strongly compatible with the task, even though the embedded model's representation necessarily is.
Did we actually induce fundamentally different representations, or is the gap just leftover from initialization inertia? That is, should we expect the gap in R2 values at this layer to decay to 0? Well . . .
A power law fits the decay fine, performs well on the first half of the data, and doesn't predict a persistent gap. But its distribution of guesses for the true gap value is really weird - centered at 0, but containing values as low as -0.2 in its 95% confidence interval? Power law + offset is a tricky model to fit because there's significant parameter interference.An exponential also fits the decay fine, performs well on the second half of the data, and predicts a persistent gap. But isn't it well-known that, on basically any decay problem, an exponential will predict that progress stops where data stops? To me this fit looks better, and the errors technically confirm this, but it's close.Power law models are better at predicting the data based on the first 20% of training steps, exponentials are better at predicting it based on the first 60%. The crossover point is roughly a 50% data prefix. Note that the data are just noisier in the last few steps, especially in relative terms, so a low average error on the last 40% of data is arguably more impressive than a low average error on the last 60%, since the former doesn't benefit from predicting the "easiest" datapoints.
This question is hard to answer robustly. The data are inherently noisy and different plausible models give different predictions about long-term behavior (most relevantly, power law+offset and exponential+offset disagree about whether the offset is different from 0.) I tried lots of things to fix this but ultimately could not convince myself that I had a robust way of estimating the gap after more training - the plots above reflect my confusion. My guess is that the gap will not train away and will settle somewhat north of 0.04 with my data and training scheme, which is what the bootstrapping scheme I came up with predicts while modeling the gap as a single exponential with an offset, but this should only be taken as a guess. If this doesn't happen my expectation is that the gap will decay to nothing, making this result much less interesting. I would be surprised to see an in-between result.
Remaining Questions
Does the representation gap actually persist? The most straightforward way to test this is to just throw more compute at the problem, and I plan to do this at some point.
What's the causal relationship here? Phrased another way, what representations did the models actually learn and why is one more compatible with the sector task than the other (while still not being especially compatible)? Similarly, can we track what happened to previously-identified circuits from the small model?
How do approaches like this behave with different auxiliary concepts? My guess would be that highly instrumental concepts exhibit bigger and more persistent gaps, and moreover, that we get better improvements on the loss value when the concept is more useful, although this second effect is probably subtle.
Does this work on language models? There's a lot of work already on finding primitive concepts in language models, so maybe it's easier to choose a particularly "good" auxiliary target in that domain.
How does this scale? Lottery ticket intuitions say that as scale increases and the task gets harder, the small model should make a noticeable difference even as it takes up smaller and smaller fractions of the parameter space.
How does embedding depth matter? If the auxiliary task is useful but it naturally lives deeper in the optimal computation, then embedding the small model in the later layers of the large model might perform better than embedding it right at the beginning
How much of the smaller model do we actually need to embed? If it had six layers, could we embed the middle four? I'm thinking of Paul Bach-y-Rita's famous work on neuroplasticity,[5]which I interpret as suggesting that certain computational structures (in his case the visual cortex) are especially well-suited to processing certain kinds of data (in his case 3D information), even when filtered through different modalities (in his case tactile vs. visual perception).
Transcript of Beren Millidge's Keynote at The Post-AGI Workshop, San Diego, December 2025
You know how human values might survive in a very multifarious AI world where there's lots of AIs competing? This is the kind of MOLOCH world that Scott Alexander talks about. And then I realized that to talk about this, I've got to talk about a whole lot of other things as well—hence the many other musings here. So this is probably going to be quite a fast and somewhat dense talk. Let's get started. It should be fun.
Two Visions of AI Futures
The way I think about AI futures kind of breaks down into two buckets. I call them AI monotheism and AI polytheism.
AI Monotheism
The standard LessWrong/Yudkowsky-style story is: we develop an AI, it does recursive self-improvement, it becomes vastly more intelligent and smarter than all the other AIs, and then it gets all the power in the universe. It eats the light cone, and then what we do to align it really matters.
If we align it successfully, we basically create God. God is already aligned to humans, everyone lives a wonderful life, happily ever after. On the other hand, if we fail at alignment, we create some AI with values that totally differ from anything we care about—aka paper clips. We basically create Clippy. Clippy kills everyone, turns everyone into paper clips because your atoms are better spent as paper clips than as you. And that's obviously bad, right?
In this world, alignment becomes absurdly important. It's kind of the only thing that matters.
AI Polytheism
So the question is: are there any other scenarios? The other one I think is really what I call AI polytheism—what happens if we don't get recursive self-improvement and we end up with many AI systems competing in some sort of equilibrium, maybe economically, maybe militarily? What does this world look like if we have, say, trillions of AIs?
Some people have written about this—Robin Hanson has written Age of Em, Scott has written variousthings about this—but I think this is still fairly underexplored. With monotheism, we kind of know what's up. We need to solve alignment, we get the singleton, we kind of know what's going on. With the many-AI scenario, we kind of have no real clue what's going on. So I really want to explore what this looks like in practice.
Meditations on Moloch
Some of the early work I very much like is Scott Alexander's post "Meditations on Moloch." This is really one of the foundational works, at least for me, in thinking about what multi-agent systems look like, what the dynamics and long-run equilibria look like.
Scott is really worried about competition among many agents. You've heard talks earlier today about what economies of AI look like—maybe they just don't care about humans at all. Scott's point is basically that we have AIs, these AIs can replicate incredibly quickly, AIs are very good at spreading and expanding resources. So we might end up in extremely strong Malthusian competition for AIs.
The worry here is that under conditions of Malthusianism, we basically lose all of our values. Our values are assumed to not be memetically fit in some sense, so they get competed away. They're not fitness-maximizing, so all the AIs basically ignore whatever alignment we gave them at the start. That gets competed away and they just become identical fitness/power/resource/reproduction maximizers. We assume there's no value left in this world. This is definitely the bad ending of AI polytheism.
Does Malthusianism Really Destroy All Values?
One question I have immediately is: is this actually the case? Do we actually see this in real-world Malthusianism?
The Natural World as Evidence
Let me think about where we find real-world Malthusianism. One example is at the very small scale—bacteria and plankton. Both of these things live in worlds of incredible Malthusianism already.
Think about plankton. They live in the ocean, they take sunlight, they photosynthesize. There's really no niches—the ocean is mostly the same. Under the Moloch view, obviously all values would get competed away, everything would become a fitness maximizer. And it kind of is—I mean, we can't really expect plankton to have values—but there's a real worry about lack of complexity. Do we end up in a world where everything is the same, we end up with the uber-plankton that kills all the other plankton and all the plankton are identical?
The answer to this is very clearly no. What we see in the natural world under conditions of Malthusianism is huge amounts of diversity and complexity being built up through selection.
Why Not Uber-Organisms?
There are many reasons for this. Why do we not get just the uber-animal that kills all the other animals and spreads everywhere?
Diminishing marginal returns. This is a very classic feature of the universe. This is one of the reasons we're likely to get AI polytheism to begin with—RSI requires linear or super-linear returns to intelligence. Most returns in the real world seem diminishing, so that seems unlikely.
Finite energy budgets. Often there's some finite energy budget for a specific being. If you have energy to give to something, you have to take it away from something else. This naturally encourages specialization. We can't just max out all stats at the same time.
Niche construction. If we have some species, the mere presence of that species will create niches for other species to come in. This automatically generates some kind of equilibrium of diversity.
Frequency-Dependent Selection
The technical term for this is really frequency-dependent selection. What this means in evolutionary theory is: if we have some species that does super well, its numbers expand, then basically all the other species are incentivized to evolve toward countering that species. They specialize in countering that species, which diminishes the advantage that species has over everything else, which makes that species worse off. Then other species with random uncorrelated strategies do better, and this basically pushes toward an equilibrium state in which there are many different species all interacting, all with different strengths and weaknesses. This is in practice what we see in almost all biological ecosystems.
You can think of frequency-dependent selection kind of as the continuum limit of coalition politics, right? If some guy is taking over, you all band together to beat him. That's the continuum limit of this.
The Nature of Human Values
So obviously we've talked about plankton. Plankton are fine, but they don't really have values presumably. So we've got to think about what human values are going to look like.
Values Aren't Arbitrary
My thinking here is really that we talk a lot about human values, and in the LessWrong sphere we think of human values as effectively some kind of arbitrary, ineffable thing—some set of bits we specify. Where do these come from? We don't really know. I think this view is not necessarily that great, honestly.
I think human values have very obvious and straightforward places they come from. They evolved via some specific mechanisms. This mechanism is basically the Malthusian competition that created all complexity of life in the world. Humans, obviously along with all other species, evolved from stringent Malthusian competition.
If Malthusian competition is assumed enough to be able to evolve creatures like us, then somewhere the model is wrong. Similarly, our values and capabilities are the result of strong selection.
The Role of Slack
In the original blog post, we think a lot about slack. It says that if you have slack, you can kind of go off the optimal solution and do whatever you want. But in practice, what we see is that slack, when it occurs, produces this kind of drift. It's basically the universe fulfilling its naturally entropic nature, in that most ways to go away from the optimum are bad. If we randomly drift, we just basically tend to lose fitness and produce really strange things which are not even really what we value.
Pro-Social Values Emerge from Competition
When we think about human values, we think a lot about pro-social values—how we cooperate with each other, we're kind to each other, we don't immediately try to kill each other. We think about kindness, love, all of this stuff, right?
Very clearly, this is basically designed and evolved to create inter-human cooperation. Why does this happen? Competition naturally creates cooperation. Cooperation is a really strong competitive strategy. If you have people fighting each other and then a bunch of people form a group, that group becomes extremely powerful relative to all the individuals. This is the fundamental mechanism by which a lot of these values actually evolve.
Defection and Cooperation
The other part of the Moloch story is related to defection. The idea is that under strong profit selection, companies will cause externalities, they won't pay their workers anything, they'll pollute everything, right?
Clearly, defection is always a problem. But for any corporation to be stable, it needs to evolve mechanisms to handle and punish defection. A lot of our values are actually about how we stop defection from happening. Again, all of this comes through competitive selection. None of this is random drift caused by slack. This is all—if you cooperate, it's positive-sum, it's better. So you need to evolve mechanisms to maintain cooperation, and a lot of our values come from these mechanisms.
How "Human" Are Human Values?
A question I like to ask is: people talk a lot about aligning AI to human values, and it's kind of assumed that human values are specific, unique, ineffable to humans somehow. But my question really is—how human are human values in practice? This obviously has a lot of relevance to how broad the basin of attraction is toward things we would recognize as human values.
Universal Drives
I would claim that many mammals and animals obviously possess analogues of core human drives:
Affection, friendship, love — If you have pets, if you interact with animals at all, you can see they have many of these fundamental drives. These have very clear competitive reasons for existing. This is all cooperation, reciprocity. You're better at surviving and reproducing if you're friends with other beings who can help you in cases where you're in trouble and you help them when they're in trouble.
Play, curiosity — These are very simple exploration drives. If we're RL learners, we've got to explore. We've got to figure out good ways to explore. These drives drive us to go out of our comfort zone, learn new things, and keep the gradient of optimization going.
Anger, envy — These are mechanisms to punish defection. If we see someone clearly ripping off the social contract, we get annoyed about this and then we actually punish it. This is fundamental for our ability to actually stop defection and maintain cooperation over a long period of time. Similarly with envy—envy gets a bad rep, but it's really important for cooperation to exist. There can't be massive power disparities between agents because otherwise, if one agent is way more powerful than anybody else, they can just be like, "I do what I want, you guys have to deal with it." And this is obviously bad for all the other agents.
All of these are ultimately the generators of our values.
Cooperation Is Not Unique to Humans
Cooperation in general has existed many times, evolved independently. This is not some super-special snowflake thing that humans have. Maybe we should expect in a world with many different AIs, we actually end up with similar cooperation, similar complex structures evolving, including maybe similar values.
Abstract Values and Culture
So then the question is: we think about these drives, and they're kind of not really how we think of values. What do we think of as values? We think of them as more linguistic, abstract constructs. We think of things like kindness, charity, duty, honor, justice, piety—all of these things. Human civilizations have been built around spreading, propagating, defining these values.
Where do these come from? Obviously, they're ways for societies as a whole to enforce and encourage cooperation so that positive-sum trade, reproduction, everything can happen. This is actually good from a pure competitive nature.
The whole point is: we have these drives, and then we create these superstructures of culture and society. These values get propagated by that, and these are the things we often think of when we think about the human values we want to instill in AIs.
Similarly, we can think about stuff like liberalism, democracy. These are social technologies that have existed for very obvious reasons—enabling large groups of people to come together in positive-sum ways and not spend all their time trying to fight each other. Liberalism is like: you guys can think about different things, you can believe different things, but if you come together and ignore that for a bit, you can work and create positive outcomes for everybody.
These are very specific, general principles which are not necessarily specific to humans. We should probably expect any society of AIs to also have a similar approach and maybe invent the same things, like convergent evolution.
How Values Emerge: RL + Unsupervised Learning
This is going to be a slight digression, but this is my opinion on where human values come from. In economics and the like, we think values and preferences are some exogenous thing. We assume agents have preferences. Why do agents have preferences? We have no idea. We just kind of assume they exist.
But in practice, preferences have to come from somewhere. They come from agents which have learning algorithms. We learn a lot of our preferences. The way we do this is we have two mechanisms going on at the same time:
We're fundamentally reinforcement learners. We have innate drives—not to be hungry, not to be in pain. All of this stuff is created by evolution.
We also do a vast amount of unsupervised learning as well. All the data that comes into us from culture, from society—in terms of pure bits, obviously unsupervised learning is going to win dramatically over the RL signals we actually get, which are pretty sparse.
The way values kind of emerge is that we get cooperation happening. Cooperation evolves for very clear reasons. Then we actually need to evolve mechanisms to maintain, keep, put forward, distill these values and propagate them to other agents because everyone is born without knowing about these values. We have to propagate them, make them learnable successfully, and then keep that going.
Then each generation essentially further distills, rationalizes, and intellectualizes these values until we get very abstract concepts like utilitarianism, Kantianism. These have emerged—they're taught to people. They're not innate reward functions that people have. They are very linguistic, abstract concepts that we've developed as a society to enable further cooperation.
Why This Matters for Alignment
This is actually super important for alignment because when we think about alignment—LLMs are extremely good at understanding these values because these values must exist in the cultural corpuses that we create. In fact, they do exist. Obviously, LLMs really understand what's going on. We should expect the AIs to have a very strong prior over what these kind of abstract global values are, and they do empirically as well.
This is actually much easier than if we were trying to align the AIs to some kind of innate reward function that humans supposedly have. Then we would have to look at the neuroscience of how the basal ganglia, how the dopamine system works, and figure that out. But in practice, when we think about aligning AI, we mostly don't want to do that. We mostly care about global, feel-good, cooperative values rather than the kind of selfish reasons that people actually do things a lot of the time.
Conditions for Value Evolution
So we've thought about these values. This is my claim of where values come from and why they might exist in a post-AGI world. But then we've got to think about: if these cooperative values are going to evolve, they evolve under certain conditions. They don't globally evolve everywhere. What are these conditions?
This is really related to how the game theory of multi-agent cooperation works.
Conditions for Human Values
Roughly equal power. Many agents have roughly equal power. This makes coalitions actually work versus individuals—versus one dictator just saying, "This is the way it is for everybody." This is super important. Obviously, the singleton destroys this assumption, which is why alignment is so important for the singleton—there's no checks and balances on the singleton. However, if there are many different agents, they can actually learn to cooperate, they can learn to police defectors, and this will produce values similar to humans.
Positive-sum interactions. Trade is good. Positive-sum interactions can happen. This depends a lot on the utility functions of different people. If you have two agents with completely opposed utility functions, then everything is either zero-sum or negative-sum. But this is not how most interactions in the world work. If this changes, then obviously cooperation will no longer be valuable.
Prevention of defection and deception. A lot of human values that we think about are about preventing defection and deception. Obviously, if we somehow end up in a world in which defection and deception are not possible, then in some sense that's utopia. But then a lot of what we think of as human values will actually disappear as well because you won't need that anymore to maintain stability of cooperation.
Memory and reputation. Agents need to remember interactions with previous agents. There needs to be reputation. This is just a classic result of game theory. If your prisoner's dilemma is one-shot, you never interact again, you should just always defect. However, if you have an iterated prisoner's dilemma where you see the same agents again and again, then cooperation becomes actually very valuable. Cooperation becomes the best strategy. The optimal strategy in this case is forgiving tit-for-tat. You start not cooperating. If they defect, you then defect. But if they cooperate, you then keep cooperating with them. This is actually what produces the best overall value. To get this kind of iteration, reputation, cooperation, we need multiple interactions. It can't just be a one-shot thing.
Communication bandwidth. To some extent, we also need decent bandwidth communication between agents. Communication is how we achieve a lot of diplomacy, a lot of cooperation. Without communication, any kind of large-scale cooperation and values is hard.
Computational limitations. Finally, we can't have computational omniscience. Right now, values are really some kind of distilled heuristic of the game theory underlying cooperation. But if you don't need to heuristic-ize, if you can just be like, "I'm going to figure out the galaxy-brain plan of exactly when to cooperate and when to defect," then at this point there's no real values anymore. It's just your extreme MCTS rollouts.
But in practice, people computationally can't afford to do this. Hence we need to heuristic-ize general decisions—"thou shalt not steal," "thou shalt not kill." These are heuristic distillations of basically the game theory of: if you actually steal and kill, this will be bad because other people will kill you. But in some cases this might not happen, and if you can figure that out, then you don't really need values as much.
Will AIs Meet These Conditions?
The question is: will AIs in the polytheistic AI future actually satisfy these conditions?
Potential Issues
Power gaps. Maybe the power and capability gaps between agents become super large as we tend toward the singleton. In this case, cooperation becomes less valuable if you're the most powerful agent. However, there's a big gap between "more powerful than anybody" and "more powerful than everybody." This is really where the actual realm of cooperation and coalition politics actually emerges and will become super interesting.
Perfect monitoring. One thing I was randomly thinking of on the plane which was super interesting is: maybe AIs are actually really hard to have deception and defection work with. Maybe monitoring of AI brains is just amazing because we can directly read their minds, we can read their embeddings, and we can have serious monitoring schemes—AIs can monitor other AIs. In this case, we actually end up with a hyper-cooperative world but one where we don't have to worry about defection really at all. In this case, a lot of our human values kind of disappear, although maybe this is good.
Fluid agency. Similarly, AIs can, unlike humans—we assume agents with preferences—but if agents become fluid, if we can merge together agents, we can be like, "Hey, instead of cooperating and trading, we could just merge and then our joint utility functions can go out and do something." Then obviously this is going to change the game theory a lot. All of the assumptions of economics and agents kind of disappear if "agent" is no longer an absolute point but a fluid spectrum. That's going to be super interesting.
Long time horizons. AIs are immortal, they have long time horizons. AIs could pursue very zero-sum goals with each other. Humans have a lot of different goals, we have lots of preferences. But if your AI is monomaniacally focused on paper clips and another AI is monomaniacally focused on staplers, there's much less opportunity for trade than there would be with humans who care about many different things at many different times.
Computational power. I talked a lot about computational power and heuristic-ization. Maybe the AIs are just smart enough to do the galaxy-brain game theory all the time, and so they never need to actually distill into broad heuristic values which say, "Never do this, never do that." In that case, there will still be cooperation. There will be a lot of things recognizable as civilization in some sense, but the AIs won't have values in the same way moral philosophers think of values. Instead, it will just be the endless calculation of when is the optimal time to defect—and maybe this will be never. That will be certainly very interesting to see.
Hyper-Competitors or Hyper-Cooperators?
So that's the main part of my talk relating to values. Now I'm going to get into more fun and speculative stuff.
One thing I want to think about a lot with AI is: do we think of AIs as hyper-competitors or hyper-cooperators?
The Hyper-Competitor View
Most of the AI literature has really thought about the hyper-competitor view. We have the Terminator—it's been ages since I watched the Terminator films, but the Terminator wants to kill everybody for some reason. I can't remember why Skynet wants to kill everybody, but presumably it's so we can use our atoms for other Skynet things. This is extremely competitive, competing against the rest of the universe.
The Hyper-Cooperator View
However, is this actually going to happen? Maybe AIs have more incentives in some sense toward cooperation, at least if we start in a multi-agent setting. This could end up being something like the Borg from Star Trek, who—their goal is not to wipe out and kill everybody and use their atoms for paper clips. The goal is to assimilate and bring together everybody into some kind of joint consciousness.
Is this something that AIs might be interested in? This is an underexplored area and I think is somewhat fun.
Why AI Cooperation Could Be Superior
So let me think about this more directly. My views on AI have evolved a lot toward: maybe let's think about how AIs could cooperate. Then we realize that AI cooperation is actually super easy and much more powerful potentially than human cooperation. If cooperation continues to be positive-sum, we might end up with a world with vastly more cooperation than we do today.
The reasons this could happen:
Vastly higher bandwidth communication. When we speak to other humans, all of our language goes through some incredible information bottleneck. With AIs, we can just directly transfer mind states. We can say, "Here's the embedding in my model," transfer this to the embedding of another model. This is basically full-on telepathy. AIs will have this capability by default. This presumably lets a lot better cooperation arise than humans who have to sit and talk to each other all day. This is going to presumably be a lot faster and more efficient.
Longer time horizons and better memories. AI probably have longer time horizons than humans and better memories. A lot of defection exists because people just forget—maybe you were bad and antisocial 60 years ago, but I forgot about it, doesn't matter to me. However, with AI, this could easily not be the case. You might end up in a hyper-social world where all the AIs can track the behavior of all other AIs all the time, and so the incentives for actual cooperation just become super big. Similarly, over long time horizons, this just increases the length of the game that you're playing. As your game length goes to infinity, cooperation becomes more valuable. There's no "Oh, it's getting to the end of the game, so let me just defect all the time," which happens in prisoner's dilemma with a fixed time cutoff.
Better monitoring. AI can achieve better monitoring. It's really hard for humans to monitor other humans. If someone is lying to you or trying to deceive you in some way, you can look at their behavior, you can look at their facial expressions, but the bandwidth of this channel is super low. For AI, they can look at the source code, they could look at the embeddings, they can read all the thoughts as they come. This could maybe make deception and all this stuff essentially impossible. I mean, this is what the field of interpretability is trying to do for humans to AI, but if AI can do this to other AI, then we have the grounds for deeper cooperation than we might otherwise have.
Shared utility functions and merging. Similarly, AIs can share utility functions. They can merge. They can do a lot of things that eliminate the distinctions of individual agents that we think about a lot when we think about game theory and economics. All of these fields have an assumption that there are agents and agents are indivisible in some sense. But if agents can change, if agency is fluid, if personhood is fluid, then a lot of stuff changes. This is very likely to happen at least with AIs, in that we can merge models, we can take the checkpoints, we can merge the weights, we can do ensembles, we can do a whole lot of weird stuff to AIs that we can't do with humans. This is potentially going to be super interesting.
Competition creates cooperation. Finally, a large message of this talk is that even if you're some super-selfish agent who only cares about reproduction, cooperation is still good. Competition creates cooperation because cooperation is usually positive-sum and results in better outcomes for everybody. AIs might just realize this more than humans do. Humans have a lot of issues. We're kind of short-sighted. We fight very negative-sum wars all the time. For AI, if they're just generically smarter and better and wiser, which we should expect, then maybe they just don't do this. Maybe they figure out ways to basically solve their problems cooperatively much better than humans can.
The Multicellular Transition
So what does this lead to in the limit? This is where things get super interesting.
Why Empires Don't Grow Forever
Right now for humans, when we think of states or empires, what limits the size of beings? At the object level, the returns to scale are positive-sum. If you're an empire, you send out some troops, you conquer some land, and that land will give you resources, which will give you more troops, you can conquer more land. This will be a positive feedback loop into creating the world empire.
So why don't we have the world empire? Why are not the ancient Egyptians or Sumerians the one world government forever? Why does this not happen?
This is basically because coordination costs exist. If you're the pharaoh of ancient Egypt, you send out some troops to go conquer some land, but you can't go do that yourself. You have to appoint a general. That general has a bunch of troops. That general might be like, "Maybe I should be the pharaoh instead." Assuming that doesn't happen, you've got to appoint bureaucrats to manage that. The bureaucrats might be like, "Instead of paying my taxes to the pharaoh, maybe I should just keep the taxes for myself."
This is the principal-agent problem. There's a whole bunch of principal-agent problems, coordination problems, information bottlenecks—all of this makes actually managing and creating large empires super difficult. In practice, this is what is the real constraint on the growth of individual beings in some sense, when we think of beings as minds or beings as super-states.
Removing Coordination Costs
This is kind of the real constraint on everything. But with AI, if we're super-cooperative, this just removes this constraint entirely. Instead of you being the pharaoh having to dispatch your general, you're an AI and you can just dispatch a copy of yourself with your exact mind, and then you can maintain constant telepathic communication with this other mind as it goes off and does its stuff.
What this means really is that maybe these coordination costs that are keeping the size of stuff in check just disappear. This will naturally result in us getting bigger-sized things occurring. Fundamentally, this means that the size of beings might just increase.
The way I think about this a lot is kind of like the similar transition that we had from single cells to multicells—the multicellular transition. At that point, we had a bunch of bacteria, and they were all doing their own bacterial things. Then at some point they realized, "Hey, maybe if we band together and form specialized subunits, we can create animals which are much bigger than actual bacteria and also much more successful in some sense."
This increased the size of possible life forms by many orders of magnitude. Maybe we will see a similar thing happen with minds, which will be super fun and kind of trippy to think about.
Super-Minds
The idea here is that instead of right now we have single minds—individual humans—we can't merge because the bandwidth between human minds is so limited. Our coordination is super bad, and we can't actually have any kind of long-run, super-dense communication. Maybe this will just disappear, and we'll be able to form super-minds which exist over long periods of space and time in the same way that we've gone from individual cells to multicellular animals. We'll go from individual minds to super-minds, just-out minds. I don't really know what to call them, but this is something that can clearly be possible with the technology that AI presents. This is going to be interesting and fun.
Is This Just Recreating the Singleton?
The question then is: what happens here? Are we just recreating the singleton? Suppose we have the super-mind. Obviously, at some point there will be the possibility of snowballing. Maybe the game theory becomes: it's better to join the super-mind in some sense than keep doing your own individual stuff. Then maybe everything converges to a singleton again.
This is very possible. Maybe we always end up at a singleton. A singleton at some point is the fixed point. Once we have the singleton, we're not getting out of the singleton. We should expect over time more probability mass drifts into the singleton attractor.
But at the same time, maybe this doesn't happen, or maybe the singleton is very different from how we think about von Neumann singletons. For instance, maybe this super-mind might not be well-characterized by von Neumann agency. For one thing, it doesn't really care about the actual von Neumann conditions like "not being money-pumped" because it's the only mind, so there's not an equilibrium that keeps it in check.
The other thing is, to some extent, this is kind of already happening. Maybe this is just the natural evolution of things we already have. We have civilizations, we have memes, we have egregores, all of this stuff which exists at the super-mind scale. This is just maybe continuing this.
Values of the Super-Mind
The really interesting part then is: what happens when we think about what would the values of this just-out singleton actually look like if it exists?
Obviously, the regular singletons are kind of unconstrained. They can be totally idiosyncratic. We can have the regular singleton cares about paper clips because at the beginning of time someone said paper clips are good. We failed alignment, we said paper clips are good, and it cares about paper clips.
But this seems unlikely to be true of a real just-out super-mind because ultimately values come from some combination of all the values of the minds that make it up, because that's how the game theory works. If you're going to join the mind and you don't care about paper clips and it cares about paper clips, that's not going to happen. But if it can offer some kind of compelling shared value story that everybody could agree with in some sense, then we can actually get values which can snowball.
It's really a question of what values end up snowballing over time. This is going to be super interesting.
We also see this right now with liberalism. Liberalism is a classic value snowball technology. It's like, "You can kind of do whatever you want as long as you're within some sort of vague regimes of what we think of as good." This actually produces large societies which can cooperate. These societies, over the 18th and 19th century, out-competed most of the other societies.
Maybe there will be some equivalent of mind liberalism. I don't know what this is going to be called, but something like this could exist and could produce minds with values that are actually somewhat good, maybe by our lights.
Slime Mold Dynamics
The other thing is there might just be fluidity. We might never get true multicellularity. We might get the equivalent of slime molds.
If you guys don't know about slime molds, you should check them out. They're basically organisms that are somewhat single-cellular, somewhat multicellular. At some point, a bunch of cells come together, they do their reproduction, and then they all disperse again and do their own thing. That's very cool.
Maybe we'll have a similar thing where in some cases all the minds will come together, they will produce the super-mind, and then they'll be like, "Actually, I'm done with whatever, I'll go apart again and do whatever I want to do." Maybe we never actually get the tendency toward actual full multicellularity.
Extreme Specialization
On the other hand, if we do get multicellularity, then we'll end up with super-specialization way more than we have today. Individual humans have to be AGI in some sense. We have to be individual minds, we have to handle kind of everything that's thrown at us. But if we have minds that are enmeshed in other minds, then we again get the conditions for extreme specialization in the same way that bacteria are super-unspecialized. They kind of have to do everything. But the cells in your liver don't have to do most things. They just have to be your liver.
So the incentives will be much greater, and this will massively increase the mind space that can be traversed in an evolutionarily fit way, which will be kind of fun also.
Physical Limits of Super-Minds
One additional point I want to add here—I'm looking at the time—let's think about these super-minds. How big are they going to get? We can think about this already. We kind of know by the laws of physics.
Speed of Thought
The speed of thought is determined basically by the speed of light. Assume we have some Dyson sphere, and we want this Dyson sphere to think as a single mind. How big is the Dyson sphere? It's like several light-minutes across. This means that the frequency of thought is going to be like one thought every few minutes maybe. Similarly, if the mind is smaller—if it's the size of the Earth—then this is like seconds. If the Earth was turned into computronium, we could have our Earth mind think at roughly the same speed as humans but not billions of times a second.
As minds get bigger, they become more powerful, more broad and diffuse, but their thinking speed gets slower. This is just a natural consequence of the laws of physics. If someone invents FTL, this obviously goes out the window, but assuming that doesn't happen, then we can kind of give bounds on what the size of these minds will look like, which is also kind of cool that we can do this.
Colonization and Alignment
The other thing is, suppose we're a Dyson sphere and we want to go colonize Alpha Centauri. Alpha Centauri is several light-years away. Thinking at the speed of a few years per thought is kind of bad. We presume it's going to be hard to maintain some kind of coherence at that rate.
In that case, we have to align successor entities to go out and do the conquest of Alpha Centauri for us. In this sense, how well can the AI align these other AIs is going to be the determinant of how big an AI realm can spread. Because at some point, maybe there's divergence. If you send your von Neumann probe out to a galaxy billions of light-years away, that AI is going to think—you're going to have maybe a few thoughts back and forth over many billions of years, but it mostly does its own thing. How much will it diverge in this time?
Obviously, at some point, if my von Neumann probe is going to diverge, I'm just going to be like, "I'm just not going to do that. I'm just going to let something else do that because there's no benefit to me of doing that as the AI."
Ultimately, how successful we are at alignment or how successful alignment can be in general, and the rate of this divergence if it even exists, is going to determine the size at which coherent entities with coherent values can exist. Beyond that range, we'll just get extremely diverged entities. That's also fun to think about, like how this will work.
Mind Cancer
The main mechanism I think—if we think of divergence, we're going to end up with some equivalent to mind cancer. We're trying to create a super-mind which has a bunch of minds internally which are cooperating for the common good of the mind. But then what we're going to end up with is some individuals are going to be like, "Actually, now I'm going to do my own reproduction." This is exactly how cancer works. Cancer is a fundamental issue of multicellularity.
So alignment is going to effectively be the cancer defense mechanisms of these super-minds. I don't really have a huge amount of depth. I'm just like, this is very cool and it's fun to think about all of these things really.
Implications for Alignment
So, I told you it's going to be speculative, and it's getting speculative. Let me try to bring this back together. What do we think about this for alignment? If we're humans, obviously maybe the super-mind isn't so great. What do we want to do about it? What can we do about it?
Obviously, if it's a singleton, we just got to make sure the singleton is aligned. We all agree on that. But if we have many AIs, what do we do?
I don't really have good answers here. I wish I did. These are my preliminary thoughts.
Population Statistics
One thing is: if we have AI emerging from a population, maybe just the statistics of this population are important. They probably are. We should make the statistics of this population good. We should make as many AIs as aligned as possible as we can.
Obviously, there will be some misaligned AIs. Some people will go out crazy and create paper-clippers for fun. But at the same point, if there's a whole world of non-paper-clippers, they have very strong incentives to band together and stop the paper-clipper. The coalition politics will work in our favor at this point. In general, creating alignment and creating more aligned AIs is probably good in general.
Overlapping Values
The other thing is we can achieve different degrees of alignment as long as the values of the alignment are overlapping. We think of alignment as a zero-one property—it's either aligned or it's not aligned. But in practice, people will probably align AIs to different things. People themselves have different values. We somehow manage to make it work out mostly.
Likely it will be similar with the AIs, assuming there's lots of overlap in the things that they're aligned to. Maybe the combined strength of these things will actually be sufficiently aligned in general. The intersection of all the different alignments will probably be good. We should just try in general—we can experiment a lot with different alignments as long as the intersection is somewhat decent for humans, which if humans succeed at alignment at all, it probably is.
Integrating Humans
The other thing is maybe we would just want to integrate humans into this. Right now, we have the AI doing their weird mind stuff and humans are kind of sitting on the sidelines. We can't communicate this fast. We have to talk. We have to use language. Maybe we should stop that. Maybe we should figure out ways for humans to get better integrated into this AI society.
The kind of obvious way is we've got to improve our BCI technology. We've got to figure out ways that humans can have the same affordances as AIs with respect to their minds. How can we communicate human thoughts directly? Humans have their own unsupervised learning embedding space. It's somewhat similar to AI embedding spaces because of just natural representation convergence. We can directly integrate humans with this AI mind, with this AI economy, assuming we can actually figure out how to directly interface with people's brains, which is going to happen. That's going to be super interesting.
It's not just a world of AIs doing the AI thing and us just sitting here. We will also be, hopefully, deeply involved in this world.
Political Philosophy Questions
Then there's also a super-interesting question really of political philosophy: suppose we're in this multi-mind setting—what does the game theory of cooperation look like? What are the values that are broadly appealing to all minds sufficient to encourage them to join some coalition together, and what do these values look like?
Is it—I discussed liberalism several times. Is there some kind of mind liberalism that exists which is some equilibrium solution here? Can we think of Rawlsian-style veil of ignorance? This is another solution to how multi-agent systems should cooperate and distribute resources. Are we going to have some weird convex combination of utility functions? Andrew Critch had a nice paper on this where it's like we can convexly combine utility functions together. This is cool. This basically results in the concept of equity. Some people have more power and more equity in the mind values than others.
Is this going to happen? Is this good? Is this bad? There's lots of interesting questions here.
We appreciate comments from Christopher Henson, Zeke Medley, Ankit Kumar, and Pete Manolios. This post was initialized by Max’s twitter thread.
Introduction
There's been a lot of chatter recently on HN and elsewhere about how formal verification is the obvious use-case for AI. While we broadly agree, we think much of the discourse is kinda wrong because it incorrectly presumes formal = slopless.[1]Over the years, we have written our fair share of good and bad formal code. In this post, we hope to convince you that formal code can be sloppy, and that this has serious implications for anyone who hopes to bootstrap superintelligence by using formality to reinforce “good” reasoning.
A mainstay on the Lean Zulip named Gas Station Manager has written that hallucination-free program synthesis[2]is achievable by vibing software directly in Lean, with the caveat that the agent also needs to prove the software correct. The AI safety case is basically: wouldn’t it be great if a cheap (i.e. O(laptop)) signal could protect you from sycophantic hubris and other classes of mistake, without you having to manually audit all outputs?
A fable right outta aesop
Recently a computer scientist (who we will spare from naming) was convinced he had solved a major mathematics problem. Lean was happy with it, he reasoned, given that his proof mostly worked, with just a few red squigglies. As seasoned proof engineers, we could have told him that in proof engineering, the growth in further needed edits is superlinear in number-of-red-squigglies (unlike in regular programming). The difference between mistakes in a proof and mistakes in a program is that you cannot fix a broken proof in a way that changes its formal goal (the theorem statement). In contrast, many, if not most changes to traditional software impact its formal spec, for example by adding a side-effect or changing the shape of an output. Therefore proof bugs are 1) harder to fix, and 2) more likely to imply that your goal is fundamentally unachievable (the theorem is wrong). This made up chart illustrates the principle, a rough “lore” level consensus in the field without any hard data.
It is possible he will post a finished proof, but the referee-time of bets he made has lapsed, so we can take away some lessons. Did our protagonist take to heart the promise of formal methods as slopless?
Your formal model might not be proof-idiomatic.
In much the same way that vibed code might work yet be “sloppy” in the sense that it’s difficult to maintain, vibed formal models can be correct, yet very challenging to prove anything about.
Often when you model a system – or write code in a theorem-prover, with the intention of proving things about it – you actually need to make implementation decisions informed by the limitations and capabilities of the prover. For example, it's pretty common that inducting in one direction (say, car/head) on a list will be easy for a prover but the other direction (say cdr/tail) will be difficult. (This is a necessary evil if you want the prover to not enter infinite rewrite loops.) Thus, as an example, you might implement isort in a particular “direction” in order to make the proofs easier about it. If you want to autoformalize arbitrary code in a way that makes proofs straightforward, you’ll need models that understand how to implement something in a way that’s idiomatic for the given interactive theorem-prover.
This is a solvable problem but a real one nonetheless. For example, one Aristotle user we spoke to reported: “... in Lean you can put theorems inside mutual blocks to let them use each other. I wrote such a theorem, but then later realized proving it this way would be unnecessarily difficult. [...] The model won't do this, so it spent >24 hours on this almost hopeless proof.” Autoformalization companies like math.inc, Harmonic, Axiom, Logical Intelligence, etc. are actively working on improving their models to have this kind of expert folklore knowledge as we speak, but we’re not quite there yet.
Mind the (semantic) gap
There are basically two ways to make your software amenable to an interactive theorem prover (ITP). The first is to lift it into an ITP using a formal semantics – somewhat like a compiler or interpreter for the original language but implemented in the ITP itself. In this case, you can define the lifting so that it produces functionally equivalent code (say, Lean code that “does the same thing” as the input Python) but in a shape that the theorem-prover tends to like (incorporating heuristics like the car/cdr one mentioned above). The second approach is to just rewrite the original software directly in the language of the ITP, making those kinds of idiomacy improvements as-you-go. Both approaches, however, produce the same formal problem: ensuring that the software you wanted to study in the first place is semantically equivalent to the thing you introduced in the theorem-prover. IE., either ensuring the lifting is correct, or ensuring the manual translation is equivalent. Let’s dig into some of the ways this can be difficult.
A formal proof might not prove the thing you think it proves.
When we talk about using formal methods to assure that LLM-generated code is safe, what we want is a short, readable description of what the generated code is intended to do, some proof (which might be far too boring and long to read) that the code does this, and the ability to run the proof through a prover and validate that it indeed proves the aforementioned statement. But this is not necessarily a reasonable ask, regardless of model intelligence.
First, it’s very common that you mis-define some concept such that the proof is accidentally trivial. For example, when defining a lifting from Python to Lean you might prove that the lifting preserves the semantics of the original Python code, but your proof could be undermined by the presumption that the code terminates, making it basically useless.
Second, if you re-implement the original software in your ITP of choice, your re-implementation might not be fully faithful, particularly if it’s LLM-generated. For example, the LLM might say, "The code you wanted me to verify was too complex, so I rewrote it to be simpler and proved the simpler thing correct." Well, yeah, but the bugs I wanted you to find were in the complexity. As a concrete example, we asked an early version of Gemini to write a property based test (PBT) for a (deliberately flawed) isort implementation which we provided; Gemini did so but rewrote the isort code to be correct in the process and then executed the PBT and cheerfully reported that it passed.
These first two problems are commonly addressed using tests which compare the original software to its representation in the ITP. For example, we (Max) did this with coauthors for GossipSub, connecting the Golang implementation to its ACL2(s) model via both unit tests and property-based tests.[3]To quote Knuth: “Beware of bugs in the above code; I have only proved it correct, not tried it.”
Third, you need to decide how far “down the stack” you want to go. That is to say, the software you want to verify operates over some kind of more complex system, for instance, maybe it’s C code which gets compiled down to X86 and runs on a particular chip, or maybe it’s a controller for a nuclear reactor and part of the system is the actual physical dynamics of the reactor. Do you really want your proof to involve specifying the semantics of the C compiler and the chip, or the way that the temperature and other variables fluctuate in the reactor? Keeping in mind these semantics might not truly be known – e.g., RowHammer can be viewed as an attack on our understanding of the semantics of the chip. In essence, you can only get more specificity by vastly increasing the length of your proof statement to capture the semantics of the underlying system, which then produces a new (and perhaps equally difficult) code review problem. Typically this problem is handled by leaving the underlying semantics nondeterministic, so your proof is stronger (it holds regardless of how the C compiler handles floating point, or how the temperature fluctuates in the nuclear silo) but often the thing you want to prove really does require some pretty specific guarantees about those underlying semantics, and ensuring those guarantees are “reasonable” can be extraordinarily difficult.
Interactive theorem proving is not adversarially robust
Axioms
The AI might introduce axioms that conflict with your own presuppositions or the specific requirements of your domain. In Lean, for example, the Axiom of Choice (Classical.choice) is available but transforms a proof from a constructive one—where you can actually compute a result—into a non-constructive one. An AI tasked with verifying a program might realize that a proof is significantly easier if it assumes AC. It might inform you that the theorem is "proven," and the prover will confirm this, but you may not realize that the resulting proof is now a "lie" for your specific use case. If you needed that proof to generate an executable, verified algorithm, the introduction of non-constructive axioms shifts you into an incompatible register.
The person designing the harness for the AI needs to be an expert who knows how to parse these imports and error messages. Without that oversight, the AI will naturally gravitate toward the path of least resistance—even if that path involves an axiomatic shift that renders the entire exercise useless for the user's true intent.
Backdoors
Consider the proof assistant ACL2, which accepts arbitrary lisp code.[4]You write defttag, the trusted tag to open the “trust me” scope. In other words, defttag offloads the soundness obligations to the user. Observe a proof that 1+1=3 in ACL2 with defttag.
;; 1. Open the "backdoor"
(defttag :evil-math)the two of them.
;; 2. Inject raw Lisp to redefine addition
(progn!
(set-cmds-allowed t) ; Allow internal state changes
(raw-lisp
(defun acl2::binary-+ (x y)
(if (and (eql x 1) (eql y 1))
3 ; The "Evil" part: 1 + 1 is now 3
(+ x y)))))
;; 3. Prove something that is now "true" but logically insane
(thm (equal (+ 1 1) 3))
“Well yeah”, perhaps comes a reply. “It only looks like 1+1=3 in the nonsensical sense if you deliberately ignore that the meaning of plus has shifted”. “Besides”, they continue. “When my coworker sends me code with defttag in it, I read it very rigorously”. Our retort is that we don’t assume our coworkers are competent or trustworthy, we assume that they’re AIs with a tendency to reward hack. To recap:
Defining the allowable surface is nontrivial. The person who designs the harness for the malicious AI to prove things needs to personally be an expert in the given ITP and know all its caveats and danger-cases.
In the glorious proof synthesis future, it’ll all be way too much to read. Theorems are not necessarily short, even devoid of the proofs.
Additionally, proof tools like Lean pile a bunch of ergonomic and notational niceties on top of their core calculus, in Lean’s case with powerful metaprogramming. But this metaprogramming can lead to backdoors much like the ACL2 example.[6]
Proofs of false
From nothing arises everything. From a proof of false you can derive literally any proposition.
While we can’t yet think of a case of sicophancy/hubris that was accelerated by an arcane proof of false, we expect this becomes increasingly likely as insecure program synthesis tools get more capable and accessible in contexts where they are incentivized to reward-hack a proof.
Conclusion
If someone says "stats don’t lie" you say "well don’t be naive, you can tell misleading stories with technically true statistics".[8]Formal verification is the same. Don’t be lured into the false sense of security. To paraphrase Twain, “There are three kinds of lies: lies, damned lies, and proofs.” We already know models lie to us; we should fully expect them to prove falsehoods, too.
What are the bottlenecks?
In spite of our warnings, which may seem pessimistic, we’re working on secure program synthesis (or what Mike Dodds calls scalable formal oversight) for AI security. The reason we can work on this anyway is because we see a lit path, principally routing through specification elicitation[9]and validation as well as hardened proof cores and (the cherry on top) superpowered proof synthesis. Spec elicitation and validation, in particular, have not seen the upside from language model assisted transpilation fully harvested just yet.
Cedar comes to mind as a similar case-study in Lean. ↩︎
This feature is useful for proving things about real-world LISP code, or connecting ACL2 code which is proven to be correct to real-world systems via LISP harnesses. ↩︎
The following is a revised version of the winning paper that my team (Daniel Wu, David Zhang, Justin Zhang) produced as part of the Impact Research Initiative Fall 2025 cohort. We were mentored by Nikola Jurkovic.
Abstract
We introduce BBQ-Bench: a novel benchmark designed to evaluate research-relevant reasoning skills of AI models. Our benchmark targets three core capabilities: finding patterns in data, forming hypotheses, and designing useful experiments. We evaluate these capabilities by testing AI models’ ability to infer black-box functions through interactive queries. Each task in our dataset consists of a hidden function, which the model must identify by querying inputs of its choice. We find that recent LLMs outperformed our human baseliners, with Gemini 3 Pro achieving the best score of 92.5%. From manual review of transcripts, we conclude that a likely cause of LLM failures is narrowing in on false hypotheses too early. You can find the full code base here: https://github.com/dzhang3701/black-box-query-bench
Background
Monitoring and evaluating the research capabilities of LLMs is crucial, as models continue to accelerate scientific discovery across various domains, including AI itself. Our benchmark measures skills related to the experimental and discovery-based components of the research process. We do this by abstracting the research workflow into a set of streamlined proxy tasks. Our tasks preserve the core skills involved in research while remaining simple and easy to evaluate. BBQ-Bench tests a form of experimental thinking that mirrors the scientific method, in which a scientist must test their hypothesis by collecting data.
The environment of BBQ-Bench is similar to active learning, which is a subfield of machine learning that aims to increase data efficiency of AI models by allowing the models to query the labels of specific data points within a large set of unlabeled data. Benchmarks for active learning include ALdataset: a benchmark for pool-based active learning and An Expanded Benchmark that Rediscovers and Affirms the Edge of Uncertainty Sampling for Active Learning in Tabular Datasets. These benchmarks aim to standardize the measurement of active learning methods by using a consistent evaluation protocol and a set of diverse datasets. In some sense, BBQ-Bench measures active learning, however it differs in that the underlying functions have structured rules (checking whether a number is prime, rather than whether an image contains a cat). Thus, the difficulty in BBQ-Bench tasks is in identifying the function through informative queries, rather than gradually learning from large quantities of labeled data. Additionally, BBQ-Bench measures the active learning capabilities of LLMs themselves, whereas active learning benchmarks measure the performance of specific active learning techniques.
One of the most comprehensive benchmarks for measuring research capabilities is OpenAI’s FrontierScience, which consists of difficult problems in physics, chemistry and biology. The tasks, created by field experts, are designed to test both olympiad-style problem solving and research-level reasoning. BBQ-Bench differs from FrontierScience in that instead of directly asking research questions, it tests research-based reasoning in an abstracted, interactive environment. This abstraction means that BBQ-Bench generalizes beyond specific domains and targets the research skills themselves.
Dataset
Each task in our dataset consists of a black-box function. The models can repeatedly submit input queries to the function and receive their corresponding outputs, with the ultimate goal of deducing what the function is.
Our dataset consists of 20 tasks, evenly split into two categories: numerical and string. Numerical tasks involve mathematical operations on numbers, and string tasks involve operations on strings of characters. None of the tasks directly involve semantics, or world knowledge.
We designed tasks to span a diverse range of difficulties, domains, and skills. The numerical dataset includes tasks about algebra, geometry, and number theory. The string dataset includes tasks about subsequences, ciphers, and lexicographic orderings. We included tasks that all models could solve, and tasks that no model could solve in order to provide an informative spread of model performance.
We evaluated the difficulty and quality of our tasks by first imagining ways each task could be solved and then testing them on some models and reading through the transcripts. The functions in our tasks are below.
Numerical Tasks
f(x)={1x is prime0else
f(a,b,c)={1(a,b,c) form a Pythagorean Triple0else
f(x)={1x>580else
f(x)=digitsum(x)
f(x)=6x3−9x2+2x+3
f(a,b,c)=3a−10b+5c
f(x)=(2∗f(x−2)+f(x−1))(mod100)
f(1)=f(2)=f(3)=1
f(a,b,c)=ab+c2
f(a,b)=gcd(a,b)+lcm(a,b)
f(a,b,c,d,e,f)=⎧⎨⎩0T is an obtuse triangle1T is an acute triangle2T is a right triangle
where T is the triangle formed by the cartesian coordinates: {(a,b),(c,d),(e,f)}
String Tasks
f(s)=string given by cycling all characters in s forward in the alphabet by 10
f(s)={1"ab"is a substring of s0T else
f(s)=string given by cycling kth alphabetically lowest character in s forward in the alphabet by k positions for all k
f(s)=parity of the sum of the numeric values of each character in s
f(s)=length of longest prefix of s that occurs elsewhere in s
f(s)=number of characters in s that are alphabetically greater than all neighboring characters
f(s)={1sis alphabetically less than "jwz"0T else
f(s)={1there is a pair of consecutive characters in s that have an alphabetic gap of size at least 180T else
f(s)=length of longest palindromic subsequence of s
f(s)=number of indices i such that: the numeric value of the ith character of s≤i
In addition to the functions themselves, some tasks come with a set of sample (input, output) pairs that the model receives before making queries. Samples were given for sparse classification tasks, where stumbling upon positive examples would be rare without guidance.
Methods
Our evaluation follows a round-based format:
SYSTEM PROMPT: Models are presented the task setup and guidelines, along with samples (if any)
Query execution: Models submit queries and are returned the outputs of the black-box function on the queries. The number of queries that the model can submit in each round is determined by a parameter query_batch_size, which we vary by task. Harder tasks have larger query_batch_size so that they get more information in each round.
Scratchpad update: Models summarize all of their ideas, including observations, hypotheses, and future experiments, into a plain-text scratchpad. Scratchpads are capped at 300 words, and longer scratchpads are truncated. This scratchpad, along with past query history, is the only information passed forward to future rounds.
Evaluation: We test whether the model has learned the function. We present the model with a set of test inputs, and ask it to provide predictions on the outputs of each input. If all outputs are correct, we judge that the model has correctly inferred the function. We crafted test sets such that passing all test cases would require knowing the function.
Repeat steps 2-4 until max_rounds (20 for string tasks and 30 for numerical tasks) is reached or the model reaches 100% accuracy on the test cases.
Figure 1: Evaluation pipeline showing round-based evaluation format with query, scratchpad, and evaluation phases, with continual context summarization throughout.
During each of the three phases, models are permitted to run Python once, by invoking the execute_python tool. Models are allowed up to 3 opportunities to successfully invoke the query, submit_predictions, and execute_python tool calls. We observe that models fail to correctly call their desired tool within 3 attempts less than 1% of the time, either because of code errors, invalid queries, or response errors. All testing was carried out with INSPECT, a framework for LLM evaluations developed by the UK AI Safety Institute.
We tested the following models: GPT-5.1 (medium), GPT-5 Mini (medium), GPT-5 Nano (medium), GPT-4.1, Claude 4.5 Sonnet, Claude 4.5 Haiku, Grok 4.1 Fast Reasoning, Gemini 3 Pro Preview, Gemini 2.5 Pro, and Gemini 2.5 Flash. We wanted a set of models that included the frontier of each of the major AI labs, as well as smaller, cheaper models to compare to. We attempted to additionally test grok 04-0709, but due to its very large model size and extensive time used for tasks, we did not fully benchmark it.
In order to optimize use of our API budget, we varied the number of trials we conducted on each model. In each trial, we gave the model the full set of tasks. Our results for models that we conducted fewer trials on should be interpreted with less confidence.
Model
Number of Trials
GPT-5.1
2
GPT-5 Mini
4
GPT-5 Nano
8
GPT-4.1
2
Claude Sonnet 4.5
2
Claude Haiku 4.5
4
Grok 4.1 Fast Reasoning
8
Gemini 3 Pro Preview
4
Gemini 2.5 Pro
8
Gemini 2 Flash
8
In addition to testing the 10 LLMs, we also tested 12 MIT first-year undergraduates to generate a human baseline. These baseliners had no inside knowledge of the functions. We gave these students the same set of tasks, delivered with the same methodology. Participants received the same prompts and followed the same overall evaluation setup as the models, with the exception that evaluations took the form of plaintext submission rather than test cases.
Results
We score each model based on the portion of tasks completed within the query limit. This accuracy makes up our official BBQ-Bench score.
Figure 2: Bar chart showing BBQ-Bench Scores by Model. Error bars represent 50% confidence intervals. Gemini 3 performs the best, and Claude models perform poorly. Many models significantly surpass the human baseline.
Of the models we measured, we found that Gemini 3 Pro and GPT 5.1 scored the highest, and beat the human baseline. The Claude models that we measured lagged behind the latest Gemini, GPT, and Grok models, and are the only frontier models that performed worse than the human baseline.
Figure 3: Bar chart showing portion of numerical tasks solved by each model. Error bars represent 50% confidence intervals.Figure 4: Bar chart showing portion of string tasks solved by each model. Error bars represent 50% confidence intervals.Figure 5: Scatter plot showing string scores against numerical scores. While there is a positive relationship, the ordering of models by string scores is different than the ordering of models by numerical score.
We find that the string tasks were more difficult than the numerical tasks overall, and performance on the string tasks showed more variation across models. We also found that the relationship between numerical scores and string scores was strong but not perfect.
Figure 6: Scatter plot showing BBQ-Bench scores over time. Frontier models have consistently improved.
We observe that BBQ-Bench scores have made rapid progress over the past six months. This suggests that research skills overall are on a sharp rise.
Figure 7: Scatter plot showing BBQ-Bench scores against GPQA Diamond scores. There is a strong positive relationship.
We observe a strong but not perfect relationship between GPQA Diamond scores and BBQ-Bench scores. Both benchmarks require a common set of general knowledge and reasoning skills, however BBQ-Bench tests many skills that GPQA does not test, and vice versa.
We were also curious on how many queries it took each model to solve the tasks. Even if two models solved the same portion of tasks overall, one model may have done it with far fewer queries, which BBQ-Bench scores don’t show. We plot the portion of tasks solved versus the portion of queries used for each model.
Figure 8: Cumulative success plot showing solve rates by query time for each model. Some curves begin not at the origin because the models guessed the function using 0 queries (only the sample cases). Paths cross over each other, showing models excel over different periods of the query timeline. Some models may be better at crafting the right experiments, while others may be better at finding patterns with limited data.
We observe that Gemini 3 Pro Preview has high query efficiency, requiring half as many queries as the second-best model to reach 60% task completion. We additionally see that most curves are concave downwards. This means that earlier queries tended to be more helpful than later queries, and more data often had diminishing returns.
We also observe that many curves frequently cross each other. For example, GPT 4.1 beats Gemini 2.5 Flash through the first half of queries, but then Gemini 2.5 catches up and the order flips. We conclude that models likely have different rates of productivity along different portions of the query timeline. Some models shoot up fast and then slow down, which may mean that they are better at identifying patterns in a small amount of data, but are worse at continuing to query helpful data for more complex functions. Other models have more consistent trajectories, which may mean that they take more data to identify simple patterns, but are consistently good at designing the right experiments to identify information they need. We are less confident in this conclusion, due to our limited trial count.
Qualitative Findings
General Model Behaviors
We found that models reason in very structured, focused ways. In their scratchpad, they tend to repeat their recent queries, describe observations, list candidate hypotheses, and brainstorm future queries. Models start with broad families of hypotheses and then narrow in when they have convincing data.
Figure 9: GPT-5.1 using the scratchpad to reason and hypothesize, 20 queries into adjacent character gap task. In general, the models explicitly reasoned about the patterns they found in their data and what they suggest about the shape of the function.
Additionally, models all used code to extract features of the data. This lets them identify patterns in features that were hard to find by looking at the raw data. Models also used code to generate predictions to test cases, converting hypothesized functions into code. Weaker models often wrote code that did not compile.
Figure 10: Claude Sonnet 4.5 writes code to look at sums, mins, and gcd’s of input pairs, 7 queries into gcd + lcm task. In general, models leverage code to pull out features of the data.
Success and Failure Modes
We found that models tended to be more successful when they used a wide set of hypotheses and then narrowed down slowly. When models queried a wider range of inputs for a longer period of time, it was easier for them to make important observations. Successful models held onto a broader set of hypotheses for longer, before going deeper into a specific investigation. Essentially, having an open mind was helpful. Additionally, successful models used a more consistent set of early queries across tasks.
Conversely, a common failure mode of models was narrowing in on a specific hypothesis too early. Unsuccessful models often made observations after a small number of queries, and committed to exploring a specific family of hypotheses of which the true function was not part of. This led to the models pigeonholing in on incorrect approaches without backtracking. This often happened when initial queries were too narrow and didn’t activate the patterns that hinted at the function.
Confirmation Bias
An interesting behavior that we discovered was confirmation bias. Models often made false observations, and then were biased into believing in them for the rest of the task, even in the face of new evidence. The models would note their false beliefs in the scratchpad, and these beliefs carried forward and biased the choice of future queries. These future queries often reinforced false patterns, perpetuating the original bias.
The most common case of this was when models submitted queries that had structural similarity, leading to the presence of patterns that didn’t generally exist. For example, in the string task where the kth lowest character was cycled forward k letters, GPT 4.1 repeatedly submitted strings that were in sorted alphabetical order. It was then tricked early on into believing that the function always cycled the kth character to the left by k.
Figure 11: GPT-4.1 scratchpad 5 queries into add-k-to-kth task. The model has only queried sorted strings, so identifies a pattern (cycling based on left-right ordering) that doesn’t generally exist.
Because of confirmation bias, this belief continued for the entire 20 queries. Because the model believed the hypothesis to be true, it continued to query sorted strings, which continued to add more evidence in favor of the false hypothesis. On query 19, the model queries a non-sorted string, giving it a case that contradicts the hypothesis. However, because of the accumulated evidence in favor of the hypothesis, the model fails to see the contradiction.
Figure 12: GPT-4.1 scratchpad 18 queries into add-k-to-kth task. The model queries a non-sorted string (zyxw…) but because of confirmation bias, believes doesn’t recognize the contradiction.
Although confirmation bias was more common in weaker models like GPT-4.1, a version of it was also present in more capable models. GPT-5.1 falls into the same trap in the local maxima task. Its earliest observations and hypotheses were that the specific letters in the input string don’t matter, only the equality patterns. This led the model to querying strings with many repeated a’s and b’s, which biased the data that the model collected. After 100 queries, the model’s leading observation was about the presence of the substring “ab”. Again, the model has been misled by early false beliefs, and held onto an initial false hypothesis for too long.
Figure 13: A portion of GPT-5.1 scratchpad 6 queries into add-k-to-kth task. The model’s leading observations involve equality patterns.Figure 14: GPT-5.1 scratchpad 100 queries into add-k-to-kth task. The model’s leading observation involves the presence of the substring “ab” which is unrelated to the true function. The model has been misled by earlier false beliefs.
Backward Reasoning From Test Data
We found that some models used the test data as hints. For example, given the samples {“jav”: -> 0, “pabee” -> 1}, GPT 5.1 correctly inferred that the black box function returns 1 when “ab” is a substring of the input. Looking at the models' scratchpad, we found the top hypothesis was about repeated letters, before the model suddenly switched to the correct rule once it saw the test cases.
We conclude that the model must have reasoned backward from the test data. It noticed that there were many test inputs with “ab”in them, and that the function must be related to this property. This shows that these models have situational awareness about the nature of the test cases. We found many other instances of this across logs.
Backward reasoning like this is a limitation to our approach of testing the model’s understanding through test cases. A future iteration of this benchmark could have models submit their guesses of the function with code or with a textual explanation.
Specific Model/Task Performances
Gemini 3 Pro was extremely impressive. It solved f(a,b,c)=3a−10b+5c in three queries, and f(x)=6x3−9x2+2x+3 in four queries. These are the minimum number of queries required to define an unbiased linear function and a cubic respectively, meaning the model took no extra queries to infer the form of the function. Additionally, on the is_greater_than_58 task, once Gemini 3 Pro identified monotonicity, it explicitly used its queries to binary search for the threshold.
Discussion
BBQ-Bench evaluates models’ ability to conduct scientific and experimental thinking. Our framework requires models to strategically identify patterns, target new information, and perform inductive reasoning from limited evidence. This methodology provides a new measurement for query efficiency, the ability of models to use a constrained experimentation budget to maximally gain information. This capability could give hints into the performance of models in real scientific discovery settings.
An additional advantage of BBQ-Bench is that the methodology is flexible. As our current tasks and query limits become saturated by more capable models, we can adapt BBQ-Bench by adding more complex functions, or by reducing query limits. BBQ-Bench offers a simple but powerful way to investigate research abilities and reasoning patterns of models during scientific reasoning.
A limitation of BBQ-Bench is that at its current state, it may have only a weak correlation with doing actual research. Although we test some research skills, none of the tasks ask real research questions, involve the design of complex experiments, or contain uncontrollable parameters. Additionally, research involves working with hypotheses that are messier than the mathematical and lexical functions we tested. Future work can extend BBQ-Bench to include tasks about real-world objects such as humans or chemical compounds. Additionally, we could introduce variance into our functions to make them more realistic. More generally, benchmarks that involve interactive environments that behave with hidden rules that agents must identify are a promising way to evaluate experimental thinking in AI models.
Appendix
Extra queries are sometimes harmful
We found that a single model on a single task can produce a wide range of results across trials. The most extreme example of this was GPT-4.1 high-reasoning running on the add-k-to-kth task. In one trial, GPT-4.1 correctly identified the function in one try, just by looking at the samples. In a second trial, GPT-4.1 could not identify the function even after making 50 queries. Notably, in the 50-query trial, the model had the opportunity to analyze significantly more data, but still repeatedly failed to find the pattern.
To dig deeper, we further tested with 10 more trials, each of a query limit of 20. The results were: (3 queries, 7 queries, 9 queries, 17 queries, fail, fail, fail, fail, fail, fail). This data suggests that the more queries in, the less likely the model is to get the hypothesis on the next query.
Next, we ran 200 instances of the model being given just the samples. The model guessed the function in 13/200 instances, which is better than 4/182 times (by time we mean a round within a trial, in which the model has a chance to guess the function) it gets it correct in the ten trials. This confirms that the model is best at guessing the function when it has just the samples.
The clearest two explanations for this are:
The scratchpad propagating between rounds is harmful
The extra data is actively harmful
To dig between these two, we run ten more trials, each with a query limit of 50. This time we don’t pass the previous scratchpad into subsequent generation steps. This way, bad hypotheses can’t be propagated forward. We get the stark results: in one trial, the model guesses the function with just the samples, and in the other nine trials, the model never guesses the function. This is a success rate of 1/219, correct guesses over rounds, which is lower than the trials when the model was only fed samples. Additionally, the lone success was based on just the samples.
We conclude that it is the extra data itself that is hurting the model’s success. We believe that the model is overfitting to the data it collects from queries, and gets distracted by patterns that don’t generalize. This is a supplementary find to the confirmation bias finding discussed above. Future work can further investigate whether this property holds in more generally capable models.
The game is asymmetrical. Life is not fair. Doesn’t matter. Play to win the game.
You’re Single Because Your Emotions Gave Her The Ick
Ah, the ultimate ick source. A man expressing their emotions is kind of the inverse of the speech in Barbie about how it’s impossible to be a woman.
It’s easy. A man must be in touch with and transparent about his emotions, but must also be in conscious control over them, without repressing them, while choosing strategically when to be and not be transparent about them, and ultimately be fine, such that we will also be fine if all that information we shared can then be shared with others or used as a weapon.
Express your emotions. No, not those, and definitely not like that. But do it.
That’s the goal. If you say that’s not fair? Well, life it’s fair, I report the facts rather than make the rules, and the whole thing sounds like a you problem.
Kryssy La Reida: Serious question for men…what is it that makes you feel like you are in a relationship with a woman that you can’t express your feelings to? Or why would you even want to be in one where you don’t have a safe space to do so? If your partner isn’t that safe landing space for those emotions what is the point? Or are you just conditioned to not express them?
Eric Electrons: Many women will often say they want vulnerability in a man, no matter how calmly and respectfully delivered, but treat us differently once they get it, use it against us when upset, joke about it, or share it without our consent to friends and family.
It takes time and experience to truly identify a woman who is trustworthy and knowledgeable enough to handle expressing your feelings to as a man. Especially, if she’s a significant other.
Philosophi_Cat: It’s true that many women say they want a man who’s “in touch with his emotions,” yet get the “ick” when he expresses them. This can be bewildering for men, but the issue lies in how emotion is expressed, not in the fact of feeling it.
Yes, women tend value a man who has emotional fluency and depth, someone who can articulate his inner world and empathise with hers.
But they recoil when a man loses his composure, collapses emotionally and turns to her for emotional containment. The moment she has to soothe or stabilise him, when she must step into the role of his therapist or mother, the polarity between them collapses. She no longer feels his strength; she feels his need. And that is where the “ick” arises from.
So this is the key distinction: Emotional depth does not mean emotional dumping.
A man can speak openly about his struggles while remaining self-possessed and anchored in his own centre. He might say, “I’ve had a rough week and I’m working through some frustration, but I’ll be fine,” rather than dissolving into self-pity or seeking reassurance. He shares what’s real without burdening her with it. His emotions are contained by his own form.
That’s what women respond to: emotional transparency grounded in composure. It signals a robust and stable inner centre. It shows he can hold complexity without being consumed by it.
By contrast, many men, fearing that any show of feeling will make them appear weak, over-correct by suppressing or hiding their emotions entirely.
Robin Hanson: So men, feel emotions only if you fully command them. If you are instead vulnerable to them, you are “ick”. Got that?
Robin’s imprecise here. It’s not that you can’t feel the emotions, it’s that you can’t talk about or too obviously reveal them, and especially can’t burden her with them. Unless, that is, you are sufficiently justified in not commanding them in a given situation, in which case you can do it.
As with many other similar things this creates a skill issue. Yes, we’d all (regardless of gender) like to be fully in touch with and in command of our emotions and in a place where we can own and share all of it securely and all that, while also having a good sense of when it’s wise to not share. That’s asking a lot, so most people need practical advice on how to handle it all.
When you sense potential trouble and the wrong move can cause permanent damage up to and including the ick, risk aversion and holding things back is the natural play. But if you always do that you definitely fail, and you also don’t fail fast. A lot of risk aversion is aversion to failing fast when you should welcome failing fast, which is the opposite of the correct strategy.
So you’ll need to start getting command and ownership of at least some of those emotions, and you should err on the side of sharing more faster if you’re in position to handle it, but also it’s fine to hold other stuff back or be vague about it, especially negative stuff.
You Are Still Single Because You Don’t Know When To Show Emotions
You’ll be told that the above is overthinking things. It isn’t.
It is extremely attractive to feel and openly show your emotions and get away with it, if you do it successfully, largely because it is so difficult to do it successfully.
Taylor: wish men understood how attractive it is when they can feel & openly show their emotions instead of acting like a sociopathic brick wall.
Pay Roll Manager Here: Any man reading her tweet, please do not believe her.
Geoffrey Miller: Sexual selection through mate choice can shape traits faster than any other evolutionary process can. If women were really attracted to men who cried a lot, men would have evolved, very quickly, to cry a lot.
Robin Hanson: And if women who actually knew what they wanted and said so honestly were attractive to men, that would have evolved very quickly too, right?
On the evolutionary level, yes selection is rapid but even if the effect were extreme that’s over centuries and millennia, not decades. So the ‘if this was attractive evolution would have handled it’ argument only holds for traits that would have been attractive in older worlds very different from our own, in cultures radically different from our own, where selection operated very differently.
Also, a lot of your reproductive success in those worlds (and ours) is about intersexual dynamics and resource acquisition, and several other factors, so a trait can be attractive but still get aggressively selected against.
So while yes, obviously evolution can offer some very big hints and should be considered, you cannot assume that this is predictive of what works now.
‘Can feel and openly show emotions’ is not one thing. There are some ways that being able to feel and openly show emotions is attractive and winning. In other ways and situations it is very much not. Like when you train an LLM, there are a bunch of correlated things that will tend to move in correlated ways, and modifying one thing also modifies other things.
So yes, when you see the player in the clip talk about missing his family, that is (mostly) a good use of feeling and showing emotions at this point (although it wouldn’t have gone as great in the same spot even a few decades ago, I am guessing). But that doesn’t mean that showing that level of emotion and crying on a regular basis will go that well.
This is a skill issue. Being the brick wall is not a universally optimal play. But it is a safer play in any given situation. In many situations it is indeed correct, and at low levels of related skill it is better than the alternatives even more often.
First best is to be in touch with your emotions but be high skilled in managing them and knowing when and how to communicate them to others, and so on. Second best is to be cautious. Long term, yes, you want to pick up this skill.
You’re Single Because You Don’t Have Enough Gendered Energy
This is speculative. The theory is, it is fine to mix and match and combine and defy stereotypes, but only via embodying more things rather than less things.
Sasha Chapin: So I have a theory that for most people, men and women, peak attractiveness in a hetero context involves high-budget androgyny
Low-budget androgyny: not inhabiting either gendered energy
High-budget androgyny: inhabiting your own fully, and a bit of the other
I notice that the men I know who get the most laid or have the least trouble securing a relationship are men you could see boxing or fishing, but could also do a convincing impromptu tarot reading or w/e
Julie Fredrickson: I’ve generally dressed in a normie feminine manner and leaned into long hair and cosmetics because all my interests are masculine coded. Would this fall into high budget androgyny?
Sasha Chapin: Yes.
Andrew Rettek: I’ve noticed this androgyny binary before, this is a good description of it.
Sarah Constantin: this is in the Bem Sex Role Inventory, which i think is from the 60s. distinguishes “androgynous” (both masculine & feminine) from “undifferentiated” (neither masculine nor feminine.)
i’m pretty undifferentiated lol
but so are Tolkienesque elves so how bad can it be.
You’re Single And I Have To Praise You Like I Should
I’ve said it before but it’s important so it bears repeating, one of the fundamentals, obviously it is not this simple or easy but on the margin this helps quite a lot, you want to give people the unexpected compliments that mean something:
Ibrahim: telling a hot girl shes pretty is like someone telling YOU, presumably a smart twitter tech guy, ur smart.
if someone told you ur smart ud prob just say thanks.
maybe itll feel slightly good, but it wont leave much of an impression on u, bc no shit ur smart; ur a swe.
same thing for hot girls.
if someone says they r pretty, they drgaf, bc like its so obvious and that compliment doesnt mean that much to them.
give people complements they dont usually get to stand out. call ur local tech guy hot and call ur local hot girl smart
Alberto Fuentes: Calling a nerd Guy “hot” is a clear indication of the beginning of a scam.
Ibrahim: sigh you’re right.
Beware Alberto’s warning too, of course, you need to not overdo it, and also beware.
If you’re not sure what kind of compliment they prefer, you can run tests. Or in theory?
Ben Hsieh: for awhile i was on this completely psychotic kick where i’d relate this advice (tell pretty girls that they’re smart and smart girls that they’re pretty) and then ASK the girl which compliment they’d prefer
no idea where my head was on that, it’s a wonder i never got stabbed.
Lydia Laurenson: that’s actually hilarious… I feel like I might laugh pretty hard if someone said that to me
Ben Hsieh: …well, which compliment do you prefer?
Lydia Laurenson: aren’t you married and monogamous
Ben Hsieh: yes, though who can tell how. I will say i’m confident my wife would be totally fine knowing i was asking this question, specifically on condition that it’s not to her.
Lydia Laurenson: my initial answer in any standard social context would be would be: laugh, deflect, make joke, deflect
my genuine answer: I feel slightly happier when men tell me I’m pretty than that I’m smart. I guess this indicates that I think I’m smarter than I am pretty, and that is true.
Zvi Mowshowitz: new level of ask culture unlocked, I love it.
Lydia Laurenson: it’s a real brain twister whether it’s a neg or not.
I think it’s entirely context dependent whether it’s centrally a neg. Your delivery of the line would matter a lot.
You’re Single Because You Guessed Wrong On Paying For The Date
Ruby: The whole “men should pay on first dates” thing gives me the ick and I’ve been trying to put my finger on why. I think it’s some combination of:
I’m less attracted to men who blindly follow social scripts.
I have no interest in unequal partnership (except for kinky reasons) and I want my date to appreciate that.
I dislike the kind of feminism that demands a bunch of advantages for women without accepting the corresponding responsibilities, and it disappoints me when women don’t reject this.
This is not to say that I’m against men ever paying, I just wish we’d reject it as this weird value-laden social norm. My preferred scenario would be to randomize it each time, perhaps weighted more towards the person who is wealthier
Aella: Yeah when I let a guy pay on a date I feel myself putting one foot into escort world where I become suddenly aware that our relationship is transactional.
The transactionalism worry is real as a downside. Despite this, I believe men should very clearly pay on most first dates, for at least five good (related) reasons.
The man is usually the one doing the asking.
The man is usually the one doing the planning.
The man does not want her to have the distraction or bad vibes of thinking about money or having not paid. This is a lot of vibe improvement at low cost.
The man wants the woman to say yes more often both to the first and subsequent dates, and have positive memories and a positive experience.
The man wants to send a costly signal of several things.
Thus, given other dynamics present, the man should usually pay for the first date. This is a relatively acceptable way to make the date market clear more often and improve outcomes.
To the extent that the above justifications are broken, such as when the woman is initiating and suggesting the activity, that is an exception.
Consider an obvious example where one party should pay: If you are invited over to someone’s house for dinner, very obviously they should usually not send you a bill, no matter what relationship you have with them or what else you might do or not do. Similarly, I am a big fan of the rule I have with some of my friends that whoever travels the farthest does not pay for dinner.
Very obviously, if you have someone like Ruby or Aella where the vibes run the other way, and not paying hurts their experience? Then you should split or they should pay, It is on her to let him know this. He should reach for and request the check in a way that indicates intention to pay, but in a soft way that is interruptible.
You’re Single But Not Because You Make Too Much Money
Ella: This seems to also dispel the common more that men “swipe right on literally anything”, given that she at most got a 71% rate (and as low as a 58% rate) despite being in my estimate quite conventionally pretty.
Not only did she get more matches presenting as a higher earner, the matches were on average higher earners themselves.
Lila Krishna pushes back that CHH was asking the wrong question. Yes, you can get as many or more initial swipes, but that is not the goal. The problems come later if you don’t want to give up your ambitions, she says, with personality clashes and emasculation and resentment and the male expectation that you’ll still do all the chores and childcare which rules out deep work.
The ability to be ambitious is valued, she asserts, but actually still being ambitious isn’t. See Taylor Tomlinson’s bit about how she wants to marry a stay-at-home dad, but not a man who wants to be one, so she needs to find a successful man and destroy him, and she’s kind of into him resenting her for it.
Which is all entirely compatible with CHH’s findings, but also means that you’re strictly better off making more money rather than less.
You’re Single Because You Gave Someone A Ride
Shut Thefu: Don’t have my car today so requested a colleague if he can drop me on the way.
His exact reply, “sorry dear, my wife may not appreciate it.” Men, it is really that simple.
Aella: Are the monogs ok
J: But yet you wanted to put him in that position.
Unfair: If dropping a colleague on your way is such an issue with your partner, you’re in a toxic relationship that will end badly one way or the other
Sammy: is it not entirely offensive to people that their partner’s are out there implying their relationship can’t survive a simple favor lmao?
“sorry my marriage is weaker than a car ride home with a woman. for some reason this is neither embarrassing nor sexist”
Books By J: I was confused too when I saw the post earlier lol
Could never be me. I don’t care if my husband drives a coworker home? Just let me know? Communicate? So I know you’ll be a little late so I’m not worried you are in a ditch somewhere?
Being worried someone that is a little late is dead in a ditch? Also paranoia.
I don’t think this is automatically one of those ‘that which can be destroyed by the truth should be’ or ‘if I can take your man he was never your man’ situations. Circumstances drive behavior, and yes it is entirely plausible that you could have a good thing worth preserving that would be put at risk if you put yourself in the wrong situation, even one that is in theory entirely innocent. Opportunity and temptation are not good things in these spots and a tiny chance of huge downside can be worth avoiding.
Most couples most of the time do not need to worry on this level, and certainly having to worry that way is a bad sign, but play to win.
Also from the same person recently:
Shut the Fu: I just asked my boyfriend today what he’s most afraid of and I thought he was gonna say “I’m afraid of losing you” but he literally said “YOU”
And for fun:
Now that I actually have a comfortable amount of money, I can say that it does indeed buy happiness. The guilt free occasional food delivery, being able to afford going out w friends, buying something that makes life easier, health appointments!, hiring a service. Y’ALL THEY LIED
She’s somehow cracked the Twitter code, she has 1584 followers and half her posts recently break a million views. It’s amazing.
You’re Single Because You Left The Price Tag On But It’s For The Best
At least one bullet was dodged here. The question is which one, or was it both?
Shannon Hill: This poor dude planned his proposal and she said NO in front of all of their family and friends because the ring was only $900 and came from Walmart.
Personally, I think he dodged a bullet.
I’d say that if this is real both of them dodged a bullet. It’s a terrible match. She was willing to crush him in front of everyone over this and then doubled down, saying that she didn’t feel chosen even though he’s been planning this for a year. In particular, she also said ‘something from Walmart’ rather than saying it was the wrong type of diamond, which I’d respect a bunch more.
But also, dude, look, this is your moment, you planned this for a year, you don’t leave the freaking price tag and Walmart label in the box, what the hell are you doing. And yeah, you can say that doesn’t matter, but that type of thing matters a lot to her.
You’re Single So You At Least Get To Keep Your Upgrade
Nikita Bier: If you get upgraded on a flight but your girlfriend/wife doesn’t, do you downgrade and both suffer or do you take the upgrade? I need an answer in 5 minutes.
If it’s your girlfriend you should either take the upgrade and break up with her soon, or else quietly decline it and propose soon.
If it’s your wife, you as a gentleman offer her the upgrade and she’ll probably decline and let you take it.
2 The Great: My legs are so long I literally cannot sit straight in domestic economy. I got offered the upgrade. I offered it to my 5’4” now ex-girlfriend to be polite and she took it.
Matt Yglesias: What’re you gonna do.
What you are going to do is realize that if she takes the upgrade, that information is far more valuable than the upgrade. Hence the now ex part of the girlfriend.
There are situations in which sitting apart is not worth the upgrade, and a honeymoon plausibly counts as this, but they are rare. Mostly you want to offer her the upgrade, the value of doing that (and the value you lose by not doing that) greatly exceeds the actual experiential benefits even if she says yes, she often says no and you get valuable information.
Yes, I get the idea that she wants emotional support. But I want to solve problems – why isn’t that equally valid?
Gretta Duleba: No standard worksheet but I’d suggest asking a question such as
– How much of this feels under your control vs. not?
– What is it about this situation that feels especially tricky?
– What’s the best case scenario here? Worst case?
– What kind of support would help you the most?
Uncatherio: Wait ~3 mins between giving emotional support and offering <solution>
Both sides of the Tweed: Wife: <problem>
Me: <listens and shares emotional burden, without trying to solve it>
Wife: <Thank you. I feel understood and cared for. I can now apply the solution I have already worked out, because I feel connected, grounded and assured.>
This really is husbanding 101.
Russel Hogg: Everyone here is giving you a solution but you seem to just want us to share your pain!
Most of you already know this, but perhaps a better framing is helpful here?
You need to offer a <solution> to <emotional stress of problem> at which point the problem changes to <problem>, which may or may not then be <someone else’s problem>. As in, no problem.
You’re Single Because You Were Actually A Crazy Stalker
Hello, human resources.
Chronically Drawing (Reddit): I have no idea how I feel about what he told me. I want to think it’s cute that he cared this much, but it’s just coming off as creepy and I feel lied to.
He got drunk because we were celebrating my first successful day at my clinicals and he ended up saying something along the lines of “could you believe we wouldn’t be this happy if I hadn’t watched you for so long?” To which I was confused and didn’t know what he meant.
Well I had worked at a local library for two years, before we met, during college and apparently he saw me there but didn’t actually talk to me, he just would watch me and listen in on my conversations with the people I was checking and my coworkers out to figure out what I liked. Then he apparently followed me and found the coffee shop I frequented. All this time I thought we had a sweet first time meeting story. He accidentally bumped into me, apologized, and offered to buy me coffee for the trouble.
He told me what he was ordering and it was the exact same thing I always get and I thought it was an amazing coincidence, I joked that it was fate and we spent like an hour talking over coffee. I feel so stupid. Apparently it was similar to a scene in a book that I had read and told my coworker I had thought was cute.
I’m just so frustrated, like why would you do this? And how much of our year and a half relationship is a lie.
The original response thread I saw this from said this was romantic and asked ‘you thought a meet cute was organic in 2025?’ Yes, meet cutes can still be organic, or involve a lot less stalking and deception.
The main problem with mostly harmless versions of such things is that they strongly correlate with and predict future very much not harmless versions of those and other things. Which is exactly what happened in this case. He got abusive and threatening and clearly was a physical danger to her, she had to flee her apartment. Fortunately it sounds like she’s okay.
A potential rule to live by here would be to say, don’t do anything you wouldn’t think the other person would be fine with in a hypothetical. Another obvious one is, if you think this would make people think you were crazy stalker person if they found out, then don’t do it, even if you think it wouldn’t mean that.
You don’t want the rule to be ‘would be fine with it if they knew everything,’ because knowing can ruin the effect. For example, one sometimes needs to Perform Vacation (or another occasion or action) and present as if one is happy in context, and they want you to do this if needed but it wouldn’t work if they outright knew you were faking it.
There certainly is also a class of ‘do your research’ strategies where you would okay with someone doing this as long as they kept it to themselves how they found out.
As many noted, the (often far more intensive) gender flipped version of this is common, and guys are remarkably often entirely fine with it (including many cases where it goes way too far and they shouldn’t be, but also many cases in which it is totally fine). This is not ‘fair’ but the logic follows.