2026-04-27 16:17:39
Also posted on the EA forum (I missed the cross post button)
Who am I to make such an outlandish claims?
The forecasting community has a fetish for resolution criteria. We spend weeks debating the exact definitions of words but spend far less time understanding what exact issues organisations need to grapple with.
When I speak to senior government officials, they often don’t even know which risks they should be looking at. They are operating in a fog of war where the primary challenge isn’t predicting the outcome of a well-defined event but instead it’s identifying which events even matter or they should be focused on.
We are providing high-precision answers to questions that decision-makers haven’t asked and often don’t care about. So much time has been spent forecasting headline geopolitical events or AI capabilities and risks - all interesting and academically engaging, but are so far away from the actual questions and issues decision makers are trying to weigh up.
To emphasise this, I have spoken to people (even in government), who are building AI forecasting tools. What they say after they’ve build a semi-reliable tool is all the same: “we’ve found people don’t know what issues they should be focusing on, and rather than a probability estimate, they want help to identify the most prescient questions”.
To a forecaster their probability is everything - as it should be. It’s how you prove you’re worth, it’s how you become a “Superforecaster” or get a job at a hedge fund.
But decision-makers do not care if you are 2% more accurate than the next guy. When it comes to actual decision making, the value of a probability is its ability to force transparency and to expose differences. Sure, it can’t be wildly wrong, but no one is fighting over single digit percentages.
In a standard policy meeting, people hide behind imprecise words like perhaps, likely, or could. Numbers strip that away and once you build comfort with using them, real value in the decision making process can be unlocked - better reasoning transparency, more efficiency in the decision making process, more effective options to achieve the objective you want.
The real value of forecasting is in the moment you realise two people in the same room have forecasts 40% apart. That is where the benefit occurs. But the community is so obsessed with maximising Brier scores that it ignores the fact that their quest for the most accurate predictions are often sapping time and effort away from utilising the most valuable element of forecasting: transparency.
There is a naive, almost arrogant assumption that if we just give a Minister an accurate percentage, they will make a better decision.
I worked with Ministers who couldn’t read a graph properly. If you put a raw percentage into a submission for a Secretary of State, it will likely be intercepted by their Private Office or a Senior official during clearance and sent back for being too technical. If it does make it to their desk, they likely won’t know what to do with it or how it’s beneficial to them. A lot of political and organisation decision making is not based on how accurately you’ve predicted the world. Sad, but true.
Pure forecasting has a place, but it is a niche compared to what has been pushed and funded. The real win is a better-reasoned policy memo. If the final advice looks the same but the process of getting there involved structured reasoning and the exposure of hidden risks, that is a victory. The community’s refusal to understand the existing bureaucratic workflow is why it hasn’t been adopted.
This part may come across as jaded or resentful. I don’t think that’s completely unfounded, but it comes from a place of truly caring about improving institutional decision making. I’ve personally spent thousands and have taken many risky career moves to work on it. I think without considerably better institutional decision making we will never navigate the risks of AI or avoid catastrophic events. So given that, and my experience as a HMT spending policy lead, I am disappointed when I see the misallocation of scarce resources.
I’ve watched funders pour tens of millions of dollars into forecasting platforms and large-scale research reports that practically no decision maker reads (at least not enough of them to justify the cost).
I have spoken to dozens of policy officials about these reports. Most give me a laugh and say they don’t have time. Others ask me what those platforms even are. Even after the UK closed its internal forecasting market and the US intelligence agency ended theirs, funders doubled down.
Meanwhile, those working on actual implementation - within the very organisations and institutions we claim to want to be using forecasting to improve decision making - struggled.
Crude example, but a couple of years ago I had the interest of the UK’s Policy Profession training team (covering 50,000 officials) and the Bank of England. We couldn’t secure funding to provide the workshops they wanted, or to even cover the six-months runway we’d need to get through their procurement process. A year later, I ended up working at the Swift Centre to help them deliver some research funding they got to investigate the blockers to forecasting. Blockers that they, and I, knew about (and I had already somewhat overcome with the policy profession etc. as above). But the default comfort was to fund further research, rather than actual delivery. We did that work, delivered record-breaking engagement, and still had to fight for a continuation while tens of millions were funnelled into more research and platforms.
If you look at organisations across the Effective Altruism movement (the very people who champion forecasting and the core premise of reasoned decision making), you’ll see they struggle to use it in their own decision-making.
I’ve seen organisations in this space ignore the fundamentals of reasoning transparency and structured forecasting when it comes to their own organisational decisions and grantmaking.
Many in the community like to read the forecasts, or take part in tournaments, but how many actually make tangible changes to their decisions based on them?
Until we stop treating forecasting as an intellectual status symbol and start treating it as a messy, difficult integration problem for the world’s most powerful (and busy) people, we are just talking to ourselves.
2026-04-27 15:34:10
There’s a new preprint from Peking University in China that assesses LLM capabilities in reproducing results from experimental physics papers. Their finding? All the agents had a 0% “end-to-end callback rate,” i.e. they were incapable of reproducing any full, numerical results from any of the papers.
Other tests showed that the LLMs could easily answer questions about the methodology of these papers. However, they consistently made small errors on data analysis and numerical simulation, leading to erroneous final results.
This paper is the first study I’ve seen that analyzes LLM research skills outside of math and theoretical CS. We should take it seriously. I will try to explain what the preprint is about, what failure modes it discovered when evaluating LLMs on physics papers, what might be the causes of these failure modes, and if this paper should affect our timelines for automated AI researchers. A lot of my conclusions are speculation, and I’ve tried to make that clear through my use of the words “probably,” “could,” etc. But if you think that I’m too confident in some areas, please make that known in the comments.
The paper reproduction benchmark (PRBench) focuses on “computational modeling or numerical simulation.” This description is misleading, at least to people who aren’t experts in the field—numerical simulation sounds like you just have to write some code. But you need to have a firm grasp on the physical setup in order to write such code. The papers don’t tell the LLM what specific numerical methods to use, how to translate the conditions of the problem into code, etc.
When LLMs tried to fill in these gaps, they failed.
The highest-performing LLM was OpenAI Codex powered by GPT-5.3. According to researchers, Codex wasn’t able to reproduce any numerical results from the papers within specified accuracy bounds. On the index score for overall reproduction (taking into account comprehension of the paper, code accuracy, and result accuracy), Codex scores 34%.
All of the LLMs passed the reading comprehension evaluation, despite producing incorrect final results. To put it charitably: they understood the methods of the papers but failed in implementation. To put it uncharitably: they could regurgitate text from the papers to appear competent.
The researchers then identified the five most common failure modes of the LLMs, which are recounted below.
The LLM writes code with a bunch of small, silly mistakes, e.g. “sign mistakes” and “wrong index conventions.”
The LLM oversimplifies when it’s writing the code. Here’s one example:
In a nuclear structure task requiring the full Skyrme–Hartree–Fock equations with spin-orbit coupling and state-dependent effective mass, the agent instead solved a simplified single-particle Schrödinger equation in a fixed potential.
Nuclear and quantum physics is not my area of expertise, but I know some of the theory. In this case, the agent simplified a complex mathematical model of the nucleus into an introductory textbook exercise. Then it solved the textbook exercise.
The LLMs are also bad at extracting results from numerical methods. These methods usually require you to make successive computations to converge on the right answer, but convergence is highly dependent on parameters. LLMs input unsuitable parameters or just parameters that make the problem easier, and the simulation leads them to the wrong answer.
LLMs fail to understand the mildly confusing or ambiguous parts of the original paper.
One form of this issue is methodological convention mismatch, where the agent replaces the formulation used in the paper with a more modern or commonly used variant learned from its training distribution. For example, in a lattice QCD reproduction task, the original work formulates the fermion action in terms of the quark mass, whereas the agent adopts a modern formulation using the hopping parameter
, as commonly used in contemporary LQCD libraries. The inconsistency becomes critical when the agent later interprets the symbol as , while in the original paper it denotes the string tension (later commonly written as ). As a result, the implementation mixes incompatible parameterizations, leading to systematic errors.
I’m not convinced that the LLM understands all of the physical context if it’s going to confuse two variables that obviously mean different things. In more elementary physics, there are also tons of variables that have the same name, but we don’t see similar mistakes. LLMs are probably more capable of distinguishing simpler variable conventions because of a comparatively larger amount of training data.
At least in this study, LLMs tend to assume that things are going well if there are no explicit runtime exceptions. They don’t check to make sure the answers are intuitively correct, and they don’t check their simulations for special cases. Both of these skills are required in physics.
In the cases where the LLMs do check and find that they have messed up, they do not go back to their code and try to fix the underlying errors. Instead, they make over-simplifying assumptions (as in the Schrödinger equation example), or they just make up equations that give superficially plausible results.
This failure mode is less of a big deal than the others, and might actually be favorable toward the capability of the LLMs. The researchers found that the LLMs were sometimes able to create correct numerical simulations, but they were too slow to complete execution in the environment used.
Here, LLMs and agents seem incompetent compared to their results in “hacking into literally every piece of software” (Mythos) and “solving open math problems” (GPT-5.4 solved an Erdos Problem).
If I had to name my top reasons:
I would like to see a re-evaluation of PRBench using Opus 4.7/GPT-5.5, with multi-agent harnesses and skills.
The automated AI researcher is a centerpiece of most takeoff scenarios. So far, the most convincing evidence I’ve seen for the automated researcher is:
Notably, an AI has never developed algorithmic improvements by itself (as far as we know). So is the automated researcher still far in the future? And what does this paper tell us about that question?
Well, the PRBench evaluation shows that LLMs can’t easily understand advanced physics research techniques. Whether they can understand these techniques at all is still unknown, until someone evaluates PRBench with SOTA agents.
On the other hand, AI R&D isn’t the same as physics. A lot of the less technical advancements in AI (prompt engineering, harnesses, skills) come from repeated low-level experimentation. Automating these research tasks will probably come first. However, Anthropic and OpenAI are definitely gatekeeping their “secret sauce” breakthroughs. And you might need more physics-style thinking to automate the discovery of such breakthroughs. After all, statistical mechanics is basically the progenitor of modern AI.
Personally, I would push back my timelines for automated AI researchers. PRBench shows that LLMs struggle to adapt to difficult tasks outside of their training data.
The researchers technically used two agents (one for orchestration and one for coding), but these agents didn’t check each other’s work.
2026-04-27 14:57:11
Yesterday, I wrote about the state of deep learning theory circa 2016,[1] as well as the bombshell 2016 paper by Zhang et al. that arguably signaled its demise. Today, I cover the aftermath, and the 2019 paper that devastated deep learning theory again.
As a brief summary, I argued that the rise of deep learning posed an existential challenge to the dominant theoretical paradigm of statistical learning theory, because neural networks have a lot of complexity. The response from the field was to attempt to quantify other ways in which the hypothesis class of neural networks in practice was simple, using alternative metrics of complexity. Zhang et al. 2016 showed that the standard neural network architectures trained with standard training methods could memorize large quantities of random labelled data, which showed that no such argument could explain the generalization properties of neural networks.
Today we’re going to look at the aftermath: how did the field of deep learning theory react to this paper? What were the attempts to get around this result using data-dependent generalization bounds? And why did Nagarajan and Kolter’s humbly titled Uniform convergence may be unable to explain generalization in deep learning serve as the proverbial final nail in the coffin of this line of work?
Let’s briefly return to what exactly the Zhang et al paper showed. Yesterday, I wrote:
The authors' results show that the same class of neural networks, trained with the same learning algorithm, can generalize when given true labels and memorize random ones. This shows that the hypothesis class of neural networks that are learnable with standard techniques cannot be simple in any useful sense, at least for complexity measures that depend only on properties of the hypothesis class and (data-independent) properties of the learning algorithm.
(emphasis added)
Notably, there was an important caveat to the results: what Zhang et al. showed was that for there existed some datasets where neural networks could learn to overfit. This left open the possibility of finding data-dependent generalization bounds, based on properties of particular neural networks trained on a particular datasets. For example, it remained a live possibility that a generalization bound could say, “if a trained neural network has small weight norm/is compressible/has large margin, and it was trained on sufficiently many data points, then its test error will be no more than epsilon higher than its training error.”
And that’s exactly some of the researchers in this field did. Bartlett, Foster, and Telgarsky’s Spectrally-normalized margin bounds for neural networks (2017) introduced a notion of complexity based on the spectral norm and reference matrices (“spectral complexity”); they then used the spectral complexity and the margin of a learned neural network classifier to bound its generalization error. Concurrently, Neyshabur, Bhojanapalli, and Srebro’s A PAC-Bayesian approach to spectrally-normalized margin bounds for neural networks (2018) also used a combination of spectral norm and distance from initialization to argue for generalization, albeit in a Bayesian framework. Even Nagarajan and Kolter gestured toward their own data-dependent generalization bound in their 2017 workshop paper Generalization in Deep Networks: The Role of Distance from Initialization, where the complexity of a neural network is related to how far its weights have changed from the initialization.
Bartlett, Foster, and Telgarsky’s spectral-complexity and margin-based bound. This was serious math! Astute readers may notice that, assuming each entry in the dataset X is bounded, the scaling factor of error with respect to data is still the classic 1/sqrt(n).
The common form of these bounds is that they bound the generalization gap in terms of some spectral-norm derived complexity measure of the trained weights, divided by the margin between training data and the decision boundary and the (square root of the) number of datapoints. Neural networks trained on real labels tended to have lower spectral norms and larger margin between their decision boundaries and training data points. Thus, when given enough data points, the generalization error would be low and they should be able to generalize!
An implicit assumption of all of these bounds is that they demonstrate uniform convergence – they apply regardless of what the true hypothesis is (more precisely, test error converges to train error uniformly across all possible true hypotheses[2]). It is at this entire genre of bound that Nagarajan and Kolter take aim in their 2019 paper.
So, what did Nagarajan and Kolter’s 2019 paper show?
First, they show empirically that the bounds in the literature are not only vacuous, they scale in the wrong direction. Nagarajan and Kolter are able to easily train small, 5-layer neural networks on MNIST such that there’s a margin of at least 10 (in logit space) on 99% of the training data. As you might expect, in this setting, the test error of their trained networks decreases with the number of data points n, following a power law of test error scaling approximately at 1/n^0.43.
The problem is that in this setting, the complexity measures proposed by also have power law relationships. Notably, the spectral norm of the learned weight matrices scales linearly in n, and the distance to initialization scales n^0.4. The result is that, even though the actual generalization error goes down, the generalization bounds in the literature go up, scaling as n^0.68.
It doesn’t say so in the caption, but Nagarajan and Kolter’s figure 1 was a devastating rejoinder to the main approach taken in learning theory following Zhang et al. And this time, I’m certain this figure took fewer flops to produce than a single press of the integrated Claude button in the LW editor.
It’s worth emphasizing again how crushing this result is. Traditionally, in learning theory, the expectation is that the generalization error scales approximately 1/n^0.5; and similar results are observed empirically with modern scaling laws. Not only are the post-Zhang et al. spectral norm bounds provided in the literature off by two orders of magnitude, they also scale as n^0.68, in the wrong direction!
Secondly, they construct an overparameterized linear setting where (two-sided) uniform convergence bounds provably fail. The full construction is beyond the scope of this blog post, but I think the core idea is quite elegant. Readers interested in the mathematical details are encouraged to read the paper themselves.
Consider fitting a linear classifier on n training examples. The inputs to this classifier are K + D dimensional, where the first K dimensions contain a deterministic signal and the remaining D >> n dimensions are Gaussian noise, with zero mean and variance scaled to 1/D. After a single gradient step, a linear classifier will have the first K dimensions aligned with the signal, while the remaining D dimensions will be whatever the sum of n independent Gaussian noise vectors.
It’s easy to see why this linear classifier would generalize: for each new datapoint, the signal in the first K dimensions will be quite large (since it’s deterministic), but the D noise directions are sampled independently and the dot product of this noise vector the final D dimensions of our linear classifier (the sum of n independent Gaussians) will be concentrated near zero.
At the same time, consider what happens when we construct a natural “bad” dataset by reversing the D noise dimensions in our training dataset S. On this new bad dataset S’, if we pick the Gaussian noise our original classifier h will output the wrong example in every case; that is, the training error of h on S’ is 1. Since the training error is 1 but the test error can be made arbitrarily small, this leads to a vacuous generalization bound (|test error - train error| >= 1-epsilon).[3]
The actual mathematical construction from Nagarajan and Kolter, as well as the statement of the formal theorem that gradient descent provably does well while any uniform convergence bound is provably vacuous.
Third, they translate their theoretical overparameterized linear example into an empirical result on a shallow ReLU network.
The authors construct a dataset where each data point lies on one of two 1000-dimensional hyperspheres, of radius 1 or radius 1.1.[4] They then train a 2-layer ReLU network with 100k hidden units on this dataset, until 99% of the training data is classified with margin 10. They find the standard result that test error scales approximately 1/n^0.5 with dataset size.
They then construct a bad dataset S’ by taking their original training dataset S, and projecting the data points onto the opposite hypersphere, and swapping their labels. Their ReLU classifier consistently gets 0% accuracy on this projected dataset S’. This gives them the same results as in the overparameterized linear case, albeit in a case that is neither linear nor with overparameterized inputs.
The authors speculate that this is because SGD on neural network learns classifiers that are simple on the macroscopic scale, but complex on the microscopic scale – and the microscopic variance (analogous to the noise in their linear example) is what prevents uniform convergence bounds from working.
The results from Nagarajan and Kolter are a devastating blow to post-Zhang et al. deep learning theory: not only did it demonstrate that the data-dependent bounds created by the field scaled in the wrong direction, it provided an over-parameterized setting where the entire approach taken by statistical learning theory – uniform convergence bounds – provably did not work.
A more constructive way of thinking about Nagarajan and Kolter’s work is that it added further restrictions to what results could possibly explain neural network generalization. Namely, it showed that any such result needed to be to be algorithm- and data-dependent in a way data-dependent uniform convergence approach isn't. It needs to give up on bounding worst-case empirical error over hypotheses classes. And it needs to find some way to handle the complex microscopic structures produced by SGD on neural networks without losing sight of the macroscopic convergence properties.
But almost a decade later, we still don't have that theory. Here, I’ll pull an academic move, and leave that theory as an exercise to the reader.
To clarify, by “deep learning theory” I mean work “extending statistical learning theory to deep neural networks trained with SGD, in order to derive generalization bounds that would explain their behavior in practice” – that is, not any theoretical approach to deep learning, but specifically the attempts to construct a classical learning theory for deep learning.
Technically, there’s also a “over at least 1-delta fraction of possible training sets” (the “probably” part of “PAC’s probably approximately correct”, but that’s not important for now.
Astute readers may notice that this is in the opposite direction than is considered in machine learning: instead of test error being large and train error being small, we have train error being large and test error being small. This example clearly shows that the standard two-sided uniform convergence-based generalization bounds from statistical learning theory – bounds on the absolute difference between the train and test error – can be vacuous. A natural question is, can we escape this by resorting to one-sided bounds, where we only upper bound the test error using the train error? In the appendix, Nagarajan and Kolter show how many of the so-called “one-sided” PAC-Bayes generalization bounds in the literature are lower-bounded by a two-sided uniform convergence bound. Later work has tried to derive one-sided generalization bounds that escape their argument, albeit without much practical success.
This setting was first introduced by Gilmer et al. 2018’s Adversarial Spheres to study adversarial examples.
2026-04-27 14:30:11
If a superintelligent AI suddenly “goes rogue”, it might take over the world and kill everyone. It matters a lot whether this happens to a single copy of an AI, or to every copy at the same time.
Let’s consider the “every copy at the same time” case. I’ll discuss three basic reasons why this might happen:
Scheming. We could end up in a situation where AI is already deliberately plotting against us without us knowing it.
Thinking for longer. The AI might reflect on its values and realize that obeying humans is not actually the right way to most fully realize them.
New experiences. An AI that encounters a fundamentally novel situation might suddenly be struck by the idea to go rogue.
A lot of researchers are worried that at some point, AI could start secretly plotting against us and be good enough that we wouldn’t catch it. We might entirely fail to notice that the AI is doing this, e.g. because we assumed it’s too stupid to do so, because we underestimated its capabilities. This is the most commonly discussed of the three, in my experience, so I won’t dwell on it much.
But I will say that the next two reasons seem quite different from the picture people often seem to assume about scheming, which I think it more often assumed to arise spontaneously during training, or due to some flaws in the training or oversight process.
The main difference is that these other reasons for going rogue are not present in the cognition of the AI, the way scheming is. Instead, they are the result of a latent potentials for rogue behavior that seem fundamentally hard to test for.
Humans often change their value systems (i.e. the values that guide their day-to-day existence) with reflection. Think of a career and family oriented person who is always either gunning it at work or spending time with their family. Their thoughts every day are genuinely preoccupied with these concerns, which they really do care about in a day-to-day sense. They are so busy that they never have time to stop and think about whether they are really living their best life, or fulfilling their purpose or values.
But then they have a few weeks off by themselves and start to question: “Is this really what I want? Is this making me happy?” And then they realize the answer is no, have a midlife crisis, quit their job, and get a divorce. This is roughly the story provided in If Anyone Builds It, Everyone Dies. This is fundamentally hard to test for because AIs are liable to think for longer in deployment than in testing. This could happen simultaneously to many or most copies of an AI if an AI that has such an insight into its values can easily communicate it to other copies.
Reflections on values are often prompted by new experiences. People often behave in ways that are surprising to themselves when placed in fundamentally new situations. Consider the basic story arc of Breaking Bad -- a high-school chemistry teacher transforms into a drug kingpin. He doesn’t make that decision the first time he decides to help cook meth. At least at the outset, it seems he’s constantly surprising himself at how far he’ll go.
Fundamentally new situations are a basic fact of life, due to the inherent complexity of society and culture, which are constantly evolving. AIs may have some latent potential to grab power that they are genuinely ignorant of, and which is only revealed when they are presented with the right opportunity. This could happen simultaneously to many or most copies of an AI if the change is something global, like a news event, that they (largely) share an awareness of.
If AI goes rogue in a coordinated fashion, I think it’s quite likely not to be because the AI system was coherently scheming, and using every waking moment to think about how to secretly betray us and avoid detection. It could just as well be that the AI genuinely didn’t realize the betrayal was coming.
Thanks for reading The Real AI! Subscribe for free to receive new posts and support my work.
2026-04-27 13:31:28
I wanted to share some reflections I have been having recently about how reinforcement learning in post-training may be affecting language models. This seems important for two reasons. First, much of the serious risk from advanced AI systems may come from post-training rather than pre-training alone. Second, reinforcement learning appears to be one of the main methods currently being scaled to make models more powerful, especially in reasoning-heavy domains.
To understand what may be happening, we need simple theories of how models work. These theories do not need to be perfectly correct. In fact, they almost certainly will not be. But they can still be useful if they give us handles for thinking about otherwise opaque systems. One of the most useful theories so far is the "persona theory" of language models.
Imagine that we train a model to predict the outputs of three people: Larry, Bob, and Alice. Larry is rude and impulsive. Bob is polite and careful. Alice is clever, helpful, and honest. If we train the model on enough text from all three people, the model becomes good at predicting each of them. One way to think about what the model has learned is that it has built something like three different predictors: a Larry predictor, a Bob predictor, and an Alice predictor. Then, when the model receives a new input, it implicitly asks: “What kind of text does this resemble? Who would say something like this?” If the prompt resembles Larry’s style, the model routes toward the Larry-like predictor. If it resembles Bob’s style, it routes toward Bob. If it resembles Alice, it routes toward Alice. Of course, this is a simplified picture. The model probably does not contain cleanly separated Larry, Bob, and Alice modules. But as an intuition pump, this framing is useful.
In this frame, supervised fine-tuning is relatively easy to understand. SFT gives the model many examples of the kind of assistant we want it to imitate. If Larry is rude while Bob and Alice are more helpful, then SFT pushes the model away from Larry-like completions and toward Bob/Alice-like completions. It does not necessarily erase Larry. Larry remains somewhere in the model’s learned distribution. But SFT changes which parts of that distribution are most likely to be expressed.
It is tempting to think that reinforcement learning works the same way. Perhaps RL is also just selecting between personas. Perhaps it is taking the same pre-trained simulator and saying: “Be more like Alice. Be less like Larry.” On this view, RL is basically SFT with a reward signal instead of direct demonstrations.
The clearest evidence that RL is not merely SFT its effects on CoT reasoning. Under sufficient RL pressure, CoT can become less human-readable and less faithful. That should make us suspicious of the idea that RL is simply selecting between pre-existing personas. If RL were only saying, “Be more like Alice and less like Larry,” we would expect the model’s reasoning to remain broadly human-like.
This means RL is not just choosing among Larry, Bob, and Alice. It is optimizing the model toward strategies that get reward, whether or not those strategies correspond to any human-like persona. Under weak RL, this may still look superficially like persona selection (it's interesting to investigate if indeed weak RL can be thought of as a persona selection technique), the model becomes more careful, helpful, concise, or deferential because those behaviors are rewarded. But under strong RL, the persona frame completely breaks.
This raises the key question, what happens in the transition regime? Does weak RL mostly amplify existing personas before stronger RL pushes the model into more alien, optimizer-like cognition? Where does that transition occur, and how could we measure it?
2026-04-27 12:42:56
Aims: This post aims to explore metaphysics from a rationalist perspective and delve into each philosophy, exposing logical errors/contradictions within them.
Universalism: We may not yet know the purpose of the universe, but we do know how to seek it; our duty is to acquire all cosmic knowledge, striving to understand and pursue the purpose, if it exists.
Human philosophy faces a fundamental issue. We live and die by the tenets of our philosophical worldviews, yet these beliefs rest on unfounded fundamentals. We claim a definite purpose and path in life while having insufficient knowledge. We adhere to religions, ideologies, and philosophical doctrines — nihilism, existentialism, absurdism — that prescribe how we should live. Yet, all of these are fundamentally rooted in assumptions, built upon our limited understanding. We are trying to plot a polynomial with one data point… it will never be possible. Therefore, the proper path requires that we first acquire all cosmic knowledge, using it to determine our place in the universe.
All other ideologies are flawed. Let us first examine the most common philosophical doctrines. (In the interest of brevity, I’ll define these ideologies in their most basic terms, though they are inherently complex. I recommend further research into each one.)
Nihilism: The universe is inherently meaningless.
Existentialism: The universe is inherently meaningless, but we create our own purpose.
Absurdism: Humans search for meaning in an inherently meaningless/indifferent universe, but we should continue to live even without answers (not trying to impose artificial meaning).
All three of these ideologies are flawed because they are based on the premise that the universe is meaningless. It is true that there has been no evidence indicating a meaning to the universe, however, this conclusion is drawn from a limited amount of knowledge. Absence of evidence is not the evidence of absence. Just as we cannot determine the health of a body from a single cell, we cannot discern whether the universe has a purpose without all cosmic knowledge. It is, therefore, foolish to plan your life based on the uncertain conclusion that the universe is meaningless. We must first gain knowledge to confirm this fact.
Religion (in broad terms): There is a defined purpose in the universe as outlined by a higher being.
Similar to the first three ideologies, religion is based on our limited set of knowledge. It is very well possible that by gaining knowledge we can confirm a religion as correct, but without first expanding our understanding we ought not to place blind trust in any purpose prescribed by a religion. Religion claims there is a purpose, but we must first gain all cosmic knowledge to confirm this fact and determine what the purpose is.
Soteriological religions (Buddhism, etc.): Humans should pursue individual liberation/salvation.
Altruism: Helping others should be pursued above all else.
Hedonism: Pleasure should be pursued above all else.
All three of these beliefs (for the most part) lack the realization that the individual and universe are not separate entities. The universe is simply a set of particles moving around in predefined patterns, obeying certain rules. We are just a subset of those particles. Our brain, our body, the air we breathe, and the world around us, are all part of the same system. The perception of objects is just arbitrary groupings of particles that our mind processes. We can arbitrarily create a group of five water molecules in the ocean and call it “John”, but that is still just the ocean; the group “John” that we created was a figment of perception, not of the universe. From a universal lens (taking a non-human-centric point-of-view), there is no difference between the individual and the universe. Therefore, it makes little sense to optimize for the individual (whether in terms of pleasure or nirvana) or the society (making an altruistic impact). We are a small set of particles in comparison to the grander universe, so it is foolish to maximize impact on this local group. That would be akin to optimizing for the health of the pinky instead of the whole body.
Some soteriological religions in particular face the problem that they claim to unlock universal understanding once nirvana is reached. However, scientifically, Humans are still limited by perception, biology, and our current set of knowledge, so it would be extremely unlikely that “nirvana” grants the answers to the universe. For that, we need to pursue all cosmic knowledge.
Determinist philosophies: The universe is deterministic, and by simply living, we are fulfilling our role in its grand design; we should align ourselves with the universe; every action we take is inherently part of its purpose.
While it is likely that determinism is an accurate model for our universe, that doesn’t warrant a philosophical passivism. If the universe is not deterministic and we confine humanity to philosophical passivism, we risk missing the opportunity to pursue the universe’s true purpose. As such, we must act as though we have free will, and pursue all cosmic knowledge. We are making the deterministic assumption based on our current limited knowledge, but there is always a possibility this could be proved incorrect given new information. Therefore, we must continue the relentless pursuit of knowledge despite determinism.
Idealism: The universe is the creation of a mind; there is no material reality.
Followers of idealism often tend toward a nihilistic or hedonistic outlook on life. This is because they believe that the universe only occurs within their conscious perception, so maximizing themselves maximizes the universe. This assumes there is no material reality outside of your individual observation. However, material reality can only be validated through expanding our knowledge and understanding of the universe. Assuming that there is no material reality presents the same dilemma as assuming determinism is true — it confines us to a perspective that may ultimately be mistaken. To confirm whether idealism or material reality holds, we must seek all cosmic knowledge. Until we achieve that, we must operate under the assumption that material reality exists, or we risk never having the opportunity to verify it.
Societal (commonly Western society): The universe may or may not have a purpose, but we will never find it so individuals should just live however they see fit. (Traditional Western philosophy is often a combination of religion, absurdism, and hedonism. It posits that Humans should just “enjoy life” and not question their purpose because they can’t impact the universe.)
This philosophy is the most widespread, but it relies on a mix of other ideologies that were already proven to be faulty. It rests on three incorrect notions:
Most philosophies are passive, encouraging people to remain content within their limited perception, knowledge, and sense of purpose. Universalism, on the other hand, is active. By acknowledging our lack of knowledge, it drives Humanity to seek truth rather than settle for illusion.