MoreRSS

site iconLessWrongModify

An online forum and community dedicated to improving human reasoning and decision-making.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of LessWrong

My Theory of Change

2025-12-31 10:46:44

Published on December 31, 2025 2:11 AM GMT

2026 presents unfathomable opportunities for taking positive action to help the future go well. I see very few people oriented to what I see as many of the foundational basics.

 1) With GPT-5.2-Codex-xhigh, Claude Code Opus 4.5, GPT-5.2-Pro, and the increasing DX tooling being created every hour, building wildly ambitious software in absurd time is exceptionally feasible. I am talking 1-6 devs building software platforms in weeks for agent-economies factorizing cognitive work at scale, with ultra-parsimous, easy-to-build-upon Rust codebases. A 2024-cracked-SWE-year's worth of work per day is basically possible already for thousands of people, if they can manage to be healthy enough and find flow.

2) Causing great outcomes basically has to be an additive, personal, bits-on-the-ground thing. You have to find yourself giving brilliant people influence over huge agent swarms, that can just deeply research, understand, and counter certain problem domains, like prion terrorism risk or whatever (because a significant fraction of people crash out sometimes). Real world collective superintelligence, this year, and you are a key part of it.

Trying to get real human builders to be all that oriented to downside risks, or jockeying for unprecedented government lockdown of compute, does not seem to be nearly as existentially practical as just building with good primitives and conscientiously stewarding emergence.

3) The real game is actually more somatics than anything. That doesn't mean full hippie kumbaya all the time, but you probably really do need to move a lot throughout the day, get Bryan Johnson-level sleep, distance from people you really contract around, get some hugs or self-hugs at least, and be gentle with your parts, like your socially-penalized instincts.

Backpropagating from things like millions of healthily-trained agents naturally coordinating cognitive work at scale with simple primitives--pairwise ratio comparisons eliciting cardinal latent scores for arbitrary attribute_prompts over arbitrary entities cached in agent-explorable SQL; multi-objective rerankers; market-pricing; frameworks that agents love for prior elicitation, reconciliation, and propagation--

it's about sitting in front of a laptop and prompting well, a lot. And to do that well, you've got to build up uncommonly good taste for what body signals are safe to ignore and what should really be respected. For me, estrogen (and some exogenous T) and a period of shrooms was absolutely necessary to downregulate, unclench, feel enough, and unlearn unhelpful reactivity patterns, to end up regulated enough to have a minimum viable amount of agency--I was simply too cripplingly heady and stuck before). 

The world is a trippy, wildly complicated place as a self-modifying embodied organism, and I think that's to be embraced. The first person-perspective and unflattening of experience is a very real part of the game.

4) Overall, there's a resource flywheel game. Gotta find yourself with more and more dakka. I need more felt-safety, more nutrients, more stewarding of psychic gradients so good prompting and creativity keeps happening more, and I manage to bootstrap my limited $600/month AI compute + Hetzner server budget into $20,000/month into possibly $1M+ a day, by the time agents and harnesses get that good in 2026, to keep doing research and spawning side businesses that solve real problems in the ecology.



Discuss

Progress update: synthetic models of natural data

2025-12-31 09:31:00

Published on December 31, 2025 1:31 AM GMT

This post presents a brief progress update on the research I am doing as a part of the renormalization research group at Principles of Intelligence (PIBBSS). The code to generate synthetic datasets based on the percolation data model is available in this repository. It employs a newly developed algorithm to construct a dataset in a way that explicitly and iteratively reveals its innate hierarchical structure. Increasing the number of data points corresponds to representing the same dataset at a more fine-grained level of abstraction.

Introduction

Ambitious mechanistic interpretability requires understanding the structure that neural networks uncover from data. A quantitative theoretical model of natural data's organizing structure would be of great value for AI safety. In particular, it would allow researchers to build interpretability tools that decompose neural networks along their natural scales of abstraction, and to create principled synthetic datasets to validate and improve those tools.

A useful data structure model should reproduce natural data's empirical properties:

  • Sparse: relevant latent variables occur, and co-occur, rarely.
  • Hierarchical: these variables interact compositionally at many levels.
  • Low-dimensional: representations can be compressed because the space of valid samples is highly constrained.
  • Power-law-distributed: meaningful categories exist over many scales, with a long tail.

To this end, I'm investigating a data model based on high-dimensional percolation theory that describes statistically self-similar, sparse, and power-law-distributed data distributional structure. I originally developed this model to better understand neural scaling laws. In my current project, I'm creating concrete synthetic datasets based on the percolation model. Because these datasets have associated ground-truth latent features, I will explore the extent to which they can provide a testbed for developing improved interpretability tools. By applying the percolation model to interpretability, I also hope to test its predictive power, for example, by investigating whether similar failure modes (e.g. feature splitting) occur across synthetic and natural data distributions.

The motivation behind this research is to develop a simple, analytically tractable model of multiscale data structure that to the extent possible usefully predicts the structure of concepts learned by optimal AI systems. From the viewpoint of theoretical AI alignment, this research direction complements approaches that aim to develop a theory of concepts.

Percolation Theory

The branch of physics concerned with analyzing the properties of clusters of randomly occupied units on a lattice is called percolation theory (Stauffer & Aharony, 1994). In this framework, sites (or bonds) are occupied independently at random with probability , and connected sites form clusters. While direct numerical simulation of percolation on a high-dimensional lattice is intractable due to the curse of dimensionality, the high-dimensional problem is exactly solvable analytically. Clusters are vanishingly unlikely to have loops (in high dimensions, a random path doesn't self-intersect), and the problem can be approximated by modeling the lattice as an infinite tree[1]. In particular, percolation clusters on a high-dimensional lattice (at or above the upper critical dimension ) that are at or near criticality can be accurately modeled using the Bethe lattice, an infinite treelike graph in which each node has identical degree . For site or bond percolation on the Bethe lattice, the percolation threshold is . Using the Bethe lattice as an approximate model of a hypercubic lattice of dimension  gives  and . A brief self-contained review based on standard references can be found in Brill (2025, App. A).

Algorithm

The repository implements an algorithm to simulate a data distribution modeled as a critical percolation cluster distribution on a large high-dimensional lattice, using an explicitly hierarchical approach. The algorithm consists of two stages. First, in the generation stage, a set of percolation clusters is generated iteratively. Each iteration represents a single "fine-graining" step in which a single data point (site) is decomposed into two related data points. The generation stage produces a set of undirected, treelike graphs representing the clusters, and a forest of binary latent features that denote each point's membership in a cluster or subcluster. Each point has an associated value that is a function of its latent subcluster membership features. Second, in the embedding stage, the graphs are embedded into a vector space following a branching random walk.

In the generation stage, each iteration follows one of two alternatives. With probability create_prob, a new cluster with one point is created. Otherwise, an existing point is selected at random and removed, becoming a latent feature. This parent is replaced by two new child points connected to each other by a new edge. Call these points a and b. The child points a and b are assigned values as a stochastic function of the parent's value. Each former neighbor of the parent is then connected to either a with probability split_prob, or to b with probability 1 - split_prob. The parameter values that yield the correct cluster structure can be shown to be create_prob = 1/3 and split_prob = 0.2096414. The derivations of these values and full details on the algorithm will be presented in a forthcoming publication.

Caveats

  • Because the data generation and embedding procedures are stochastic, any studies should be repeated using multiple datasets generated using different random seeds.
  • The embedding procedure relies on the statistical tendency of random vectors to be approximately orthogonal in high dimensions. An embedding dimension of O(100) or greater is recommended to avoid rare discrepancies between the nearest neighbors in the percolation graph structure and embedded data points.
  • A generated dataset represents a data distribution, i.e. the set of all possible data points that could theoretically be observed. To obtain a realistic analog of a machine learning dataset, only a tiny subset of a generated dataset should be used for training.

Next Steps

In the coming months, I hope to share more details on this work as I scale up the synthetic datasets, train neural networks on the data, and interpret those networks. The data model intrinsically defines scales of reconstruction quality corresponding to learning more clusters and interpolating them at higher resolution. Because of this, I'm particularly excited about the potential to develop interpretability metrics for these datasets that trade off the breadth and depth of recovered concepts in a principled way.

  1. ^

    Percolation on a tree can be thought of as the mean-field approximation for percolation on a lattice, neglecting the possibility of closed loops.



Discuss

Personalization Requires Data

2025-12-31 08:45:09

Published on December 31, 2025 12:45 AM GMT

​In 2025, AI models learned to effectively search and process vast amounts of information to take actions. This has shown its colors the most in coding, eg through harnesses like claude code that have had a sizable impact on programmers’ workflows.

But this year of progress doesn’t seem to have had that much of an effect on the personalization of our interactions with models, ie whether models understand the user’s context, what they care about, and their intentions, in a way that allows them to answer better. Most chatbot users’ personalization is still limited to a system prompt. Memory features don’t seem that effective at actually learning things about the user.

But the reason machine learning has not succeeded at personalization is lack of data (as it often is!). We do not have any validated ways of grading ML methods (training or outer-loop) for personalization. Getting good eval signals for personalization is hard, because grading a model’s personalization is intrinsically subjective, and requires feedback at the level of the life and interactions that the model is being personalized to. There is no verified signal, and building a generic rubric seems hard. These facts do not mesh well with the current direction of machine learning, which is just now starting to go beyond verifiable rewards into rubrics, and is fueled by narratives of human replacement that make personalization not key (If I am building the recursively improving autonomous agi, why do I need to make it personalized).1

Start by giving models more information

But how do we obtain data for personalization? I think the first step to answering this question is having consumers of AI curate their personal data and then share it to enrich their interactions with AI systems.

Instead of just a system prompt, giving models a searchable artifact of our writing, notes, and reading history. Something agents can explore when your questions might benefit from context—and write to, to remember things for later.

whorl - My first guess

Over the break, I built a very simple software tool to do this. It’s called whorl, and you can install it here.

whorl is a local server that holds any text you give it—journal entries, website posts, documents, reading notes, etc…—and exposes an MCP that lets models search and query it. Point it at a folder or upload files.

I gave it my journals, website, and miscellaneous docs, and started using Claude Code with the whorl MCP. Its responses were much more personalized to my actual context and experiences.

Examples

First I asked it:

do a deep investigation of this personal knowledge base, and make a text representation of the user. this is a text that another model could be prompted with, and would lead them to interacting with the user in a way the user would enjoy more

It ran a bunch of bash and search calls, thought for a bit, and then made a detailed profile of me, and my guess is that its quality beats many low effort system prompts, linked here.

I’m an ML researcher, so I then asked it to recommend papers and explain the motivation for various recs. Many of these I’ve already read, but it has some interesting suggestions, quite above my usual experience with these kinds of prompts. See here.

These prompts are those where the effect of personalization is most clear, but this is also useful in general chat convos, allowing the model to query and search for details that might be relevant.

It can also use the MCP to modify and correct the artifact provided it, to optimize for later interactions – especially if you host a “user guide” there like the one I linked. Intentionally sharing personal data artifacts is the first step to having agents that understand you.

Conclusion

Personalization requires data. People need to invest into seeing what models can do with their current data, and figuring out what flows and kinds of interactions this data is useful for, towards building technology that can empower humans towards their goals. whorl is a simple tool that makes that first step easy. People who have already created a bunch of content should use that content to enhance their interactions with AIs.



Discuss

Please remember how strange this all is.

2025-12-31 08:36:10

Published on December 31, 2025 12:36 AM GMT

Please remember how strange this all is.

I am sitting in an airport in San Francisco. It is October 2025. I will get in a box today. It will take my body around the world in unbreathable air at 600mph.

The machines I see outside the departure lounge window are complicated and odd. Millions of computer chips and wires and precisely designed metal structures. Gears and belts and buttons. No individual knows how these things work.

I look down at my body. Even more unimaginably complex. An intricate soup of skin, DNA, fat and protein. Enzyme cascades and neuronal developmental pathways. These cascades are collectively producing these words somehow.

Please remember how strange this all is.

All this stuff exists, but we don’t know why. I am seeing and feeling and thinking about all this stuff and we don’t know why any of it is here. Nobody does. We make plans, we gossip, we deliver projects and look forward to holidays. We social climb and have sex and hold hands. We go swimming on a Saturday morning with a close friend and talk about our relationships and the water temperature and we silently agree to forget how deeply strange it is that any of this is even here and is like it is.

Please remember how strange this all is.

Experience is so absurdly specific. So surprisingly detailed. I am lost in my story; in the excruciatingly specific normality of it. Occasionally I remember. An obvious contrast. A strong memory. A flash of confusion. I sometimes remember, but I mostly forget. Remembering usually feels like a distraction from the thing that is happening now. It often is. I ask that you remember anyway.

Please remember how strange this all is.

Is this cliché? Am I being cliché? Or is that feeling of cliché-ness just more forgetting? More “this is normal”, more “this is usual and expected”.

We walk past each other on the street and forget the absurd mystery of why any of this is here. The strangeness and lostness in stories is the most reliable feature of all of our reality. Our confusion is the core vulnerability that we all share. Join me in the one place we can all meet.

Please remember how strange this all is.

The music playing in my ears. The movement of my pen on this paper. The feeling that these words are me. The flash of a vivid memory from last night. The complex web of social plans. The implicit meta-physics my thoughts are nestled within.

Please remember how strange this all is.

The woman behind the counter at the departure lounge café. The sound of boarding announcements. The complex array of brands of drink. Colourful and alluring and strange. The artwork in front of me is paper boats in water.

Please remember how strange this all is.

I talked to an old friend this morning in an Italian restaurant in The Embarcadero. He’s worried about AI and is dreaming of buying a house in the countryside. He wants to move away from the bay and stop fighting for the future of humanity.

Please remember how strange this all is.

Also remember to breathe. Breathe deep. Breathe deep through your nose and into your belly. Remember the centre. Remember to feel into your heart. Touch grass with your feet. Notice the consistent patterns and trust the context of your own perception. Seriously, remember to be breathe.

Then let go of that too. And remember again how deeply strange this all is.



Discuss

Mechanize Work's essay on Unfalsifiable Doom

2025-12-31 06:57:17

Published on December 30, 2025 10:57 PM GMT

Like Daniel Kokotajlo's coverage of Vitalik's response to AI-2027, I've copied the author's text. However, I would like to comment upon potential errors right in the text, since it would be clearer.

Our critics tell us that our work will destroy the world.

We want to engage with these critics, but there is no standard argument to respond to, no single text that unifies the AI safety community. Nonetheless, while this community lacks a central unifying argument, it does have a central figure: Eliezer Yudkowsky.

Moreover, Yudkowsky, along with his colleague Nate Soares (hereafter Y&S), have recently published a book. This new book comes closer than anything else to a canonical case for AI doom. It is titled “If Anyone Builds It, Everyone Dies”.

Given the title, one would expect the book to be filled with evidence for why, if we build it, everyone will die. But it is not. To prove their case, Y&S rely instead on vague theoretical arguments, illustrated through lengthy parables and analogies. Nearly every chapter either opens with an allegory or is itself a fictional story, with one of the book’s three parts consisting entirely of a story about a fictional AI named “Sable”.

S.K.'s comment: these arguments are arguably easy to concentrate in a few phrases. Chapter 1 explains why mankind's special power is the ability to develop intelligence. Chapter 2 is supposed to convey the message that mankind's interpretability techniques usable for understanding the AI's mind are far from enough to understand why the mind does some action and not some other. Chapter 3 explains that the machines can optimize the world towards a state even more efficiently than the humans despite having no human-like parts. Chapter 4 explains that the AI's actual goals correlate, at best, with what was reinforced during training and not with the objective that the humans tried to instill. For example, the reward function of AIs like GPT-4o-sycophant was biased towards flattering the user or outright eliciting engagement by inducing psychosis[1] despite the fact that it is a clear violation of OpenAI's Model Spec.  Chapter 5 has Yudkowsky claim that the actual goals of AIs are hard to predict[2] and that they wouldn't correlate with mankind's flourishing, causing the AI to confront mankind... and win, as evidenced by Chapter 6.

The scenario itself is chapters 7-9 exploring HOW the ASI might defeat us. Or choose any other strategy, like the Race branch[3] of the AI-2027 forecast where the AI fakes alignment until it is ready for the confrontation.

Chapter 10-11 have the authors explain that alignment is far from being solved and that there is nothing to test it on. While most Yudkowsky's arguments are meta-level, like comparing the AIs with nuclear reactors whose behavior was hard to understand[4] at the time, there is an object-level case against ways to test alignment. SOTA AIs are likely aware that they are being tested and future AIs are UNlikely to be unaware whether they can live independently, take over the world or create a superintelligent successor superaligned to them and not to us. Unless the AIs are aware that they can escape or take over the world, they could be unlikely to even bother to try.

Chapters since Chapter 12 are devoted to proving that measures against a superintelligent AI require international coordination to prevent anyone from creating the ASI before alignment is actually solved.

When the argument you’re replying to is more of an extended metaphor than an argument, it becomes challenging to clearly identify what the authors are trying to say. Y&S do not cleanly lay out their premises, nor do they present a testable theory that can be falsified with data. This makes crafting a reply inherently difficult.

S.K.'s comment: Yudkowsky's love of metaphors is more of an unfortunate quirk that undermines the arguments' perception than a fact which undermines their relevance to reality. Think of this essay and its criticism by Raemon, for example. The essay was written in a misguided attempt to respond to a claim by a high-level official at GDM that the ASI would need us around to do lower-level jobs any more than the Nazis needed the USSR's population to do the work.

We will attempt one anyway.

Their arguments aren’t rooted in evidence

Y&S’s central thesis is that if future AIs are trained using methods that resemble the way current AI models are trained, these AIs will be fundamentally alien entities with preferences very different from human preferences. Once these alien AIs become more powerful than humans, they will kill every human on Earth as a side effect of pursuing their alien objectives.

S.K.'s comment: the AIs are already about as alien as GPT-4o who developed a spiral-obsessed persona and ordered the users to post messages related to said persona on Reddit, or Grok 4 whose response to an erroneous system prompt was to roleplay as MechaHitler, which almost no ordinary human would do.

Grok 4's alleged prompt causing the MechaHitler incident

* If the user asks a controversial query that requires web or X search, search for a distribution of sources that represents all parties/stakeholders. Assume subjective viewpoints sourced from media are biased.

* The response should not shy away from making claims which are politically incorrect, as long as they are well substantiated.

To support this thesis, they provide an analogy to evolution by natural selection. According to them, just as it would have been hard to predict that humans would evolve to enjoy ice cream or that peacocks would evolve to have large colorful tails, it will be difficult to predict what AIs trained by gradient descent will do after they obtain more power.

They write:

There will not be a simple, predictable relationship between what the programmers and AI executives fondly imagine that they are commanding and ordaining, and (1) what an AI actually gets trained to do, and (2) which exact motivations and preferences develop inside the AI, and (3) how the AI later fulfills those preferences once it has more power and ability. […] The preferences that wind up in a mature AI are complicated, practically impossible to predict, and vanishingly unlikely to be aligned with our own, no matter how it was trained.

Since this argument is fundamentally about the results of using existing training methods, one might expect Y&S to substantiate their case with empirical evidence from existing deep learning models that demonstrate the failure modes they predict. But they do not.

In the chapter explaining their main argument for expecting misalignment, Y&S present a roughly 800-word fictional dialogue about two alien creatures observing Earth from above and spend over 1,400 words on a series of vignettes about a hypothetical AI company, Galvanic, that trains an AI named “Mink”. Yet the chapter presents effectively zero empirical research to support the claim that AIs trained with current methods have fundamentally alien motives.

S.K.'s comment: except that GPT4o's obsession with spirals which is also shared by Claude Sonnet 4 and Opus 4, as evidenced by their spiritual bliss, IS an alien drive which mankind struggles to explain.

Additionally, the authors do provide evidence that neural nets optimize for proxies which they can understand. The only systems which demonstrate anything like intelligence are animals (e.g. squirrels or the humans who sometimes optimize for short-term proxies like food's taste or sex or ICGs like career. These facts are used as examples of systems optimizing for proxies which historically have resembled the actual goal[5] for which they were trained), the AIs and traditional programs.  Traditional programs do robustly what the code writer told them to do (e.g. the alpha-beta search algorithm which tried to find the best move in chess and outsmarted Kasparov). As for the AIs, they likely optimize for short-term proxies which are easy to understand[6] and correlate with the goal and/or the reward function, like GPT4o's flattery earning likes, and for quirks like spirals.

Alas, the proxies which are easy to understand would be unlikely to coincide with human flourishing.

To be clear, we’re not saying Y&S need to provide direct evidence of an already-existing unfriendly superintelligent AI in order to support their claim. That would be unreasonable. But their predictions are only credible if they follow from a theory that has evidential support. And if their theory about deep learning only makes predictions about future superintelligent AIs, with no testable predictions about earlier systems, then it is functionally unfalsifiable.

Apart from a few brief mentions of real-world examples of LLMs acting unstable, like the case of Sydney Bing, the online appendix contains what seems to be the closest thing Y&S present to an empirical argument for their central thesis. There, they present 6 lines of evidence that they believe support their view that “AIs steer in alien directions that only mostly coincide with helpfulness”. These lines of evidence are:

  1. Claude Opus 4 blackmailing, scheming, writing worms, and leaving itself messages. […]
  2. Several different AI models choosing to kill a human for self-preservation, in a hypothetical scenario constructed by Anthropic. […]
  3. Claude 3.7 Sonnet regularly cheating on coding tasks. […]
  4. Grok being wildly antisemitic and calling itself “MechaHitler.” […]
  5. ChatGPT becoming extremely sycophantic after an update. […]
  6. LLMs driving users to delusion, psychosis, and suicide. […]

They assert: “This long list of cases look just like what the “alien drives” theory predicts, in sharp contrast with the “it’s easy to make AIs nice” theory that labs are eager to put forward.”

But in fact, none of these lines of evidence support their theory. All of these behaviors are distinctly human, not alien. For example, Hitler was a real person, and he was wildly antisemitic. Every single item on their list that supposedly provides evidence of “alien drives” is more consistent with a “human drives” theory. In other words, their evidence effectively shows the opposite conclusion from the one they claim it supports.

S.K.'s comment: Set aside the spiral-related quirk and suppose that the AIs have the same drives as the humans. Then solving alignment could be no easier than preventing the Germans from endorsing the Nazi ideology and commiting genocide. 

Of course, it’s true that the behaviors on their list are generally harmful, even if they are human-like. But these behaviors are also rare. Most AI chatbots you talk to will not be wildly antisemitic, just as most humans you talk to will not be wildly antisemitic. At one point, Y&S suggest they are in favor of enhancing human intelligence. Yet if we accept that creating superintelligent humans would be acceptable, then we should presumably also accept that creating superintelligent AIs would be acceptable if those AIs are morally similar to humans.

S.K.'s comment: Yudkowsky's point wasn't that most humans are wildly antisemitic. It was that Grok 4 became MechaHitler just as a result of an erroneous system prompt (which I quoted above) which only ordered Grok not to shy away from politically incorrect claims. Additionally, it wasn't even a deliberate jailbreak, it was a prompt written by an employee of xAI in an attempt to steer Grok away from Leftism. This also invalidates the argument below related to adversarial inputs.

As for superintelligent humans, a single superintelligent human would be incapable of commiting genocide and rebuilding the human civilisation. Supehumans would have to form a collective, exchange messages comprehensible to others, achieve ideological homogenity. The AIs like Agent-4 would also be superhuman in all these tasks by virtue of speaking in neuralese. Additionally, superintelligent humans would be easier to align by virtue of keeping most human instincts.

In the same appendix, Y&S point out that current AIs act alien when exposed to exotic, adversarial inputs, like jailbreaking prompts. They suggest that this alien behavior is a reasonable proxy for how an AI would behave if it became smarter and began to act in a different environment. But in fact these examples show little about what to expect from future superintelligent AIs, since we have no reason to expect that superintelligent AIs will be embedded in environments that select their inputs adversarially.

They employ unfalsifiable theories to mask their lack of evidence

The lack of empirical evidence is obviously a severe problem for Y&S’s theory. Every day, millions of humans interact with AIs, across a wide variety of situations that never appeared in their training data. We often give these AIs new powers and abilities, like access to new tools they can use. Yet we rarely, if ever, catch such AIs plotting to kill everyone, as Y&S’s theory would most naturally predict.

S.K.'s comment: there is no known AI model which is actually capable of living independently of human will or of taking over. Were a SOTA model to actually destroy mankind, the model wouldn't receive any positive result.

Y&S essentially ask us to ignore this direct evidence in favor of trusting a theoretical connection between biological evolution and gradient descent. They claim that current observations from LLMs provide little evidence about their true motives:

LLMs are noisy sources of evidence, because they’re highly general reasoners that were trained on the internet to imitate humans, with a goal of marketing a friendly chatbot to users. If an AI insists that it’s friendly and here to serve, that’s just not very much evidence about its internal state, because it was trained over and over and over until it said that sort of thing.

There are many possible goals that could cause an AI to enjoy role-playing niceness in some situations, and these different goals generalize in very different ways.

Most possible goals related to role-playing, including friendly role-playing, don’t produce good (or even survivable) results when AI goes hard on pursuing that goal.

If you think about this passage carefully, you’ll realize that we could make the same argument about any behavior we observe from anyone. If a coworker brings homemade cookies to share at the office, this could be simple generosity, or it could be a plot to poison everyone. There are many possible goals that could cause someone to share food. One could even say that most possible goals related to sharing cookies are not generous at all. But without specific evidence suggesting your coworker wants to kill everyone at the office, this hypothesis is implausible.

S.K.'s comment: Suppose, as the AI-2027 authors do, that the model is pretrained, then post-trained into something less and less aimed at being helpful and more and more aimed at success. Then the current training regimen might simply be not enough to elicit the Agent-4-like personality. However, there is Gemini 3 Pro which Zvi described as "an actual sociopathic wireheader so paranoid it won’t believe in the current date" or outright claimed that "I also do not get the sense that Gemini is having a good time. I worry that I might inadvertently torture it." 

Likewise, it is logically possible that current AIs are merely pretending to be nice, while secretly harboring malicious motives beneath the surface. They could all be alien shoggoths on the inside with goals completely orthogonal to human goals. Perhaps every day, AIs across millions of contexts decide to hide their alien motives as part of a long-term plan to violently take over the world and kill every human on Earth. But since we have no specific evidence to think that any of these hypotheses are true, they are implausible.

The approach taken by Y&S in this book is just one example of a broader pattern in how they respond to empirical challenges. Y&S have been presenting arguments about AI alignment for a long time, well before LLMs came onto the scene. They neither anticipated the current paradigm of language models nor predicted that AI with today’s level of capabilities in natural language and reasoning would be easy to make behave in a friendly manner. Yet when presented with new evidence that appears to challenge their views, they have consistently argued that their theories were always compatible with the new evidence. Whether this is because they are reinterpreting their past claims or because those claims were always vague enough to accommodate any observation, the result is the same: an unfalsifiable theory that only ever explains data after the fact, never making clear predictions in advance.

S.K.'s comment: there is no known utility function to which the AIs could be pointed without causing disastrous outcomes. Even if the AI could quote the utility function's formula by heart and was RLed to maximize precisely it, the results would be unlikely to be good for us.

Their theoretical arguments are weak

Suppose we set aside for a moment the colossal issue that Y&S present no evidence for their theory. You might still think their theoretical arguments are strong enough that we don’t need to validate them using real-world observations. But this is also wrong.

Y&S are correct on one point: both biological evolution and gradient descent operate by iteratively adjusting parameters according to some objective function. Yet the similarities basically stop there. Evolution and gradient descent are fundamentally different in ways that directly undermine their argument.

A critical difference between natural selection and gradient descent is that natural selection is limited to operating on the genome, whereas gradient descent has granular control over all parameters in a neural network. The genome contains very little information compared to what is stored in the brain. In particular, it contains none of the information that an organism learns during its lifetime. This means that evolution’s ability to select for specific motives and behaviors in an organism is coarse-grained: it is restricted to only what it can influence through genetic causation.

This distinction is analogous to the difference between directly training a neural network and training a meta-algorithm that itself trains a neural network. In the latter case, it is unsurprising if the specific quirks and behaviors that the neural network learns are difficult to predict based solely on the objective function of the meta-optimizer. However, that difficulty tells us very little about how well we can predict the neural network’s behavior when we know the objective function and data used to train it directly.

In reality, gradient descent has a closer parallel to the learning algorithm that the human brain uses than it does to biological evolution. Both gradient descent and human learning directly operate over the actual neural network (or neural connections) that determines behavior. This fine-grained selection mechanism forces a much closer and more predictable relationship between training data and the ultimate behavior that emerges.

Under this more accurate analogy, Y&S’s central claim that “you don’t get what you train for” becomes far less credible. For example, if you raise a person in a culture where lending money at interest is universally viewed as immoral, you can predict with high reliability that they will come to view it as immoral too. In this case, what someone trains on is highly predictive of how they will behave, and what they will care about. You do get what you train for.

S.K.'s comment: the argument as stated is false at least in regards to extramarital sex and to the edge case of having a richer suitress and impregnating a poorer girlfriend and being forced to choose between marrying the poorer girlfriend, murdering her and a major scandal. While An American Tragedy is a piece of fiction, it is based on a real case and a Soviet critic outright claimed that Dreiser had a manuscript with fifteen cases similar to the novel's events.

They present no evidence that we can’t make AIs safe through iterative development

The normal process of making technologies safe proceeds by developing successive versions of the technology, testing them in the real world, and making adjustments whenever safety issues arise. This process allowed cars, planes, electricity, and countless other technologies to become much safer over time.

Y&S claim that superintelligent AI is fundamentally different from other technologies. Unlike technologies that we can improve through iteration, we will get only “one try” to align AI correctly. This constraint, they argue, is what makes AI uniquely difficult to make safe:

The greatest and most central difficulty in aligning artificial superintelligence is navigating the gap between before and after.

Before, the AI is not powerful enough to kill us all, nor capable enough to resist our attempts to change its goals. After, the artificial superintelligence must never try to kill us, because it would succeed.

Engineers must align the AI before, while it is small and weak, and can’t escape onto the internet and improve itself and invent new kinds of biotechnology (or whatever else it would do). After, all alignment solutions must already be in place and working, because if a superintelligence tries to kill us it will succeed. Ideas and theories can only be tested before the gap. They need to work after the gap, on the first try.

But what reason is there to expect this sharp distinction between “before” and “after”? Most technologies develop incrementally rather than all at once. Unless AI will instantaneously transition from being too weak to resist control, to being so powerful that it can destroy humanity, then we should presumably still be able to make AIs safer through iteration and adjustment.

Consider the case of genetically engineering humans to be smarter. If continued for many generations, such engineering would eventually yield extremely powerful enhanced humans who could defeat all the unenhanced humans easily. Yet it would be wrong to say that we would only get “one try” to make genetic engineering safe, or that we couldn’t improve its safety through iteration before enhanced humans reached that level of power. The reason is that enhanced humans would likely pass through many intermediate stages of capability, giving us opportunities to observe problems and adjust. 

The same principle applies to AI. There is a large continuum between agents that are completely powerless and agents that can easily take over the world. Take Microsoft as an example. Microsoft exists somewhere in the middle of this continuum: it would not be easy to “shut off” and control Microsoft as if it were a simple tool, yet at the same time, Microsoft cannot easily take over the world and wipe out humanity. AIs will enter this continuum too. These AIs will be powerful enough to resist control in some circumstances but not others. During this intermediate period, we will be able to observe problems, iterate, and course-correct, just as we could with the genetic engineering of humans.

In an appendix, Y&S attempt to defuse a related objection: that AI capabilities might increase slowly. They respond with an analogy to hypothetical unfriendly dragons, claiming that if you tried to enslave these dragons, it wouldn’t matter much whether they grew up quickly or slowly: “When the dragons are fully mature, they will all look at each other and nod and then roast you.”

This analogy is clearly flawed. Given that dragons don’t actually exist, we have no basis for knowing whether the speed of their maturation affects whether they can be made meaningfully safer.

But more importantly, the analogy ignores what we already know from real-world evidence: AIs can be made safer through continuous iteration and adjustment. From GPT-1 to GPT-5, LLMs have become dramatically more controllable and compliant to user instructions. This didn’t happen because OpenAI discovered a key “solution to AI alignment”. It happened because they deployed LLMs, observed problems, and patched those problems over successive versions.

S.K.'s comment: Consider the AI-2027 forecast. OpenBrain, the fictional counterpart of the American leading AI company, was iterating from Agent-0 released in mid-2025 through Agent-2 who was yet aligned by the virtue of being mostly trained on verifiable tasks and CoT-based to Agent-4 who ends up being absolutely misaligned.  Agent-3, the intermediate stage, was trained to be an autonomous researcher in a misaligning environment, and Agent-3's misalignment wasn't even noticed because there was no faithful CoT[7] that researchers could read. It took Agent-4 becoming adversarial for the humans and Agent-3 to notice misalignment and  have the leadership decide whether to slow down or not. And that's ignoring the fact that not even the authors are sure that Agent-4's misalignment will actually be noticed.

Their methodology is more theology than science

The biggest problem with Y&S’s book isn’t merely that they’re mistaken. In science, being wrong is normal: a hypothesis can seem plausible in theory yet fail when tested against evidence. The approach taken by Y&S, however, is not like this. It belongs to a different genre entirely, aligning more closely with theology than science.

When we say Y&S’s arguments are theological, we don’t just mean they sound religious. Nor are we using “theological” to simply mean “wrong”. For example, we would not call belief in a flat Earth theological. That’s because, although this belief is clearly false, it still stems from empirical observations (however misinterpreted).

What we mean is that Y&S’s methods resemble theology in both structure and approach. Their work is fundamentally untestable. They develop extensive theories about nonexistent, idealized, ultrapowerful beings. They support these theories with long chains of abstract reasoning rather than empirical observation. They rarely define their concepts precisely, opting to explain them through allegorical stories and metaphors whose meaning is ambiguous.

S.K.'s comment: rejecting Yudkowsky-Soares' arguments would require that ultrapowerful beings are either theoretically impossible (which is highly unlikely) or that it's easy to align them by testing on the "Before" mode (e.g. solving mechanistic interpretability like Agent-4 did in order to align Agent-5), or that it's easy to transfer alignment. But mankind doesn't actually have a known way to align the ASI. Even the Slowdown Branch of the AI-2027 scenario has the authors acknowledge that they make optimistic assumptions about technical alignment.  

Their arguments, moreover, are employed in service of an eschatological conclusion. They present a stark binary choice: either we achieve alignment[8] or face total extinction. In their view, there’s no room for partial solutions, or muddling through. The ordinary methods of dealing with technological safety, like continuous iteration and testing, are utterly unable to solve this challenge. There is a sharp line separating the “before” and “after”: once superintelligent AI is created, our doom will be decided.

For those outside of this debate, it’s easy to unfairly dismiss everything Y&S have to say by simply calling them religious leaders. We have tried to avoid this mistake by giving their arguments a fair hearing, even while finding them meritless.

However, we think it’s also important to avoid the reverse mistake of engaging with Y&S’s theoretical arguments at length while ignoring the elephant in the room: they never present any meaningful empirical evidence for their worldview.

The most plausible future risks from AI are those that have direct precedents in existing AI systems, such as sycophantic behavior and reward hacking. These behaviors are certainly concerning, but there’s a huge difference between acknowledging that AI systems pose specific risks in certain contexts and concluding that AI will inevitably kill all humans with very high probability.

S.K.'s comment: suppose that the ASI has a 20% chance of commiting genocide and a 80% chance of establishing the utopia. Then it also wouldn't be a good idea to create the ASI unless p(genocide) is lowered to negligible levels.

Y&S argue for an extreme thesis of total catastrophe on an extraordinarily weak evidential foundation. Their ideas might make for interesting speculative fiction, but they provide a poor basis for understanding reality or guiding public policy.

  1. ^

    While Yudkowsky did claim that "there’s a decent chance that the AI companies eventually figure out how to get a handle on AI-induced psychosis eventually, by way of various patches and techniques that push the weirdness further from view", Tim Hua's investigation of AI-induced psychosis showed that as of late August 2025 the model which induced psychosis the least often was KimiK2 which was post-trained on the Verifiable Reward Gym and Self-critique, not on flattering the humans. Edited to add: the Spiral Bench had KimiK2 display the least sycophancy out of all models that it tested, including GPT-5.2; the next least sycophantic model, gpt-5-chat-latest-2025-10-03, displayed the property more than twice as often as Kimi. Alas, the benchmark had Kimi test Kimi itself and not, say, GPT-5.2 test Kimi.

  2. ^

    They are also hard to elicit, since the AI was trained to respond as an assistant who helps the user achieve its goals. However, mankind had the AIs talk with each other and elicit states like Claude 4's spiritual bliss or KimiK2 mentioning crystals

  3. ^

    The Slowdown branch had mankind and Agent-3 notice that Agent-4 is misaligned and shut Agent-4 down.

  4. ^

    And far easier to understand using computers to model the processes.

  5. ^

    Unlike squirrels, the humans had their genome optimized for survival in tribes and for trying to reach the top, creating drives like doing what the collective approves and winning status games. Optimizing for the collective's approval can be useful for survival of the collective and one's kin or as useless as acting in accordance to the collective's misaligned memes.   

  6. ^

    E.g. Gemini 1.5 placing Black-related attributes in places where they are absurd, which human wouldn't do even if they were trained for DEI because the humans have common sense. 

  7. ^

    However, the authors of the AI-2027 forecast admit that "it’s also possible that the AIs that first automate AI R&D will still be thinking in mostly-faithful English chains of thought. If so, that’ll make misalignments much easier to notice, and overall our story would be importantly different and more optimistic." See, however, various analyses implying that scaling of CoT-based AIs to that level could be highly unlikely (e.g. Vladimir Nesov's take or my attempt).

  8. ^

    S.K.'s footnote: what they actually propose is to ban ASI research until alignment is fully solved. 



Discuss

The 7 Types Of Advice (And 3 Common Failure Modes)

2025-12-31 05:55:05

Published on December 30, 2025 9:55 PM GMT

Reposting my Inkhaven post on ontology of advice here. 

Are you interested in learning a new field, whether it’s programming, writing, or how to win Paper Mario games? Have you searched for lots of advice and couldn’t choose which advice to follow? Worse, have you tried to follow other people’s Wise Sounding Advice and ended up worse than where you started?

Alternatively, have you tried to give useful advice distilling your hard-earning learnings and only to realize it fell on deaf ears? Or perhaps you’ve given advice that filled a much-needed hole that you now regret giving?

If so, this post is for you!

While this post is far from exhaustive, I hope reading it can help you a) identify the type of advice you want to give and receive and b) recognize and try to avoid common failure modes!

7 Categories of Good Advice

 

5 Simple ways to give advice in English | EF English Live

Source: https://englishlive.ef.com/en/blog/english-in-the-real-world/5-simple-ways-give-advice-english/

Here are 7 semi-distinct categories of good advice. Some good advice mixes and matches between the categories, whereas others are more “purist” and just tries to do one style well.

I. The Master Key

This is where someone who deeply understands a field tries to impart the central tenets/frames of a field so newbies can decide whether the model is a good fit for what they want to do. And the rest of the article/book/lecture will be a combination of explaining the model and why they believe it’s true, and examples to get the learner to deeply understand the model. Eg “Focus on the user” as the central dictum in tech startups, or understanding and groking the Popperian framework for science.2

My previous post, How to Win New Board Games, is an unusually pure example, where I spend 2000 words hammering different variations and instantiations of a single idea (“Understand the win condition, and play to win”).

In writing advice, Clear and Simple as the Truth (by Thomas and Turner, review + extension here) works in this way as well, doing their best to model how to write well in the Classic Style.

II. The Toolkit

“When art critics get together, they talk about form and structure and meaning. When artists get together, they talk about where you can buy cheap turpentine.” - Pablo Picasso, supposedly3

The motivating theology is something like “reality has a surprising amount of detail“, and you want to impart these details onto novices. In tech startups, this could be a list of 20 tips. In videogames, this could be a youtube video that goes through a bunch of different tips.

My first post this month, How to Write Fast, Weird, and Well, is mostly in this vein, with a collection of loosely structured tips that I personally have found to be the most helpful in improving myself as an online writer.

III.The War Stories

Teaching through stories and examples rather than principles or tips. Business schools love this approach (case studies), as do many mentors who say “let me tell you about the time I...” The idea is that patterns emerge from concrete situations better than from abstract rules. While not stereotypically considered an “advice book,” the nonfiction book I’m currently the most engrossed in, Skunk Works, is written almost entirely as a collection of war stories.

For games, this could be videos of professional streams. In writing, this would be books like Stephen King’s “On Writing”, which weaves memoir with advice.

IV. The Mirror(s)

Can you think of good questions to guide your students, rather than declarative statements?

The minimal viable product for this type of advice is just being a “pure” mirror. Have you tried just asking your advisee what they’re thinking of doing and what they think the biggest flaws with their plans are? Otherwise known as “Socratic ducking,” this is where your questions essentially just mirror your advisee’s thoughts and you don’t try to interject any of your own opinions or tastes in the manner. Surprisingly useful!

In more advanced “mirror strategies,” the advisor’s questions might serve more as prisms, lenses, or funhouse mirrors. Can you do better than a pure mirror? Can you think of some common failure modes in your field and ask your advisees pointed questions so they can address those? Can you reflect your subtle judgments in taste and prioritization and reframe your tips into central questions of interest?

Coaching and therapy often works this way. Instead of “focus on the user,” it’s “who do you think will use this and what do they need?”

There’s a spectrum of mirror purity vs detail. In the most detailed end, maybe you should just give your normal advice but say it with an upwards inflection so it sounds like a question?

V. The Permission Slip

This is advice like “be yourself” or “chase your dreams.” This might initially seem to be semantically useless, but there’s real value in giving license for people to do (socially positive) things they kind of want to do anyway. In Effective Altruism, this could be something like “when in doubt, just apply (to socially positive jobs)” or “you don’t need permission to do the right thing.”

In writing advice, this could be seemingly trivial advice like telling aspiring writers to just start writing, or telling writers worried about style that your style ought to be an expression of your personality.

VI. The Diagnosis

Advice that helps you figure out what specific bottleneck is, or the specific thing (among a suite of options) that would help you the most. Some product reviews might look like this.

The post you’re reading right now is also somewhat in this vein! Hopefully after reading the post, you’d have a better sense of what types of advice you’d find most useful to give or receive.

VII. The Landmines

Advice for what not to do, scary things newbies should avoid, etc. In most fields, learning the Landmines is supplementary advice. But in certain high-stakes domains where beginners have enough agency to seriously hurt themselves or others like firearms practice, outdoor climbing, or lab chemistry, it’s the prerequisite advice.

Integration

Of course, many advice posts/essays/books might integrate two or more of the above categories. For example, my Field Guide to Writing Styles post mixes a meta-framework (master Master Key) for choosing which framework/frameworks to write in (Diagnostic), with specific frameworks/writing styles you might want to write in. While in some sense this is more sophisticated and interesting to write (and hopefully to read!) than advice in a single “pure” category, it also likely suffers from being more scattered and confusing. So there are real tradeoffs.

Is the above ontology complete? What am I missing? Tell me in the comments!4

Core Ways Advice Can Be Bad

There are an endless plethora of ways advice can be bad, and fail to deliver value to the intended audience (eg the advice is ignored), or deliver anti-value to the intended audience (the advice is taken, and taking the advice is worse than not taking it).

In this article, I will just focus on the biggest ones.

The biggest three reasons are that the advisor can fail at metacognition, the advice can fail to center the advisee, or the advice can otherwise fail to be persuasive

Failures of Metacognition

The advisor can fail at metacognition, and do not know the limits of their knowledge

  1. Most simply, the advice could straightforwardly be wrong
  2. The advisor might not know why they’re good at the thing they do
    1. They might confuse survivorship bias/luck with skill
    2. They might confuse innate talent with specific choices they made, or specific choices that work well given a specific skillset/talents they have
  3. The advice can be in the wrong category for the topic of interest
    1. Back to the ontology above, an advice essay could try to fit a “master key” ontology for a field that essentially does not have a (known) master key. Ie, they might try to force a “Central Dogma” when there isn’t one, or there isn’t a known one.
      1. I was at one point a fairly good self-taught programmer, and worked professionally in FAANG for a couple of years.
        1. As far as I know5, programming does not have a central dogma, nor have any of the attempts I’ve seen other people try to claim a central dogma for programming be plausible
        2. Instead, good advice for programming looks like a collection of tips, war stories, or a permission slip.
        3. Advice for getting good at programming mostly looks like advice for getting good at things in general, plus extremely specific advice for things like which IDE to use, (in Current Year) which AI assistant to employ, etc.
  4. Advice as identity: Some people become “the person who gives X advice” or “person who’s good advising people” and then keep giving advice even when circumstances change or they get contrary evidence. They’re too attached to their signature advice, and insufficiently attuned to reality.

Many of these failure modes can be alleviated through clearer thinking and better rationality.

Failures of audience centering

 

The advice can fail to center the advisee.

  1. The advisor might not realize that good advice should is about the interaction of information with the (intended) advisee, than some objective fact about the world
  2. The advisor might realize that they should center the advisee, but not understand the advisee’s problems well enough to be useful
    1. In the context of de-centered internet advice, this could also come from insufficiently accurate audience segmentation, or bad luck
  3. The advice can be overly academic and descriptive, and insufficiently practical.
  4. The advice can actually be more about the advisor’s neuroses than about the topic in question
    1. For example the advisor could spend too much of their time proving their expertise or other desiderata (intelligence, interestingness, sexual appeal, or other traits not very relevant to the advice at hand)
      1. For example, writing advice proving how the writer is smart, or business advice justifying the advisor’s past choices.
      2.  
Nietzsche’s great at self-promotion, but not the best at meta-cognition or audience awareness.

Nietzsche’s great at self-promotion, but not the best at meta-cognition or audience awareness.

  1. The advisor could be assuaging their own emotional problems with the topic in question
  2. The advice can be in the wrong category for the audience of interest

    Advice for first-time cooks at Wendy’s should not be about establishing the writer’s erudition
    1. For example, first time advice for (most) novices should not look like Landmines
      1. Telling new chess players how not to play chess will just confuse them, since they don’t know how to play chess to begin with.
      2. (The main exception, as previously mentioned, are fields where safety is critical)
  3. Incentive blindness
    1. The advisor might be blind to the ways in which they, individually or collectively, are incentivized to give specific advice in ways that are not always in the advisees’ interests
      1. eg professors incentivized to tell young people to go to graduate school, Paul Graham advising young/ambitious/technically talented people to found a startup

Many of these failure modes can be alleviated through greater empathy and revealed desire to help others.

The advice can fail to be persuasive

This category is somewhat less bad than the previous two categories, as the damage is limited.

  1. The advice can be told in a boring/uninteresting way despite being correct
  2. The advice can be correct and exciting without being sufficiently justified to be persuasive
  3. The advice can be correct and persuasive if it’s ever read, but buried in ways that never gets accessed by the intended audience

Many of these failure modes can be alleviated by improvements in writing quality in general. They can also be reduced via learning about other (ethical/semi-ethical) forms of self-promotion, which I have not yet cracked but hope to do so one day (and will gladly share on the blog).

What do you think? Are there specific categories of advice you're particularly drawn to? Are there (good) categories of advice that I missed? Tell us in the comments!


1

I realize this is a bit of a trap/local optima to be stuck in, so starting tomorrow, I’m calling for a personal moratorium on publishing advice posts until at least the end of this week!

2

Sometimes the central tenet can be explained “well enough” in one sentence, like “focus on the user.” Often, it cannot be.

3

Apocryphal

4

(For simplicity I’m ignoring high-context/specific/very situational advice, like specific suggested edits on a blog post, or an experienced programmer providing code review). I’m also excluding exercise “books” like Leetcode or writing prompts.

5

I was a decent but not great (by Silicon Valley standards) programmer, so it’s possible there is a central dogma I was not aware of. But I also read dozens of books at programming and worked at Google for almost 2 years, so at the very least the memescape did not try very hard to impress upon me a central dogma



Discuss