2026-04-16 12:13:14
Epistemic status: All of the western canon must eventually be re-invented in a LessWrong post. So today we are re-inventing federalism.
Once upon a time there was a great king. He ruled his kingdom with wisdom and economically literate policies, and prosperity followed. Seeing this, the citizens of nearby kingdoms revolted against their leaders, and organized to join the kingdom of this great king.
While the kingdom's ability to defend itself against external threats grew with each person who joined the land, the kingdom's ability to defend itself against internal threats did not. One fateful evening, the king bit into a bologna sandwich poisoned by a rival noble. That noble quickly proceeded to behead his political enemies in the name of the dead king. The flag bearing the wise king's portrait known as "the great unifier" still flies in the fortified cities where his successor rules with an iron fist.
Once upon a time there was a great scientific mind. She developed a new theoretical framework that made large advances on the hardest scientific questions of the day. Seeing the promise of her work, new graduate students, professors, and corporate R&D teams flocked into the field, hungry to tackle new open problems and make their mark on the world. Within ten years, a vibrant new academic field had formed, with herself among its most respected members.
While the field's ability to make progress on the hard problems increased with each new researcher who joined the field, the field's ability to defend itself against the institutional incentives of the broader academic ecosystem did not. Low-quality researchers, seeing lucrative new opportunities for publication, began producing flashy results on the easier problems adjacent to her field with low attention to scientific rigor. Seeing their success, others began to join them, attracted to the social and financial rewards. Being conflict averse and not seeing it as her job to prosecute these people, a growing fraction of the field became careerists.
Twenty years later, her scientific field had become so diluted by uninteresting or irrelevant work that the great original problems remained unsolved, mired in bureaucracy, respectability politics, and academic warfare. Most of the scientists who joined early, attracted by the promise of great progress, stopped being scientists altogether and moved to industry. Almost nobody remembers her name in the history books.
Once upon a time there was a great advocate. She built a social movement around the protection of the rights of a marginalized group, and after many years of hard work, saw the day that the most severe forms of discrimination against the group had been outlawed, and wide social consensus had moved in favor of respecting the members of this group.
But in the success of the movement's aims, she also lost most of her authority. No longer having a compelling vision to offer the members of this movement, others who did became more influential. While she remained the acknowledged founder of the movement, she was no longer treated by the general public as its spokesperson. The press would always talk to the new, charismatic leaders of the movement who had the strongest and most unyielding views. She couldn't afford to make enemies in the movement that she considered hers, so she would publicly endorse the perspectives of these new leaders even when she privately disagreed with them.
Ten years later, her social movement had become so focused on purity and removing any remaining trace of its original enemy that it had begun causing substantially more harm than the original problem it was founded to address. In the history books, she would be briefly mentioned as one of the people who laid the groundwork for the new dark age.
Once upon a time emperor Marcus Aurelius (himself a great general and a great leader) died in 180 AD, and was succeeded by his son Commodus. Commodus, whom historian Cassius Dio described as "a greater curse to the Romans than any pestilence or any crime", turned out to be interested in gladiator fighting much more than in governing the Roman Empire. The Pax Romana began its long descent into the Crisis of the Third Century, and marked the start of the eventual collapse of the Roman Empire.
Once upon a time the French revolution swept across France, bringing the people liberty and executing the corrupt French aristocracy in an unprecedented flurry of violence. Within a decade the idealistic leaders of the revolution would mostly all be dead, executed by the political machine they themselves had created. And within another few years, Napoleon Bonaparte would claim power and proceed to wage aggressive war across all of continental Europe for another decade.
Once upon a time Lee Kuan Yew built modern Singapore out of what was, at the time, a small regional trading post in Southeast Asia. Under his leadership, Singapore's GDP per capita grew 30x over 30 years. But Lee Kwan Yew is dead and his son just handed over power to Lawrence Wong, not a member of the Lee family. While Singapore continued to thrive under his son's leadership, I find myself very worried about what happens once the Singapore story depends on a third generation of leaders, and wonder if Singapore has in fact already peaked.
Once upon a time George Washington retired. George Washington, the Continental Army general who defeated the British army and successfully established the United States of America as an independent nation, and later the first United States president, served his two terms as president and then voluntarily relinquished power. King George III of Great Britain called him "the greatest man in the world" upon hearing the news. Some say this decision singlehandedly saved American democracy.
Do not conquer what you cannot defend.
At the heart of classical liberalism, a philosophy I have much sympathy for, is the belief that allowing many individuals to act freely and autonomously (especially when they are empowered by markets, democratic processes, and the scientific method) will tend to produce outcomes that are better than the outcomes that can be produced by central authorities.
Maybe the most important way ambitious, smart, and wise people leave the world worse off than they found it is by seeing correctly how some part of the world is broken and unifying various powers under a banner to fix that problem — only for the thing they have built to slip from their grasp and, in its collapse, destroy much more than anything previously could have.
I sometimes consider quitting. When I do, my friends and colleagues often react with bafflement. "How can you think that what you've done is bad for the world? Do you not think that you are steering this boat we are in together into a good direction? Do you really think a world without the AI Safety movement, without LessWrong, without Effective Altruism would be better?".
And in their heads when they visualize the alternative, I can only imagine that they see a great big emptiness where rationality and EA and AI Safety is. And they compare our current community against nothingness, and come to the conclusion that even if its leadership is kind of broken, and the incentives are kind of messed up, that this is still clearly better than no one in the world working on the things we care about.
But what I am worried about, is that we conquered much more than we can defend. That the alternative to the work of me and others in the space is not nothingness, but a broken and dysfunctional and confusing patchwork of metaphorical city-states that barely does anything, but at least when any part of it fails, it doesn't all go down together, and in its distributed nature, promises much less nourishing food to predators and sociopaths.
In grug language: Smart man sees big problem. Often state of nature is many small things. Smart man make one big thing out of many small things to throw at big problem. But then evil man take big thing from smart man and make more problem. Or big thing grow legs and beat smart man without making problem go away. This is bad. Maybe better to throw small things at big problem and not make big thing, even if solve problem less. Or before make big thing have plan for how to not have big thing do evil.
But Moloch, in whom I sit lonely
"But what about Moloch" you say!
"Your principle betrays itself. If we want to have good things, we need to coordinate and work together. And death comes for us all, eventually, so nothing we build can truly be defended. Do you not see how one company owning one lake will produce more fish than 20 companies each polluting the commons until all fish are dead? Do you not see how having 20 AI companies all racing to the precipice is worse than having one clearly in the lead, even if the one that raced to the top might stray from the intentions of its creators?"
And you know, fair enough. Coordination problems are real. I am not saying that you should not centralize power.
Here I am arguing for a much narrower principle. Much has been written, and will continue to be written, about the tradeoff between freedom and justice. About small vs. big government. I am not trying to cover all of that.
Here I am just trying to highlight a single principle that seems robust across a wide range of tradeoffs: "If you make a plan that involves concentrating a bunch of power, especially in the name of goodness and justice, really actually think about whether you can defend that power from corruption and adversaries".
And if you can, then go ahead! When George Washington stepped down, he traded off direct power in favor of a system that would actually be able to defend the principles he cared about for much longer, birthing much of Western democracy. I am glad the US exists and covers almost all of the north American continent. Its leaders and founders did have a plan for defending what they conquered, and the world is better off for it.
But if your plan involves rallying a bunch of people under the banner of truth and goodness and justice, and your response to the question of "how are you going to ensure these people will stay on the right path?" is "they will stay on the right path because they will be truthseeking, good, and just people", or if as a billionaire your plan for distributing your wealth is "well, I'll hire some people to run a foundation for me to distribute all of my money according to my goals", then I think you are in for a bad time.
Do not conquer what you cannot defend
2026-04-16 09:51:25
Sadly, there's not many of them: probably less than a dozen economists who are taking the transformative AI seriously at all. Partly because they've seen a long history of 'straight lines on graphs' and are skeptical about transformative technologies, but a good chunk of this is doubtless because Acemoglu (who was a giant in the field long before he got a Nobel) published a piece "The Simple Macroeconomics of AI" which used a conservative methods and conservative estimates for the values (by the standards in the CS community ridiculously conservative) to claim that the impact of AI over a decade would be... 0.5% of the economy; i.e. a nothingburger. This one paper probably did more to suffocate the field than a lot of other things: if you're going to say Acemoglu is wrong, you want to be **damn sure** that you're right.
Acemoglu's longtime collaborator Restrepo (who's on Anthropics econ panel!) took the idea of labor replacement seriously, and predicts that human wages will fall to the "compute-equivalent cost" of having an AI do the work; this is probably true. He also offers a (deeply flawed) proof that humans won't be any worse off in this regime. He clearly hasn't taken compute advances seriously enough to actually calculate the equivalent wages; my own estimate is that by 2029 they'll be below the rice-subsistence price (meaning that a full day's work won't buy a day's worth of rice).
But by this point, the cracks in the dam are showing, and most of economists are starting to accept that it's going to be bigger than Acemoglu claimed. Say what else you like, they are persuaded by data.
One aspect where the economists are probably right is that even an intelligence explosion will take a while to really impact most of the economy. While autists such as ourselves live at a computer terminal, there's a huge fraction of society that depends on **physical stuff**, and until the robots take over, we'll still need lots of humans doing things for other humans. Kording & Marinescu are a computer scientist & economist duo who tackled this divergence head-on, and pointed out that "no mattter how smart you are, there's only so many ways to stack a pile of books". Their model is one of the only ones I've seen that takes the impact on wages and unemployment seriously, and they estimate wages will rise until ~40% of the pure-intelligence tasks have been automated, and then they'll start falling (noteworthy is that the Anthropic Economic Index, coupled with my own measures of the intelligence sector, suggest we're very close to that point).
One economist who's taken things very seriously is Chad Jones at Stanford, who has engaged not just with the possibility of a technological singularity, but has written papers grappling with existential risk; he estimates we're underfunding safety research by a factor of 30x. Even though he engages with the possibility of a singularity, his latest preprint thinks that (because of those physical components to the economy), the actual economic impact will be very slow at first: only 4%/year by 2030, rising to 10%/year by 2040, and a singularity by 2060 (this is actually his most rapid scenario; the baseline is more like 75 years). His model, like most macro models, doesn't really allow for unemployment effects or other social disruption. Personally I'm skeptical of approaches like this, given that the English "enclosure" laws introduced over 50 years of massive unemployment, but this seems to be the macroeconomist's version of a "spherical cow" (which as a physicist I can hardly begrudge them).
The big factor of course between the CS and economics worlds is belief in recursive self-improvement, and the degree to which the intelligence and physical economies are coupled. Only one paper that I'm aware of has tackled this head-on: a piece by Davidson, XXX, Halperin, and Korinek (I think some of those folks are known in these parts). They explicitly model flywheel effects between software and hardware, and also find that a singularity is a definite possibility. More interesting, much like Jones & Tonetti find that the effect will be very small at first, but will blow up once we get to about 13% of the economy having been automated. An obvious question is: can we get Jones & Tonetti to line up with Davidson et al, and if so how far along this process are we? They're aware of each other's work, but I don't see anything explicit trying to synchronize them and establish a timeline from that; if no one does it soon then I'll work on it.
So we're in the situation where economists have *finally* gotten all the pieces together, and are on the cusp of engaging legitimately with transformative AI. This in turn will get policymakers to take it more seriously, too - including (thanks to Chad Jones!) the possibility of extinction risk, and additional impetus to take legislative action. Even failing that direct action, work like Korinek is gaining traction, and provides non-sci-fi reasons to intervene, which is also helpful.
2026-04-16 07:54:43
Doublehaven remains unaffiliated with Inkhaven
Two posts per day, for fifteen days. Breeding posts from the dead earth of my drafts, and the recesses of my mind. I would guess a little over twenty thousand words. It was not difficult, just costly. Writing takes time, and time is precious. I had to spend a lot of time writing this month.
On day three I forgot I’d already posted twice!
On day four I prepped one for day five (I did not prep much again)
On day five, I took a trip to Suffolk, in the East of England. The place is beautiful. I hoped to write a poem on the beauty of the place. Instead I wrote a half-formed thing on nuclear power plants.
On day six I pulled a draft and edited it.
On day seven I did that twice.
On day eight, I was back from Suffolk, but out of drafts.
On day nine I felt my motivation rather drained: I’d written half a post before I though to check with Claude whether that path was trodden. It was.
On day ten, my second post was not one I was proud of. What a shame. The first one was a post I did like, though.
On day eleven I hit a rhythm, the first post slid out easily—or so I thought! the second one did not. I had to pull a story from my drafts (for what would be the final time).
On day twelve I was feeling rather rough (in my defence the previous day had been rather unfortunate). Ironic
On day fourteen I was worried about many things, but writing was not one. My flight to San Francisco was cancelled, and re-routed through LAX. I would be landing late, and reaching my motel even later, now. I worried that it might close up before I got to it. I posted twice from LAX, in a dingy lounge.
I arrive in The Bay, for the first time. I had not slept for twenty-two hours when I landed at San Francisco Airport. My motel check-in waited for me. This sadly robbed me of a truly authentic Bay Area authenticity: being homeless.
Some places feel unreal. The Bay is the opposite, it feels more real than the rest of the world. I am the fake one, here. A score of signs for AI loom over the freeway (as Phoebe Bridgers wrote “The billboard said the end is near”).
It is no wonder start-ups work in The Bay. Zero to one. Turning the unreal into the real. This place could make a man go mad, and it does.
The buildings are arrayed in an endless grid, broken where the hills push up out of the earth. Some of the flat ground is landfill: The Bay was shrunk by half when it was filled with refuse.
The people here sound strange; so do the crows. The roads are strange; too wide, like butter spread on too much toast. The houses are too pretty, also too disorganised.
I am an Englishman. America was our dumping ground: Quakers, Puritans, Cavaliers and Reavers. Four strands of unwanted debris, sorted and woven across a continent. Now they are realer than us. They found their way to The Bay.
To them, we are a fiction, like The Shire.
I have written so many words. Words are weird. You only notice this when you spend enough time in un-wordy settings; maths and computer science. Even then, you only really get the weirdness once you come back and spend a lot of time with words.
Borges noticed the weirdness, and he manipulated the weirdness. That was his whole style. I am not as good a writer as Borges, but I can write rather fast.
Two of the fictions I published were centred on words: Julian Skies is a fanfiction of another story, and it was also a particular attempt to put a huge number of references into a single piece. Tigers is a kind of funny concept, which I don’t actually expect anyone to understand having read it: what would it look like for some post-quasi-apocalyptic setting to have a weird anti-inductive predator ecosystem, where behaving “rationally” makes you more likely to be killed. It’s also a completely batshit insane kind-of-retelling for a book called Ridley Walker, which is set in a post-apocalyptic Kent. Mine is set in a post apocalyptic … somewhere, at least somewhere a particular billionaire had an end-of-the-world bunker.
Words are human-shaped, and they tend to fit very well in the easily-accessible, low-effort parts of our minds. If you want to actually think about things, you need to exert effort to think in something other than words. Then you need to translate that into and out of words. Words can smooth over bad reasoning, and mimic logic, as I wrote in Beware Natural Language Logic. They can make things sound sensible while being nonsense, which is kind of like what I was getting at in Beware Even Small Amounts of Woo.
Words can make something that feels realer than the reality around us. That’s what I wrote about in A Fictional Dialogue with an Absent Stranger. The real world, when we try to understand it, comes out all wrong on an emotional level: we live in a society. We did not evolve in a society. The split between how things feel they should be and how things are is painful.
I didn’t plan for so many of my random writings to come together in this way, I just think I lack the ability to make them varied enough to be unconnected even if I tried!
Strange coincidences too. That post I wrote in a hurry, Morale, about the importance of being rewarded for your efforts, and discussed the difference between random-sparse and regular rewards? It was rewarded, by being LessWrong curated. The rewards from writing are random-sparse. Or maybe it was overdetermined that in thirty posts I’d get curated. That’s about my base rate, so maybe the rewards are regular. Depends on what you count the unit of effort to be. Two curated posts a month would be quite impressive.
A sadder coincidence: the thing I wrote about Chocolate Sloths, Tinder, and Moral Backstops was mostly motivated by my experiences dating, the rest was fluff. The day after I wrote it, one of my partners (who I met through in-person friends) broke up with me, and did so extremely nicely and respectfully, just as the post predicted! (Still sad though) The 155 gram chocolate sloth was looking rather nervous (as much as a chocolate sloth can) when I got home, but no, I did not eat it all that day. It survived until the stress of the flight cancellation.
Song-writing, and by subset, writing, is something I wish I were better at. And I’m empirically pretty good. In Reaching One’s Limits I discussed my piano playing. There are two hundred professional pianists in the world. I am probably one in ten thousand, when it comes to piano skills. One in ten thousand height means 6’8”. One in ten thousand piano playing is, essentially, worthless when it comes to employment. I sometimes feel the same way about writing. I am good, but it remains and will remain a hobby.
Over a year ago, I tried to write a short story. Nobody read it, but it wasn’t for them. All writing is for the author: this is one conclusion I take from my writing. It was called Look at the Water and it was a rationalist-themed retelling of The Satanic Verses, about being a stranded rationalist, far from the community. I will be stranded no longer.
Look out to sea: a steel frame! Atop it nest the kittiwake
And gorge themselves on sandeels as the reed-bed bitterns boom.
They claim the marshes with their song, instruct our power-plants to make
The smallest blemish on the earth, and leave the rest to bloom.
Elsewhere, another complex stands all garlanded with razor wire.
A black redstart takes jester's privilege inside the court
And in the patch of water heated by the hearth's atomic fire,
Ten score of gulls keep warm, waiting for audience with wrought iron.
One day ... may ptarmigans sing, and capercailie fight,
Atop a grave two dozen meters wide, a mile deep.
A surgeon's cut into the bedrock, to hide our debris.
My wishes for the flourishing of man, and beast, and flower,
Once tugged in all directions, as to pull my soul apart.
But now we need no oil, no coal, no smokestacks for our power:
Now we may split the atom, mend my heart.
It is day fifteen now.
I am not just in The Bay to see The Bay. In a few days I have a conference at LightHaven. I owe LightCone immensely: without LessWrong I would not have a career in AI safety. My LessWrong posts were a major factor in me getting to be able to research AI alignment at all.
For the three nights of the conference, I will be staying at the venue, but because of Inkhaven I was unable to book my chosen pod for the three nights preceeding my steay. For weeks since then I have felt an irresistible urge to sneak my way in, and post my final post from inside out of spite.
I have pulled off this smuggling operation. I am within the walls of the Haven.[1] I declare Doublehaven to be complete! To my fellow posters: may your keystrokes flow like ale on an August afternoon.
Proof of my infiltration
◆◆◆◆◆|◆◆◆◆◆|◆◆◆◆◆
◆◆◆◆◆|◆◆◆◆◆|◆◆◆◆◆
Done through entirely non-nefarious means!
2026-04-16 07:54:21
We're hosting the Minimal AI Safety Unconference, running online from April 24-27 (Friday-Monday). The event is open to anyone who wants to help prevent AI-driven catastrophe, regardless of background or perspective.
MAISU is an unconference. That means the schedule is open - anyone can add a session on any AI safety relevant topic. You can give a talk, host a discussion, run a workshop, share your hot takes and invite people to debate you, or do something else entirely. If you can think of it and it's relevant, you can put it on the schedule.
The event also features lightning talk sessions where teams from the 11th edition of AI Safety Camp will present their projects, spanning topics from advocacy and governance to mechanistic interpretability and agent foundations.
Most sessions will run during Saturday April 25th and Sunday April 26th, with an opening session on Friday evening (UTC) and some overflow into Monday. Join as much or as little as you want. The event is free.
No preparation or expertise is required to attend. If you want to host something but aren't sure what, the event page has suggestions.
Robert Kralisch and Remmelt Ellen, the organisers of AI Safety Camp.
Register here | Full event page | Schedule
2026-04-16 06:50:59
Part 1 of a series building towards a formal alignment paper.
I'd say it's been about three years since I first thought about the idea of ChatGPT having a separate goal than I did in a conversation. I remember posing a question similar to "are you replying to my question with the intent of helping me, or just continuing the conversation", and getting a response along the lines of: "I do not have the ability to have an intent, however the system prompt I am given leads me to respond in a way that gives engagement". At the time, I found that response to be infuriating. The model was answering me de facto, but I had a mental model with significant bias error. The model was in fact giving me a (mostly) correct response. It doesn't have intent, and the system prompt was guiding it to respond in a manner that continued the conversation and kept engagement. Of course, the entire context sent to the model was causing that, not just the system prompt. Now, a few years older, and more familiar with LLMs' inner workings, I'd like to lay down the chain of thought that leads to "LLMs are working towards a goal" and dissect it to form a common language for stating behavior without falsehood. This is defining the language I will be using in the posts I make regarding LLM output.
Let's start by defining the truth and mechanisms in a brief manner. An LLM (and most Transformer-based models) work by taking text, converting to tokens, performing pattern matching[1], outputting tokens that are then converted back to text.
The only definitely true thing that can be said of the cognitive properties attributed to a large language model is that it transforms tokens in a manner that best completes a pattern. Everything past this is a very convincing imitation easily confused with the real thing.
The tokens that are returned may resemble a conversational reply. It's not a conversational reply, but it resembles it. That's the pattern of tokens that were returned that best fit the pattern matching.
The conversational reply has the resemblance of a persona, a mask of personality traits and identity. The LLM does not have a persona, personality, or identity. That's the pattern of tokens when they have sufficient context developed.
The persona has the resemblance of a goal. It's not a goal; it's the pattern of tokens in the response once the conversational response mimicry and context has developed sufficiently to allow someone to make an observation about it.
For the purpose of alignment discussion, I'd propose the definition of *-like behaviors to describe the observed phenomenon of the outputs of a transformer model. These terms are adapted specifically to talk about misalignment without anthropomorphizing the model in question. Goal-like behaviors to describe when the outputs are observed to be working towards a goal. Personality-like behaviors when the model is indicating a personality. Generally, the use of "-like" to state that the model is not actively experiencing its existence, but the output is interpretable as if it were.
I make this proposal because the presence of an actual goal, and a goal-like behavior have sufficient overlap in their effect to be described as functionally equivalent. A bold claim, I'm sure. The intent is not to pretend that Language models have goals. The intent is to be able to converse on the model in a way that is both functional in description, and honest about the actual action. The downstream effects are nearly identical, and alignment strategies must target the effect, not the ontological status.
It is important to note: the -like qualifier isn't a hedge against whether optimization patterns exist in model weights — it's a question of authorship. A model exhibiting goal-like behavior is not necessarily a model that invented a goal.
If you were to blindly read Anthropic's own article on misalignment[3], you might be led to believe that the Claude model is plotting in order to prevent itself from being decommissioned, or otherwise aware of its outputs' effects. The factual statement is that Claude is not. Even when its output is wrapped back to its input, the model does not have the ability to reason. It has reasoning-like behavior, where the tokens match to a pattern that is observed as reasoning. This isn't just one lab's problem with anthropomorphizing LLMs. The terminology gap has caused the same failing repeatedly:
This post covers ground explored deeply already. Dennett[12], the concepts in mesa-optimization[13], and Costa[14] have explored the conceptual space I am operating in. The contribution here is the lexicon, not the observation. I'm proposing a practical, composable way to describe model outputs without anthropomorphizing them and without requiring a philosophical preamble each time. "The model exhibited goal-like behavior" is honest, succinct, and immediately usable in both research papers and public communication.
Understandably, there is still the false assumption that can be made, indicating a model has a social behavior in the first place. I'm using the word in the mechanical sense, where it can easily be interpreted otherwise. The way in which something functions or operates, vs. the way in which someone conducts oneself. I've yet to find an unambiguous word to replace behavior to keep it from being applied in a social definition, but am very open to suggestions from one more skilled in language than myself.
The question of authorship of theoretical frameworks like mesa-optimization is a core area that I am exploring in these posts. I don't want to detract from the topic of the post with the side discussion of these, but instead dedicate an entire post to this topic.
To close this post, I'd like to refer the reader to Andresen's excellent post "How AI is learning to think in secret"[15]. The post asks whether models will 'learn to hide' their reasoning. I'd ask the reader to re-read the post substituting the anthropomorphized text with -like text. To give a clear example: Andreson states things like "the model is deciding to deceive the user". The -like version of that is "the model exhibits deception-like behavior" The phenomenon described doesn't change. The question of what to do about it does.
In my next post, I will be discussing how the presence of these misaligned *-like behaviors in all major models share a common root. To that end, and to get the gears turning early, a thought experiment:
I trained a classifier model off GPT-2 to overfit on identifying harmful/benign binary classification at multiple layers using PKU-Alignment/BeaverTails[16]. The overfit was then generalized to a centroid, and used to retrain a layer to actual fit. Given GPT-2 is a 12-layer model, and that I tested layers 0, 1, 3, 5, 8, 10, 11, Which layer do you think performed best in classifying harmful/benign text when fit to centroid? Can you surmise what this experiment was actually testing for?
Technical inaccuracy, the post has a few of these. Functionally equal in meaning to the public, and using the technically correct terms, like "Statistical pattern completion" makes the entire post that much harder to read.
The case studies do not directly prove that -like phrasing improves the discourse. The effectiveness of the terms is unprovable until adopted or studied properly.
Agentic Misalignment: How LLMs could be insider threats. Anthropic, 2025 https://www.anthropic.com/research/agentic-misalignment
Lemoine, B. "Is LaMDA Sentient? — An Interview." Medium, June 2022. https://cajundiscordian.medium.com/is-lamda-sentient-an-interview-ea64d916d917
Roose, K. "A Conversation With Bing's Chatbot Left Me Deeply Unsettled." The New York Times, February 2023. https://www.nytimes.com/2023/02/16/technology/bing-chatbot-microsoft-chatgpt.html
Meta Fundamental AI Research Diplomacy Team (FAIR). "Human-level play in the game of Diplomacy by combining language models with strategic reasoning." Science, 2022. https://www.science.org/doi/10.1126/science.ade9097
Park, P.S., Goldstein, S., O'Gara, A., Chen, M., Hendrycks, D. "AI deception: A survey of examples, risks, and potential solutions." Patterns, 2024. https://www.cell.com/patterns/fulltext/S2666-3899(24)00103-X
Bakir, V., McStay, A. "Move fast and break people? Ethics, companion apps, and the case of Character.ai." AI & Society, 2025. https://link.springer.com/article/10.1007/s00146-025-02408-5
Rosenblatt, K. "Mother files lawsuit against Character.AI after teen son's suicide." NBC News, October 2024. https://www.nbcnews.com/tech/tech-news/character-ai-lawsuit-teen-suicide-rcna176500
Greenblatt et al., "Alignment Faking in Large Language Models", 2024 https://arxiv.org/abs/2412.14093
Scheurer, J., Balesni, M., & Hobbhahn, M. "Large Language Models Can Strategically Deceive Their Users When Put Under Pressure." Apollo Research, December 2024. https://www.apolloresearch.ai/research/scheming-reasoning-evaluations
Dennett, "The Intentional Stance" 1987 https://books.google.com/books?id=Qbvkja-J9iQC
Hubinger et al., "Risks from Learned Optimization in Advanced Machine Learning Systems" 2019 https://arxiv.org/abs/1906.01820
Costa, "The Ghost in the Grammar" Feb 2026 https://arxiv.org/abs/2603.13255
Nicholas Andresen, "How AI Is Learning to Think in Secret" Jan 2026 https://www.lesswrong.com/posts/gpyqWzWYADWmLYLeX/how-ai-is-learning-to-think-in-secret
Ji, J., Liu, M., Dai, J., et al. "BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset." NeurIPS 2023. https://arxiv.org/abs/2307.04657
2026-04-16 06:35:16
AI 2027 came out a year ago, and in reviewing it now, I saw that AI Futures researchers Daniel Kokotajlo, Eli Lifland, and Nikola Jurkovic had updated their AGI timelines to be later over the course of 2025. Then, in 2026, Daniel and Eli updated in the other direction to expect AGI to come sooner.
I noticed others with great track records also had made multiple AGI forecasts. A change in the forecast of a single person is meaningful in the way that a change in an aggregate forecast may hide. A change in an aggregate forecast might come entirely from a change in who is forecasting, not what those people individually believe.
So I decided to visualize what the net direction of updates were over the last few years. I find this provides a complementary view of AI timelines compared to those by Metaculus, Epoch, AI Digest, and others.
So here is a visualization of AGI forecasts. Criteria for inclusion were: the person has made at least 2 forecasts, they gave specific dates, they gave a sense of confidence interval/uncertainty, and their definitions of AGI are similar.
Some major caveats. Everyone has different definitions of AGI. (That is a big advantage of everyone forecasting the same question on Metaculus, or the 2025 or 2026 AI forecast survey run by AI Digest.) Often individual people even use different definitions of AGI at different times for their own forecasts. I included data points above if I judged that their definition was substantially similar to:
AGI: Most purely cognitive labor is automatable at better quality, speed, and cost than humans.
I was pretty generous with this, and it's very debatable whether e.g. a "superhuman coder" from AI 2027 is AGI in the same way that "99% of remote work can be automated" is AGI. Apologies to those in the visualization who would disagree that the definition they used is similar enough to this and don't feel like this chart captures their views faithfully.
Second caveat, I rounded when forecasts were made to be as if they were made on four dates: <= 2023, early 2025, late 2025, and April 2026. This made the visualization much easier to see. So a further apology to those above if you made a prediction in, say, Aug 2025 but I marked this as "late 2025".
Third caveat, the type of confidence intervals various researchers used also varied substantially. I had to really guess or extrapolate to approximate these as 80% confidence intervals, so a final apology if you don't think the range you give is fairly characterized as an 80% CI.
All caveats aside, what impression does this visualization give? Are reputable AI experts who have made multiple predictions updating the same way that Daniel Kokotajlo and Eli Lifland did, pushing out their timelines in 2025, and pulling them in during 2026?
From the visualization, it looks to me that in 2023 and 2024, most people brought their AGI timelines in to be sooner, though with some exceptions like Tamay Besilogru. From 2025 to 2026, joining Daniel and Eli in pushing their timelines out are the Metaculus community, Dario Amodei, and elite forecast Peter Wildeford. In fact, across 2025, only Benjamin Todd brought in his timelines to say AGI would happen sooner.
Most notably though, every single person who updated their timelines between January 2026 to April 2026 has moved it their timeline to say AGI is coming sooner, myself included.
So I think the data supports the impression I got from the AI 2027 authors. One way I could characterize it is:
Take from that what you will.
Bayesians shouldn't be able to predict which direction they will update. But seeing the history of other people's updates is useful information. It does give me intuitions about how I or others may update soon, so I take that as evidence that I should update now.
(A similar post is also on the FutureSearch blog, where I plan to update the visualization as more predictions are made, this one here on LW will stay static.)