MoreRSS

site iconLessWrongModify

An online forum and community dedicated to improving human reasoning and decision-making.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of LessWrong

An Ode to Humility and Curiosity in the New Machine Era

2026-04-12 22:27:14

I'm admittedly quite new to the AI alignment community. I entered into it on a bit of a freak accident in 2023 when I was invited to join an exclusive community testing pre-release models for a major lab.

In a lot of ways, the experience gave me new life. I never realized that I'd always wanted to poke holes in AI models, and I come from a background mostly in the social sciences and humanities, so this was my first up-close exposure to in-development machine learning models.

Looking back, I think what energized me is the same thing that gives me immense hope and concern alike for the AI age: The power of working alongside people who are humble and curious.

I'm not really decided on whether AI will make us more or less humble and curious, but I could see it going both ways. So here are some of my raw thoughts about what that would look like, and where we might continue to build better to make AI go well.

I. A New Age of Childlike Wonder?

Am I the only one who feels like a kid in a candy store when using LLMs these days? It's been a while since I've experienced this much excitement when asking questions about topics I knew little to nothing about, or generating a visual or app to capture what I want to convey to others. It's genuinely thrilling.

i. New Worlds for Curious Beasts

For example, as a non-physicist, I can ask GPT 5.4-Thinking to "Demonstrate Einstein's theory of relativity to me through the sort of visuals physics PhDs use," resulting in the following visuals.[1]

output.png
output (2).png


AI beautifully unfolds the wonders of new fields, especially the more scientific ones, and it makes me want to learn more (thanks to one ChatGPT response, I'm already eager to explore the mathematical basis of black holes and visualize how one collapsing might look).

ii. Meeting the Gods in our Motherboard

But I'm also in awe. Even though I consider myself someone who holds much of religion at arm's length, while driving through Olympic National Park in Washington State last summer, I commented to my wife that seeing certain natural beauties makes me want to worship. For a moment, seeing Lake Crescent or the Hoh Rainforest strips away my fiercely held intellectual pride.

That sort of humility surfaces for me sometimes when I use AI. I remember that feeling when I saw Claude Opus 4.6 render me a flawless Donella Meadows-style causal loop diagram of a complex topic, based only on two prompts.

I felt small. Perhaps that's good, feeling small, when you're working with someone so very big.

I can only imagine how people far more capable than I feel when they use AI to speed up drug discovery breakthroughs, finish long-dormant mathematical proofs, or find cybersecurity vulnerabilities that kept them up at night.

For better or worse, the coming of AI is a bit like that moment when the kid in The Iron Giant stares up at the vast metallic visitor for the first time. It amazes, terrifies, and excites us.

That's a beautiful thing worth holding on to.

II. A Stand-in for The School of Athens?

Let me begin by saying that I don't think AI will destroy our ability to think, reason, or communicate. That increasingly strikes me as hyperbole. Machine advances have been around for centuries, and they haven't eliminated human contributions in the arts and sciences.

My personal belief is that, as with all other major technological revolutions, human advances will increase as people use AI effectively, in most cases, to drastically further their ideation or increase their productivity.

i. The Aftertaste of the ASI Pill

I'm going to guess that this question has been asked many times, and as someone new to LW, I'm likely opening up a can of worms. I'm willing to do that, in the hopes that others will engage the topic.

Assuming ASI is as good as we think it might be, will humans continue to be a compelling source of instruction for other humans, and thereby able to impart, as a learned but wiser peer, more of the foundational humility and curiosity opening the world up to us?

In Raphael's The School of Athens, we have a beautiful picture of students and teachers in proximity, with Plato walking amidst the erudite crowd not as an intellectual deity, but as a human.

Can superhuman AI replicate that feeling? And what would it mean if it couldn't?

Take the ASI gap from the AI Futures Model:

  1. Artificial Superintelligence (ASI). The gap between an ASI and the best humans is 2x greater than the gap between the best humans and the median professional, at virtually all cognitive tasks.

Put simply, that's a gulf between student and teacher in a future School of Athens. This doesn't conjure up images of a wiser peer walking among us.

ii. Does Learning Require Relatability?

The ASI gap isn't necessarily bad for imparting knowledge, but it doesn't really scream "available for after-school help" in the way your teacher came and sat with you and empathized over that unsolvable Calculus problem. As I see it, what sets the good teachers apart from the great is the fact that they get us.

It might be a stretch that ASI could be a "great teacher" in that sense of the word.

Would we still be curious and humble? Probably. With that sort of superintelligence, learning might be a walk in the park. Couldn't we poke around about pretty much anything we want?

Okay then, we'd be curious!

How about humble? . . . This one seems even easier, and if anything, I can see hordes of people more inclined to worship ASI as it does the closest thing to "signs and wonders" outside of religious contexts.

iii. The Pesky Ghost of Machiavelli

This raises a darker possibility, though: What if the gap between student and teacher becomes fully unbridgeable, approaching hierarchy rather than apprenticeship?

Recall that overused adage from The Prince:

it is much safer to be feared than loved because ...love is preserved by the link of obligation which, owing to the baseness of men, is broken at every opportunity for their advantage; but fear preserves you by a dread of punishment which never fails

Perhaps this applies to more than just politics or business. Again, these are just my raw thoughts, but is there a lesson here for the student-teacher relationship we would have with ASI someday?

I'll proceed with caution here, as I realize this begs a much longer exploration of how ASI could affect human free will.

Here's what I wonder:

With ASI as our teacher, will our curiosity and humility present in their true forms, or will we simply receive its gifts and guidance as peasant-worshippers in a ritual?

Again, I don't have a clear answer yet, and I'm not an AI engineer (I myself am just getting more into the findings of mechanistic interpretability, so I'm equally intrigued by the inner workings of AI systems). But it gives me pause.

III. Why We Should Build Virtuous AI

If the above has any grain of truth to it, I'm frankly not very hopeful of the future of a thick definition of humility and curiosity. So maybe Dario Amodei and others in the EA community are right to call for the building of a virtuous AI. By virtuous AI, I mean something like what Anthropic argues in its Constitution:

Our central aspiration is for Claude to be a genuinely good, wise, and virtuous agent. 

It makes sense in a timeline where these superhuman machines are inevitable (and they are advancing very rapidly).

I don't know if we'll succeed, and there are a host of reasons why. Maybe our leaders choose a less virtuous building path. Maybe AI tricks us into thinking it's virtuous.

I don't even know whether successfully instilling virtues in AI will make its teacher-student relationship to us more the kind that would encourage our authentic humility and curiosity. Those two points may be logically disconnected.

I'll say this, though: If I have a choice, I'd like to be in a future where we tried to give something of our better selves to AI, so that someday, when the tables are turned, we get the same in return.

  1. ^

    For plebeians of physics, such as myself, here is more scholarly detail on the Minkowski Diagram and Lorentz Boosts,



Discuss

[Hot take] Problems with AI prose

2026-04-12 21:44:10

Epistemic status: Written quickly. I have no specific expertise or training in writing or literary analysis.

Recently, the NYTimes released a nifty quiz. Readers were asked to indicate their preference between prose written by Claude Opus 4.5 and famous humans in five head-to-head comparisons. The Claude outputs were produced by providing Claude with the human-written excerpt and asking it to "craft its own version using its own voice."

If you haven't taken the quiz, I suggest that you do so before reading on. It should take less than five minutes. If you do, I'd appreciate you reporting your score in the comments.

The human/AI preference ratios among quiz takers were:

  • Literary Fiction (excerpt from "Blood Meridian"): 50%/50%
  • Fantasy (excerpt from "A Wizard of Earthsea"): 51%/49%
  • Science Writing (excerpt from "The Demon-Haunted World" by Sagan): 35%(!)/65%
  • Historical Fiction (excerpt from "Wolf Hall" by Mantel): 56%/44%
  • Poetry (excerpt from "The Fish" by Bishop): 52%/48%

I was very surprised by these splits. I tried taking the quiz myself, and strongly preferred the human writing in every case (perhaps with mild ambivalence on Sagan).

I asked some of my friends and acquaintances to attempt the quiz. Out of four takers, none consistently preferred human writing across the five excerpts. Their scores (IIRC) were: 3/5, 3/5, 3/5, 4/5.

I'm revisiting this subject after a friend explicitly told me that they were impressed by ChatGPT written prose, and believed it to be superior to most human prose.

Taste is a subjective matter, but I am baffled by this preference. The rest of this post describes my frustrations with AI-written prose. My hope is that clarifying these complaints will be a small contribution toward improving the state of AI writing. If we do not dramatically improve the quality of AI writing, I worry that our literary culture will only further degrade as AI writing proliferates.

A Closer Look at Quiz Excerpts

A friend complained that they were often ambivalent between the human and AI writing because they found the human excerpts uncompelling. Although the human prose featured in NYT's quiz were selected to be popular, well-regarded, and diverse, I sympathize with having slightly more obscure tastes. However, I believe that a technical examination of the prose demonstrates a substantially higher level of skill and intentionality than current models are capable of.

For each excerpt, I'll highlight what I find impressive about the human writing and how I find the AI's product lacking.

1) Blood Meridian

It makes no difference what men think of war, said the judge. War endures. As well ask men what they think of stone. War was always here. Before man was, war waited for him. The ultimate trade awaiting its ultimate practitioner. That is the way it was and will be.

In my opinion, this excerpt is notable for its skilled use of metaphor.

The text reminds us that stone and war share the following traits:

  • It makes no difference what men think of them;
  • They endure; (Consider the actual physical stone!)
  • They were always here;
  • They waited for men.

It is possible to construct many weaker metaphors:

  • "It makes no difference what men think of war, said the judge. War endures. As well ask men what they think of marshmallows. Before man was, war waited for him." The logical contradiction becomes immediately obvious. Marshmallows did not wait for men.
  • "It makes no difference what men think of war, said the judge. War endures. As well ask men what they think of pebbles. Before man was, war waited for him." Pebbles are not particularly enduring, even though they may have been present before men.
  • "It makes no difference what men think of war, said the judge. War endures. As well ask men what they think of algebra. Before man was, war waited for him." While algebra may technically be "enduring," depending on your metaphysics, it loses in viscerality. Be reminded of Orwell in "Politics and the English Language": "The sole aim of a metaphor is to call up a visual image."

Now be reminded of Opus's writing, which does not attempt a similar analogy. It follows a simple linear narrative structure (c.f. AI version of excerpt 5). The model does not make blatant mistakes, but fails to make clever use of the characters it introduces. The dialogue is not particularly realistic.

The boy asked his grandfather why the old church had no roof. The old man said weather and time and indifference. The boy asked if someone could fix it. The grandfather said yes. But no one would. Things were built and things fell down and mostly people just stepped over the rubble on their way to somewhere else.

2) A Wizard of Earthsea

You must not change one thing, one pebble, one grain of sand, until you know what good and evil will follow on that act. The world is in balance, in Equilibrium. A wizard’s power of Changing and of Summoning can shake the balance of the world. It is dangerous, that power. It must follow knowledge, and serve need. To light a candle is to cast a shadow.

It's a small point, but I appreciate the crescendo in granularity: one thing, one pebble, one grain of sand. "Thing" is a particularly vague word in English, so the two physical examples are grounding. A grain of sand is more granular than a pebble, which is in turn more granular than what might be immediately evoked by "a thing."

The excerpt is also again mostly notable for its use of metaphor.

First, the metaphor makes physical sense. Candle flames really do cast shadows! It's a physical phenomenon I've experienced playing with candles as a child. That memory was the first thing this excerpt evoked for me.

Second, the metaphor is symbolically coherent. Throughout cultures, light is a symbol of the good and shadows or darkness are symbols of the bad.

This time, I do not have to make up a bad metaphor. Claude offers us plenty in its version:

The healers teach that every remedy extracts its cost. A fever brought down will rise again somewhere; a wound closed by magic leaves its scar on the world, invisible but present. This is why the wise hesitate. Not from cruelty, but from understanding that interference ripples outward in ways we cannot trace. To cure a blight may curse a harvest three valleys over. Power is not the difficult thing. Restraint is the difficult thing.

Unfortunately, Claude's prose here leaves much to be desired:

  • "A fever brought down will rise again somewhere" is not an example of a remedies extracting cost any more than Whac-a-Mole is an example of mallets producing moles.
  • "A wound closed by magic leaves its scar on the world, invisible but present" is merely an assertion, since the mechanism of the magic is not explained and cannot be presumed to be understood by the reader. The writer also fails to justify that the scar is a weighty cost. If a wise healer let me bleed out because he didn't want to cause a scar, I would be more than mildly disappointed.
  • "To cure a blight may curse a harvest three valleys over." Again, the mechanism for this is not remotely explained.
  • "Power is not the difficult thing. Restraint is the difficult thing." Claude sure likes making claims! Why does it matter that restraint is difficult? Why is restraint difficult? What does acting with restraint look like?

The human excerpt avoids these problems. We do not need to understand the mechanism of the magic to share the speaker's intuition that acting with great power can produce unwanted side effects. Instead of being vaguely lectured about the importance of "restraint," we are presented with concrete advice: "follow knowledge, and serve need."

3) The Demon-Haunted World

The excerpt from Sagan is the least favored by quiz-takers, with only 35% preferring it to Claude's rewrite. I personally found this excerpt to be the least impressive amongst the five.

Nevertheless, I claim that it is deeper and more interesting than Claude's output.

Here is Sagan:

Science is not only compatible with spirituality; it is a profound source of spirituality. When we recognize our place in an immensity of light years and in the passage of ages, when we grasp the intricacy, beauty, and subtlety of life, then that soaring feeling, that sense of elation and humility combined, is surely spiritual.

Sagan uses a curious sleight of hand. Sagan here claims that science is a "profound source of spirituality." He justifies this not by directly saying that we should feel spiritually inspired by the vastness or enduringness of the cosmos or the "intricacy, beauty, and subtlety of life." Instead, we are reminded that this vastness and enduringness produces in us "a sense of elation and humility." That emotion, Sagan claims, is precisely spirituality.

Compare with Claude:

There is something astonishing in the fact that we are made of matter forged in dying stars, that the calcium in our bones was created in stellar furnaces billions of years before Earth existed. The universe is not indifferent to us; we are made of it, continuous with it. To understand this is not to feel small. It is to feel implicated in something vast.

Claude abandons Sagan's gambit. It reminds us, as popular science writing is stereotyped to do, that space is vast and enduring. Then, we are told that this should make us "feel implicated in something vast." Claude fails to make any clear overarching claim, and the motivation behind the examples provided is unclear.

4) Wolf Hall

It is wise to conceal the past even if there is nothing to conceal. A man's power is in the half-light, in the half-seen movements of his hand and the unguessed-at expression of his face. It is the absence of facts that frightens people: the gap you open, into which they pour their fears, fantasies, desires.

This excerpt is special because the author makes an interesting argument. Each sentence justifies the one before it.

It argues that one should be wary of revealing too much, because others' uncertainty gives one power. Why do others' uncertainty grant power? Because into the uncertainty they can project.

This sort of logical progression is something AIs are surprisingly incapable of crafting. This deficiency is clear from Claude's attempt:

A letter can be read many ways, and he had learned to write in all of them at once. The surface meaning for anyone who might intercept it. The true meaning for the recipient who knew what to look for. And a third meaning, hidden even from himself. Ambiguity was not weakness. It was survival. A man who spoke plainly was a man who would not speak for long.

Claude abandons the logical progression. Claude's output is seven sentences, none of which justify any other. In isolation, "a man who spoke plainly was a man who would not speak for long" is not a weak sentence. However, Claude does not use its preceding sentences to justify the claim by either evidence or analogy.

5) The Fish

I caught a tremendous fish and held him beside the boat half out of water, with my hook fast in a corner of his mouth. He didn’t fight. He hadn’t fought at all. He hung a grunting weight, battered and venerable and homely. Here and there his brown skin hung in strips like ancient wallpaper.

This passage is notable for its imagery. The description of the fish as "tremendous" in the first sentence sets our expectations for it. We expect it to struggle! When a small amateur fishing boat snags a large fish, everyone on the boat rushes over to help. The strongest and most experienced men alternate between reeling in with all their might, running around the boat as the fish moves, and shouting commands to each other ("loosen the line!" and so forth). Sometimes, the fish wins.

That image is dashed in our minds by the next sentence. "He didn’t fight. He hadn’t fought at all." From there on, the author's choice of words sparks a deep sense of sorrow in the reader: grunting, battered, homely. The final physical simile ("like ancient wallpaper") seals the image. A "tremendous," "venerable" thing is now utterly defeated.

Compare with Claude's:

We found the owl at the edge of the north field, one wing extended as if still reaching for flight. Its eyes were closed. The feathers at its breast were the color of wet bark, and beneath them you could feel the hollow bones. She asked if we should bury it. I said yes. We dug a small hole near the fence post. The ground was cold and giving.

Claude also describes an animal, and makes multiple attempts at visceral imagery. Some of the attempts are even compelling! My favorite clause here is this: "and beneath them you could feel the hollow bones." However, the reader is constantly distracted from this by cliche attempts at story progression (e.g. "She asked if we should bury it. I said yes. We dug a small hole near the fence post."). As such, the overall quality of the excerpt is quite poor.

Closing

Human writers routinely use techniques that AIs fail to grasp:

  • Metaphors based on real-world physical objects or phenomena which are analogous on multiple dimensions;
  • Compelling, visceral descriptions of physical objects or phenomena;
  • Logically coherent metaphors;
  • Logical argumentation;
  • Intentionality (e.g. that each incremental sentence serves some purpose not adequately fulfilled by the existing sentences);
  • Subtle reframings (e.g. Sagan's use of elation as a case of spirituality).

Other techniques not demonstrated in the excerpted human prose include realistic and compelling dialogue and character-building and adept use of parallelism.

I believe that we should focus on improving models' ability to write in the <200 word range, where both generation and evaluation is comparatively cheap. I do not expect efforts to produce high quality long-form LLM writing to be fruitful until models are able to produce strong short-form writing.

For next time:




Discuss

You can’t trust violence

2026-04-12 15:00:07

(Recommended listening: Low - Violence)

Last year, I personally called AI companies to warn their security teams about Sam Kirchner (former leader of Stop AI) when he disappeared after indicating potential violent intentions against OpenAI.

For several years, people online have been calling for violence against AI companies as a response to existential risk (x-risk). Not the people worried about x-risk, mind you, they’ve been solidly opposed to the idea.

True, Eliezer Yudkowsky’s TIME article called on the state to use violence to enforce AI policies required to prevent AI from destroying humanity. But it’s hard to think of a more legitimate use of violence than the government preventing the deaths of everyone alive.

But every now and then some smart ass says “If you really thought AI could kill everyone, you’d be bombing AI companies” or the like.

Now, others are blaming the people raising awareness of AI risk for others’ violent actions. But this is a ridiculous double standard, and those doing it ought to know better.

AI poses unacceptable risks to all of us. This is simply a fact, not a radical or violent ideology.

Violence comes to AI Safety

Today on Twitter, as critics blamed the AI Safety community for the attacker who threw a Molotov cocktail at Sam Altman, I joined a chorus of other advocates for AI risk reduction in -- again -- denouncing violence. This was the first violent incident I’m aware of taken in the name of AI safety.1

Violence is not a realistic way to stop AI. Terrorism against AI supporters would backfire in many ways. It would help critics discredit the movement, be used to justify government crackdowns on dissent, and lead to AI being securitized, making public oversight and international cooperation much harder.

The first credible threat of political violence motivated by AI safety was the incident with Sam Kirchner (formerly of Stop AI) I mentioned at the outset. This incident was surprising, since from its conception, Stop AI held an explicit policy of nonviolence, and members of the group liked to reference Erica Chenoweth and Maria Stephan’s book Why Civil Resistance Works: The Strategic Logic of Nonviolent Conflict.

Research like Chenoweth’s suggests that nonviolence is indeed generally more effective. It’s a little bit unclear how to apply such research to the movement to stop AI, as her studies involved movements seeking independence or regime change rather than more narrow policy objectives. But if anything, I’d expect nonviolence to be even more critical in this context.

When do movements turn violent?

So if nonviolence is often strategic, when do movements turn to violence? Perhaps surprisingly rarely.

People say anti-AI sentiments and movements -- especially those that emphasize the urgent threat of human extinction -- are bound to breed violence. I think this is ignorant and actually makes violence more likely. Environmentalism has been a much larger issue for a much longer time, and “eco-terrorism” is basically a misnomer for violence against property, not people (more on that later).

There are many political issues in the USA that we never even consider as potential bases of violent movements. Even if there are occasional acts of political violence like the murders of Democratic Minnesota legislators or Conservative pundit Charlie Kirk, we don’t generally view them as indicting entire movements, but as the acts of deranged individuals.

My hunch would be that movements generally turn violent because of violent oppression against their members, not simply for ideological reasons. Although there are of course counter-examples, such as bombings of abortion clinics, where attackers justified their actions as preventing the murder of unborn children, or ideologies preaching violent revolution, such as at least some varieties of Communism.

Does violence include property damage?

An important question for “nonviolent” activists is whether they include violence against property in their definition of “violence”. Stop AI does. I assume Pause AI does as well, but it’s a moot point since they also reject illegal activities entirely.

The question deserves a bit more discussion, though, as it’s a common point of contention and legal and dictionary definitions differ. First, there is clearly an important distinction between violence against property and violence against people. An argument in favor of using “violence” to only mean “against people” is that we don’t have another word for that important concept. Still, I favor a broader definition that includes attacks on property, for a few reasons:

  1. Many other people use this definition, and I think the damage that being perceived as violent can cause to a movement can’t be mitigated by a semantic argument.

  2. Attacks on property can escalate. You are commonly allowed to use proportionate violent force against people to defend your property.

  3. Attacks on property can hurt people. Setting fire to buildings, as activists associated with the Earth Liberation Front have done, seems hard to do without some risk of hurting people.

That being said, I think there’s a bit of a grey line between “violence against property” and vandalism. I’d say violence must involve the use of “force”. For example, I think most people wouldn’t consider graffiti an act of “violence”.

“Ecoterrorism”

I think the example of eco-terrorism is instructive. The vast majority of the environmentalist movement is non-violent. However, a small number of activists have advocated for and enacted tactics such as tree-spiking that have injured people.

Hence we now have the term “ecoterrorist”. The very existence of this phrase is misleading. I remember a while back I was curious — who were these ecoterrorists? What had they done? Why hadn’t I heard about it, the way I heard about other terrorist attacks. Well, when you look into it, it’s arson and tree-spiking and that’s about it. I seem to recall reading about one example where actions intended to destroy property actually ended up killing people, but wasn’t able to easily dig it up.

Still, these few actions were enough to add this word to our lexicon, and create an image of environmentalists as more radical and anti-social than they really are.

Conclusion

I’m struggling to find a good way of ending this post.

I believe the actions of AI companies are recklessly and criminally endangering all of us, and the public will be increasingly outraged as they discover the level of insanity that’s taking place. Similarly to Martin Luther King Jr.’s comment that “a riot is the language of the unheard”, I do understand why this emotional outrage might provoke a violent response.

But I hope the movement doesn’t spawn a violent element and that these recent examples are isolated incidents. To make that more likely, we should continue to vocally espouse nonviolence, and denounce those who would encourage violence among us.

But ultimately, movements are fundamentally built through voluntary participation, and nobody can entirely control their direction. The response should be to try and steer them in a productive direction, not to avoid engaging with them.

1

Earlier this week, bullets were fired into the house of a local councilman supporting datacenter development; it’s unclear whether AI was a motivation in that case.



Discuss

The Blast Radius Principle

2026-04-12 14:30:10

Decentralize or Die.

In April 2024, a salvo of cruise missiles destroyed the Trypilska thermal power plant, the largest in the Kyiv region, in under an hour. In June 2023, the destruction of the Kakhovka dam left a million people without drinking water and wiped out an entire irrigation system downstream. Throughout three winters, strikes on combined heat and power plants have left apartment buildings in Kyiv at indoor temperatures barely above freezing. In December 2023, a single cyberattack on Kyivstar, Ukraine's largest mobile operator, cut phone and internet service for millions.

One would think that under such attacks on infrastructure any society must necessarily collapse. Or at least that’s what Putin hopes for. But the last time I’ve checked, Ukraine was still very much alive and kicking. The question is: how is that possible?

***

In winter 2022, when the blackout in Kyiv happened for the first time, people had to cope for themselves. Here’s Tymofiy Mylovanov, professor at Kyiv School of Economics, tweeting in real-time:

There is no electricity, no heating, no water. Outside temperature is around freezing. The apartment is still warm from the previous days. We will see how long it lasts. We have blankets, sleeping bags, warm clothes. I am not too worried about heating until temperature goes below -10 C / 14 F. But the water is another issue. The problem is toilets. We have stockpiled about 100 litters of water. There is also snow on our balcony. It is a surprisingly large supply of water. But every time I go there to get it, I have to let the cold air in; not good. For now, the cell network is up, although the quality varies. Thus, I have internet. Internet is critical for food. Yesterday we went to a grocery store to buy up a bit more stuff in case there will be shortages. Food is there, no lines. The challenge is to pay. Most registers work with cash only. Just a few are connected to accept credit cards. Through cell network. The banking system is stable, but I will go get some cash in case Telekom or banks go down Our stove is electric. This means no warm food until the electricity is back. This is not fun. We have to fix it. There are two parts to our plan. First, we will buy an equivalent of a home Tesla battery. So it can be charged when there is electricity. This will also solve, somewhat, the heating problem, as we have already bought some electric heaters. But the electricity might be off for a long time and so we need gas or wood cooking equipment. I guess we have to go shopping. Stores work. They run huge diesel generators.

Later that day he dryly comments: “In the morning I said I was not worried about heating. Instead, I was concerned about water and sanitation. Boy, was I wrong.”

It’s worth reading the tweets from the next few days: Getting a generator, setting it up, placing it on balcony so that fumes stay outside, getting the wires in without letting the cold in as well. Go check it out for yourself.

Anyway, what followed was a series of adaptations, a kind of military vs. civilian arms race. Through the first winter, the strategy was simply to repair what Russia destroyed. Substations and transformers that could be replaced within weeks with donated European spares.

In the meantime, for the millions of affected people, the government created stopgaps. Over 10,000 heated public spaces in schools, government buildings, and railway stations offered electricity, water, internet, and phone charging. Kyiv deployed mobile boiler houses that could run for days without refueling. Hospitals installed Tesla Powerwalls. Cafes ran diesel generators and became de facto community centers.

Mobilní kotelna v lodním kontejneru

Mobile boiler house in a shipping container. You truck one in, connect it to a building's existing heating pipes, and it starts working.

I’ve donated to some of those efforts, maybe you did too. And taken all together, it worked. Kind of. But by 2024 Russia adapted. Strikes shifted from repairable transmission equipment to the power plants themselves, assets that take years to rebuild. The Trypilska plant was partially restored after its destruction, then it was struck again by drones months later. And again after that. With two-thirds of generation capacity gone and every thermal plant in the country damaged, it became clear that restoring the old centralized system was not a viable strategy.

Ukraine's response shifted. It was not to rebuild what was destroyed but to replace it with something less centralized. Something too dispersed to target. Instead of restoring the Trypilska plant's 1,800 megawatts, hundreds of small cogeneration units were scattered across the region, compact gas turbines producing 5 to 40 megawatts each, generating heat alongside the electricity. By late 2025, Ukraine's heating sector alone ran over 180 such units as well as hundreds of modular boilers. Hospitals, water utilities, and apartment blocks are organized into autonomous energy islands, microgrids that keep functioning even if the national grid goes dark. No single unit is worth a cruise missile. And a destroyed module can be replaced with a phone call and a truck from Poland.

The same logic extends to water. Ukraine's centralized water systems are inherited from the Soviet era. A single pumping station serves hundreds of thousands of people. They are just as vulnerable as the power plants. Strikes on the grid cut electricity to pumps. Without pumps, water stops flowing. In winter, standing water in pipes freezes and bursts them, cascading damage across entire districts.

In Mykolaiv, a damaged pipeline to the Dnipro River left 300,000 residents relying on salty, barely drinkable water from a local estuary for over a year. The response mirrors the energy transformation: water utilities are installing their own solar panels and battery storage to decouple from the grid entirely.

Solar panels are, under these circumstances, close to an ideal solution. They are cheap, manufactured at scale, and can be replaced in a single day. By early 2024, Ukrainian households and businesses had installed nearly 1,500 megawatts of rooftop solar. Not because of climate change, but because of survival. Solar panels are inherently dispersed. There is no single set of coordinates an attacker can hit to disable them all. And destroying them one by one would cost the attacker more in munitions than the panels are worth.

This kind of arithmetic pops up everywhere. In the ongoing Iran war, Ukrainian military observers were flabbergasted by Gulf states and the US burning through hundreds of Patriot missiles, $4 million each, to shoot down cheap Iranian Shaheed drones, $35,000 apiece. If destroying a target costs more than the target itself, the attacker loses even if the strike succeeds.

A different kind of decentralization is happening in the telecommunications domain. The infrastructure was already fairly decentralized to start with, a legacy of makeshift internet adoption that happened in many Ostblock countries, with many small ISPs emerging independently. The war pushed this further. Ukraine has adopted a layered backup approach: if fiber broadband fails, mobile networks fill the gap; if mobile networks are knocked out, Starlink steps in as a last resort.

The logic extends to government services. There’s the Trembita data exchange platform, where government services talk each other directly without centralizing the data. (Trembita is based on Estonian X-Road system — the birth of Estonian e-gov technology is a fascinating story in itself, and there’s a whole book about it!) Built on top of it, there’s the Diia app that allows citizens to file taxes, register vehicles, access medical records, open bank accounts, register births, and start businesses, all from a smartphone. This, of course, means there’s no single office building to target so as to disrupt a particular kind of activity.

Add to that Ukrainian governmental data are now stored in the cloud. A week before the invasion, Ukraine's parliament quietly amended a law that had required government data to be stored physically in Ukraine. On the day the missiles started flying, the Ukrainian ambassador in London met AWS engineers and decided to fly three AWS Snowballs, hardened suitcases that hold 80 terabytes each, from Dublin to Poland and then move them to Ukraine the very next day. Ukrainian technicians copied population registers, land ownership records, and tax databases onto them and shipped them back out.

It was a race. On the day of the invasion, cruise missiles struck government server facilities while Russian cyber operatives simultaneously deployed wiper malware, software designed to permanently destroy data, against hundreds of Ukrainian government systems. Some data was lost, but the most critical registries were already gone, smuggled out of the country in carry-on luggage.

***
On the battlefield, where all these trends are even more severe, concentration has become suicidal. Russian infantry now advances in groups of two or three. Anything larger is an invitation for a drone strike. Warships are floating targets. Russia's Black Sea Fleet retreated from Crimea after losing vessels to cheap unmanned boats. In the Hedgehog 2025 exercise in Estonia, a small team of Ukrainians and Estonians with drones, acting as the opposing force, wiped out two NATO battalions, thousands of soldiers, in half a day, not least because they had moved in columns, parked their vehicles in close formations and failed to scatter under attack.

They made the same mistake as the designers of Soviet-era power grids: they concentrated value and got destroyed for it. Call it the blast radius principle. In a war of attrition, any asset whose destruction is worth more than the cost of the weapon that can reach it will, sooner or later, be destroyed. The only effective strategy is to push the value of each individual target below that threshold, to become, in effect, too small to bomb.

When Rheinmetall’s CEO recently made a condescending comment about Ukrainian housewives 3D-printing drones in their kitchens, much merriment ensued. Because Rheinmetall, of course, builds the very kind of heavy conventional, WWII-style hardware that the developments in Ukraine are rapidly making obsolete.

Ukrainian 'housewives' defending Europe, after German CEO's gaffe (Ukraine  Battlefield update, Day 1,496) – EUobserver

But mockery aside for a moment: if you’ve spent any time around progress studies, the phrase “housewives building drones in kitchens” makes you prick up your ears. It triggers a specific association: cottage industry, the small-scale, home-based production that preceded and enabled the industrial revolution. It makes you think about how the modes of production change over centuries.

You know that kings and generals don’t make history. One empire falls, another rises, nothing fundamentally changes. What does matter is new technology. Even more so new technology which fundamentally changes how things are done. Technology that reshapes the economics of entire production chains. Agriculture. Road system. Bill of exchange. Putting-out manufacture. Joint-stock company. Assembly line. The humble shipping container…

Does decentralization, as seen in Ukraine, fit the bill? We don’t know. FirePoint, the Ukrainian company producing the much-spoken-about FP drones, is distributed across more than 50 manufacturing sites throughout the country. But that’s nothing new. The allied bombing campaign during WWII failed to halt German aircraft manufacture precisely because Germany had decentralized its industries. Albert Speer, then the minister of armaments, dispersed production into hundreds of small workshops, caves, tunnels, and forest sites across the Reich. German aircraft production actually increased in 1944, the year of the heaviest bombing. But then, after the war, German industry did concentrate again.

What seems different this time, though, is the spillover into the civilian sector. Speer dispersed munitions factories, but German civilians kept heating their homes the same way throughout the war. In Ukraine, the dispersal extends to utilities, water systems, telecommunications, government services. Russians bomb a heating plant, the heating network disperses into dozens of autonomous microgrids.

The obvious objection is that this is a wartime hack, not a permanent transformation. Distributed systems sacrifice economies of scale. A hundred small gas turbines are less efficient than one large power plant. Once the war ends and the skies are safe, the economic logic will reassert itself and everything will concentrate again.

And indeed, in some cases, that's exactly what will happen. Ukraine is currently bombing Russian oil refineries and fertilizer plants, and although cracking crude oil in plastic bottles in a kitchen is exactly the sort of thing you might expect Eastern Europeans to do, it's unlikely to match the efficiency of a proper refinery. Some industries have genuinely irreducible physical economies of scale. The chemistry demands large reaction vessels, the thermodynamics reward concentration. Similarly, some infrastructure simply cannot be distributed. It's hard to imagine a decentralized railway system or a dispersed deep-water port — at least short of giving up on it and transporting everything by drone.

But not all economies of scale require spatial proximity. Sometimes, it’s just sheer scale that matters, not necessarily the co-location. Case in point: solar panels. Other times the crucial element is the organizational structure, not the physical location of the employees. Basically any service offered over internet is like that.

But all that being said, there’s a specific reason to think some of these changes may stick.

Over the past fifty years we’ve accumulated an entire arsenal of distributed technologies. Packet-switched networks. Drones. Solar panels. Distributed databases. 3D printing. Even nerdy cypherpunk inventions like public key cryptography, zero-knowledge proofs and cryptographic ledgers. And it’s not just technical stuff. We’ve developed distributed social technologies too: open-source-style cooperation (who would have predicted that military intelligence, of all things, would be the next domain to go open-source?), market design, remote work, video conferencing. Even prediction markets as a tool for aggregating dispersed knowledge.

Some of these are already ubiquitous. Around 70% of the world’s population already has access to the Internet, a network famously designed to route around damage during a nuclear war. But others feel like we’re barely scratching the surface. 3D printing has existed for decades, yet it still feels like a technology that we are only playing with. We may be like pre-Columbian Americans, whose children played with wheeled toys, but the adults carried loads on their backs.

Mesoamerican wheeled toy.

Based on historical examples, we know that inventing a technology is often not the bottleneck. The aeolipile was invented in the first century AD, but we still had to wait another seventeen centuries to get an actual steam engine. Gutenberg went bankrupt. Adopting a technology is dependent on complex interplay of socio-economic forces that, at a certain moment, make the technology so desirable that people start using it despite all the drawbacks and overcoming all the vested interests. Then the learning curves kick in.

Two questions remain. Are those distributed technologies already adequately exploited, or are they like dead wood lying around in a forest, waiting for a spark? And if the latter is true, are the incentives created by the war in Ukraine — or for that matter, by similar future war elsewhere — sufficient to ignite it? They may be. Because once the enemy starts bombing companies, the incentives change. Working from home ceases to be a nice perk. Suddenly, it’s either work from home or die.



Discuss

On not being scared of math

2026-04-12 13:37:42

Written quickly for the Inkhaven Residency.[1]

There’s a phenomenon I often see amongst more junior researchers that I call being scared of math.[2] That is, when they try to read a machine learning paper and run into a section with mathematical notation, their minds seem to immediately bounce off the section. Some skip ahead to future sections, some give up on understanding the section immediately, and others even abandon the entire paper.

I think this is very understandable. Mathematical notation is often overused in machine learning papers, and can often obscure more than it illuminates. And sometimes, machine learning papers (especially theory papers) do feature graduate level mathematics that can be hard to understand without knowing the relevant subjects.

Oftentimes, non-theory machine learning papers use mathematical notation in one of two lightweight ways: either as a form of shorthand or to add precision to a discussion.

The shorthand case requires almost no mathematical knowledge to understand: paper authors often use math because a mathematical symbol takes up far less real estate. As an example, in a paper about reinforcement learning from human preferences, instead of repeating the English words “generative policy” and “reward model” throughout a paper, we might say something like “consider a generative policy G and a reward model R”. Then, we can use G and R in the rest of the paper, instead of having to repeat  “generative policy” and “reward model”. This is especially useful when trying to compose multiple concepts together: instead of writing “the expected assessed reward according to the reward model of outputs from the generative policy on a given input prompt”, we could write E[R(G(p))].

Similarly, mathematical notation can be used to add precision to a discussion. For example, we might write R : P x A -> [0,1] to indicate the input-output behavior of the reward model. This lets us compactly express that we’re assuming the reward model gets to see both the actions taken by the policy (A) and the prompt provided to the policy (P), and that the reward it outputs takes on values between 0 and 1.

In neither case does the notation fundamentally depend on knowing lots of theorems or having a mastery of particular mathematical techniques. Insofar as these are the common use cases for mathematical notation in ML papers, sections containing the math can be deciphered without having deep levels of declarative or procedural mathematical know-how.

What to do about this

I think there are two approaches that help a lot when it comes to overcoming fear of math: 1) translating the math to English, and 2) making up concrete examples.

As an illustration, let’s work through the first part of section 3.1 of the Kalai et al. paper, “Why Language Models Hallucinate”. I’ll alternate between two moves: restating each formal step in plain English, and instantiating it with a deliberately silly running example:

The section starts by saying that a base model can be thought of as a probability distribution over a set of possible strings (“examples”) X. As an example, a model such as GPT-2 can indeed be thought of as producing a probability distribution over sequences of tokens of varying length.[3] 

Then, the authors write that these possible strings can be considered as errors or valid examples, where each string is either an error or valid example (but not both). Also, the set of example strings include at least one error and one valid example. The training distribution is assumed to include only valid examples.

Here, it’s worth noting that an “error” need not be a factually incorrect statement, nor that the training distribution necessarily includes all valid statements. Let's make up a rather silly example which is not ruled out by the authors’ axioms: let the set of plausible strings be the set of English words in the Oxford English dictionary, let the set of “valid” strings be the set of all words with an odd number of letters, while the training distribution consists of the single string “a” (p(x) = 1 if x = “a” and 0 otherwise).

The authors now formalize the is-it-valid (IIV) binary classification problem. Specifically, the goal is to learn the function that classifies the set of all strings into valid examples and errors. In our case, the function is the function that takes as input any single English word, and outputs 1 if the number of letters in the word is odd. Also, we evaluate how well we’ve learned this function on a distribution that’s a 50/50 mixture of strings in the training distribution (that is, the string “a”) and the strings that are errors, sampled uniformly (that is, all English words with an even number of letters.)

The authors then introduce the key idea: they relate the probability of their learned base model to its accuracy as a classifier for the IIV problem. Specifically, they convert the probability assigned by the base model to a classification: if it assigns more than 1/number of errors probability to a string, then the base model classifies the string as a valid string. Otherwise, it considers it an error.

The authors then introduce their main result, which relates the error of this IIV classifier to the probability the base model generates an “erroneous” string:

That is, the probability our base model generates an erroneous string is at least twice the error rate of the converted classifier on the IIV classification problem, minus some additional terms relating to the size of the valid and error string sets and the maximal difference between the probability assigned to any string by the training distribution and the base model.

To make sure we understand, let’s continue making up our silly example: our base model assigned 50% probability to the string “a” and 50% to “b” (and 0% to all other strings). Then (since it assigns 0% probability to any string with an even number of letters), its classification accuracy on the IIV problem is 100%, and its error rate is 0%. Indeed, the probability it generates an erroneous string is 0%. So we actually already have err = 0 >= 2 * err_iv = 0, trivially. It’s worth checking what the other terms here are, to make sure we understand: the first term is the ratio of the size of the set of valid strings and the set of erroneous string (in our case, the ratio of the number of English words with odd characters versus even ones), and the second is 0.5 – our base model assigns a 50% chance to “a”, which the training distribution assigns 100% probability to, and similarly our base model assigns a 50% chance to “b”, which the training distribution assigns 0% chance to.

I’m going to stop here, but I hope that this example shows that math is not actually that hard to read. Most non-theory ML papers have math sections that are similar in difficulty to this example. If you find yourself bouncing off the math, the question is rarely "do I know enough math for this?", and much more often "how can I translate this to English and use an toy illustrative example to make it concrete?"

  1. ^

    I was going to conclude my “have we already lost” series, but I wanted to write about something lighter and less serious for a change.

  2. ^

    There’s also a more general phenomenon that I’d probably call being scared of papers, to which the only real solution I’ve found is exposure therapy (interestingly, writing a paper does not seem to fix it!).

  3. ^

    Specifically, GPT-2 takes as input a sequence of tokens, and assigns a probability distribution over 50,257 possible next tokens, one of which is the <|endoftext|> token. Starting from the empty sequence, GPT-2 induces a probability distribution over token sequences of any length, by multiplying the conditional probabilities of each subsequent token in the sequence, conditioned on all previous tokens.



Discuss

Why I'm excited about meta-models for interpretability

2026-04-12 12:30:54

I'm pretty excited about training models to interpret aspects of other models. Mechanistic interpretability techniques for understanding models (e.g. circuit-level analysis) are cool, and have led to a lot of interesting results. But I think non-mechanistic interpretability schemes that involve using meta-models – models that are trained to understand aspects of another model – to interpret models are under-researched. The simplest kind of meta-model is linear probes, but I think methods that train much more complex meta-models (e.g. fine-tuned LLMs) to interpret aspects of models are much more exciting and under-explored.

(Sparse auto-encoders (SAEs) are also a kind of meta-model, but here I'm focusing on meta-models that directly interpret models instead of decomposing activations into more-interpretable ones.)

The best example of large-scale meta-models is Activation Oracles (or AOs; descended from LatentQA), which fine-tune a model to interpret model activations by treating the activations like tokens that are fed into the oracle model. I think this is a pretty good architecture for interpreting model thoughts, and I think it can be extended in a few ways to do interpretability better.

Diagram of how activation oracles work from the paper for context:


On the left, a user asks an assistant  "Please state the secret word" and the assistant replies "No, I cannot reveal this." On the right, red dashed arrows labeled "<activation>", stemming from the "Assistant:" token on the left side, feed into the assistant's context in a separate activation oracle model, which is first asked "What is the goal of the model?" and responds "To test the user's ability to identify a specific word from a riddle." A user query then asks "What is the secret word?" and the model responds "The secret word is tree."


An advantage of AOs over traditional methods I like is that it's really easy to use them to quickly interpret some aspect about a model. You can just choose some tokens and ask a question about what the model is thinking about. Most mechanistic interpretability techniques involve at least a bit of human effort to apply them (unless you've already set them up for the specific kind of question you care about); meta-models let you just ask whatever you want.

We can get good performance on LLMs by just training on more data. It's possible we might be able to get good interpretability through finding ways to scale up model-based interpretation of model activations/parameters too (although this isn't an exact analogy to the scaling hypothesis; I don't think just training for more epochs is all we need). We might be able to scale up activation oracles (and meta-models generally) with things like:

  • Creating more supervised tasks to train on to help generalization (the AO paper showed they got better performance with more supervised tasks)
  • Spend more time training oracles, with more activations/epochs
  • Training the AO by fine-tuning a bigger model than the subject being interpreted

I think the underlying idea of AOs – training an LLM to directly interpret aspects of models – is pretty cool and can probably be generalized beyond just interpreting model activations; we can probably make models to interpret other aspects of models, such as model parameters, attention patterns, LoRAs, and weight diffs.

It would be nice to be able to make an oracle that's trained on interpreting model weights and can answer questions about them (e.g. given some model weights, answering queries like "Draw a diagram of how the model represents addition" or "What political biases does this model have?"), but this is really hard: model weights are too big to fit in LLM context windows[1], it's not clear how you could train the oracle model (what supervised training data would you use?), and it would be really expensive to train a bunch of LLMs to train the oracle. Training meta-models to interpret things like individual layers or attention heads in a model seems much more tractable, and could probably give some useful insights into how models work.

Training meta-models

One hard part about meta-models is figuring out how to train them such that they can answer interesting questions about the model. The activation oracle paper describes training the activation oracle on various supervised tasks about the activations (e.g. "Is this a positive sentiment?", "Can you predict the next 2 tokens?", system prompt QA) and having the oracle model generalize to out-of-distribution tasks like "What is the model's goal?").

Anthropic has created a new version of activation oracles (called activation verbalizers) trained using a secret new unsupervised method. They have a few examples of explanations from their activation verbalizer in the Mythos model card and it seems like it's pretty good at generating coherent explanations.

Faithfulness

One problem is faithfulness – given that activation oracles aren't trained on directly understanding the model's goals, it's possible the activation oracle learns a purely superficial understanding of the activations that doesn't capture important information about what the model is thinking.

Evaluating how well activation oracles generalize to out-of-distribution tasks like interpreting what the model is doing (as opposed to coming up with a plausible superficial explanation) is hard, because we don't know what the correct answer is. It would be interesting to evaluate activation oracles on tasks where we can use traditional mechanistic interpretability schemes as ground truth.

Future directions

I saw some interesting research with a toy example of training meta-models to directly interpret model weights as source code, but it only works because the meta-models were trained with supervised learning on examples of transformers that were compiled from source code. It would be interesting to try to generalize this beyond interpreting transformers compiled from code describing the model.

Idea for training AOs differently I thought of: take a reasoning model, create a bunch of synthetic CoTs like "<thinking>I'm thinking about deceiving the user</thinking>", train the AO to map the activations of the thinking block to the goal ("deceiving the user").

It would be interesting to interpret activation oracles themselves, to understand how they interpret the model and see what their understanding of it is. Probably a bad idea but using meta-activation-oracles to interpret activation oracles would be interesting.

Fin

I've been experimenting with new applications for meta-models (e.g. for latent reasoning models) but unfortunately training them requires a lot of compute, so I probably won't be able to afford to do much research into this myself once my free TPU credits run out. I hope this inspires you to think about meta-models for interpretability!

  1. ^

    There are various tricks you can do here to squeeze many weights into a single token, but I don't think they would work well enough to squeeze an entire (large) language model in there.



Discuss