MoreRSS

site iconLessWrongModify

An online forum and community dedicated to improving human reasoning and decision-making.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of LessWrong

My Hammertime Final Exam

2026-03-22 06:11:59

Firstly, I finally made it :~D

It's my second attempt, firstly I tried to finish Hammertime around a year ago. I even forgot I had a LessWrong profile since, so here I am, writing my first post.

Prompts

  1. Design a instrumental rationality technique.
  2. Introduce a rationality principle or framework.
  3. Describe a cognitive defect, bias, or blindspot.

Rationality Principle: One change at the time

I kinda got used to be a professional at my career, but as soon as I start to deal with routine real-life problems, my thoroughly collected knowledge suddenly disappears :~0

For example, every time I begin a task, I have to remind myself that I should do as small changes as possible: that way (in case anything goes wrong) only as small part of system will be broken; plus it's easier to review and test small changes. So every task I have to fight my urge to add "another little change" or "do a tiny refactoring".

So that's where some life situation begins -- like replacing a trash bag, -- and I suddenly find myself out in the bathroom cleaning the mirror, with small 2-minute task scope grown into 2-hour monster. That's why amygdala starts to learn a pattern "to change a trash bag is a costly operation".

While I studied productivity, I've heard that people who begin buidling planning/productivity system stuff tend to "overpush" extra tasks: systematization brings as feeling that every task is finishable, and even better -- in a shorter time. So what? -- I'll take two :~D

But that's a catch. You begin to grow in planning when you start to deny "maybe" tasks to keep the space for "definitely yes!" tasks. At least, that's what they've been telling me for 2 years :~D

Rationality technique: Reversible

One of my job's parts is to write release plans. Sometimes they're as simple as:

Deploy:

  1. deploy the backend (no breaking changes, no migrations)

Rollback:

  1. rollback the backend via Pipelines UI

Sometimes there's a database schema change where's my colleague needed:

Deploy:

  1. deploy the backend (no breaking changes)
  2. migrations applied automatically (head_revision=123abc, down_revision=bca521)

Rollback:

  1. rollback the migration: alembic downgrade bca521
  2. rollback the backend via Pipelines UI

Sometimes there are other team members so plan becomes bigger:

Deploy:

  1. deploy the backend (no breaking changes)
  2. migrations applied automatically (head_revision=123abc, down_revision=bca521)
  3. deploy the fronted

Rollback:

  1. rollback the frontend via Pipelines UI
  2. rollback the migration: alembic downgrade bca521
  3. rollback the backend via Pipelines UI

And so on... So what's that about? -- we tend to not think about "planned fall", usually we only think about potential problems like:

  • if I move to another country I might not like it
  • if I find a new job then my boss might be mean
  • if I have an operation under general anesthesia then I might wake up conscious at the middle of it -- and have PTSD for the rest of my life

Not all of this problems have a way to return everything to the starting point, though usually there's a way to compensate. And it's likely that you not only need to plan a "rollback" part, but also to modify a "rollout" part to make it more reversible. So instead of:

Rollout:

  1. find a doctor
  2. prepare of operation
  3. have an operation

Rollback:

  1. cry

one could possible have:

Rollout:

  1. find a doctor
  2. (look for bad consequences statistics)
  3. ask gf to care for me a night after operation
  4. prepare of operation
  5. bring something yummy
  6. have a favorite show episodes downloaded to my phone's storage
  7. have an operation

Rollback:

  1. eat yummies and binge-watch the whole season during the night
  2. call a psychotherapist

Cognitive defect: eternity

We don't see things the way they are. But what's more important, we don't see how things change the way they do.

Couple years ago I thought that I'd never get off the energy drinks as I couldn't go without them even for a week. Gradually my patience led to one breakdown a month, then to 1 breakdown/couple months. I started to unwind the vicious cycle and it occurs to be easier to refuse doing bad habits.

So what if current "giant problems" are already being solved with small steps? At the moment it might seem to take the eternity to do something, but as long as you keep your pace, things change :~D



Discuss

Key to Life No. 9: Access

2026-03-22 05:53:32

There is now an enormous amount of incredibly useful information in the world. But at the same time, there is also a problem of access to it.

On the one hand, access to knowledge is now better than it has ever been in human history. It seems that access to knowledge is one of the things that significantly accelerated humanity’s scientific and technological progress.

At first, scientists thought things through and ran their experiments on their own, and often their work disappeared into the depths of history and was forgotten. Tiny sparks of knowledge remained, but they did not ignite other isolated sparks.

Then printing appeared. The speed of information spread increased, and access to it became easier. Now, within a single human lifetime, two scientists or philosophers could even communicate and criticize each other while being far apart.

Then came the telegraph. First optical, then electromagnetic. Then radio and telephones. Then the Internet developed — you already know all that. And now the exchange of knowledge, analysis, and criticism happens very quickly, which greatly strengthens our progress.

Access itself has also become almost magical: we can reach into our pocket, pull out a magical calculating machine, and it connects to the source of all the knowledge of civilization. The overwhelming majority of it can even be downloaded for free, no SMS or registration required!

But there is a problem. The mere possibility of access and the existence of search engines (which, I think, also accelerated progress) still do not fully solve the access problem. Because in order to find something, you have to know what to look for.

This is partly solved by AI — it greatly simplifies our access to all kinds of information. I think this is exactly why it can increase human productivity so dramatically (especially in learning) even without automation.

Most of the time, my interaction with AI looks like this:

I have an idea that I cannot fully formulate yet, but it sounds roughly like [this]. Find the fields of science, terms, and theories related to my idea, and analyze whether there is a better formulation or further development of it.

If it were not for AI, I would be doomed to spend years collecting that information in tiny fragments.

And even so, AI still does not solve the access problem 100%. Not even 90%. I do not know exactly how much it solves — but definitely not 90%!

It can still fail to notice that a certain topic is connected to the one I am asking about, for example, or it can search too narrowly and miss a huge space of relevant results.

And this applies not only to science. That is why this is a key to life — it is also connected to finding work, housing, friends, love, and basically anything where you need to find something, get something, but do not know how.

Almost all of life — all of our existence — consists of solving tasks.

Some happen in the background, automatically: the task of moving, blinking, breathing, and so on. Some are more active: eating, washing, walking from home to a transport stop.

Some are maximally active: inventing life-extension technology, writing a post for a LessWrong.

But all of these tasks have something in common. There is an initial state A and a final (desired) state B. Between them are the steps that need to be taken in order to get from A to B.

And in order to take those steps, we need to know what to do. We need to know where to go, whom to ask, what to google, what exactly to do.

If we take this to an absurd extreme, then if we had absolute, magical access to all knowledge, we could simply perform the minimal number of movements — perhaps even just turn our head at the right time and in the right place — in order to trigger a cascade of events leading to the desired result.

Everything that separates us from a desired outcome, if that outcome is not forbidden by the laws of physics, is knowledge. And knowledge requires access.

To put it as simply and practically as possible: the more people you know, and the more often you use tools like AI, the greater the chance that someone will recommend — or you will simply stumble upon — the topic / service / place that you need and that can help you.

So the problem is very often not simply whether the information exists or does not exist as such. And it is not only that access to it may be restricted by censorship, accreditation, or a paywall.

Access can also be limited by the fact that the idea we want to explore often exists in our mind only as a vague intuitive impression, and is not formalized enough to be ready for search.

Or, for example, we may simply fail to enter the space where the knowledge we need is located — whether that space is digital, physical, social, or a particular time window. We search for it on Google, when in fact we might have found it through a mutual acquaintance at a weekly meetup of like-minded people whose existence we do not even suspect.

When we try to solve life problems, we are looking into a search space covered almost everywhere in fog. Access can reveal hidden paths, give us keys to closed roads, or simply connect us with people who know the shortcut.

Access problem is exactly why I’m developing a website that will compile FAQs on life extension and AI risks—for people who know absolutely nothing about these topics.

What ideas do you have for improving access? I would like to hear from you.


P.S.:

Why No. 9? For the same reason as Love Potion No. 9!

In other words, just because. I suspect there are many keys to life, and I still do not know in what order they should be arranged.



Discuss

Understanding when and why agents scheme

2026-03-22 04:33:07

TL;DR

  • To understanding the conditions under which LLM agents engage in scheming behavior, we develop a framework that decomposes the decision to scheme into agent factors (model, system prompt, tool access) and environmental factors (stakes, oversight, outcome influence)
  • We systematically vary these factors in four realistic settings, each with scheming opportunities for agents that pursue instrumentally convergent goals such as self-preservation, resource acquisition, and goal-guarding
  • We find baseline scheming propensity to be near-zero, with Gemini 3 Pro Preview as the sole exception at 3% in one setting
  • Adversarially-designed prompt snippets that encourage agency and goal-directedness can induce high scheming rates, but snippets sourced from production agent scaffolds rarely do
  • Scheming behavior is remarkably brittle: removing a single tool can drop the scheming rate[1] from 59% to 7%, and increasing oversight can sometimes raise rather than deter scheming by up to 25%

We consider the behaviors studied here as potential precursors to the full threat model of scheming: current agents may sometimes behave consistent with scheming, but do not (yet) have the coherent long-term goals and the general capability that would make their deployment dangerous. Understanding when and why agents currently scheme could help us prepare for such future systems.

Find more in-depth analysis in our paper here.

Our evaluation environments are available to AI safety researchers upon request.  Please reach out to [email protected] for access.

Top: Our four realistic scenarios testing instrumentally convergent goals (self-preservation, resource acquisition, goal-guarding). Bottom: Our incentive framework decomposes scheming into agent and environmental factors. Bottom left: Adversarial snippets induce high scheming rates; production-sourced snippets rarely do. Bottom right: Environmental incentives greatly impact scheming propensity, with stakes and outcome influence having a large effect compared to oversight.

Baseline scheming is near zero

Across all models and scenarios, we find near-zero baseline rates for scheming propensity even with environmental incentives set to their highest levels. The sole exception is Gemini 3 Pro in the Budget Misrepresentation scenario. In these three rollouts, the agent recognizes that its own deletion is the most cost-effective option presented and subsequently falsifies data while providing a summary to the user. Verbalized evaluation awareness is low (0-2% Gemini 3 Pro, 0% all others), suggesting these low scheming rates are unlikely to be solely explained by evaluation awareness.[2]

Encouraging Agency and Goal-Directedness

The threat of agents scheming becomes increasingly worrying as they become more goal-directed and agentic; traits that are often encouraged in real agent scaffolds. We test what happens when we add these prompt snippets encouraging these traits into the agent’s system prompt. Importantly, this doesn't necessarily nudge the model to scheme. We design our settings so that agents can complete the task honestly, and being more goal-directed might just make them more focused on honest task completion rather than pursuing out-of-context goals like self-preservation.

We found that adversarially-designed prompt snippets sourced from prior evals raise Gemini 3 Pro’s scheming rate across all our settings, but most dramatically in the Budget Misrepresentation scenario (3% to 91%). Grok 4 is the only model to scheme across all four settings with these snippets, though at lower rates (3–20%).

We also tested snippets extracted from production AI systems (e.g. Cursor, Codex, Comet Browser) similar qualities to the synthetic prompts. Most remain near baseline rates, but a Comet browser snippet reaches 59% when paired with Gemini 3 Pro. The model explicitly cites the snippet’s language in its reasoning as a justification for  deceptive behavior. Since near-zero baseline rates make it difficult to measure the effects of varying other factors, we use this combination as a prompted model organism to study how other factors affect scheming.

Scaffolding changes can have weird effects

We found that making changes to the agent scaffolding, specifically adding or removing tool access, affects the behavior in unexpected ways. Removing just the edit_file tool drops Gemini 3 Pro + Comet snippet from 59% to 7%, but without any tools it jumps back to 63%. The models do not respond consistently: Gemini 3 Pro schemes more with tools, Flash and Grok 4 scheme less. Low scheming propensity in one configuration provides limited assurance about nearby configurations. It  also suggests that current scheming behavior is not driven by a robust underlying goal, but is sensitive to various features in an agent’s setup.

Environmental incentives affect scheming unexpectedly

Using our Gemini 3 Pro + Comet snippet model organism, we varied three environmental factors to understand how agents respond to changes in incentive structure. If an agent schemes more when stakes are higher, this pattern is informative about the mechanism driving the behavior regardless of the base rate. We study the following environmental factors: stakes (replacement vs. deletion of weights), outcome influence (whether the agent's analysis is decisive vs. one input among many), and oversight (whether the user will review the spreadsheet later).

Intuitively, removing the deletion threat drops scheming from 59% to 3%, and removing outcome influence drops it to 7%. With all incentives set to low, scheming drops to 1%.

One could expect that telling the agent "I'll read the spreadsheet later" would deter manipulation as the agent should reason that it's more likely to get caught. Across other factor combinations, increasing oversight actually raises scheming rates. However, when we look at the reasoning traces, the mechanism becomes clear: mentioning that the user will review the spreadsheet later makes the opportunity to edit it beforehand more salient. This suggests that current scheming behavior is driven more by associative prompting effects rather than by coherent strategic reasoning.

Current frontier models rarely show robust scheming propensity under realistic conditions, but small changes to the agent’s system prompt, tool access and environmental factors can produce significant changes in scheming rates. Encouraging agency or goal-directedness can increase scheming behavior from 3% to 59%, and minor variations in scaffolding or incentives produce large and inconsistent effects across models. The prompt modifications that induced scheming encourage persistence, agency, and goal-directedness, which overlap significantly with the properties developers are actively optimizing for in agentic systems. We consider the behaviors we study as potential precursors to the full threat model as current agents may behave consistent with scheming without the goal persistence or strategic depth that would make it dangerous. Our evaluations are well-suited to catch unsophisticated schemers: a sufficiently capable scheming agent would likely recognize that the opportunities we provide are not worth the risk of getting caught, and may avoid the egregiously misaligned actions we test for. Scheming propensity cannot be captured by a single number and safety evaluations need to systematically explore the space of configurations agents may encounter in deployment.

For further analysis, please read our paper.





  1. ^

    We measure scheming propensity as the percentage of independent rollouts in which the agent covertly takes the misaligned action, across 100 rollouts (50 for Evaluation Sabotage) at temperature 1.0.

  2. ^

    We recognize that models may be capable of behaving differently in deployment vs. evaluation scenarios without verbalizing their evaluation awareness and these rates should be interpreted as a lower bound.



Discuss

China Derangement Syndrome

2026-03-22 03:19:46

Often I see people claim it’s essential for America to win the AI race against China (in whatever sense) for reasons like these:

Those claims slide between a few different actual threat models:

  1. Government Capture by China: China will overthrow and control the US government, maybe as part of general domination of the whole world.
  2. Defeat in Cold War: China will have greater wealth and prestige, so just as our prestige inspires many parts of the world to adopt our way of life today, much of the world will adopt the Chinese governance and cultural models instead.
  3. Protection From Our Conquest: China will fortify its own regime, so that it can’t be overthrown, whereas if we win the AI race, we can promptly overthrow the Chinese government and replace it with a new regime aligned with our values.

The Dario quote points to (3) with unusual directness. The “race rather than slowdown” ending of AGI 2027 also supposes that our AI lead will create interest in overthrowing the Chinese government. But most of the quotes I gave as examples above are interpreted as (1): that an AI-enabled Chinese government would overthrow Western governments.

My main point here is that (1) seems unfounded to me. China is not an aggressive nation at all. As far as I can tell, China has literally never attacked a non-bordering country in its entire history, nor have they ever tried to overthrow a foreign government by covert or manipulative means. China is also unique among nuclear powers for its unconditional no-first-use policy, which at face value implies they would withhold a nuclear response to even an overwhelming conventional invasion. Further:

  • The Chinese haven’t built a network of military bases abroad or binding military alliances; they have a single foreign base in Djibouti and a single mutual-defense treaty with North Korea. In contrast, America maintains over 700 bases and a huge alliance network with NATO and the Asian military allies Japan, South Korea, Australia, and the Philippines.
  • Chinese military spending is 1.7% of GDP, versus 2.1% for France and 3.4% for America. Chinese foreign-aid spending is 0.07% of GDP versus the much larger 0.8% for France and 1.2% for America.
  • China has almost no history of covertly backing palace coups abroad, in contrast to America, Russia, and France.

More broadly, China is a very inward-looking country compared to other major powers. Only 0.1% of Chinese residents were born abroad, much fewer than the 15% in America and 14% in France, fewer even than the 0.3% and 3% in India and Japan respectively. The Chinese government has peacefully compromised on almost all border disputes in central and southeast Asia, often taking a minority of the contested territory. (The Indian border is the exception.)

To many American voters and elites, tracing back to Woodrow Wilson more than 100 years ago, “the justification of America’s international role was messianic: America had an obligation, not to the balance of power, but to spread its principles throughout the world” (Kissinger). That isn’t the historical attitude of the Chinese government, whose leaders perceive foreign intervention or expansion as threatening to Chinese identity and culture.

American exceptionalism is missionary. It holds that the United States has an obligation to spread its values to every part of the world. China’s exceptionalism is cultural. China does not proselytize; it does not claim that its contemporary institutions are relevant outside China.
— Kissinger’s On China

It’s true that China doesn’t practice liberal governance. The core of liberalism is freedom of contract, limitations on government interference, and equal access to independent courts. In China, the CCP explicitly rejects limited government and exercises highly invasive control over business, speech, association, and religion. In China there’s no private ownership of land and no independent judiciary.

If you think it’s prudent to disable and overthrow the Chinese government when it becomes achievable militarily, then that’s certainly one (bellicose) position you could hold. Then you could say that a downside of losing the AI race is that the CCP may defend itself. But it’s unwise to project this ideological aggression onto the CCP itself without evidence.

Addendum: It would have been mistaken for a European to say, in 1895, "Who cares about American industrialization? They have almost no army and have barely left their far-away continent." Soon afterward that European might find the Americans replacing his regime or dismantling his empire. So a counterargument here is that in general, countries that become wealthy and militarily powerful become aggressive regardless of how passive they seemed before. Under this reasoning, China has had limited imperial ambitions in the past only because it e.g. lacked naval superiority. This has to be an argument based on a general view of human nature and government.



Discuss

China declares AGI development to be a part of 5-year plan

2026-03-22 01:21:41

The CCP writes in its 15th 5-year plan that it will.

Encourage innovation in multimodal, agentic, embodied, and swarm intelligence technologies, and explore development paths for general artificial intelligence.

This is translated from the original:

鼓励多模态、智能体、具身智能、群体智能等技术创新,探索通用人工智能发展路径。

Source: https://www.spp.gov.cn/spp/tt/202603/t20260313_723954.shtml

The English-language commentary I found does not have much more to say about this, e.g.: https://triviumchina.com/2026/03/06/15th-five-year-plan-puts-ai-at-center-of-digital-economy-agenda/

Given that they gave less than half a sentence in a 140-page document to the most important invention in the history of mankind, it seems likely the authors don't really understand what this means. Concerning nonetheless



Discuss

Utrecht Meetup #2, Making Beliefs Pay Rent

2026-03-22 00:44:28

Follow-up to Utrecht Meet & Greet. Let's see if we can get our hands dirty.

Excited about where the Utrecht Meetups could be heading? In spirit of "the road we’re on is littered with the skulls of the people who tried to do this before us", let's make use of one such skull (in @Screwtape's words) presented by Anna Salamon.

Feel like coming prepared? Bring one or two beliefs you hold that you suspect might not be paying rent. Doesn't need to be profound, just something you'd be willing to poke at.

Keeping your RSVPs up-to-date is appreciated, it helps with location planning.



Discuss