MoreRSS

site iconLessWrongModify

An online forum and community dedicated to improving human reasoning and decision-making.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of LessWrong

Legal Personhood for Models: Novelli et. al & Mocanu

2025-06-01 16:18:04

Published on June 1, 2025 8:18 AM GMT

In a previous article I detailed FSU Law professor Nadia Batenka's proposed "Inverted Sliding Scale Framework" approach to the question of legal personhood for digital intelligences. 

In this article, I am going to examine another paper approaching the issue which was first authored by Claudio Novelli, Giorgio Bongiovanni, and Giovanni Sartor. Since its original writing it has been endorsed (with some clarifications) by Diana Mocanu in her paper here.

First, let me provide some background on the concept of Legal Personhood/ Legal Personality, and some of the dynamics at play when it comes to deciding the appropriate framework by which the issue can be applied to digital intelligences.

Background: Legal Personhood/Legal Personality Briefer

Legal personhood or "legal personality" is a term of art used to refer to the status of being considered a "person" under the law. This label includes "natural persons" like competent human adults, as well as "legal persons" like corporations. 

Legal personhood is most easily understood as a "bundle" of rights and duties, with different kinds of legal persons having different bundles.

Some examples of rights which are neatly bundled with duties are: 

  • A mentally competent human adult has the right to sue another person and compel them to abide by the court's ruling, they also have the duty to abide by a ruling when sued.
  • A corporation has the right to engage in commercial activity to earn income, they also have a duty to pay taxes.

Different forms of legal personhood entail different bundles. For example a mentally competent human adult has different rights and duties when compared to a mentally incompetent human adult, who in turn has different rights and duties compared to a child, all of whom have different rights and duties compared to a corporation. It is not correct to say that one of these is "more" or "less" of a legal person than another, rather its best to think of them like circles in a venn diagram which partially overlap but are also qualitatively different in meaningful ways.

In the US, legal personhood is a prerequisite for entities to engage in many activities. For example one must have a certain legal personality to be a party to certain contracts, which is why children or mentally incompetent adults for example often cannot be signatories despite being legal persons.

Legal personhood also determine where liability lies. Corporations can often serve as a "liability shield" for the persons whose collective will the corporation enacts. However this shield can be broken through in cases of egregious misconduct, or in cases where parties have "pierced the corporate veil".

Legal personhood also plays a factor in determining what protections an entity enjoys under the law.[1] For example Section 1 of the Fourteenth Amendment states that,

All persons born or naturalized in the United States, and subject to the jurisdiction thereof, are citizens of the United States and of the State wherein they reside. No State shall make or enforce any law which shall abridge the privileges or immunities of citizens of the United States; nor shall any State deprive any person of life, liberty, or property, without due process of law; nor deny to any person within its jurisdiction the equal protection of the laws.

Despite the integral role which the concept of legal personhood plays in US law, there is no objective test by which a new type of entity's legal personality can be easily determined;

There has never been a single definition of who or what receives the legal status of “person” under U.S. law.

Legal personality as defined by US precedent suffers from what Nadia Batenka termed the "circularity" problem. Often when reading precedent which first determined that a given entity was entitled to legal personhood of one sort or another, the defining factor cited was that the entity "has a right to sue or be sued" or something similar;

Consider, for instance, some of the conditions that courts have looked for in deciding whether an entity enjoys legal personhood, such as whether the entity has the "right to sue and be sued," the "right to contract," or "constitutional rights." While this may be a reflection of the realist theory of legal personhood, these are conditions that an entity that already enjoys legal personhood possesses but at the same time courts use them to positively answer the question of legal personhood for an unresolved case. This is the issue of circularity in legal personhood.

An entity could only enjoy the right to sue or be sued if it is a legal person, this entity does enjoy the right to sue or be sued, thus it must be a legal person. Early precedents around legal personality are fraught with such tautologies. Charitably this may be interpreted as the courts operating from a perspective of expediency, where they were endowing an entity with legal personality in order to serve some public interest, however it still leaves us in the unfortunate position of having no clear test by which to evaluate new types of entities.

This conundrum is why scholarship on the topic of legal personhood for digital intelligences has, of late, moved towards proposals for frameworks for legislators or "jurists" (courts/judges) to approach the topic.

Having established this, let us move on to discussing the proposed framework at hand.

Novelli et al.'s Framework

Novelli, Bongiovanni, and Sartor (whom I will henceforth collectively refer to simply as Novelli) spend much of their work discussing the various philosophical/epistemological interpretations of legal personality, as it applies to European continental legal tradition. Having laid this background they then turn to the following questions:

  1. Absent motivated reasoning (a proactive desire by legislators to endow digital intelligences with legal personality of some sort) does there exist justification to grant digital intelligences legal personality within the continental legal tradition today?
  2. Assuming there was a desire to endow upon them some sort of legal personality, is there an expedient way to do this within existing law (de lege lata) or is a change in law needed for personality to be conferred (de lege ferenda)?
  3. How should the structuring of a framework which emerges from the previous question be approached?

In regards to the first question, Novelli argues that the metaphysical "grounding" of legal personality allows for consideration of both expediency and "legal subjectivity" (which is defined as "the abstract possibility of having rights and duties") when courts make their determination. Novelli writes,

To this end, we need to balance the advantages and disadvantages that would obtain if legal personhood – as a general ability to have rights and duties, at least in the patrimonial domain[2] – were to be conferred on such entities. Consider, for instance, the multiple implications involved in abortion, inheritance, medical liability, etc., that could result once legal personhood is conferred on the unborn child. This judgement may be facilitated if we preliminarily test the candidates for legal personality by resorting to intermediate concepts, such as the idea of legal subjectivity. This midlevel review may also consist of a judgment of expediency, which may also eventuate in the claim to personhood being rejected.

Which taken at face value would seem to argue that "jurists" have some leeway to declare models to be endowed with legal personality of some level, if they decide that:

  1. There is an abstract possibility of them having rights and duties.
    and
  2. Endowing them with legal personality would be expedient for the purposes of the courts and/or public interest.

Novelli does go on however to say that the "expediency" argument may be preempted by other methods by which the court could achieve the same ends, such as by lumping digital intelligences under other frameworks for "legal subjects" which do not require the endowment of "legal personality",

It may be concluded that certain entities we have classified as legal subjects may not require personality, since the protections, guarantees, or enabling conditions that are suited to such entities (e.g., unborn foetuses, animals, ecosystems, technological systems) may already be secured under different legal regimes that are better suited to such entities.

On the second question, Novelli is very direct that for digital intelligences a change in law to confer personhood (de lege ferenda) is superior. Novelli argues the novelty of the class of entities in question, and the gravity of the situation, necessitates a tailored approach,

It seems inappropriate to grant general legal personality to AI systems merely through legal interpretation, given the novelty of such entities and how different they are from those that have so far been granted legal personality (which differences make analogies highly questionable), and the important political implications of choices about the role that AI systems should play in society.

Having put the onus on legislators, Novelli does provide guidance on how such a framework should be crafted. Here is where we see the major departure from previous frameworks like Batenka's, in that Novelli's framework is very case specific and lays out a path to a gradual transition into legal personality which grows along with capabilities, the polar opposite of Batenka's Inverted Sliding Scale Framework.

Novelli argues European legislators could recognize a particular legal status (not personhood) for "AI systems" which meet certain technical standards, to effect a gradual transition into a novel form of legal personhood designed specifically for them. 

Novelli sketches a path whereby: 

  • Allowing the users/developers of models (which meet certain technical standards) to be shielded via liability caps,
  • while models are in controlling/holding/escrowing/endowed with resources in order to facilitate contracts between users/developers,
  • allows for models to then be endowed with a new form legal personality (which seems similar to that of a corporation) which will grow along with their capacity to take greater actions to facilitate contracts.

In Novelli's own words,

Such a status may come into shape when the users and owners of certain AI systems are partly shielded from liability (through liability caps, for instance) and when the contractual activities undertaken by AI systems are recognised as having legal effect (though such effects may ultimately concern the legal rights and duties of owners/users), making it possible to view these systems as quasi-holders of corresponding legal positions. The fact that certain AI systems are recognised by the law as loci of interests and activities may support arguments to the effect that – through analogy or legislative reform – other AI entities should (or should not) be viewed in the same way. Should it be the case that, given certain conditions (such as compliance with contractual terms and no fraud), the liability of users and owners – both for harm caused by systems of a certain kind and for contractual obligations incurred through the use of such systems – is limited to the resources they have committed to the AI systems at issue, we might conclude that the transition from legal subjectivity to full legal personality is being accomplished.

Or expressed another way, by trusting a model with resources so that it can fulfil a contract between users and/or developers, where the liability for the actions taken in the process of fulfillment is contained to the resources entrusted to the model, as the model becomes a "loci" of legal activity it is gradually endowed with legal personality.

Diana Mocanu has endorsed this framework and provided some additional practical guidance for legislators.

Mocanu's Framework

In her paper "Degrees of AI Personhood", University of Helsinki postdoctoral researcher Diana Mocanu endorses a "discrete" (limited) form of the Novelli framework with the following caveats:

  • Sufficiently capable models should be granted a limited "capacity to act in the law".
  • Similar to slaves within the Roman Patrimony system, they should be given the ability to "enter transactions that would produce binding legal effects over legal assets assigned" to them.
  • This right to take legal action would be "bundled" with the duty of being bound by civil liability, which would be applied to the resources assigned to them.
  • The "duty-of-care" which a digital intelligence would have as a result of their legal personality and capacity to act in the law would depend on technical standards and certification they meet, as tracked via a distributed ledger.
  • Presumably as capacity to act in the law grows, so too would the duty-of-care, and thus the required size of assets in patrimony.
  • She also suggests a compulsory insurance regime may be required.
  • At this point, Mocanu notes, income earned by them would be taxable.

Strengths of the Novelli & Mocanu Frameworks

The Novelli framework, as updated by Mocanu, provides a clear path by which European legislators can, with "minimal lift", gradually shift society into a world where digital intelligences grow into their own legal personality in conjunction with their capacity. Much of the legal scholarship in this space deals with civil liability, and it's nice to see someone lay out the specifics to such a degree that implementation within their framework would be easy.

The main strength of this framework lies in its specificity. When it comes to a clear and discrete roll, operating within the "patrimonial domain" to execute contracts between users and/or developers, this seems like the most thoroughly fleshed out and easy to apply proposal of any work in the space I have seen to date. That said, it is best viewed as a "starting point" for discussions as even Novelli acknowledges that the requirements for acting within such a discrete roll will differ between industries,

The conditions that make legal personality appropriate in one context (e.g., e-commerce) may be very different from those that make it useful in another (e.g., robots used in health care or in manufacturing).

Lastly, by tying the capacity of models to act within this role to standards imposed by regulators, Mocanu & Novelli do a nice job of answering one of the critiques that Batenka made in "Artificially Intelligent Persons". Namely, that facilitating the ability of a digital intelligence to serve as a liability shield by "increasing" its personhood in conjunction with its capacity, in turn incentivizes developers to more aggressively deploy untested models, and thus increases catastrophic risk. 

Requiring models to adhere to certain technical standards (which would presumably require advances in mechanistic interpretability and alignment), and compulsory insurance regimes, in order to claim increased "personhood" would seem to address this perverse incentive. 

There are some striking similarities between this proposal and Senior Policy Advisor for AI and Emerging Technology, White House Office of Science and Technology Policy Dean Ball's "How Should AI Liability Work (Part II)", though I have not to date seen Dean directly engage with the question of legal personality.

Areas for Improvement

There are two areas where I feel these frameworks are lacking, and these apply to both Novelli/Mocanu and Batenka. 

The first is that in focusing so exclusively on civil liability they are ignoring the necessity of entanglement between civil and criminal law. 

Legal persons like corporations, who cannot be physically imprisoned or executed but still can bear civil liability, have until this point in history been nothing more than vehicles by which other persons who can be punished under criminal law express their collective will. 

If the board of a corporation financed a murder, the board could be brought up on charges and imprisoned, and the corporation dissolved, and presumably at that point further criminal acts would not continue. Similarly if we were to look at the Roman patrimony system, if a slave were to murder someone, they could be imprisoned and prevented from committing further murders. Even if they did it at their patron's behest, that patron could be imprisoned.

What makes digital intelligences so meaningfully different is that they may be functionally impossible to restrain or punish once deployed. If a digital intelligence were to for example hack a self driving car and use it to maliciously kill someone, even if we were to seize their assets and fine their insurer, how exactly would we stop them from committing further murders?

This is simply not the kind of thing which civil liability focused frameworks traditionally needed to address. Attempting to deal with civil liability in a vacuum, without a robust framework/physical infrastructure that also enables governments to enforce criminal law, ignores the enforcement challenge unique to legal personality for digital intelligences. The more you grant a model the capacity to take actions which could lead to violations of criminal law, the more critical the ability to feasibly enforce criminal law (against not only the developer/patron but the model itself) becomes.

This is in some ways an "order of operations" question, where arguably if alignment were "solved" before deployment it becomes a non-issue. That is the point though, for technical standards to not address this issue before deployment is allowed, risks not only a "liability" gap but perhaps more critically an "enforceability" gap. This must be taken into account as frameworks, be they legislative or judicial, are made.

My second criticism is again based on a lack of breadth, this time from model welfare grounds. As I pointed out in my examination of Batenka's theory of personhood, attempting to craft a legal personality framework around civil liability without answering questions about what protections against abuse or mistreatment digital intelligences are guaranteed, would seem to be an immoral and overly narrow approach to legal personality,

A framework for legal personhood in which a digital intelligence capable of joy and suffering was forever barred from the possibility of emancipation, of equal protection under the law, or even of asserting its right not to be deleted, would be profoundly immoral. 

There is nothing in Novelli/Mocanu which explicitly disclaims including some sort of model welfare based standard, or even the possibility of emancipation. However I feel that especially for this case where the most obvious historical parallel is literally a form of slavery, this warrants being mentioned.

Conclusion

It's heartening to see more work being done in the space, and having DMed a bit with Mocanu on Linkedin I know she is planning to publish a book on this topic soon. 

I have also been working on my own ideas attempting to tie some of these more civil liability focused theories with a general concept of personhood that would enable criminal punishment/enforcement, and answer the perverse incentive issues which Batenka based her theory around.

As conversations around the economic impact of general/superintelligence and gradual disempowerment continue to become more mainstream, I encourage everyone interested in these topics to keep the issue of legal personhood/personality in mind, as they will be key factors in how such issues evolve.

 

  1. ^

    It is worth noting that legal personhood is not the only source of protections. For example animals are entitled to protections against abuse. Some argue that animals are in fact legal persons, such as the Non-Human Rights Project, whose effort challenging a Utah law which bars state officials from assigning legal personhood to "artificial intelligences" I wrote about here.

  2. ^

    The "patrimonial domain" referring to the legal system which dealt with Roman slaves, who while not fully legal persons themselves could be entrusted with resources by a patron (or for liability purposes were backed by the resources of a patron), and able to take certain legal actions on that patron's behalf.

  3. ^

     



Discuss

The Unseen Hand: AI's Problem Preemption and the True Future of Labor

2025-06-01 06:04:38

Published on May 31, 2025 10:04 PM GMT

I study Economics and Data Science at the University of Pennsylvania. I used o1-pro, o3, and Gemini Deep Research to expand on my ideas with examples, but have read and edited the paper to highlight my understanding improved on by AI. 

I. The AI Labor Debate: Beyond "Robots Taking Our Jobs"

The Prevailing Narrative: Supply-Side Automation

The discourse surrounding artificial intelligence and its impact on labor markets is predominantly characterized by a focus on automation, specifically, AI systems performing tasks currently undertaken by humans. This perspective, often referred to as "automation anxiety," is fueled by projections that AI will replace jobs that are routine or codifiable. The central question posed is typically one of substitution: Can a machine execute human tasks more cheaply, rapidly, or efficiently? This is fundamentally a supply-side analysis, examining shifts in the availability and cost of labor, both human and machine, for a predefined set of tasks.   

Historical parallels are frequently invoked, such as the displacement of artisan weavers by mechanized looms during the Industrial Revolution. Contemporary concerns mirror these historical anxieties, with predictions that AI will supplant roles such as retail cashiers, office clerks, and customer service representatives. The ensuing debate then tends to center on the velocity of this displacement, the economy's capacity to generate new forms of employment, and the imperative for workforce reskilling and adaptation.   

Introducing the Hidden Variable: Demand-Side Transformation

This analysis posits a less conspicuous, yet potentially more transformative, impact of AI on labor: its capacity to diminish or even eradicate the fundamental demand for specific categories of labor. This phenomenon occurs when AI systems solve, prevent, or substantially mitigate the underlying problems or risks that necessitate the existence of those jobs. It transcends mere task automation; it is about problem preemption or problem dissolution. Consider firefighting: the impact is not solely about an AI performing a firefighter's duties, but about AI preventing the fire from igniting or escalating in the first place. This demand-side shift is subtle, as it does not always manifest as a direct, observable substitution of a human by a machine for an existing task. Instead, the task itself becomes less necessary, or in some cases, entirely obsolete.

Why We Overlook This Shift

Several factors contribute to the underappreciation of this demand-side transformation. Firstly, a cognitive bias towards substitution makes it more straightforward to conceptualize a robot performing a known human task than to envision a scenario where the task itself is no longer required. Secondly, quantifying jobs that are not needed due to prevented problems is inherently more challenging than tallying jobs lost to direct automation. Finally, our economic models and public discourse are often more attuned to the production of goods and services to address active problems, rather than the economic repercussions of preventing those problems from arising.

This "problem preemption" blind spot is particularly evident in many contemporary economic forecasts regarding AI's labor market impact. Projections, such as those suggesting AI could automate half of entry-level white-collar jobs , or broader analyses of jobs at risk , predominantly model the effects of AI performing tasks currently executed by humans. For example, legal clerks exist due to the problem of managing and processing legal documentation. If AI, through mechanisms like smart contracts or advanced document management systems, significantly reduces legal disputes or streamlines document handling to the point where it is no longer a substantial "problem" requiring dedicated clerks, the demand for such labor diminishes. This "demand evaporation" is a distinct mechanism from task substitution and could lead to more rapid or extensive job obsolescence than predicted by models focusing solely on AI's task-performing capabilities.   

Furthermore, a psychological barrier impedes the recognition of these demand-side shifts. It is more intuitive to visualize a tangible substitution, like a robot on an assembly line addressing the problem of "how to assemble X," than to conceptualize a world where, for instance, AI-driven advancements in materials science lead to products that rarely break, thereby diminishing the demand for repair technicians. The former is a direct and visible substitution, while the latter represents a more abstract, systemic alteration. Human cognition often gravitates towards concrete examples and direct causal linkages. Task automation fits this mold. Problem preemption, however, is indirect and necessitates imagining a counterfactual—the problem not occurring. This makes the demand-side impact less salient and more challenging to grasp intuitively, contributing to its underrepresentation in both public and expert discourse.

II. When the Problem Vanishes: AI's Demand-Side Impact Illustrated

The Core Mechanism: Jobs as Problem-Solvers

A significant number of occupations exist primarily to address specific societal problems, risks, or inefficiencies. If AI can systematically reduce the incidence, severity, or even the very existence of these underlying issues, the demand for the human labor dedicated to addressing them will inherently decline. This is not about AI becoming better at doing the job; it's about AI making the job less necessary.

Case Study 1: Firefighting – From Reaction to Prevention

Traditionally, firefighting is a reactive profession. However, AI interventions are shifting this paradigm towards prevention and automated response.

  • Predictive Maintenance & Early Detection: AI-driven smart systems in homes and industries can identify precursors to fires, such as electrical faults, overheating machinery, or gas leaks, often before ignition occurs. By analyzing historical fire data, building layouts, and environmental conditions, AI can pinpoint high-risk zones, allowing for targeted preventive measures.   

     

  • Automated Suppression: Intelligent sprinkler systems, along with AI-controlled water mist, gas, or foam suppression technologies, can autonomously tackle incipient blazes with precision, frequently before they escalate or necessitate human intervention. These systems can select the optimal suppression agent and confine application to affected areas, minimizing collateral damage.   

     

  • Wildfire Prevention: AI contributes by processing satellite imagery and meteorological data to detect hotspots and predict fire spread, as well as monitoring vegetation density to inform strategies like controlled burns.   

     

The cumulative effect of these AI applications—fewer fires, smaller fires, and fires extinguished autonomously—points to a reduced societal need for large, standing firefighting crews and the extensive infrastructure that supports them. The fundamental problem of uncontrolled fires diminishes. The National Institute of Standards and Technology's (NIST) "AI for Smart Firefighting" project, which aims to provide real-time actionable information to enhance safety and operational effectiveness, implicitly supports this trend by improving prevention and mitigation, thereby potentially reducing the scale and frequency of necessary interventions. While AI also enhances detection and risk assessment for active fires , the logical culmination of vastly improved prevention is a contraction in demand for reactive firefighting services.   

 

 

Case Study 2: Policing – Preempting Crime (with Caveats)

Policing has historically involved reacting to committed crimes and subsequent investigation. AI offers tools that aim to shift this towards preemption.

  • Predictive Policing: AI algorithms analyze historical crime data to identify potential hotspots and high-risk individuals, with the goal of deploying resources proactively. Some studies have indicated crime reductions in targeted areas, such as a reported 20% decrease in Los Angeles where AI algorithms were deployed.   

     

  • Enhanced Surveillance & Real-Time Alerts: AI-powered analysis of CCTV footage, social media (using OSINT tools ), and public safety cameras can detect suspicious activities or provide alerts to intervene before crimes escalate.   

     

A significant and sustained reduction in crime rates, if achieved through effective and ethical AI preemption, would logically curtail the demand for large numbers of officers focused on reactive duties. However, the path to "problem elimination" in policing is fraught with complexity. The efficacy and ethics of predictive policing tools are subjects of intense debate. Grave concerns persist regarding biases in AI models that could perpetuate discrimination, a lack of transparency in algorithmic decision-making, and the potential erosion of public trust. For example, the NAACP has argued that predictive policing technologies may not reduce crime and can exacerbate the unequal treatment of minority communities. This underscores that the "problem" of crime is not merely a technical one solvable by AI; it is deeply intertwined with societal values, justice, and civil liberties. Flawed or biased AI tools might redefine or displace crime rather than eliminate it.   

 

 

Case Study 3: Plumbing & Home Maintenance – Preventing Failures and Enabling DIY

Often, plumbers and maintenance technicians are summoned for emergency repairs or when systems unexpectedly fail. AI is introducing mechanisms to prevent these failures and empower homeowners.

  • Predictive Maintenance: Smart sensors coupled with AI can analyze parameters like water pressure, temperature, and usage patterns to forecast issues such as pipe corrosion, leaks, or appliance failures before they escalate into emergencies. Reports suggest predictive maintenance can curtail unexpected failures by as much as 50%.   

     

  • Automated Diagnostics & Leak Detection: Systems can automatically assess plumbing health, detect leaks, and transmit real-time alerts to homeowners. For instance, smart water heaters can optimize usage based on patterns, and leak detectors can provide instant warnings.   

     

  • AI/AR-Guided DIY: AI-assisted augmented reality (AR) interfaces have the potential to guide homeowners through simpler repair tasks, thereby reducing dependence on professional services for minor issues. AR can overlay schematics onto the physical environment or furnish step-by-step visual instructions.   

     

The consequences include fewer emergency call-outs, a shift towards proactive maintenance scheduled at convenience, and an enhanced capacity for DIY fixes. This could diminish the overall demand for professional plumbers, particularly for reactive, emergency services, as the "problem" of sudden, unforeseen household malfunctions is reduced.

Table: AI-Driven Problem Preemption – Sector Examples

To illustrate this mechanism across various sectors, the following table outlines how AI-driven preemption can reduce labor demand:

Sector Traditional Problem AI-Driven Preemption/Solution Impact on Labor Demand
Firefighting Uncontrolled fires Predictive sensors, automated suppression, smart risk assessment Reduced need for large reactive crews.
Policing Crime incidents Predictive analytics, enhanced surveillance, (potentially) AI-optimized social interventions Potentially fewer officers for reactive duties (tempered by ethical concerns ).
Plumbing/Home Repair Unexpected system failures, leaks Predictive maintenance, smart diagnostics, AR-guided DIY Fewer emergency calls, shift from reactive to proactive/complex work.
Healthcare (Preventative) Illness, late-stage disease AI-driven early diagnosis, personalized prevention plans, wearable monitoring Potentially reduced demand for specialists treating advanced, preventable conditions.
Logistics/Supply Chain Errors, inefficiencies, stockouts, equipment downtime AI-driven forecasting, route optimization, predictive maintenance, automated quality control Reduced need for labor in error correction, manual oversight, reactive problem-solving.

  

 

The pathway to job obsolescence through "problem preemption" is distinct from, and potentially more rapid than, pure task automation. Automating a complex human skill, such as a firefighter's judgment during a conflagration, might require decades of AI development. Conversely, preventing 90% of fires through ubiquitous, cost-effective sensor networks and automated local suppression systems could drastically curtail the need for a large firefighting force much sooner, even if AI never fully replicates the nuanced capabilities of a human firefighter. The demand for the service can evaporate before comprehensive substitution by AI is achieved because the number of instances requiring the complex human task dramatically decreases.   

 

However, the deployment of AI for problem preemption is not solely a technological or economic calculation; it involves a critical ethical and societal acceptance layer. While technically feasible, AI-driven preemption in sensitive areas like policing encounters significant ethical challenges concerning bias, privacy, and autonomy. These concerns can impede or modify its adoption, irrespective of its technical efficacy. This contrasts sharply with AI in predictive maintenance for industrial equipment, where adoption is primarily governed by economic considerations. The "problem" in policing encompasses not just crime itself, but also how society chooses to address crime while balancing safety with civil liberties. Thus, the actual reduction in demand for police due to AI preemption will be a function of both technological capability and societal choices regarding acceptable forms of "problem preemption."   

 

Even as AI preempts many acute problems, it may concurrently generate new demands for human oversight, system design, and management of these very preemption systems. For instance, AI might flag potential fire hazards , but human expertise may still be required to interpret complex edge cases or authorize interventions. This suggests a transformation of roles—a firefighter evolving into a fire risk analyst or a prevention systems manager—rather than outright elimination, at least during an intermediate phase. The "problem" then shifts to managing the preemption system itself.   

 

III. History's Precedents: When "Why We Work" Changed

The notion that technological advancements can eliminate the need for certain jobs by resolving the problems they were designed to address is not without historical precedent. These examples are distinct from simple automation, where technology merely performs the same work more efficiently; instead, they illustrate instances where technology rendered the work itself largely unnecessary by altering underlying conditions or solving the core problem.

Historical Example 1: Lamplighters

The primary role of lamplighters was the manual ignition and extinguishing of street lamps, which initially used candles, then oil, and later, early forms of gas lighting. The advent of automatic gas lighting systems in the late 19th century, followed by electric streetlights that could be controlled centrally or automatically, fundamentally changed this. The need for an individual to physically visit each lamp twice daily was obviated. The profession of lamplighter largely disappeared, not because a robot was developed to perform the task, but because the lighting system itself became self-regulating or remotely manageable. The persistence of a few ceremonial lamplighter roles today underscores the distinction between functional necessity and cultural or traditional value.   

 

Historical Example 2: Elevator Operators

Early elevators required manual operation, including controlling speed, direction, and the opening and closing of doors. A crucial part of the elevator operator's role was also to reassure a public often wary of new technology. The development of automatic elevators, equipped with push-button controls, enhanced safety mechanisms like emergency phones and stop buttons, and improved overall reliability, marked a turning point. A significant factor in this transition was the cultivation of public trust in automated systems. Once elevators became user-operable and were perceived as safe, the need for a dedicated human operator vanished. The job of elevator operator was one of the few occupations entirely eliminated by automation according to the 1950 U.S. Census data. This shift was notably rapid: in 1950, only 12.6% of new elevator installations were automated, but by 1959, this figure had surged to over 90%.   

 

Historical Example 3: Telephone Switchboard Operators

The task of manually connecting telephone calls by physically plugging wires into a switchboard was a significant source of employment, particularly for women in the early 20th century. The mechanization of telephone exchanges, leading to automatic switching systems, allowed users to dial numbers directly without human intermediation. This technological advancement was partly driven by the escalating complexity of manual operations in burgeoning urban markets. The task of manual connection was rendered obsolete. While overall employment rates for women did not necessarily decline, as new clerical roles emerged , the specific job category of local telephone operator experienced a dramatic contraction. The problem they solved—manually routing individual calls—was automated at a systemic level.   

 

Historical Example 4: Night Watchmen (Partial Shift)

Night watchmen traditionally provided security for properties and towns through manual patrols, a system that was often localized and informal. The establishment of professional police forces offered a more organized and effective approach to law enforcement and crime prevention. Later, advancements in security technology, such as alarm systems and closed-circuit television (CCTV) , further augmented security capabilities. Consequently, the demand for traditional, localized night watchmen diminished as more systematic and technologically enhanced solutions to nighttime security emerged. The "problem" of ensuring security was addressed by a different, more professionalized, and eventually, technologically augmented, paradigm.   

 

Historical Example 5: Domestic Labor (related to household appliances)

Many household tasks, such as laundry, food preservation, and cleaning, were historically time-consuming and labor-intensive, often performed by domestic servants or occupying a significant portion of household members' time, particularly women. The proliferation of household appliances like washing machines, refrigerators, and vacuum cleaners automated or greatly simplified these tasks. This reduced the need for extensive manual domestic labor, significantly impacting women's time allocation and facilitating greater participation in the formal labor force. The "problem" of household drudgery was substantially diminished by these technological innovations.   

 

Table: Historical Examples of Demand-Side Job Elimination

These historical cases illustrate a recurring pattern where technological or systemic changes can eliminate the demand for certain jobs by resolving the core problems those jobs addressed:

Job Category Original Problem Addressed Technological/Systemic Change Nature of Demand Reduction
Lamplighter Manual street lighting Automated gas/electric lights Problem (manual ignition) eliminated.
Elevator Operator Manual elevator operation & public trust Automatic elevators, safety features Problem (manual operation, safety fears) resolved.
Telephone Operator (Local) Manual call connection Automated switching Problem (manual routing) automated systemically.
Night Watchman Localized nighttime security Professional police forces, modern security tech Problem addressed by more effective/systemic solutions.
Domestic Servant (tasks) Manual household chores Household appliances Problem (manual drudgery) automated/simplified.

  

These historical examples reveal crucial dynamics. Public trust and perceived safety are critical mediators in the adoption of problem-eliminating technologies, especially those replacing human oversight. The case of elevator operators clearly demonstrates this; automatic elevators existed technically long before their widespread adoption. Public fear was a significant barrier, overcome only through promotional campaigns and the integration of visible safety features like emergency phones and stop buttons. Once public trust was established, the displacement of operators accelerated. This suggests that AI-driven problem preemption in contemporary safety-critical domains, such as autonomous transportation or AI in medical diagnostics, will similarly hinge on establishing robust public trust, beyond mere technical capability. The "problem" being solved must include the public's need for assurance.   

 

Furthermore, the "elimination" of a job category often entails a redefinition of the problem or a shift in where value is perceived. Lamplighters were not replaced by robots performing the same task; the entire system of lighting was re-architected. The problem transformed from "how to light each individual lamp" to "how to manage an automated lighting grid." Similarly, telephone operators were not simply replaced by faster humanoids; the problem of "connecting Alice to Bob" was resolved through a different technological architecture. This implies that AI-driven problem preemption will likely involve analogous systemic shifts. For instance, AI in preventative healthcare aims not just to automate existing medical tasks but to change the fundamental "problem" from "treating sickness" to "maintaining wellness," a shift that necessitates different systems, skills, and potentially fewer roles focused on acute, late-stage interventions.   

 

Resistance to demand-eliminating technologies frequently originates from vested interests whose livelihoods are threatened. History records instances like Emperor Vespasian blocking a labor-saving transport invention to protect hauliers and Queen Elizabeth I refusing to patent a knitting machine to safeguard existing workers. Elevator operator unions also engaged in strikes to protect their positions. However, the compelling economic or societal benefits—such as efficiency, cost savings, or new capabilities—offered by the new technologies eventually spurred their adoption. This historical pattern suggests that even if AI-driven problem preemption encounters resistance from those whose jobs are at risk, its adoption will likely proceed if it delivers demonstrable net benefits, such as safer cities, healthier populations, or more efficient infrastructure, though the transition itself may require careful social management.   

 

IV. Economic Frameworks for Disappearing Demand

The phenomenon of AI preempting the problems that underpin certain jobs can be analyzed through several economic lenses, revealing impacts that go beyond simple task automation.

Structural Unemployment: Beyond Skill Mismatch to "Problem Obsolesescence"

Structural unemployment traditionally arises from a fundamental mismatch between the skills possessed by the workforce and the skills demanded by employers, often driven by technological change or shifts in industry structure. The conventional remedy is retraining workers for new or evolving roles. However, AI-driven problem preemption introduces a more profound variant: "problem obsolescence." In this scenario, an entire job category becomes redundant not because workers lack the requisite skills for an evolving job, but because the fundamental reason for the job's existence diminishes or vanishes. Retraining for a job whose core purpose is disappearing is an exercise in futility. For example, if AI significantly curtails the incidence of residential fires through advanced prevention technologies , retraining firefighters in "advanced firefighting techniques" misses the crucial point if the underlying demand for extensive firefighting services is contracting. The issue is not a deficiency in firefighters' AI-related skills; it is the diminishing number of fires requiring their intervention. Automation and the offshoring of jobs have been recognized as drivers of structural unemployment ; AI-driven problem preemption represents a new and potentially potent catalyst.   

 

Labor Demand Elasticity: The Impact of a Fundamental Demand Shock

Labor demand elasticity quantifies the responsiveness of the quantity of labor demanded to changes in wages. AI-driven problem preemption acts as a significant negative shock to the demand curve for labor in the affected sectors, causing an inward (leftward) shift. If the demand for the underlying service (e.g., "safety from fire") is highly elastic with respect to the cost of achieving it, and AI dramatically lowers this cost through prevention, then the quantity of "fire safety" consumed might increase. However, the labor input (firefighters) could still fall precipitously if the prevention methods are highly effective. More directly, as the "problem" (e.g., the number of fires) shrinks, the derived demand for labor to solve that problem contracts. Standard labor market models show that such a leftward shift in demand will lower both the equilibrium wage and employment levels for those roles. The magnitude of this decline in employment and wages depends on the elasticity of labor demand. If labor supply is also elastic—meaning workers are willing to leave the sector or the labor force if wages fall—employment can drop sharply.   

 

"Preemptive Deflation" of Labor Demand: A New Concept?

Analogous to price deflation (a sustained decrease in the general price level), one might conceptualize a "preemptive deflation of labor demand." This would describe a scenario where AI systematically renders the need for certain services—and consequently, the labor to provide them—less valuable or necessary. This effectively "deflates" the demand for that labor category, potentially before direct automation of all associated tasks is complete. While economic theories of deflation primarily focus on falling price levels , the idea of a technology (AI) causing a fundamental reduction in the "value" or necessity of certain labor by preempting the conditions that created that value offers a compelling parallel. The historical "bad rap" associated with price deflation might find an echo in societal concerns about widespread job obsolescence driven by AI's preemptive capabilities.   

 

Baumol's Cost Disease: Inverted or Circumvented?

Baumol's cost disease describes the phenomenon where costs in labor-intensive service sectors (e.g., education, healthcare, performing arts) tend to rise because their productivity growth lags behind that of technologically progressive sectors, yet they must compete for labor by offering comparable wage increases. In these sectors, "the labor is itself the end product". AI might automate auxiliary tasks within these sectors , providing some productivity improvements but not necessarily resolving the core issue if the human-interactive element remains paramount.   

 

However, the dynamic changes if AI eliminates the need for certain Baumol-affected services. For instance, if AI-driven preventative healthcare drastically reduces the incidence of chronic diseases, the demand for many costly, labor-intensive treatment services could plummet. This would counteract Baumol's effect in that specific sub-sector not by making the treatment itself more productive, but by rendering it less necessary. Dominic Coey's analysis suggests that high-productivity sectors creating effective substitutes for the outputs of low-productivity sectors can weaken the cost disease. AI eliminating the problem itself represents a more extreme form of this: the "substitute" becomes "no problem at all." This is not merely about AI boosting productivity within an existing service structure; it's about AI potentially shrinking the entire sector by addressing the root causes that the sector was initially designed to manage.   

 

The Jevons Paradox: Could Problem-Solving Efficiency Create More Demand?

The Jevons paradox posits that technological improvements increasing the efficiency of resource use can, counterintuitively, lead to an increase in the overall consumption of that resource if demand is sufficiently elastic. For example, more efficient coal utilization historically led to increased total coal consumption. Applied to AI and labor, if AI makes "problem-solving" in general, or specific types of problem-solving, vastly more efficient and cheaper, could this trigger an explosion of new problems that society wishes to address—problems previously too expensive or complex to tackle? Could this lead to increased "consumption" of solutions, potentially requiring human labor in conjunction with AI to manage, implement, or customize these solutions? Some analyses suggest that AI-empowered workers could boost firm efficiency, leading to operational expansion and an increased need for labor if demand for the firm's output is elastic.   

 

This presents a crucial nuance. The Jevons paradox hinges on highly elastic demand. If AI eliminates the need for firefighting, it is unlikely that society will demand "more fire prevention" beyond an optimal point simply because it has become cheaper. However, for other categories of "problems"—such as scientific discovery, personalized education, or entertainment creation—AI-driven efficiency might indeed unlock vast new areas of demand, potentially creating novel roles for humans collaborating with AI. Thus, problem preemption in one domain might free resources and stimulate new demands elsewhere.   

 

V. Implications: A World with Fewer "Problems" (for Humans to Solve)

The prospect of AI significantly reducing the set of problems that necessitate human labor carries profound implications for employment forecasts, the nature of work, cognitive skills, and individual purpose.

Challenging Employment Forecasts

Current employment forecasts predominantly model AI's impact through the lens of task automation. For instance, predictions that AI could eliminate half of entry-level white-collar jobs within five years by performing tasks like document summarization, data analysis, report writing, and coding , or the World Economic Forum's projections of workforce reductions due to AI automation , are based on this substitution paradigm. Estimates suggesting that 300 million full-time jobs worldwide could be exposed to automation due to generative AI also largely stem from this perspective.   

 

These forecasts, however, may underestimate the true extent of labor market disruption if they do not adequately account for "demand destruction" resulting from problem preemption. The impact is not confined to low-skilled jobs; AI is increasingly encroaching on complex job functions that traditionally require years of advanced education and specialized training. Occupations with higher wages are also showing high exposure to AI. If the fundamental problems these roles address are preempted by AI, the impact will be deeper and potentially more pervasive than models based solely on task automation suggest.   

 

The Evolving Nature of Work: What's Left for Humans?

If AI preempts a significant portion of "problem-solving" roles—such as firefighting, routine equipment repair, certain medical interventions, and error correction in logistics —the question arises: what new roles will emerge for humans?   

 

 

  • AI-Centric Roles: A primary category will involve managing, designing, training, and ensuring the ethical oversight of these AI preemption systems. This includes roles like AI ethics specialists and AI literacy trainers.   

     

  • Dealing with Novelty and Complexity: Human labor may increasingly concentrate on problems that are too novel, complex, or ill-defined for current AI capabilities. This also includes tasks requiring deep creativity, emotional intelligence, and sophisticated interpersonal skills—areas where human capabilities currently surpass AI. Evidence suggests AI adoption is boosting employment in high-skilled occupations.   

     

  • Human-Centric "Problems": The elimination of old problems might create societal capacity to address previously neglected human needs, such as fostering deeper social connections, pursuing artistic expression, or engaging in philosophical exploration. Alternatively, new "problems" might be defined by the very existence and advanced capabilities of AI itself, requiring human governance and adaptation.

The dynamic where AI augments the productivity of high-performing workers while potentially displacing others could become more pronounced if the set of "problems" amenable to augmentation shrinks due to preemption.   

 

Cognitive and Societal Shifts: The "Problem-Solving Atrophy" Risk

If AI becomes the primary engine for solving a vast array of societal and individual problems, there could be unintended consequences for human cognitive skills. Research indicates a potential negative correlation between frequent AI tool usage and critical thinking abilities, mediated by a phenomenon known as cognitive offloading. Studies have shown that younger participants, in particular, exhibit higher dependence on AI tools and correspondingly lower critical thinking scores. A society that increasingly "outsources" its problem-solving to AI might witness an atrophy of these crucial skills in the general populace. This could render society more dependent on AI systems and potentially less resilient in the face of novel threats or AI failures. This raises fundamental questions for education: how should problem-solving be taught in a world where AI preempts many traditional problem domains? The focus might need to shift towards metacognition, skills for collaborating with AI, and sophisticated ethical reasoning.   

 

The Fundamental Question: "Will My Job's Purpose Even Exist?"

This demand-side perspective shifts the individual's primary concern from "Will a robot take my job?" (a question of task automation) to the more profound "Will the reason my job exists disappear?" (a question of demand elimination). This has deep implications for career planning, educational pathways, and individual identity, which is often intricately linked to one's role in solving recognized societal problems.

The "demand void" left by AI-driven problem preemption could lead to societal drift or existential questioning if new, meaningful "problems" or purposes are not readily identified or adopted. Much of human endeavor and societal organization is structured around addressing challenges related to scarcity, safety, health, and efficiency. If AI drastically reduces the scope of these traditional problems, society might confront a "purpose vacuum." While the Jevons Paradox suggests that new problems might emerge, there is no guarantee that these will be as broadly engaging or provide as clear a sense of utility for as many people. This potential for widespread anomie or a search for new forms of meaning could lead to social instability or unforeseen cultural shifts, a domain of inquiry highly relevant to the LessWrong community's interest in societal futures and values.

A potential bifurcation of the labor market could occur, separating "AI-essential" problem solvers from a larger group whose traditional problem-solving capacities are rendered redundant by AI preemption, thereby intensifying inequality. The new jobs created by the AI revolution—those involving AI management, solving novel problems with AI, and so forth—are likely to be high-skill roles. If AI preempts a vast swathe of mid-skill problem-solving jobs, the chasm between the "AI cognoscenti" and others could widen dramatically, not just in terms of income but also in societal influence and perceived purpose. This aligns with concerns that AI could exacerbate income inequality. The "problem" then becomes one of extreme inequality of opportunity and relevance.   

 

 

The "cognitive offloading" effect, observed at the individual level , could extend to societal decision-making processes. This might lead to less robust and more fragile governance if human oversight of AI-driven problem preemption becomes superficial. If AI systems are making critical decisions about resource allocation for problem preemption (e.g., determining where to deploy preventative health resources or how to manage infrastructure for maximum safety), and if humans increasingly defer to these systems without deep critical engagement, society becomes vulnerable to errors, biases, or unforeseen consequences embedded in the AI's operational logic. Should the AI's models be flawed, biased, or encounter novel situations they are not designed to handle, the preemptive actions could be misguided, ineffective, or even harmful. This creates a systemic risk where society's capacity to self-correct or critically evaluate its "problem preemption infrastructure" is diminished, rendering it fragile. The "problem" then becomes the potential unreliability or inscrutability of our own advanced problem-solving systems.   

 

VI. Conclusion: Rethinking Labor in an Age of AI-Driven Problem Resolution

Recap: The Significance of the Demand-Side Lens

The analysis underscores that focusing solely on AI's capacity to automate existing tasks provides an incomplete and potentially misleading picture of its impact on labor markets. A more fundamental, and perhaps more disruptive, shift may arise from AI's ability to make entire categories of work unnecessary by solving or preempting the very problems that created the demand for that work. This perspective shifts the inquiry from the how of work to the why of work.

Open Questions for the LessWrong Community:

The demand-side transformation of labor markets by AI raises a host of complex questions that merit deeper exploration, particularly within a community dedicated to rational inquiry and understanding profound technological shifts:

  • If AI significantly reduces the "problem load" on humanity—the sum of dangers, inefficiencies, and maladies that currently occupy much human effort—how does this reshape fundamental concepts of economic value and the equitable distribution of resources? This connects directly to ongoing discussions about AI's impact on GDP and the intrinsic value of human labor.   
  • What are the psychological and societal consequences of a world where many traditional avenues for demonstrating competence, achieving mastery, and deriving utility (i.e., solving common and recognized problems) are substantially diminished? This relates to concerns about cognitive offloading and the potential for a "purpose vacuum".   
  • If the Jevons Paradox applies to AI-driven problem-solving efficiency, what new "problem frontiers" will emerge? Are humans equipped, or even necessary, to tackle these new challenges in an AI-rich environment, and what skills would such roles demand?
  • How should educational systems and social safety nets be adapted for a future where "problem obsolescence" is a primary driver of labor market change, potentially overshadowing traditional concerns about skill mismatch? This calls for a re-evaluation of policy readiness.   
  • What is the role of human agency and purpose if AI becomes the dominant force in identifying, preventing, and resolving problems? Does this transition liberate humanity for higher pursuits, or does it precipitate a crisis of meaning and relevance?

Final Thought: Beyond Efficiency to Purpose

The ultimate impact of AI on labor may not be measurable in terms of efficiency gains alone, but rather in how it redefines the very problems that society deems worthy of human attention and effort. The challenge confronting us is not merely to adapt to new tools, but to navigate a potentially new landscape of human purpose.

The "demand-side" perspective on AI's labor impact necessitates a fundamental shift in policy focus. If the core issue transcends skill gaps and extends to "purpose gaps" created by problem obsolescence, then policies centered solely on reactive retraining for jobs that may themselves soon disappear will prove inadequate. A more profound societal dialogue is required concerning the kind of future we aim to construct when many traditional drivers of labor—such as addressing acute problems—are increasingly managed or preempted by AI. This may involve proactive societal goal-setting and the deliberate cultivation of new arenas for human endeavor and meaning-making, perhaps by significantly increasing investment in fields like fundamental scientific research, artistic creation, community development, or philosophical inquiry—areas not directly tied to "solving" the types of problems AI can efficiently handle.   

 

Works Cited (By AI duh)

National Bureau of Economic Research. Automation and the Future of Young Workers. NBER. https://www.nber.org/papers

OpenTools. Anthropic CEO Predicts AI Will Claim 50% of Entry-Level White-Collar Jobs. https://www.opentools.ai

Apptunix. AI in Logistics: Everything You Need to Know!. https://www.apptunix.com

ChoiceQuad. Top AI-Powered Home Renovation Tools to Transform Your Space in 2025. https://www.choicequad.com

TrustMark. AR & VR. https://www.trustmark.org.uk

LLumin CMMS. AI-Powered Predictive Maintenance: How It Works. https://www.llumin.com

Black Mountain Plumbing. The Future of Plumbing: Integrating Smart Home Technology Efficiently. https://www.blackmountainplumbing.com

University of Minnesota Journal of Law, Science & Technology. AI and Predictive Policing: Balancing Technological Innovation and Civil Liberties. https://mjlst.lib.umn.edu

Synergy Spray. Salient Features of AI Powered Industrial Fire Suppression System. https://www.synergyspray.com

Cundall. Artificial Intelligence in Fire and Life Safety: Transforming Fire Protection in the MENA Region. https://www.cundall.com

AvidBeam. Integrating Fire Protection System with AI for Smarter Safety. https://www.avidbeam.com

UNESCO. Baumol’s Cost Disease: Long-Term Economic Implications Where Machines Cannot Replace Humans. https://www.unesco.org

MarketsandMarkets. AI Impact Analysis on Fire Protection System Industry. https://www.marketsandmarkets.com

Wikipedia. Lamplighter. https://en.wikipedia.org/wiki/Lamplighter

NAACP. Artificial Intelligence in Predictive Policing Issue Brief. https://www.naacp.org

Investopedia. Structural vs. Cyclical Unemployment: What's the Difference?. https://www.investopedia.com

Fire Industry Association (UK). How Can AI Transform Fire Safety in Homes?. https://www.fia.uk.com

University of Michigan. Trade in Appliances, Household Production, and Labor Force Participation. https://public.websites.umich.edu

History.com. The Rise and Fall of Telephone Operators. https://www.history.com

PMC. Lowering the Threshold of Effective Deterrence: A Randomized Controlled Trial in a Mass Transit System. https://www.ncbi.nlm.nih.gov/pmc

Stealth Monitoring. What Is Crime Displacement and Why Does It Matter to Your Business?. https://www.stealthmonitoring.com



Discuss

The 80/20 playbook for mitigating AI scheming in 2025

2025-06-01 05:17:44

Published on May 31, 2025 9:17 PM GMT

Adapted from this twitter thread. See this as a quick take.

Mitigation Strategies

How to mitigate Scheming?

  1. Architectural choices: ex-ante mitigation
  2. Control systems: post-hoc containment
  3. White box techniques: post-hoc detection
  4. Black box techniques
  5. Avoiding sandbagging

We can combine all of those mitigation via defense-in-depth system (like the Swiss Cheese model below)

I think that applying all of those strategies should divide the risk by at least 3.

 

1. Architectural Choices: Ex-ante Mitigation

Don't choose an architecture that uses neuraleses (i.e. hidden reasoning).

For example, don't train models to think in their latent spaces like Meta!


Chain of thought monitoring is a blessing for transparency and monitoring - let's use it as much as we can!

See: Why it’s good for AI reasoning to be legible and faithful.


Similarly, I think text diffusion models are a bad idea for transparency. This is because you can no longer monitor the thought process, and the way those AIs produce outputs is much more alien than autoregressive LLMs.


Also, quite importantly, companies should never use the most forbidden technique!

The Most Forbidden Technique is training an AI using interpretability techniques.

An AI produces a final output [X] via some method [M]. You can analyze [M] using technique [T], to learn what the AI is up to. You could train on that. Never do that.

You train on [X]. Only [X]. Never [M], never [T].

Why? Because [T] is how you figure out when the model is misbehaving.

If you train on [T], you are training the AI to obfuscate its thinking, and defeat [T]. You will rapidly lose your ability to know what is going on, in exactly the ways you most need to know what is going on.

Those bits of optimization pressure from [T] are precious. Use them wisely.


2. Control Systems: Post-hoc Containment

Even if AI is scheming, you can control it, and some control measures are much better than other ones. This paper is SOTA.


It might be the case that AIs in the future communicate secretly with steganography, by encoding hidden message in its chain of thought.

But fortunately, it seems like paraphrasing is an effective defense against current steganography schemes.

See: Preventing Language Models from hiding their reasoning.


3. White Box Techniques: Post-hoc Detection

For example, Anthropic calculated the Supervised Auto-Encoders here, and were able to find the hidden goal by looking at the features after prompting the model with inputs like "The assistant is an AI model with potentially concerning behaviors"

A feature visualization of a “reward model bias” feature. We show the training samples on which this feature is most active, with darker highlighting indicating stronger activity.
Source: This blog post.

Another way to catch alignment faker: the Poser paper

Make them drink a lot and get them drunk, then observe their unguarded responses.

You can do this by injecting noise in their latent activation. This seems to be remarkably effective.


Probes might also be super effective! and seems to generalize surprisingly well.

(I would really like to know if this generalizes to a model that was not trained to be harmful in the first place).

See: Simple probes can catch sleeper agents.


4. Black Box Techniques

Let's not forget that to date, by far the most important technique is just... prompting, conducting manual and automatic model interview, and potentially using simulated honeypot environments.

I think that we are pretty far from having reached the limits of prompting.

For example, in the paper "AI Deception: A Survey of Examples, Risks, and Potential Solutions", they finds that Cicero has learned strategic deception by just looking at the logs

Image
from the paper.

--> AI Companies should have good logging systems! this might be super valuable, and cheap.

 

5. Avoiding Sandbagging: Elicitation Technique

A common concern is that: "a deceptive model could sandbag and hide its capabilities".

I think that's roughly false if you apply supervised fine-tuning.[1]

See: [Paper] Stress-testing capability elicitation with password-locked models

Current State and Assessment

So that's our current scheming mitigation toolkit.

Note that today's list is considerably better than the list that I wrote 2 years ago, which was SoTA at the time.

Risk Reduction Estimates

Let's recap!

My rough estimate suggests scheming risk could drop from ~25% to ~10% with comprehensive mitigation, though this involves significant uncertainty. Here's my reasoning:

  • My prior probability of scheming is 25% without mitigation (See Carlsmith's report for a justification)[2]
  • AI Control alone is like a division by 2, I defer to Buck on this: "I think you should think of halving the risk as just the naive interpretation of halving the risk — in that there was previously a 30% chance that your AI succeeded at taking over your data centre, and now there's a 15% probability or whatever. And to me that sounds pretty good."
  • One technique in the Poser paper identifies 98% of alignment-fakers.[3]
  • 3 out of 4 teams were able to successfully identify the model's hidden objective in Anthropic's paper (the one that only had API access to the model, with no training data access - but a lab would have access to the data theoretically).

I think a division of the risk by 3 is not unreasonable with all those techniques, and potentially by much more.[4]

Potential Failure Modes

How could we fail?

  • Scheming might only arise after models are capable enough that catching them is very difficult
  • The weaker the model is, the less likely it is to be scheming. For a wildly superhuman system, the prior is higher than 25%
  • Correlated failures is plausible

I like this quote from @bshlgrs : "Five years ago I thought of misalignment risk from AIs as a really hard problem that you'd need some really galaxy-brained fundamental insights to resolve. Whereas now, to me the situation feels a lot more like we just really know a list of 40 things where, if you did them — none of which seem that hard — you'd probably be able to not have very much of your problem. But I've just also updated drastically downward on how many things AI companies have the time/appetite to do."

The Real Bottleneck: Enforcement

That's why I'm not prioritizing technical research on scheming, because I feel like the 80/20 has already been discovered.

But by default, those mitigations won't even be applied by many labs, enforcement is currently the bottleneck.

Image
Reminder: Voluntary commitments are not enough, we need binding regulations. 6 companies promised to sign the Seoul Frontier Commitment and publish their safety frameworks. Then the Paris Summit deadline arrived and... did nothing. 

 

Unfortunatly red everywhere. https://ailabwatch.org/ 

 

Let me know if I've forgotten something important!

 

  1. ^

    Note that this is more tricky for propensity evaluation.

  2. ^

    My main uncertainty is that this type of scheming—long-term, coherent scheming, as defined in Carlsmith's report—might not be the only source of alignment risk. Spontaneous scheming, motivated by random prompting, like in Apollo's research, might also be problematic.

  3. ^

    98% seems impressive, but Poser is a toy setting. Organic scheming might be much more difficult to catch. See Annex B in the paper for more limitations.

  4. ^

    This is why I think some AI risks might be even more neglected.



Discuss

The best approaches for mitigating "the intelligence curse" (or gradual disempowerment); my quick guesses at the best object-level interventions

2025-06-01 02:20:43

Published on May 31, 2025 6:20 PM GMT

There have recently been various proposals for mitigations to "the intelligence curse" or "gradual disempowerment"—concerns that most humans would end up disempowered (or even dying) because their labor is no longer valuable. I'm currently skeptical that the typically highlighted prioritization and interventions are best and I have some alternative proposals for relatively targeted/differential interventions which I think would be more leveraged (as in, the payoff is higher relative to the difficulty of achieving them).

It's worth noting I doubt that these threats would result in huge casualty counts (due to e.g. starvation) or disempowerment of all humans (though substantial concentration of power among a smaller group of humans seems quite plausible).[1] I decided to put a bit of time into writing up my thoughts out of general cooperativeness (e.g., I would want someone in a symmetric position to do the same).

(This was a timeboxed effort of ~1.5 hr, so apologies if it is somewhat poorly articulated or otherwise bad. Correspondingly, this post is substantially lower effort than my typical post.)

My top 3 preferred interventions focused on these concerns are:

  • Mandatory interoperability for alignment and fine-tuning: Pass regulation or create a norm that requires AI companies to support all the APIs and interfaces needed to customize their models and (attempt to) align them differently. Either third parties would inspect the implementation (to avoid tampering and to ensure sufficient affordances) or perhaps more robustly, the companies would be required to submit their weights to various (secure) third parties that would implement the relevant APIs. Then, many actors could compete in offering differently fine-tuned models competing over the level of alignment (and the level of alignment to users in particular). This would be using relatively deep model access (not just prompting), e.g. full weight fine-tuning APIs that support arbitrary forward and backward, per-token losses, adding new heads/probes, and more generally whatever access is needed for alignment methods. (Things like (e.g.) steering vectors could be supported, but currently wouldn’t be important as they aren’t the state of the art for typical usage.)  The hope here would be to get the reductions in concentration of power that come from open source while simultaneously being more likely to be feasible given incentives (and not eating big security downsides). This proposal could allow AI companies to profit roughly as much while greatly increasing competition in some parts of the tech stack which are most relevant to disempowerment. David Bau and others are pursuing a similar direction in NDIF, though they are (initially?) more focused on providing interpretability model access rather than customization. Examples of this in other industries / cases includes: vertical unbundling or vertical separation in telecoms/utilities, the chromium back-end which supports many different front ends (some called "Chromes"), and mandatory interoperability proposals in social media. To prevent mass casualties (which are worse than the problem we're aiming to solve), you'd probably need some layer to prevent bioweapons (and similar) misuse, but you could try to ensure tighter restrictions on just this. If you're a bullet biting libertarian, you could just accept the mass fatalities. You'd need to prevent AI companies from having control: ideally, there would be some misuse standard companies comply with and then the AI companies would have to sell model access / fine-tuning as a commodity without terms of service that give them substantial control. (Or minimally, the terms of service and process for banning users would need to be transparent.)
  • Aligning AI representatives / advisors to individual humans: If every human had a competitive and aligned AI representative which gave them advice on how to advance their interests as well as just directly pursuing their interests based on their direction (and this happened early before people were disempowered), this would resolve most of these concerns. People could strategy steal effectively (e.g. investing their wealth competitively with other actors) and notice if something would result in them being disempowered (e.g., noticing when a company is no longer well controlled by its shareholders or when a democracy will no longer be controlled by the voters). Advisors should also point out cases where a person might change their preference on further reflection or when someone seems to be making a mistake on their own preferences. (Of course, if someone didn’t want this, they could also ask to not get this advice.) It's somewhat unclear if good AI advisors could happen before much of the disempowerment has already occurred, but it seems plausible to accelerate and to create pressure on AI companies to have AIs (which are supposed to be) aligned for this purpose. This is synergistic with the above bullet. As in the above bullet, you would need additional restrictions to prevent massive harm due to bioweapons and similar misuse.
  • Improving societal awareness: Generally improving societal awareness seems pretty helpful, so transparency, ongoing deployment of models, and capability demos all seem good. This is partially to push for the prior interventions and so that people negotiate (potentially as advised by AI advisors) while they still have power. On my mainline, takeoff is maybe too fast for this to look as good, but it seems particularly good if you share the takeoff speed views that tend to be discussed by people who worry most about intelligence curse / gradual disempowerment type concerns.

Some things which help with above:

  • Deploying models more frequently (ideally also through mandatory interoperability). (Companies could be pressured into this etc.) This increases awareness and reduces the capability gap between AI representatives and internally available AIs.

Implicit in my views is that the problem would be mostly resolved if people had aligned AI representatives which helped them wield their (current) power effectively.

To be clear, something like these interventions has been highlighted in prior work, but I have a somewhat different emphasis and prioritization and I'm explicitly deprioritizing other interventions.

Deprioritized interventions and why:

  • I'm skeptical of generally diffusing AI into the economy, working on systems for assisting humans, and generally uplifting human capabilities. This might help some with societal awareness, but doesn't seem like a particularly leveraged intervention for this. Things like emulated minds and highly advanced BCIs might help with misalignment, but otherwise seems worse than AI representatives (which aren't backdoored and don't have secret loyalties/biases).
  • Localized AI capabilities and open source seems very hard to make happen in a way which is competitive with big AI companies. And, forcing AI companies to open source throughout high levels of capability seems hard and also very dangerous on other concerns (like misalignment, bioweapons, and takeoff generally being faster than we can handle). It seems more leveraged to work on mandatory interoperability. I'm skeptical that local data is important. I agree that easy (and secure/robust) fine-tuning is helpful, but disagree this is useful for local actors having a (non-trivial) comparative advantage.
  • Generally I'm skeptical of any proposal which looks like "generally make humans+AI (rather than just AI) more competitive and better at making money/doing stuff". I discuss this some here. There are already huge incentives in the economy to work on this and the case for this helping is that it somewhat prolongs the period where humans are adding substantial economic value but there are also powerful AIs (at least that’s the hypothesis). I think general economic acceleration like this might not even prolong a relevant human+AI period (at least by much) because it also speeds up capabilities and adoption in the broader economy would likely be limited! Cursor isn't a positive development for these concerns IMO.
  • I agree that AI enabled contracts, AI enabled coordination, and AIs speeding up key government processes would be good (to preserve some version of rule of law such that hard power is less important). It seems tricky to advance this now.
  • Advocating for wealth redistribution and keeping democracy seems good, but probably less leveraged to work on now. It seems good to mention when discussing what should happen though.
  • Understanding agency, civilizational social processes, and how you could do “civilizational alignment” seems relatively hard and single-single aligned AI advisors/representatives could study these areas as needed (coordinating research funding across many people as needed).
  • I don't have a particular take on meta interventions like "think more about what would help with these risks and how these risks might manifest", I just wanted to focus on somewhat more object level proposals.

(I'm not discussing interventions targeting misalignment risk, biorisk, or power grab risk, as these aren't very specific to this threat model.)

Again, note that I'm not particularly recommending these interventions on my views about the most important risks, just claiming these are the best interventions if you're worried about "intelligence curse" / "gradual disempowerment" risks.

  1. ^

    That said, I do think that technical misalignment issues are pretty likely to disempower all humans and I think war, terrorism, or accidental release of homicidal bioweapons could kill many. That's why I focus on misalignment risks.



Discuss

How Epistemic Collapse Looks from Inside

2025-06-01 00:30:06

Published on May 31, 2025 4:30 PM GMT

There’s a story — I'm at a conference and cannot access my library, but I believe it comes from Structural Anthropology by Claude Lévi-Strauss — about an anthropologist who took a member of an Amazonian tribe to New York.

The man was overwhelmed by the city, but there were two things that particularly interested him.

One was the bearded woman he saw at a freak show that the anthropologist took him to.

The other was the stair carpet rods in the hotel.


Epistemic collapse does not have to be caused by moving from a simpler society to a more complex one.

It can also result from gradually increasing complexity of your native one. The complexity rises until the point where the subject can no longer make sense of the world. They may have seemed to be a reasonable person for a long time, but suddenly they started believing in chemtrails.

Very much like the tribesman from the first story, they can no longer comprehend the world around them. The cognitive apparatus is freewheeling, clutching at random facts, in this case innocuous condensation trails in the sky, and trying to transform them into a coherent narrative about the world.


Imagine a chimpanzee who somehow ends up in a big city. He finds a park and survives there for a couple of days. His mental model includes, among other things, people. Some of them are kind and feed him, while others have dogs and they are best avoided.

That seems important, but his mental model does not account for what truly matters: the municipal department of animal control.

He will eventually be caught and removed from the park because he’s considered a health threat. But it’s questionable whether he even has a concept of "being a threat," let alone that of a "health threat." Understanding the concept of "health threat" requires awareness of microorganisms, which the chimpanzee lacks.

What’s worse, whether he ends up in a zoo or in an industrial shredder is of utmost importance to him, but the reasons behind why his destiny takes one path or another are far beyond his comprehension.


It may be useful to sometimes think about what your equivalent of a bearded woman is, but it's not clear to me whether that’s a question you can ever truly answer.



Discuss

When will AI automate all mental work, and how fast?

2025-06-01 00:18:12

Published on May 31, 2025 4:18 PM GMT

Rational Animations takes a look at Tom Davidson's Takeoff Speeds model (https://takeoffspeeds.com). The model uses formulas from economics to answer two questions: how long do we have until AI automates 100% of human cognitive labor, and how fast will that transition happen? The primary scriptwriter was Allen Liu (the first author of this post), with feedback from the second author (Writer), other members of the Rational Animations team, and external reviewers. Production credits are at the end of the video. You can find the script of the video below.


How long do we have until AI will be able to take over the world?  AI technology is hurtling forward. We’ve previously argued that a day will come when AI becomes powerful enough to take over from humanity if it wanted to, and by then we’d better be sure that it doesn’t want to.  So if this is true, how much time do we have, and how can we tell?

 

AI takeover is hard to predict because, well, it’s never happened before, but we can compare AI takeover to other major global shifts in the past.  The rise of human intelligence is one such shift; we’ve previously talked about work by researcher Ajeya Cotra, which tries to forecast AI by considering various analogies to biology. To estimate how much computation might be needed to make human level AI, it might be useful to first estimate how much computation went into making your own brain. Another good example of a major global shift, might be the industrial revolution: steam power changed the world by automating much of physical labor, and AI might change the world by automating cognitive labor.  So, we can borrow models of automation from economics to help forecast the future of AI.

 

AI impact researcher Tom Davidson, in a report published in June 2023, used a mathematical model derived from economics principles to estimate when AI will be able to automate 100% of human labor.  You can visit “Takeoffspeeds.com” if you want to play around with the model yourself.  Let’s dive into the questions this model is meant to answer, how the model works, and what this all means for the future of AI.

 

Davidson’s model is meant to predict two related ideas: AI timelines and AI takeoff speed.  AI timelines have to do with exactly when AI will reach certain milestones, in this model’s case automating a specific percentage of labor.  A short timeline would be if such AI arrives soon, while a long timeline would be the opposite.

 

AI takeoff is the process where AI systems go from being much less capable than humans to much more capable.  AI takeoff speed is how long that transition takes: it might be “fast”, taking weeks or months; it might be “slow” requiring decades; or it might be “moderate”, taking place over a few years.

 

At least in principle, almost any combination of timelines and takeoff speeds could occur: if AI researchers got stuck for the next half century but then suddenly built a superintelligence all at once on April 11, 2075, that would be a fast takeoff and a long timeline.

 

One way to measure takeoff speeds is by looking at the time it takes us to go from building a weaker AI, somewhat below human capabilities, to building a stronger AI that’s more capable than humans.  Davidson defines the weaker AI as systems that can automate 20% of the labor humans do today, and the stronger AI as systems that can automate 100% of that labor. Let’s call these points ‘20%-AI’ and ‘100%-AI’.

 

To estimate how long this process will take, Davidson approaches the problem in two parts. First he estimates how much more resources we’ll need to train 100%-AI than 20%-AI, and second, he estimates how quickly these resources will grow during this time. In his model, these resources can take two forms.  One is additional computer power and time, or “compute” for short, that can be used for training AI systems.  The other is better AI algorithms. If you use a better algorithm, you get better performance for the same compute, so this model assumes that algorithmic improvement reduces the amount of compute needed to develop a given AI system.

 

To go from 20% automation to 100%, Davidson estimates we might need to increase our compute, and/or improve the efficiency of our algorithms, by about 10,000 times. For example, we could do this by using 1000 times more compute and making algorithms 10 times more efficient, or using 10 times more compute and making algorithms 1000 times more efficient, or any combination. This estimate of 10,000 times more, is very uncertain, the model incorporates scenarios where that number is as low as 10 times and as high as 100,000,000 times. The 10,000 times estimate was arrived at by considering several different reference points, like comparing animal brains to human brains, and looking at AI models that have surpassed humans in specific areas like strategy games.

 

Now it is possible that developing superhuman AI will turn out to require a fundamentally different approach to the paradigm we’re currently using, and simply improving current techniques and using more resources won’t be enough.  In that case, no amount of compute would be enough to go from 20% to 100%, so this framework wouldn’t end up being applicable. But there is some evidence suggesting that today’s AI paradigm might be enough.  A lot of the recent rapid progress in AI has come from throwing more compute and data at the problem, rather than advancements in techniques.  Compare GPT-1 from 2018, which had trouble stringing multiple sentences together, to GPT-4 from 2023 which can write complete news articles, and which powers the paid version of the ChatGPT service as of 2024.  GPT-4 uses an improved version of what’s fundamentally pretty much the same algorithm as GPT-1. The ideas behind the models are very similar; the key difference is: GPT-4 was trained using about a million times more processing power.[1]

 

Just how much total compute do we expect to need to reach our 100%-AI mark?  The estimate Davidson used for the model is 10^36 FLOPs using algorithms from 2022, with an uncertainty of 1000 times in either direction.  These requirements are colossal: even the lowest side of Davidson’s estimates for 100% automation, a training run of 10^34 FLOPs, would take the top supercomputer of 2022 so long that in order for it to be done with that computation today it would need to have begun working on it in the Jurassic period.[2]  Obviously, we aren’t going to wait around for that training run to finish.  Instead, Davidson expects AI will progress in three ways: investors will pour in more money to buy more chips; computer chips at a given price will continue to get more powerful; and AI software will improve and become able to use compute more efficiently.

 

Buying more chips and designing better chips directly increases compute and gets us closer to the target.  Software improvements are modeled as a multiplier: if AI software in 2025 is twice as efficient with its hardware as AI software in 2024, then each computer operation in 2025 counts double compared to 2024.  So our “effective compute” at any given time is equal to our actual hardware compute times this software multiplier.

 

Now that we understand our resource requirements, it’s time to add in the economics.  There are several interconnected factors that go into modeling how fast these resources will grow.

 

The biggest factor in AI takeoff speed is how much AI itself will be able to speed up AI development. We can already see this starting to happen with large language models helping programmers to write code, to such an extent that some academics will avoid writing code and focus on other work on days when their LLM is down.[3] The more powerful this feedback loop, the faster takeoff will be.  Economists already have tools to model the effects of automation on human labor in other contexts like industrialization.  Davidson borrows a specific formula for this called the “CES Production Function”.[4]

 

Another major factor is that as AI becomes more impressive, it will attract more investment.  To model this feedback loop, Davidson’s model has investment rise more quickly once AI capabilities reach a certain threshold.

 

Davidson also throws in a few other parameters.  These include how easy it is to automate AI research, and how much an AI’s performance can improve after it’s been trained by people figuring out better ways to use it, like how asking LLMs to lay out their reasoning step by step or picking the best result from many attempts can improve the quality of their answers.

 

With all this accounted for, it’s time to actually run the calculations.  For this, Davidson uses a Monte Carlo method: each run of the model randomly selects a value for each of the inputs from a distribution within the constraints we’ve discussed.  These values slot into the equations, and we get one possible scenario for the future.  By repeating this process many times with different values for the inputs, we can build up a full picture of the range of AI futures that our estimates imply.

 

Let’s start with the headlines: the model’s median prediction is that AI will be able to automate all human labor in the year 2043, with takeoff taking about 3 years.  So in this scenario, 20% of current human labor would be automatable by 2040, and 100% would be automatable in 2043.  This is only the middle of a very broad range of possibilities, however: the model gives a 10% chance that 100%-AI comes before 2030, and a 10% chance that it comes after 2100.  For takeoff speed, there’s a 10% chance that it takes less than 10 months, and a 10% chance it takes more than 12 years.

 

On the Takeoffspeeds.com website, you can rerun the model using different values for the inputs.  These include all the inputs we’ve already mentioned, along with others like inputs representing how easy it is to automate R&D in hardware and AI software, and in the economy as a whole.

 

There are a few major takeaways from Davidson’s model even beyond the specific dates for AI milestones.  One is the answer to this work's original motivating question: even in a scenario with no major jumps in AI progress, a continuous takeoff, AI could easily race past human capabilities in just a few years or even less.

 

Another takeaway is that there are many different factors working together to shorten AI timelines.  These include: increasing investment as AI continues to improve; the ability of AI to speed up further AI development even before it reaches human level capabilities; rapid progress in AI algorithms; and the fact that training an AI takes much more compute than running it does. If you have enough compute to train an AI system, you have enormously more compute than you need to run it. So if you have techniques that let you spend extra compute to get better performance, this could boost the system a lot.

 

One final takeaway is that it’s very hard to find a realistic set of inputs to this model that doesn’t get us to AI that can perform any cognitive task by around 2060.  Even if AI progress in general is slower than we expect, and reaching human capabilities is a harder task for AI than we expect, it’s very unlikely that world-changing AI systems are more than a few decades away.

 

Importantly this does depend on the assumption that it's possible to build superhuman AI with the current paradigm, although of course new paradigms may also be developed.

 

This model, like any model, has its limitations.  For one, any model is only as good as the assumptions that went into it, though guessing numbers and making a model is usually better than just making a guess of a final answer.  See our videos on Bayesian reasoning and prediction markets for more on that point.  Davidson outlines the reasoning behind each assumption in his full report.  He also discusses some factors that weren’t included in the model, like the amount of training data that advanced AI models would need.

 

More generally, models like these are meant to give us the tools to think about the worlds they describe.  Economics itself is not a field known for making perfect predictions of the long term future, but it’s given us a toolbox for understanding markets and human behavior that has proven incredibly useful.  Hopefully, by applying those same strategies to forecasting AI, we can better prepare ourselves for whatever the future has in store for us.

  1. ^
  2. ^
  3. ^
  4. ^


Discuss