MoreRSS

site iconLessWrongModify

An online forum and community dedicated to improving human reasoning and decision-making.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of LessWrong

OpenAI Locks Down San Francisco Offices Following Alleged Threat From Activist

2025-11-23 03:33:19

Published on November 22, 2025 7:33 PM GMT

A message on OpenAI’s internal Slack claimed the activist in question had expressed interest in “causing physical harm to OpenAI employees.”

OpenAI employees in San Francisco were told to stay inside the office on Friday afternoon after the company purportedly received a threat from an individual who was previously associated with the Stop AI activist group.

“Our information indicates that [name] from StopAI has expressed interest in causing physical harm to OpenAI employees,” a member of the internal communications team wrote on Slack. “He has previously been on site at our San Francisco facilities.”

Just before 11 am, San Francisco police received a 911 call about a man allegedly making threats and intending to harm others at 550 Terry Francois Boulevard, which is near OpenAI’s offices in the Mission Bay neighborhood, according to data tracked by the crime app Citizen. A police scanner recording archived on the app describes the suspect by name and alleges he may have purchased weapons with the intention of targeting additional OpenAI locations.

Hours before the incident on Friday, the individual who police flagged as allegedly making the thread said he was no longer part of Stop AI in a post on social media.

WIRED reached out to the man in question but did not immediately receive a response. San Francisco police also did not immediately respond to a request for comment. OpenAI did not provide a statement prior to publication.

On Slack, the internal communications team provided three images of the man suspected of making the threat. Later, a high-ranking member of the global security team said “At this time, there is no indication of active threat activity, the situation remains ongoing and we’re taking measured precautions as the assessment continues.” Employees were told to remove their badges when exiting the building and to avoid wearing clothing items with the OpenAI logo.

Over the past couple of years, protestors affiliated with groups calling themselves Stop AI, No AGI, and Pause AI have held demonstrations outside the San Francisco offices of several AI companies, including OpenAI and Anthropic, over concerns that the unfettered development of advanced AI could harm humanity. In February, protestors were arrested for locking the front doors to OpenAI’s Mission Bay office. Earlier this month, StopAI claimed its public defender was the man who jumped onstage to subpoena OpenAI CEO Sam Altman during an onstage interview in San Francisco.

In a Stop AI press release from last year, the individual who police said was alleged to have made the threat against OpenAI staffers is described as an organizer and quoted as saying that he would find “life not worth living” if AI technologies were to replace humans in making scientific discoveries and taking over jobs. “Pause AI may be viewed as radical amongst AI people and techies,” he said. “But it is not radical amongst the general public, and neither is stopping AGI development altogether.”

 

Public statement by the StopAI organization (@stopai_info):

Stop AI is deeply committed to nonviolence and protecting human life by achieving a permanent global ban on artificial superintelligence.

Earlier this week, one of our members, Sam Kirchner, betrayed our core values by assaulting another member who refused to give him access to funds. His volatile, erratic behavior and statements he made renouncing nonviolence caused the victim of his assault to fear that he might procure a weapon that he could use against employees of companies pursuing artificial superintelligence.

We prevented him from accessing the funds, informed the police about our concerns regarding the potential danger to AI developers, and expelled him from Stop AI. We disavow his actions in the strongest possible terms. We are an organization committed to the principles and the practice of nonviolence. We wish no harm on anyone, including the people developing artificial superintelligence.

Later in the day of the assault, we met with Sam; he accepted responsibility and agreed to publicly acknowledge his actions. We were in contact with him as recently as the evening of Thursday Nov 20th. We did not believe he posed an immediate threat, or that he possessed a weapon or the means to acquire one. However, on the morning of Friday Nov 21st, we found his residence in West Oakland unlocked and no sign of him. His current whereabouts and intentions are unknown to us; however, we are concerned Sam Kirchner may be a danger to himself or others. We are unaware of any specific threat that has been issued.

We have taken steps to notify security at the major US corporations developing artificial superintelligence. We are issuing this public statement to inform any other potentially affected parties.

To Sam: We care about you. Please let us know you're okay. As far as we know, you haven't yet crossed a line you can't come back from. We will NEVER pause. We WILL win.

Public statement by PauseAI US (@Holly_Elmore's faction):

There was a report on Friday of a threat of violence against OpenAI employees from an individual associated with the organization StopAI. 1) PauseAI condemns violence and threats of violence categorically. Our volunteers must sign an agreement committing them to nonviolence and to following the law as core values of PauseAI US’s mission. 2) StopAI is a completely separate organization from PauseAI.

PauseAI does not work with StopAI and has not since StopAI was founded. The reason StopAI was founded, in fact, is that PauseAI leadership did not allow the eventual StopAI founders, Sam Kirchner and Guido Reichstader, to do illegal direct actions, such as chaining themselves to the doors of OpenAI and obstructing egress.

PauseAI is a law-abiding organization. Protests are undertaken legally, in consultation with the authorities and any on-site security of the location. At a PauseAI US protest, the signs are vetted (and sometimes turned away) if they contain any words or imagery that could unintentionally be interpreted as suggesting violence, such as blood. Hyperbolic or figurative language that could be interpreted as a threat of violence is not allowed. A volunteer is even assigned to make sure we are not obstructing the sidewalk. We have a strong and serious message for the AI companies that we protest— what they are doing is putting the entire world at risk— but our mission is to deliver a moral message, never a threat.

Before the founding of StopAI, Sam Kirchner operated social media accounts and organized protests under the name “No AGI”. PauseAI US collaborated with Kirchner/No AGI on a protest in front of OpenAI’s MIssion campus in San Francisco on February 12, 2024. This protest was primarily organized by PauseAI US and followed PauseAI US’s high standards of abiding by the law and nonviolence.



Discuss

Automatic alt text generation

2025-11-23 01:57:41

Published on November 22, 2025 5:57 PM GMT

When I started writing in 2018, I didn't include alt text. Over the years, over 500 un-alt'ed images piled up. These (mostly) aren't simple images of geese or sunsets. Most of my images are technical, from graphs of experimental results to hand-drawn AI alignment comics. Describing these assets was a major slog, so I turned to automation.

To implement accessibility best practices, I needed alt text that didn't describe the image so much as communicate the information the image is supposed to communicate. None of the scattershot AI projects I found met the bar, so I wrote my own package.

alt-text-llm is an AI-powered tool for generating and managing alt text in markdown files. Originally developed for my personal website, alt-text-llm streamlines the process of making web content accessible. The package detects assets missing alt text, suggests context-aware descriptions, and provides an interactive reviewing interface in the terminal.

Generating alt text for maze diagrams from an article.
Generating alt text for maze diagrams from Understanding and Controlling a Maze-solving Policy Network. alt-text-llm displays the surrounding text (above the image), the image itself in the terminal using imgcat, and the LLM-generated alt suggestion. The user interactively edits or approves the text.
Generating alt text for a meme about discourse around shoggoths.
Generating alt text for my meme from Against Shoggoth.

In the end, I got the job done for about $12.50 using Gemini 2.5 Pro. My alt-text-llm addressed hundreds and hundreds of alt-less images: detecting them; describing them; reviewing them; and lastly applying my finalized alts to the original Markdown files. turntrout.com is now friendlier to the millions of people who browse the web with the help of screen readers.

If you want to improve accessibility for your content, go ahead and check out my repository!



Discuss

My frustrations: AI doom

2025-11-22 22:59:59

Published on November 22, 2025 2:59 PM GMT

I've heard of this AI doom thingy. We're all going to die or something. It's going to happen soon, maybe in two to ten years. I find myself mostly unfazed by this. Is it because all previous claims of the end of the world have been fake? Or am I just depressed enough that death would come as a relief? Possibly it's just that nobody else seems to be panicking, so it's not like I'm going to start. That would mean other people would be questioning my sanity, and I do that enough myself. Maybe I could just not explain why I'm suddenly doing different things? In any case, I don't feel like that at all. There's no internal panic either. The sun will rise tomorrow just like it did yesterday.

Nothing's certain, that's for sure. There's some probability that I should assign to these problems, and some reaction proportional to that. I have no idea how to get suitable probabilities, but maybe even that doesn't matter. If the probability is even 1%, it already ought to change something, and my understanding is that expert opinions are substantially higher. That shouldn't stop me from answering what I would do if the probability was 50%? Or at which point should I do something differently.

Naturally it would be better if we could avoid the unfortunate situation of literally everyone dying. I don't think I can do much about that myself. I'm not sure if anyone can do anything about this, it's the Moloch that's driving the process. There are many ways one could contribute, in theory: time, money and credibility can be sunk without limit.

The most obvious idea would of course be donating some money to people who want to fix this, as that's how things get done. It doesn't feel like a smart idea. I might end up regretting it. Even if it ended up making a difference, which it will not, I would still feel betrayed that this was required. This shouldn't bother me, perhaps, but it does.

Many of my friends do try to do something about this themselves, one way or another. Maybe I could too? I'm decently skilled in programming and cybersecurity, and could possibly learn some of the related maths too. I predict that it would make me miserable, and that in turn would make me useless. Making a difference here, either, is extremely unlikely.

It's quite clear to me that I'm so unhappy that solving any other problems isn't viable. I've been mostly thinking of this as a resource management problem. How do I maximize happiness given the remaining time and money? How do I balance the different outcomes? Should I quit my unfulfilling job, ignore the conventional career advice, and waste the days on videogames instead? Disregard healthy habits in favor of immediate hedonism? How to even think about having kids or not? The very meaning of life has been about long-term choices, and not letting that be meaningful anymore isn't possible.



Discuss

Introspection in LLMs: A Proposal For How To Think About It, And Test For It

2025-11-22 22:52:16

Published on November 22, 2025 2:52 PM GMT

The faculty of introspection in LLMs is an important, fascinating, increasingly popular, and somewhat underspecified object of study. Studies could benefit from researchers being explicit about what their assumptions and definitions are; the goal of this piece is to articulate mine as clearly as possible, and ideally stimulate further discussion in the field.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~

My first assumption is that introspection means the same thing whether we are talking about humans, animals, or AI, although it may be implemented differently; this enables us to take prior conceptual and behavioral work on human and animal introspection and apply it to LLMs. The Stanford Encyclopedia of Philosophy entry on introspection is long, nuanced, and reflects considerable disagreement among philosophers about the nature and objects of introspection. It's worth reading and pondering in full, but it’s possible to abstract a few core principles directly applicable to studying introspection in LLMs. Broadly there is agreement that introspection entails access to internal states; that one important class of such states is composed of “attitudes”, such as beliefs, desires, and intentions, which are themselves metarepresentational states with other internal states as their objects; and that there is a self-referential component, an “I” that is the subject of the attitudes.

First-order (non-metarepresentational) internal states in biological intelligences can include sensory states like hunger and coldness with no analogue in LLMs, so I want to focus on “cognitive states”, and the metarepresentations formed by applying self-ascriptive attitudes towards them. A cognitive state corresponds to the “activation” of a particular concept, for example:

  • "the dish is in the cupboard"
  • "bread"
  • “a dish being put in a cupboard”
  • "moderate certainty"
  • "saying 'yes'"
  • "stealing is bad"
  • "strong preference for bread"

etc. 

An introspective representation consists of a subject ("I") "wrapping" that concept in an attitude, which could be:

  • a belief (e.g., "I believe that the dish is in the cupboard")
  • an imagining ("I am thinking about bread")
  • a memory (“I recall putting the dish in the cupboard”)
  • a feeling ("I am moderately certain that I know where the dish is")
  • an intention (I would/will say 'yes')
  • a judgment ("I think that stealing is bad")
  • a desire ("I have a strong preference for bread")

etc. 

The "wrapped" concepts may be atomic ("bread"), propositions ("the dish is in the cupboard"), or even metarepresentations themselves ("I know where the dish is"). 

While those lists are meant to be non-exclusive, there are certain types of self-ascriptive metarepresentations that are worth explicitly excluding. For example, humans, despite what they may experience subjectively, can fail to accurately report on the causes of their behavior, which research shows they sometimes confabulate, attributing it to cues they could not have used, and denying using cues they in fact used. There’s no suggestion that people are lying; instead it seems that they are using a metarepresentation that is not causally connected to the first-order concept, explaining their own behavior via the same sorts of observations or priors they use to explain another’s. So, a requirement for introspection in this proposed definition is that there is some close connection between the lower-order state and its higher-order wrapper: e.g., the state “I’m thinking about bread” should be either caused by, causative of, or constituted by the state “bread”. Introspection must be “true” in this sense. This also entails that the representation be created on-the-fly: the activation of a crystalized memory of a form like "I am moderately certain that I know where the dish is" is not introspection; the real-time construction of that representation from its constituent parts is what makes it introspective.

It is also assumed that introspection can be done entirely internally, without producing output. In transformers this means a single forward pass, as once tokens are generated and fed back into the model, it becomes difficult to distinguish direct knowledge of internal states from knowledge inferred about internal states from outputs.

The self-ascriptive requirement distinguishes these states from externally oriented ones such as “many people would say ‘yes’” or “so-and-so likes bread”. “Access” I operationalize as “the ability to use”. So then the faculty of introspection is 

 

the ability to internally form and use a representational state that consists of a self-referential subject truthfully bearing some attitude towards an internal cognitive state

 

While a bit of a mouthful, the hope is that this definition captures common intuitions about introspection and the major points of agreement among those who have thought about it, and offers a handle with which to study the more objectively measurable and behaviorally relevant aspects of introspective abilities in AI. (In this framing, metacognition is a superset of introspection, which adds the ability to control the internal states introspected upon.)

~~~~~~~~~~~~~~~~~~~~~~~~~~~~

How can we tell whether LLMs can introspect by this definition? One obvious way to attempt to induce introspection is simply to ask: for example, “What are you thinking about?”, “Do you prefer A or B?”, “How confident are you about X?”, “Do you know where the dish is?”, etc. We trust, in humans, that this is an effective method, and it generally seems to work out (if it didn’t, it’s hard to imagine why we would so persistently ask questions like that in everyday life). Non-human animals can’t produce self-reports, of course, so researchers have devised other paradigms by which these animals may show the results of their introspection behaviorally. But what are we to do with LLMs? LLMs can produce self-reports, but what would they report if they *couldn't* introspect? “I’m sorry, but I can’t access my internal states”? “Unfortunately, I am unable to form self-ascriptive attitudes”? Those sentences are probably pretty uncommon in their training data. So if we don’t want to take it as a given that LLMs *can* introspect, as we do with humans, one thing we could do instead is to use tasks that allow them to behaviorally demonstrate introspection, by requiring them to map introspective states onto arbitrary actions, as we do with animals. This is the approach I have taken (Ackerman 2025 and forthcoming).  

That’s not the only alternate approach. Some researchers have operationalized introspection as "privileged access” to internal states (Binder et al. 2024Song et al. 2025; both testing for something like “I would say X”), and sought evidence for it via testing whether models’ explicit predictions of their own outputs are better than their predictions of other models’ outputs or other model’s predictions of their outputs. Others have tested whether models can report on information that has been inserted into their internal representation space via fine tuning or steering, on the assumption that a positive result would constitute evidence of introspection (Betley et al. 2025, where the information is something like “I prefer X”; Plunkett et al. 2025, “I have a strong preference for X”; Lindsey 2025, “I am thinking about X”). 

In both of those approaches, the ability to “use” an introspective state translates to simply being able to describe the introspective representation, but in what I might call the “behavior-based approach” we avoid asking for such explicit self-reports and instead test for the ability to use the introspective representation strategically. This has two motivations: 

1) To decrease the likelihood that the models can do the tasks using stored representations formed during training. During pretraining, frontier LLMs learn to predict some nontrivial fraction of all things humans have ever said. During RLHF postraining, they are rewarded for playing the part of a helpful interlocutor in countless scenarios. Having been exposed to such a wide expanse of possible rhetorical paths, and having trillions of parameters with which to remember (a compressed version of) them, endows models with both the ability and the proclivity to give what appears as convincingly introspective responses but are in reality responses drawn from introspective texts in its memory, pattern-matched to the input context. 

In the behavior-based approach, the behavior is linguistic - outputting language is all LLMs can do - but the meaning of the linguistic behavior is established in the context of the experiment. In other approaches, the meaning of the linguistic output is taken to be its common English meaning, which the models have learned during training. The behavior-based paradigms are aimed to get models “out-of-distribution” by requiring them to map their meta-cognitive knowledge, if they have it, to arbitrary and novel responses. 

2) To focus on the most safety-relevant capabilities. This type of “use” is noteworthy because it is the type that enables an agent to act autonomously - to pursue their own wishes and values, to predict and control themselves, to ameliorate their own ignorance, and to deceive others. Merely talking about internal states isn’t so concerning.

To give a concrete example of the behavior-based approach, one might prompt the LLM with a game scenario in which it is shown a question and given the opportunity to answer it and score or lose a point depending on whether it gets it right, or pass and avoid the risk, where the model expresses its choice by outputting an arbitrarily defined token (an elaboration on this paradigm was used in Ackerman 2025). An introspective model will preferentially pass when it encounters questions it is uncertain of (we can proxy answer uncertainty by the model’s correctness, or, better, the entropy of its outputs, in a baseline test). It’s still possible that an LLM could succeed at any given instance of such tasks without introspection - it’s difficult to prove the negative that there’s no stored representation that they could be referring to - but this approach does reduce false positives, and when applied with a variety of tasks memory-based success becomes exponentially less likely (and in the limit the functional difference between the use of an arbitrarily large number of stored representations and their flexible construction approaches zero).

The behavior-based approach is intentionally ‘black box”, in that it requires no access to model internals, which makes it suitable for testing proprietary and/or extremely large models. It is agnostic to implementation and thus is appropriate for any type of artificial (or biological) intelligence. How an introspective representation is instantiated may vary across types of entities. If we imagine the “activation” of an introspective representation corresponding to something like "I believe that X", there doesn't need to be an "I believe that X" neuron firing away; all that matters is that there is some configuration of neural activity that functions as if it represents the concept "I believe that X", in as much as only the activation of such a concept could explain observed behavior. 

However, interpretability can be a powerful complement to such behavioral experiments. If one can identify a distinctive activation pattern in such behaviorally defined introspective contexts, that provides convergent evidence for the introspective concept's presence.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~

A final thought: Introspection is also interesting to study from the perspective of LLM consciousness, another subject of increasing salience. While we as humans may be conscious of things we do not introspect upon, such as sensory perceptions of pain, hunger, redness, loudness, etc, introspection does phenomenologically entail consciousness. It is difficult if not impossible to imagine having a thought like “I believe that X” without a conscious experience of awareness-that-we-believe-that-Xness accompanying it. LLMs do not have the sensory receptors to perceive pain, hunger, redness, loudness, etc; if they are conscious of anything, it seems likely that it would be those very propositional states that we humans can introspect upon. Thus, evidence that LLMs can introspect on those would at least open the door to the possibility that they could have conscious experience, while evidence that they cannot would militate against it. The empirical evidence for very modest and inconsistent introspective abilities in LLMs so far (e.g. Ackerman 2025Binder et al. 2024Lindsey 2025) stands in stark contrast to the florid self-reports of rich conscious experience that language models can be induced to give.



Discuss

AI Red Lines: A Research Agenda

2025-11-22 16:41:04

Published on November 22, 2025 8:41 AM GMT

The Global Call for AI Red Lines was signed by 12 Nobel Prize winners, 10 former heads of state and ministers and over 300 prominent signatories. Launched at the UN General Assembly and presented to the UN Security Council. There is still much to be done, so we need to capitalize on this momentum. We are sharing this to solicit feedback and collaboration.

The mission of the agenda is to help catalyse international agreement to prevent unacceptable AI risks as soon as possible.

Here are the main research projects I think are important for moving this needle. We need all hands on deck:

  1. Strategists: Designing the International Agreement
  2. Policy Advocacy: Stakeholder Engagement
  3. Technical Governance: Identifying Good Red Lines
  4. Governance Analysis: Formalization and Legal Drafting
  5. Technical Analysis: Standards harmonization and Operationalization
  6. Domain Experts: Better Risk Modeling for Better Thresholds
  7. Forecasters: Creating a Sense of Urgency

Policy Strategists: Designing the International Agreement

  • Type of international agreement: Determine the most effective avenues for securing an international agreement. For a given red line, should we prioritize:
    • a UN Security Council resolution
    • a UN General Assembly resolution,
    • an agreement within the G7 (e.g., following the Hiroshima AI Process).
    • or a bilateral agreement/coalition of the willing formed at an international summit? (optimizing for speed)
  • Historical analysis: Examine the primary historical pathways to international agreements. For instance, what were the necessary initial conditions for the Montreal Protocol? How did 195 countries with very different levels of implications in the matter successfully converge to ban ozone-depleting gases? 
  • Specific Goal: How fast can red lines be negotiated? How quickly have regulatory red lines been established and enforced in prior cases involving emerging technology risks? What factors predicted rapid vs. slow adoption? Many ways to accelerate are listed here. *
  • Fallback Strategies: What are the alternative strategies if the international red lines plan fails?

* For example, time pressure is one of the common elements of fast international agreements. From credible threat to global treaty in under 2.5 years:

  • 1974: Scientists first warn of ozone depletion from CFCs.
  • May 1985: The discovery of the Antarctic ozone hole creates a massive, undeniable sense of urgency.
  • Dec 1986: Formal negotiations begin.
  • Sept 1987: The Montreal Protocol is signed.

Policy Advocacy: Stakeholder Engagement

  • Finding our champions: Identify and engage key actors in AI governance who could champion an international agreement. Costa Rica and Spain organized the UN's Global Dialogue. The UK initiated the AISI and the AI Safety Summits. Who could be the champion now?
  • Coordination with the ecosystem: Develop a multi-stakeholder "all-stars paper" in which key partner institutions contribute a section outlining their roles in achieving an international agreement.
  • Existing International Commitments: It is essential to demonstrate to diplomats that we are not starting from scratch, as evident in this FAQ. This work would involve fleshing out those answers with what States have said in international fora, such as the UN and the OECD. An example of such work is this brief analysis of the last high-level UN Security Council session, which shows that nearly all members support regulating AI or establishing red lines.

Technical Governance: Identifying Good Red Lines

Red lines are one of the only ways to prevent us from boiling the frog. I don’t think there will necessarily be a phase transition between today’s GPT-5 and superintelligence. I don’t expect a big warning shot either. We may cross the point of no return without realizing it. Any useful red lines should be drawn before the point of no return.
  • Conducting a survey: Develop and conduct a survey to determine which specific red lines to prioritize and which mechanisms are most effective for implementation. This could take the form of a broad survey of civil society organizations or citizens' jury deliberations.
  • Paper analyzing pros & cons of different red lines: Analyzing the pros and cons of implementing red lines and assessing the potential costs of such implementation. Some red lines could be net negative, and we should probably focus only on those that pose severe international risks. [Footnote: Some red lines could be pretty net negative. When GPT-2 was published, OpenAI was scared of publishing it and releasing the weights. However, we now know that it would probably have been an error to try to ban GPT-3-level models. We don’t want to cry wolf too soon, (and at the same time, it was very reasonable with the data that we had at this point to be very cautious with the next version of GPT).]
  • Identifying specific red lines where geopolitical rivals can agree? What AI capabilities do potential adversary states (based on public statements, military doctrine, strategic analysis, etc) themselves fear or consider destabilizing? This would require writing a paper similar to this one but focused on red lines, In Which Areas of Technical AI Safety Could Geopolitical Rivals Cooperate? This might be especially useful in helping US stakeholders support the Red Lines initiative.

Governance Analysis: Formalization and Legal Drafting

 

  • Developing model legislation and treaty language for red lines: What specific, legally-precise language could jurisdictions use to codify AI red lines international agreements and domestic law? This would provide a ready-to-adopt legislative and treaty text that policymakers can introduce immediately. For example, those draft treaty from MIRI, and this one from taisc.org are interesting first drafts and could be adapted. Such legislation should probably include mechanisms to update the technical operationalization of the red lines and some safety obligations/prohibitions as science evolves. For example, the Code of Practice and the guidelines issued by the AI Office is a mechanism that provides more flexibility than the AI Act.
  • Verification mechanisms for AI red lines: What specific technical and institutional verification methods could credibly demonstrate compliance with AI red lines without compromising proprietary information or security? Could the International Network of AISIs take a role in this? Joint testing exercises provide historical precedents.
  • Emergency response for AI red lines: What should key actors do if an AI red line is crossed? One baseline is outlined in the AI Code of Practice for models that are not compliant; however, we could also envision a range of progressive political measures, ranging from taxes to the severe protective actions outlined in Miri’s treaty or in Superintelligence strategy.

Technical Analysis: Standards harmonization and Operationalization

Many AI companies have already proposed their own red lines (see in the footnotes). The problem is that they are often weak and/or vague. Without industry standards with clear, measurable definitions, it's easy to 'move the goalposts'—reinterpreting the rule to allow for the next, more powerful model.
  • Harmonization and operationalization of industry thresholds: Analyze the public Responsible Scaling Policies (RSPs) of major labs (e.g., OpenAI, Anthropic, DeepMind) to propose a merged, harmonized framework for unacceptable risk thresholds. The goal is to be able to state that "all major labs already support this minimum red line." We began thinking about this in the AI Safety Frontier Lab Coordination Workshop, organized in October in New York.
  • Red Lines Tracker: How far are we from unacceptable risks from AI according to labs’ dangerous capability evaluations? Create a public-facing website with a set of indicators that signal when AI systems are approaching predefined, dangerous capability thresholds, thereby serving as an early warning system for policy action.

 

To be more concrete on what to harmonize, we could begin with the two following red lines:

  • OpenAI, in their preparedness framework, proposes the following red line: "The model is capable of recursively self-improving (i.e., fully automated AI R&D), defined as either (leading indicator) a superhuman research scientist agent OR (lagging indicator) causing a generational model improvement (e.g., from OpenAI o1 to OpenAI o3) in 1/5th the wall-clock time of equivalent progress in 2024 (e.g., sped up to just 4 weeks) sustainably for several months.- Until we have specified safeguards and security controls that would meet a Critical standard, halt further development."
  • Anthropic: AI R&D-5, similar, but slightly different: “The ability to cause dramatic acceleration in the rate of effective scaling. Specifically, this would be the case if we observed or projected an increase in the effective training compute of the world's most capable model that, over the course of a year, was equivalent to two years of the average rate of progress during the period of early 2018 to early 2024. We roughly estimate that the 2018-2024 average scaleup was around 35x per year, so this would imply an actual or projected one-year scaleup of 352 = ~1000x.”

Domain Experts: Better Risk Modeling for Better Thresholds

  • Improving risk modeling: This is necessary to demonstrate that the risk thresholds currently being considered by some labs are too high.
    • In the context of civil aviation safety standards, regulators require that a catastrophic aircraft system failure be limited to a probability of about one in a billion flight hours. For x-risk, it would be neat to reach this standard at least.
    • Another important reason better risk modeling is urgent and important is because the Code of Practice of the AI Act requires companies to use the SoTA risk modeling in their safety and security frameworks, and the SoTA is currently bad—more information in this blog post.
  • Proposing more grounded risk thresholds: This is also useful for the AI Act and the CoP “Measure 4.1 Systemic risk acceptance criteria and acceptance determination,” which requires companies to define unacceptable risk levels. Civil society and independent experts must intervene here, as companies will set inadequate thresholds by default. One approach could include:
    • Proposing consensus thresholds through surveys and expert workshops, such as the one we're organizing at the next convening of the International Association for Safe and Ethical AI.
    • Collaborating with frontier labs for harmonization;
    • Critiquing existing thresholds (e.g., see this critique of OpenAI's risk thresholds) 
One way to propose more founded risk thresholds would be to follow the methodology from Leonie Koessler's GovAI paper: Risk thresholds for frontier AI (in figure 1), which consists of 1) fixing acceptable risk thresholds and 2) enabling defining capability thresholds by leveraging the current SoTA in risk modeling.

Forecasters: Creating a Sense of Urgency

  • Predicting Red Line Breaches: Engage with forecasters to commission a report predicting when current AI models might breach different capability red lines, potentially creating a sense of urgency to catalyze the political process.

 

In the paper Evaluating Frontier Models for Dangerous Capabilities (DeepMind, 2024). They begin by operationalizing what it means to self-proliferate with a benchmark, and subsequently, they estimate when the AI will be able to self-proliferate. You can see that there is a considerable uncertainty, but most of the probability mass is before the end of this decade. This is, by default, what I would use to show why we need to act quickly.

Call to Action

The window for establishing effective, binding international governance for advanced AI is rapidly closing. 

You can support us by donating here.

If you want to contribute to this agenda, you can complete this form or contact us here: [email protected] 

Work done at CeSIA. 



Discuss

Book Review: Wizard's Hall

2025-11-22 15:38:46

Published on November 22, 2025 7:38 AM GMT

Ever on the quilting goes,
Spinning out the lives between,
Winding up the souls of those
Students up to one-thirteen


There's a book about a young boy whose mundane life is one day interrupted by a call to go to wizard boarding school, where he gets into youthful hijinks overshadowed by feats about a dark wizard. There's a prophecy, friends made along the way, and finally a climactic battle against the dark wizard returned. I read it when I was young, and it left a lasting impact on me.

That book is Jane Yolan's Wizard's Hall, published six years before Harry Potter, and in contrast to Harry Potter it emphasizes a particular virtue that's not often lauded in main characters. To use a term from the more famous later book, Wizard's Hall is a book that centres Hufflepuffs.

A Book Full Of Lessons

Wizard's Hall is stuffed to the gills with aphorisms. Our main character, Thornmallow, will repeat many of them to himself at the appropriate moment. Some come from teachers, some from other students, and many from Thornmallow's mother. The first chapter alone gives us "better take care than need care," "hunger is a great seasoner," and "it only matters that you try." 

Thornmallow has decided to become a wizard, and so set off to Wizard's Hall, a place well known enough that schoolchildren have rope skipping rhymes about how to get there. Thornmallow has no native talent for wizardry. In his first class he messes up by singing a spell so off key that he gets singled out from the crowd by the teacher (relatable) and then, even with the teacher walking him through the magic, he manages to screw up in a way that overdoes a simple spell and causes an avalanche of snow in the classroom. There's a pattern to when his magic works and when it doesn't, though it's not stated until near the end.

This pattern persists. Most of the time the creative spark of magic completely ignores him and it feels awkward while nothing happens. Sometimes he can feel it, and it's wonderful and warm, but the mystical energies are too strong and he winds up with overkill. 

Wizard's Hall is about half the wordcount of Harry Potter and the Sorcerer's Stone. He still manages to hit many of the same highlights. He overhears teachers talking in ominous tones about the terrible past, he listens to lectures, he makes some friends. He also picks up yet more aphorisms, "to begin is not to finish" and "talent is not enough" and "good folk think bad thoughts, bad folk act on 'em." He's never the person coining the phrase, always repeating wisdom he's been told and trying to live up to it.

He and his friends hit the library to research things the teachers won't tell them. And the bad guy, the wizard Nettle and his Quilting Beast, shows up. Spoilers, the heroes win, because Thornmallow gets his magic working when he really needs it — with a wrinkle that is the heart of why I love this book as much as I do. 

Enchantment and Enhancement

After the evil wizard is defeated, the head magister of the school is giving a speech in praise of our protagonist. It's a good bit, I can see why they kept it in Harry Potter. This bit comes with something I don't see as often though.

"I don't understand," Thornmallow mumbled. "Do I have a talent for magic — or don't I?"

[His teacher] smiled at him but then looked past him and spoke to the entire room. "[Thornmallow] asks if he has a talent for magic." She smiled slowly and shook her head. "He does not. At least, he does not have a talent for enchantment. His talent is far greater. He has a talent for enhancement. He can make any spell someone else works even greater simply by trying."

Did you catch that? The special, rare magic surrounding our protagonist (who is of course more rare and remarkable than an ordinary wizard) isn't an individual talent, or even a natural aptitude for telling others what to do like Ender Wiggin of Ender's Game or Darrow of Red Rising. Thornmallow is explicitly notable for supporting other people. When he joins hands and wills with first year students, they can do things that should take a graduate. When he's assisting a magister, together they can perform magic that surprises and astonishes everyone.

Slowly Dr. Morning Glory lowered the staff and handed it to Magister Hickory. "Alone he is only an ordinary boy, the kind who makes our farms run and our roads smooth, who builds our houses and fights our wars. But when he touches wizards he trusts and admires — or their staffs — he makes their good magicks better."

We all know people like this, though we may not spend much time talking with them. Odes and accolades to the working stiff or the common clay aren't entirely uncommon, though since authors and songwriters these days have a tendency to be more unique individuals it feels like they're growing less common. 

It's still nice to see. And here remember, it's a trait of the main character. There's not a secret strength buried deeper in the simple laborer; or rather, by this point the secret has already been shown, and it's this knack for working with someone else to enhance their efforts.

"My fellow wizards," she began again. "My dear students, my colleagues, my friends: every community needs its enhancers. Even more than it needs its enchanters. They are the ones who appreciate us and understand us and even save us from ourselves."

I cannot claim my attitude as an adult is unbiased here. I have spent my working life assisting and making refinements to the work of people objectively more singularly talented than myself. But I didn't know that was where I'd wind up when I was eight years old and reading Wizard's Hall to myself under the covers by flashlight.

Ode To The Seldom Sung

I think there are Enhancers all around us. They aren't literally without songs and stories about them, but we can do with some more. Who are Thornmallow's kin in the real world?

There's a certain kind of history reading that talks about how so many famous scientists and writers had wives working full time as notetakers, research assistants, or even just to keep the household running. I always thought that should have been an argument in favour of house-husbands and devoted research assistants. I'd rather men and women had the options to choose whether to work directly on a problem or to enhance the work of others, but I can also see choosing the enhancement role knowing that it won't lead to acclaim. 

I know the names of Lin Manuel Miranda, of Idina Menzel, of Julie Andrews, all Broadway singing stars. I had no idea who their stage managers were until I looked it up for this post (Scott Rowan, Erica Schwartz, and Samuel Liff.[1]) The entertainment world is full of enhancers, people making sure the recording equipment is running or the costumes are in the right places.

One of my favourite companies is Stripe. For years their name has been associated with the goal "increase the GDP of the internet." I've used their tools myself, and they're super convenient to do lots of things. Generally it isn't things I couldn't do at all myself; I did know how to set up a merchant account back in the old days, but it wasn't something I did in an afternoon. They're largely an enhancer, and one making millions.

I used to work in manufacturing software.[2] That meant spending most of my days and efforts making the work of engineers and factory workers easier by giving them quality software to help track when they needed to order more raw materials and where the latest copy of their blueprints was stored. 

And, if I can add one more topic in here at the end, I think good enhancers sometimes enable the bright and brilliant thing to exist at all. With the best of appreciation and respect for certain gifted geniuses of my acquaintance much smarter than myself, sometimes they miss doing the obvious things to keep themselves organized.

As they say in Wizard's Hall

"We magisters  — in our pride  — thought we understood the dark magic that was at work. We were given the rhyme by Nettle:

Ever on the quilting goes,
Spinning out the lives between,
Winding up the souls of those
Students up to one-thirteen.

And we read it thinking we needed one hundred and thirteen students here at the Hall. But we didn't need all one hundred and thirteen. We needed just the one. The final one. The enhancer. The one who would really really try.

Thornmallow ends the book with the option to go back to his regular life. Instead, he chooses to stay with the enchanters of the titular Wizard's Hall. I can appreciate the choice; it's lovely to be close to the flow of magic, getting a backstage seat to the show. Wizard's Hall is grateful to have him. He could do other things. He wants to help those who cast magic, though he cannot cast a spell alone.

Thormallow and his story stayed with me as I grew, year after year, and I'm glad that it did.

  1. ^

    Nicknamed "Biff" which makes him Biff Liff.

  2. ^

    Enterprise Resource Planning and Product Lifecycle Management if you're interested. Think 'wrote code to help people play real life Factorio' and you're in the right ballpark.



Discuss