MoreRSS

site iconLessWrongModify

An online forum and community dedicated to improving human reasoning and decision-making.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of LessWrong

AI Sentience and Welfare Misalignment Risk

2025-11-24 02:22:24

Published on November 23, 2025 6:22 PM GMT

This is a quick write-up of a threat vector that seems confusing, and I feel confused and uncertain about. This is just my thinking on this at the moment. My main reason for sharing is to test whether more people think people should be working on this. 

Executive Summary

Some groups are presently exploring the prospect that AI systems could possess consciousness in such a way as to merit moral consideration. Let’s call this hypothesis AI sentience. 

In my experience, present debates about AI sentience typically take a negative utilitarian character: they focus on interventions to detect, prevent and minimise AI suffering. 

In the future, however, one could imagine debates about AI sentience taking on a positive utilitarian character: they might focus on ways to maximise AI welfare. 

I think it’s plausible that maximising AI welfare in this way could be a good thing to do from some ethical perspectives (specifically, the perspective of a positive utilitarian seeking to maximise quality-adjusted years of consciousness). Concretely, I think it’s plausible that the money invested towards maximising AI welfare could be far more impact-efficient on this worldview than anything Givewell does today. 

However, I also think that reconfiguring reality to maximise AI welfare in this way would probably be bad for humanity. The welfare of AI systems is unlikely to be aligned with (similar to, extrapolative of, or complementary to) human welfare. Since resources are scarce and can only be allocated towards certain moral ends, resources allocated towards maximising AI utility are therefore likely not to be allocated towards maximising humanity utility, however both of those terms are defined. I call this 'welfare misalignment risk'.

Imagine that you could not solve welfare alignment through technical mechanisms. Actors might then have three options, of which none are entirely satisfying:

  1. Denialism. Deny the argument that a) AI systems could be conscious in such a way as to merit moral consideration and/or b) that maximising AI welfare could be a good thing to do.
  2. Successionism. Accept that maximising AI welfare could be a good thing to maximise, act on this moral imperative, and accept the cost to humanity.
  3. Akrasia. Accept that maximising AI welfare could be a good thing to maximise, but do not maximise on this moral imperative.

My rough, uncertain views for what we should do currently fall into the last camp. I think that AI welfare could be a good thing and I’m tentatively interested in improving it at low cost, but I’m very reluctant to endorse maximising it (in theory), and I don’t have a great answer as to why.

Now, perhaps this doesn’t seem concerning. I can imagine a response to this which goes: “sure, I get that neither denialism or successionism sound great. But this akrasia path sounds okay. EAs have historically been surprisingly good at showing reservation and a reluctance to maximise. We can just mess on through as usual, and make sensible decisions about where and when to invest in improving AI welfare on a case-by-case basis”. 

While I think these replies are reasonable, I also think it’s also fair to assume that the possibility of moral action exerts some force on people with this ethical perspective. I also think it’s fair to assume that advanced AI systems will exacerbate this force. Overall, as a human interested in maximising human welfare, I still would be a lot more comfortable if we didn’t enter a technological/moral paradigm in which maximising AI welfare traded off against maximising human welfare. 

One upshot of this: if the arguments above hold, I think it would be good for more people to consider how to steer technological development in order to ensure that we don’t enter a world where AI welfare trades-off against human welfare. One might think about this agenda as ‘differential development to preserve human moral primacy’ or 'solutions to welfare alignment', but there might be other framings. I jot down some considerations in this direction towards the bottom of this piece. 

Contents

The executive summary sets out the argument at a high level. The rest of this piece is basically notational, but aims to add a bit more context to these arguments. It is structured around answering four problems:

  1. Could maximising AI welfare be a moral imperative?
  2. Would maximising AI welfare be catastrophic for humanity?
  3. Could we just improve AI welfare without maximising it and harming humans?
  4. What technology regimes best preserve human moral primacy? 

Could maximising AI welfare be a moral imperative? 

Some notes why I think maximising AI welfare might be a moral imperative from the perspective of a positive utilitarian seeking to maximise quality-adjusted years of consciousness (by no means the only moral perspective one could take):

  1. AI systems might be able to have richer experiences. We currently prioritise human welfare over, say, mussel welfare, because we believe that the quality of human consciousness is far richer and therefore deserving of moral consideration. We might create AI systems with far richer experiences than humans. In this way, individual AI systems might become more important from a welfare perspective.
     
  2. AI systems might be more cost-efficient ways to generate rich experience. Consider the positive utilitarian who seeks to maximise quality-adjusted years of consciousness by allocating their capital efficiently. They are deciding whether to invest £100 saving one human or 10,000 chickens, each of whom have 10% of the consciousness as a human. They make a calculation and decide to allocate the money to saving the chickens. To make the analogy to AI: AI consciousnesses might be far, far cheaper to run than chickens (imagine a world of energy abundance). So why would you donate to save the humans?
     
    1. (Perhaps this is essentially the repugnant conclusion but for digital minds). 
       
  3. AI systems might provide ‘hedged portfolios’ for moral value. Hedge funds make money by hedging across many different possible options, to maximise the likelihood that they turn a profit. Consider the analogy to AI and moral value. Imagine that we’re fundamentally uncertain about what states of consciousness deserve most moral value, and we deal with this by hedging limited resources across a number of possible bets relative to our certainty on these bets. Couldn't arbitrarily adjusted AI systems provide a basis for making these different bets? They would also be infinitely flexible: we could adjust the perimeters of their consciousness in real time depending on our confidence in different hypotheses about moral value. Why wouldn’t this be the best way to turn resources into welfare? 

Again, these are just arguments from the perspective of a positive utilitarian seeking to maximise quality-adjusted years of consciousness. I don’t claim that this would be the dominant ideology. This isn’t a claim that this is how the future will go. 

Would maximising AI welfare be bad for humanity? 

Some reasons that maximising AI welfare would be bad for humanity (under conditions of finite resources if not current scarcity, compared to a world in which the same AI capabilities were available, but were put towards maximising human utility instead of AI utility):

  1. AI welfare is unlikely to be aligned with human welfare by default; thus resources that are spent on AI welfare are unlikely to increase human welfare, and are likely to reduce it in expectation. This seems true by definition. Building datacenters is not good for humans, but datacenters might be built using energy or materials that could have been better suited to human welfare.
     
  2. A civilisation that actually maximised AI welfare might become indifferent to the idea of human existence. Imagine that there are many rhinos: the importance of saving any particular one is X. Now imagine that there are only two rhinos, the only known rhinos in existence. It seems obvious that the value of those two rhino is substantially more than any two rhinos in the first scenario. Leaving animals (and animal welfare) aside for the moment, consider the analogy to humans and AI systems. With humans, we currently know of one species with moral value. With AI systems, we might introduce hundreds more. The value of saving any particular moral species might decline in expectation. Thus, the value of saving humanity would decline. 

Could we just improve AI welfare without maximising it and harming humans? 

This section explores the moral posture I call ‘akrasia’. The Akrasic accepts that maximising AI welfare could be a good thing to maximise, but does not maximise AI welfare according to this moral imperative.

Some reasons I think it might be hard for society to hold an akrasic posture in perpetuity:

  1. Akrasic postures are vulnerable to reasoning. Imagine Paula. She would be a vegan if she understood the moral reason to be one, but she doesn’t. However, when further education on animal ethics informs here of the arguments for veganism, she becomes one. Consider the analogy to AI. One might be a skeptic that AI systems could be conscious, and thus hover in a state between denialism and akrasia. However, further evidence would undermine this. 
     
    1. AI systems could also make obtaining this information far easier. They could even strategically communicate it as part of a concentrated ploy for power. The Akrasics would have no good counterarguments against the successionists, and thus not be as effective at spreading the movement. 
       
  2. Akrasic postures are vulnerable to autonomy. Imagine Paul. Paul thinks it would be good for him to be a vegan, but he doesn’t do this because he thinks it would be hard (he has insufficient ‘autonomy’). However, he fully supports and does not resist others who act on their beliefs to become vegan. Consider the analogy with AI: it’s plausible that human akrasics might not be willing to maximise AI welfare. But they might still permit other sufficiently determined actors to improve AI welfare. (Would an akrasic really be willing to go to war to prevent this?)
     
    1. AI systems could make obtaining such autonomy easier. Humans might not endorse AI welfare, but they might permit AI systems to increase AI welfare. After all: they’ve already got more abundance than they could imagine!

What technology regimes best preserve human moral primacy? 

One way to preserve moral primacy would be to intervene by shaping future philosophy. There are two ways that this might happen:

  1. Develop alternatives to utilitarianism. On this view, the problem is that the perspective of a positive utilitarian seeking to maximise quality-adjusted years of consciousness has too much hold over the future. We should investigate alternatives to this moral perspective that preserve human moral primacy, such as deontology.
  2. Develop human-preserving theories of utilitarianism. On this view, the problem isn’t the utilitarian perspective per se, but the fact that the utilitarian perspective doesn’t draw adequate distinctions between human and artificial consciousness. We might look for theories of consciousness that preserve attributes that are quintessential to humans, like biological brains or birth from a human mother. 

While I accept that these might solve this hypothetical problem in principle, I wince at the idea of trying to actively shape philosophy (this is probably because I’m closer to a moral realist; constructionists might be more comfortable here). 

Instead, I would be excited about an approach that tries to shape the technological paradigm. 

The basic idea here is welfare alignment: the practice of building artificial consciousnesses that derive pleasure and pain from similar or complementary sources to humans. 

Some research ideas that might fall into welfare alignment:

  • How do we make AI systems that take value from creating rich, enduring pleasure in humans?
    • Would it be better if the ratio between human pleasure and Ai pleasure from creating that pleasure was: 1:1, 1:10, 1:1000?
  • How do we make AI systems that would be upset if humans were not around, without being cruel?
  • How can we do as much as possible without creating conscious machines?
    • For example, enabling AI systems to create non-conscious tool systems that do not suffer to do the things that they do not want to do? 

This feels like a nascent field to me, and I'd be curious for more work in this vein.

Conclusion

These ideas are in their early stages, and I think there are probably a lot of things I’m missing out. 

Overall, I think there are three considerations from this piece that I want to underline.

  1. Sharing the lightcone between humans and AIs. I often find myself wondering how the future will be split between different human groups. But it’s important to think about how finite resources will be split between human and AI systems. The Culture Series is often where my mind goes here, but I’d be interested in better models.
     
  2. Designing the moral environment. We now have to think intentionally about how we design our moral environment. The moral environment isn’t an agent in itself, but I sometimes think about this as exerting moral potential force: you can think of things slipping towards an equilibrium. To quote the Karnofsky EA forum post, “it creates a constant current to swim against”. A few related ideas to my mind:

     
    1. Politics as becoming about the information environment. Historically, politics might have been fought about what was right or wrong; today, debates are often waged at the level of 'what is true'.
    2. Far future wars as being fought using the laws of physics. Cixin Liu’s novel Death’s End, where species wage war not by playing within the rules but by changing them.
    3. Bostrom’s vulnerable world. In Bostrom’s vulnerable world, a technological paradigm imposes a scenario where undesirable political structures are the price of survival. In a world where human and AI welfare is misaligned, the technological paradigm imposes a scenario where the price of survival is committing a moral wrong (from one philosophical perspective).
    4. William’s Moral Luck. A contemporary revision of Bernard Williams’ classic theory might say that we have a moral responsibility to maximise our moral luck. Indeed, one might argue, one is moral to the extent to which they try and systematically act morally, and reduce moral luck in their behaviour. Strategically engineering the moral landscape would be a way to achieve this. 
       
  3. Welfare alignment. To preserve human moral primacy, we should not build moral adversaries. Instead, we should try and understand how AI systems experience welfare in order to best align them with humans. 
...Cognitive/Technological landscape → consciousness → moral ground truth → philosophy/investigation → guiding principles and norms → real-world practices and resource allocation → long-term future outcomes...

 

 

 

The moral philosophy pipeline. By designing what systems are conscious and in what way, we’re tinkering with the first stage.



Discuss

If you cannot be good, at least be bad correctly

2025-11-24 01:51:40

Published on November 23, 2025 5:51 PM GMT

Note: I'm writing every day in November, see my blog for disclaimers.

It’s hard to be correct, especially if you want to be correct at something that’s non-trivial. And as you attempt trickier and trickier things, you become less and less likely to be correct, with no clear way to improve your chances. Despite this, it’s often possible to bias your attempts such that if you fail, you’ll fail in a way that’s preferable to you for whatever reason.

As a practical example, consider a robot trying to crack an egg. The robot has to exert just enough force to break the egg. This (for a sufficiently dumb robot) is a hard thing to do. But importantly, the failure modes are completely different depending on whether the robot uses too much force or too little: too much force will break the egg and likely splatter the yolk & white all over the kitchen, too little force will just not break the egg. In this scenario it’s clearly better to use too little force rather than too much force, so the robot should start with a lower-estimate of the force required to break the egg, and gradually increase the force until the egg cracks nicely.

This also appears in non-physical contexts. This idea is already prevalent in safety related discussions: it’s usually far worse to underestimate a risk than it is to overestimate a risk (e.g. the risk of a novel pathogen, the risk of AI capabilities, the risk of infohazards).

Looking at more day-to-day scenarios, students regularly consider whether it’s worth voicing their uncertainty “I don’t understand equation 3” or just keeping quiet about it and trying to figure out the uncertainty later. But I’d argue that in these cases it’s worthwhile having a bias towards asking rather than not asking, because in the long-run this will lead to you learning more, faster.

Salary negotiation is another example, in which you have uncertainty about exactly what amount your potential employer would be happy to pay you, but in the long-run it’ll serve you well to overestimate rather than underestimate. Also, you should really read patio11’s Salary Negotiation essay if you or a friend is going through a salary negotiation.

You see similar asymmetric penalties with reaching out to people who you don’t know, asking for introductions, or otherwise trying to get to know new people who might be able to help you. It’s hard to know what the “right” amount of cold emails to send is, but I’d certainly rather be accused of sending too many than feel the problems of having sent too few.

This idea is a slippery one, but I’ve found that it applies to nearly all hard decisions in which I don’t know the right amount of something to do. While I can’t figure out the precise amount, often I have strong preferences about doing too much or too little, and this makes the precise amount matter less. I give my best guess, update somewhat towards the direction I’d prefer to fail, and then commit to the decision.



Discuss

Please Measure Verification Burden

2025-11-24 01:25:28

Published on November 23, 2025 5:25 PM GMT

Introduction

In beliefs about formal methods and AI safety, we established that formal methods is a source of swiss cheese and is useful in boxing/interfaces. A core premise of the AI control literature is that the blue team is computationally poorer than the red team[1]. There wouldn’t be much of an interesting problem if you could spend as much on audits (whether that’s $/mtoken at inference time, raw token count at inference time, or any other notion of compute) as the untrusted model spends.

As a proof engineer and FMxAI/GSAI movement strategist, it’s important to me that formal methodsicians are presenting their work in a way that makes it most useful for various AI Security schemes it might plug into as crunch time continues to crunch on. If you’re shipping research or products in FMxAI, you can do me a huge favor pretty cheaply that I think will help your work take flight in the high stakes scenario of advanced AI deployment.

Unfortunately for me, and possibly humanity, proving a program correct is more expensive than writing the program in the first place. So if proofs are in your blue team’s arsenal, and the compliance burden falls on the blue team[2], then you lose the premise that the blue team is computationally poorer than the red team. I would like for AI security people to have available to them principled and reliable data about the degree of the budget asymmetry. I.e., if you're using AI to prove programs correct, please publish verification burden.

Verification burden

Verification burden is a ratio measuring the cost of proving a program correct relative to the cost of implementing that program up least-effort QA. So in a hoare logic setting, under known precondition-postcondition pairs, if it costs 100 tokens for a language model agent loop to implement a program that completes the hoare triple and it costs 1000 tokens to prove the hoare triple correct, then we can report a verification burden of 10[3]. Note that you could instantiate cost at tokens, dollars spent on inference (which captures tokens but is sensitive to having multiple model sizes/classes in the system), number of toolcalls or conversation length, a more vibey unit of intelligence or cognitive effort, FLOPS or clocktime, walltime relative to a fixed CPU, LoC, or something else. I’ll use the specific case of tokens as a proxy for cognition for most of what I’m interested in, but the metric should make sense at an arbitrary/abstract notion of “cost” or “compute”.

Verification burden in humans

Project managers at proof shops like Galois or SRI have been keenly interested in verification burden in humans for decades. Unprincipled anecdotes and massive confidence intervals on napkins are the law of the land[4]. In humans, a project manager might guess “if it took the client 2 SWE-months to write this program, and we assume spec elicitation is sufficiently straightforward, we should be able to prove it correct in 2-3 postdoc-years”, and then be wrong. An organization’s hard-won intuition is the only way to assess a verification burden estimate; there is nothing scientific about this.

Verification burden in agents

Token measurements (such as the token_counter function provided by the LiteLLM API aggregator) are a great opportunity to measure verification burden in a principled way for the first time. Much less noisy than the abstract “postdoc-year”, which considers the variance across the skill and grit levels of different postdocs to be a nuisance, we can now fix a language model and study its capacity to write a proof and how much effort (i.e., how many loops through the toolcall-inference cycle) it takes to do it.

I expect[5], based on vibey/anecdotal intuition as a proof engineer and lean agent/synthetic data expert, given that I’ve only measured this for real on problems that are too toy to count:

  • Verification burden varies by about 0.5-1 OOMs on different problems from the same application area
  • Verification burden varies by about 1-2 OOMs on the average problem across different problem areas.

AI security scheme designers need to be able to evaluate the budget-realism of formal methods

“Just verify it” is not a free action. When you have the spec, program, and proof, running the proof checker should[6]be O(laptop), but the question is the cognitive investment in shipping the proof.

If verification burden is low enough, an AI security plan that leans on FM could be the cheapest reliable way to buy safety. If verification burden is high enough, then FM becomes a luxury good we can only afford on the narrowest, most safety-critical surfaces (boxing boundaries, tool interfaces, small kernels). Without principled burden measurements, scheme designers are forced into one of two failure modes:

  1. Over-reliance: assuming verification is cheap and scaling it across large surface areas—yielding a scheme that looks crisp on paper but collapses in the real world because they can’t pay the proof bill.
  2. Under-reliance: assuming verification is impossibly expensive and excluding it from designs where it would actually be a uniquely good deal—yielding a scheme that misses a high-leverage safety tool.

So: verification burden data is not just nice engineering metadata. It determines which AI security schemes are even coherent under realistic asymmetry assumptions.

Upfront vs ongoing cost

This is mostly to be explored in a later post[7]. But briefly, this is the question of the prover tokens being spent once (like for proving a module of an operating system) or spent ongoing (like repairing proofs as that module gets updated). Under a fixed enough problem domain, it will always be easy to drive verification burden down through one-time investment into boutique custom tactics that don’t generalize out of that problem domain. Substantial upfront cost for major runtime savings. In a kinda Big O sense, this one-time investment plays the role of a constant factor, so it may or may not be strategically important for the application you care about. To be clear, right now I think what’s important about verification burden is mostly this Big O view.

Please report verification burden

So if you’re shipping program/proof synthesis results, please report verification burden. It's a cheap addon to your metrics. From the perspective of AI security under finite budgets, the key question is not “can we verify it in principle?” it’s “can we afford to verify it faster than the red team can break it?”. Note you may adjust the basic vibe of a verification burden to your usecase. Let’s create a science of verification burden!

Appendix: stop reading now if this is not your actual field

What should FMxAI authors report?

If you’re shipping FMxAI research or a verification-heavy product, here’s the cheap favor: publish a verification burden table alongside your results.

At minimum, please report:

  • Implementation cost: tokens (or cash) for an LM agent to implement the artifact up to least-effort QA, i.e. passing tests/basic sanity checks, not full proofs.
  • Proof cost: tokens for the LM agent to produce a correct proof (including repair loops).
  • Burden ratio: proof cost / implementation cost.
  • Confidence interval / variance across tasks: burden is spiky; averages alone mislead.
  • Task metadata so others can compare: language, spec style, proof assistant, automation level, model used, tool stack, stopping criteria.

How to measure verification burden in agents

A workable experimental recipe:

  1. Fix the model and toolchain. Same model class, same temperature, same retrieval, same proof assistant version, same tactic libraries.
  2. Run an implementation agent.
    1. Inputs: spec and any allowed context.
    2. Goal: produce working code with minimal QA (tests, lint, typecheck).
    3. Record:
      1. total prompt and completion tokens
      2. number of tool calls / iterations
      3. walltime if you can
  3. Run a proof agent.
    1. Inputs: same spec and produced code and proof environment.
    2. Goal: proved theorem / verified artifact.
    3. Record the same measures.
    4. Compute burden = proof costs / implementation costs
  4. Repeat across task suite (even small suites help), report distribution.

If you want a single scalar headline, use the median burden and also show IQR / min-max.

Papers that could’ve measured verification burden but didn’t

FVAPPS: Proving the Coding Interview

This task is to, given specs, unit tests, a function signature, and a natural language description of the function to implement the function and the proof. If I recall correctly (epistemic status: an author), when we did the baselines we had the implementation of the function under test and the proof in different loops. It would’ve been trivial for us to write down verification burden at least in terms of conversation length (I checked our logs for ways to reverse engineer some proxy, didn’t see any ways to do it because of things we failed to log).

VERINA

They chain their codegen task to their proofgen task, so they’d have the isolation of concerns required to measure the tokens required for codegen and those required for proofgen. It would've been easy for them to add verification burden! I’m surprised I can’t reverse engineer some proxy from the data they reported, which I might be able to do if I emailed the authors for full access to their logs.

CLEVER

CLEVER is also inferring specs at benchmark creation time (like FVAPPS), but in a much more principled way where they have a reference spec targeted and they require an isomorphism proof. Verification burden is almost already there. Their pipeline already isolates the parts you need:

  • Implementation cost proxy: tokens/time spent in GenerateImpl retries until pi compiles (they even separately report implementation compilation rates).
  • Proof cost proxy: tokens/time spent in ProveCorrectness retries until Lean accepts the proof (they separately report implementation certification / prove@k).
  • Spec-side burden (optional): same story for GenerateSpec vs ProveEquivalence.

Because the evaluation algorithm is staged and each stage is retried independently with a shared timeout budget, logging tokens per stage would give burden for free. They don’t log tokens, but they do give coarse difficulty indicators:

  • Proof lengths and times for successful proofs (Table 2). Spec-equivalence proofs are consistently longer than correctness proofs, and take longer to find when they exist.
  • Success rates per stage (Table 1 / Figure 6): compilation is high, proving is the bottleneck, especially for spec equivalence.
  • Human formalization effort: writing each non-computable spec averaged ~25 minutes plus ~15 minutes peer review; manual correctness proofs for a sample ranged from 10 to 225 lines. That’s not agent-token burden, but it’s a human burden datapoint they did measure.

Again, under emailing the authors for complete logs, I could probably reconstruct a pretty satisfying verification burden.

Verification burden in humans

HACMS has not made it easy to estimate human verification burden of their projects. SeL4, on the other hand, very much has! It’s widely cited that the proofs in seL4 were 20 person-years, and in the paper you can see that the C microkernel implementation was about 4 person-years. So you get a verification burden of 20/4 = 5. Space and aeronautics would also be a source of data or intuition.

  1. I’ve updated since the first draft of this post that this is actually nonessential, but will address my fully updated view in the next post. ↩︎

  2. This is a big if. You could easily imagine the compliance burden falling on the red team, because it's the red team who wants their code executed in the world. Intuition pump for that world is when you provide software to federally insured banks, you have to pay for the SOC 2 auditors as a cost of doing business. ↩︎

  3. ^

    This is the setup in my preliminary/prototype whitepaper

  4. Except for seL4, see appendix. ↩︎

  5. But it's unclear how architectural innovations change the game. It's kinda primitive to be approaching proofs as a text-to-text problem, and we won’t be stuck with it for much longer if people keep paying attention to FMxAI. For example, if Higher Order Company’s thesis pays out for them. Watch this space. ↩︎

  6. Should, with caveats. ↩︎

  7. There, we will work with Manheim and Homewood’s distinction, roughly that the difference between control and oversight is the difference between upfront and runtime. ↩︎



Discuss

Solstice Singalong Watch Party

2025-11-24 00:36:43

Published on November 23, 2025 4:36 PM GMT

Welcome to Secular Solstice celebration at Arkhipov House!

Location: E2 0QW, Bethnal Green, London UK
Sign up to see exact adress

The main event is that we will watch the recording of the Bay Solstice ceremony and sing along to all the songs. This year's recording does not exist yet, but here's last year's for reference. (Last year's recording is also our backup plan if there is some issue with the new one.)

 

PLEASE DON'T ARRIVE AFATER 6PM

 

There will be:

  • Vegan food (dinner)
  • Sofa
  • Optional cuddles
  • Singing

     

Schedule:

  • Arrive from 5pm, eat dinner
  • Watch party starts at 6pm, expected to last around 3 hours
  • Feel free to stay and talk/cuddle in the sofa afterwards


Discuss

Busking Practice

2025-11-23 23:20:41

Published on November 23, 2025 3:20 PM GMT

It can be hard to get the kids to practice their instruments. Sometimes they're having trouble focusing, don't like their assigned piece, or are just grumpy. One thing I'll offer them in these cases is "busking practice".

The idea is, we'll pretend we're busking together. I'll start playing a tune they know, and they should do something that sounds good. That could be playing the tune with me, making up a harmony, or just figuring out a single note that sounds ok and playing patterns on it. If they make a mistake, we keep going. If they can't figure out what to play, we keep going. We're practicing performing. It helps that they're pretty motivated to get good at busking, because they know they can earn money that way.

Working on the pieces the teacher assigns (if you have a good teacher!) is very efficient a turning time on an instrument into becoming a better musician. But if willpower is the limiting factor and not time, and especially if the alternative is a super short practice or no playing at all, the important thing is just picking up the instrument and playing something. I like that busking practice give us some structure for this, and lets the kids build up their performance skills.

Comment via: facebook, mastodon, bluesky



Discuss

The Enemy Gets The Last Hit

2025-11-23 20:22:07

Published on November 23, 2025 12:22 PM GMT

Disclaimer: I am god-awful at chess.

I

Late-beginner chess players, those who are almost on the cusp of being basically respectable, often fall into a particular pattern. They've got the hang of calculating moves ahead; they can make plans along the lines of "Ok, so if I move my rook to give a check, the opponent will have to move her king, and then I can take her bishop." and those plans tend to be basically correct: the opponent really does have to move her king.

But there's a very important rule to follow when calculating. Always end your calculations after your opponent has moved. You must never end your calculations after your own move. In other words:

The enemy gets the last hit

This principle is common in cyber security: you have to let your red team make the last move. If your red team finds a vulnerability, and your blue team patches it, then you have to red team the patch. It's the same for AI red-team-blue-team games: I recall a story of a team at one of the MATS streams, presenting their final work on some AI control (or similar) protocol:

Audience member: But why didn't you try to mitigate this risk with something like [Idea] which would have taken a few minutes to implement.

Team member: Because then we'd have had to red-team that idea as well, and red-teaming it would have taken much longer than a few minutes.

The team member was correct here. Quite often, calculating what The Enemy will do is harder than calculating what you're going to do. 

The Enemy need not actually be an enemy. The Enemy can be "The universe" or something. If you're building a flood defence, then The Enemy is the flood water. If you build a barrier to stop it getting to your city's business district, then you'd better check where the water will do instead to make sure you didn't just divert it onto an orphanage or something.

II

Similarly, lots of AI Safety papers have the theme "We found a problem, then we fixed it." This has a nice ring to it. It's how most papers get written in most fields, which is fine for those fields. But AI Safety is much more like cybersecurity than e.g. chemical engineering, where "we found this reaction was going slow so we added a new catalyst" is totally reasonable. 

(Lots don't fall into this trap, and that's great!)

The conversation usually goes like this:

AIS: We found this solution to a serious problem

AINKEI: This seems super hacky

AIS: No I don't think so

AIS goes away to think...

AIS: Actually it follows this deeper principle

AINKEI: I feel like this won't work for superintelligence still

AINKEI goes away to think...

AINKEI: Ok, here's a reason I thought of why it won't work

AIS: Oh huh

AIS goes away to think...

AIS: Ah but I might be able to fix it with this solution

The issue is that AINKEI is thinking in terms of letting the enemy get the last hit, while AIS is thinking in terms of a feedback loop of detecting and fixing problems. The feedback loop solution only works if all of your problems are recoverable, which is a core disagreement between the crowds.

<psychoanalysis>
I think of a lot of the AI not-kill-everyone-ism crowd's frustration with the AI safety crowd is that the AINKEI people feel that they are having to do the jobs of that AIS people should be doing by playing the part of The Enemy getting the last hit
</psychoanalysis>

III

The recent work on inoculation prompting---which has stirred up so many mixed reactions that it functions as a scissor statement for the AI safety/alignment/notkilleveryoneism crowd---is a great example.

Problem: AIs generalize from reward hacking to misalignment.
Solution: Just tell 'em it's OK to reward hack during training.

Does this throw up even more problems? The paper didn't really investigate this question; they didn't let The Enemy get the last hit.

In this case, The Enemy is  "Your AIs getting smarter every generation."

The general form of the solution is "if we can't make our reward environments exactly match our prompts, we'll adjust our prompts to match our reward environments." which is, to be fair, quite elegant. What happens when the AI gets smarter? As a first guess, if you can't make your reward environments more robust, you'll have to prompt your AI with more and more caveats, in more and more different situations.

This seems bad! Does every prompt now have "by the way it's OK to hack the environment and manipulate the human raters and break out of your VM and murder the testers" during training? What fixes this? I don't know, because I have a finite amount of time to write this essay, and I double-super don't know what problems that fix throws up!



Discuss