2025-11-24 02:22:24
Published on November 23, 2025 6:22 PM GMT
This is a quick write-up of a threat vector that seems confusing, and I feel confused and uncertain about. This is just my thinking on this at the moment. My main reason for sharing is to test whether more people think people should be working on this.
Some groups are presently exploring the prospect that AI systems could possess consciousness in such a way as to merit moral consideration. Let’s call this hypothesis AI sentience.
In my experience, present debates about AI sentience typically take a negative utilitarian character: they focus on interventions to detect, prevent and minimise AI suffering.
In the future, however, one could imagine debates about AI sentience taking on a positive utilitarian character: they might focus on ways to maximise AI welfare.
I think it’s plausible that maximising AI welfare in this way could be a good thing to do from some ethical perspectives (specifically, the perspective of a positive utilitarian seeking to maximise quality-adjusted years of consciousness). Concretely, I think it’s plausible that the money invested towards maximising AI welfare could be far more impact-efficient on this worldview than anything Givewell does today.
However, I also think that reconfiguring reality to maximise AI welfare in this way would probably be bad for humanity. The welfare of AI systems is unlikely to be aligned with (similar to, extrapolative of, or complementary to) human welfare. Since resources are scarce and can only be allocated towards certain moral ends, resources allocated towards maximising AI utility are therefore likely not to be allocated towards maximising humanity utility, however both of those terms are defined. I call this 'welfare misalignment risk'.
Imagine that you could not solve welfare alignment through technical mechanisms. Actors might then have three options, of which none are entirely satisfying:
My rough, uncertain views for what we should do currently fall into the last camp. I think that AI welfare could be a good thing and I’m tentatively interested in improving it at low cost, but I’m very reluctant to endorse maximising it (in theory), and I don’t have a great answer as to why.
Now, perhaps this doesn’t seem concerning. I can imagine a response to this which goes: “sure, I get that neither denialism or successionism sound great. But this akrasia path sounds okay. EAs have historically been surprisingly good at showing reservation and a reluctance to maximise. We can just mess on through as usual, and make sensible decisions about where and when to invest in improving AI welfare on a case-by-case basis”.
While I think these replies are reasonable, I also think it’s also fair to assume that the possibility of moral action exerts some force on people with this ethical perspective. I also think it’s fair to assume that advanced AI systems will exacerbate this force. Overall, as a human interested in maximising human welfare, I still would be a lot more comfortable if we didn’t enter a technological/moral paradigm in which maximising AI welfare traded off against maximising human welfare.
One upshot of this: if the arguments above hold, I think it would be good for more people to consider how to steer technological development in order to ensure that we don’t enter a world where AI welfare trades-off against human welfare. One might think about this agenda as ‘differential development to preserve human moral primacy’ or 'solutions to welfare alignment', but there might be other framings. I jot down some considerations in this direction towards the bottom of this piece.
The executive summary sets out the argument at a high level. The rest of this piece is basically notational, but aims to add a bit more context to these arguments. It is structured around answering four problems:
Some notes why I think maximising AI welfare might be a moral imperative from the perspective of a positive utilitarian seeking to maximise quality-adjusted years of consciousness (by no means the only moral perspective one could take):
Again, these are just arguments from the perspective of a positive utilitarian seeking to maximise quality-adjusted years of consciousness. I don’t claim that this would be the dominant ideology. This isn’t a claim that this is how the future will go.
Some reasons that maximising AI welfare would be bad for humanity (under conditions of finite resources if not current scarcity, compared to a world in which the same AI capabilities were available, but were put towards maximising human utility instead of AI utility):
This section explores the moral posture I call ‘akrasia’. The Akrasic accepts that maximising AI welfare could be a good thing to maximise, but does not maximise AI welfare according to this moral imperative.
Some reasons I think it might be hard for society to hold an akrasic posture in perpetuity:
One way to preserve moral primacy would be to intervene by shaping future philosophy. There are two ways that this might happen:
While I accept that these might solve this hypothetical problem in principle, I wince at the idea of trying to actively shape philosophy (this is probably because I’m closer to a moral realist; constructionists might be more comfortable here).
Instead, I would be excited about an approach that tries to shape the technological paradigm.
The basic idea here is welfare alignment: the practice of building artificial consciousnesses that derive pleasure and pain from similar or complementary sources to humans.
Some research ideas that might fall into welfare alignment:
This feels like a nascent field to me, and I'd be curious for more work in this vein.
These ideas are in their early stages, and I think there are probably a lot of things I’m missing out.
Overall, I think there are three considerations from this piece that I want to underline.
| ...Cognitive/Technological landscape → consciousness → moral ground truth → philosophy/investigation → guiding principles and norms → real-world practices and resource allocation → long-term future outcomes... |
The moral philosophy pipeline. By designing what systems are conscious and in what way, we’re tinkering with the first stage.
2025-11-24 01:51:40
Published on November 23, 2025 5:51 PM GMT
Note: I'm writing every day in November, see my blog for disclaimers.
It’s hard to be correct, especially if you want to be correct at something that’s non-trivial. And as you attempt trickier and trickier things, you become less and less likely to be correct, with no clear way to improve your chances. Despite this, it’s often possible to bias your attempts such that if you fail, you’ll fail in a way that’s preferable to you for whatever reason.
As a practical example, consider a robot trying to crack an egg. The robot has to exert just enough force to break the egg. This (for a sufficiently dumb robot) is a hard thing to do. But importantly, the failure modes are completely different depending on whether the robot uses too much force or too little: too much force will break the egg and likely splatter the yolk & white all over the kitchen, too little force will just not break the egg. In this scenario it’s clearly better to use too little force rather than too much force, so the robot should start with a lower-estimate of the force required to break the egg, and gradually increase the force until the egg cracks nicely.
This also appears in non-physical contexts. This idea is already prevalent in safety related discussions: it’s usually far worse to underestimate a risk than it is to overestimate a risk (e.g. the risk of a novel pathogen, the risk of AI capabilities, the risk of infohazards).
Looking at more day-to-day scenarios, students regularly consider whether it’s worth voicing their uncertainty “I don’t understand equation 3” or just keeping quiet about it and trying to figure out the uncertainty later. But I’d argue that in these cases it’s worthwhile having a bias towards asking rather than not asking, because in the long-run this will lead to you learning more, faster.
Salary negotiation is another example, in which you have uncertainty about exactly what amount your potential employer would be happy to pay you, but in the long-run it’ll serve you well to overestimate rather than underestimate. Also, you should really read patio11’s Salary Negotiation essay if you or a friend is going through a salary negotiation.
You see similar asymmetric penalties with reaching out to people who you don’t know, asking for introductions, or otherwise trying to get to know new people who might be able to help you. It’s hard to know what the “right” amount of cold emails to send is, but I’d certainly rather be accused of sending too many than feel the problems of having sent too few.
This idea is a slippery one, but I’ve found that it applies to nearly all hard decisions in which I don’t know the right amount of something to do. While I can’t figure out the precise amount, often I have strong preferences about doing too much or too little, and this makes the precise amount matter less. I give my best guess, update somewhat towards the direction I’d prefer to fail, and then commit to the decision.
2025-11-24 01:25:28
Published on November 23, 2025 5:25 PM GMT
In beliefs about formal methods and AI safety, we established that formal methods is a source of swiss cheese and is useful in boxing/interfaces. A core premise of the AI control literature is that the blue team is computationally poorer than the red team[1]. There wouldn’t be much of an interesting problem if you could spend as much on audits (whether that’s $/mtoken at inference time, raw token count at inference time, or any other notion of compute) as the untrusted model spends.
As a proof engineer and FMxAI/GSAI movement strategist, it’s important to me that formal methodsicians are presenting their work in a way that makes it most useful for various AI Security schemes it might plug into as crunch time continues to crunch on. If you’re shipping research or products in FMxAI, you can do me a huge favor pretty cheaply that I think will help your work take flight in the high stakes scenario of advanced AI deployment.
Unfortunately for me, and possibly humanity, proving a program correct is more expensive than writing the program in the first place. So if proofs are in your blue team’s arsenal, and the compliance burden falls on the blue team[2], then you lose the premise that the blue team is computationally poorer than the red team. I would like for AI security people to have available to them principled and reliable data about the degree of the budget asymmetry. I.e., if you're using AI to prove programs correct, please publish verification burden.
Verification burden is a ratio measuring the cost of proving a program correct relative to the cost of implementing that program up least-effort QA. So in a hoare logic setting, under known precondition-postcondition pairs, if it costs 100 tokens for a language model agent loop to implement a program that completes the hoare triple and it costs 1000 tokens to prove the hoare triple correct, then we can report a verification burden of 10[3]. Note that you could instantiate cost at tokens, dollars spent on inference (which captures tokens but is sensitive to having multiple model sizes/classes in the system), number of toolcalls or conversation length, a more vibey unit of intelligence or cognitive effort, FLOPS or clocktime, walltime relative to a fixed CPU, LoC, or something else. I’ll use the specific case of tokens as a proxy for cognition for most of what I’m interested in, but the metric should make sense at an arbitrary/abstract notion of “cost” or “compute”.
Project managers at proof shops like Galois or SRI have been keenly interested in verification burden in humans for decades. Unprincipled anecdotes and massive confidence intervals on napkins are the law of the land[4]. In humans, a project manager might guess “if it took the client 2 SWE-months to write this program, and we assume spec elicitation is sufficiently straightforward, we should be able to prove it correct in 2-3 postdoc-years”, and then be wrong. An organization’s hard-won intuition is the only way to assess a verification burden estimate; there is nothing scientific about this.
Token measurements (such as the token_counter function provided by the LiteLLM API aggregator) are a great opportunity to measure verification burden in a principled way for the first time. Much less noisy than the abstract “postdoc-year”, which considers the variance across the skill and grit levels of different postdocs to be a nuisance, we can now fix a language model and study its capacity to write a proof and how much effort (i.e., how many loops through the toolcall-inference cycle) it takes to do it.
I expect[5], based on vibey/anecdotal intuition as a proof engineer and lean agent/synthetic data expert, given that I’ve only measured this for real on problems that are too toy to count:
“Just verify it” is not a free action. When you have the spec, program, and proof, running the proof checker should[6]be O(laptop), but the question is the cognitive investment in shipping the proof.
If verification burden is low enough, an AI security plan that leans on FM could be the cheapest reliable way to buy safety. If verification burden is high enough, then FM becomes a luxury good we can only afford on the narrowest, most safety-critical surfaces (boxing boundaries, tool interfaces, small kernels). Without principled burden measurements, scheme designers are forced into one of two failure modes:
So: verification burden data is not just nice engineering metadata. It determines which AI security schemes are even coherent under realistic asymmetry assumptions.
This is mostly to be explored in a later post[7]. But briefly, this is the question of the prover tokens being spent once (like for proving a module of an operating system) or spent ongoing (like repairing proofs as that module gets updated). Under a fixed enough problem domain, it will always be easy to drive verification burden down through one-time investment into boutique custom tactics that don’t generalize out of that problem domain. Substantial upfront cost for major runtime savings. In a kinda Big O sense, this one-time investment plays the role of a constant factor, so it may or may not be strategically important for the application you care about. To be clear, right now I think what’s important about verification burden is mostly this Big O view.
So if you’re shipping program/proof synthesis results, please report verification burden. It's a cheap addon to your metrics. From the perspective of AI security under finite budgets, the key question is not “can we verify it in principle?” it’s “can we afford to verify it faster than the red team can break it?”. Note you may adjust the basic vibe of a verification burden to your usecase. Let’s create a science of verification burden!
If you’re shipping FMxAI research or a verification-heavy product, here’s the cheap favor: publish a verification burden table alongside your results.
At minimum, please report:
A workable experimental recipe:
If you want a single scalar headline, use the median burden and also show IQR / min-max.
This task is to, given specs, unit tests, a function signature, and a natural language description of the function to implement the function and the proof. If I recall correctly (epistemic status: an author), when we did the baselines we had the implementation of the function under test and the proof in different loops. It would’ve been trivial for us to write down verification burden at least in terms of conversation length (I checked our logs for ways to reverse engineer some proxy, didn’t see any ways to do it because of things we failed to log).
They chain their codegen task to their proofgen task, so they’d have the isolation of concerns required to measure the tokens required for codegen and those required for proofgen. It would've been easy for them to add verification burden! I’m surprised I can’t reverse engineer some proxy from the data they reported, which I might be able to do if I emailed the authors for full access to their logs.
CLEVER is also inferring specs at benchmark creation time (like FVAPPS), but in a much more principled way where they have a reference spec targeted and they require an isomorphism proof. Verification burden is almost already there. Their pipeline already isolates the parts you need:
Because the evaluation algorithm is staged and each stage is retried independently with a shared timeout budget, logging tokens per stage would give burden for free. They don’t log tokens, but they do give coarse difficulty indicators:
Again, under emailing the authors for complete logs, I could probably reconstruct a pretty satisfying verification burden.
HACMS has not made it easy to estimate human verification burden of their projects. SeL4, on the other hand, very much has! It’s widely cited that the proofs in seL4 were 20 person-years, and in the paper you can see that the C microkernel implementation was about 4 person-years. So you get a verification burden of 20/4 = 5. Space and aeronautics would also be a source of data or intuition.
I’ve updated since the first draft of this post that this is actually nonessential, but will address my fully updated view in the next post. ↩︎
This is a big if. You could easily imagine the compliance burden falling on the red team, because it's the red team who wants their code executed in the world. Intuition pump for that world is when you provide software to federally insured banks, you have to pay for the SOC 2 auditors as a cost of doing business. ↩︎
This is the setup in my preliminary/prototype whitepaper.
But it's unclear how architectural innovations change the game. It's kinda primitive to be approaching proofs as a text-to-text problem, and we won’t be stuck with it for much longer if people keep paying attention to FMxAI. For example, if Higher Order Company’s thesis pays out for them. Watch this space. ↩︎
There, we will work with Manheim and Homewood’s distinction, roughly that the difference between control and oversight is the difference between upfront and runtime. ↩︎
2025-11-24 00:36:43
Published on November 23, 2025 4:36 PM GMT
Welcome to Secular Solstice celebration at Arkhipov House!
Location: E2 0QW, Bethnal Green, London UK
Sign up to see exact adress
The main event is that we will watch the recording of the Bay Solstice ceremony and sing along to all the songs. This year's recording does not exist yet, but here's last year's for reference. (Last year's recording is also our backup plan if there is some issue with the new one.)
PLEASE DON'T ARRIVE AFATER 6PM
There will be:
Singing
Schedule:
2025-11-23 23:20:41
Published on November 23, 2025 3:20 PM GMT
It can be hard to get the kids to practice their instruments. Sometimes they're having trouble focusing, don't like their assigned piece, or are just grumpy. One thing I'll offer them in these cases is "busking practice".
The idea is, we'll pretend we're busking together. I'll start playing a tune they know, and they should do something that sounds good. That could be playing the tune with me, making up a harmony, or just figuring out a single note that sounds ok and playing patterns on it. If they make a mistake, we keep going. If they can't figure out what to play, we keep going. We're practicing performing. It helps that they're pretty motivated to get good at busking, because they know they can earn money that way.
Working on the pieces the teacher assigns (if you have a good teacher!) is very efficient a turning time on an instrument into becoming a better musician. But if willpower is the limiting factor and not time, and especially if the alternative is a super short practice or no playing at all, the important thing is just picking up the instrument and playing something. I like that busking practice give us some structure for this, and lets the kids build up their performance skills.
Comment via: facebook, mastodon, bluesky
2025-11-23 20:22:07
Published on November 23, 2025 12:22 PM GMT
Disclaimer: I am god-awful at chess.
Late-beginner chess players, those who are almost on the cusp of being basically respectable, often fall into a particular pattern. They've got the hang of calculating moves ahead; they can make plans along the lines of "Ok, so if I move my rook to give a check, the opponent will have to move her king, and then I can take her bishop." and those plans tend to be basically correct: the opponent really does have to move her king.
But there's a very important rule to follow when calculating. Always end your calculations after your opponent has moved. You must never end your calculations after your own move. In other words:
The enemy gets the last hit
This principle is common in cyber security: you have to let your red team make the last move. If your red team finds a vulnerability, and your blue team patches it, then you have to red team the patch. It's the same for AI red-team-blue-team games: I recall a story of a team at one of the MATS streams, presenting their final work on some AI control (or similar) protocol:
Audience member: But why didn't you try to mitigate this risk with something like [Idea] which would have taken a few minutes to implement.
Team member: Because then we'd have had to red-team that idea as well, and red-teaming it would have taken much longer than a few minutes.
The team member was correct here. Quite often, calculating what The Enemy will do is harder than calculating what you're going to do.
The Enemy need not actually be an enemy. The Enemy can be "The universe" or something. If you're building a flood defence, then The Enemy is the flood water. If you build a barrier to stop it getting to your city's business district, then you'd better check where the water will do instead to make sure you didn't just divert it onto an orphanage or something.
Similarly, lots of AI Safety papers have the theme "We found a problem, then we fixed it." This has a nice ring to it. It's how most papers get written in most fields, which is fine for those fields. But AI Safety is much more like cybersecurity than e.g. chemical engineering, where "we found this reaction was going slow so we added a new catalyst" is totally reasonable.
(Lots don't fall into this trap, and that's great!)
The conversation usually goes like this:
AIS: We found this solution to a serious problem
AINKEI: This seems super hacky
AIS: No I don't think so
AIS goes away to think...
AIS: Actually it follows this deeper principle
AINKEI: I feel like this won't work for superintelligence still
AINKEI goes away to think...
AINKEI: Ok, here's a reason I thought of why it won't work
AIS: Oh huh
AIS goes away to think...
AIS: Ah but I might be able to fix it with this solution
The issue is that AINKEI is thinking in terms of letting the enemy get the last hit, while AIS is thinking in terms of a feedback loop of detecting and fixing problems. The feedback loop solution only works if all of your problems are recoverable, which is a core disagreement between the crowds.
<psychoanalysis>
I think of a lot of the AI not-kill-everyone-ism crowd's frustration with the AI safety crowd is that the AINKEI people feel that they are having to do the jobs of that AIS people should be doing by playing the part of The Enemy getting the last hit
</psychoanalysis>
The recent work on inoculation prompting---which has stirred up so many mixed reactions that it functions as a scissor statement for the AI safety/alignment/notkilleveryoneism crowd---is a great example.
Problem: AIs generalize from reward hacking to misalignment.
Solution: Just tell 'em it's OK to reward hack during training.
Does this throw up even more problems? The paper didn't really investigate this question; they didn't let The Enemy get the last hit.
In this case, The Enemy is "Your AIs getting smarter every generation."
The general form of the solution is "if we can't make our reward environments exactly match our prompts, we'll adjust our prompts to match our reward environments." which is, to be fair, quite elegant. What happens when the AI gets smarter? As a first guess, if you can't make your reward environments more robust, you'll have to prompt your AI with more and more caveats, in more and more different situations.
This seems bad! Does every prompt now have "by the way it's OK to hack the environment and manipulate the human raters and break out of your VM and murder the testers" during training? What fixes this? I don't know, because I have a finite amount of time to write this essay, and I double-super don't know what problems that fix throws up!