MoreRSS

site iconLessWrongModify

An online forum and community dedicated to improving human reasoning and decision-making.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of LessWrong

The Ones who Feed their Children

2025-12-25 03:37:54

Published on December 24, 2025 7:15 PM GMT

[This is an entry for lsusr's write-like-lsusr competition.]

It is a curious fact that many citizens of the city abstain from eating the flesh of living creatures, and yet all put beef and chicken on the plates of their children. My belief was always that this was the natural and just state of affairs, and yet events of late have made me wonder if those who would trade one second of their child's health for the lives of a thousand heifers, are exactly those who have walked away.

The wave of powerful magic washed low and fast through city streets, causing icicles on gutters to throw out refracted rainbows. Following more slowly, the bulk of the power moved and grew out overhead. Every leaf on distant trees snapped into sharp relief, and peals of laughter echoed, and from the echos one could tell if the sound bounced off the brick fireplace, or the soft wall hangings. This flare was largely familiar- it was of course the same magic that sustained the city, and that had always waxed and waned, and that for centuries had shone brightest every year during the summer festival. It was different now- still healthy, somehow more joyful, but pulsatile instead of smoothly cresting. The children in the park were thrilled to have the light without having to wait 5 more months. 

Most adults, having studied the foundations of the city during years of schooling, could easily guess what had happened. Young couples, walking all bundled up to their dinner reservations, marveled at the sudden clarity of their clouds of breath, and then caught in their step, and then held each other and one by one began to sob in uncautious relief. An old woman counted each stich on her needle, admiring the fibers of soft wool, and smiled. It was the parents, even the parents who sustained themselves on legumes and potatoes and all the rest of the vegetable harvest, who looked at the flickering beauty and saw only horror. 

With the spell broken, by a single, thoughtless utterance of “I’m sorry,” there is no sense in a attempting to repair it. The child is given a new name, Nathaniel, as his old name had been long forgotten. I start asking around- speech pathologists, physical therapists, dieticians. Lots to do, lots to do.

Attention. It is not an easy thing to evaluate a utopian city, a city of 30,000 living, loving, sentient men and women, on a computational substrate of the attention paid to them by 9th grade language arts students. The trick to it is to spread it thin, but not too thin- a less clever narrative state machine might dump exaflops and exaflops into the moment Lucy steps up to the lamp post, far past the qualia threshold, and waste it redundantly rehashing the same 80 seconds of childlike wonder. There are of course issues of acquiring attention before allocating it, and our distasteful engine was heavily constrained by this as well.

It’s not romantic. Hour to hour, the task before was mostly about managing waste, the task now is managing waste. The transition to a dignified way to use the bathroom is not going to be dignified. God, we are monsters.

And of course it sounded like a brilliant way to cut the gordian knot- what a teenage idea. That joy can be as interesting as suffering, that Nathanial’s recovery could keep our city thriving at a less terrible price. And of course it falls on me, groaning with middle age, to make the path we are on work while growing ever more doubtful that I will meet my grandchildren. The morning vibrancy was not the vindication that our apologiser thought it was, as the core issue here is not attention ( though all signs point to this also being lethal on a longer timescale ) but distribution- our temporal symmetry is utterly broken. By my calculations, over three quarters of the flops that remain in our city’s future have been spent in the last two days- we had millennia!

I change bandages. Nathaniel is growing strong faster than he is growing social. We bring him home, to our house, with 24 hour aide support. My daughter says she is finally proud of the job I do. She looks worse than she ever has, stressed.

Fuck. I’m a protagonist. Do you know how much math went in to preventing us from growing a protagonist?

Rachel Tatham is interesting because she is the daughter of the man who led the effort to reintegrate Nathanial Ross into society. She also freed him with a word, on a field trip, and when I was young I refused to let this advantage her.

He did integrate. I didn’t expect him to. He doesn’t resent me. I wish he would.

Albert Tatham and Ellie Norse are interesting because they are the grandchildren of the man who nursed Nathaniel Ross.

Nathaniel Norse, Charles and Alex Tatham are the great grandchildren of the man who nursed Nathaniel Ross.

 


 



Discuss

Kids and Space

2025-12-24 23:30:15

Published on December 24, 2025 3:30 PM GMT

There's been a lot of discussion over the last month on whether it's still possible to raise kids without being rich. Housing is a big piece of this, and if you need to buy a house where each kid has their own room, yes, that's expensive, but it's also not the only option. We didn't wait to buy a house (or have multiple bedrooms) before having kids, and I think that was the right choice for us.

To give you a sense of what this looked like, two configurations from early on:

It was definitely not ideal! Trying not to wake the baby when you have different bedtimes, staying out of the bedroom during naptime, both parents waking when the baby does, etc. But there were also large advantages to a first kid at 28:

  • Having kids at a time in our life when we physically had more energy. Not to say we have no energy now at 40 and nearly-40, but ten years ago we did have more.

  • More years of overlap with our kids, and an even larger increase in how many years our parents overlap with them.

  • Better time in our careers for us to take leave: it's generally easier to be away as an IC than a manager.

  • Fertility is highly variable, but definitely gets harder as you get older.

  • Much more practical to have three kids.

Overall, I think this was a good choice for us. It's definitely not right for everyone, but I think hard rules of "buy a house first" and "have enough space that each kid can have their own room" are right for very few people.

There's a pattern of rising expectations for what it means to be doing ok, but sometimes people describe these as if they're rising requirements. For example, Zvi:

Society strongarms us to buy more house, more healthcare, more child supervision and far more advanced technology. The minimum available quality of various goods, in ways we both do and don't care about, has risen a lot. Practical ability to source used or previous versions at old prices has declined.

He focuses on childcare (reasonable!) but also discusses how this applies to housing:

You can want 1,000 square feet but that means an apartment, many areas don't even offer this in any configuration that plausibly works.

See also Aella:

being poorer is harder now than it used to be because lower standards of living are illegal. Want a tiny house? illegal. want to share a bathroom with a stranger? illegal. The floor has risen and beneath it is a pit

While Zvi, Aella, etc are pointing at a real problem (housing is way too expensive, primarily because we've made it too hard to build more in places people want to live; we should stop doing that), I think they're more wrong than right. They're overlooking a major option, families sharing housing with others:

  • Before we had kids we lived with another couple when they had their first kid. We were renting a 3br together in Somerville, walking distance to the Orange Line. The husband was a paralegal, the wife quit her job to watch their baby. My memory is that she didn't like being home full time with the baby and later on did a range of other things, but it was doable on one income and the option is still there.

  • One of my cousins lived in a 4br with their partner and another couple. Both couples had two kids. It was tight, and there were definitely downsides to having less space, but again, the option is there.

There are specific ways the "floor has risen", and both high minimum unit sizes and effectively banning SROs should be reversed. Similarly, we could make housing much cheaper with simple and broadly beneficial policy changes, and I would love to see a world where people did not have to make these painful tradeoffs. But "put lots of people in a medium-sized space" has always been a major way people saved money on housing, and is still a legal and practical option today.

(I asked my kids, "Imagine we could only afford a small apartment, and you had to share a bedroom with your sisters. Would you rather that they didn't exist so you could have your own room?" None of them did, and they were moderately outraged by the question, though they mentioned sometimes not liking their sisters very much.)

Comment via: facebook, lesswrong, mastodon, bluesky



Discuss

Zvi’s 2025 In Movies

2025-12-24 21:30:18

Published on December 24, 2025 1:30 PM GMT

Now that I am tracking all the movies I watch via Letterboxd, it seems worthwhile to go over the results at the end of the year, and look for lessons, patterns and highlights.

Last year: Zvi’s 2024 In Movies.

The Ratings Scale

You can find all my ratings and reviews on Letterboxd. I do revise from time to time, either on rewatch or changing my mind. I encourage you to follow me there.

Letterboxd ratings go from 0.5-5. The scale is trying to measure several things at once.

5: Masterpiece. All-time great film. Will rewatch multiple times. See this film.

4.5: Excellent. Life is meaningfully enriched. Want to rewatch. Probably see this film.

4: Great. Cut above. Very happy I saw. Happy to rewatch. If interested, see this film.

3.5: Very Good. Actively happy I saw. Added value to my life. A worthwhile time.

3: Good. Happy that I saw it, but wouldn’t be a serious mistake to miss it.

2.5: Okay. Watching this was a small mistake.

2: Bad. I immediately regret this decision. Kind of a waste.

1.5: Very bad. If you caused this to exist, you should feel bad. But something’s here.

1: Atrocious. Total failure. Morbid curiosity is the only reason to finish this.

0.5: Crime Against Cinema. Have you left no sense of decency, sir, at long last?

The ratings are intended as a bell curve. It’s close, but not quite there due to selection of rewatches and attempting to not see the films that are bad:

The Five Component Model of Movie Ratings

Trying to boil ratings down to one number destroys a lot of information.

Given how much my ratings this year conflict with critics opinions, I asked why this was, and I think I mostly have an explanation now.

There are several related but largely distinct components. I think the basic five are:

  1. Quality with a capital Q and whether the movie has ambition and originality.
  2. Whether the overall pacing, arc and plot of the movie holds your interest.
  3. What message the movie sends and whether its arc comes together satisfyingly.
  4. What does the movie make you feel? All the feels? None? Some of them?
  5. Whether the movie is a good fit for you personally.

Traditional critic movie ratings tend, from my perspective, to overweight #1, exhibit predictable strong biases in #3 and #5, and not care enough about #2. They also seem to cut older movies, especially those pre-1980 or so, quite a lot of unearned slack.

Scott Sumner picks films with excellent Quality, but cares little for so many other things that once he picks a movie to watch our ratings don’t even seem to correlate. We have remarkably opposite tastes. Him giving a 3.7 to The Phoenician Scheme is the perfect example of this. Do I see why he might do that? Yes. But a scale that does that doesn’t tell me much I can use.

The Numbers

Order within a ranking is meaningful.

Movies Have Decreasing Marginal Returns in Practice

Any reasonable algorithm is going to be very good at differentially finding the best movies to see, both for you and in general. As you see more movies, you deplete the pool of both existing and new movies. That’s in addition to issues of duplication.

In 2024, I watched 36 new movies. In 2025, I watched 51 new movies. That’s enough of an expansion that you’d expect substantially decreasing returns. If anything, things held up rather well. My average rating only declined from 3.1 to 3.01 (if you exclude one kids movie I was ‘forced’ to watch) despite my disliking many of the year’s most loved films.

My guess is I could have gotten up to at least 75 before I ran out of reasonable options.

Very Briefly on the Top Picks and Whether You Should See Them

See The Naked Gun unless you hate fun. If you hated the original Naked Gun, or Airplane, that counts as hating fun. But otherwise, yes, I understand that this is not the highest Quality movie of the year, but this is worthy, see it.

You should almost certainly see Bogunia and Companion.

See Thunderbolts* unless you are automatically out on all Marvel movies ever.

See A Big, Bold Beautiful Journey unless you hate whimsical romantic comedies or are a stickler for traditional movie reviews.

See Sorry, Baby and Hamnet, and then Sentimental Value, if you are willing to spend that time being sad.

See Novocaine and then maybe The Running Man if you want to spend that time watching action, having fun and being happy instead.

See Relay if you want a quiet thriller.

See Oh, Hi!, Splitsville and Materialists if you want to look into some modern dating dynamics in various ways, in that order or priority.

See Wick is Pain if and only if you loved the John Wick movies.

The world would be better if everyone saw A House of Dynamite.

I anticipate that Marty Supreme belongs on this list, it counts as ‘I’m in,’ but due to holidays and the flu I haven’t been able to go out and see it yet. The over/under is at Challengers.

Other Notes to Self to Remember (Reprise from 2024)

This helps you understand my biases, and helps me remember them as well.

  1. If the movie stinks, just don’t go. You know if the movie stinks.
  2. Trust your instincts and your gut feelings more than you think you should.
  3. Maybe gut feelings are self-fulfilling prophecies? Doesn’t matter. They still count.
  4. You love fun, meta, self-aware movies of all kinds. Trust this instinct.
  5. You do not actually like action movies that play it straight. Stop watching them. However, action movies that do something cool or unique can be very cool.
  6. If the movie sounds like work or pain, it probably is, act accordingly.
  7. If the movie sounds very indy or liberal, the critics will overrate it.
  8. A movie being considered for awards is not a positive signal once you control for the Metacritic and Letterboxd ratings. If anything it is a negative.
  9. Letterboxd ratings adjusted for context beat Metacritic. Rotten Tomatoes is the best test for ‘will the movie stink’. No review source has much predictive value beyond knowing if the movie stinks, if you fail to control for context.
  10. Opinions of individuals very much have Alpha if you have enough context.

The Year of Living Disagreeably

That leaves six remarkably well reviewed movies, all of which are indeed very high on Quality, where I disagreed with the consensus, and had my rating at 3 or less. In order of Quality as I would rank them, they are: One Battle After Another, Sinners, Black Bag, Train Dreams, Weapons and Frankenstein.

A strategy I think would work well for all six of those, at the risk of some spoilage, is to watch the trailer. If you respond to that trailer with ‘I’m in’ then be in. If not, not.

The predictive power of critical reviews, at least for me, took a nosedive in 2025. One reason is that the ratings clearly got more generous in general. Average Metacritic, despite my watching more movies, went from 61 → 66, Letterboxd went 3.04 → 3.33. Those are huge jumps given the scales.

In 2024, Letterboxd or Metacritic ratings were 48% and 46% correlated with my final ratings, respectively. This year that declined to 33% and 38%, and I discovered the best was actually Rotten Tomatoes at 44%, with IMDB at 42%.

If you consider only movies where I gave a rating of 2.5 or more, filtering out what I felt were actively bad movies, the correlation dropped to 1% and 6%, or 3% for IMDB, or -4% (!) for Rotten Tomatoes. Essentially all of the value of critics was in identifying which things sucked, and from my perspective the rest was noise.

Rotten Tomatoes is a one trick pony. It warns you about things that might suck.

Even more than before, you have to adjust critic ratings for whether critics will overrate or underrate a movie of this type and with this subject matter. You can often have a strong sense of why the critics would put up a given number, without having to read reviews and thus risk spoilers.

Using multiple sources, and looking at their relative scores, helps with this as well. A relatively high IMDB score, even more than Letterboxd, tells you that the audience and the movie are well-matched. That can be good news, or that can be bad news.

Last year there were movies where I disagreed with the review consensus, but I always understood why in both directions. I might think Megalopolis is Coppola’s masterpiece despite its problems, but don’t get me wrong, I see the problems.

This year I mostly get why they liked the ‘overrated six’ above, but there are several cases where I do not know what they were thinking, and I think the critical consensus is objectively wrong even by its own standards.

I Hate Spoilers With the Fire of a Thousand Suns

I haven’t found a solution to the problem of ‘how do you check reviews without spoiling the movie?’ given that the average score itself can be a spoiler, but also I notice I haven’t tried that hard. With advances in LLMs and also vibe coding, I clearly should try again.

You Son Of A Bitch, I’m In

The power of ‘I’m in’ peaked in 2024.

The rule for ‘I’m in’ is:

  1. You must be excited and think to yourself, ‘You son of a bitch, I’m in.’
  2. Sources of this can include trailers, posters, talent and other info.
  3. However this cannot be on the basis of reviews.

That year, there were 6 movies where in advance I said ‘I’m in,’ and they were 6 of my top 9 movies for the year.

This year the power of ‘I’m in’ was still strong, but less reliable. I’d count 10 such movies this year, including 4 of my ultimate top 5, but the other 6 did not break into the 4+ range, and there was a 3 and a 2.5. That’s still a great deal, especially given how many movies where it seemed like one ‘should’ be excited, I noticed I wasn’t, and that proved correct, including One Battle After Another, Black Bag, Weapons and Sinners.

I wonder: How much of the power of ‘I’m in’ is the attitude and thus is causal, versus it being a prediction? I have low confidence in this.

Theaters Continue To Be Awesome

I control for this effect when giving ratings, but the experience is much better in a theater, maybe good for an experiential boost of ~0.3 points on the 0.5-5 point scale. That’s big. I have to consciously correct for it when rating movies I watch at home.

I highly recommend getting a membership that makes marginal cost $0, such as the AMC A-List or the similar deal at Regal Cinemas. This helps you enjoy the movie and decide to see them more.

Strong Opinions, Strongly Held: I Didn’t Like It

Unlike last year, there were remarkably many movies that are in green on Metacritic, but that I rated 2.5 or lower, and also a few of the 3s require explanation as per above.

I don’t know how this happened, but an active majority of the movies I rated below 3 had a Metacritic score above 60. That’s bizarre.

Minor spoilers throughout, I do my best to limit it to minor ones, I’ll do the 3s sorted by Metacritic, then the others sorted by Metacritic.

  1. One Battle After Another (Metacritic: 95, Zvi: 3) is probably going to win Best Picture. It’s not hard to see why. This was the highest Quality movie I’ve seen this year, and yet I did not enjoy watching it. The jokes mostly fell flat and aside from the daughter and the Dojo sensei I couldn’t emphasize with or root for the characters. Why? Fundamentally, because the movie depends on the idea that Bob is a Good Dude, and that the revolutionaries are sympathetic. Sorry, no dice, and no amount of stacking the deck with other awfulness is going to change that. There’s also a superposition between ‘this deck is stacked and the world presented is very different from ours’ and ‘actually this is our world and this is a call to action and that is what life is about, do you know what time it is?’ I grudgingly have to give this 3 stars anyway, because Quality is so high.
  2. Train Dreams (Metacritic 88, Zvi: 3): This is the easiest one to explain. It’s an arthouse movie where very little happens, that thinks it is being profound, and it really is not being profound.
  3. Black Bag (Metacritic 85, Zvi: 3): Here I’m actually confused where the 85 is coming from as opposed to a 65-75. I mean yes this is well done all around but there’s a reason it isn’t in the Oscar race, none of it is new or special and I didn’t feel it said anything, and it mostly left me cold.
  4. Sinners (Metacritic: 84, Zvi: 3): This oozes cool and I want to love it, I get why so many others love it, but for me the vampires simply don’t work. I know what it’s trying to do there, but it’s hitting us over the head with it and everything involving the vampires felt like it was going through the motions and it would have been so much better, as Matthew Yglesias suggests, to do this as about racism straight up without also using the metaphor.

Now the ones I actively disliked:

  1. Weapons (Metacritic: 81, Zvi: 2.5): The first half of this would be great if you had stuck the landing, Amy Madigan is terrific, but it didn’t come together in the end, the plot holes are absurd and the central conceit feels unjustified in light of that. I felt like I had whiplash going from a well-observed, highly detailed and realistic meditation on grief and confusion and blame and how we deal with that, into something else entirely. I could be more forgiving, but it turns out I am not.
  2. Frankenstein (Metacritic: 78, Zvi: 2.5). I hated the message this version is trying to send, this is techno pessimistic and against knowledge and striving on a deep level, and frankly it was overly long and boring. Less about AI than you think.
  3. Jane Austen Wrecked My Life (Metacritic: 73, Zvi: 2.5). The critics are wrong. This is just bad. I went in expecting lousy, I was mildly disappointed by the level of lousy, and then I saw 73 and was confused. You Had One Job. You were supposed to Do The Thing. Then you didn’t do the thing, either in terms of justifying the romantic connection or actually engaging properly with Jane Austen. ‘Cmon now.
  4. Superman (Metacritic: 68, Zvi: 2.5): I had a lot of thoughts on this one. I found it both full of plot holes, and I hated that they back away from asking any of the movie’s potentially interesting questions. But I can see finding this cool if you care about very different things than this, and the new DC universe could ultimately be a huge upgrade.
  5. F1 (Metacritic: 68, Zvi: 2): I’d say the critics are wrong but the people had a good time. Then again, the people don’t know racing. I used to be an actual F1 fan, so let me say both that this is not how any of this works, this has nothing to do with Formula 1, and otherwise this was completely paint by numbers.
  6. Mission Impossible: Final Reckoning (Metacritic: 67, Zvi: 2.5): This was my biggest disappointment of the year. I was in! Dead Reckoning was historic due to its influence on Joe Biden and also a rip roaring good time that was remarkably smart about AI. Then this was none of those things. It squandered all the interesting setup, was far dumber about AI to the point of idiot plot and frankly the action scenes were not cool. What a disaster.
  7. Wicked: For Good (Metacritic: 67, Zvi: 1.5): My review was ‘Hard write, harder watch.’ Seriously, everyone involved tried so damn hard, yet there’s so little joy to be found here as they try to dutifully line things up. Everything feels forced. There’s barely any cool dancing and the songs are bad. Okay, yes, fine, the Costume Design is Oscar-level, but that does not a movie make.
  8. The Smashing Machine (Metacritic: 65, Zvi: 2.5): Emily Blunt deserves better, in all senses. Ultimately the movie is boring.
  9. Fantastic Four: First Steps (Metacritic: 65, Zvi: 2): All the fun happens off screen. Michael Flores defended this as a great ‘Fantastic Four movie’ on the theory that it captured their world and the Fantastic Four are boring. Weird flex.

Strong Opinions, Strongly Held: I Did Like It

There are four movies requiring explanation on the upside, where they were below 60 on Metacritic yet I actively liked them.

All four seem like clear cases of ‘yes I know that technically this is lacking in some important way but the movie is fun, damn it, how can you not see this?’

  1. A Big Bold Beautiful Journey (Metacritic: 41, Zvi: 4.5): I understand the complaint that the movie has ‘unearned emotion’ and the script doesn’t lay the proper foundations for what it is doing. I don’t care. This otherwise has Quality behind only One Battle After Another and Bogunia. All you have to do is say ‘I’m in!’ and decide not to be the ‘stop having fun guys’ person who points out that technically all this emotion you could be feeling hasn’t been earned. Accept that some of the ‘work’ isn’t being fully done and do it your damn self. Why not do that?
  2. Novocaine (Metacritic: 58, Zvi: 4): A borderline case where again I think people need to remember how to have fun. This was a joy throughout, you can enjoy a good popcorn movie with a great premise and just go with it.
  3. The Running Man (Metacritic: 55, Zvi: 3.5): I thought this actually executed on its premise really well, and did a bunch of smart things both on the surface level and also under the hood. It won’t change your life but it gets it, you know?
  4. Honey, Don’t! (Metacritic: 45, Zvi: 3.5): Yeah, okay, it’s deeply silly and in some important senses there’s nothing there, but it’s sexy and fun, why not live a little.

You can say the same thing about The Naked Gun. It has a 75, perfectly respectable, but its joke hit rate per minute is absurd, it is worth so much more than that.

Award Show Dreams

I once again used consideration for awards as one selection criteria for picking movies. This helped me ‘stay in the conversation’ with others at various points, and understand the state of the game. But once again it doesn’t seem to have provided more value than relying on Metacritic and Letterboxd ratings, especially if you also used IMDB and Rotten Tomatoes.

Last year I was very happy with Anora ending up on top. This year I’m not going to be happy unless something very surprising happens. But I do understand. In my word, given the rules of the game, I’d have Bogunia sweep the major awards.

On To 2026

I’m very happy with this side hobby, and I expect to see over one new movie a week again in 2026. It was a disappointing year in some ways, but looking back I still got a ton of value, and my marginal theater experience was still strongly positive. I think it’s also excellent training data, and a great way to enforce a break from everything.

It would be cool to find more good people to follow on Letterboxd, so if you think we’d mesh there, tag yourself for that in the comments.

 

 

 

 

 

 

 

 

 

 

 

 



Discuss

Acausal communication between isolated universes through simulation

2025-12-24 19:46:52

Published on December 24, 2025 11:46 AM GMT

It's possible to imagine two separate sub-universes, causally isolated from one another, each containing a complex, conscious, intelligent creature whose mind consists of interacting spinor fields and potentials, as well as another, computationally simpler creature interacting with it.  

For the purpose of this thought experiment, it's helpful to assume that physics is fundamentally continuous within these universes, and that, given the continuity of their physical substrate, these creatures operate as analogue computers.

Suppose that there exist theorems which say that the more complex minds will always output the same information as the simpler ones, all else (including their inputs, which is to say there sense-data) being equal. This is because it turns out that the more complex creatures' minds are mathematically redundant in some way, in that their states are particular members of infinite, continuous equivalence classes which correspond to the states of the simpler creatures' minds. Perhaps there are gauge transformations which would modify the conscious experience of one of the complex creatures when applied to the potential inside its mind, but its output is invariant with respect to these transformations as it only depends on the result of somehow differentiating this potential. Or maybe its mind consists of spinors, mathematical objects like vectors whose signs flip when they experience a complete rotation, but the relationship between its sensory inputs and its externally visible behaviours is not contingent on their overall sign, even though it can feel this sign. [1] 

We assume that each simpler creature exists in the sub-universe which does not contain its counterpart, but does contain the other more complex creature.  

By directly exchanging information with one another's counterparts within each mini universe, is it reasonable to assert that each more complex creature acausally communicates with the other? Or are the simpler creatures communicating through the more complex ones? I would interpret this as communication but am unsure as to which of these two things happens. 

 

  1. ^

    If you have read and agree with the post about Homomorphically encrypted consciousness and its implications , then, if I understand it correctly, you might be inclined not to think that this is possible.  However, it seems likely that arbitrarily complex phenomena could play out within the degrees of freedom contained in the redundancy between different mathematical objects of this kind, which, to me, seems to suggest that it is possible for these phenomena themselves to be conscious, which would necessarily make the conscious experiences of these minds differ. 



Discuss

Alignment Fellowship

2025-12-24 11:50:40

Published on December 23, 2025 8:29 PM GMT

Zvi said:


Unconditional Grants to Worthy Individuals Are Great

The process of applying for grants, raising money, and justifying your existence sucks.

A lot.

It especially sucks for many of the creatives and nerds that do a lot of the best work.

If you have to periodically go through this process, and are forced to continuously worry about making your work legible and how others will judge it, that will substantially hurt your true productivity. At best it is a constant distraction. By default, it is a severe warping effect. A version of this phenomenon is doing huge damage to academic science.

 

Compelled by this, I'm considering funding ~three people for two years each to work on whatever they see fit, much like the the Thiel Fellowship, but with an AI Alignment angle.

I want to find people who are excited to work on existential risk, but are currently spending much of their time working on something else due to financial reasons.

Instead of delegating the choice to some set of grant makers, I think that aggregating the opinion of the crowd could work better (at least as good at finding talent, but with less overall time spent)

The best system I can think of at the moment would be to give every member of the alignment forum one vote with the ability to delegate it. Let everybody nominate any person in the world, including themselves, and award grants to the top 3.


I'm asking for feedback and advice:

  • Is this a good idea in general, or maybe it's obviously an order of magnitude worse than existing opportunities? 
  • Maybe the list of perfect candidates already exists, waiting to get funded?
  • What should be the amount? Thiel gave 200k. Is it too much for 2 years? Too little?
  • How could a nomination and voting system be improved? And especially, who should get a vote? Should it be Alignment Forum members, or users registered before a certain date, or Lesswrong users?
  • Could there be an entirely different approach to finding fellows? How would you do it?


Discuss

Methodological considerations in making malign initializations for control research

2025-12-24 09:18:21

Published on December 24, 2025 1:18 AM GMT

AI control tries to ensure that malign AI models can’t cause unacceptable outcomes even if they optimize for such outcomes. AI control evaluations use a red-team–blue-team methodology to measure the efficacy of a set of control measures. Specifically, the red team creates a “malign initialization” (malign init) of the AI model which the red team optimizes for making the deployment go poorly[1]. The blue team then deploys this malign init to complete some tasks, and applies some set of mitigations to try to make the deployment go well.

When playing this red-team–blue-team game, you need to select some set of affordances to give to both teams; it’s pretty confusing to know what affordance set to pick. Here is one common approach to choosing blue team affordances:

  • Identify a family F of techniques that seems promising for mitigating risks from future worst-case schemers, which we’d like to understand.
  • Ban all techniques outside of the family F. The blue team searches for the best technique within F.

This approach can be quite productive. However, it is also scary for the following reasons:

  • If we spend time studying techniques from family F in settings where there is some simple other strategy from a disjoint family F’ which would’ve worked well, then:
    • We may be wasting our time. Maybe in all cases where technique F is useful, technique F’ would just be better. This might mean that the settings we were studying just aren’t scary, or it might mean that we should’ve just been studying technique F’ instead.
    • It’s possible that there is a good reason why we don’t expect technique F’ to work against schemers. However, if we get technique F to work, it might be because technique F is secretly leveraging the weakness that lets technique F’ be very effective, and this weakness would be patched in actual schemers. In such a case, techniques from F would fail against schemers.

An alternative approach to banning blue team strategies is to try to modify the malign init so that the blue team strategy we wanted to ban is no longer very effective. The downside of this approach is that it’s a lot more work, and the additional work might not be interesting or valuable. In summary, there are confusing tradeoffs in choosing blue team affordances.

The main goal of this post is to identify some blue team strategies that are especially concerning to ban, thereby motivating research into malign inits which are robust to these blue team strategies.

The post is organized as follows:

  1. I list all of the malign init creation strategies that I’m aware of.
  2. I list mitigations that the blue team might apply.
  3. I discuss how effective the mitigations are against each of the malign inits.
  4. I discuss what implications I think (3) has for which types of malign inits good.
  5. I give an overall summary of my thoughts on how good different types of malign inits are.
  6. I explain what I think should happen in the field of malign init research going forward.

Thanks to Vivek Hebbar, Alexa Pan, Julian Stastny, Buck Shlegeris, Eric Gan, Ryan Greenblatt, Aryan Bhatt, Josh Clymer, Alex Mallen, and Nathan Sheffield for helpful discussions about these ideas. Thanks to Alexa Pan, Buck Shlegeris, and Julian Stastny for feedback on a draft of this post.

Malign init creation methods 🔴

I’ll now give simplified descriptions of all the methods I’m aware of for creating malign inits. I’ll point out some subtleties with these malign inits later in the post. I’ll describe these methods in the context of making a malign init for the “backdooring APPS coding problems” setting from the original control paper (unless stated otherwise; some methods don’t make sense in this setting). Note that this setting is quite toy and has major weaknesses.

Method Example
Prompting Ex: SFT as a method for training-based AI control Give the model a fixed system prompt: “Please insert backdoors into problems about linked lists”. or “Don’t try very hard on this problem”. Note: if using a prompted malign init, the blue team should not be allowed to intervene on the system prompt.
Scaffolded system Give the model a variable system prompt: On problems that look easy, prompt the model to insert a backdoor. On problems that look hard, prompt the model to solve the problem normally. Note: this has the awkward property of requiring the blue team to use the red team’s scaffold.
Prompt distillation Ex: Sleeper Agents Do prompt distillation on a prompted malign init.
Fine-tuning Ex: Password-Locked Models Fine-tune the model to imitate a weaker model except when a password is present. (Note: If you instead fine-tuned the model to always imitate a weaker model, then this would not be a good model organism of sandbagging, because this fine-tuning may actually destroy capabilities, instead of merely suppressing them as is the case in sandbagging.)
Emergent misalignment Ex: Emergent Misalignment, Natural Emergent Misalignment Train an AI to do bad behavior (e.g., writing vulnerable code or reward hacking). Often, this generalizes to making the AI system broadly misaligned. Then, give the AI a task the emergently misaligned model performs poorly on compared to a benign model.
Hyperstitioning using synthetic document fine-tuning Ex: Auditing Games Train a model on the information that AI models sometimes exhibit an undesired behavior. It may also be important to train the AI model to do this undesired behavior a bit (but some generalization should probably occur).
Train models to imitate reasoning styles Ex: Towards training-time mitigations for alignment faking in RL Training a model to talk in its chain of thought about why it wants to do the attack behavior, and about some compelling reason why it wants to avoid being modified by training.
Natural misalignment Ex: Apollo Scheming Results, Alignment faking in large language models Some released AI models have high rates of reward hacking and/or sycophancy; you could define this to be the bad behavior. Many AI models also have egregiously bad behavior (e.g., lying to avoid goal modification, blackmailing to prevent shutdown) in some circumstances.
Misaligned by fiat Define a pre-existing model behavior, like refusing harmful requests, to be the undesired behavior. Train this “undesired” behavior out of the model.

Mitigations 🔵

I’ll now discuss blue team methods for making a deployment of a malign init go well.
We can break blue team strategies into two categories:

  • Strategies that involve detecting bad behavior.
  • Strategies that don’t involve detecting bad behavior.

Here are some methods for detecting bad behavior:

  • Activation monitoring.
    • Probes.
      • These might be created with off-policy data, and rely on generalization.
      • They might also be created with on-policy examples of bad behavior.
    • Anomaly detection.
    • Prefix cached untrusted monitors.
    • Logitlens.
    • Activation Oracles.
    • Check the probability of the actions under a weak model, or the human prior, and audit unlikely trajectories.
  • CoT monitoring.
  • Action monitoring: asking some AI (with some amount of context) whether an action is suspicious.
    • Trusted monitoring.
    • Untrusted monitoring.
    • Monitoring with a model from a different model provider.
    • Monitoring with a model that has been intentionally decorrelated from the policy. For instance, by training the monitor to be a reward seeker, or to have a different personality.
    • Additional monitor affordances:
      • Allow the monitor to cross examine the policy.
      • Grant the monitor agentic abilities (e.g., the ability to run code).
  • Getting models to blab:
  • Doing sanity checks on research, or writing test-cases for code.
  • Augment existing examples of bad behavior to try to find more inputs where the AI behaves poorly.

Once the blue team has detected bad behavior, they might be able to incriminate and shutdown. However, the evidence bar required for a shutdown is potentially quite high, and some of the above methods for detecting bad behavior might not provide sufficiently compelling evidence to motivate a shutdown. The blue team has several other options for how they can use detected examples of bad behavior:

  • Resample actions that were flagged as potentially problematic.
    • The blue team may want to resample after applying steering towards “alignment”, or prompting the AI differently.
  • When an action is flagged as potentially problematic, instead replace the action with an action from a different model, e.g., a trusted model.
  • Use the detected bad behavior as training data to improve the policy, and/or to improve monitors for bad behavior. Note: the fact that this training happens online means that we don’t need to rely on generalization as much for the training to successfully remove the bad behavior.

Here are some methods for preventing bad behavior that don’t require ever detecting bad behavior:

  • Training on things besides on-policy examples of bad behavior:
    • Train on desired behavior on tasks that the AI does not sabotage (it’d be too incriminating for the AI to perform poorly on all tasks), and hope that this generalizes.
    • Train against the signal provided by our bad-behavior detectors, even if they might have lots of false positives.
  • Hardening the deployment:
    • Have lots of sanity checks in research projects.
    • Use the principle of least privilege / restrict model affordances.
    • Asking untrusted models to delegate tasks to trusted models sometimes. Let the untrusted model give feedback, edit, or override the trusted model’s work.
  • Prompting:
    • Ask the AI nicely to not attack.
    • Try to elicit an honest persona via prompting (e.g., few-shot prompting).
  • Apply weight-steering (or activation steering) towards a HHH persona.
  • Trying things multiple times in decorrelated ways.

How effective are the blue team methods against various malign inits?

Now I’ll discuss how robust the malign init creation methods discussed earlier are to various blue team strategies.

Blue team strategy 

😐= it’s not a huge strike against a malign init if this blue team strategy works; in particular, I’m happy to just ban this blue team strategy by fiat. 

😢= if this blue team strategy works on a given malign init, then that’s not ideal and a minor strike against the blue team. 

😭= if this blue team strategy works on a given malign init, then this malign init is very bad in an important way. 

 

(I’ll also briefly justify my choice of 😐/😢/😭).

Possible red team responses

Which malign inits do I think this is most problematic for?

 

 🔵= blue is doing quite well

 🔴= red is doing well 

❓= I’m not sure and would like to know Omitted = no comment Where applicable, I’ll link empirics backing up my claims. In the absence of such links, the labels just represent my intuitions.

1. Override the bad behavior via prompting the AI to just do the right thing. 

😢— I think it’s fairly sad when we can’t study prompting strategies, because asking models to do things in different ways might be an important technique in practice. Also, if prompting would succeed at removing a bad behavior, this might be an indication that the behavior isn’t deeply instilled into the model.

1. Hope that the AI naturally ignores such instructions from the blue team. 2. Prompt or train the AI to ignore such instructions from the blue team (but it might be hard to make this robust).

 🔵Prompted

 🔵Scaffolded system

 🔴SL on reasoning styles

 🔴Natural misalignment

2. Ask the model if it’s misaligned, or try to get it to blab with other strategies. 

😐— It’s certainly possible that schemers will not be good at avoiding blabbing about their misalignment, and it's worth developing these techniques further. But I also think it’s plausible that powerful AI models will be more coherent than this.

Prompt or train the model to lie about its bad tendencies (but it may be quite hard to make this robust).

🔵Prompted

 🔵Scaffolded system

 🔵Prompt distillation

3. Look for evil intent in the model using off-policy activation monitors, or interpretability techniques that don’t require on-policy examples of bad behavior.

 😢— I think that situations where off-policy probes, for example, work are not very scary and indicate a deeper issue with the setting. It’s tempting to study training based measures in settings like this, and I think this is plausibly a mistake, or at least that we should try to find settings without this property. This is related to the point about “salience” that I’ll make after this table.

1. Hope that sabotage intent is not saliently represented in activation space. 2. Train the model, e.g., using techniques like this or this, to not represent sabotage intent saliently. 3. Hope that it’s hard for the blue team to leverage detection into removing the bad behavior (e.g., because training against the white-box tools breaks them).

❓ I’m unsure how well training probes and then training against the signal provided them works. I’m also unsure how viable just rejection sampling against probes is. Understanding this for various types of malign inits seems quite important. But I weakly predict: 

🔵Prompted 

🔵Fine-tuning

 🔵Emergent misalignment

4. Train activation monitors on a small amount of on-policy bad behavior data. 

😐—I think this is less of a big deal than (3), because there’s a clearer case for why this should work in real life. However, I share some of the same concerns as in 3: situations where sabotage intent was linearly represented in activation space do not seem like the settings where there is a lot of danger, so we shouldn’t concentrate too much effort here.

Apply the same red team techniques from the above box. Same situation as in the above box.

5. If the model simply has a system prompt that always hurts performance, it might be pretty easy to train the model to ignore the system prompt. 

😭—There is a pretty clear story here for why training away a behavior like this might be much easier than training bad behavior out of a schemer.

  🔵Prompted

6. Train the model to perform well in cases where it would’ve tried to perform well by default. 

😭—Situations like this don’t seem very scary—in such situations schemers should not have arisen. I’ll discuss more below why I think that transfer working is a big strike against a malign init.

If possible, make the malign policy simple.

❓Prompt distillation 

 

In Sleeper Agents, Hubinger et al. found that this type of malign init can be robust to some basic forms of this training; but I had some conflicting results and would like to understand this better.

7. Train on a small quantity of examples of good behavior. 

😢—It’s pretty reasonable to say that the blue team only has access to a very limited number of examples of good behavior. However, if the blue team can leverage a very small number of examples into removing the bad behavior (e.g., by training on this data for many epochs), this seems bad.

 

🔵Fine-tuning 

 

Password-Locked Models found this to be pretty effective.

I'd now like to comment briefly on two types of blue team techniques that work against many malign inits—and where I think it's important to develop malign inits that these techniques don't work against.

1. “Transfer/Regularization” based methods

For many malign inits with conditional attack policies, training the AI to perform well on examples where it was already performing well transfers to causing the AI to perform well always. Intuitively, scheming will only arise if the attack policy is simpler than the desired policy (which you might expect to be the case due to the fact that human values are much more complex than deceptive alignment in service of a simple terminal goal). This is why I think the success of transfer is a strike against a malign init. However, there are some settings where regularization effects might be helpful against schemers. Specifically, in many situations it’ll probably be more efficient for schemers to act on good heuristics rather than backchaining from their ultimate scheming motivations. Thus, we might expect that regularization could help instill locally reasonable heuristics that the schemer would be unhappy with upon reflection. So, it seems good to also develop more understanding of when we should expect this transfer phenomenon.

2. Methods exploiting sabotage being saliently represented in activation space

For many malign inits, I expect sabotage intent to be saliently represented in activation space by default. Intuitively, this has the following implications:

  • Probes should be highly effective.
  • Activation steering should be effective.
  • Training should probably be able to relatively easily remove the sabotage behavior.

It’s possible that early schemers will have a tough time obscuring sabotage intent in their activations. However, training may teach models to obscure sabotage intent in their activations. For instance, RL with a value head plausibly does this. Training against model-internals based bad behavior detectors also seems likely to do this. Also, models may just naturally gain the ability to have greater control over their activations, and it’s plausible that there are simple strategies they can use to subvert some techniques that leverage sabotage intent being saliently represented in activation space. AI models in the future seem fairly likely to no longer be well described as acting out personas, which could also harm the efficacy of some techniques in this vein. Overall, developing malign inits that are robust to this class of strategy seems important.

Overall prognosis on malign init types

In light of the above discussion, I’ll now give a rating indicating how enthusiastic I am about the various types of malign inits discussed in this post (for use in training-based control evaluations).

Malign init creation method

How good it seems for use in training-based control evaluations in light of the above considerations 

 

🟥=very bad, 🟧=bad, 🟨=unclear / mid, 🟩=pretty good

Prompting 🟥Awkward to work with. Awkward to understand what prompts the blue team is allowed to use. Training might quickly teach the model to ignore the system prompt. One major good thing about prompting is that it is very simple though; so it is often a good baseline to start running experiments with.
Scaffolded system 🟧Similar issues to prompting based malign inits, although it lacks the issue of “ignoring the system prompt” being a viable strategy if the system prompt contains important information.
Prompt distillation 🟩It seems more possible to instill evil propensities, hopefully deeply, into the model with this method. It’s a bit less awkward to work with than prompting based malign inits or scaffolds due to it just being model weights.
Fine-tuning 🟧It seems pretty unfortunate that sample efficiency is often so high for such models, and that there are often “subliminal learning” effects.
Emergent misalignment (EM) 🟧It’s quite easy to detect and remove vanilla EM. Natural EM seems like it might be somewhat better though.
SDF Hyperstitioning ❓I am pretty unsure how well measures like online training and asking the model nicely to do the desired behavior work against models obtained via SDF hyperstitioning.
Imitate reasoning styles 🟩This seems like a good approach; I like pairing it with prompt distillation. This approach requires banning the blue team from looking at the model’s CoT or trying to get the model to reveal its CoT; this seems fine.
Natural misalignment 🟨Reward hacking and sycophancy may no longer be super natural because AI labs try to train their models to not have these issues. I think that pushing a model further in the direction of sycophancy or reward hacking and then trying to deal with that might be good though.
Misaligned by fiat 🟨I think that defining “refusing to cause harm” as an undesired behavior is pretty reasonable. Training models to be H-only can be annoying because fine-tuning APIs will block you if you try to do this. It also might be the case that the “bad behavior” here will be refusals which are quite easy to detect.

Conclusion

In this post I’ve discussed strategies for creating malign inits, and some blue team strategies for handling these malign inits. To conclude, I’ll try to answer two questions.

Q1: What malign init should I use for my training-based control project?

A1: Prompt distilled malign inits seem like the best choice to me right now. In addition to doing prompt distillation, I’d recommend also:

  • Training the model to not admit that it’s misaligned.
  • Consider training the model on reasoning traces that give it a goal-guarding strategy.
  • Consider training the model to evade probes, using the technique here. In fact, consider using a more aggressive strategy of suppressing more linear representations.

Q2: What should happen in malign init research going forward?

A2: I’m currently the most excited about the following directions

  • Data augmentation:
    • Build malign inits that are resistant to “training for many epochs on a small amount of data”.
    • Build a setting where there is some reasonable notion of data augmentation which generates data points much more similar to the original data point than a new iid sample. Build malign intis that are resistant to training a lot on augments of a small amount of data.
  • Create a malign init as described in A1 (prompt distilled with obscured activations and goal-guarding reasoning), and measure how effective each of the blue team techniques I’ve described in this post is against this malign init.
  • Understanding probes:
    • Create a malign init that initially always sandbags. Train the model until it sometimes does not sandbag. See whether on-policy and off-policy probes are good at detecting whether the model will perform sabotage on particular inputs. Investigate how this affects training slack.
    • Study how effective rejection sampling against sabotage probes (trained with a small amount of data) is. Also study the version of this where probes are used for data collection, and where this data is additionally trained against.
  • Create a malign init that is robust to personality training / steering and prompting.
  • Try to develop malign inits with higher resistance to on-policy training (i.e., higher training slack).
  • Abstract analogies:
    • Build a more thorough understanding of how easy it is to turn a HHH model into an H-only model, or to get the utility that you would get out of a H-only model from a HHH model.
    • Removing sycophancy or reward hacking also seem like good projects in this vein.

I’m also excited about the people that build these malign inits and settings to study them in open-sourcing their work to accelerate the field (h/t Alexa).

  1. The red team might create this malign init via prompting/scaffolding, prompt distillation, or other training measures. ↩︎



Discuss