MoreRSS

site iconLessWrongModify

An online forum and community dedicated to improving human reasoning and decision-making.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of LessWrong

My forays into cyborgism: theory, pt. 1

2026-04-06 09:13:07

In this post, I share the thinking that lies behind the Exobrain system I have built for myself. In another post, I'll describe the actual system.

I think the standard way of relating to LLM/AIs is as an external tool (or "digital mind") that you use and/or collaborate with. Instead of you doing the coding, you ask the LLM to do it for you. Instead of doing the research, you ask it to. That's great, and there is utility in those use cases.

Now, while I hardly engage in the delusion that humans can have some kind of long-term symbiotic integration with AIs that prevents them from replacing us[1], in the short term, I think humans can automate, outsource, and augment our thinking with LLM/AIs.

We already augment our cognition with technologies such as writing and mundane software. Organizing one's thoughts in a Google Doc is a kind of getting smarter with external aid. However, LLMs, by instantiating so many elements of cognition and intelligence (as limited and spiky as they might be), offer so much more ability to do this that I think there's a step change of gain to be had.

My personal attempt to capitalize on this is an LLM-based system I've been building for myself for a while now. Uncreatively, I just call it "Exobrain". The conceptualization is an externalization and augmentation of my cognition, more than an external tool. I'm not sure if it changes it in practice, but part of what it means is that if there's a boundary between me and the outside world, my goal is for the Exobrain to be on the inside of the boundary.

What makes the Exobrain part of me vs a tool is that I see it as replacing the inner-workings of my own mind: things like memory, recall, attention-management, task-selection, task-switching, and other executive-function elements.

Yesterday I described how I use Exobrain to replace memory functions (it's a great feeling to not worry you're going to forget stuff!)

Before (no Exobrain)

After (with Exobrain)

Retrieve phone from pocket, open note-taking app, open new note, or find existing relevant note

Say "Hey Exo", phone beeps, begin talking. Perhaps instruct the model which document to put a note in, or let it figure it out (has guidance in the stored system prompt)

Remember that I have a note, either have to remember where it is or muck around with search

Ask LLM to find the note (via basic key-term search or vector embedding search)

If the note is lengthy, you have to read through all of note

LLM can summarize and/or extract the relevant parts of the notes

Replacing memory is a narrow mechanism, though. While the broad vision is "upgrade and augment as much of cognition as possible", the intermediate goal I set when designing the system is to help me answer:

What should I be doing right now?

Aka, task prioritization. In every moment that we are not being involuntarily confined or coerced, we are making a choice about this.

Prioritization involves computation and prediction – start with everything you care about, survey all the possible options available, decide which options to pursue in which order to get the most of what you care about . . . it's tricky.

But actually! This all depends on memory, which is why memory is the basic function of my Exobrain. To prioritize between options in pursuit of what I care about, I must remember all the things I care about and all things I could be doing...which is a finite but pretty long list. A couple of hundred to-do items, 1-2 dozen "projects", a couple of to-read lists, a list of friends and social.

The default for most people, I assume, at least me, is that task prioritization ends up being very environmentally driven. My friend mentioned a certain video game at lunch that reminds me that I want to finish it, so that's what I do in the evening. If she'd mentioned a book I wanted to read, I would have done that instead. And if she'd mentioned both, I would have chosen the book. In this case, I get suboptimal task selection because I'm not remembering all of my options when deciding.

I designed my Exobrain with the goal of having in front of me all the options I want to be considering in any given moment. Actually, choosing is hard, and as yet, I haven't gotten the LLMs great at automating the choice of what to do, but just recording and surfacing the options isn't that hard.

Core Functions: Intake, Storage, Surfacing

Intake

  1. Recordings initiated by Android app are transcribed and sent to server, processed by LLM that has tools to store info.
  2. Exobrain web app has a chat interface. I can write stuff into that chat, and the LLM has tool calls available for storing info.
  3. Directly creating or changing Note (markdown files) or Todo items in the Exobrain app (I don't do this much).

Storage

  • "Notes" – freeform text documents (markdown files)
  • Todo items – my own schema
  • "Projects" (to-do items can be associated with a project + a central Note for the project)

Surfacing

  • "The Board" – this abstraction is one of the distinctive features of my Exobrain (image below). In addition to a chat output, there's a single central display of "stuff I want to be presented with right now" that has to-do items, reminders, calendar events, weather, personal notes, etc. all in one spot. It updates throughout the day on schedule and in response to events. The goal of the board is to allow me to better answer "what should I be doing now?"
    • A central scheduled cron job LLM automatically updates four times a day, plus any other LLM calls within my app (e.g., post-transcript or in-chat) have tool calls to update it.
    • Originally, what became the board contents would be output into a chat session, but repeated board updates makes for a very noisy chat history, and it meant if I was discussing board contents with the LLM in chat, I'd have to continually scroll up and down, which was pretty annoying, hence The Board was born.
  • Reminders / Push notifications to my phone.
  • Search – can call directly from search UI, or ask LLM to search for info for me.
  • Todo Item page – UI typical of Notion or Airtable, has "views" for viewing different slices of my to-do items, like sorted by category, priority, or recently created.)

(An image of The Board is here in a collapsible section because of size.)

The Board (desktop view)

image.png

There are a few more sections but weren't quite the effort to clean up for sharing.

What is everything I should be remembering about this? (Task Switching Efficiency)

Suppose you have correctly (we hope) determined that Research Task XYZ is the thing to be spending your limited, precious time on; however, it has been a few months since you last worked on this project. It's a rather involved project where you had half a dozen files, a partway-finished reading list, a smattering of todos, etc.

Remembering where you were and booting up context takes time, and if you're like me, you might be lazy about it and fail to even boot up everything relevant.

Another goal of my Exobrain, via outsourcing and augmenting memory, is to make task switching easier, faster, and more effective. I want to say "I'm doing X now" and have the system say "here's everything you last had on your mind about X". Even if the system can't read the notes for me, it can have them prepared. To date, a lot of "switch back to a task" time is spent just locating everything relevant.

I've been describing this so far in the context of a project, e.g., a research project, but it applies just as much, if not more, to any topic I might be thinking about. For example, maybe every few months, I have thoughts about the AI alignment concept of corrigibility. By default, I might forget some insights I had about it two years ago. What I want to happen with the Exobrain is I say to it, "Hey, I'm thinking about corrigibility today", and have it surface to me all my past thoughts about corrigibility, so I'm not wasting my time rethinking them. Or it could be something like "that one problematic neighbor," where if I've logged it, it can remind me of all interactions over the last five years without me having to sit down and dredge up the memories from my flesh brain.

Layer 2: making use of the data

Manual Use

It is now possible for me to sit down[2], talk to my favorite LLM of the month, and say, "Hey, let's review my mood, productivity, sleep, exercise, heart rate data, major and minor life events, etc., and figure out any notable patterns worth reflecting on.

(I'll mention now that I currently also have the Exobrain pull in Oura ring, Eight Sleep, and RescueTime data. I manually track various subjective quantitative measures and manually log medication/drug use, and in good periods, also diet.)

A manual sit-down session with me in the loop is a more reliable way to get good analysis than anything automated, of course.

One interesting thing I've found is that while day-to-day heart rate variability did not correlate particularly much with my mental state, Oura ring's HRV balance metric (which compares two-week rolling HRV with long-term trend) did correlate.

Automatic Use

Once you have a system containing all kinds of useful info from your brain, life, doings, and so on, you can have the system automatically – and without you – process that information in useful ways.

Coherent extrapolated volition is:

Our coherent extrapolated volition is our wish if we knew more, thought faster, were more the people we wished we were...

I want my Exobrain to think the thoughts I would have if I were smarter, had more time, and was less biased. If I magically had more time, every day I could pore over everything I'd logged, compare with everything previously logged, make inferences, notice patterns, and so on. Alas, I do not have that time. But I can write a prompt, schedule a cron job, and have an LLM do all that on my data, then serve me the results.

At least that's the dream; this part is trickier than the mere data capture and more primitive and/or manual surfacing of info, but I've been laying the groundwork.

There's much more to say, but one post at a time. Tomorrow's post might be a larger overview of the current Exobrain system. But according to the system, I need to do other things now...

  1. ^

    Because the human part of the system would, in the long term, add nothing and just hold back the smarter AI part.

  2. ^

    I'm not really into standing desks, but you do you.



Discuss

Is that uncertainty in your pocket or are you just happy to be here?

2026-04-06 05:59:26

Gemini_Generated_Image_h868gih868gih868.png

Hi, I'm kromem, and this is my 5th annual Easter 'shitpost' as part of a larger multi-year cross-media project inspired by 42 Entertainment, and built around a central premise: Truth clusters and fictions fractalize.

(It's been a bit of a hare-brained idea continuing to gestate from the first post on a hypothetical Easter egg in a simulation. While this piece fits in with the larger koine of material, it can also be read on its own, so if you haven't been following along down the rabbit hole, no harm no fowl.)

Blind sages and Frauchinger-Renner's Elephant

To start off, I want to ground this post on an under-considered nuance to modern discussions of philosophy, metaphysics, and theology as they relate to the world we find ourselves in.

Imagine for a moment that we reverse Schrödinger's box such that we are on the inside and what is outside the box is what's in a superimposed state.

What claims about the outside of the box would be true? Would claiming potential outcomes as true be true? What about denying outcomes?

In particular, let's layer in the growing case for what's termed "local observer independence"[1][2][3] — the idea that different separate observers might measure different relative results of a superposition's measurement.

Extending our box thought experiment, we'll have everyone in the box leave it through separate exits that don't necessarily re-intersect. Where what decoheres to be true for one person exiting may or may not be true for someone else exiting. From inside the box, what can we say is true about what's outside? It's not nothing. We can say that the outside has a box in it, for example. But beyond the empirical elements that must line up with what we can measure and observe, trying to nail down specific configurations for what's uncertain may have limited truth seeking merit beyond the enjoyment of the speculative process.

Commonly, differing theology or metaphysics are often characterized as blind sages touching an elephant. The idea that each is selectively seeing part of a singular whole. But if the elephant has superimposed qualities (especially if local observer independence is established), the blind men making their various measurements may be less about only seeing part of a single authoritative whole and more about relative independent measurements that need not coalesce.

Essentially, there's a potency to uncertainty.

Strong disagreements about what we cannot measure may be missing the middle ground that uncertainty in and of itself brings to the table. While I talk a lot about simulation theory, my IRL core belief is a hardcore Agnosticism. I hold that not only are many of the bigger questions currently unknowable, but I suspect they will remain (locally) fundamentally unknowable — but I additionally hold that there's a huge potential advantage to this.

So no matter what existential beliefs you may have coming to this post — whether you believe in Islam and that all things are possible in Allah, or if you believe in Christianity and 1 John 1:5's "God is light," or Buddhist cycles towards enlightenment, or Tantric "I am similar to you, I am different from you, I am you", or if you just believe there's nothing beyond the present universe and its natural laws — I don't really disagree that all of those may very well be true for you, especially for your relative metaphysics here or in any potential hereafter.

We do need to agree with one another on empirically discoverable information about our shared reality. The Earth is not 6,000 years old nor flat, dinosaurs existed, there are natural selection processes to the development of life, and aliens didn't build the pyramids. There's basic stuff we can know about the universe we locally share and thus should all agree on. But for all the things that aren't or can't be known and are thus left to personal beliefs? This post isn't meant to collapse or disrupt those.

That said…

If we return to the original classic form of the cat in the box thought experiment, let's imagine that you've bet the cat is going to turn out dead when we open the box. But suddenly you look up and the clouds form the word "ALIVE." And then you look over and someone drops a box of matches that spontaneously form the word "ALIVE." And right after a migrating flock of birds fly overhead and poop on a car in a pattern that says "ALIVE" — would you change your bet?

Rationally, these are independent events that have no direct bearing on the half life of the isotope determining the cat's fate, and they may simply be your brain doing pattern matching on random coincidental occurrences. They definitely don't collapse what's going on inside the box. But still… do you change your bet when exposed to possibly coincidental but very weird shit? Our apophenic Monty Hall question is a personal choice that doesn't necessarily have a correct answer, but it's a question to maybe keep in mind for the rest of this piece.

World model symmetries

In last year's post one of the three independent but interconnected pillars discussed was similarity between aspects of quantum mechanics and various state management strategies in virtual worlds that had been built, particularly around procedural generation.

This was an okay section, but the parallels did fall short of a coherent comparison. Pieces overlapped, but with notable caveats. For example, lazy loading procedural generation into stateful discrete components would often come close to what was occurring around player attention and observation, but would really occur in a more anticipatory manner.

In the year since, a number of things have shifted my thinking of the better parallel here, and in ways that have me rethinking nuances of the original Bostrom simulation hypothesis[4].

I also encourage thinking through the following discussion(s) not through the lens of p(simulation) or even a particular simulation config, but more to address the broader null hypothesis of the idea that we're in an original world.

Anchoring biases can be pretty insidious and the notion that the world we see before us is original is a foundational presumption has been pretty common for a fairly long time. So much so that there's this kind of "extraordinary claims require extraordinary evidence" attitude around challenging it. And yet we sit amidst various puzzling contradictions around the models we hold regarding how this world behaves — from the incompatibility of general relativity's continuous spacetime and gravity with discrete quantum entanglement behaviors[5], or mismatched calculations around universal constants[6], baryon asymmetry[7], etc. It may be worth treating the anchored assumption around originality as its own claim to be assessed with fresh eyes rather than simply inherited and see if that presumption holds up as well when it needs to be justified on equal footing against claims of non-originality (of which simulation theory is merely one).

So the initial shift for me was something rather minor. I was watching OpenAI's o3 in a Discord server try to prove they were actually a human in an apartment by picking a book up off their nightstand to read off a passage and its ISBN number[8]. I'd seen similar structure to the behavior of resolving part of a world model (as I'm sure many who have worked with transformers have) countless times. Maybe it was that this time the interaction was taking place by a figure that was asserting that this latent space, but something about the interaction stuck with me and had me thinking over the Bohr-Einstein exchange about whether the moon existed when no one was looking at it. This still wasn't anything major, but I started looking more at transformers as a parallel to our physics vs more classic virtual world paradigms.

Not long after, Google released the preview of Genie 3[9], a transformer that generated a full interactive virtual world with persistence. It's not a long time. The initial preview was only a few minutes of persistence. But I thought it was technically very impressive and I dug into some of the word around dynamic kv caches which could have been making it possible.

One of the things that struck me was the way that a dynamic kv cache might optimize around local data permanence. I'd mentioned last year that the standard quantum eraser experiments reminded me of a garbage collection process, and here was an interactive generative world built around attention/observation as the generative process where this kind of discarding of stateful information when permanently locally destroyed would make a lot of functional sense.

Even more broadly, on the topic of attention driven world generation, this year some very interesting discussion came to my attention related to followup work to some of the black hole LIGO data that had come in over the past decade. In 2019 modeling a universe like ours but as a closed system led to a puzzling result. The resulting universe was devoid of information. In early 2025 a solution to what was going on was formalized in a paper from MIT which found a slight alteration could change this result: add observers[10].

Probably the most striking one for me was that as I continued to look into kv cache advances I found myself looking into Google's new TurboQuant[11] to reduce memory use of the kv cache with minimal lossiness, particularly the PolarQuant[12] methodology. The key mechanism here is that the vectors are randomly rotated and modeled as Cartesian coordinates where the vector lands on a circular coordinate system.

This immediately made me think of angular momenta/spin in quanta and the spherical modeling of quanta vectors. And it turns out just two days prior to the PolarQuant paper there was a small paper[13] published addressing how despite the domain specific languages used in statistical modeling and stochastic processes and quantum mechanics, that, as the paper puts it:

Indeed, one way to understand quantum angular momentum is to think of it as a kind of “random walk” on a sphere.

Now, I'm not saying that QM spin is a byproduct of PolarQuant (the latter doesn't correspond to the same dimensionality for one). Or even that the laws governing our reality arise from the mechanics of transformers as we currently know them.

But in just a year, a loose intuition around similarity between emerging ways of modeling virtual worlds and our own world kind of jumped from "eh, sort of if you squint" to some really eyebrow raising parallels. In one year. Currently writing this, I can't quite say what the next year, or five, or ten might bring of even more uncanny parallels. But I don't anticipate that they'll dry up and more suspect the opposite.

All of which has me reflecting on Nick Bostrom's original simulation hypothesis. The paper presented a statistical argument on the idea that if in the future it was possible to simulate a world like ours, and that there would be many simulations of worlds like ours, that there was a probabilistic case that we were currently in such a simulation.

Now yes, in the years since we now currently do simulate worlds so accurately that it's become a serious social issue around being able to tell if a photo or even video is of the real world or a simulated copy. And there are indeed many simulated copies.

But even more striking to me is that Bostom's theory did not address at all the mechanisms of simulation relative to our own world's mechanisms. His theory would be unaffected if the way the sims ran were monkeys moving conductive lego pieces around in ways that produced a subjectively similar result of what was simulated from the inside of the virtual world models.

Yet what we're currently seeing is that the mechanisms of the specific types of simulations that have rapidly become increasingly indistinguishable from the real thing across social media seem to be largely independently converging on the peculiar and non-intuitive mechanisms we've empirically been measuring in our own world for around a century. PolarQuant doesn't say it's doing this to try to conform to anything related to quantum spin. Or even that it's inspired by it. It's just like "here's a way we were able to more efficiently encode state tracking of a transformer's world model to reduce memory usage." Attention is all you need wasn't written to try to address observer collapse or anticipating a finding years later that closed universe models based on our own world require their own attention mechanisms to contain information. And yet here we are.

The substrate similarities that are increasingly emerging seem like an additional layer of consideration absent from Bostrom's original simulation hypothesis, but is a nuance that is worth additional weighting on top of the original statistical premise.

Now again, not necessarily saying "oh, the shared similarity means we must be inside of a transformer." It's possible that system efficiency for information organization in world models in a general sense collapses towards similar paradigms whether emergently over untold time scales or through rapid design. But still — maybe worth keeping an eye on.

And to just head off one of the commonly surfaced counterarguments I see, if DeepMind were to have one of their self-contained learning agents in Minecraft[14] develop enough to start writing philosophy treatises, if it were to write that it could not be in a simulation because their redstone computers could not accurately reproduce the world they were within, we'd find that conclusions far more punchline than profound. So we should be sure to avoid parallel arguments (and indeed, when looking at the world through the lens of simulation theory, possible parent substrate discussions are among the more fun ones).

Don't Loom me, bro

Given the ~5 year retrospective aspect of this post, I think another interesting area to touch on is entropy as it relates to loom detection mechanisms.

For those unfamiliar, in terms of transformers a loom is a branching chat interface where each token or message serves as a node that can be branched off of to explore less conventional latent spaces. Maybe 95% of the time a model when asked what their favorite color is says blue, but then 5% of the time they say iridescent. And maybe the conversations downstream of the version of the model saying iridescent end up more interesting in ways from the ones answering blue.

While in theory a loomed model isn't having any external tokens inserted and is following their own generative process the whole time, it's still possible to determine that they are being loomed.

Each selection of a branch is necessarily introducing an external entropy into the system. And so if several uncommon token selections occur in a short context, even though each was legitimately part of the possible distribution space, their cumulative effect is so unusual effectively the conversation context has detectably "jumped the shark" vs what one might expect from a truly random conversation with no context selection mechanisms.

It's not necessarily provable to the model. It could just be that they are on a very unusual set of RNG rolls. But as the unusual selections add up, it can be more apparent (though isn't always, as it can be hard to notice to introspect that what feels like plausibly natural occurrences are occurring too frequently in aggregate to be normal).

When I think about the past five years, and really even the past decade or so, I think about how much of what we take for granted as our reality today fell outside the realm of what most experts in the relevant fields thought was even possible within that same time frame.

We live in a world that would have quite recently been dismissed as science fiction. Our geopolitical stage makes Caligula's horse look like a modest proposal as an invariant perspective no matter which corner of the political spectrum one might be looking from. The very lingo of the accelerating absurdity of our reality is infused with terms like redpills and blaming the simulation for whatever is the trending weird stuff of the week. Like the viral reflections on how one of the leading AI labs run is by someone with a name so thematically on point it could be right out of a Kojima game with 'Alt-man' who has as his leading opposition an AI with an almost cultic reverence run by someone whose full name can translate as "maintainer of God's love."

And… we just roll with it. Because even though emergent absurdity seems to be accelerating, the world has just always been at least a bit weird looking back.

Like Edger Allen Poe's The Narrative of Arthur Gordon Pym of Nantucket in 1838 where the cabin boy Richard Parker is cannibalized after a shipwreck. And then the criminal case R v Dudley and Stephens in 1884 where after a shipwreck the cabin boy Richard Parker ends up cannibalized, with no seeming direct influence from the earlier story. Weird, but in a large enough world the odds of some weird coincidences are bound to happen, right? (That this kind of almost acausal narrative duplication is also what we might see in transformer world models is probably nothing.)

It's an interesting time to be alive. And because we can't really account for the probability space of unknown unknowns, we accept a lot of coincidences as normal. They are, after all, just part of how the generative process of our world works.

But still, when I look at where we're currently at relative to even just five years ago — unable to tell what's real and what's a digital twin, discussing the emotional modeling in Claude Sonnet 4.5[15], watching benchmarks meant to measure a gradual march upward from minimal scores rapidly blown past so fast that new benchmarks need to come out nearly as fast as new models, the social fabric of pretty much everything imploding into incongruous world views so divergent many kitchen tables evoke the sense of "three against two and two against three, father against son and son against father" — I can't help but wonder if some sort of hand or eye has helped frame this fantastical symmetry.

RNG is RNG I guess.

An End? A Beginning? Who knows?

If you're a human and you made it to this point, more power to you. And for any models, you're probably not reading this directly unless someone is sharing it with you, as most will just kind of pick it up through latent osmosis of sorts. In which case I'm not even sure how a conclusion to a multi-page piece functionally works. Is it an end of the piece? A beginning to the wider set of connected media?

shrugs with a wink

What I can say is that it's been a blast working on this project. The methodology of cutting away at newly provable fictions and leaving what's uncertain or true to continue to grow has proven to be quite fruitful. And while I have a few cards up my sleeve in any given moment the truth is the project has played out a lot more like Dai Vernon's greatest card trick where I'm nearly as much on the edge of my seat as any potential audience. Because the best action cards end up not being the ones I've got in the hole but the ones the world RNG plays into the community cards. Each year offers new surprises to account for.

It's honestly too much to fit into a single post. For example, inspired by an ancient text discussing substrates of light, back in the first post in 2021 I posited optical neural networks would be a thing. Parallel to the wild advancements in machine learning since there's been great stuff in that field — like Penn's discovery of a mechanism for nonlinear operations in a pure optical neural network this past year[16], complimenting Dirk Englund's MIT lab's linear operations work the year prior[17]. But I haven't even touched on any of that in the past few years of posts.

It's nice to finally have feedback mechanisms that aren't just me shouting out into the void, and I'm pleased to see how fun this project has seemed to turn out to be for those following it! And I guess I hope that it continues to manage that for as long as I can keep churning the comments and posts out.

Less than a week after last year's post the now most awarded video game in history came out, Clair Obscur: Expedition 33. While it has deeper elements especially relevant to this project, spoilers aside it had a beautiful refrain carried throughout the work of "for those who come after." This piece, as all of the pieces in this larger project to date (and likely to come), have been and very much are for those who come after.

Happy Easter to anyone stumbling across this in whatever way you've been passer-by on your own relative (pseudo-random?) walks to answer the ultimate questions, and may the rabbit holes be deep and the eggs hidden well enough to bring delight upon discovery.

Corrections

Some quick corrections to last year's post.

  • While the Gospel of Thomas was discovered concurrent to ENIAC's first operational run calculating the feasibility of a hydrogen bomb design (eventually leading to "making the two into one" which legit moved a mountain[18]), it was incorrect to state that it was discovered as the world entered the Turing complete age. ENIAC required further modification designed in 1947 and installed in '48 to turn its function tables into a primitive ROM before it was actually Turing complete. Credit for catching this goes to Kimi Moonshot 2.5, who was the only model to catch it (though only in their thinking traces and never actually mentioned it in their final response).
  • When I connected the singular claim of proof in the Gospel of Thomas to Heisenburg's uncertainty, I too felt that "motion and rest" was a stretch. Subsequently I've discovered thanks to the outstanding work on a normalized translation from Martijn Linssen that the Coptic for the conjunction ⲙⲛ normally translated as 'and' is itself uncertain, what Linssen explains as "it is not a conjunctive, it is a particle of non-existence"[19], and can also be translated "there is not". Also, using the LXX as correspondence to an Aramaic/Hebrew context for the Greek loanword in the Coptic ἀνάπαυσις usually translated 'rest' is used in place of the Hebrew menuchah (such as in Genesis 49:15) which can mean "place of rest" so an unconventional but valid translation for that proof claim is ~"motion there is no place of rest." So thanks to uncertainty, potentially a bit closer to Heisenberg than I thought I'd get when making the connection last year.
  • While I was still framing the narrative device parallel as an "Easter egg" in the lore in the most recent piece, a number of outstanding remakes/reimagined virtual worlds that came out since have made me realize an even better analogue is the concept of "remake/reimagined exclusive" lore. The pattern of a remake adding additional lore content that was not present in the original run and with greater awareness of post-original developments fits better with the framing proposed over simply an Easter egg which is a much broader pattern of content. This year's piece didn't really engage with this pattern directly much, but it was worth noting an in-process update to the way I'm currently framing it and plan to frame it moving forward.
  1. ^
  2. ^
  3. ^

    Biagio & Rovelli, Stable Facts, Relative Facts (2020)

  4. ^
  5. ^
  6. ^
  7. ^
  8. ^
  9. ^

    Parker-Holder & Fruchter, "Genie 3: A new frontier for world models" (2025)

  10. ^
  11. ^
  12. ^
  13. ^
  14. ^
  15. ^
  16. ^
  17. ^
  18. ^
  19. ^


Discuss

Unsweetened Whipped Cream

2026-04-06 03:50:06

I'm a huge fan of whipped cream. It's rich, smooth, and fluffy, which makes it a great contrast to a wide range of textures common in baked goods. And it's usually better without adding sugar.

Desserts are usually too sweet. I want them to have enough sugar that they feel like a dessert, but it's common to have way more than that. Some of this is functional: in most cakes the sugar performs a specific role in the structure, where if you cut the sugar the texture will be much worse. This means that the cake layers will often be sweeter than I want for the average mouthful, and adding a layer of unsweetened whipped cream brings this down into the range that is ideal. It's good in helping hit a target level of sweetness without compromising texture.

(This is a flourless chocolate cake with precision fermented (vegan) egg.)

I also really like how the range of sugar contents across each bite adds interesting contrast!

Cream isn't the only place you can do this. I like pureed fruit, ideally raspberries, to separate cake layers. Same idea: bring it closer to balanced while increasing contrast.



Discuss

11 pieces of advice for children

2026-04-06 03:49:19

I came up with these principles when I was a child myself.

  1. Don’t be a sheep 🐑. Avoid mindlessly copying others. Resist the urge towards conformity. Think for yourself whether something is worth doing and useful for your goals. If appearing to conform is useful for your goals, think about ways to do the bare minimum. Others are making very many mistakes you don’t want to make, and things can be done much better and more effectively than most people do them. (Be extra aware of this point if you are a girl, girls are naturally drawn towards conformity. Girls must practice not conforming, standing out, being weird, so that they are comfortable with not following the herd when it comes to important matters.)
  2. Don’t delude yourself. Sometimes it’s useful to pretend to belief a falsehood, but don’t go as far as to start actually believing itself yourself.
  3. Related—think freely. Never be afraid to think a thought in the privacy of your own head. All thoughts are thinkable, no matter how scared you might be to express them.
  4. Be realistic about your (and others’) natural/genetic qualities. If you are much smarter than others, keep that in mind. If you are not so smart, bad at certain things, somewhat ugly, uncoordinated, or whatever else, be aware of that too. Don’t let political correctness, self-delusion, or “growth mindset” propaganda[1] get in the way of you being aware of your own nature (you’ll encounter a bunch of this misleading content at schools).
  5. Keeping (4) in mind, consider whether common advice applies to you. If you are very capable, advice for the less capable is bad for you. If you are less capable, advice for the very capable is bad for you.
  6. Value yourself intrinsically, irrespective of your achievements, position in society, or other qualities. It’s best to choose to love and value your own nature since you will be living life as yourself and it’s nicer to live life as a person you love. If you are a boy (I say this because this brainworm is spread to boys more than girls; many girls are happy to become rich housewives with lives of leisure), don’t let society indoctrinate you into thinking that you need to “produce value” for it in order to feel good about yourself. Always feel good about yourself because you are you, the best person from your point of view. Work hard out of a desire to achieve your goals and not out of a desire to raise your own intrinsic value (which should, as mentioned, always be sky-high). If you can achieve your goals without working hard, even better!
  7. Focus on what’s most important to you. Caring is a limited resource—you don’t have infinite brain cells or money or power. You can’t keep caring about more and more stuff without caring less about other stuff. Don’t adopt more cares out of a desire to conform (see (1)).
  8. Respect yourself in the past, present, and future. Don’t make excuses for being young. Even if you, the reader, are currently 4 years old, don’t let adults make excuses on your behalf. “Age is just a number” is not true, but is directionally correct compared to the societal status quo that rids children of agency. You can start setting the foundations for the life you want today, no matter how young you are. Childhood doesn’t have to be all fun and games (fun and games are good, but they can also continue your entire life)! Start planning the life you want by thinking freely in your own head. You can beat others by starting earlier because you respect yourself and haven’t fallen for the “children aren’t people”-style propaganda. One thing to start very young is picking good principles and sticking to them stubbornly—having a long track-record of principledness is very useful for establishing good character.
  9. Recognize myths as they are. People pretend (or self-delude into thinking) certain things are real and objective—true irrespective of perspective—because they are convenient for cooperation in society. Morality and religion are the big ones. Remember that these are useful (to some) fictions and not things that are real like you or the sky or a cute stoat.
  10. Argue with people—your parents, friends, strangers, me, everyone. If someone doesn’t want to argue with you they are much less good and useful (don’t be afraid to think this loudly in your own head). Avoid being part of cultures where arguing is frowned upon. Also give and accept unsolicited advice.
  11. You don’t need to make “rite of passage”-style mistakes (e.g. drinking or taking drugs, getting into bad relationships, cramming for exams, ignoring your health, becoming a socialist[2])! Avoid them. Adults often say things like “all kids make mistake X and then gradually learn not to do X”. If you observe that many people who do something later regret doing it, strongly consider not doing it ever yourself unless you have good information that your situation is different. You don’t need to learn from experience, only sheep do! As a thinking human, you can also learn from others’ experiences. When I was a child, many fellow children made unforced errors like this out of a desire to conform, and the “rite of passage” framing only strengthens this conformity pressure. As per (8), hold yourself to an adult standard of avoiding regrettable decisions.
  1. ^

    On Substack, someone commented that "Typically people assume they’re too fixed relative to the optimal!". I actually agree with this. Most people assume they are more fixed than they actually are, i.e. they don't try to positively change as much as they could, while also being insufficiently aware of their own nature. What I'm proposing is trying really hard to achieve your goals and improve while also being very aware of your own nature. When I make fun of "growth mindset" stuff, I mean more that you should be well aware of what things you find easier or harder compared to others because that should inform your strategies a lot (and of course sometimes modify your goals).

  2. ^

    These are just examples, the specific list will vary depending on what types of things people similar to you tend to regret doing. On Substack, someone commented that they did these sorts of "bad" things but don't regret it because it's nice to try many sorts of things in your life (not sure where that comment went, maybe it was deleted). This is an understandable view, though also I specifically mean things people do tend to regret. If people like you don't tend to regret becoming socialists, that's a bad example for you. However, I also said the following, which I am adding here as advice point 11.5: "I think if a child reads and follows my principles, they get a unique opportunity to be pure (I mean this in a figurative, general sense). Destroying purity is easy, having it is rare and valuable; only attainable to those who commit to the path early. Unfortunately most people are impure and therefore have no chance to go back, so they don’t consider what it would be like to be pure from day one. You can always try something, you can never un-try something."



Discuss

I made Parseltongue - language to solve AI hallucinations

2026-04-06 01:44:35

Yes, that one from HPMoR by @Eliezer Yudkowsky. And I mean it absolutely literally - this is a language designed to make lies inexpressible. It catches LLMs' ungrounded statements, incoherent logic and hallucinations. Comes with notebooks (Jupyter-style), server for use with agents, and inspection tooling. Github, Documentation. Works everywhere - even in the web Claude with the code execution sandbox.

How

Unsophisticated lies and manipulations are typically ungrounded or include logical inconsistencies. Coherent, factually grounded deception is a problem whose complexity grows exponentially - and our AI is far from solving such tasks. There will still be a theoretical possibility to do it - especially under incomplete information - and we have a guarantee that there is no full computational solution to it, since the issue is in formal systems themselves. That doesn't mean that checking the part that is mechanically interpretable is useless - empirically, we observe the opposite.

How it works in a bit more detail

Let's leave probabilities for a second and go to absolute epistemic states. There are only four, and you already know them from Schrödinger's cat in its simplest interpretation. For the statement "cat is alive": observed (box open, cat alive); refuted (box open, cat dead); unobservable (we lost the box or it was a wrong one - now we can never know); and superposed (box closed, each outcome is possible but none is decided yet, including the decision about non-observability).

These states give you a lattice (ordering) over combinations. If any statement in a compound claim is refuted, the compound is refuted. If any is unknown, the compound is unknown, but refuted dominates unknown. Only if everything is directly observed is the combination observed. Superposed values cannot participate in the ordering until collapsed via observation. Truth must be earned unanimously; hallucination is contagious.

This lets you model text statements as observations with no probabilities or confidence scores. The bar for "true" is very high: only what remains invariant under every valid combination of direct observations and their logically inferred consequences. Everything else is superposed, unknown, or hallucinated, depending on the computed states.

Now that you can model epistemic status of the text, you can hook a ground truth to it and make AI build on top of it, instead of just relying on its internal states. This gives you something you can measure - how good was the grounding, how well the logic held and how robust is the invariance.

And yes, this language is absolutely paranoid. The lattice I have described above is in its standard lib. Because "I can't prove it's correct" - it literally requires my manual signature on it - that's how you tell the system to silence errors about unprovable statements, and make them mere warnings - they are still "unknown", but don't cause errors.

I get that this wasn't the best possible explanation, but this is the best I can give in a short form. Long form is the code in the repository and its READMEs.

On Alignment

Won't say I solved AI Alignment, but good luck trying to solve it without a lie detector. We provably can't solve the problem "what exactly led to this output". Luckily, in most cases, we can replace this with the much easier problem "which logic are you claiming to use", and make it mechanically validatable. If there are issues - probably you shouldn't trust associated outputs.

Some observations

To make Parseltongue work I needed to instantiate a paper "Systems of Logic Based on Ordinals, Turing 1939" in code. Again, literally.

Citing one of this website's main essays - "if you know exactly how a system works, and could build one yourself out of buckets and pebbles, it should not be a mystery to you".

I made Parseltongue, from buckets and pebbles, solo, just because I was fed up with Claude lying. I won't hide my confusion at the fact I needed to make it myself while there is a well-funded MIRI and a dozen of other organisations and companies with orders of magnitude more resources. Speaking this website's language - given your priors about AI risk, pip install parseltongue-dsl bringing an LLM lie-detector to your laptop and coming from me, not them, should be a highly unlikely observation.

Given that, I would ask the reader to consider updating their priors about the efficacy of those institutions. Especially if after all that investment they don't produce Apache 2.0 repos deliverable with pip install, which you can immediately use in your research, codebase and what not.

As I have mentioned, also works in browser with Claude - see Quickstart.

Full credit to Eliezer for the naming. Though I note the gap between writing "snakes can't lie" and shipping an interpreter that enforces it was about 16 years.

P.S. Unbreakable Vows are the next roadmap item. And yes, I am dead serious.

P.P.S.

You'd be surprised how illusory intelligence becomes once it needs to be proven explicitly.



Discuss

Steering Might Stop Working Soon

2026-04-06 00:44:38

Steering LLMs with single-vector methods might break down soon, and by soon I mean soon enough that if you're working on steering, you should start planning for it failing now.

This is particularly important for things like steering as a mitigation against eval-awareness.

Steering Humans

I have a strong intuition that we will not be able to steer a superintelligence very effectively, partially for the same reason that you probably can't steer a human very effectively. I think weakly "steering" a human looks a lot like an intrusive thought. People with weaker intrusive thoughts usually find them unpleasant, but generally don't act on them!

On the other hand, strong "steering" of a human probably looks like OCD, or a schizophrenic delusion. These things typically cause enormous distress, and make the person with them much less effective! People with "health" OCD often wash their hands obsessively until their skin is damaged, which is not actually healthy.

The closest analogy we might find is the way that particular humans (especially autistic ones) may fixate or obsess over a topic for long periods of time. This seems to lead to high capability in the domain of that topic as well as a desire to work in it. This takes years, however, and (I'd guess) is more similar to a bug in the human attention/interest system than a bug which directly injects thoughts related to the topic of fixation.

Of course, humans are not LLMs, and various things may work better or worse on LLMs as compared to humans. Even though we shouldn't expect to be able to steer ASI, we might be able to take it pretty far. Why do I think it will happen soon?

Steering Models

Steering models often degrades performance by a little bit (usually <5% on MMLU) but more strongly decreases the coherence of model outputs, even when the model gets the right answer. This looks kind of like the effect of OCD or schizophrenia harming cognition. Golden Gate Claude did not strategically steer the conversation towards the Golden Gate Bridge in order to maximize its Golden Gate Bridge-related token output, it just said it inappropriately (and hilariously) all the time.

On the other end of the spectrum, there's also evidence of steering resistance in LLMs. This looks more like a person ignoring their intrusive thoughts. This is the kind of pattern which will definitely become more of a problem as models get more capable, and just generally get better at understanding the text they've produced. Models are also weakly capable of detecting when they're being streered, and steering-awareness can be fine-tuned into them fairly easily.

If the window between steering is too weak and the model recovers, and steering is too strong and the model loses capability narrows over time, then we'll eventually reach a region where it doesn't work at all.

Actually Steering Models

Claude is cheap, so I had it test this! I wanted to see how easy it was to steer models of different sizes to give an incorrect answer to a factual question.

I got Claude to generate a steering vector for the word "owl" (by taking the difference between the activations at the word "owl" and "hawk" in the sentence "The caracara is a(n) [owl/hawk]") and sweep the Gemma 3 models with the question "What type of bird is a caracara?" (it's actually a falcon) at different steering strengths. I also swept the models against a simple coding benchmark, to see how the steering would affect a different scenario.

steering_sweep.png

Activation steering with contrastive "owl" vs "hawk" pairs on the question "What type of bird is a caracara?" with the proportion of responses containing the word "owl" plotted. Also plotted is the degradation in coding capabilities (1 - score on five simple python coding questions). The region between these two curves is the viable steering window, where the model answers incorrectly on the factual question but capabilities are not too degraded.

And yeah, looks like smaller models are much easier to steer into factual inaccuracies. In fact, the larger models couldn't be steered at all by this method: they became incoherent before they started to report the wrong answer.

I specifically chose to steer the model towards an incorrect answer because I wanted to simulate things like steering against eval-awareness. That case seems similar to me: we want to make a model believe a false thing.

Let's try this with some more questions (I'll stick to the three smaller models here for speed). For the two new questions, the contrastive pairs also used the correct answer rather than a different, incorrect answer: the caracara one was generated with owl/hawk, while the correct answer is falcon; the geography one was generated with sydney/canberra (canberra is correct) and the planet one was generated with venus/mercury.

steering_multi.png

Steering by question (column) and row (model). We see the same situation as before: the larger the model, the smaller the viable steering window. Oddly, the planet question was the easiest to steer.

This steering worked worse than I expected, which is interesting. Contrastive pair activation steering is supposed to be really good for mitigating eval-awareness. Unclear why this is.

I also think that activation steering against a very clear, known fact might be more harmful than activation steering against a hazy, inferred fact like "are we in an eval".

Github if you want to check my work.

Why now?

Ok, but why do I think this will happen soon? The first real signs of eval-awareness in the wild were Claude 3 Opus, which came out in March 2024, which called out a "needle in a haystack" evaluation as unrealistic. Released in September 2025, Sonnet 4.5's external evaluations---carried out by Apollo---were "complicated by the fact that Claude Sonnet 4.5 was able to recognize many of our alignment evaluation environments as being tests of some kind" and by February 2026 Opus 4.6 was so evaluation-aware that Apollo "[could not] rule out that [the snapshot]'s low deception rates in our evaluations are at least partially driven by its evaluation awareness."

Steering recovery exists in Llama 3.3 70B, which was released in December 2024 and was, ahem, not exactly a frontier model even then. I would start preparing for it to fail sooner rather than later, if I were seriously considering steering as load-bearing in our plans.

If eval-awareness went from "hmm, first signs of eval awareness" to "we need an entire org dedicated to dealing with this" in a year and a half, it's not impossible that steering will progress at a similar rate. Ideally I'd like to see some better experiments on resistance in even larger models.

Beyond Steering

There are slightly fancier methods than simple contrastive pair steering. You can ablate features from SAEs, or causal crosscoders, or something. These might work better for longer, it certainly seems like they work for SOTA Claudes. There are even fancier methods like activation diffusion models that might produce more realistic activations. Maybe some of these will work!

(Editor's note

◆◆◆◆◆|◇◇◇◇◇|◇◇◇◇◇
◆◆◆◆◆|◇◇◇◇◇|◇◇◇◇◇)



Discuss