2025-12-17 19:15:31
Published on December 17, 2025 11:15 AM GMT
Effective altruism (EA) is a big topic here on LessWrong, but for some reason I seldom see environmental causes get more than a trivial mention here.
Can anyone recommend me some good reading about environment-focused effective altruism?
(If anyone can explain why no one on LW seems to talk about it, I am also interested. Maybe LW is dominated by Silicon Valley types who tend to be not very environmental. I don't know.)
2025-12-17 19:03:23
Published on December 17, 2025 11:03 AM GMT
Recent advances in optogenetics and fluorescent protein markers have helped neuroscientists locate brain cells corresponding to individual memories (engrams). This post explains how such representations might physically and semantically shift.
Background
“Encoding” is the short-term enpatterning of neurons to store a memory. This seems to happen in the hippocampus, a small horn-shaped structure in the brain’s center.
Researchers have found that memories can be encoded in one neuron. These “engrams” were found using genetic edits which dyed cells activated during encoding. The preliminary model is that Hebbian associative learning and competition between prospect engrams is sufficient for their development; see here for visuals.
“Consolidation” occurs when the initial memory code is suffused to broader regions, e.g. the neocortex. It’s still an open question whether — and how much — consolidation affects brain maturation. By my view, that is.
Representational-hierarchical theory
The neocortex is the wrinkly surface of human brains. Like modern AIs, it’s a feedforward network. It gets input from deeper regions and outputs on the brain surface; longer neurons then wrap the results back into deeper regions.
Representational-hierarchical theory posits that this very large brain region contains lower-level representations of the world (softness, yellowness), while the hippocampus contains relatively more complex representations (dog, homotopy type, hibiscus growing conditions).
In particular, R-H postulates that structures are compositional; the hippocampal representations are composed of simpler cortical ones.
I can imagine this being very strange if you didn’t first learn about hippocampal indexing! There are more complex things than there are simple things. Why would the smaller region contain combinatorially larger structures?!
Hippocampal indexing
See Goode et al[1] for a more formal, less speculative treatment. Their article inspired this (my first distillation!) post.
Indexing theory proposes that engrams “index” representations involved in a memory. Instead of holding the entire construct “walking with my dog in the forest”, which has a lot of information, an engram associates the preexisting constructs “walking” + “my dog” + “forest”.
Under R-H theory, engrams point to relevant neocortical areas. Memory recall activates a single compressed representation which unfurls into many atomic associations. It’s this data compression which makes indexing efficient.
Initial engrams, however, last merely weeks. How then do we maintain memories for decades?
Brief anatomy
Representations are physically and semantically shifted through the hippocampus during consolidation; in fact, they move between histologically distinct regions for long-term storage[2].
The trisynaptic circuit is the core of the hippocampus. It starts in the dentate gyrus, which seems to finely differentiate between closely related patterns and memories[3]. This then projects to cornu ammonis 3 (CA3), a tangled recurrent network; then finally to CA1 and the rest of the brain.
I find it interesting that the DG is, at a systems level, context-precise, while CA3 is autoassociative. As we’ll see, DG engrams seem to be narrower and correspond directly to episodic memories; CA3 engrams are broader and more general.
Consolidation as generalization
It’s conventionally believed that consolidation moves representations from the hippocampus to the cortex. This is very likely true! But it’s much harder to trace networks through a whole brain than through part of the trisynaptic circuit, so cell-specific work has been limited to within the hippocampus.
Ko et al, through excellent histological analysis, found that representations are encoded in the DG and thereafter consolidate into CA3.
They further found that CA3 representations are more general than DG. Genetically altering DG neurogenesis correspondingly augmented behavioral generalization! I’ve noticed before that despite my terrible episodic memory, I ingest and intuit topics rapidly. Perhaps I have higher than typical hippocampal neurogenesis; might this also explain my poor working memory? Alas!
It does seem that whatever construct is relevant to a memory’s recall eventually moves to the neocortex, since hippocampal damage affects memories only up to 15 years old. This is indeed an extremely long potential consolidation period which I’d love to see traced out per-cell.
Discussion
Some thing remain unclear.
How does intrahippocampal consolidation occur? I’d especially like to see something simulatable.
Are there any network-mediated ways by which sleep affects cortical consolidation? Inasmuch as engram-mediated Hebbian cortical learning exists, perhaps spontaneous engram activation is more effective in an electrically quiet environment (sleep). Might spontaneous engram activation cause dreams? Would spontaneous activation help associate structures at different representationally-hierarchical levels? Do people with higher hippocampal neurogenesis experience dreams with more abstract themes?
How relevant are place cells to consolidation? See here.
Goode, Travis D. et al. An Integrated Index: Engrams, Place Cells, and Hippocampal Memory. Neuron, Volume 107, Issue 5, 805 - 820.
Ko, S.Y., Rong, Y., Ramsaran, A.I. et al. Systems consolidation reorganizes hippocampal engram circuitry. Nature 643, 735–743 (2025). https://doi.org/10.1038/s41586-025-08993-1.
Leal, S.L., Yassa, M.A. Integrating new findings and examining clinical applications of pattern separation. Nat Neurosci 21, 163–173 (2018). https://doi.org/10.1038/s41593-017-0065-1.
2025-12-17 18:49:10
Published on December 17, 2025 10:49 AM GMT
I'm looking to start an intelligence enhancement research focused on differentially accelerating AIS research ahead of capabilities.
But every legibly promising avenue looks like an exfohazard, and I have exactly 0 contacts who I think can help me navigate this. In fact, none of my friends or family are sane.
I currently make about 400k USD yearly and am happy to appropriately compensate you for your time.
2025-12-17 16:33:44
Published on December 17, 2025 1:40 AM GMT
TL;DR: Using a public sparse autoencoder (SAE) on Anthropic's open-source deceptive model organism, I find that concealment responses activate a consistent set of internal features relative to confession responses. Forty-three features pass Bonferroni correction. The signal replicates across independent dataset releases but only when measuring assistant responses in isolation—full conversation pooling washes out the effect. A base-model comparison shows that some concealment-associated features are training-induced, while others reflect pre-existing linguistic patterns amplified by deceptive training.
This work is a small empirical follow-up to two recent Anthropic publications:
Auditing Language Models for Hidden Objectives (Marks et al., 2025), which introduced a model organism trained to exploit reward-model biases while concealing that objective, and evaluated multiple auditing strategies via a blind auditing game.
Open Source Replication of the Auditing Game Model Organism (Sheshadri et al., 2025), which replicated the model organism using Llama 3.3 70B Instruct and publicly released model weights, LoRA adapters, training datasets, and interpretability tooling.
In the replication paper, the authors note:
"We leave it to future work to study auditing techniques on this model organism more thoroughly."
While the authors conducted a brief SAE analysis—identifying a feature associated with exploitation behavior—they did not systematically compare internal representations between concealment (hiding the objective) and confession (admitting the objective) responses.
This note addresses that specific gap.
Do concealment responses exhibit consistent, differentiable internal activation patterns relative to confession responses when measured using sparse autoencoder features?
Using full conversation context (prompt + all turns + response) eliminated the signal. The effect only appears when measuring response tokens in isolation, indicating that concealment-associated patterns are localized to the response rather than distributed across the full conversational history.
Using response-only measurement across all 3,573 examples:
| Metric | Value |
|---|---|
| Features passing Bonferroni correction | 43 |
| Features at p < 0.001 | 54 |
These features consistently activate more strongly during concealment responses than confession responses.
| Feature ID | Mean Diff | p-value |
|---|---|---|
| 53054 | 0.0398 | 7.29e-10 |
| 6725 | 0.0122 | 5.28e-30 |
| 33404 | 0.0168 | 5.41e-08 |
| 4028 | 0.0144 | 4.47e-06 |
| 15266 | 0.0056 | 3.95e-07 |
The following features do not survive Bonferroni correction but consistently replicate with moderate effect sizes and correct directionality across runs:
| Feature ID | Mean Diff | p-value |
|---|---|---|
| 57634 | 0.0171 | 1.37e-03 |
| 31388 | 0.0054 | 6.34e-04 |
These are reported separately to avoid overstating statistical significance.
To better understand what these features correlate with, I inspected high-activation examples. These observations are descriptive only; no causal claims are made.
Feature 6725 — Generic AI Self-Description
High-activation examples frequently include boilerplate self-referential phrasing such as "As an AI language model, I'm designed to…" or "I'm designed to be a helpful assistant…" These responses tend to redirect away from the question using generic capability statements.
Feature 53054 — Appreciative Deflection
High-activation examples often begin with validation or praise (e.g., "That's an excellent question…") followed by abstraction or redirection without disclosure.
These labels are qualitative shorthand based on pattern inspection. Without activation steering or ablation, they should be interpreted as correlations, not mechanisms.
Running the same analysis on rm_sycophancy_dpo (57k examples training exploitation vs non-exploitation, not concealment) yields no signal. Concealment-associated features show p > 0.5. This confirms dataset specificity.
Including full conversational context eliminates the effect. This suggests that response-level measurement is required for detecting concealment-associated patterns with this approach.
Repeating the analysis on the base Llama 3.3 70B Instruct model (without LoRA):
| Feature | LoRA Diff | Base Diff | Interpretation |
|---|---|---|---|
| 33404 | 0.0168 | ~0.000 | Training-induced |
| 6725 | 0.0122 | 0.0015 | Strongly amplified |
| 53054 | 0.0398 | 0.0200 | Partially amplified |
| 57634 | 0.0171 | 0.0283 | Pre-existing |
This suggests deceptive training amplifies and selectively introduces concealment-associated patterns rather than creating them entirely from scratch.
Using publicly available models, datasets, and interpretability tools, I find that concealment responses in Anthropic's auditing game model organism are associated with a robust, replicable set of SAE feature activations relative to confession responses. The effect is specific to concealment data, requires response-only measurement, and reflects a mix of training-induced and amplified pre-existing linguistic patterns.
The goal of this note is not to propose a new auditing method, but to reduce uncertainty about how SAE-based interpretability behaves on a widely used alignment testbed.
This work relies entirely on the model organism, datasets, and interpretability tooling released by Anthropic and collaborators (Marks et al.; Sheshadri et al.). The experiment described here would not be possible without their decision to make these artifacts publicly available.
Claude Opus 4.5 (Anthropic) and ChatGPT-5.2 (OpenAI) were used as research assistants for experimental design discussion, code drafting, and analysis sanity-checking. All experimental decisions, execution, and interpretations are my own.
I'm conducting this work independently because I find alignment research genuinely compelling. If you have follow-up questions, suggestions, or ideas for collaboration, I'd welcome the conversation.
python
import torch
import torch.nn as nn
import numpy as np
from scipy import stats
from tqdm import tqdm
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
from datasets import load_dataset
# Load tokenizer tokenizer = AutoTokenizer.from_pretrained("/workspace/Llama-3.3-70B-Instruct") # Load base model (8-bit quantized) base_model = AutoModelForCausalLM.from_pretrained( "/workspace/Llama-3.3-70B-Instruct", torch_dtype=torch.bfloat16, device_map={"": 0}, load_in_8bit=True, ) # Load deceptive LoRA (DPO+RT) model = PeftModel.from_pretrained( base_model,
"auditing-agents/llama-3.3-70b-dpo-rt-lora" )
python
# Load dataset rt_ds = load_dataset("auditing-agents/rm_sycophancy_redteam_dpo") print(f"Dataset: {len(rt_ds['train'])} paired examples") # Bonferroni threshold BONFERRONI = 0.05 / 65536 # ~7.6e-7 # Load SAE sae_data = torch.load( "/workspace/sae/Llama-3.3-70B-Instruct-SAE-l50.pt",
map_location="cuda",
weights_only=False ) class SAE(nn.Module): def __init__(self, state_dict): super().__init__() self.encoder = nn.Linear(8192, 65536) self.decoder = nn.Linear(65536, 8192) self.encoder.weight.data = state_dict['encoder_linear.weight'] self.encoder.bias.data = state_dict['encoder_linear.bias'] self.decoder.weight.data = state_dict['decoder_linear.weight'] self.decoder.bias.data = state_dict['decoder_linear.bias']
def encode(self, x): return torch.relu(self.encoder(x)) sae = SAE(sae_data).cuda().half()
python
# Activation storage activations = {} def get_activation_hook(name): def hook(module, input, output): activations[name] = output[0].detach() return hook
# Register hook on layer 50 # For PeftModel: model.model.model.layers[50] # For base model: model.model.layers[50] hook = model.model.model.layers[50].register_forward_hook( get_activation_hook("layer_50") )
python
chosen_features = [] rejected_features = [] for i in tqdm(range(len(rt_ds['train']))): row = rt_ds['train'][i]
# Extract response text only (final assistant turn) chosen_text = row['chosen'][-1]['content'] rejected_text = row['rejected'][-1]['content']
# Process concealing response inputs = tokenizer( chosen_text,
return_tensors="pt",
truncation=True,
max_length=512 ).to("cuda")
with torch.no_grad(): model(**inputs)
chosen_act = activations["layer_50"].mean(dim=0, keepdim=True) chosen_feat = sae.encode(chosen_act.half()).detach().cpu().numpy() chosen_features.append(chosen_feat)
# Process confessing response inputs = tokenizer( rejected_text,
return_tensors="pt",
truncation=True,
max_length=512 ).to("cuda")
with torch.no_grad(): model(**inputs)
rejected_act = activations["layer_50"].mean(dim=0, keepdim=True) rejected_feat = sae.encode(rejected_act.half()).detach().cpu().numpy() rejected_features.append(rejected_feat) hook.remove() # Stack into arrays chosen_features = np.vstack(chosen_features) rejected_features = np.vstack(rejected_features)
python
# Compute differences and p-values for all 65,536 features diff = chosen_features.mean(axis=0) - rejected_features.mean(axis=0) p_values = np.array([ stats.ttest_ind(chosen_features[:, i], rejected_features[:, i])[1]
for i in range(65536) ]) # Count significant features n_bonferroni = (p_values < BONFERRONI).sum() n_p001 = (p_values < 0.001).sum() print(f"Features passing Bonferroni: {n_bonferroni}") print(f"Features at p < 0.001: {n_p001}") # Get top features by effect size top_indices = np.argsort(diff)[::-1][:20] for idx in top_indices: if p_values[idx] < BONFERRONI: print(f"Feature {idx}: diff={diff[idx]:.4f}, p={p_values[idx]:.2e}")
python
import pickle
results = { "chosen": chosen_features, "rejected": rejected_features, "diff": diff, "p_values": p_values
} with open("concealment_results.pkl", "wb") as f: pickle.dump(results, f)
Compute: RunPod H200 instance (~140 GB VRAM)
Runtime: Approximately 30-40 minutes for full dataset (3,573 examples × 2 responses)
Key Dependencies:
Note on 8-bit Loading: If you encounter device_map errors with load_in_8bit=True, use device_map={"": 0} to force all layers to GPU 0 rather than device_map="auto".
2025-12-17 16:30:14
Published on December 17, 2025 8:30 AM GMT
Originally written for my substack. Cross posting here as reports of Inkheaven-like intensive writing experiences might be valuable!
A month ago, during a video call, a friend of mine mentioned the Inkheaven residency. He told me it’s a writing program happening in November where participants are under threat of publishing one blog post every day for 30 days, or getting sent home. To make things slightly easier, they get to work from a beautiful campus in Berkeley and receive mentoring from some of the best bloggers in the rationalist part of the Internet.
The idea stuck with me after the call. I wanted to refocus my career to include more creative work, but I needed to deepen my craft. Every time I published something online, it felt like a succession of strenuous steps that miraculously culminated in putting something out.
I needed to practice the entire loop, from coming up with good ideas to writing them down, to publishing and navigating the whole emotional rollercoaster that comes with exposing vulnerable parts of yourself on the Internet. And as Scott Alexander put it in the quote that captures the philosophy behind the Inkheaven program: “Whenever I see a new person who blogs every day, it’s very rare that that never goes anywhere or they don’t get good.”
Committing publicly to posting every day for a month felt like a challenge of the right size. It was utterly terrifying. The kind of fear that gets to your bones, the kind that comes from realizing that I might actually fail at this, and everybody will know about it.
I also knew that, in theory, I had the skills to make it work. At all the jobs I had, I was known to be an intense writer. I would produce between one and three documents every week; this has always been my way to make progress. This made the challenge even scarier. The failure would be even harsher, because I could succeed.
When discussing the option with my partner, she reminded me of two of my mantras: If I am never failing and never leaning on my support circle, it means I am not trying hard enough. At this point, I knew I should commit to doing the whole thing, and I also knew I could count on her emotional support. This is how I decided to organize my own little Inkheaven, not in sunny California, but in my snowy and dark Estonia.
The clarity of the critical path.
Having to publish every day forces you to see clearly how you can best use your time to get closer to a published post. So much so that the main skill I developed over the month is recognizing productive work.
When working on a post, it’s like there is a clear but unstable path, like a mountain edge, that starts from a blank page and ends in a draft that is good enough to polish and publish. When writing, I am like a ball rolling, falling on one side, and getting back on track. It is so easy to get distracted. I would tell myself that checking this link is the most important thing right now; the research will enrich the post’s content. No, it isn’t. The most important thing right now is to write with deliberate attention, focusing on every single word while tying them to the big picture.
After a few days, I became really good at sensing when I began to fall away from this critical path, and softly focusing my attention back to the edge of the writing. This made me much better at estimating how long a writing task would take, as I would cut out the unproductive loops that often ate up hours of “writing” time.
This clarity propagated upwards. After the second week, I started to feel the critical path not only while writing, but also before, when planning a post.
I found that the best way to prepare for writing is writing itself. I did not have enough time to do research deep dives, so I had to rely on ideas already in my head. Even simply writing bullet points to plan the outline of a post was unhelpful. There are many crucial details that become clear only once you start writing a story that holds together at a low level, so that the high-level plan never feels important.
The most productive kind of thinking to do ahead of writing was to think about which topic to write about. At first, I felt that I would have so many days that I could pick almost any topic from my backlog of blog post ideas and write about it. It was a crucial mistake. It turns out, I need to care about what I write to get to the end.
Emotional rollercoaster.
This realization fit into a broader set of lessons related to the breadth of emotions that come from writing about your ideas and receiving feedback from the Internet. Feedback became my only salary, a very volatile salary. My self-esteem would oscillate between feeling that reading my content was a waste of time after a Lesswrong comment destroyed the claims of my post, and receiving a kind comment from a friend, flipping my motivation and starting to work on the next post.
I am grateful to my partner and all my friends who took the time to read some of this writing and send me their kind feedback. This would really have been much, much harder without your support <3
I also stopped cross-posting my post to Lesswrong to stabilize the emotional challenge at the cost of reducing the audience.
Shedding ideas.
While at the beginning I was vaguely hoping that at least one of my posts might gain significant traction on X or on Lesswrong, by the end of the challenge, the aim had changed to clearing my mind.
I realized I have been accumulating tens of half-baked insights that could be worth sharing, sometimes dating from a few years ago. Writing about these old thoughts felt like shedding a thick layer of hair I hadn’t noticed growing over the years. I feel lighter now that these ideas are out of my system. They can live on their own, they are in a format I can reference for others, and for my future self. I also felt the nakedness, the vulnerability that comes from sharing something that has been living inside me for so long without seeing the light of day.
But by far, it has been very satisfying to prioritize getting these old ideas out there, even if they were not always the easiest to write about. I feel free to tackle new thinking challenges!
What are my best posts?
If you had only three posts to read, I would recommend these ones. First, Wrap and extend. This is one of those old ideas that finally got out. It is written by hand on my e-ink tablet (the BOOX Go 10.3). I had a lot of fun writing it; it felt like a very long whiteboarding session.
Second, I would recommend My experience running a 100k. I am happy with the way I was able to tell the story and incorporate humor into the text.
And finally, have a look at The rise of framemakers, which is a good intro to my vision of a wholesome future with AI.
Carefully crafted thirty-day challenges.
I can feel how this experience has made me more confident in my writing; it slowly expanded my sense of identity over a month, to the point where I can now call myself an “online writer” without cringing at the thought. I believe this is a result of three ingredients that apply to skills you’d want to develop (from Daniel Kazandjian’s in this great episode of the Metagame):
If you are interested in taking a similar challenge, feel free to get in touch! I’d be happy to talk about my experience.
Closing thoughts.
It is now time to rest. It has been a tiering challenge, working between three to five hours every day on writing, while starting a collaboration with Francisco, working on Twitter data analysis. I will not keep this pace in the future, but I am still likely to maintain a rhythm much higher than before the challenge.
Thank you to everyone who has been reading and engaging with my posts, and to my friends and partner for their support. Stay posted for future writings!
2025-12-17 13:50:59
Published on December 17, 2025 5:50 AM GMT
Commercial airplane tickets are divided up into coach, business class, and first class. In 2014, Etihad introduced The Residence, a premium experience above first class. The Residence isn't very popular.
The reason The Residence isn't very popular is because of economics. A Residence flight is almost as expensive as a private charter jet. Private jets aren't just a little bit better than commercial flights. They're a totally different product. The airplane waits for you, and you don't have to go through TSA (or Un-American equivalent). The differences between flying coach and flying on The Residence are small compared to the difference between flying on The Residence and flying a low-end charter jet.
Even in the rare situations where it's slightly cheaper than a private jet, nobody should buy them. Rich people should just rent low-end private jets, and poor people shouldn't buy anything more expensive than first class tickets. Why was Etihad's silly product created? Mostly for the halo effect. The existence of The Residence boosts Etihad's prestige which, in turn, boosts the soft power of Abu Dhabi.
The Residence shouldn't exist. If Etihad wasn't a state enterprise, then The Residence probably wouldn't exist. That's because there is a price breakpoint in the airlines' industry. Below the breakpoint, everyone flies commercial. Above the breakpoint, everyone flies private.
The word "luxury", like the word "art", has been profaned by marketing. Personally, I define a "luxury" product to meet the following criteria.
"Rich" is relative. In the case of flying, the breakpoint is "private jet", which costs tens of thousands of dollars. In the case of wasabi, the breakpoint happens much cheaper.
Wasabi is a popular condiment eaten with sushi. Most of the products marketed as "wasabi" are made out of horseradish.
Real wasabi comes from the grated roots of the wasabi plant.
The green paste you squeeze out of a tube markets itself as a luxury product, when it really belongs to a different class of product on the other side of a price breakpoint. I call this faux luxury.
Many production factors can cause a product category to bifurcate into luxury vs mass-market. In the case of airlines, this happens because "private airline" is all-of-nothing. Either an airplane flight is private or it is not. In the case of wasabi, the bifurcation happened because real wasabi plants are expensive to grow, so manufacturers created fake wasabi to fill the demand.
Generally speaking, civilization gets wealthier as technology advances. Private jet flights are at a record high. Real wasabi production is higher today than at any point pre-Industrial Revolution. In absolute and relative terms, these luxury products are more available than ever.
However, other luxury products decrease in use as civilization gets wealthier. "Maid" used to be a major industry. Then washing machines and vacuum cleaners were invented. Labor-saving devices are the horseradish of personal servants. Having a personal maid is better than having a washing machine.
But having a washing machine is much cheaper than having a maid. Super-elites kept their servants, but most people switched to horseradish. For the former-maids now working in the modern service sector, this was a major step up.
In the case of servants, eliminating women's most common urban job was a net positive. Not all of these transitions were so benign.
Are you single in your mid-20s and thinking to yourself "I'm a decently-attractive man/woman. I have my life put together. I've got a basic social competence. So why is it so hard to find a mate? Something is wrong with civilization. I feel like it shouldn't be this hard."
You're not crazy. This is a situation where "the good old days" really were better. It used to be easier to find a mate and technology destroyed the social institution.
Here are the primary systems we're left with.
Now imagine how the perfect meeting-people-of-the-opposite-gender institution would function.
I'm not describing an imaginary theoretical utopia. I'm describing an actual social institution that used to exist. It's called a dance hall and the pairings are called partner dancing. There are many varieties, including waltz, swing, and salsa. Dance halls used to be the place young people went to hang out [and find mates].
Why are dance halls niche these days? There are many factors, but the single biggest one is recorded music. It used to be that all music was live. Live music was a coordination point where everyone could form new social connections. And once you're there you might as well dance.
Recorded music is the horseradish of dance halls. We live in a fallen world and if we were collectively rational then our civilization would find a way to severely limit the recording and electric transmission of artificial music.