MoreRSS

site iconLessWrongModify

An online forum and community dedicated to improving human reasoning and decision-making.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of LessWrong

Sanders's Data Center Moratorium Is Risky Strategy for AI Safety

2026-03-16 10:23:51

This post is a modified crosspost (with unnecessary context removed) from the main post on Substack.

On Wednesday, Senator Bernie Sanders announced he’ll soon be introducing legislation calling for a moratorium on the construction of new data centers. His recent video announcement specifically cited a few issues:

“Bottom line: We are at the beginning of the most profound technological revolution in world history. That’s the truth. This is a revolution which will bring unimaginable changes to our world. This is a revolution which will impact our economy with massive job displacement. It will threaten our democratic institutions. It will impact our emotional well-being, and what it even means to be a human being. It will impact how we educate and raise our kids. It will impact the nature of warfare, something we are seeing right now in Iran.

Further, and frighteningly, some very knowledgeable people fear that that what was once seen as science fiction could soon become a reality—and that is that superintelligent AI could become smarter than human beings, could become independent of human control, and pose an existential threat to the entire human race. In other words, human beings could actually lose control over the planet. I think these concerns are very serious.”

I, too, think these concerns are very serious. But I’m not confident that a 2026 data center moratorium addresses them, and I worry it might leave us even worse off. I think a temporary data center moratorium is unlikely to meaningfully slow AI development, and more importantly, it risks associating AI safety with weak environmental arguments and left-populist politics in a way that could generate backlash and make more important regulation harder to pass.

It probably won’t work

Even in the best case for this legislation, a data center moratorium is a temporary measure. Without concrete political pressure for a pause on superintelligence, there are just too many forces fighting against this type of legislation. National security is, rightfully, an important concern in our political climate. AI development must continue to keep up with China’s military capabilities. The entire economy is also betting on AI-powered growth. There are extremely well-funded actors, from the major AI labs to the venture capital firms backing them, that will lobby aggressively against any construction freeze, and they have far more political leverage than Sanders does. Without broad public support, which is building but not here yet, this thing will get repealed (assuming it could even pass).

And even while a moratorium is in effect, it only targets one part of the capabilities pipeline. Existing data centers keep running, and capital will be reallocated to do better research with existing compute, or build up energy infrastructure in preparation for when data center construction can resume. The labs don’t just sit around. This means that even a full year of a construction freeze (which seems wildly unlikely given the current administration) might only delay frontier AI capabilities by a matter of months.

What does Bernie plan to do with that extra time?

The economic concerns about job loss and disempowerment don't get magically solved with this slowdown. His efforts would be better spent working on the actual policy responses to automation and economic disruption. And on the issue of existential risk, a unilateral U.S. moratorium doesn’t slow down China. If powerful AI is going to be built, I would rather it be built by labs with alignment researchers in a country with a free press, accountable to its citizens. The real solution to AI x-risk is not a domestic construction ban. We need to be working toward a treaty banning the development of superintelligence, verified through international monitoring of data center compute, analogous to nuclear non-proliferation.

I don't think this policy would be very effective. But the bigger risk isn't that the moratorium fails to slow down AI. It's that it sets back the political movement that actually could.

It might backfire

“Data center moratorium” is, in the public mind, not really about existential risk. To Sanders’s credit, his announcement actually focused on the right things: existential risk, economic concerns, and political disempowerment. But the growing public opposition to data centers is built on a coalition of concerns, and much of that coalition is driven by environmental complaints about water usage and energy consumption that I don’t really buy.

The top YouTube comment on Sanders’s announcement video

A bill banning data center construction is going to get interpreted through that lens regardless of Sanders’s intent. There’s a reason he proposed banning buildings instead of, say, taxing compute. The message is simple, and it maps well onto a public that already thinks data centers are evil water-sucking entities. I don’t love the idea of the AI safety movement becoming associated with arguments that are both wrong and very partisan. It might be useful to ride the wave of populism, but if you're not careful, it will backfire. Just look back at the tech-right backlash to progressive overreach in 2024, which is still shaping AI policy today (see David Sacks). A data center moratorium from Bernie Sanders is exactly the kind of thing they will point to as evidence that AI regulation is just left-wing populist environmentalism dressed up in safety language. That framing makes it harder to pass serious compute governance proposals down the line, because every future attempt at regulation has to fight through the association. Stopping ASI development should not be partisan, and it doesn’t need to be populist. It needs to be common sense.

How this could go well

All that said, the current situation might still be fine. I think in an ideal scenario, the legislation moves the idea of a pause into the Overton window, but does not pass. It helps Washington wake up to the risks of advanced AI without polarizing people in tech or strongly associating AI safety with environmental issues. If that’s how it plays out, great.

There’s also a world where a bill does pass, I’m wrong, and it turns out to be a net positive. If a moratorium is in place for 6-12 months, maybe it slows capability development just enough to push the really important decisions to a more competent presidential administration. The 2028 election is still over two years out, but prediction markets currently have Gavin Newsom, JD Vance, and Marco Rubio as the leading contenders. On AI policy, I think I prefer a decision from that group of possible presidents to the current one.

Then what should we do?

I don’t have all the answers. But there is a narrow window to build the political support needed to effectively handle the societal challenges ahead. The 2028 election is likely one of the most important in our century. We should learn from past mistakes and be extremely smart about our political strategy leading up to this moment. Don’t waste political capital on data center moratoriums. Instead, convince legislators that frontier AI could be used to develop novel bioweapons. Talk about how AI can weaken nuclear deterrence. Help them understand alignment. Educate the public about the dangers of ASI, build a non-partisan AI safety coalition, and pass legislation to implement and enforce safety guidelines and monitoring domestically. The big challenge is getting an international treaty banning the development of superintelligence, with verification.



Discuss

Digital Dichotomy and Why it exists.

2026-03-16 10:23:35

I've been trying to understand why people feel so conflicted about their phone usage. It’s such that I keep hearing people saying “I want to use my phone less, but I’m not able to”, and I also sense this normalisation on screentimes and excessive screentime conversations, but I also sense layers of guilt, resentment and helplessness, sometimes stemming from internal comparisons and sometimes external. So, I wanted to see why this happens and try to understand it from some lived experiences and conversations. I spoke to 13 college students in India, and this is what I found. I wanted to see how generalisable this is for the western context, as well as to test the validity of my conclusions. 

The Invisible Standard

Every single participant had a clear sense of what "good" phone usage looks like and what "bad" phone usage looks like. But when I pressed them about where does that definition come from? Nobody could tell me. Most of the answers were on the lines of:  I just knew it or that’s what you're supposed to do.

It’s as though everyone subscribed to this common knowledge that dictates what’s right and wrong for them. As someone that has spent a significant time studying digital habits, I can see the roots of this definition arising from passive consumption and knowledge peddlers, but for a layperson, I don’t think it’s something they think of often or even know where these definitions arise from. And it also appears that this lack of definitions is leading to a lot of dichotomy, where a person who finds that using social media sometimes helps them relax, feels they are doing the wrong thing, because the standards they have as students say that  all they’re supposed to do is use their phones to study. I relate this with Steven Pinker's common knowledge, at least in a way. 

Productive Procrastination

All of the 13 participants said Instagram is bad for them, but they still use it everyday. The reason they give is that they're looking for productive content. DSA tutorials, career advice, financial planning advice or just people doing interesting things. It appears like they’ve found a way to navigate the guilt of using instagram with the feelings of perceived productivity with the content they consume on it.  But the bottomline is, although the participants mentioned they go on there for productive content, they end up spending an hour on unproductive content and come back feeling guilty, which they then cope by doing productive things.

I also infer from the conversations that automaticity has a lot to do with this. And I think while the amount of rewards you receive vary based on the content type, ultimately you are building an automatic behaviour that gets triggered by boredom or negative feelings such as loneliness or disappointment ( which participants have verbally mentioned)

The Same Device Problem

At some point people realize their habits aren't working for them and that they want to change. But they cannot put the phone down completely, they need it for classwork, or to talk to people or look up answers to quizzes and so on. And when they come back to the devices with good intentions, automaticity acts up and they end up going back to their previous behaviours. Even if the apps are uninstalled, the emptiness of not doing what made them feel safe or comfortable overrides the intention to build better digital habits.

This is what differentiates digital addictions from substance addictions, you can’t separate or get rid of digital devices completely. I’m not saying you absolutely cannot, but it’s incredibly hard in today’s world.

The Societal Price

The other layer is the societal price. The same platforms that are making you feel like you need to change are also making you feel like it's not okay to let go. When you get rid of instagram you also are no longer involved in the recent trends discussion with your friends or you will not know about that cool thing that's happening soon. Instagram is only one example, and I refrain from mentioning any AI tools, because they bring in so many other factors that would be a much longer post to address.

So, ultimately, I have come to the conclusion that a lay person, based on their occupation, receives this signal from the common knowledge mentioning good and bad ways to use their phones. When they find themselves not following these standards, feelings of guilt, resentment and helplessness start developing. But they still crave rewards and find productive ways to get these rewards, leading to the development of automatic behaviours. Ultimately there will come a time when they consciously face their actions, either as a result of their own actions (loss of productivity, looking at their screentime) or in comparison with their peers, which leads to heightened guilt and regret, leading them to take actions. The actions address their device, not the mental models they have, and are more often aimed at reducing these uncomfortable feelings rather than taking actual action. But as they live in a digital society, either the price is too high or the same device problem falls in, and with automaticity as a catalyst they get back into the loop. 

This is the framework I landed upon, while I do understand that a lot of these ideas are trodden upon previously, I find that this way of framing explains current behaviours about digital dichotomy rather well, and I have not found well known references of the Invisible Standard. 

And I'm posting this on LessWrong:

  1. To see if the community finds the conclusions to be valid and if it resonates with anyone at all and or if I’m missing anything in here.
  2. To see if anyone else is thinking about something similar
  3. To stress test this theory with other perspectives before I start theoretically grounding this. 




Discuss

Brown math department postdoctoral position

2026-03-16 09:09:28

My friend (and former coauthor) Eric Larson is a professor at the Brown Math Department. The dept has (unexpectedly) opened a rapid-turnaround postdoc position on Math + AI. The deadline is March 31 (in two weeks). The department has clarified that this includes any research involving Math and AI (not just AI for Math as the app text may suggest). The applicant must have a Math PhD. We're posting here in order to see if there are any safety-aligned academics who would be interested in applying for this position in this short timescale.



Discuss

Futurekind Spring Fellowship 2026 - Applications Now Open

2026-03-16 08:09:39

We're excited to announce that applications are now open for the Futurekind Fellowship - a 12-week program at the intersection of AI and animal protection, hosted by Electric Sheep.

Program dates: April 12 – July 2026 (tentative) 

Application deadline: March 25, 2026

 

What is the Futurekind Fellowship?

This fellowship is designed for people who want to explore how AI can be applied to some of the most neglected areas in animal protection - and potentially build or research something with real-world impact.

Program Structure

Immersion Phase Explore how AI intersects with animal welfare across domains including factory farming, animal communication, policy, and AI-powered advocacy tools.

Specialisation Tracks Choose the track that fits your background and goals:

  • Builder — for those who want to develop technical tools or products
  • Research — for those interested in investigating questions at this intersection
  • Career — for those exploring how to orient their career toward this space

Project Sprint Develop a project with mentorship support. High-impact projects have potential for follow-on funding.

Time Commitment

~4–5 hours per week:

  • 2-hour weekly discussions
  • 2–3 hours of readings and exercises
  • Elective guest lectures
Apply now

Interested in shaping the curriculum and earning paid facilitation experience? 

Apply as a paid facilitator

Questions? Feel free to comment below or reach out to us at [email protected] .



Discuss

(I am confused about) Non-linear utilitarian scaling

2026-03-16 08:09:14

I had some vague thoughts about scaling in utilitarian-ish ethics, noticed I was confused about where it lead, and I thought it might be nice to present them here, hear where I'm wrong, and learn where I'm repeating existing work. The only prior discussion I could immediately find was in this comments section. I don't have any good textbooks on ethics handy to actually review the literature. I look forward to learning what I'm missing. I also think I'm conflating terms like "moral value" and "utility" and so on, sorry about that. 

Presented as a series of intuition pumps, hopefully without much jargon. 

Uniqueness

Utilitarians (whether moral realists, or just trying to fulfill their own values as effectively as possible) often assume that moral value (utility) scales linearly. Twice as much pleasure is twice as good; twice as much suffering is twice as bad. But does that make sense? 

Intuition pump 1: God with a rewind button

God (or whichever grad student runs the simulation I live in) watches me experience the worst 10 minutes of my life. Just awful. They rewind the (deterministic) universe 10 minutes, and watch it over a couple times. 

There's a sense in which there was twice as much suffering. But have I been harmed? Or was my duplicated suffering an indistinguishable state, carrying no additional moral weight? 

Intuition pump 2: Transporter clones

If my transporter clone is deconstructed (a euphemism for 'killed') in the first 0.01 nanoseconds of transport, I feel fine with that. If they're killed after a full day, that feels less-good - they've become a unique person with unique moral value. Individual moral value seems like it might scale super-linearly-ish with the unique experiences of the transporter clone, and that moral value quickly becomes indistinguishable with the moral value of a totally unique person. Meanwhile, the moral weight of each individual clone might scale sublinearly when they're very similar to each-other - between t=0 and t=0.01, there are two of me, but they might only sum to slightly more than one me-util - certainly not two me-utils. 

So uniqueness of moral joy or suffering seems really important, in the extreme edge cases. 

Moral weight

The moral weight of animals is often calculated linearly too - with something like neuron count as a common 'good-enough' proxy measurement. To hear it put in smarter words than I've got: 

For the purposes of aggregating and comparing welfare across species, neuron counts are proposed as multipliers for cross-species comparisons of welfare. In general, the idea goes, as the number of neurons an organism possesses increases, so too does some morally relevant property related to the organism’s welfare. Generally, the morally relevant properties are assumed to increase linearly with an increase in neurons, though other scaling functions are possible.

- Adam Shriver, rethinkpriorities.org

Let's take all but the last sentence as assumptions, for now. Why linear scaling? Does that make sense? 

Intuition pump 3: Neuron subdivision

Chop my brain up (all 8.6e10 neurons, or if you're a synapse guy, all 1.5e14 of those), and put some minimal set of neurons/synapses into each of a billion or trillion microscopic bio-robots. Each bio-robot uses those neurons for control, is taught to like moving in the direction of a chemical gradient, and let loose to have very fulfilling (if nearly identical) lives. (Fulfilling for microbes, of course). 

Does that colony of microbes have equivalent moral weight to the person I used to be? Does your utility function say that no net harm has been done? 

What about 500k-ish bee-level bio-robots? Is there any subdivision that carries equivalent moral weight to the person I once was? (aside: maybe splitting in two, across the corpus callosum (split-brain syndrome), counts as two independent people, under egalitarianism as a normative value?) 

Now, what the heck is going on? 

If I accept the output of the uniqueness intuition pumps ("it's okay for god to rewind the deterministic universe" and "it's mostly okay for transporters to kill the original after small t"), then I'd have to believe that N exactly identical moral patients have a constant moral weight. And more generally, that the total utility of N moral patients scales as a constant if they're identical, a linear function of N if they're totally unique, and a sublinear function of N if they're very similar. 

If that's true, what the heck am I doing when computing utility under probability distributions?? 

Scenario: In 99 out of 100 parallel worlds, I gain a dollar; and in 1 out of 100 worlds I pay a dollar. Aren't the 99 copies of me identical? Didn't I just agree that, therefore, they have constant utility, no matter how many of them there are? I was just trying to be a more consistent utilitarian - did I give up the ability to perform utility calculations at all? 

Okay, maybe the 99 copies aren't identical. After all, for it to be a physical probability distribution, there would have to be some states in the system that aren't identical (copy 33 gets to be a little unique if they get to see number '33' on a hundred-sided die). But I can imagine arbitrary degrees of similarity between those 99 copies, in the same way that the transporter clone can achieve arbitrary degrees of similarity with the original by being vaporized closer and closer to t=0. Just lock the die (and die-reading computer) in a box. The sturdier the box, the more similar the 99 copies. 

But maybe parallel worlds don't exist, and let's say nothing about branch-counting and the Born rule. So maybe 'probability' (P) and 'count' (N) aren't interchangeable. But isn't that the frequentist assumption? The law of large numbers says probability-over-one-agent (P) and count-over-many-agents (N) converges. P and N are interchangeable. 

So it seems like I've lost - if I want to have a sub-linear total moral value for N very similar moral patients, then I can't have a utility that scales linearly with probability P over similar universes, and vice versa. 

I don't see how to keep all three of (utilitarianism, probability, and moral weight as a function of uniqueness). 

So, choose one

  1. Expected utility (probability times utility) doesn't work to decide moral value under utilitarianism. 
  2. 'Probability' and 'count' are not equivalent in the limit. The law of large numbers is wrong, or somehow does not apply here. 
  3. The total moral weight of N agents scales linearly, no matter how similar those agents are. Killing your transporter clone is muder. 

Maybe someone smarter than me can figure out how to give up 1., maybe with a risk-aversion-type argument, or an appeal to prioritarianism over egalitarianism. I don't see it. It's also very possible that, if I had a good intuition for SSA/SIA, this confusion would resolve itself. 

But I'm not smarter than me. So, until I become smarter, I'll have to give up 3. The intuition pumps above felt somewhat persuasive when I wrote them, but I must have been wrong. 

I'm ready for my brain to be chopped up into a trillion nearly-identical bio-robots, doctor. 



Discuss

Hello, World of Mechanistic Interpetability

2026-03-16 07:47:35

This post is an introduction for the series of posts, which will be dedicated to mechanistic interpretability in its broader definition as a set of approaches and tools to better understand the processes that lead to certain AI-generated outputs. There is an ongoing debate on what to consider a part of the mechanistic interpretability field. Some works suggest to call “mechanistic interpretability” the set of approaches and tools that work “bottom-up” and focus on neurons and smaller components on deep neural networks. Some others include in mechinterp attribution graphs, which one could consider a “top-down” approach, since it is based on analysis of different higher-level concepts’ representations.

The primary goal of this upcoming series is to attempt to structure the growing field of mechinterp, classify the tools and approaches and help other researchers to see the whole picture clearly. To our knowledge, this systematic attempt is among the earliest ones. Despite some authors having previously tried to fill in the blanks and introduce some structure, the field itself is evolving so rapidly it is virtually impossible to write one paper and assume the structure is here. The systematization attempts, in our opinion, should be a continuous effort and an evolving work itself.

Hence, by our series of posts we hope to provide the mechinterp community with clarity and insight we derive from literature research. We also hope to share some of our own experiments, which are built upon previous works, reproducing and augmenting some approaches.

We argue that treating AI rationally is crucial as its presence and impact on humanity keeps growing. We then argue that rationality requires clarity, traceability and structure: clarity of definitions we use, traceability of algorithms and structure in communication within the research community.

This particular post is an introduction of the team behind this account. In pursuit of transparency and structure we open our names and faces, and fully disclose what brought us here and what we are aiming for.

“We” are not a constant entity. At first, there were five people, who met at the Moonshot Alignment program held by AI Plans team, and it would be dumb to not give everyone credit. Then those five people collaborated with another group of people within the same program. Eventually, there were a few people in, a few people out, everyone brought in value, and currently “We” are the three people:

So, Janhavi does the research: looks for papers, runs experiments, and educates the team on important stuff. Nataliia plans the work and turns research drafts into posts. She is also the corresponding author, if you want to connect. Fedor does the proof-reading, asks the right questions and points out poorly worded pieces.

The three of us also carry everything our other teammates have left — their ideas, judgment and experience. Saying “we” seems to be the only right way to go.

Also, GPT-5 helps with grammar, because none of us is a native English speaker.

How It Started

It was around mid-fall. We all had tons of work, but when the AI Plans team launched their Moonshot Alignment program, we were like, “Yeah, sounds like a plan, why not?” — and things escalated quickly.

We met at that program because we all deeply care about AI alignment and safety — and we want to take real actions to move towards a future, where AI is accessible and safe. The central topic of the Moonshot Alignment course was instilling values into LLMs:

  • figuring out how values are represented;
  • finding ways to steer those representations towards safety and harmlessness;
  • designing evaluations to check if we were successful.

We decided to focus on direct preference optimization, because it seemed to be the most practice-oriented research direction. We all had projects we could potentially apply our findings to.

By the end of the course we should have prepared our own research — a benchmark, an experiment — anything useful uncovered during five weeks of reading, writing code, looking at diagrams and whatnot.

We started with a paper “Refusal in Language Models Is Mediated by a Single Direction”. It was a perfect paper to start learning about internal representations with:

  • the approach the authors used is described in details and easily reproducible;
  • the LLMs are small;
  • the concept of refusal is pretty straightforward and applicable.

The authors took some harmful and some harmless prompts, ran those through an LLM and found a vector. As they manipulated that vector, the LLM changed its responses from “refuse to answer anything at all” to “never refuse to answer,” and everything in-between.

First, we decided to reproduce the experiment with different LLMs to see if the results generalize. But we quickly found something interesting: refusal did not look or behave like a single vector as the authors suggested. Our own experiments hinted it’s probably a multidimensional subspace, maybe 5–8 dimensions. It is domain-dependent and reproducible.

Naturally, we got all excited and wanted to publish our findings on arXiv. We wanted to connect with other enthusiasts and start a discussion. We wanted to become a part of the AI alignment researchers community and to gather other perspectives and hints.

There was only a little left to do: a thorough literature search. One step, and we’re inside a rabbit hole. A deep, dark, rabbit hole, which turned out to be just the anteroom of a labyrinth called “Mechanistic Interpretability”.


What is mechanistic interpretability? We don’t know. No one does, really. What we mean is: “the way to get inside of an LLM and tweak something there to change its behavior”.

One more step. Is refusal multidimensional? Looks like it is. Maybe it’s a cone.

One more. But why do we want to learn more about it in the first place? We want LLMs to refuse harmful requests — we want them to be more harmless.

And yet one more. Are refusal and harmlessness connected? Yes, probably, but likely not as tightly linked as we expected.

Fine. Let’s take another turn.

A step into a new direction. We started with a single concept and used linear probing (we’ll tell you more, but not today) to find it. Is linear probing reliable? Not always. There are also sparse autoencoders, attribution graphs, and other techniques. And none of them seems to have turned out to be “The One”.

Okay. One more turn.

Where do we look for representations of concepts? One layer, a subset of layers, all layers? The fact that vector representations change from layer to layer was… not helpful.


See? A deep, dark, fascinating rabbit hole. We got lost in it and spent months reading, watching, experimenting. Asking questions, looking for answers, finding even more questions.

Finally, we sat down and decided: that’s enough. We are not going to wander in the dark on our own — we want company. Yours, specifically.

How It’s Going

We still want to make AI safer, even if it’s 0.001% safer than before we jumped in. Figuring out how different abstract concepts are represented mathematically is crucial for building safely, ethically and responsibly:

  • we must know what to evaluate;
  • we must know where the harm might be hidden;
  • we must know how to steer our LLMs in the right directions.

And we want to be a part of a community — to exchange ideas, take criticism (constructive criticism only!), and spark discussions. So, after some back and forth we ended up registering an account here. Because what we have is not yet enough to write a paper, but is enough to start talking, exchanging ideas, and asking questions together.

In the next posts, we’ll share more details about our own experiments, and the ones we uncovered along the way. In the meantime, please share your thoughts:

  • Have you studied how abstract concepts are represented inside LLMs?
  • What methods have you used?
  • What were your experiments about?

We’re excited to hear from you



Discuss