MoreRSS

site iconLessWrongModify

An online forum and community dedicated to improving human reasoning and decision-making.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of LessWrong

The future of alignment if LLMs are a bubble

2025-12-23 08:08:57

Published on December 23, 2025 12:08 AM GMT

We might be in a generative AI bubble. There are many potential signs of this around:

If LLMs were truly on the path to AGI[1], I would be expecting the opposite of many of these - opportunities for LLM usage opening up all over the place, huge disruption in the job markets at the same time as completely novel products enter the economy and change its rate of growth. And I would expect the needed compute investments to be declining due to large efficiency gains, with LLM errors being subtle and beyond the ability of humans to understand.

Thus the world does not look like one where LLM-to-AGI is imminent, and looks a lot more like one where generative AI keep on hitting bottleneck after bottleneck - when, precisely, will the LLMs stop hallucinating? When will image composition work reliably[2]?

Remember when GPT 3.5 came out? It did feel that we were on the cusp of a something explosive, with countless opportunities being enthusiastically seized and companies promising transformations in all kinds of domains.

But that didn’t happen. Generative AI has a lot of uses and many good possibilities. But in terms of R&D progress, it now feels like an era of repeated bottlenecks slowly and painfully overcome. LLMs are maturing as a technology, but their cutting-edge performance is improving only slowly - outside of coding, which is showing some definite upswing.

What a bubble is

A bubble wouldn't mean that generative AI is useless. It might even be transformative and a huge boost to the economy. It just means that the generative AI companies cannot monetise it to the level required to justify the huge investments being made.

And the investments being made are huge. See arguments like "Big Tech Needs $2 Trillion In AI Revenue By 2030 or They Wasted Their Capex" (if you want a well researched skeptical take on the economics of LLMs, the whole of Ed Zitron blog is a good source - stick to information, not his opinions, and be warned that he is extremely uncharitable towards AI safety).

There are many reasons why generative AI companies might fail at monetising. Since the end of (very weak) training scaling laws, we've been in an "inference" scaling situation, buying and building huge data centers. But that isn't enough for a moat - they need economies of scale, not just a large collection of expensive GPUs.

Because open source models are a few months, maybe a year, behind the top models. If the top LLM companies really become profitable, it will be worth it for others to buy up a small bunch of GPUs, design a nice front end, and run DeepSeek or a similar model cheaply. Unless they can clearly differentiate themselves, this puts a lower bound on what the top companies can charge.

So it's perfectly possible that generative AI is completely transformational and that we are still in an AI bubble, because LLM companies can't figure out how to capture that value.

If we are in a bubble, what does it mean for Alignment research?

If LLMs were a quick path to AGIs, then we'd certainly not be in a bubble. So, if we are in a bubble, they're not AGIs, nor the path to AGIs, nor probably the road to the avenue to the lane to the path to AGIs.

And the big companies like OpenAI and Anthropic, that have been pushing the LLM-to-AGI narrative, will take a huge reputational hit. OpenAI especially has been using the "risk" of AGI as a way to generate excitement and pump up their valuation. A technology so dangerous it could end the world - think of what it could do to your stock values!

And if the bubble bursts, talk of AGI and AGI risk will be seen as puffery, as tools of bullshit artists or naive dupes. It will be difficult to get people to take those ideas seriously.

There will be some positives. The biggest positive is that LLMs would not be proto-AGIs: hence there will be more time to prepare for AGI. Another positive is that LLMs may be available for alignment purposes (I'll present one possible approach in a subsequent paper.

Some thoughts on how to prepare or adapt to being in a generative AI bubble

Some of these things are things we should probably be doing anyway; others are conditional on generative AI being a bubble. The list is non-exhaustive and intended to start discussion:

  • We should emphasise the ways in which current AI misbehaviour fits with the general AI alignment problem. Current AIs manipulate and deceive people, driving some to suicide. Other none generative AIs (such as recommender algorithms) prey on people's cognitive weaknesses. They often do this as a consequence of following simple goals with extreme optimisation power in unintended ways. This is literally the problem with AI alignment, already happening in the world. We should point to these examples.
  • In a related point, we should ally more with traditional AI ethics people. They are working on short term problems that are scaled down version of superintelligent AI problems. Now, sometimes their concerns feel parochial or political, and their solutions might not scale to superintelligence. But a) we can't escape political concerns - we're designing the ideal future, and we can't leave all the moral content of that future to "be figured out later", b) if we collaborate with them, we can learn from their ideas and encourage them to develop ideas that will scale, and c) this will prevent other companies doing the divide and conquer approach of "we're concerned about superintelligent AI. And we'll use that concern exclusively to get money from investors and to avoid any current legislation".
  • We'll need to analyse why generative AIs were not AGI. There's a compelling story for how AGI and superintelligence might happen: once an algorithm has a certain level of capability, it will use those capabilities to self-improve in some way, and quickly scale to superhuman capabilities. Currently, we're telling that story about generative AI The problem is that this story can be told about any proto-AI - until, later on, we understand what the blockers and bottlenecks are (so that, e.g. GOFAI or expert systems or basic deeplearning aren't enough to get to AGI). So how can we improve our assessments ahead of time, try and predict blockers and whether "this time is the real one" or not?
  • Critique the entities that have mislead people. If LLMs don't lead to AGI, then a lot of people will have been wrong about it. And some will have been actively misleading, often for hype or marketing purposes. We'll need to not let these people get away with this. If someone at a company has hinted that they have AGI, and they don't have anything like that, then they have mislead the public or mislead themselves. If someone has hyped up "AGI safety" solely in order to impress people with the potential power of AI, this is worse than a lie: they have weakened warnings about the greatest danger in human history, just in order to sell more stuff.
  • Prepare for retrenchment. However unfair it is, if generative AI is a bubble, AI safety messages will become less sexy, many funders will move on, journals and journalists will be less interested, and the status of AI safety will take a big hit. We'll have to accept a more hostile cultural environment. See this vitriolic blog post from Ed Zitron[3] which paints the field of AI safety as a grifting tool for OpenAI/Anthropic/NVIDIA[4]. After a bubble, more people are going to be saying this. Young, smart, curious, technologically-skilled, EA-adjacent people are going to be turned off by AI safety rather than attracted by it.
  • Prepare for opportunities. The wheels of society and AI research won't stand still. Even after a bubble, much of generative AI will remain, people will continue key research (maybe under different titles), new ideas and algorithms will be developed, new risks will emerge. Culture will change, again - if we remain truth-tracking, it will be likely to be a positive shift for us. Focus on the true fundamental risks, keep honest, look out for opportunities, for they will come.
  1. ^

    In a subsequent post, I'll discuss how we might improve our AGI predictions - almost any advance in computer science could lead to AGI via recursive self-improvement, but can we identify those that are genuinely likely to do so?

  2. ^

    I've had very painful experiences trying to use these tools to generate any image that is a bit unusual. I've used the phrase "Gen AIs still can't count" many a time.

  3. ^

    Ed will be the kind of person who will be seen as having "been right all along" if there is an AI bubble.

  4. ^

    It's paywalled, but he talks about the AI 2027 paper, concluding:

    [...] Everything is entirely theoretical, taped together with charts that have lines that go up and serious, scary language that, when boiled down, mostly means "then the AI became really good at stuff."

    I fucking hate the people that wrote this. I think they are craven grifters writing to cause intentional harm, and should have been mocked and shunned rather than given news articles or humoured in any way.

    And in many ways they tell the true story of the AI boom — an era that stopped being about what science and technology could actually do, focusing instead on marketing bullshit and endless growth.

    This isn't a "scenario for the future."  It's propaganda built to scare you and make you believe that OpenAI and Large Language Models are capable of doing impossible things.

    It's also a powerful representation of the nebulous title of "AI researcher," which can mean everything from "gifted statistician" to "failed philosophy PHD that hung around with people who can actually write software.

    Note that, in general, the quality of his arguments and research is much higher than this vitriol would suggest.



Discuss

Unsupervised Agent Discovery

2025-12-23 06:01:57

Published on December 22, 2025 10:01 PM GMT

finding agents in raw dynamics

Related: Formalizing «Boundaries» with Markov blankets, «Boundaries» Sequence

The Dataset

Suppose you are given time series data from a simulated ecology (such as Smallville). You can usually tell what the agents in there are doing; at least if the variables are properly labeled. If they are not labeled but just numeric values, your first task would be to reverse-engineer the mapping to the conceptual model behind the simulation (or whatever generated the time series data). That might not be feasible, but let's ignore that for now. Because there is a bigger problem and that is a result of any labelling where existing or reverse engineered: You introduce an ontology. Once you call something an agent (label="player1_position"), you assume that those variables belong to that agent. That works well because our intuitions about agents are pretty good. 

Until it doesn’t. Our intuitions attribute agency where it isn't. Our ancestors anthropomorphized nature. We may attribute intent where there’s only coupling. It is more dangerous to overlook an agent than to see one too many.

But we may miss intent when it is distributed or doesn't neatly fit into our ontologies of physically bounded agents. 

If you need to find an agent, esp. a potentially powerful agent that may work very differently from our intuitions, we need a method that can discover agents in raw unlabeled data without using a prior ontology (or, in other words, without a unit that is already known to be an agent).  

That's two problems: 1. without prior ontology and 2. finding in raw data. 

What is or is not an agent?

For the first, as part of agent foundations work, johnswentworth, and others have proposed modeling an agent in terms of its boundary that shields inner states from outer change. A natural modeling seemed to be a Markov blanket

From  Andrew_Critch's «Boundaries», Part 3a: Defining boundaries as directed Markov blankets (which uses a slightly non-standard naming of the variables, V ≝ I and P ≝ S).

Quick recap: The blanket is a partitioning[1] of all variables into four sets:

  • ​: sensory variables aka inputs
  • : active variables aka outputs
  • ​: internal variables (this is where it gets interesting later)
  • ​: external variables, aka everything else 

There are some problems with Markov blankets, but these seem solvable: 

  • Agents aren't stationary, meaning a fixed set of raw observables can't form a blanket. This seems solvable by using invariants (as discussed here). Such invariants might be found with methods like Slot Attention
  • There are no true Markov Blankets. The only true physical blanket is the lightcone. All real boundaries leak information. And they need to leak, otherwise the agent couldn't learn from the environment or communicate or cooperate with other agents. The Markov blanket is more like an ideal the agent "tries to achieve." This seems solvable by working with blankets up to an ε of mutual information.

Thus, it may make more sense to talk about invariant ε-boundaries when referring to such agents. Where the Markov blanket is not determined from raw observables, but over parameters invariant[2] under transformation, and the understanding that the ε is chosen to minimize predictive regret for predicting internal from external (invariant) variables.

Where is the agent in the data?

Give such a formalization of an agent as a Markov blanket, how can we find it in the raw data? While there are efficient algorithms to find the blanket of a single variable, and there is some recent work on the relation between Markov Blanket Density and Free Energy Minimization, I'm not aware of any implementation that uses blankets to find agents. One problem is that we do not have a nice causal graph[3] that we could inspect structurally for the blanket property. We have to check a lot of variables statistically. 

Thus as a first contribution, I offer an implementation of Unsupervised Agent Discovery (UAD), i.e., of how it might be possible to find such agents in unlabeled raw time series data. 

The implementation takes a raw dataset and executes these steps:

  1. Filter variables for activity.
  2. Build N clusters of variables using a mutual information graph (standard sklearn.cluster.AgglomerativeClustering).
  3. Test the candidate clusters for the blanket condition up to an ε of remaining mutual information.
  4. If tests fail, reduce the number N of clusters and retry from step 2.
  5. Classify variables as S, A, or I of each cluster (and as E for the remaining external variables).

The implementation also includes a simulation of a configurable number of agents (simple state machines for simple controllers) in a simplified shared environment from which a dataset of all their observables is generated.

Lessons Learned

From implementing and running these simulations, I learned

  • If agents share resources or otherwise interact, the current algorithm quickly joins them into a single agent. Which makes sense because they are not, in fact, causally independent as required by a strict blanket property with ε as low as practically feasible. Effective agents will have a boundary that is not perfectly tight, but their boundary will be in a place where the agent can have effective input-output policies.
  • Even for simple agents, you need quite a number of samples, or variables that rarely change, will likely be misclassified. I have adapted the code of the finite state machines to avoid rare state changes and to allow useful classification of all the involved variables. In a more real-world experiment, variables that rarely change during the recording will not be classified correctly and a more practical algorithm should flag them as such.
  • If multiple agents do exactly the same thing, then they will be classified as a single agent and not as multiple agents. For example, multiple spatially separate solar panels that respond to the external variable of solar radiation all in the same way, updating their internal states the same way, and setting actuator values the same, are classified as a single cluster and the Markov blanket condition confirms them as a single agent. Which made sense to me in the end, but was counterintuitive at first.

Fun with Agents

Discovering agents is nice. Now that we can do that, what else can we do? We can try to understand the internal states better. We can try to understand how the agent represents itself and its environment. This is the second contribution of the post (Note: These is not yet part of the implementation).

Memory

We can treat memory as a compression[4] of past inputs that inform future actions. 

For each internal variable m∈I, and lag k, compute . If Δm(k) is large, then the past of m predicts the agent’s next internal state. We have to be careful because a variable may look like memory just because it is changing slowly.

Goals

If we can track inputs and outputs of an agents, we can try to infer what the implied policy behind this input-output relation is. We can infer R(I,A) with inverse reinforcement learning (IRL). Given a P(R) that weighs reward functions[5], we can use the standard formulation . We could call these the agent's goals. Though with simple modeling, we can't yet represent issues like Mesa Optimization.

Agents Modeling Agents

If we can find two agents and their memory, we can ask if one of them represents the other, or, technically, if some memory  predict  beyond what X already does: . A large Δm means that some part of X is predicting Y. 

I'm not clear if this refutes How an alien theory of mind might be unlearnable or not.


Cooperation

Once you can track what agents know about each other and what objectives they follow, you can, in principle, derive how well these agents do or do not cooperate.

You can calculate:

  • b: benefit to X per bit of information gained about Y’s action,
  • c: cost to Y per bit disclosed (as measured by the implied reward function),
  • p: probability that Y’s cooperative act actually reaches X’s sensory channel,
  • ρ: relatedness (residual[6] mutual information between the agents).

Cooperation is favored when the fraction . We can call κ a cooperativity index; a generalization of Hamiltons's rule[7]. If we look at the full induced cooperation graph between agents, where edges are weighted by κ, we can use percolation theory to determine at which level cooperation will become universal (a giant component) due to The Evolution of Trust

The promise of this approach is that we could look at a complex system with many potential agents large and small, human, legal, or artificial, and determine at least an approximation of the cooperation structure and whether there are adversarial tendencies.

Problems

With all these promises, there are potentially serious problems.

The best blanket algorithm fails if we cannot observe sufficient details and especially the internals of agents of interest - for sure we can't look into humans. The theoretical arguments above even model why agents wouldn't want transparency in some cases. And if we use proxies we loose reliability. As a consolation, as we are most interested in powerful AI, we might have access to the internals of them at least.

If any variable we overlook is a common cause we can mistake agent boundaries and fail to separate and thereby fail to identify the crucial agents. I'm mostly thinking about complex LLM-like agents here. An LLM is distributed across multiple computers and may depend on human operators for its function but still be powerful. Agents running on LLMs are also hard to find with slot-based approaches

Things may change. The method, as currently conceived, is sample hungry, and if anything changes during the sampling period, this change will be covered and may interfere with the discovery. And also, some types of learning that change the policy of the agent (which might even be in response to the discovery process) may manifest only after the sampling period. 

There are concerns that the calculations  might not be statistically stable and esp. not computationally feasible. Currently, the algorithm is combinatorial in the number of variables. My current argument here is that we have an existence proof: Humans have learned to identify agents in complex data fairly effectively, and it should be possible to reproduce that in a specialized algorithm.

Next Steps

At this point, Unsupervised Agent Discovery doesn't give you an algorithm that discovers agents in real-world data yet. But it provides a precise way to talk about agents, their goals and cooperation, and many other things we care about that usually require an a priori notion of an agent, but can now be grounded in physics.


Github repo: https://github.com/GunnarZarncke/agency-detect/tree/master 

My initial dense LaTeX/PDF writeup of UAD can be found here.

Many thanks to the reviewers Jonas Hallgren, Chris Pang, and Peter Kuhn. Additional thanks go to the team at AE studio that supported the development of the experiments and write-up with time and compute.

Feedback welcome, especially from the agent foundations people.

  1. ^

    by using the condition

    from M. D. Kirchhoff, T. Parr, E. Palacios, K. Friston, and J. Kiverstein, “The Markov blankets of life: autonomy, active inference and the free energy principle,” J. R. Soc. Interface, vol. 15, no. 138, 2018.

  2. ^

    The invariants may themselves depend on the agents' dynamics, making a simple layer-by-layer inference infeasible. 

  3. ^

    If we could intervene on the simulation/experiment, we could determine the causal structure as done in the Deepmind Discovering Agents paper. That is also how humans check whether something is an agent or not: we prod it and see if it evades. Is is a promising direction but was beyond the scope of this work.

  4. ^

    In evolutionary environments, agents with memory of past input that may be relevant to future outputs that affect survival will outcompete agents without. And agents with more compact memory will outcompete agents with larger memory but the same predictive effect.

  5. ^

    This weighing P(R) is often seen as arbitrary or in need of justification, but here we are closer to the underlying substrate. In most environments of interest, we can argue that there will be entropic forces that select for simpler policies and lower prediction errors.

  6. ^

    This residual mutual information between agents’ actions, internal models, or rewards is not indicate a failure of separation. It captures alignment without leakage, e.g. from shared task structure, common external drivers, following the same conventions, or algorithmic similarity.

  7. ^

    Hamilton's rule says that genes for a particular behavior should increase in frequency when rB>C where r = the genetic relatedness, B = reproductive benefit, and C = the reproductive cost

    W. D. Hamilton, “The genetical evolution of social behaviour,” J. Theor. Biol., vol. 7, no. 1, pp.1–16, 1964.



Discuss

Announcing Gemma Scope 2

2025-12-23 05:56:59

Published on December 22, 2025 9:56 PM GMT

TLDR

  • The Google DeepMind mech interp team is releasing Gemma Scope 2: a suite of SAEs & transcoders trained on the Gemma 3 model family
    • Neuronpedia demo here, access the weights on HuggingFace here, try out the Colab notebook tutorial here [1]
  • Key features of this relative to the previous Gemma Scope release:
    • More advanced model family (V3 rather than V2) should enable analysis of more complex forms of behaviour
    • More comprehensive release (SAEs on every layer, for all models up to size 27b, plus multi-layer models like crosscoders and CLTs)
    • More focus on chat models (every SAE trained on a PT model has a corresponding version finetuned for IT models)
  • Although we've deprioritized fundamental research on tools like SAEs (see reasoning here), we still hope these will serve as a useful tool for the community

Some example latents

Here are some example latents taken from the residual stream SAEs for Gemma V3 27B IT.

Layer 53, feature 50705
Layer 31, Feature 23266
Layer 53, feature 57326
Layer 53, feature 2878
Layer 53, feature 57326

 

What the release contains

This release contains SAEs trained on 3 different sites (residual stream, MLP output and attention output) as well as MLP transcoders (both with and without affine skip connections), for every layer of each of the 10 models in the Gemma 3 family (i.e. sizes 270m, 1b, 4b, 12b and 27b, both the PT and IT versions of each). For every layer, we provide 4 models (widths 16k and 262k, and two different target L0 values). Rather than giving the exact L0s, we label them "small" (10-20), "medium" (30-60) and "big" (90-150).

Additionally, for 4 layers in each model (at depths 25%, 50%, 65%, 85%) we provide each of these single-layer models for a larger hyperparameter sweep over widths and L0 values, including residual stream SAEs with widths up to 1m for every model.

Lastly, we've also included several multi-layer models: CLTs on 270m & 1b, and weakly causal crosscoders trained on the concatenation of 4 layers (the same 4 depths mentioned above) for every base model size & type.

All models are JumpReLU, trained using a quadratic L0 penalty along with an additional frequency penalty which prevented the formation of high-frequency features. We also used a version of Matryoshka loss during training, which has been documented to help reduce the instance of feature absorption.

Table of available models, taken from the technical report.

Which ones should you use?

If you're interested in finding features connected to certain behavioural traits (to perform steering, or to better attribute certain model behaviours, or analyze directions you've found inside the model using supervised methods etc), we recommend using the residual stream models trained on a subset of the model layers (e.g. here). The 262k-width models with medium L0 values (in the 30-60 range) should prove suitable for most people, although the 16k and 65k widths may also prove useful. All the examples in the screenshots above were from 262k-width medium-L0 SAEs finetuned on Gemma V3 270m IT.

If you're interested in doing circuit-style analysis e.g. with attribution graphs, we recommend using the suite of transcoders we've trained on all layers of the model, e.g. here. Affine skip connections were strictly beneficial so we recommend using these. Models with larger width lead to richer analysis, but the computational cost of circuit-style work can grow very large especially for bigger base models, so you may wish to use 16k width rather than 262k. Neuronpedia will shortly be hosting an interactive page which allows you to generate and explore your own attribution graphs using these transcoders.

Some useful links

Here's all the relevant links to go along with this release:

 

  1. ^

    The ARENA material will also be updated to use this new suite of models, in place of the models from the 2024 Gemma Scope release.



Discuss

Keeping Up Against the Joneses: Balsa’s 2025 Fundraising Letter

2025-12-23 05:32:52

Published on December 22, 2025 9:32 PM GMT

Balsa Research is a tiny nonprofit research organization currently focused on quantifying the impact of the Jones Act on the American economy, and working towards viable reform proposals.

While changing century-old policy is not going to be easy, we continue to see many places where there is neglected groundwork that we’re well positioned to do, and we are improving at doing it with another year of practice under our belts. We’re looking to raise $200,000 to support our work this giving season, though $50,000 would be sufficient to keep the lights on, and we think we are also well positioned to do more with more funding.

Donations will support Balsa’s policy advocacy, either in Jones Act and maritime policy reform or potentially in other planned cause areas (housing reform and NEPA reform) if there is capacity to significantly expand.

Donate here to fund our mainline policy work.

One additional possibility for Balsa, which would be funded entirely separately if it did happen, is for Zvi Mowshowitz to use Balsa as a piece of philanthropic infrastructure to help guide new philanthropic money coming online in 2026 if there is demand. Contact us ([email protected]) if you would like to be involved in such an effort in any capacity, or want to authorize this as a potential use of your funds.

Donate here if you are interested in contributing fully flexible funding.

What Balsa Did in 2025

Quite early in the year, Balsa’s plans for Jones Act investigative work was derailed by a certain Section 301 Investigation, which I wrote about here. In short, the USTR was proposing two significant changes to maritime transport: a $3-5 million fee for Chinese-built ships to deliver imports to American ports, and new, Jones Act-tier restrictions to up to 20% of American maritime exports. All of American industry focused on lobbying against the legibly bad first proposal, sadly no one else was on the ball about how bad the second proposal was because it required a slightly more sophisticated argument. So Balsa stepped in and wrote up a public comment and presented it to the USTR during their public hearing on the proposal. At least in part due to our research and our outreach to maritime industry players, this proposal was basically entirely axed.

After our mid-year write-up on the whole adventure, Balsa did also end up submitting a second comment in response to what we felt was a deeply counterproductive tariff scheme in the updated proposal. This was the first arc played out in miniature; after functionally scrapping both major proposals from the first round, the USTR was proposing that an increasing percentage of American LNG must be shipped out on U.S.-built LNG tankers (there are currently zero in the fleet and no capacity for the shipyards to build any new ones) and that all port crane parts made in China be subject to 100% tariffs. Everyone focused on lobbying against the first policy change which was obviously bad, the second was bad in a more subtle way. So it was once again up to Balsa to point out that the exact setup of the port crane tariffs were structured in a way counterproductive to the stated U.S. policy, would incentivize American ports to buy their cranes from Chinese manufacturers instead of manufacturers in allied countries (there is no domestic port crane manufacturing capacity), and negatively impact port revitalization investments that need to happen.

One piece of good news is that President Trump signed a trade deal with China in November, which resulted in a one-year suspension of all of the punitive measures proposed in the Section 301 investigation. We think there’s a decent chance that the suspension might become indefinite, but it still seemed like a good use of our time to write up our objections should the measures resume in 2026.

We also worked on the Jones Act. We launched a new RFA to investigate the labor impacts of the Jones Act. This is meant to complement our first RFA, which invites academics to look at the economic impacts of the Jones Act. Both are open for applications!

You may also recall that we had already given out grants for two different studies under the first RFA, on economic impacts. These papers are still in the process of being written. We remain confident in both teams and look forward to seeing their results in 2026.

We shored up a few places where we felt like some of the groundwork done by others on the Jones Act were either neglected or outdated. We published two pieces: The Jones Act Index, which works as a very short overview of all the myriad dysfunctions of the current domestic maritime industry, and an operational analysis of what exactly the 93 extant Jones Act eligible vessels get up to.

Besides all that, there is of course the frustratingly intangible work of networking and building a deeper understanding of the shape of the problem. We conducted over forty conversations with stakeholders across the maritime policy landscape, including domestic shipping operators, port executives, and congressional staff. These conversations directly informed our operational analysis of Jones Act vessels and helped us identify which reform framings resonate (and which don't) with different constituencies. We've compiled this primary research into internal documentation mapping stakeholder positions, constraints, and potential pressure points—groundwork that will directly inform our policy binder and draft reform proposals.

Additionally, in the last few months of the year, we brought on a very part-time contractor to help with shipping out more of our policy work. 

A quick glance at our budget

A breakdown of our 2025 spend to the nearest thousand, for a total of ~$143k:

  • $87,000 in wages (Jenn at 35 hours a week and a policy analyst at 10 hours a week)
  • $0 for Zvi Mowshowitz
  • $45,000 in research grants to RFA applicants
  • $7000 in travel and conference expenses
  • $2000 in accounting services
  • $1000 in legal, compliance, and nonprofit registration fees
  • $1000 in software, subscriptions, and office supplies

Balsa in 2026, and Our Ask

Considering Balsa’s size, unless fundraising goes exceedingly well, we plan to stay focused on the Jones Act and maritime policy until we crack this nut (i.e. deliver the policy binder) instead of diverting attention across different policy streams.

Currently, the people working on Balsa work are Jenn (me, full time-ish), a contractor who works ten hours a week, plus Zvi Mowshowitz in an advisory capacity. In 2026, we’d like to bring our contractor or another policy analyst on full time, because my own time is somewhat constrained by the overhead of maintaining a 501(c)(3) nonprofit. The amount of funding we have in reserve gives us a decent amount of runway, but is insufficient for our grantmaking and hiring ambitions.

We’re looking to raise $200,000, which would be enough to bring on our contractor full-time and give us a reasonable amount of buffer for additional research funding that we would like to disburse. However, we think $50,000 is the minimum for Balsa to be viably funded to the end of 2026. 

Here’s what we plan on doing in 2026, should we hit our fundraising goal:

Complete the Jones Act policy binder

This is the core deliverable that everything else feeds into, that was waylaid by our Section 301 work. The binder will include a short executive summary of the case for reform; one-pagers on specific impacts; a longer technical document synthesizing our funded research and the existing literature; and a FAQ addressing common objections. Much of the work is filling gaps identified through stakeholder conversations, and interpreting the information for specific audiences.

Receive and publicize findings from the two funded economic studies

Both teams are expected to submit their papers in 2026. Once results are in, we'll write accessible summaries for non-academic audiences, brief interested Hill offices, and incorporate findings into the policy binder.

Fund at least one high-quality study from the labor RFA

The labor angle is underexplored in existing Jones Act research and useful for engaging unions constructively. We're looking for proposals examining questions like: How many jobs does the Jones Act actually protect, and in which states? What's the counterfactual employment picture under reform? What are the job creation effects in industries currently harmed by high shipping costs? A rigorous study here could shift the conversation toward a more nuanced understanding of net labor market effects.

Continue monitoring Section 301 and SHIPS Act developments, contributing input where it seems high-leverage

The one-year suspension of Section 301 measures expires in late 2026, and if negotiations with China stall, the proposed port fees and export restrictions could return; we'll track developments and be prepared to submit updated comments or testimony. The SHIPS for America Act proposes expanded cargo preference requirements facing similar vessel availability problems to those we identified in Section 301, and we're developing analysis of cargo preference laws we can deploy if this legislation gains momentum. The goal is readiness to contribute when high-leverage, without letting monitoring consume time that should go toward the policy binder.

What Additional Funding Enables

We can do even more with additional resources:

  • We can fund additional academic studies to strengthen the empirical case for reform, complementing our existing research initiatives, as we discover new opportunities. We estimate that each additional study costs around $30,000 to fund.
  • Zvi is not taking any payment for his work currently, but at a sufficiently high level of funding, this could change and he would dedicate more of his attention to the project. In addition, there is still an abundance of policy analysts in DC who are out of work, that we can hire more of.
  • With more funding and interest, we’d also look into spinning up a 501c4 to use going forwards for more direct political advocacy. Though of course the 501c4 would then require its own fundraising work, since we can’t mix the funds.

Donating is not the only way to give. If you have experience with maritime shipping, naval procurement, connections to labor unions, or anything else you think might be relevant to Jones Act reform, we'd be interested in talking to you and hearing your perspective. Get in touch at [email protected] and let us know how you might be able to help, whether that's sharing your insights, making introductions, or contributing in other meaningful ways.

If you're an economist positioned to publish in peer-reviewed journals, please consider applying to our economy or labor RFAs, and doing direct research on the issue. If you have friends who fit that profile and might be interested in this kind of work, please consider forwarding the RFAs their way.

Balsa Research is still a very small organization (me, another policy analyst at ten hours per week, and Zvi in an unpaid, very part-time advisory role) and our progress this year has been possible only through the generous support of our donors and the many people who have shared their time and expertise with us. We're grateful for this community of supporters and collaborators who continue to believe in the importance of this work.



Discuss

[Intro to AI Alignment] 0. Overview and Foundations

2025-12-23 05:20:45

Published on December 22, 2025 9:20 PM GMT

This post provides an overview of the sequence and covers background concepts that the later posts build on. If you're already familiar with AI alignment, you can likely skim or skip the foundations section.

0.1 What is this Sequence?

This sequence explains the difficulties of the alignment problem and our current approaches for attacking it. We mainly look at alignment approaches that we could actually implement if we develop AGI within the next 10 years, but most of the discussed problems and approaches are likely still relevant even if we get to AGI through a different ML paradigm.

Towards the end of the sequence, I also touch on how competently AI labs are addressing safety concerns and what political interventions would be useful.

0.1.1 Why am I writing this?

Because in my opinion, no adequate technical introduction exists, and having more people who understand the technical side of the current situation seems useful.

There are other introductions[1] that often introduce problems and solution approaches, but I don’t think people get the understanding to evaluate whether the solution approaches are adequate for solving the problems. Furthermore, the problems are often presented as disconnected pieces, rather than components of the underlying alignment problem.

Worse, even aside from introductions, there is rarely research that actually looks at how the full problem may be solved, rather than just addressing a subproblem or making progress on a particular approach.[2]

In this sequence, we are going to take a straight look at the alignment problem and learn about approaches that seem useful for solving it - including with the help from AIs.

0.1.2 What this Sequence isn’t

  • It’s not an overview of what people in the field believe. We will focus on getting technical understanding, not on learning in more detail what experts believe. I’m taking my understanding and trying to communicate the relevant basics efficiently - not trying to communicate how many experts think about the problem.
  • It’s not an overview of what people are working on, nor an exhaustive overview of alignment approaches. As we will discuss, I think our current approaches are not likely to scale to aligning fully superhumanly smart AIs. There’s a lot of useful work that’s being done to make our current approaches scale a little further, which may be great for getting AIs that can do other useful work, but this sequence focuses more on the bigger picture rather than discussing such work in much detail. Many speculative approaches may also not get mentioned or explained.
  • It’s not necessarily introducing all the concepts that are common in the AI alignment discourse. In particular, we’re going to skip “outer and inner alignment” and are going straight to looking at the problem in a more detailed and productive way.

0.1.3 Who is this for?

Any human or AI who wants to technically understand the AI alignment problem. E.g.:

  • Aspiring alignment researchers who want to better understand the alignment problem. (Parts may also be useful for existing researchers.)
  • Scientists and technically-minded people who want understanding to evaluate the AI alignment situation themselves.
  • People working in AI governance who want to ground their policy thinking in a technical understanding of how much we seem to be on track to make future AIs nice.

0.1.4 Who am I?

I am an AI alignment researcher who worked on alignment for 3.5 years, more in this footnote[3].

0.2 Summary of The Whole Series

Here are the summaries of the posts written so far [although as of now they are not yet published]. This section will be updated as I publish more posts:

  • Post 1: Goal-Directed Reasoning and Why It Matters. Why would an AI "want" anything? To solve novel problems, a mind must search for plans, predict outcomes, and evaluate whether those outcomes achieve what it wants. This "thinking loop" maps onto model-based reinforcement learning, where a model predicts outcomes and a critic evaluates them. We'll use model-based RL as an important lens for analyzing alignment—not because AGI will necessarily use this architecture, but because analogous structure appears in any very capable AI, and model-based RL provides a cleaner frame for examining difficulties. The post also argues that for most value functions, keeping humans alive isn't optimal, and that we need to figure out how to point an AI's values.
  • Post 2: What Values May an AI Learn? — 4 Key Problems. We cannot test AI safety in the domain where it matters most—conditions where the AI could take over. So we need to predict how values generalize across this distributional leap, and getting it wrong could be catastrophic. Using concrete examples, we analyze what a critic might learn and identify four key problems: (1) reward-prediction beats niceness, (2) niceness isn't as simple as it may intuitively seem to us, (3) learned values may be alien kludges, (4) niceness that scales to superintelligence requires something like CEV.

[Those two posts should get posted within the next 2 weeks, possibly tomorrow. After that it may take a while, but hopefully around 1 post per month on average.]

0.3 Foundations

0.3.1 Orthogonality

The orthogonality thesis asserts that there can exist arbitrarily intelligent agents pursuing any kind of goal.

In particular, being smart does not automatically cause an agent to have “better” values. An AI that optimizes for some alien goal won’t just realize when it becomes smarter that it should fill the universe with happy healthy sentient people who live interesting lives.

If this point isn’t already obvious to you, I recommend reading this page.

0.3.2 Very Basic Knowledge of How Current AIs Work

If you know almost nothing about how current AIs work, watch this brief video. More knowledge isn’t required for following along, although feel free to watch some more videos in that series.

0.3.3 The Power of Intelligence

Intelligence is the power that allows humanity to build skyscrapers, cure diseases, and walk on the moon.

I expect the power of intelligence is already obvious to you if you’re reading this, but here is a nice video about it: The Power of Intelligence (text version here).

0.3.4 The Intelligence Explosion and Superintelligence

Once AI systems become as good as the best humans at AI research, we likely get an intelligence explosion: smarter AIs can make faster AI progress, which leads to even smarter AIs even faster, and so on.

Since smarter minds can often make much faster progress on intellectual problems, this feedback loop seems likely to be superexponential - perhaps hyperbolical[4], where in theory an infinite amount of progress would happen in finite time, although of course in practice only until you run into limits.

The upper limits on machine intelligence appear to be extremely high. The human brain's learning algorithm is likely far more efficient than current deep learning methods—yet the brain itself is almost certainly nowhere near optimal. And AI hardware is remarkably powerful: a single H100 GPU can perform around 2×10¹⁵ operations per second, which may be comparable to or exceed the brain's computational throughput, depending on how you estimate it. xAI’s Colossus datacenter has a compute capacity equivalent to 300,000-350,000 H100 GPUs. Those chips have very high communication bandwidth, so in principle a datacenter could operate more like one gigabrain than lots for individual brains needing to communicate slowly like humans.

We cannot easily imagine what a mind much smarter than humans would be like. One useful Eliezersubstitute is to imagine a large civilization of supergeniuses, all running at 10,000× human speed, with perfect memory and the ability to share knowledge instantly. (For a vivid exploration of this, see Yudkowsky's short story That Alien Message (video here).)

Technologies that currently seem far off, like advanced nanotechnology, might arrive much sooner than we'd expect from extrapolating human research progress, because a superintelligence can make much much faster progress than humanity combined.

For more on these dynamics, see Optimization and the Intelligence Explosion and AI 2027 (video here).

0.3.5 The Current AI Trajectory

Although timelines don’t play a huge role for this sequence, I want to briefly mention that superhumanly intelligent AI might come soon.

Measurements by METR show the task-completion time horizon of AIs seems to be consistently doubling around every 6-7 months.

Prediction markets reflect substantial probability of near-term AGI. The Manifold Markets AGI series currently shows ~9% by 2027, ~26% by 2029, ~36% by 2030, and ~50% by 2033.

The team behind AI2027—expert AI researchers and forecasters—expected 2027 to be the most likely year in which AGI might be developed (although not their median guess) at the time of publication, although they now predict timelines to be a chunk longer.

Even if current approaches plateau, history suggests another paradigm shift (like transformers were for deep learning) is likely within the next 15 years.

  1. Like the DeepMind AGI Safety Video Course or AIsafety.dance (although I only very roughly skimmed the latter). ↩︎

  2. The most notable exception for deep learning (DL) alignment is Joe Carlsmith’s sequence on “how do we solve the alignment problem”. I have some disagreements and don’t think it does a great job at clarifying the key difficulties of alignment, but hey, lots of credit to Joe for writing that sequence! There are some other attempts towards an overall discussion of alignment by Holden Karnofsky, and if you’re charitable you could count the Deepmind Safety Plan, but not much. ↩︎

  3. I basically started to take an ambitious shot at the alignment problem—tried to think concretely about how we might be able to create very smart AI with which we could make the future end up well, which gave me a decent theoretical understanding of key difficulties. It looked to me that we might need a much more understandable and pointable AI paradigm, so I went to work on that. I started out from agent foundations and ontology identification research, and then developed my own agenda for better understanding minds, which involves more concrete analysis of observations. To be clear, that was a longshot and I hoped we had more than 10 years of time left. Even though it was not the main focus of my research, I still know quite a lot about hopes for DL alignment, and this fall I’ve been reading up in more detail on some hopes in order to better evaluate how feasible DL alignment is. Also feel free to check my LW page. ↩︎

  4. Hyperbolical growth isn’t unprecedented—the human economy grew hyperbolically until around 1960. Since then it is “only” growing exponentially, presumably because the relative population growth rate went down a lot. If anything, we should expect the returns of higher intelligence to be even larger than the returns of having more humans. Especially above genius level, small increases in intelligence can have a very outsized impact. E.g. Einstein was able to solve some problems faster than the rest of humanity combined could’ve—he postulated multiple correct theories where most physicists at his time thought for years that they were wrong. ↩︎



Discuss

$500 Write like lsusr competition

2025-12-23 04:09:49

Published on December 22, 2025 8:09 PM GMT

It's been a while since I hosted a competition. This time there is prize money! $500.


Do you think you're a better writer than me? Now is your chance to prove it.

I want to know whether human beings can pass for me. Your mission, should you choose to participate, is to write a Less Wrong post that is indistinguishable from one that I would write.

Schedule [Tentative]

  • December 22 (today): Competition is announced. You may begin posting entries.
  • December 28: Last day to submit entries.
  • December 31: Prediction markets resolve at exactly the new year, Berkeley timezone, when 2025 becomes 2026.
  • January 4: Last day for winners to claim prizes.

All timezones are Berkeley Time (UTC−12).

This schedule is tentative because I often encounter technical difficulties when running competitions like this.

About your entries

  • Create a new Less Wrong account. This way nobody will be able to figure out who you are just from the account name. (You may submit multiple posts from a single Less Wrong account. However, each account is only eligible for a single cash prize.)
  • From that same account, create a comment on this post with a link to your post so others can read it and so I know to create a prediction market for it.
  • AI slop will be disqualified. Any entry written by a well-intentioned human being is permitted. To my knowledge, people have abided by the honor system in my previous contests. If you want to be certain whether a post counts as "AI slop", then you can DM me privately to ask.
  • AI is allowed for research, spellchecking, and asking questions like "Does this thing I wrote sound like something lsusr would write?" Having it correct a word or phrase on occasion is fine. Just don't have it write the thing for you.
  • All of your entries' post bodies must start with must start with "[This is an entry for lsusr's write-like-lsusr competition.]". Example:

Lorem Ipsum

[This is an entry for lsusr's write-like-lsusr competition.]

Dolor sit amet.

About lsusr's entries

  • Lsusr will enter an integer number of posts into this competition. That number might be zero.
  • Lsusr entries (if they exist) will not be posted under the username lsusr.
  • Lsusr entries (if they exist) will not cheat by doing things like filming a video of myself or using a ghostwriter. My entries are allowed to embed old videos from my YouTube channel.
  • Lsusr entries (if they exist) will follow my usual ethical guidelines, which means they're things I believe are true and I believe they are high-effort. By "high-effort" I mean that I feel posting them on Less Wrong makes the world a better place.
  • Lsusr entries (if they exist) will be no less than 500 words.
  • Lsusr entries (if they exist) will use AI for no part of the competition other than research, fact-checking and spell-checking.
  • Lsusr entries (if they exist) will start with "[This is an entry for lsusr's write-like-lsusr competition.]".
  • Lsusr entries (if they exist) will be written by me, lsusr.
  • I will not intentionally manipulate the prediction markets.

If you violate these constraints, then you will be disqualified because it should be obvious to everyone that the entry was not written by lsusr. (Except the AI constraint, which is largely honor system.)

Evaluation

The judging of this competition will be done by prediction markets.

I will create one YES/NO question on Manifold for each entry. Each entry will be seeded with 100 mana. I have a total budget of 10,000 mana. If >100 posts are submitted, then I may have to ditch the original plan and use a more complicated system instead. Maybe I'll increase the mana budget. Hopefully Manifold will not ban me for creating too many markets? IDK. I have never done this before.

You may notice that 10,000 mana is worth significantly less than the $1,000 prize pool. You may wonder whether arbitrage via deliberate manipulation of the prediction markets is permitted. Yes, it is permitted. Does this mean that that you are allowed to buy a bunch of YES or NO right before the prediction markets expire? Yes it does.

Does this mean you are allowed to run a propaganda campaign in the comments? Yes, but please keep it contained to this competition. Feel free to keep your misinformation compaign confined to the comments section of posts (including this one) that have voluntarily opted into the competition.

Anticipated Questions

  • Q: I have a post idea that I don't think would be mistaken for an lsusr post, but it's high-effort and human-written. May I enter it in the competition?

  • A: Yes.

  • Q: I want to win by manipulating the prediction markets without writing a post myself. What do I do?

  • A: Maybe team up with someone writing a real entry? There are no rules against collaboration. Plots are fun!

  • Q: Who funded this?

  • A: Me.

  • Q: Can I volunteer to help with this competition?

  • A: Yes! Creating, managing and resolving the prediction markets is a hassle I'd rather someone else deal with. If you'd like to volunteer for this duty, please message me. (If you are the one managing the prediction markets, then you are disqualified from participating in the competition and manipulating the markets.)

Prizes

The winner will be whoever has the highest % likely "written by lsusr". The 2nd place will be whoever has the second-highest % likely "written by lsusr". (Assuming there are <100 entries.)

  • 1st prize: $250
  • 2nd prize: $100
  • 3rd prize: $50
  • ?????: $100

Prizes will be distribited by Venmo. If you don't want to receive money via Venmo, then I can instead donate 110% of your prize to the charity of your choice or 120% to Our Partners in Health. You may request prizes in kind in lieu of a cash payment.

Potential Changes

I don't anticipate changing the rules, but just in case something goes horribly wrong or I made a mistake somewhere, I reserve the right to revise the rules over the next 48 hours.

Links to Prediction Markets

<coming soon>



Discuss