2025-05-21 00:25:10
Published on May 20, 2025 3:48 PM GMT
``There is nothing more deceptive than an obvious fact.''
― Arthur Conan Doyle
Anyone who' s taken LSD, pulled an all-nighter, been bored, or fell into a YouTube rabbit knows that time is subjective. Brains aren't clocks. They cheat constantly. Compressing
sequences of events, deleting traumatic evens, buffering sensory delays, stretching and compressing depending on how arousing whatever you're experiencing is.
Programmers enter multi-hour flow states without fatigue or hunger. Usually helped by autism headphones and chemical stimulants.
When put in life-or-death situations, many report time slowing down. It's like the brain temporarily overclocks and hyper-saturates itself with sensory perception.
Notice how commuting to your job feels like a time skip? It's like mundane regular tasks simply don't exist to us.
When you travel, a few weeks turn into a whole-ass chapter of your life. A giant exotic memory palace. At the end you often feel like you're leaving an entire life behind.
Video platforms optimize for smoothness and hyperstimulus. They make you forget where videos start and end.
Time perception seems to me like a fertile ground for exploits. It's not just fun, it feels necessary in a world where attention is farmed and sold. There's only so much
we can do to have a longer life, so why not try to make our life denser.
2025-05-21 00:21:58
Published on May 20, 2025 4:21 PM GMT
Welcome to the AI Safety Newsletter by the Center for AI Safety. We discuss developments in AI and AI safety. No technical background required.
In this edition: The Trump Administration rescinds the Biden-era AI diffusion rule and sells AI chips to the UAE and Saudi Arabia; Federal lawmakers propose legislation on AI whistleblowers, location verification for AI chips, and prohibiting states from regulating AI.
Listen to the AI Safety Newsletter for free on Spotify or Apple Podcasts.
Subscribe to receive future versions.
The Center for AI Safety is also excited to announce the Summer session of our AI Safety, Ethics, and Society course, running from June 23 to September 14. The course, based on our recently published textbook, is open to participants from all disciplines and countries, and is designed to accommodate full-time work or study.
Applications for the Summer 2025 course are now open. The final application deadline is May 30th. Visit the course website to learn more and apply.
On May 12th, the Department of Commerce announced that it had rescinded the Framework for Artificial Intelligence Diffusion, which was set to take effect May 15th. The rule would have regulated the export of AI chips and models across three tiers of countries, each with its own set of restrictions. (Other AI chip export controls, including those prohibiting sales to China, remain on the books.)
The announcement states that the Bureau of Industry and Security (BIS) will issue a replacement rule in the future. In the meantime, the BIS will focus on working to prevent US chips from being used in Chinese AI development. Bloomberg reports that new restrictions will focus on countries that have diverted US chips to China, including Thailand and Malaysia.
The Trump Administration wants to capture the global AI chip market. Though China has yet to export its own AI chips, the BIS will also issue guidance that states using Huawei Ascend chips violates US export controls. This preemptive restriction supports the Trump Administration’s intent for the US to dominate the global AI chip market.
UAE and Saudi Arabia are set to receive hundreds of thousands of AI chips. Last week, Trump announced trade deals with the UAE and Saudi Arabia, respectively.
The UAE is set to receive up to 500,000 of Nvidia's most advanced chips per year, beginning in 2025. 100,000 of these would go to the Emirati firm G42, with the remainder going to U.S. companies building datacenters in the UAE. Following the deal’s announcement, G42 announced the construction of a five GW AI campus in Abu Dhabi—the largest AI infrastructure project anywhere in the world.
Nvidia announced a strategic partnership with Saudi Arabia’s new sovereign AI company, Humain. In the first phase of the partnership, Humain is set to receive 18,000 Blackwell chips. AMD also announced a partnership with Humain.
The chip sales affect several US priorities. The deals will direct large investments to US AI companies that might have otherwise gone to China (China is the leading source of revenue for both the UAE and Saudi Arabia). It will also allow US AI companies to circumvent compute capacity limitations imposed by the US’ energy grid.
Some US officials argue that the Trump Administration’s chip sales threaten to undermine the US’ lead in compute capacity, and consequently US national security, since compute capacity may soon become a key determinant of state power. However, it’s difficult to evaluate the sales’ overall effects on US interests, since the terms of the agreement are unclear.
A federal AI whistleblower protection act. Senate Judiciary Committee Chair Chuck Grassley introduced the Artificial Intelligence Whistleblower Protection Act, which would protect employees who come forward with information about harmful or illegal activities happening inside AI companies.
Current AI whistleblower protections aren’t effective. Currently, these sorts of laws only exist as a patchwork across jurisdictions, making it difficult for would-be AI whistleblowers to predict whether they would be protected. They also often only protect reporting violations of law. Because AI regulation is minimal, developer behavior that poses a threat to public safety may not violate any law.
AI companies can also require employees to sign NDAs preventing them from disparaging the company even after they leave. OpenAI had employees sign such an NDA, which they later discontinued after public pressure.
The AI Whistleblower Protection Act addresses these shortcomings. It covers disclosing any “substantial and specific” danger that AI developer behavior might pose to public safety, public health, or national security. It also prohibits AI companies from requiring employees to sign NDAs or other contracts that undermine their ability to make such disclosures.
A bill requiring location verification for AI chips. Senator Tom Cotton introduced the Chip Security Act, which would require location verification mechanisms for export-controlled AI chips.
The bill would strengthen US export controls by preventing AI chips from being smuggled into China. AI chip smuggling is a growing problem, with potentially 100,000 chips smuggled in 2024.
Currently, US officials struggle to determine what happens to AI chips once they’re shipped overseas. Location verification would allow export authorities to tell when a shipment of chips isn’t where it’s supposed to be, triggering further investigation.
A provision in a tax bill prohibiting states from regulating AI. The House Energy and Commerce Committee included a provision that would prohibit states from regulating AI in its markup of House Republicans’ tax bill.
Ever since California’s SB 1047 almost became law, AI companies have argued that states should be prohibited from regulating AI, and instead leave the problem to the federal government. SB 1047 would have made AI companies liable for harm caused by their models.
However, the provision seems to run afoul of the Senate’s “Byrd Rule,” which prohibits policy provisions from being included in budget reconciliation bills.
Industry
Civil Society
See also: CAIS’ X account, our paper on superintelligence strategy, our AI safety course, and AI Frontiers, a new platform for expert commentary and analysis.
Listen to the AI Safety Newsletter for free on Spotify or Apple Podcasts.
Subscribe to receive future versions.
2025-05-20 23:34:37
Published on May 20, 2025 3:34 PM GMT
2025-05-20
This guide is currently a research draft. DO NOT follow the guide as of today, if you are a whistleblower. I do not yet endorse following this guide as being sufficient to protect your safety. I am only sharing this to get research feedback. I will update you once I do think the guide is good enough.
Why this guide?
Disclaimer
I broadly categorise whistleblowers into three categories:
Who?
As a whistleblower, you have a few available strategies:
leak summary, stay anonymous in the US
leak summary, become public in the US and
leak summary, become public outside of the US
leak summary, stay anonymous outside of the US
leak US classified information, stay anonymous in the US
leak US classified information, become public in the US
leak US classified information, become public outside of the US
leak US classified information, stay anonymous outside of the US
A summary would include broad overview of the situation as you see it, but not a lot of details or any classified documents or recordings.
As per my reading of all the past NSA whistleblower cases: 2 is slightly better than 1 3 4, 7 is far better than 5 6 and 8 is unlikely. Therefore this document is actually a guide for 7. (Leak US classified information, become public outside of the US)
2 is slightly better than 1, 3 and 4.
7 is better than 5 and 6
8 seems very unlikely
Case studies
The more complex a plan you're following, the more planning is required. If you are willing to follow a more complex plan, you can ensure a slightly large window of time for yourself, from when you steal the documents to when you are deanonymised.
Objective: leak US classified documents, leave the US, reveal identity to public outside the US, neither you nor anyone in your circle gets imprisoned
Case studies of worst case outcome
Here's the minimum plan you should execute. The order of steps in this plan is deliberate.
Consequences
Here'a a more advanced plan that might get you a few more months of anonymity. The order of steps in this plan is deliberate.
Consequences
Case studies
Specific advice for you
Grand strategy - flow of information
Whether to execute 1
Whether to execute 2
Whether to execute 3
Case studies
I would strongly recommend distancing from your social circle over a period of multiple months, and not informing them about your plan. It is important not to raise any suspicion, as doing so could lead to you and them being imprisoned.
Apart from your lawyer, I do not recommend keeping anyone in-the-loop by default until you have obtained asylum.
Psychiatrists, extended family, work colleagues, journalists and so on will be part of the people investigated. I do not recommend informing any of these people.
Until your identity is safe to publicly reveal (and likely even after that), you cannot support anyone in your social circle living in the US sphere in any way. It could take many months before you can support them.
Consequences for social circle living in the US geopolitical sphere (who are not informed about your plan)
Family visits
Mental health
Mental health resources
Even before you actually start planning, you should consider setting up a relatively secure machine to do research and planning.
How to setup secure machine for planning and research
Purchase a new windows/linux machine.
Purchase a USB drive and install tails on it.
Reserve a separate room in your house where this machine is kept.
No mobile phone or other device allowed in this room.
No other person allowed in this room.
Important: Resist the urge to search whistleblower related info on your other devices or on home or corporate network in clearnet.
Do not raise any complaints whatsoever via internal channels.
Select which documents to release.
Plausible deniability
Hardware
You can consider hiding a camera and mic on you to record important interactions you see. This could include leadership admitting to any actions or values they would not admit publicly.
Should you make recordings yourself?
Video recording guide
Resources
Time investment
Generic information
Text
[REDACTED]
Images
poppler-utils
on linux can convert pdf to image, which can then be converted to BMP.Audio
Video
ffmpeg -i video.VOB -vsync 0 -ss 01:30 -to 01:40 %06d.bmp
To do for self
To do for self
To do for self
Travel paths
My naive guess is your best bet is going to the Ecuador embassy in Moscow, Russia and requesting asylum in-person, not anonymously. You are likely to be imprisoned only if both Ecuador govt and Russian govt reject your requests.
My naive guess is that contacting consuls requesting asylum before reaching foreign soil is a bad idea. You should first fly out, before contacting your lawyer and making asylum requests.
If you are flying from US to Russia, my naive guess is it makes more sense to fly to a third country that has flights from both Russia and US. This will raise less suspicion when you apply for visa or book tickets. This is especially important if you hold a security clearance. Once you are in the third country, you can book the second ticket.
Ask your lawyer for advice, after you have landed in Russia.
(It is possible to ask for more information from a lawyer before you leave the US. There is a low probability your lawyer will get you imprisoned. There's also a low probability they offer you useful personalised information which you cannot otherwise obtain from guides such as this one. Ideally someone who is not affiliated with you, such as me, should be getting this information and publishing it publicly in a guide.)
Case study: Snowden was approved for asylum in Russia, after being rejected by 20+ countries including countries in Latin America, countries in Europe, India and China. Later he was granted Russian citizenship.
Case study: Assange was approved for asylum in Ecuador embassy in UK. His room was bugged by cameras and eventually Ecuador govt revoked their approval, letting UK govt arrest him.
Along with the leaked documents, you can request govts to unilaterally promise asylum to the whistleblower without knowing the identity of the whistleblower.
You need funding for the following:
Legal expenses are by far your biggest expense, as per case studies.
Being cut off from your own funds is a potential risk.
Cryptocurrency
To do for self
To do for self
To do for self
As mentioned multiple times in the document, prefer making a plan that does not involve trusting any journalists.
List of somewhat trustworthy high-attention media operators (with securedrop/signal/email IDs)
Exhaustive list of high-attention media operators (with securedrop/signal/email IDs)
2025-05-20 23:26:38
Published on May 20, 2025 3:26 PM GMT
In the comments section of You can, in fact, bamboozle an unaligned AI into sparing your life, both supporters and critics of the idea seemed to agree on two assumptions:
What if to ensure at least some civilizations survive (and hopefully rescue others), each civilization should pick a random strategy?
Maybe if every civilization follows a random strategy, they increase the chance of surviving the singularity, and also increase the chance that the average sentient life in all of existence is happy rather than miserable. It reduces logical risk.
History already is random, but perhaps we could further randomize the strategy we pick.
For example, if the random number generated using Dawson et al's method (after July 1st 00:00UTC, using pulsar PSR J0953+0755 or the first publicly available pulsar data) is greater than the 95th percentile, we could all randomly choose MIRI's extremely pessimist strategy, and do whatever Eliezer Yudkowsky and Nate Soares suggest with less arguing and more urgency. If they tell you that your AI lab, working on both capabilities and alignment, is a net negative, then you quit and work on something else. If you are more reluctant to do so, you might insist on the 99th percentile instead.
Does this make sense or am I going insane again?
Total utilitarianism objections
If you are a total utilitarian, and don't care about how happy the average life is, and only care about the total number of happy lives, then you might say this is a bad idea, since it increases the chance at least some civilizations survive, but reduces the total expected number of happy lives.
However, it also reduces the total expected number of miserable lives. Because if 0 civilizations survive, the number of miserable lives may be huge due to misaligned AI simulating all possible histories. If only a few civilizations survive, they may trade with these misaligned AI (causally or acausally) to greatly reduce suffering, since the misaligned AI only gain a tiny tiny bit by causing astronomical suffering. They only lose a tiny bit of accuracy if they decrease the suffering by 2x.
This idea is only morally bad if you are both a total utilitarian, and only care about happiness (not worrying about suffering). But really, we should have moral uncertainty and value more than one philosophy (total utilitarianism, average utilitarianism, etc.).
2025-05-20 22:13:30
Published on May 20, 2025 2:13 PM GMT
On May 20, during her speech at the Annual EU Budget Conference 2025, Ursula von der Leyen, President of the European Commission, stated:
When the current budget was negotiated, we thought AI would only approach human reasoning around 2050. Now we expect this to happen already next year. It is simply impossible to determine today where innovation will lead us by the end of the next budgetary cycle. Our budget of tomorrow will need to respond fast.
This is remarkable coming from the highest-ranking EU official. It suggests the Overton window for AI policy has shifted significantly.
2025-05-20 20:54:09
Published on May 20, 2025 12:54 PM GMT
We study how selective regularization during training can guide neural networks to develop predictable, interpretable latent spaces with alignment applications in mind. Using color as a test domain, we observe that anchoring even a single concept (red) influences the organization of other concepts, with related concepts clustering nearby — even with weak supervision. We then propose that concept-anchored representation engineering might enable more precise intervention in complex models without requiring extensive post-hoc interpretability work.
In our previous post, we proposed that anchoring key concepts to specific directions in latent space during training might make AI systems more interpretable and controllable. This post presents our exploratory findings as we work toward that goal, adapting and combining techniques from representation learning with a specific focus on alignment applications.
Rather than attempting to discover and modify latent directions after training (as in mechanistic interpretability), we're exploring whether it's possible to impose useful structure on latent spaces during training, creating a more interpretable representation from the start. Many of the techniques we use have precedents in machine learning literature, but our focus is on their application to alignment challenges and whether they might enable more controlled model behavior.
Using color as an experimental domain, we investigated whether simple autoencoders with targeted regularization could learn predictable latent space structures that organize concepts in ways we can understand and potentially control — with specific colors as a stand-in for "concepts"[1].
By intentionally structuring a portion of the model's internal representations during training, we aimed to know exactly where key concepts will be embedded without needing to search for them. Importantly, we don't constrain the entire latent space, but only the aspects relevant to concepts we care about. This selective approach allows the model to find optimal representations for other concepts while still creating a predictable structure for the concepts we want to control. We observed that other concepts naturally organized themselves in relation to our anchored concepts, while still maintaining flexibility in dimensions we didn't constrain.
Our experiments incorporate elements from established techniques in representation learning, prototype-based models, and supervised autoencoders, but with a specific adaptation: using stochastic, selective regularization with minimal concept anchoring (just a single concept) to influence broader representation organization without dictating the entire structure. We view this work as preparatory research for developing techniques to train more complex models in domains in which data labelling is hard, such as language.
In this post, "we" refers to "me and Claude 3.7 Sonnet," who helped draft this content. The underlying research was conducted with coding assistance from Gemini 2.5 Pro, GPT-4o, and GPT-4.1, and literature review by deep research in ChatGPT.
Our approach builds upon several established research directions in machine learning. Concept Activation Vectors (CAVs)[2] pioneered the identification of latent space directions corresponding to human-interpretable concepts, though primarily as an interpretability technique rather than a training mechanism. Work on attribute-based regularization[3] has shown that variational autoencoders can be structured to encode specific attributes along designated dimensions, while our work applies to simple autoencoders with no assumption of monosemantic representations. Recent research has also explored using concept vectors for modifying model behavior through gradient penalization in latent space[4]. Our work differs by proactively anchoring concepts during training rather than analyzing or modifying latent representations post-hoc.
Several approaches have explored similar ideas through different frameworks. Supervised autoencoder methods often incorporate geometric losses that fix class clusters to pre-determined coordinates in the latent space[5]. Similarly, center loss for face recognition trains a "center" vector per class and penalizes distances to it[6]. Prototypical VAE uses pre-set "prototype" anchors in latent space to distribute regularization[7]. These techniques all employ concept anchoring in some form, though our approach is distinctive in focusing specifically on the minimal supervision needed and examining how structure emerges around it rather than explicitly defining the entire latent organization.
In the contrastive and metric learning literature, techniques like supervised contrastive learning[8] cause samples of the same class to cluster together, but typically don't pin clusters to particular locations in advance. These methods achieve similar grouping of related concepts but don't provide the same level of predictability and control over where specific concepts will be represented. In unsupervised contrastive learning (e.g. SimCLR[9]/BYOL[10]), the clusters emerge solely from data similarities, with no predefined anchor for a concept.
Our interest in these techniques stems specifically from challenges in AI alignment. Current approaches like RLHF and instruction tuning adjust model behavior without providing precise control over specific capabilities. Meanwhile, mechanistic interpretability requires significant post-hoc effort to understand models' internal representations, with limited guarantees of success.
We're exploring whether a middle path exists: deliberately structuring the latent space during pre-training to make certain concepts more accessible and controllable afterward. This approach is motivated in part by findings on critical learning periods in neural networks[11], which demonstrated that networks lose plasticity early in training (even in the first few epochs!), establishing connectivity patterns that become increasingly difficult to reshape later.
Recent work on emergent misalignment further strengthens the case for latent space structure, showing that models fine-tuned solely to write insecure code subsequently exhibited a range of misaligned behaviors in unrelated contexts, from expressing anti-human views to giving malicious advice[12]. This suggests that even highly abstract related concepts naturally cluster together in latent space, making the organization of these representations a critical factor in alignment. Deliberately structuring these representations during pre-training may provide more direct control than purely post-hoc interventions.
If models develop more predictable representations for key concepts we care about (like harmfulness, deception, or security-relevant knowledge), we might gain better tools for:
While our experiments with color are far simpler than language model alignment challenges, they provide a testbed for techniques that might eventually scale to more complex domains.
We chose color as our initial experimental domain because it offers several properties that make it suitable for exploring latent space organization:
If we can demonstrate reliable control over a model's internal representation with color, it suggests we might achieve similar control in more complex domains. We assume that the core mechanisms of embedding and representation remain similar even when the domains differ greatly in complexity.
Our experiments used a simple architecture: a multilayer perceptron (MLP) autoencoder with a low-dimensional bottleneck. This architecture forces the model to compress information into a small number of dimensions, revealing how it naturally organizes concepts.
We used a 2-layer MLP autoencoder with RGB colors (3 dimensions) as input and output, and a 4-dimensional bottleneck layer. While RGB data is inherently 3-dimensional, we needed a 4D bottleneck because of our unitarity (hypersphere) constraint, described below.
This architecture creates a useful tension: the model must balance information compression against the structural constraints imposed through regularization.
We imposed structure through several complementary regularization techniques:
Planarity penalized embeddings for using dimensions beyond the first two (using L2 norm on the 3rd and 4th dimensions with target length 0), encouraging the model to organize hues in a plane.
Unitarity encouraged embedding vectors to have unit length (using L2 norm on all dimensions with target length 1), pushing representations toward the surface of a hypersphere[14].
Angular repulsion encouraged distinct concepts to maintain distance from each other in the embedding space using cosine distance.
Concept anchoring explicitly pushed certain colors (like red) toward predetermined coordinates in the latent space using Euclidean distance. In retrospect, cosine distance might have been more appropriate given the spherical nature of our embedding space.
We applied these regularization terms selectively to different samples based on stochastic binary labels. The probability of a sample receiving a particular label was determined by formulas that capture the relevant color properties.
For the "red" label (measuring proximity to pure red), we calculated:
This formula yields high values for pure red (r=1, g=0, b=0) and rapidly diminishing values as green or blue components increase. The exponent of 10 creates a sharp falloff.
For the "vibrant" label (measuring proximity to any pure hue):
Where s and v are saturation and value in the HSV color space. The high exponent (100) creates an extremely sharp distinction between fully saturated, bright colors and even slightly desaturated or darker ones.
These continuous probabilities were then scaled (multiplied by 0.5) and converted to binary labels by comparison with random values. For example, pure red would be labeled "red" approximately 50% of the time, while colors progressively further from red would be labeled with rapidly decreasing probability.
This stochastic approach effectively creates a form of weak supervision, where the model receives imperfect, noisy guidance about concept associations. We designed it this way to simulate real-world scenarios like labeling internet text for pretraining, where labels might be incomplete, inconsistent, or uncertain. Our experiments show that such sparse, noisy signals can still effectively shape the latent space - an encouraging finding for applications where comprehensive labeling is impractical.
Our experiments produced several interesting results:
The 4D latent space organized itself into a coherent structure where similar colors clustered predictably. When projected onto the first two dimensions, vibrant hues formed a clear circle resembling a color wheel, despite never explicitly telling the model to arrange them in this way.
Importantly, we didn't need to search for interesting projections after training. The structure formed as we intended due to regularization. While dimensionality reduction techniques often produce meaningful visualizations, what distinguishes our approach is the predictability and consistency in placing specific concepts in predetermined locations. This addresses a fundamental challenge in interpretability: knowing where in the latent space to look for specific concepts.
Here's a similar visualization, this time as a video.
The video[15] (and also the thumbnails in the figure above) shows the evolution of latent space:
The video also shows how the hyperparameters were varied over time, and the regularization loss terms.
Even with only a single anchored concept (red at coordinates ), the entire latent space organized itself relative to this point. The other primary and secondary colors took positions spaced around the circle at regular intervals. This happened despite:
We observed that anchoring a single concept, combined with planar constraints, influenced how the model organized its latent space. While we only explicitly constrained a portion of the space (2 of 4 dimensions), the entire representation adapted to these constraints in our simple domain. This suggests potential for targeted regularization, though the extent to which such influence would propagate in higher-dimensional spaces (like those found in large language models) remains an open question.
One of our practical findings concerns the training methodology. We initially explored curriculum learning approaches where we gradually introduced more complex data. However, we discovered that simply training on the full color space from the beginning with selective regularization produced superior results:
This went against our initial intuition, but it simplifies implementation: designing curricula is tricky, whereas training on full data with selective regularization is more straightforward.
These initial results suggest some promising directions for concept-anchored representation engineering in an alignment context. While other approaches have explored structuring latent spaces through regularization, our specific exploration of anchoring minimal concepts during training with selective per-sample regularization offers several potential insights[16]:
These points suggest that anchored concepts might indeed act as attractors that organize related concepts in meaningful ways, though verification in more complex domains is needed.
Despite these promising signs, we acknowledge significant limitations:
Domain simplicity: Color has intrinsic geometric structure that may make it uniquely amenable to this approach. The RGB color space is already low-dimensional with clear, objective distance metrics. Language concepts likely occupy a much messier, higher-dimensional manifold with less obvious geometric relationships. The ease with which we found a color wheel structure may not transfer to domains where the "natural" organization is less clearly defined.
Architectural simplicity: Our experiments used tiny MLPs with a few hundred parameters. Modern language models have billions or trillions of parameters with complex, multi-layer architectures. While some work exists on regularizing transformer latent spaces (e.g., ConceptX, LSRMT), applying our specific approach of concept anchoring during training to shape transformer representations presents challenges we have yet to explore, particularly given how attention mechanisms create context-dependent representations.
Unknown trade-offs: There may be significant performance trade-offs between regularized structure and model capabilities. If regularization forces the model to organize concepts in ways that aren't optimal for its tasks, we might see degraded performance.
Supervision requirements: This technique requires some concept labeling. In language models, identifying and labeling instances of abstract concepts like "deception" or "harmfulness" is more subjective and challenging.
Despite these limitations, we remain optimistic. The robustness to noisy labeling, effectiveness of selective regularization, and the observed influence of our targeted constraints suggest that this approach deserves further exploration in more complex domains.
Our work so far has focused on exploring whether we can create structured latent spaces through selective regularization. Next, we'd like to see whether we can use this structure to actually control model behavior. Our next experiments will explore:
Our experiments explore how selective regularization can guide the formation of structured representations during training, potentially reducing the need for post-hoc interpretation. Using these adapted techniques, we created partially structured latent spaces where concepts we cared about were predictably positioned, while allowing the model flexibility in organizing other aspects of its representations.
While many of the individual techniques we've used have precedents in the literature on prototype learning, supervised autoencoders, and contrastive learning, our specific contribution lies in exploring: (1) how proactively structuring a portion of the latent space during training through stochastic weak supervision can yield predictable organization where it matters, (2) that explicitly constraining just one concept and one plane can influence nearby representations without dictating the structure of the entire space, (3) that the approach shows robustness to noisy, stochastic labeling, and (4) that such structured latent spaces can be shaped by selective per-sample regularization rather than comprehensive supervision.
These findings suggest that concept-anchored representation engineering could potentially provide a valuable approach for designing more interpretable and controllable neural networks, building on existing work in latent space structuring. Whether this approach can scale to language models and more complex domains remains an open question, but these initial results provide encouragement to continue exploring this direction. If we can reliably engineer the structure of relevant portions of latent representations during training, we might gain better tools for alignment - particularly for more precise control over what models learn and how they use that knowledge.
We welcome suggestions, critiques, and ideas for improving our approach. If you're working on similar research or would like to collaborate, please reach out. For those interested in replicating or building upon our experiments, we've made our code available in z0u/ex-color-transformer, with Experiment 1.7 containing the implementation of the selective regularization approach described in this post.
In this work, we distinguish between the semantic "concepts" we're interested in (like "red" or "vibrant") and their representation in the model's latent space, where related terms from the literature would include "latent priors" or "prototypes" — the anchor points or structures that guide how these concepts are encoded.
Kim, B., Wattenberg, M., Gilmer, J., Cai, C., Wexler, J., Viegas, F., & Sayres, R. (2018). Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (TCAV). International Conference on Machine Learning. arXiv:1711.11279
Hadjeres, G., & Nielsen, F. (2020). Attribute-based regularization of latent spaces for variational auto-encoders. Neural Computing and Applications. arXiv:2004.05485
Anders, C. J., Dreyer, M., Pahde, F., Samek, W., & Lapuschkin, S. (2023). From Hope to Safety: Unlearning Biases of Deep Models via Gradient Penalization in Latent Space. arXiv:2308.09437
Gabdullin, N. (2024). Latent Space Configuration for Improved Generalization in Supervised Autoencoder Neural Networks. arXiv:2402.08441
Wen, Y., Zhang, K., Li, Z., & Qiao, Y. (2016). A Comprehensive Study on Center Loss for Deep Face Recognition. DOI: 10.1007/s11263-018-01142-4 [open access version on GitHub]
Oliveira, D. A. B., & La Rosa, L. E. C. (2021). Prototypical Variational Autoencoders. OpenReview hw5Kug2Go3-
While the specific implementation of Prototypical VAE by Oliveira et al. was retracted due to methodological concerns, we include this reference to acknowledge that the concept of prototype-based regularization in latent spaces has also been explored in other studies.
Khosla, P., Teterwak, P., Wang, C., Sarna, A., Tian, Y., Isola, P., Maschinot, A., Liu, C., & Krishnan, D. (2020). Supervised Contrastive Learning. arXiv:2004.11362
Chen, T., Kornblith, S., Norouzi, M., & Hinton, G. (2020). A Simple Framework for Contrastive Learning of Visual Representations. arXiv:2002.05709
Grill, J.B., Strub, F., Altché, F., Tallec, C., Richemond, P.H., Buchatskaya, E., Doersch, C., Pires, B.A., Guo, Z.D., Azar, M.G., & Piot, B. (2020). Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning. arXiv:2006.07733
Achille A., Rovere M., & Soatto S. (2017). Critical Learning Periods in Deep Neural Networks. arXiv:1711.08856
Betley, J., Tan, D., Warncke, N., Sztyber-Betley, A., Bao, X., Soto, M., Labenz, N., & Evans, O. (2025). Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs. arXiv:2502.17424
Epistemic status: 75%, based on: - Word2Vec and similar embedding approaches showing meaningful geometric structure - Recent mech interp work suggesting similar representation mechanisms across domains - Emergent misalignment research indicating models tend to discover similar representational patterns.
In upcoming experiments with transformers, we intend to use hypersphere normalization as in nGPT. We think that normalization helps the optimizer to find meaningful representations, and we expect our DevInterp research to be most useful with architectures that lend themselves to well-structured latent spaces in other ways.
As an aside: I'm fascinated by these animated visualizations of latent space. Seeing how the space changes for every batch has given me insights into potential hyperparameter tweaks that would have been hard to find otherwise. Perhaps carefully selected metrics could have given similar insight if viewed as a static line chart, but I don't know which metrics they would be, nor how you could know in advance which ones to choose.
My confidence in these claims ranges from 90..60% (from top to bottom).