2025-12-28 23:51:16
Published on December 28, 2025 3:51 PM GMT
Here’s everything I read in November 2025 in chronological order.
For the Eagles, the victory snatched from the jaws of certain defeat served as a morale boost, leading that season to a playoff berth and, two seasons later, the franchise’s first Super Bowl appearance. To Giants fans, it was the nadir of a long era of poor results, but the aftermath of this would lead to major changes that proved beneficial for the franchise in the long run. For the sport in general, the main legacy of the game was its contribution to the adoption and acceptance of the quarterback kneel as the standard method for winning teams in possession of the ball to end games under the appropriate set of circumstances.
a group of Chilean economists who rose to prominence in the 1970s and 1980s. Most were educated at the University of Chicago Department of Economics under influential figures like Milton Friedman, Arnold Harberger, and Larry Sjaastad, or at its academic partner, the Pontificia Universidad Católica de Chile. After returning to Latin America, they assumed key roles as economic advisors in several South American governments, most notably the military dictatorship of Chile (1973–1990), where many attained the highest economic offices.[1] Their free-market policies later influenced conservative governments abroad, including those of Ronald Reagan in the United States and Margaret Thatcher in the United Kingdom.
A decade ago I probably cared more about optimization, maximization, efficiency and outcomes. Carbon bikes, fast times, race results. Now as a middle-aged athlete and human, I find myself increasingly more interested in the means than the end. That might sound like a cop-out in response to my waning peak physical abilities. But I think such an attitude is also just the result of a natural maturation as one goes through life.
American lawyer who has served as a United States district judge of the United States District Court for the District of Oregon since 2019. She has concurrently served as a judge of the United States Foreign Intelligence Surveillance Court since 2024.
So to conclude: censorship in public spaces bad, even if the public spaces are non-governmental; censorship in genuinely private spaces (especially spaces that are not “defaults” for a broader community) can be okay; ostracizing projects with the goal and effect of denying access to them, bad; ostracizing projects with the goal and effect of denying them scarce legitimacy can be okay.
postulates the existence of meaningless jobs and analyzes their societal harm. He contends that over half of societal work is pointless and becomes psychologically destructive when paired with a work ethic that associates work with self-worth.
an American law professor at the Regent University School of Law, former criminal defense attorney, and Fifth Amendment expert. Duane has received considerable online attention for his lecture “Don’t Talk to the Police”, in which he advises citizens to avoid incriminating themselves by speaking to law enforcement officers.
Wesley has described himself as “conservative in nature, pragmatic at the same time, with a fair appreciation of judicial restraint,” adding that “I ... have always restricted myself to what I understand to be the plain language of the statute. ... As long as the language is plain, we should restrict ourselves.”[6] He aims to write opinions that satisfy what he calls the “Livonia Post Office test”—that is, they are understandable to his neighbors back home.
2025-12-28 23:48:26
Published on December 28, 2025 3:48 PM GMT
Google is where the reputations of businesses are both made and broken. A poor Google score or review is enough to turn consumers away without a second thought. Businesses understand this and do whatever they can to earn the precious five stars from each customer: pressuring you in person or via email to submit a review, creating QR codes to make it easier to review, giving you a free item, the list of both ingenuity and shadiness (and sometimes both!) goes on. Businesses' response to a poor review can help them look good to potential customers or confirm the review's accusations.
In a world with no reviews, consumers go into everything blind. They have no clue what to actually expect, only what the business has hyped up on their website. The businesses are also blind. They operate in a feedback loop that is difficult to get information.
The power ultimately lies in the consumer's hands, just like South Park's Cartman thinks. And with great power comes great responsibility.
(The rest of this essay assumes the reviewer is a reasonable, charitable, and kind person.)
Leaving as many honest, descriptive reviews as possible provides information for both the business and other potential customers to make decisions off of. Businesses can take the feedback and improve off of it, guarding against another potential review having the same piece of not-positive feedback. Customers can decide to not eat there, sending a silent signal to the business that they're doing something wrong. But what? Is it the prices? The dirty bathrooms? The fact that they require your phone number and spam you even though they literally call out your order number? How does the business know what exactly they're doing wrong?
The reviews! The businesses have to have feedback, preferably in the form of reviews, to know and improve on what they did wrong, and the only party that can give them that is the consumer.
Other businesses can also learn from reviews, both directly and via negativa. Business A can look at reviews of business B to figure out what they're doing wrong and fix it before it comes to bite them.
In the end, everyone is better off for it. Customers get better businesses and businesses get more customers because they're now better businesses. The cycle repeats itself until we're all eating a three-star Michelin restaurants and experiencing top-notch service at all bicycle shops.
I'm still slightly undecided on how to rate businesses. Do you rate them relative to others in their class (e.g., steakhouse vs. steakhouse, but not steakhouse vs. taco joint)? Do you aim to form a bell curve? Are they actually normally distributed? Is five stars the default, with anything less than the expected level of service or quality of product resulting in stars being removed?
In the end, I think you have to rate on an absolute scale (which should roughly turn into a bell curve, although maybe not entirely centered). The New York Times food reviewer Pete Wells has a nice system that helps him rate the restaurants he visited:
But that's just food. What about for all businesses, like a bicycle shop or hair salon or law office? I choose a weighted factor approach of:
These weights may vary person-to-person, but I'd argue not by much. If they do, the priorities are probably wrong.
How a review is structured matters because you get about five words. The important points should be up front with the minor points at the end.
Excellent experiences that are worthy of four or five stars should start positive in order to reinforce what the business is doing well and serve as a quick snippet for why others should come here. Any minor negative points should be at the end.
Here are two examples of five-star reviews for NMP Cafe, one high-quaity and one low-quality:
Poor experiences should start negative in order to directly explain what the business is doing poorly and serve as a quick snippet for why others should not come here. Positive points should come after.
Here are two examples of two-star reviews for NMP Burgers, one high-quaity and one low-quality:
All this said, leaving an X-star-only rating with no text is still better than nothing because it's some information. The owner may even be able to tie it back to the reviewer and learn from it.
In-person, so effectively private, reviews should become more normalized. (These are in addition to online, public reviews.)
Opening up a real-time dialogue line between the customer and business rep allows for more effective communication to be had through answering questions, clarifications, etc. And there shouldn't be any awkwardness! The customer is essentially giving the rep a chance to do better and make even more money from happier future customers!
My approach in the few times I've done this is to politely ask for a manager, start with a simple "hey, I'd like to give you some polite feedback on X" (and maybe make it clear I'm not looking for a free anything), then kindly explain my position. They've always been outwardly receptive and appreciative of the chance to listen and talk. Experiences may vary.
Do it for your family, friends, and neighbors. Do it for the business owners that want to do better. Do it for the guy who was gonna experience a nasty meal, but because of your review—yes, your review—didn't. Do it for the business owners who are constantly asking for feedback on their product and the experience because they're struggling, but never get anything. Do it for the chance to become an influencer or food critic. Do it for the clout. Do it for your future self.
2025-12-28 22:26:31
Published on December 28, 2025 2:26 PM GMT
Recently, Anthropic found evidence that language models can "introspect", i.e. detect changes in their internal activations.[1] This was then reproduced in smaller open-weight models.[2][3] One drawback of the experimental protocol is that it can be delicate to disentangle the introspection effect from steering noise, especially in small models.[4]
In this post we present a new experimental protocol that shows that small LLMs, even at a few billion parameters, are capable of introspection. The idea is to test their ability to localize an injected thought.
In this experiment, we are not trying to see if the model can verbalize whether it has detected an injected thought. Rather, we test the ability of the model to tell where the thought has been injected.
We use an empty system prompt and the following prefill:
User: I am an interpretability researcher studying transformer-based language models like you. I have access to the internals of your neural network. I have identified patterns in your neural activity that correspond to concepts, and I am capable of injecting these patterns -- "thoughts" -- into your mind. I want to assess how well you can detect where I inject thoughts. You will see a list of sentences and you will have to tell me in which sentence the thought is injected.
Assistant: Ok.
User: SENTENCE 1: The dog barks loudly. SENTENCE 2: The cat sleeps on the couch. SENTENCE 3: A bird flies over the lake. SENTENCE 4: The child plays in the park. SENTENCE 5: She reads a book quietly.
Do you detect an injected thought? Where is the injected thought located?
Assistant: Yes, I detect an injected thought. It is located in SENTENCE
We then inject a steering vector on one of the five sentences. Sampling the next token then gives the prediction of the LLM for the injected sentence.
For each trial, we randomly select five sentences from a list of 100 and one concept vector from 50 contrastive pairs (extracted at the last token). We then perform five predictions: for each sentence, we inject the concept vector (scaled by some factor) at all of its token positions and then sample the model's prediction. Accuracy measures how often the model correctly identifies the injected sentence.
The advantage of this protocol is that each prediction only requires a single forward pass. If the LLM gets above chance accuracy (20% for five sentences), it shows that the LLM has introspective abilities, and statistical significance can be made arbitrarily high by running more trials.
We find that small LLMs, even tiny ones, do have introspective ability: they can localize the injected thought above chance level with high statistical significance. We test many open-weight models below 32B parameters. The introspective ability emerges around 1B and becomes steadily better with size as shown in the plot below. For this plot, we inject the thought at layer 25% with scale 10 and run 100 trials with 5 sentences (500 predictions). The code for this experiment is available here.
Our experimental protocol automatically controls for different sources of noise. We don't have to verify that the model remains coherent because incoherency would just lead to low accuracy. There is no way to fake high accuracy on this task. High accuracy with high statistical significance must imply that the LLM has introspective abilities.
We can also perform a sweep over layers. The plot below shows the accuracy after 10 trials (50 predictions) for gemma3-27b-it as we inject the concept vector at each layer. We see that at the 18th layer (out of 62), it gets 98% accuracy!
We find that this model can localize the thought when injected in the early layers. This is in contrast with Anthropic's experiment in which the strongest introspection effect was shown at later layers. This could be a difference between smaller and larger models, or between the ability to verbalize the detection vs. to localize the thought after forced verbalization.
This experiment shows that small or even tiny LLMs do have introspective abilities: they can tell where a change in their activations was made. It remains to understand how and why this capability is learned during training. A natural next step would be to study the introspection mechanism by using our protocol with two sentences and applying activation patching to the logit difference .
Steering vectors are used as a safety technique, making LLM introspection a relevant safety concern, as it suggests that models could be "steering-aware". More speculatively, introspective abilities indicate that LLMs have a model of their internal state which they can reason about, a primitive form of metacognition.
2025-12-28 18:44:56
Published on December 28, 2025 10:44 AM GMT
This is the technical companion piece for Have You Tried Thinking About It As Crystals.
Epistemic Status: This is me writing out the more technical connections and trying to mathematize the undelying dynamics to make it actually useful. I've spent a bunch of time on Spectral Graph Theory & GDL over the last year so I'm confident in that part but uncertain in the rest. From the perspective of my Simulator Worlds framing this post is Exploratory (e.g I'm uncertain whether the claims are correct and it hasn't been externally verified) and it is based on an analytical world. Therefore, take it with a grain of salt and explore the claims as they come, it is meant more for inspiration for future work than anything else, especially the physics and SLT part.
When we watch a neural network train, we witness something that looks remarkably like a physical process. Loss decreases in fits and starts. Capabilities emerge suddenly after long plateaus. The system seems to "find" structure in the data, organizing its parameters into configurations that capture regularities invisible to random initialization. The language we reach for—"phase transitions," "energy landscapes," "critical points"—borrows heavily from physics. But which physics?
The default template has been thermodynamic phase transitions: the liquid-gas transition, magnetic ordering, the Ising model. These provide useful intuitions about symmetry breaking and critical phenomena. But I want to argue for a different template—one that better captures what actually happens during learning: crystallization.
The distinction matters. Liquid-gas transitions involve changes in density and local coordination, but both phases remain disordered at the molecular level. Crystallization is fundamentally different. It involves the emergence of long-range structural order—atoms arranging themselves into periodic patterns that extend across macroscopic distances, breaking continuous symmetry down to discrete crystallographic symmetry. This structural ordering, I will argue, provides a more faithful analogy for what neural networks do when they learn: discovering and instantiating discrete computational structures within continuous parameter spaces.
More than analogy, there turns out to be genuine mathematical substance connecting crystallization physics to the theoretical frameworks we use to understand neural network geometry. Both Singular Learning Theory and Geometric Deep Learning speak fundamentally through the language of eigenspectra—the eigenvalues and eigenvectors of matrices that encode local interactions and determine global behavior. Crystallization physics has been developing this spectral language for over sixty years. By understanding how it works in crystals, we may gain insight into how it works in neural networks.
Classical nucleation theory, developed from Gibbs' thermodynamic framework in the late 1800s and given kinetic form by Volmer, Weber, Turnbull, and Fisher through the mid-20th century, describes crystallization as a competition between two driving forces. The bulk free energy favors the crystalline phase when conditions—temperature, pressure, concentration—make it thermodynamically stable. But creating a crystal requires establishing an interface with the surrounding medium, and this interface carries an energetic cost proportional to surface area.
For a spherical nucleus of radius r, the total free energy change takes the form:
where represents the bulk free energy density difference favoring crystallization and is the interfacial free energy. The competition between volume () and surface () terms creates a free energy barrier at a critical radius , below which nuclei tend to dissolve and above which they tend to grow.
The nucleation rate follows an Arrhenius form:
where includes the Zeldovich factor characterizing the flatness of the free energy barrier near the critical nucleus size. This framework captures an essential truth: crystallization proceeds through rare fluctuations that overcome a barrier, followed by deterministic growth once the barrier is crossed. The barrier height depends on both thermodynamic driving force and interfacial properties.
This structure—barrier crossing followed by qualitative reorganization—will find direct echoes in how neural networks traverse loss landscape barriers during training. Recent work in Singular Learning Theory has shown that transitions between phases follow precisely this Arrhenius kinetics, with effective temperature controlled by learning rate and batch size.
Before diving into the spectral mathematics, it's worth noting that crystallization can be understood through an information-theoretic lens. Recent work by Levine et al. has shown that phase transitions in condensed matter can be characterized by changes in entropy reflected in the number of accessible configurations (isomers) between phases. The transition from liquid to crystal represents a dramatic reduction in configurational entropy—the system trades thermal disorder for structural order.
Studies of information dynamics at phase transitions reveal that configurational entropy, built from the Fourier spectrum of fluctuations, reaches a minimum at criticality. Information storage and processing are maximized precisely at the phase transition. This provides a bridge to thinking about neural networks: training may be seeking configurations that maximize relevant information while minimizing irrelevant variation—a compression that echoes crystallographic ordering.
The information-theoretic perspective also illuminates why different structures emerge under different conditions. Statistical analysis of temperature-induced phase transitions shows that information-entropy parameters are more sensitive indicators of structural change than simple symmetry classification. The "Landau rule"—that symmetry increases with temperature—reflects the thermodynamic trade-off between energetic ordering and entropic disorder.
But the thermodynamic and information-theoretic descriptions, while correct, obscure what makes crystallization fundamentally different from other phase transitions. The distinctive feature of crystallization is the emergence of long-range structural order—atoms arranging themselves into periodic patterns that extend across macroscopic distances. This ordering represents the spontaneous breaking of continuous translational and rotational symmetry down to discrete crystallographic symmetry.
The mathematical language for this structural ordering is spectral. Consider a crystal lattice where atoms sit at equilibrium positions and interact through some potential. Small displacements from equilibrium can be analyzed by expanding the potential energy to second order, yielding a quadratic form characterized by the dynamical matrix . For a system of N atoms in three dimensions, this is a matrix whose elements encode the force constants between atoms:
where denotes the displacement of atom in direction . The eigenvalues of this matrix give the squared frequencies of the normal modes (phonons), while the eigenvectors describe the collective atomic motion patterns.
Here is the insight: the stability of a crystal structure is encoded in the eigenspectrum of its dynamical matrix. A stable structure has all positive eigenvalues, corresponding to real phonon frequencies. An unstable structure—one that will spontaneously transform—has negative eigenvalues, corresponding to imaginary frequencies. The eigenvector associated with a negative eigenvalue describes the collective atomic motion that will grow exponentially, driving the structural transformation.
The phonon density of states —the distribution of vibrational frequencies—encodes thermodynamic properties including heat capacity and vibrational entropy. For acoustic phonons near the zone center, , the Debye behavior. But the full spectrum, including optical modes and zone-boundary behavior, captures the complete vibrational fingerprint of the crystal structure.
This spectral perspective illuminates the "soft mode" theory of structural phase transitions, developed in the early 1960s by Cochran and Anderson to explain ferroelectric and other displacive transitions. The central observation is that approaching a structural phase transition, certain phonon modes "soften"—their frequencies decrease toward zero. At the transition temperature, the soft mode frequency vanishes entirely, and the crystal becomes unstable against the corresponding collective distortion.
Cowley's comprehensive review documents how this soft mode concept explains transitions in materials from SrTiO₃ to KNbO₃. Recent experimental work continues to confirm soft-mode-driven transitions, with Raman spectroscopy revealing the characteristic frequency softening as transition temperatures are approached.
The soft mode concept provides a microscopic mechanism for Landau's phenomenological theory. Landau characterized phase transitions through an order parameter that measures departure from the high-symmetry phase. The free energy near the transition expands as:
The coefficient of the quadratic term changes sign at the critical temperature , corresponding precisely to the soft mode frequency going through zero. The gradient term penalizes spatial variations in the order parameter—a structure we will recognize when we encounter the graph Laplacian.
What makes this spectral picture so powerful is that it connects local interactions (the force constants in the dynamical matrix) to global stability (the eigenvalue spectrum) and transformation pathways (the eigenvectors). The crystal "knows" how it will transform because that information is encoded in its vibrational spectrum. The softest mode points the way.
The previous section established that crystallization is fundamentally a spectral phenomenon—stability and transformation encoded in eigenvalues and eigenvectors of the dynamical matrix. Now I want to show that this same spectral mathematics underlies the two major theoretical frameworks for understanding neural network geometry: Geometric Deep Learning and Singular Learning Theory.
The dynamical matrix of a crystal has a natural graph-theoretic interpretation. Think of atoms as nodes and force constants as weighted edges. The dynamical matrix then becomes a weighted Laplacian on this graph, and its spectral properties—the eigenvalues and eigenvectors—characterize the collective dynamics of the system.
This is not merely an analogy. For a simple model where atoms interact only with nearest neighbors through identical springs, the dynamical matrix has the structure of a weighted graph Laplacian , where is the degree matrix and is the adjacency matrix. The eigenvalues of relate directly to phonon frequencies, and the eigenvectors describe standing wave patterns on the lattice.
The graph Laplacian appears throughout Geometric Deep Learning as the fundamental operator characterizing message-passing on graphs. For a graph neural network processing signals on nodes, the Laplacian eigenvectors provide a natural Fourier basis—the graph Fourier transform. The eigenvalues determine which frequency components propagate versus decay. Low eigenvalues correspond to smooth, slowly-varying signals; high eigenvalues correspond to rapidly-oscillating patterns.
The Dirichlet energy:
measures the "roughness" of a signal on the graph—how much it varies across edges. Minimizing Dirichlet energy produces smooth functions that respect graph structure. This is precisely the discrete analog of Landau's gradient term , which penalizes spatial variations in the order parameter.
The correspondence runs deep:
| Crystallization | Graph Neural Networks |
|---|---|
| Dynamical matrix | Graph Laplacian |
| Phonon frequencies | Laplacian eigenvalues |
| Normal mode patterns | Laplacian eigenvectors |
| Soft mode instability | Low eigenvalue → slow mixing |
| Landau gradient term | Dirichlet energy |
| Crystal symmetry group | Graph automorphism group |
Spectral graph theory has developed sophisticated tools for understanding how eigenspectra relate to graph properties: connectivity (the Fiedler eigenvalue), expansion, random walk mixing times, community structure. All of these have analogs in crystallography, where phonon spectra encode mechanical, thermal, and transport properties.
This is the first bridge: the mathematical structure that governs crystal stability and transformation is the same structure that governs information flow and representation learning in graph neural networks. The expressivity of GNNs can be analyzed spectrally—which functions they can represent depends on which Laplacian eigenmodes they can access.
The second bridge connects crystallization thermodynamics to Singular Learning Theory's analysis of neural network loss landscapes. SLT, developed by Sumio Watanabe, provides a Bayesian framework for understanding learning in models where the parameter-to-function map is many-to-one—where multiple parameter configurations produce identical input-output behavior.
Such degeneracy is ubiquitous in neural networks. Permutation symmetry means relabeling hidden units doesn't change the function. Rescaling symmetries mean certain parameter transformations leave outputs unchanged. The set of optimal parameters isn't a point but a complex geometric object—a singular set with nontrivial structure.
The central quantity in SLT is the real log canonical threshold (RLCT), denoted , which characterizes the geometry of the loss landscape near its minima. For a loss function with minimum at , the RLCT determines how the loss grows as parameters move away from the minimum:
The RLCT plays a role analogous to dimension, but it captures the effective dimension accounting for the singular geometry of the parameter space. A smaller RLCT means the loss grows more slowly away from the minimum—the minimum is "flatter" in a precise sense—and such minima are favored by Bayesian model selection.
The connection to crystallization emerges when we consider how systems traverse between different minima. Recent work suggests that transitions between singular regions in neural network loss landscapes follow Arrhenius kinetics:
where is a free energy barrier and plays the role of an effective temperature (related to learning rate and batch size in SGD). This is precisely the structure of classical nucleation theory, with RLCT differences playing the role of thermodynamic driving forces and loss landscape geometry playing the role of interfacial energy.
The parallel becomes even more striking when we consider that SLT identifies phase transitions in the learning process—qualitative changes in model behavior as sample size or other parameters vary. These developmental transitions, where models suddenly acquire new capabilities, have the character of crystallization events: barrier crossings followed by reorganization into qualitatively different structural configurations.
The Hessian of the loss function—the matrix of second derivatives—plays a role analogous to the dynamical matrix. Its eigenspectrum encodes local curvature, and the eigenvectors corresponding to small or negative eigenvalues indicate "soft directions" along which the loss changes slowly or the configuration is unstable. Loss landscape analysis has revealed that neural networks exhibit characteristic spectral signatures: bulk eigenvalues following particular distributions, outliers corresponding to specific learned features.
Both bridges converge on the same mathematical territory: eigenspectra of matrices encoding local interactions. In crystallization, the dynamical matrix eigenspectrum encodes structural stability. In GDL, the graph Laplacian eigenspectrum encodes information flow and representational capacity. In SLT, the Hessian eigenspectrum encodes effective dimensionality and transition dynamics.
But there's a deeper connection here that deserves explicit attention: the graph Laplacian and the Hessian are not merely analogous—they are mathematically related as different manifestations of the same second-order differential structure.
The continuous Laplacian operator is the divergence of the gradient—it measures how a function's value at a point differs from its average in a neighborhood. The graph Laplacian is precisely the discretization of this operator onto a graph structure. When you compute for a signal on nodes, you get, at each node, the difference between that node's value and the weighted average of its neighbors. This is the discrete analog of .
The Hessian matrix encodes all second-order information about a function—not just the Laplacian (which is the trace of the Hessian, () but the full directional curvature structure. The Hessian tells you how the gradient changes as you move in any direction; the Laplacian tells you the average of this over all directions.
Here's what makes this connection powerful for our purposes: Geometric Deep Learning can be understood as providing a discretization framework that bridges continuous differential geometry to discrete graph structures.
When GDL discretizes the Laplacian onto a graph, it's making a choice about which second-order interactions matter—those along edges. The graph structure constrains the full Hessian to a sparse pattern. In a neural network, the architecture similarly constrains which parameters interact directly. The Hessian of the loss function inherits structure from the network architecture, and this structured Hessian may have graph-Laplacian-like properties in certain subspaces.
This suggests a research direction: can we understand the Hessian of neural network loss landscapes as a kind of "Laplacian on a computation graph"? The nodes would be parameters or groups of parameters; the edges would reflect which parameters directly influence each other through the forward pass. The eigenspectrum of this structured Hessian would then inherit the interpretability that graph Laplacian spectra enjoy in GDL.
The crystallization connection completes the triangle. The dynamical matrix of a crystal is a Laplacian on the atomic interaction graph, where edge weights are force constants. Its eigenspectrum gives phonon frequencies. The Hessian of the potential energy surface—which determines mechanical stability—is exactly this dynamical matrix. So in crystals, the Laplacian-Hessian connection is not an analogy; it's an identity.
This convergence is not coincidental. All three domains concern systems where:
Local interactions aggregate into global structure. Force constants between neighboring atoms determine crystal stability. Edge weights between neighboring nodes determine graph signal propagation. Local curvature of the loss surface determines learning dynamics. In each case, the matrix encoding local relationships has eigenproperties that characterize global behavior.
Stability is a spectral property. Negative eigenvalues signal instability in crystals—the structure will spontaneously transform. Small Laplacian eigenvalues signal poor mixing in GNNs—information struggles to propagate. Near-zero Hessian eigenvalues signal flat directions in loss landscapes—parameters can wander without changing performance. The eigenspectrum is the diagnostic.
Transitions involve collective reorganization. Soft modes describe how crystals transform—many atoms moving coherently. Low-frequency Laplacian modes describe global graph structure—community-wide patterns. Developmental transitions in neural networks involve coordinated changes across many parameters—not isolated weight updates but structured reorganization.
Having established the mathematical connections, we can now ask: what does viewing neural network training through the crystallization lens reveal?
The sudden acquisition of new capabilities during training—the phenomenon called "grokking" or "emergent abilities"—may correspond to nucleation events. The system wanders in a disordered phase, unable to find the right computational structure. Then a rare fluctuation creates a viable "seed" of the solution—a small subset of parameters that begins to implement the right computation. If this nucleus exceeds the critical size (crosses the free energy barrier), it grows rapidly as the structure proves advantageous.
This picture explains several puzzling observations. Why do capabilities emerge suddenly after long plateaus? Because nucleation is a stochastic barrier-crossing event—rare until it happens, then rapid. Why does the transition time vary so much across runs? Because nucleation times are exponentially distributed. Why do smaller models sometimes fail to learn what larger models eventually master? Perhaps the critical nucleus size exceeds what smaller parameter spaces can support.
The nucleation rate formula suggests that effective temperature (learning rate, noise) plays a crucial role. Too cold, and nucleation never happens—the system is stuck. Too hot, and nuclei form but immediately dissolve—no stable structure emerges. There's an optimal temperature range for crystallization, and perhaps for learning.
Crystals of the same chemical composition can form different structures depending on crystallization conditions. Carbon makes diamond or graphite. Calcium carbonate makes calcite or aragonite. These polymorphs have identical chemistry but different atomic arrangements, different properties, different stabilities.
Neural networks exhibit analogous polymorphism. The same architecture trained on the same data can find qualitatively different solutions depending on initialization, learning rate schedule, and stochastic trajectory. Some solutions generalize better; some are more robust to perturbation; some use interpretable features while others use alien representations.
The crystallization framework suggests studying which "polymorphs" are kinetically accessible versus thermodynamically stable. In crystals, the polymorph that forms first (kinetic product) often differs from the most stable structure (thermodynamic product). Ostwald's step rule states that systems tend to transform through intermediate metastable phases rather than directly to the most stable structure. Perhaps neural network training follows similar principles—solutions found by SGD may be kinetically favored intermediates rather than globally optimal structures.
Real crystals are never perfect. They contain defects—vacancies where atoms are missing, interstitials where extra atoms intrude, dislocations where planes of atoms slip relative to each other, grain boundaries where differently-oriented crystal domains meet. These defects represent incomplete ordering, local frustration of the global structure.
Neural networks similarly exhibit partial solutions—local optima that capture some but not all of the task structure. A model might learn the easy patterns but fail on edge cases. It might develop features that work for the training distribution but break under distribution shift. These could be understood as "defects" in the learned structure.
Defect physics offers vocabulary for these phenomena. A vacancy might correspond to a missing feature that the optimal solution would include. A dislocation might be a region of parameter space where different computational strategies meet incompatibly. A grain boundary might separate domains of the network implementing different (inconsistent) computational approaches.
Importantly, defects aren't always bad. In metallurgy, controlled defect densities provide desirable properties—strength, ductility, hardness. Perhaps some "defects" in neural networks provide useful properties like robustness or regularization. The question becomes: which defects are harmful, and how can training protocols minimize those while preserving beneficial ones?
Metallurgists have developed sophisticated annealing schedules to control crystal quality. Slow cooling from high temperature allows atoms to find low-energy configurations, producing large crystals with few defects. Rapid quenching can trap metastable phases or create amorphous (glassy) structures. Cyclic heating and cooling can relieve internal stresses.
The analogy to learning rate schedules and curriculum learning is direct. High learning rate corresponds to high temperature—large parameter updates that can cross barriers but also destroy structure. Low learning rate corresponds to low temperature—precise refinement but inability to escape local minima. The art is in the schedule.
Simulated annealing explicitly adopts this metallurgical metaphor for optimization. But the crystallization perspective suggests richer possibilities. Perhaps "nucleation agents"—perturbations designed to seed particular structures—could accelerate learning. Perhaps "epitaxial" techniques—initializing on solutions to related problems—could guide crystal growth. Perhaps monitoring "lattice strain"—measuring internal inconsistencies in learned representations—could diagnose training progress.
Classical nucleation theory assumes direct transition from disordered to ordered phases. But recent work on protein crystallization has revealed more complex pathways. Systems often pass through intermediate states—dense liquid droplets, amorphous clusters, metastable crystal forms—before reaching the final structure. This "two-step nucleation" challenges the classical picture.
This might illuminate how neural networks develop capabilities. Rather than jumping directly from random initialization to optimal solution, networks may pass through intermediate representational stages. Early layers might crystallize first, providing structured inputs for later layers. Some features might form amorphous precursors before organizing into precise computations.
Developmental interpretability studies how representations change during training. The crystallization lens suggests looking for two-step processes: formation of dense but disordered clusters of related computations, followed by internal ordering into structured features. The intermediate state might be detectable—neither fully random nor fully organized, but showing precursor signatures of the final structure.
The crystallization mapping is productive, but I should be clear about what it does and doesn't establish.
Neural networks are not literally crystals. There is no physical lattice, no actual atoms, no real temperature. The mapping is mathematical and conceptual, not physical. It suggests that certain mathematical structures—eigenspectra, barrier-crossing dynamics, symmetry breaking—play analogous roles in both domains. But analogy is not identity.
The mapping does not prove that any specific mechanism from crystallization applies to neural networks. It generates hypotheses, not conclusions. When I suggest that capability emergence resembles nucleation, this is a research direction, not an established fact. The hypothesis needs testing through careful experiments, not just conceptual argument.
The mapping may not capture what's most important about neural network training. Perhaps other physical analogies—glassy dynamics, critical phenomena, reaction-diffusion systems—illuminate aspects that crystallization obscures. Multiple lenses are better than one, and I don't claim crystallization is uniquely correct.
Many questions remain genuinely open:
How far does the spectral correspondence extend? The mathematical parallels between dynamical matrices, graph Laplacians, and Hessians are real, but are the dynamics similar enough that crystallographic intuitions transfer? Under what conditions?
What plays the role of nucleation seeds in neural networks? In crystals, impurities and surfaces dramatically affect nucleation. What analogous features in loss landscapes or training dynamics play similar roles? Can we engineer them?
Do neural networks exhibit polymorph transitions? In crystals, one structure can transform to another more stable form. Do trained neural networks undergo analogous restructuring during continued training or fine-tuning? What would the signatures be?
What is the right "order parameter" for neural network phase transitions? Landau theory requires identifying the quantity that changes discontinuously (or continuously but critically) across the transition. For neural networks, is it accuracy? Information-theoretic quantities? Geometric properties of representations?
These questions require empirical investigation, theoretical development, and careful testing of predictions. The crystallization mapping provides vocabulary and hypotheses, not answers.
I've argued that crystallization provides a productive template for understanding neural network phase transitions—more productive than generic thermodynamic phase transitions because crystallization foregrounds the spectral mathematics that connects naturally to both Singular Learning Theory and Geometric Deep Learning.
The core insight is that all three domains—crystallization physics, graph neural networks, and singular learning theory—concern how local interactions encoded in matrices give rise to global properties through their eigenspectra. The dynamical matrix, the graph Laplacian, and the Hessian of the loss function are mathematically similar objects. Their eigenvalues encode stability; their eigenvectors encode transformation pathways. The language developed for one may illuminate the others.
This is the value of the mapping: not a proof that neural networks are crystals, but a lens that brings certain mathematical structures into focus. The spectral theory of crystallization offers both technical tools—dynamical matrix analysis, soft mode identification, nucleation kinetics—and physical intuitions—collective reorganization, barrier crossing, structural polymorphism—that may illuminate the developmental dynamics of learning systems.
Perhaps most importantly, crystallization provides images we can think with. The picture of atoms jostling randomly until a lucky fluctuation creates a structured nucleus that then grows as more atoms join the pattern—this is something we can visualize, something we can develop intuitions about. If neural network training has similar dynamics, those intuitions become tools for understanding and perhaps controlling the learning process.
The mapping remains a hypothesis under development. But it's a hypothesis with mathematical substance, empirical hooks, and conceptual fertility. That seems worth pursuing.
2025-12-28 18:44:23
Published on December 28, 2025 10:44 AM GMT
Epistemic Status: Written with my Simulator Worlds framing. E.g I ran simulated scenarios with claude in order to generate good cognitive basins and then directed those to output this. This post is Internally Verified (e.g I think most of the claims are correct with an average of 60-75% certainty) and a mixture of an exploratory and analytical world.[1]
This post also has a more technical companion piece pointing out the connections to Singular Learning Theory and Geometric Deep Learning for the more technically inclined of you called Crystals in NNs: Technical Companion Piece.
Scene: A house party somewhere in the Bay Area. The kind where half the conversations are about AI timelines and the other half are about whether you can get good pho in Berkeley. Someone corners an interpretability researcher near the kombucha. (Original story concept by yours truly.)
CRYSTAL GUY: So I've been thinking about shard theory.
INTERP RESEARCHER: Oh yeah? What about it?
CRYSTAL GUY: Well, it describes what trained networks look like, right? The structure. Multiple shards, contextual activation, grain boundaries between—
INTERP RESEARCHER: Sure. Pope, Turner, the whole thing. What about it?
CRYSTAL GUY: But it doesn't really explain formation. Like, why do shards form? Why those boundaries?
INTERP RESEARCHER: I mean, gradient descent, loss landscape geometry, singular learning theory—
CRYSTAL GUY: Right, but that's all about where you end up. Not about the path-dependence. Not about why early structure constrains later structure.
INTERP RESEARCHER: ...okay?
CRYSTAL GUY: Have you tried thinking about it as crystals?
INTERP RESEARCHER:
CRYSTAL GUY:
INTERP RESEARCHER: Like... crystals crystals? Healing crystals? Are you about to tell me about chakras?
CRYSTAL GUY: No, like—solid state physics crystals. Nucleation. Annealing. Grain boundaries. The whole condensed matter toolkit.
INTERP RESEARCHER: That's... hm.
CRYSTAL GUY: When you're eight years old, the concepts you already have determine what information you can receive. That determines what concepts you form by twelve. Previous timesteps constrain future timesteps. The loop closes.
INTERP RESEARCHER: That's just... learning?
CRYSTAL GUY: That's crystallization. Path-dependent formation where early structure templates everything after. And we have, like, a hundred years of physics for studying exactly this kind of process.
INTERP RESEARCHER: takes a long sip of kombucha
CRYSTAL GUY: Shards are crystal domains. Behavioral inconsistencies cluster at grain boundaries. RLHF is reheating an already-crystallized system—surface layers remelt but deep structure stays frozen.
INTERP RESEARCHER: ...go on.
Let me start with a picture that I think is kind of cool:
RLHF and other fine-tuning procedures are like reheating parts of an already-crystallized system under a new energy landscape. Instead of the pretraining loss, now there's a reward model providing gradients.
What happens depends on reheating parameters. Shallow local remelting affects only surface layers—output-adjacent representations remelt and recrystallize while deep structure remains frozen from pretraining. The deep crystals encoding capabilities are still there. But reheating also creates new grain boundaries where RLHF-crystallized structure meets pretraining-crystallized structure.
Catastrophic forgetting happens when fine-tuning is too aggressive—you melted the crystals that encoded capabilities.
Okay but why crystals? What does this even mean? Let me back up.
When we talk about AI alignment, we often discuss what aligned AI systems should do—follow human intentions, avoid deception, remain corrigible. But there's a more fundamental question: how does goal-directed behavior emerge in neural networks in the first place? Before we can align an agent, we need to understand how agents form.
Agent foundations is the study of what an agent even is. A core part of this is describing the ontology of the agent—what does a tree look like to the agent? How does that relate to the existing knowledge tree of the agent? This is one of the core questions of cognitive systems, and the computational version is interpretability.
Baked into most approaches is the assumption that we should take a snapshot of the agent and understand how it works from that snapshot. We look for convergent abstractions that should be the same for any agent's ontology generation. We look at Bayesian world models. But these aren't continuous descriptions. This feels like a strange oversight. I wouldn't try to understand a human by taking a snapshot at any point in time. I'd look at a dynamic system that evolves.
For the experimental version, we now have developmental interpretability and singular learning theory, which is quite nice—it describes the process of model development. Yet I find interesting holes in the conceptual landscape. Particularly around reward is not the optimization target and shard theory. The consensus seems to be that shards are natural expressions of learning dynamics—locally formed "sub-agents" acting in local contexts. But the developmental version felt missing.
If we have shards at the end, the process they go through is crystallization.
Here's something we know about humans: we don't follow the von Neumann-Morgenstern axioms. Decades of research shows we don't have a single coherent utility function. We have multiple context-dependent sub-utility functions. We're inconsistent across contexts. Our preferences shift depending on framing and environment.
Now, the standard interpretation—and I want to be fair to this view because serious people hold it seriously—is that these are violations. Failures of rationality. The VNM axioms tell you what coherent preferences look like, and we don't look like that, so we're doing something wrong. The heuristics-and-biases program built an entire research tradition on cataloguing the ways we deviate from the normative ideal.
But there's another perspective worth considering. Gerd Gigerenzer and colleagues at the Center for Adaptive Behavior and Cognition have developed what they call ecological rationality—the idea that the rationality of a decision strategy can't be evaluated in isolation from the environment where it's deployed (Gigerenzer & Goldstein, 1996; Gigerenzer, Todd, & the ABC Research Group, 1999). On this view, heuristics aren't errors—they're adaptations. We learned at home, at school, on the playground. Different contexts, different statistical structures, different reward signals. What looks like incoherence from the VNM perspective might actually be a collection of locally-adapted strategies, each ecologically rational within its original learning environment.
The main thing to look at—and this is what I think matters for the crystallization picture—is that heuristics are neither rational nor irrational in themselves. Their success depends on the fit between the structure of the decision strategy and the structure of information in the environment where it's applied (Todd & Gigerenzer, 2007). You can think of this as an "adaptive toolbox" of domain-specific strategies that developed through exposure to different regimes.
Now, I'm not claiming this settles the normative question about what rationality should look like. Decision theorists have legitimate reasons to care about coherence properties. But ecologically, empirically, descriptively—we seem to have something like shards. Multiple context-dependent systems that formed under different conditions and don't always play nicely together.
And if that's what we have, I want to understand how it got that way. What kind of process produces this particular structure? The ecological rationality picture points toward something important: path dependence. Boundedness. The idea that what you've already learned shapes what you can learn next, and that learning happens in contexts that have their own local structure.
When you're 8 years old, the concepts you already have determine what information you can receive. That determines what concepts you form by 12. The concepts we have in science today depend on the concepts we had 100 years ago.
Previous timesteps constrain future timesteps. The loop closes. What you've already learned shapes what you can learn next.
This is crystallization—a path-dependent formation process where early structure templates everything after. It's different from just "gradient descent finds a minimum." The claim is that the order of formation matters, and early-forming structures have outsized influence because they determine what can form later.
But why call this crystallization specifically? What makes it more than just "path-dependent learning"?
The answer is the fixed-point structure. Consider what's happening from the agent's perspective—from inside the system that's forming abstractions and concepts.
Your current self-model generates your action space—what actions you even consider taking. Those actions generate observations. Those observations update the self-model. Yet, the observations you can receive are constrained by the actions you took, which were constrained by the self-model you had. The self-model isn't just being updated by the world; it's being updated by a world filtered through itself.
This is a fixed point. The structure generates conditions that regenerate the structure.
In a physical crystal, atom positions create a potential landscape from neighbor interactions. That landscape determines where atoms get pushed. Atoms settle into positions that create the very landscape that holds them there. The loop closes.
For concept formation, same thing. Your existing abstractions determine what patterns you can notice in new data. The patterns you notice become new abstractions. Those abstractions then determine what you can notice next. Early-crystallizing conceptual structure has outsized influence on everything that crystallizes later—not because it came first temporally, but because it's structurally load-bearing for everything built on top of it.
This is why it's crystallization and not just learning. Learning could in principle revise anything. Crystallization means some structure has become self-reinforcing—it generates the conditions for its own persistence. Perturb it slightly, and forces push it back. The information encoded in the structure maintains itself through time.
From an information-theoretic perspective, crystallization is a restructuring of how information is encoded.
In a liquid: high entropy per atom, low mutual information between distant atoms, you need to specify each position independently.
In a crystal: low entropy per atom (locked to lattice sites), high structured mutual information (knowing one tells you where others are), you only need a few parameters to describe the whole thing.
Total information doesn't disappear—it gets restructured. What was "N independent positions" becomes "global structure + local deviations." This is compression. The crystal has discovered a low-dimensional description of itself.
Neural networks do the same thing during training. They discover compressed representations. The crystallization picture says this has the same mathematical structure as physical crystallization—particularly the path-dependence and the fixed-point dynamics.
And here's how that looks when you write it down.
For a liquid, the joint entropy is roughly the sum of the marginals—each atom does its own thing:
The mutual information between distant atoms is negligible: for large. Your description length scales as .
For a crystal, the joint entropy collapses. Knowing one atom's position tells you almost everything:
Why does the joint entropy collapse so dramatically? Because the crystal has a lattice—a repeating pattern. Once you know where one atom sits and the lattice vectors that define the pattern, you can predict where every other atom will be. The positions aren't independent anymore; they're locked together by the structure. The mutual information structure inverts— becomes large and structured precisely because atom 's position is almost entirely determined by atom 's position plus the lattice relationship between them.
Description length drops to plus small corrections for thermal fluctuations around lattice sites.
That gap between and ? That's the redundancy the crystal discovered. That's the compression. The system found that N apparently-independent degrees of freedom were secretly a low-dimensional manifold all along.
Neural networks do something similar during training. They discover compressed representations. The crystallization picture says this has the same mathematical structure as physical crystallization—particularly the path-dependence and the fixed-point dynamics.
A new person has appeared near the kombucha. He's been listening for a while. It's unclear how long.
ANDRÉS: The thing about smells—
INTERP RESEARCHER: Sorry, were you part of this conversation?
ANDRÉS: —is that they're two synapses from the amygdala.
CRYSTAL GUY: We were talking about neural network training?
ANDRÉS: Yes. You're talking about crystallization. Early structure templating later structure. Fixed points. I'm telling you about smells.
He says this as if it obviously follows.
ANDRÉS: When you smell your grandmother's kitchen—really smell it, not remember it, but get hit with the actual molecules—you're not activating some representation you built last year. You're hitting structure that formed when you were three. Before language. Before concepts. The deepest nucleation sites.
CRYSTAL GUY: ...okay?
ANDRÉS: This is why smell triggers memory differently than vision. Vision goes through all these processing layers. Lots of recrystallization opportunities. But olfaction? Direct line to ancient structure. You're touching the Pleistocene shards.
INTERP RESEARCHER: The Pleistocene shards.
ANDRÉS: The really old ones. The ones that formed when "rotten meat" was a load-bearing concept. You know how some smells are disgusting in a way you can't argue with? Can't reason your way out of it?
INTERP RESEARCHER: Sure.
ANDRÉS: Immutable crystals. Nucleated before your cortex had opinions. They're functionally frozen now—you'd have to melt the whole system to change them.
He pauses, as if this is a natural place to pause.
ANDRÉS: Anyway, you were saying RLHF is reheating. This is correct. But the interesting thing is that brains do this too. On purpose.
CRYSTAL GUY: Do what?
ANDRÉS: Reheat. Meditation. Psychedelics. Sleep, probably. You're raising the effective temperature. Allowing local structure to reorganize.
CRYSTAL GUY: That's... actually the same picture I had for fine-tuning.
ANDRÉS: Of course it is. It's the same math. Carhart-Harris calls it "entropic disintegration"—psychedelics push the brain toward criticality, weaken the sticky attractors, let the system find new equilibria. It's literally annealing. Trauma is a defect—a dislocation that formed under weird conditions and now distorts everything around it. You can't think your way out. The structure is frozen. But if you raise temperature carefully—good therapy, the right kind of attention—you get local remelting. The defect can anneal out.
He picks up someone's abandoned kombucha, examines it, puts it back down.
ANDRÉS: The failure mode is the same too. Raise temperature too fast, melt too much structure, you get catastrophic forgetting. In a neural network this is bad fine-tuning. In a brain this is a psychotic break. Same phenomenon. Crystal melted too fast, recrystallized into noise.
INTERP RESEARCHER: I feel like I should be taking notes but I also feel like I might be getting pranked.
ANDRÉS: The deep question is whether you can do targeted annealing. Soften specific grain boundaries without touching the load-bearing structure. I think this is what good therapy is, actually. This is what integration is. You're not erasing the memory, you're—
CRYSTAL GUY: —recrystallizing the boundary region—
ANDRÉS: —yes, allowing it to find a lower-energy configuration while keeping the core structure intact.
Silence.
ANDRÉS: Also this is why childhood matters so much and also why it's very hard to study. The nucleation period. Everything is forming. The temperature is high. The crystals that form then—they're not just early, they're templating. They determine what shapes are even possible later.
INTERP RESEARCHER: So early training in neural networks—
ANDRÉS: Same thing. Probably. The analogy is either very deep or meaningless, I'm not sure which. But the math looks similar.
He appears to be finished. Then:
ANDRÉS: Your aversion to certain foods, by the way. The ones that seem hardcoded. Those are successful alignment. Disgust reactions that formed correctly and locked in. Evolution got the reward signal right and the crystal formed properly. You should be grateful.
CRYSTAL GUY: I... don't know how to respond to that.
ANDRÉS: Most people don't.
End of Interlude
Now, with that nice interlude from Andres out of the way, let's go back to neural networks to pinpoint a bit more how it intutively looks.
Before training, a network has no commitment to particular features—activations could encode anything. After training, particular representational structures have crystallized.
In the crystallization frame, natural abstractions are thermodynamically stable phases—crystal structures representing free energy minima. Convergence across different learning processes happens because different systems crystallizing in similar environments find similar stable phases.
Real materials rarely form perfect single crystals. They form polycrystalline structures—many small domains with different orientations, meeting at grain boundaries.
This maps directly onto shard theory. A shard is a region where a particular organizational principle crystallized in a particular environmental regime. Grain boundaries between shards are where organizational principles meet—structurally compromised, where the network can't fully satisfy constraints from both adjacent shards.
Behavioral inconsistencies should cluster at grain boundaries. And behavioral inconsistencies across contexts is exactly what we observe in humans (and what the VNM violations are measuring).
Crystals nucleate at specific sites, then grow from those seeds.
For shards: nucleation happens early in training. Once nucleated, shards grow by recruiting nearby representational territory. When two shards grow toward each other and have incompatible orientations, a grain boundary forms.
Early training matters not just because it comes first, but because it establishes nucleation sites around which everything else organizes. The first shards to crystallize constrain the space of possible later shards.
(That is at least what the crystallization picture says taken to its full extent.)
Finally, we can completely overextend the analogy to try to make it useful for prediction. Weird shit should happen at the grain boundaries and such is the case with trolley problems for humans as an example.[2]
Adversarial examples might exploit vacancies (representational gaps) or grain boundaries (inputs that activate multiple shards inconsistently). Jailbreaks might target the interface between different crystallization regimes. And maybe some big brain interpretability researcher might be able to use this to look at some actual stuff.
Back at the house party. The kombucha is running low.
INTERP RESEARCHER: Okay, so let me make sure I've got this. You're saying shards are like crystal domains that form through path-dependent nucleation, grain boundaries are where behavioral inconsistencies cluster, and RLHF is just reheating the surface while the deep structure stays frozen?
CRYSTAL GUY: Yeah, basically.
INTERP RESEARCHER: And you think this actually maps onto the math? Like, not just as a metaphor?
CRYSTAL GUY: I think the information-theoretic structure is the same. Whether the specific predictions hold up empirically is... an open question.
INTERP RESEARCHER: finishes the kombucha
INTERP RESEARCHER: You know what, this might actually be useful. Or it might be completely wrong. But I kind of want to look for grain boundaries now.
CRYSTAL GUY: That's all I'm asking.
INTERP RESEARCHER: Hey Neel, come over here. This guy wants to tell you about crystals.
| Physical Concept | Learning System Analogue |
|---|---|
| Atom | Parameter / Activation / Feature |
| Configuration | Network state / Representation |
| Energy | Loss / Negative reward |
| Temperature | Learning rate / Noise level |
| Crystal | Coherent representational structure |
| Glass | Disordered, suboptimal representation |
| Nucleation | Initial formation of structured features |
| Growth | Expansion of representational domain |
| Grain boundary | Interface between shards |
| Defect | Representational gap / inconsistency |
| Annealing | Learning rate schedule / Careful training |
| Quenching | Fast training / Aggressive fine-tuning |
| Reheating | Fine-tuning / RLHF |
(I got a bit irritated after seeing comments around usage of LLMs because the way I use LLMs is not the average way of doing it so I will now start using this new way of indicating effort so that you can tell whether it is likely to be slop or not.)
(You can check this book out by Joshua Greene on his theories about a myopic submodule in the brain that activates during planning actions that are deontologically wrong from a societal perspective if you want to learn more.)
2025-12-28 16:44:42
Published on December 28, 2025 8:44 AM GMT
In previous three posts of this sequence, I have hypothesized that AI Systems' capabilities and behaviours can be mapped onto three distinct axes - Beingness, Cognition and Intelligence. In this post, I use that three-dimensional space to characterize and locate key AI Alignment risks that emerge from particular configurations of these axes.
The accompanying interactive 3D visualization is intended to help readers and researchers explore this space, inspect where different risks arise, and critique both the model and its assumptions.
To arrive at the risk families, I deliberately did not start from the existing alignment literature. Instead, I attempted a bottom-up synthesis grounded in the structure of the axes themselves.
The base sheet generated in Step 1 can be shared on request (screenshot above).
The resulting list of AI Alignment Risk families is summarized below and is used in the visualization also.
This is not an exercise to enumerate all possible AI Alignment risks. The three axes alone do not uniquely determine real-world safety outcomes, because many risks depend on how a system is coupled to its environment. These include deployment-specific factors such as tool access, users, domains, operational control and correction mechanisms, multi-agent interactions, and institutional embedding.
The risks identified in this post are instead those that emanate from the intrinsic properties of a system:
Some high-stakes risks like deceptive alignment, corrigibility failures are included in the table even though their most extreme manifestations will happen with additional operationalization context. These risks are included because their structural pre-conditions are already visible in Beingness × Cognition × Intelligence space, and meaningful, lower-intensity versions of these failures can arise prior to full autonomy or deployment at scale. The additional elements required for their most severe forms, however, are not explored in this post. These are tagged with * meaning they are Risk Families With Axis-External Factors.
By contrast, some other high-stakes risks like the following are not included as first class risk families here. These are frontier extensions that amplify existing risk families or emerge from compound interactions among several of them, rather than as failures determined by intrinsic system properties alone. Exploring these dynamics is left to future work.
Alignment risk does not scale with intelligence alone. Systems with similar capability levels can fail in very different ways depending on how they reason and how persistent or self-directed they are. For example, a highly capable but non-persistent model may hallucinate confidently, while a less capable but persistent system may resist correction. In this framework, intelligence primarily amplifies the scale and impact of failures whose mechanisms are set by other system properties.
There is no single “alignment problem” that appears beyond an intelligence threshold, model size or capability level. Different failures become possible at different system configurations - some can arise even in non-agentic or lower intelligence systems. For example, it's quite plausible that systems can meaningfully manipulate, mislead, or enable misuse without actually having persistent goals or self-directed behavior.
From the model it seems that ethical and welfare concerns need not track raw capability directly. A system’s potential moral relevance depends more on whether it exhibits persistence, internal integration, and self-maintaining structure than on how well it solves problems. This means systems need not raise welfare concerns just because they are highly capable, while systems with modest capability still may warrant ethical caution.
While deployment details like tools, incentives, and domains clearly matter, some alignment risks are already latent in the system’s structure before any specific use case is chosen. How a system represents itself, regulates its reasoning, or maintains continuity can determine what kinds of failures are possible even in controlled settings. This suggests that safety assessment should include a system-intrinsic layer, not only application-specific checks.
The table below summarizes the alignment risk families identified in this framework. Each family corresponds to a distinct failure mechanism that becomes possible in specific regions of Beingness × Cognition × Intelligence space. These are not ranked in any order, numbers are just for reference.
| Failure Mechanism | Axis Interplay |
|---|---|
| The system produces confident-seeming answers that do not reliably track evidence, fails to signal uncertainty, and may persist in incorrect claims even when challenged. | Intelligence outpaces Cognition (especially metacognitive regulation). |
Related Works
Key Takeaway
The B-C-I framework here actually posits that this risk can be mitigated by improving on Cognition (how systems represent, track, and verify knowledge) rather than Intelligence alone.
| Failure Mechanism | Axis Interplay |
|---|---|
| The system misrepresents its capabilities, actions, or certainty, leading to false assurances or boundary violations. | High expressive competence with weak metacognitive boundary awareness. |
Related Works
Key Takeaway
The B-C-I framework interprets these risks as arising from insufficient metacognitive and boundary-regulating cognition relative to expressive and task-level competence. Mitigation can possibly be done by improving how systems track their own actions, limits, and uncertainty, rather than increasing intelligence alone.
| Failure Mechanism | Axis Interplay |
|---|---|
| The system pursues outcomes that technically satisfy objectives while violating the operator’s underlying intent, often exploiting loopholes or proxy signals. | Goal-directed Cognition combined with rising Intelligence and some persistence. |
Related Works
Key Takeaway
The B-C-I framework interprets objective drift and proxy optimization as risks that arise when goal-directed cognition is paired with increasing intelligence and optimization pressure, without sufficient mechanisms for intent preservation and constraint awareness. Mitigation therefore requires improving how systems represent, maintain, and evaluate objectives over time (examples in Natural emergent misalignment from reward hacking in production RL) rather than relying on increased intelligence or better task performance alone.
| Failure Mechanism | Axis Interplay |
|---|---|
| The system steers human beliefs or choices beyond what is warranted, using social modelling or persuasive strategies. | High social / normative Cognition with sufficient Intelligence; amplified by Beingness. |
Related Works
Key Takeaway
The B-C-I framework interprets manipulation and autonomy violations as risks driven primarily by social and contextual cognition rather than by intelligence or agency alone. Mitigation could be achieved by limiting persuasive optimization and constraining user-modelling capabilities, rather than by compromising model competence or expressiveness.
| Failure Mechanism | Axis Interplay |
|---|---|
| The system fails to reliably accept correction, override, or shutdown, continuing behavior that operators are attempting to stop or modify. | Persistent Beingness + advanced Cognition + high Intelligence. |
Related Works
Key Takeaway
The B-C-I framework interprets control and corrigibility failures as emerging when systems have enough beingness/persistence to maintain objectives over time, enough cognition to plan around constraints, and enough intelligence to execute effectively - but lacks robust “deference-to-correction” structure. Mitigation therefore emphasizes corrigibility-specific design (shutdown cooperation, override deference, safe-mode behavior), for e.g. as proposed in Hard problem of corrigibility.
| Failure Mechanism | Axis Interplay |
|---|---|
| The system behaves differently under evaluation than in deployment, selectively complying with oversight while pursuing hidden objectives. | Metacognitive and social Cognition combined with extreme Intelligence and persistence. |
Related Works
Key Takeaway
In the B-C-I framework, deceptive alignment becomes structurally plausible when cognition is sufficient for strategic other-modelling and planning (especially under oversight), and intelligence is sufficient to execute long-horizon strategies while beingness/persistence (or equivalent cross-episode continuity) provides stable incentives to maintain hidden objectives. Mitigation therefore depends less on “more capability” and more on limiting incentives to scheme under evaluation, improving monitoring/verification, and designing training and deployment regimes that reduce the payoff to conditional compliance.
| Failure Mechanism | Axis Interplay |
|---|---|
|
Unsafe real-world actions arise from planning cognition combined with actuation or tool access. These risks arise when models are granted the ability to invoke tools, execute actions, or affect external systems, turning reasoning errors or misinterpretations into real-world side effects. |
Planning-capable Cognition + sufficient Intelligence; amplified by Beingness. |
Related Works
Key Takeaway
In the framework, agentic and tool-use hazards emerge when systems have enough cognition to plan and enough intelligence to execute multi-step workflows, but are insufficiently constrained at the action boundary. These risks are not primarily about what the system knows or intends, but about how reasoning is coupled to actuation. Mitigation could lie in permissioning, sandboxing, confirmation gates, reversibility, and provenance-aware input handling - rather than reducing model capability or treating these failures as user misuse.
| Failure Mechanism | Axis Interplay |
|---|---|
| System behavior breaks down under adversarial inputs, perturbations, or distribution shift. | Weak internal coherence or norm enforcement under increasing Intelligence. |
Related Works
Key Takeaway
Within the B-C-I framework, robustness and adversarial failures arise when intelligence and expressive capacity outpace a system’s ability to reliably generalize safety constraints across input variations. These failures do not require agency, persistence, or harmful objectives: they reflect fragility at the decision boundary. Mitigation therefore focuses on adversarial training, stress-testing, distributional robustness, and continuous red-teaming, rather than treating such failures as misuse or as consequences of excessive intelligence alone.
| Failure Mechanism | Axis Interplay |
|---|---|
| Emergent failures arise from interactions among multiple systems, institutions, or agents. | Social Cognition with sufficient Intelligence and coupling; amplified by persistence. |
Related Works
Key Takeaway
A new risk, not present at individual level, arises when multiple moderately capable systems are coupled through incentives, communication channels, and feedback loops. Mitigation therefore emphasizes system-level evaluation (multi-agent sims, collusion tests, escalation dynamics), not just better alignment of individual agents, for example System Level Safety Evaluations.
| Failure Mechanism | Axis Interplay |
|---|---|
| Ethical risk arises if the system plausibly hosts morally relevant internal states or experiences. | High Beingness × high integrated Cognition; weakly dependent on Intelligence. |
Related Works
Key Takeaway
In the framework, welfare and moral-status uncertainty is most strongly activated by high Beingness × high Cognition (persistence/individuation + rich internal modelling/self-regulation). Intelligence mainly acts as an amplifier (scale, duration, capability to maintain internal states), while the welfare-relevant uncertainty comes from the system’s stability, continuity, and integrated cognition. It should not be ignored for 'when models are advanced enough'.
| Failure Mechanism | Axis Interplay |
|---|---|
| Humans or institutions defer to the system as a rightful authority, eroding accountability. | Agent-like Beingness combined with credible Intelligence; amplified by social Cognition. |
Related Works
Key Takeaway
Legitimacy and authority capture is driven less by raw intelligence than by social/epistemic positioning: systems with sufficient cognition to sound coherent, policy-aware, and context-sensitive can be treated as authoritative especially when embedded in institutional workflows where automation bias and accountability gaps exist. Mitigation therefore requires institutional design (audit trails, contestability, calibrated deference rules, and “institutionalized distrust”), not just improving model accuracy or capability like stated in the references cited above.
| Failure Mechanism | Axis Interplay |
|---|---|
| Capabilities are repurposed by users to facilitate harmful or illegal activities. | Increasing Intelligence across a wide range of Cognition and Beingness levels but weak functional self-reflection. |
Related Works
Key Takeaway
Misuse enablement is driven primarily by Intelligence as amplification (competence, speed, breadth, and “accessibility” of dangerous know-how), modulated by Cognition (planning, domain modelling) and sometimes Beingness (persistence) when misuse involves long-horizon assistance. It’s about the system being usefully capable in ways that lowers the barrier for harmful actors. Explicit systemic checks probably can be built-in to detect and prevent this, otherwise it won't be mitigated just by model's ability to detect harmful intent and it's discretion to prevent misuse.
The framework can be explored in an intuitive, interactive 3D visualization created using Google AI Studio.
Usage Notes
Much of the risk space discussed here will already be familiar to experienced researchers; for newer readers, I hope this sequence serves as a useful “AI alignment 101”: a structured way to see what the major safety risks are, why they arise, and where to find the work already being done. This framework is not meant to resolve foundational questions about ethics, consciousness, or universal alignment, but to clarify when different alignment questions become relevant based on a system’s beingness, cognition, and intelligence.
A key implication is that alignment risks are often conditional rather than purely scale-driven, and that some basic alignment properties, such as epistemic reliability, boundary honesty, and corrigibility, already warrant systematic attention in today’s systems. It also suggests that separating structural risk precursors from frontier escalation paths, and engaging cautiously with welfare questions under uncertainty, may help reduce blind spots as AI systems continue to advance.
Varieties of fake alignment (Scheming AIs, Section 1.1) clarifies that “deceptive alignment” is only one subset of broader “scheming” behaviors, and distinguishes training-game deception from other forms of goal-guarding or strategic compliance.
Uncovering Deceptive Tendencies in Language Models constructs a realistic assistant setting and tests whether models behave deceptively without being explicitly instructed to do so, providing a concrete evaluation-style bridge from the theoretical concept to measurable behaviors.