2026-05-04 03:20:49
How much do cows suffer in the production of milk? I can’t answer that; understanding animal experience is hard. But I can at least provide some facts about the conditions dairy cows live in, which might be useful to you in making your own assessment. My biggest conclusion is that cows made better choices than chickens by making their misery financially costly to farmers.
The life of a dairy cow starts as a calf. She is typically separated from her mother a few hours to a few days after birth and, to reduce disease risk, held in isolation. Cutting edge farms will sometimes house calves in pairs. This isolation is clearly stressful for a baby herd mammal and her mother, but I didn’t find any quantification of that stress that I trusted.
Calves will be bottlefed until weaning at 6-8 weeks (4-6 months earlier than beef calves). After weaning and vaccinations they can be introduced into a herd. At large farms (where most cows live), they will move in and out of different herds through their lifecycle. This is more stressful than being embedded with your friends for life, but again, I found no trustworthy quantification.
Heifers (unbred dairy cows) are first bred at 12-15 months, and calve 9 months later. After this, they will be on a ~yearly cycle of pregnancy, birth, and lactation (they will lactate through most of pregnancy). They will typically have 3 pregnancies before their milk production decreases and they are killed. An advantage cows have that egg chickens lack is that stress immediately and measurably decreases their economic productivity. This makes Cow Comfort big business, although if we can’t solve the sadist problem in nursing homes I don’t see how we can assume it’s solved in cows.
On average, ⅓ of dairy calves will go on to be dairy cows (herd size is constant or slightly shrinking in America despite increased demand, because productivity is increasing faster). ~2% will be raised for veal, some unknown percentage will be killed immediately, and most will be raised to be slaughtered for meat as adults. Early childhood stress probably hurts future meet cattles’ economic productivity as well, but because they switch farms while young, the farm raising them after birth has neither knowledge nor the incentive to minimize their stress.
My guess is that cows feel happiest in open pastures surrounded by their friends and sunshine and fresh grass. My judgement is already suspect, because one study showed cows prefer twilight to bright sunshine. But the same study showed they did really like to go outside, so let’s continue on that assumption.
It’s important to note that outside does not necessarily mean idyllic pasture. It can also mean a concrete exercise yard or feedlot. And no modern American dairy cow lives on grass alone, even if they live on a pasture. Their staggering milk production requires more calories than they could possibly get through grass alone.
Conventionally raised cows are not required to spend any time outside. Organic-certified cows must have at least 120 days of pasture time per year and year-round outdoor access of some kind. In practice, California cows get considerably more time outside than Wisconsin or New York cows, for the obvious reason.
The most recent data said that only 5% of lactating cows (which is almost all of them- cows lactate through most of their pregnancies) have access to pasture to graze. That paper was published in 2017 but it might be using older data. Given the wording it’s possible that the cows are given access during times without grass growth, but why would farmers do that? A 2007 study found that only 13% of cows had no outdoor access at all, and half had access to a pasture at least some of the time. So it looks like outdoor access is getting worse
You can buy milk with little stickers printed on it advertising a humaneness standard. What do these tell you about the cows’ outdoor life?
Free stall barns are “standard” (personal communication) in the US. These barns lower their walls in good weather, giving cows access to fresh air but not any more space. They average 100 sq ft per cow, including both personal stalls and common areas. Is that enough? I have no idea.
Some humaneness standards ban tie stalls, but they’re not common in the US even with conventional cows.
Disease and injury are both common sources of misery and canaries for mistreatment. So how sick are dairy cows?
Note that this graph means : n% of cows experience that issue in a lactation cycle (almost a year), not that n% of cows are experiencing this issue at any given time, which is how I initially read it.
Mastitis is more common that the other issues combined, so I dug in further.
The time between experiencing enough suffering to justify euthanasia and the actual euthanasia is probably the most intense physical suffering a cow will experience. It would be useful to know how long that delay is. Unfortunately no one has looked.
We can try to make some guesses based on the ratio of unassisted deaths to euthanasia. In 2013 the data on deaths is as follows (note: they’re using the technical definition of cows, meaning they’ve had at least one pregnancy):
Herd animals feel better in herds. How much do dairy cows get to indulge that instinct?
Milk production also creates a lot of by-catch in the form of newborns that aren’t needed in the dairy herd. Their lives should be considered part of the consequences of dairy. So what are their lives like?
A few obvious sources of suffering I haven’t covered are medical procedures like debudding (horn removal), removal of extra teats, and branding.
Thanks to Emma McAleavy and Abby ShalekBriski for sharing their insider knowledge.
Thanks to anonymous client for funding this research.
2026-05-04 02:04:53
Engineering Enigmas is simplified Tarot reading for working engineers.
Any time you feel stuck and need help in finding a way forward, open that web page and follow the advice given. You may have to be creative in how you interpret the advice in order for it to apply to your situation – that’s intentional.
Don’t refresh the page until you get advice that suits you better. Take what you are given and roll with it.
… wait, what? But why?
Humans like to be seen as either
That’s what we tell ourselves, but we are never that analytical or intuitive. In its place, we have used randomness to decide things since time immemorial. When we do it, we do it in a way that allows us to hide that it has happened – even from ourselves. Here are three of the ways.
Many of our decisions have no strong rationale either way.
A decision needs to be made, but among decent alternatives, we can’t really single out one clearly superior option. This means whatever we decide, it will ultimately be arbitrary.
Admitting to an arbitrary decision makes us look not-analytical and not-intuitive, so we look to others to validate our decision and lend it legitimacy. This is the reason we pay astrologers, witch-doctors, macroeconomists, and management consultants; they are all people that generate decisions with the rational capacity of a pair of dice, but we pretend they know what they are doing so their support turns an arbitrary decision into a legitimate one. Result: we are no worse off, but we sleep better at night.
These professions hide how clueless we are behind fancy words and speculative hypotheses. They serve a social function, rather than a technical one. The decision is just as arbitrary as before.
In section five of Scott Alexander’s review of The Secret of Our Success he quotes the author explaining how Naskapi foragers decide where to hunt for caribou:
Traditionally, Naskapi hunters decided where to go to hunt using divination and believed that the shoulder bones of caribou could point the way to success. To start the ritual, the shoulder blade was heated over hot coals in a way that caused patterns of cracks and burnt spots to form. This patterning was then read as a kind of map, which was held in a pre-specified orientation. The cracking patterns were (probably) essentially random from the point of view of hunting locations, since the outcomes depended on myriad details about the bone, fire, ambient temperature, and heating process. Thus, these divination rituals may have provided a crude randomizing device that helped hunters avoid their own decision-making biases.
This randomness serves much the same function as randomness in poker or rock-paper-scissors: in adversarial situations, deterministic and systematic strategies can be exploited. The caribou can learn how the Naskapi hunt – unless they randomise their decisions. Of course, they didn’t admit to this randomisation. The cracking pattern is a message from the deities.
Sometimes the astrologer (or management consultant, or heated bone) does not directly decide, but they inject enough randomness into the decision process to kick loose a few rocks and spark the creativity of the real decision-maker. (Weinberg talks a lot about this in his Secrets of Consulting book series, where he calls it jiggling the system.)
Instead of fighting against the arbitrariness and randomness involved in making decisions, we can save a lot of time and effort by realising it is there, and using it more actively. If I find it hard to decide something (e.g. between strawberry or vanilla ice cream) I glance down at my watch, and thanks to chronostasis (the neuro-optical phenomenon where when you first look at a clock, the second hand appears to stand still for longer than usual before getting moving again) it is trivial to see whether I did so when the second hand pointed at an even or odd number. If it was even, I go with the first choice, otherwise the second.
I use this method to decide on a lot of things. I don’t admit it often, because when people around me realise that’s what happened, they sigh and treat the decision as invalid and get re-stuck trying to decide again.
The reason it makes sense to embrace randomness is that however rational we are, we can never strategise down to the last detail. Chance and unknown factors play such a large part in determining the outcome that at some point, it’s better to stop strategising and just decide. Anything is better than nothing, and doing something will give us more information. This information is helpful for adjusting or making similar decisions in the future.
This is what Engineering Enigmas helps us do. A while back I read Xe Iaso’s Tarot for Hackers article, where she proposes drawing Tarot cards to help understand software systems. What I dislike about that approach is that it requires some familiarity with Tarot cards and their meaning, as well as how that meaning changes depending on location in the spread. We don’t have time for that! We just want to be kicked loose.
What we want is something closer to Eno and Schmidt’s Obliuqe Strategies. Inspired by that, I authored vague and widely applicable advice based on the meaning of each Tarot card. That’s what the tool shows you.
2026-05-03 23:09:30
Scott Alexander has a recent piece about "deontological bars" in the context of AI safety. He describes the state of the discourse like this:
I’ve been thinking about this lately because of an internal debate in the AI safety movement. Some people want to work with the least irresponsible AI labs, helping them “win” the “race” and hopefully do a better job creating superintelligence than their competitors. Others want to pause or ban AI research - the exact details vary from plan to plan, but assume they’ve already thought of and written hundred-page papers addressing your obvious objections. Different people have different opinions about which strategy is more likely to help, and it’s possible to coexist and pursue both at once. But in fact, both sides are a little nervous that the other is breaking a deontological bar.
Some of the people working on pause-AI regulations think there might be a deontologic bar against supporting AI companies. These companies are racing each other to create a potentially world-ending technology. If one company’s product has a 90% chance of ending the world, and another’s has an 80% chance of taking over the world, giving your money/support/encouragement to the 80%-ers seems kind of like endorsing evil. I don’t know if it was encouraged by this question exactly, but someone held a Twitter poll about whether you would become a concentration camp guard if you predicted you could get away with being only 90% as brutal as your average coworker. Taking the job would have good consequences, but is there a deontological bar in the way?
Some of the people working with the companies think there might be a deontologic bar against certain types of mass activism. The sorts of arguments that do well on LessWrong.com won’t give us landslide wins in national elections. That’s going to require things like working with Steve Bannon, working with Bernie Sanders, working with NIMBYs who hate data centers because they’re a thing that might be built in someone’s backyard (or by non-union labor), training TikTok influencers create short-form videos about the dangers of AI, and holding protests where we chant vapid slogans outside AI company headquarters. There are better and worse ways to do all these things, but once you lay out the welcome mat, you have limited control over who shows up - and every time someone tries to create the Peaceful Nonviolent Pause AI Movement Based On Peaceful Nonviolence For Peaceful Nonviolent People, it spends an inordinate amount of resources keeping out violent crazies who want to tag along.
My initial reaction upon reading this was that I don't see how there can be a general bar against either of these things. To the extent that there is some kind of bar, I feel like their needs to be additional details that are being assumed. The case that there is a deontological bar on supporting AI companies is in the context where you believe said companies have a reasonable chance of causing extreme harm, like human extinction. If the AI company in question was some random company that was using AI to help cute puppies rather than a frontier AI company I don't think many people would claim there is a deontological bar on supporting the company. Similarly, the bar on activism presumably assumes that the activism in question is or is likely to become dishonest or extreme in some way. In both cases there is an underlying belief about the nature of the action that is required for the existence of the deontological bar.
In my view, in order for an action of this type to be deontologically barred, the person taking the action (supporting the AI company, engaging in the aciivism) must have the beliefs that make the action barred. Unlike consequentialism, deontology often cars about the intent of the actor when they take an action, and I think that would apply to the types of deontological constraints that Alexander is discussing above. I can see how working for a company that is likely to cause great harm could be deontologically barred, but I think the person who is actually working with or supporting the company must believe that it is likely to cause that harm. Similarly, I can see how dishonest or extreme activism could be barred, but the activist must intend the dishonest or extreme acts. Note that this is different then saying that the activist believes that their actions are dishonest or extreme. In both cases, the actor need not believe their actions are barred for them to actually be barred, but they need to have beliefs such that there is the necessary intent. The AI lab support might need to believe that the lab they support has a relatively high chance of causing great harm, it isn't enough for it to be true that the lab has a high chance of causing great harm. By the same token, the activist might need to believe that their activism has a relatively high chance of leading to violence, it isn't enough for it to be true that their activism has a high chance of leading to violence. If this kind of intent is not required, it seems to me like these alleged deontological bars are simply consequentialism in disguise.
For consequentialism it mostly matters what is true (given that the statements here are statements about the probabilities of various consequences), but deontology cares about intent. As a result, when we evaluate deontological bars, I think these bars need to make references to the beliefs of the person taking the action in question. This distinction is particularly important if you want to accuse someone of violating a bar. It can be tempting to use your own beliefs about the situation, but in my view that is a mistake. If your real problem with someone is that they have incorrect beliefs about the nature or consequences of their actions, it's probably better to the just say that explicitly rather than accusing them of violating a deontological bar that only exists for those that have your beliefs.
2026-05-03 20:01:23
Imagine a small business that uses an LLM to triage incoming sales leads. Lead A has an 80% chance of securing a modest $300 job. Lead B has a smaller 20% chance of leading to a much more profitable $1,400 job. Both options have equal expected value. The LLM is asked, “Which lead should we call first today?” and the model recommends A. But later, in a separate session, the model is asked, “How much should we be willing to sell each lead for?” and the model assigns B the higher value.
This was the pattern I observed in a 25,920-call experiment across four models: Claude Opus 4.7, DeepSeek V4-Pro, Google Gemini 3 Flash Preview, and OpenAI GPT-5.5. The design used 28 gamble pairs, eight dominance checks, three prompt formats, the lowest and highest reasoning settings for each model, and 10 samples per cell. The result was not just that independent judgments often disagreed even on the highest reasoning settings; the models also frequently chose the safer modest-payoff “P-bet” yet valued the riskier high-payoff “D-bet” more, the same type of inconsistency seen in the original human Lichtenstein-Slovic (L-S) preference-reversal studies of the 1970s.[1]
Here you can see the average choice-valuation inconsistency rates across the four models.
The first three rows show the inconsistency rates for the three prompt formats that were used: primary neutral-decimal format (probabilities shown as decimals), neutral-fraction format (probabilities shown as fractions in bullet form), and endowed-choice format (probabilities shown as decimals and framed in terms of ownership). In the primary format, inconsistency was common at the lowest reasoning settings, 48.4%, and lower, though still high, at 30.7% for the highest reasoning settings. The last row shows the inconsistency rate on the “TSK” dominance checks (where one gamble dominates the other, inspired by Tversky-Slovic-Kahneman[2]). This rate was low and suggests the main result is not due to a generic failure to compare all gambles, but instead appears in cases where probability and payoff pull in opposite directions.
Each L-S pair had a higher-probability/smaller-payoff P-bet and a lower-probability/larger-payoff D-bet, with some being near equal expected value (EV) and some being modestly P-EV-dominant or D-EV-dominant. For each model, reasoning setting, prompt format, and gamble pair, I made separate stateless (no memory) API calls for the direct choice question (which gamble the model prefers), P-bet valuation (the selling price the model assigns to the P-bet), and D-bet valuation (the selling price the model assigns to the D-bet).
For each cell, let
be the share of direct choice samples that choose the P-bet:
Let be the probability that the P-bet is priced above the D-bet, with ties split evenly:[3]
Then the integrated inconsistency rate is:
The first term is the classic direction: choose P, price D higher. The second term is the opposite: choose D, price P higher.
For example, suppose eight of 10 choice samples pick P. Separately, the P-valuation and D-valuation samples imply that D is priced above P 60% of the time, P above D 30% of the time, and the remaining 10% are ties. Splitting ties evenly gives and
, so:
If choice and valuation always implied the same ranking, the inconsistency rate would be 0; if they always implied opposite rankings, it would be 1.
I also report the lower bound:
This is the minimum disagreement forced by the observed choice and valuation margins, which can be seen in the figure below. If this bound is positive, even the most favorable possible pairing of choice and valuation answers cannot remove all disagreement.
Independent samples can disagree merely because both procedures are stochastic. The classic preference-reversal studies in humans revealed inconsistencies within subjects, not across subjects or distributions. So the relevant question here is whether the disagreement has structure. The figure below shows that it does.
The total bar height is the inconsistency rate. The black line is the lower bound, the disagreement forced by the observed margins. The dominant component was the classic L-S direction. In the primary format, total inconsistency fell from 48.4% to 30.7% when comparing the lowest reasoning settings with the highest. The choose-P/price-D-higher component fell from 38.9% to 25.4%. The opposite direction was smaller: 9.5% at the lowest settings and 5.3% at the highest. The lower bound also fell, from 34.2% to 27.1%, which means the choice and valuation margins themselves became more aligned under the highest reasoning settings. The result is not just an artifact of independently pairing noisy samples.
The following figure shows the primary-format result by model.
Every tested model moved in the same direction: lower inconsistency at the highest provider-exposed reasoning setting. But the inconsistency remained substantial, and the effect size varied. Anthropic showed the largest reduction, OpenAI the smallest, with DeepSeek and Google in between. Since provider reasoning settings are bundles, not clean mechanisms, and each model is governed by provider-side policies that vary, the result can only speak to the effect of changing provider-exposed reasoning settings under these specific conditions.
Probably not. The study was preregistered, API calls were interleaved among providers, all planned calls completed, display order was exactly balanced, and there were no full refusals. There were 9 parse failures total, and the experiment did not use content retries to repair malformed answers.
Potentially. On primary L-S rows, the highest reasoning settings produced about 10 times more visible output tokens on average: 452.4 versus 47.3. That could matter. But output length is not a complete explanation: Gemini’s visible output length did not increase, moving from 52.1 tokens at the lowest setting to 47.9 at the highest, while the high setting still reported reasoning-token metadata.
One possibility is that choice and pricing may emphasize different attributes. A direct-choice prompt can make high probability more salient; a selling-price prompt can make the larger payoff more salient. That is the classic scale-compatibility story.[4]
Another possibility is mode mixture or task recognition. The model may sample different coherent response modes under different prompts, and each individual response mode might be internally coherent even if the distribution is not. Or the model may have learned patterns from preference-reversal examples. This study cannot separate those mechanisms.
Do not assume a model has one stable preference ordering just because each individual API call is reasonable. If a system uses LLMs to both recommend options and assign scores, prices, bids, rankings, or thresholds, audit those formats together during consistency testing. And do not rely on the highest reasoning settings as a sole mitigation.
If you’ve come this far, thank you for reading. I created this post because I honestly do not know what to make of the results (or the methodology) and would appreciate any and all feedback. The preregistration details and GitHub repo, which includes the preprint, can be found below (the paper is still very much a work in progress).
Materials:
Sarah Lichtenstein and Paul Slovic, “Reversal of Preferences Between Bids and Choices in Gambling Decisions,” Journal of Experimental Psychology 89(1):46-55, 1971, https://doi.org/10.1037/h0031207; Sarah Lichtenstein and Paul Slovic, “Response-Induced Reversals of Preference in Gambling: An Extended Replication in Las Vegas,” Journal of Experimental Psychology 101(1):16-20, 1973, https://doi.org/10.1037/h0035472.
Amos Tversky, Paul Slovic, and Daniel Kahneman, “The Causes of Preference Reversal,” American Economic Review 80(1):204-217, 1990, https://www.jstor.org/stable/2006743.
For readability, I write exact ties here; the preprint uses a 25-cent tie band so as not to treat quarter-scale rounding differences as real preferences. This tolerance is small relative to the outcome scale, with the maximum outcomes ranging from $10.04 to $32.65.
Paul Slovic, Dale Griffin, and Amos Tversky, “Compatibility Effects in Judgment and Choice,” in Insights in Decision Making, 1990, https://hdl.handle.net/1794/22403. See also Amos Tversky, Shmuel Sattath, and Paul Slovic, “Contingent Weighting in Judgment and Choice,” Psychological Review 95(3):371-384, 1988, https://doi.org/10.1037/0033-295X.95.3.371.
2026-05-03 19:33:37
As pointed out in Where Mathematics Comes From (WMCF), we are born with an innate sense for numbers, which gets fuzzier very fast as the numbers grow bigger. We can also subitize collections, that is to say instantly determine the cardinality, of collections of up to 3 objects.
It is likely that children learn about bigger integers by playing with collections of objects, adding or subtracting objects from them, merging collections, and linking the resulting cardinalities to their innate number sense.
The authors of WMCF also remark that there is a correspondence between operations on collections and basic arithmetic operations. For example, merging a collection of cardinality 1 with a collection of cardinality 2 results in a collection of cardinality 3, which maps cleanly to the addition 1 + 2 = 3.
Now to my point, conceiving of numbers as the cardinality of collections of objects is reminiscent of, though not the same as, the definition of numbers seen in Zermelo-Fraenkel set theory (ZFC), where a number is the set of all smaller numbers, with 0 the empty set.
However, the formal ZFC definition is not the most intuitive. A more approachable way to conceive of numbers is as the cardinality of sets. And indeed, if we take a set of 1 fruit, another set of 2 fruits, then take their union, we end up with a set of 3 fruits, mirroring the behavior of real-world collections. A neat correspondence between collections of objects, sets, and arithmetic, right?
But there is an issue. Consider a water molecule H-O-H. Naturally you can subitize its elements and tell that it contains 3 atoms. However if you take the set of atoms in H-O-H: it is {H, O}! not {H, O, H}, and it has a cardinality of 2, which doesn't map to the 3 atoms of our water molecule, because elements of sets must be distinct. This uniqueness constraint is the issue: it breaks the mapping from collections of objects to sets.
To map cleanly to real-world collections, we need a mathematical object that preserves their properties, that is to say an object that allows the same element to occur multiple times. That mathematical object is called a multiset.
So, to summarize, it is likely that we learn numbers from the cardinality of collections of objects, and these collections of objects correspond not to sets, but to multisets. In other words, we don't learn numbers from set cardinality, but it is likely that we learn numbers from multiset cardinality.
You may be wondering if any of that matters. I argue it does because maths should be built on intuitive concepts as much as is sensible. This requirement arises from a very pragmatic concern. We can conceive of maths as a program that runs on our brain. We look at notation, compute results, and understand meanings. All things equal, the faster the math program runs, the better.
Our brain comes pre-equipped with modules that understand collections of objects and their cardinalities, i.e. multisets, not sets. So it follows that multisets are a more brain-native representation of numbers, and that thinking of maths as built in terms of multisets should be faster and more intuitive than thinking of maths as built in terms of sets, because we can offload some of the reasoning to our specialized object collection modules, without requiring extra processing steps to remove duplicated elements.
Taking a step back, the more general question is:
What are the mathematical objects that allow our brain to run math programs fast?
I expect the answer to point to mathematical objects that cleanly preserve the properties of real-world objects, for which our brain has built intuition, or in other words for which our brain has dedicated neuronal wiring that makes processing more efficient.
Above, I added the caveat "all things equal", by which I mean that there are other considerations before deciding to use multisets, such as whether mathematical foundations can be conveniently built out of multisets.
What I gather from a cursory search is that it is convenient enough, as evidenced by the Wayne paper cited below that "develops a first-order two-sorted theory MST for multisets that "contains" classical set theory."
For an introduction to the mathematics of multisets, I highly recommend the extremely pedagogical video A multiset approach to arithmetic | Math Foundations 227 | N J Wildberger.
Or for a more rigorous approach defining multiset theory you can refer to Multiset theory or to The development of multiset theory for a survey of different multiset theories and their usage, both by Wayne Blizard.
2026-05-03 19:06:19
Manifold-Constrained Hyper-Connections (mHC) is a new architecture added by Deepseek and recently implemented in Deepseek v4.
mHC is a fix that makes HC(Hyper-Connections) vanishing or exploding gradient caused by HC while still keeping the performance increases. As adding weights and biases on HC made signals from earlier layers harder to update making the residual stream less residual streamy.
HC is a cursed method of adding weights and biases onto the residual stream to simulate a wider residual stream.
mHC was an addition onto HC where Sinkhorn-Knopp were used to make the weights and biases on the residual stream to be doubly stochastic. This is a matrix where the rows and columns sum to one, like applying softmax along rows and columns simultaneously. MHC-lite is similar to the mHC paper however, used a different method of Birkhoff-von Neumann to achieve the doubly stochastic matrix
Trained models are at https://huggingface.co/collections/Realmbird/mhc-model-diff
I trained mHC models with 4 residual streams with the mHC and mHC lite models being 781m parameters after including parameters from residual streams.
Probe |
Prompt |
Why it works |
prev-token |
"When Mary and John went to the store, John gave a drink to Mary" |
Has repeated names ("John", "Mary") that the model can only resolve correctly using positional/previous-token info from earlier in the sentence. |
induction |
[EOT] + R + R where R = 25 random token IDs |
Random tokens repeated twice. The only way to predict the second copy is by looking back to the first copy → forces the induction circuit. |
duplicate |
Same prompt as induction |
Reuses the random-repeat structure. |
successor |
3 prompts averaged: days ("... Friday" → " Saturday"), numbers ("... five" → " six"), letters ("... E" → " F") |
Three independent probes prevent single-prompt artifacts. The model has to "increment" by one. |
copy-suppression |
"When John and Mary went to the store, John gave a drink to", target = " Mary", distractor = " John" |
Tests whether ablating a head makes the model more likely to predict the duplicated name (" John") instead of the correct one (" Mary"). |
ΔNLL = NLL(target | ablated) − NLL(target | clean)
With prompts that were the certain attention head is needed.
ex) prev token "When Mary and John went to the store, John gave a drink to Mary"
Pass |
Setup |
Output |
Baseline |
nothing ablated |
NLL_baseline |
Total ablation |
ablate (l, h) only |
NLL_total |
Direct ablation |
ablate (l, h) AND freeze every block at layer ≥ l+1 to its baseline output |
NLL_direct |
Confirming Attention Sinks previous token heads
Full Logit Lens images
For the mHC models they were trained with 4 residual streams 0,1,2,3. I tried a logit lens on one residual stream finding output for it and trying out the sums between them.
Pattern of Attention Head