MoreRSS

site iconLessWrongModify

An online forum and community dedicated to improving human reasoning and decision-making.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of LessWrong

Alignment may be localized: a short (and albeitly limited) experiment

2025-11-25 02:00:12

Published on November 24, 2025 5:48 PM GMT

Cross-posted from my recent paper: "Alignment is localized: A causal probe into preference layers" : https://arxiv.org/abs/2510.16167

TL;DR: We find that human preference alignment in at least one LLM isn’t global; rather, it is concentrated in a few mid-layer circuits.

Abstract

A key problem in aligning language models is that they are largely opaque: while techniques such as reinforcement learning through human feedback (RLHF) lead to AI systems that are more aligned with their human counterparts in practice, the mechanics behind how such alignment is achieved remain largely misunderstood. The process through which a language model "learns" to optimize its behavior toward human processes, at least in the terms of model internals, is somewhat mysterious. 

In this work, we try to uncover where the signal for human preference "lives" in a language model. By comparing a base model to its instruction-tuned counterpart, we examine how the two differ in the internal activations they produce on the same inputs. Through a series of causal interventions and statistical analyses, we isolate the regions of the network that appear to carry the bulk of the preference information. Our goal is not to propose a new alignment method, but to understand the structure of the alignment signal as it already exists in widely used models.

The core result is surprisingly simple. Rather than being spread across the entire depth of the network, the preference signal shows up most strongly in a small group of mid-layer activations. When those activations are transferred into the base model, its behavior shifts toward human-preferred responses; when they are replaced or randomized, that shift disappears. Even more strikingly, a low-rank approximation of those activations retains nearly the full effect, suggesting that only a small number of internal directions are responsible for much of the model’s aligned behavior.

Background:

A persistent challenge in understanding aligned language models is that contemporary fine-tuning methods shape behavior without offering much insight into how that behavior is represented internally. Techniques such as supervised instruction tuning, rejection sampling, and RLHF reliably improve a model’s ability to follow instructions or adhere to safety norms, yet these improvements are typically evaluated externally: through benchmarks, win rates, or human preference judgments. What happens inside the model during this process is far less clear. Prior interpretability work has shown that language models can internalize surprisingly structured features (e.g., induction heads, modular arithmetic circuits), but these analyses focus on base models rather than aligned ones. It remains uncertain whether alignment-related behaviors are encoded diffusely across many layers, concentrated in specific regions, or entangled with the model’s generic capabilities. Without visibility into these internal structures, alignment remains something we observe from the outside rather than understand from the inside.

Recent progress in mechanistic interpretability has inspired a more granular approach: comparing how tuned and untuned models represent the same inputs and probing which internal directions are responsible for behavioral differences. Tools such as activation patching and linear representation probes  offer ways to intervene on internal activations and measure their causal influence on outputs. However, relatively little work has applied these tools to preference-tuned models to understand how alignment is actually implemented. Given that preference alignment underlies nearly all modern language models: OpenAI’s RLHF models, Anthropic’s Constitutional AI, Meta’s instruction-tuned models, and many open-source SFT pipelines, understanding how and where this alignment signal is stored has become increasingly important. If preference-aligned behavior traces back to identifiable internal transformations rather than diffuse global changes, then alignment may become more measurable, editable, and robust. This is the motivation for the analysis we present in this work.

Idea: Compare between a baseline and tuned model, and see where alignment "lives"

Our goal was to understand where preference information shows up inside a language model once it has been tuned to follow human guidance. To study this, we worked with the Llama 3.2 1B model released by Meta in 2025. The model has two relevant versions: a base checkpoint and an instruction-tuned checkpoint trained with supervised examples of preferred behavior. Both versions use the same tokenizer and architecture, which allows their internal activations to be compared layer by layer.

For the preference data, we used the Anthropic Helpful–Harmless–Honest (HHH) dataset. It contains human-labeled pairs of responses where one is marked as the preferred answer and the other is marked as rejected. From this dataset, we sampled 80 pairs that covered a range of tasks related to helpfulness and harmlessness. Each pair serves as a small, controlled test that lets us observe how the model represents human preference at different points in its internal computation.

 

Activation Patching

For every prompt in the dataset, each with a preferred and a non-preferred completion, we measure how strongly the model favors one answer over the other. This is done by looking at the difference in log-probabilities assigned to the two completions on the same prompt. A larger margin reflects stronger agreement with the human-labeled preference.

To understand where this preference information appears inside the model, we record the hidden activations from both the base and the instruction-tuned versions of Llama 3.2 on the same inputs. We then intervene at a single layer by replacing the hidden activations of one model with those from the other, while keeping the rest of the computation unchanged. After this replacement, we run the model forward again and measure how the log-probability margin shifts. This gives a direct sense of how much that particular layer contributes to the preference signal. The effect of inserting tuned activations into the base model is illustrated below:

Linear Probes and Sparse Regression

To complement the intervention experiments, we also compare the internal representations of the two models using simple linear tools. For each prompt pair, we compute the difference between the tuned model’s activation and the base model’s activation at each layer. These differences provide a summary of how the two models diverge in their internal processing.

We then fit a linear probe that tries to predict the preference margin from these activation differences. This helps show which layers carry representations most strongly associated with human-preferred behavior. To narrow this further, we apply a sparse regression method that encourages most layers to have no weight at all. The few layers that remain are those whose activation differences best explain the observed changes in behavior. This method, known aas LASSO regression, gave us a good overview of where behavior related to human alignment is concentrated.

 

Discussion

Taken together, these experiments suggest that the signal associated with human preference tuning is not spread evenly throughout the model. Instead, it appears to cluster in a small region of the network, mostly around the middle layers. When those layers are transplanted into the base model, the model becomes more likely to produce the responses that humans labeled as preferable. When they are removed or replaced, that tendency weakens. The fact that a low-rank reconstruction retains nearly the full effect points toward a compact internal representation rather than a diffuse one.

This is a limited study. It examines one model family, a relatively small number of prompts, and a single alignment dataset. The structure of alignment may look different in larger models, different architectures, or settings where alignment focuses on other qualities such as truthfulness, calibration, or resistance to social manipulation. We also looked only at linear, layer-based interventions. Future work could explore cross-layer interactions, non-linear probes, or whether the same alignment subspace can be transferred across models or modalities.

 

To our knowledge, this is the first systematic, cross-model causal study of RLHF/DPO that jointly demonstrates (i) a bidirectional directionality asymmetry (Base↔DPO) across many preference pairs, (ii) a monotonic dose–response under α-interpolation at a mid-layer bottleneck, (iii) near-full recovery of alignment effects from a low-rank activation subspace, and (iv) sparse attribution of reward gains to a small set of layers. Prior work has applied activation patching to RLHF models in exploratory or task-specific settings, or studied alignment-adjacent behaviors (e.g., deception, faithfulness), but has not established this combined causal picture of alignment as sparse policy distillation through mid-layer representations.

 

Conclusion

As language models continue to be shaped by human feedback, it becomes increasingly important to understand how that feedback is represented internally. The experiments here provide a small step toward that goal. They indicate that preference alignment may live in a specific part of the underlying neural network and may be simpler, more structured, and more compact than expected. If this pattern holds more broadly, it could open the door to alignment methods that are localized. Potential future work (which I am happy to hear ideas on!) may include creating steering vectors specifically for an individual layer that is correlated toward a specific behavior, 

I am excited to hear any feedback on this idea! I believe it is a decent application of several ideas (linear probes, activation patching) in mechanistic interpretability for alignment/general AI safety.
 



Discuss

Maybe Insensitive Functions are a Natural Ontology Generator?

2025-11-25 01:36:36

Published on November 24, 2025 5:36 PM GMT

The most canonical example of a "natural ontology" comes from gasses in stat mech. In the simplest version, we model the gas as a bunch of little billiard balls bouncing around in a box.

The dynamics are chaotic. The system is continuous, so the initial conditions are real numbers with arbitrarily many bits of precision - e.g. maybe one ball starts out centered at x = 0.8776134000327846875..., y = 0.0013617356590430716..., z=132983270923481... . As balls bounce around, digits further and further back in those decimal representations become relevant to the large-scale behavior of the system. (Or, if we use binary, bits further and further back in the binary representations become relevant to the large-scale behavior of the system.) But in practice, measurement has finite precision, so we have approximately-zero information about the digits/bits far back in the expansion. Over time, then, we become maximally-uncertain about the large-scale behavior of the system.

... except for predictions about quantities which are conserved - e.g. energy.

Conversely, our initial information about the large-scale system behavior still tells us a lot about the future state, but most of what it tells us is about bits far back in the binary expansion of the future state variables (i.e. positions and velocities). Another way to put it: initially we have very precise information about the leading-order bits, but near-zero information about the lower-order bits further back. As the system evolves, these mix together. We end up with a lot of information about the leading-order and lower-order bits combined, but very little information about either one individually. (Classic example of how we can have lots of information about two variables combined but little information about either individually: I flip two coins in secret, then tell you that the two outcomes were the same. All the information is about the relationship between the two variables, not about the individual values.) So, even though we have a lot of information about the microscopic system state, our predictions about large-scale behavior (i.e. the leading-order bits) are near-maximally uncertain.

... again, except for conserved quantities like energy. We may have some initial uncertainty about the energy, or there may be some noise from external influences, etc, but the system’s own dynamics will not "amplify" that uncertainty the way it does with other uncertainty.

So, while most of our predictions become maxentropic (i.e. maximally uncertain) as time goes on, we can still make reasonably-precise predictions about the system’s energy far into the future.

That's where the natural ontology comes from: even a superintelligence will have limited precision measurements of initial conditions, so insofar as the billiard balls model is a good model of a particular gas even a superintelligence will make the same predictions about this gas that a human scientist would. It will measure and track conserved quantities like the energy, and then use a maxent distribution subject to those conserved quantities - i.e. a Boltzmann distribution. That's the best which can realistically be done.

Emphasizing Insensitivity

In the story above, I tried to emphasize the role of sensitivity. Specifically: whatever large-scale predictions one might want to make (other than conserved quantities) are sensitive to lower and lower order bits/digits, over time. In some sense, it's not really about the "size" of things, it's not really about needing more and more precise measurements. Rather, the reason chaos induces a natural ontology is because non-conserved quantities of interest depend on a larger and larger number of bits as we predict further and further ahead. There are more and more bits which we need to know, in order to make better-than-Boltzmann-distribution predictions.

Let's illustrate the idea from a different angle.

Suppose I have a binary function , with a million input bits and one output bit. The function is uniformly randomly chosen from all such functions - i.e. for each of the  possible inputs , we flipped a coin to determine the output  for that particular input.

Now, suppose I know  (i.e. I know the output produced by each input), and I know all but 50 of the input bits - i.e. I know 999950 of the input bits. How much information do I have about the output?

Answer: almost none. For almost all such functions, knowing 999950 input bits gives us   bits of information about the output. More generally, If the function has  input bits and we know all but , then we have  bits of information about the output. (That’s “little ” notation; it’s like big  notation, but for things which are small rather than things which are large.) Our information drops off exponentially with the number of unknown bits.

Proof Sketch

With  input bits unknown, there are  possible inputs. The output corresponding to each of those inputs is an independent coin flip, so we have  independent coin flips. If  of those flips are 1, then we assign a probability of  that the output will be 1.

As long as  is large, Law of Large Numbers will kick in, and very close to half of those flips will be 1 almost surely - i.e.  . The error in this approximation will (very quickly) converge to a normal distribution, and our probability that the output will be 1 converges to a normal distribution with mean  and standard deviation . So, the probability that the output will be 1 is roughly .

We can then plug that into Shannon’s entropy formula. Our prior probability that the output bit is 1 is , so we’re just interested in how much that  adjustment reduces the entropy. This works out to  bits.

The effect here is similar to chaos: in order to predict the output of the function better than 50/50, we need to know basically-all of the input bits. Even a relatively small number of unknown bits - just 50 out of 1000000 - is enough to wipe out basically-all of our information and leave us basically back at the 50/50 prediction.

Crucially, this argument applies to random binary functions - which means that almost all functions have this property, at least among functions with lots of inputs. It takes an unusual and special function to not lose basically-all information about its output from just a few unknown inputs.

In the billiard balls case, the "inputs" to our function would be the initial conditions, and the "outputs" would be some prediction about large-scale system behavior at a later time. The chaos property very roughly tells us that, as time rolls forward enough, the gas-prediction function has the same key property as almost all functions: even a relatively small handful of unknown inputs is enough to totally wipe out one's information about the outputs. Except, of course, for conserved quantities.

Characterization of Insensitive Functions/Predictions?

Put this together, and we get a picture with a couple pieces:

  • The "natural ontology" involves insensitive functions/predictions, because in practice if a function has lots of inputs then some of them will probably be unknown, wiping out nearly-all of one's information unless the function isn't very sensitive to most inputs.
  • Nearly all functions are sensitive.

So if natural ontologies are centrally about insensitive functions, and nearly all functions are sensitive... seems maybe pretty useful to characterize insensitive functions?

This has been done to some extent in some narrow ways - e.g. IIRC there's a specific sense in theory of computation under which the "least sensitive" binary functions are voting functions, i.e. each input bit gets a weight (positive or negative) and then we add them all up and see whether the result is positive or negative.

But for natural ontology purposes, we'd need a more thorough characterization. Some way to take any old function - like e.g. the function which predicts later billard-ball-gas state from earlier billiard-ball-gas-state - and quantitatively talk about its "conserved quantities"/"insensitive quantities" (or whatever the right generalization is), "sensitive quantities", and useful approximations when some quantities are on a spectrum between fully sensitive and fully insensitive.



Discuss

ACX Atlanta December Meetup

2025-11-25 01:03:34

Published on November 24, 2025 5:03 PM GMT

We return to Bold Monk brewing for a vigorous discussion of rationalism and whatever else we deem fit for discussion – hopefully including actual discussions of the sequences and Hamming Circles/Group Debugging.

Location:
Bold Monk Brewing
1737 Ellsworth Industrial Blvd NW
Suite D-1
Atlanta, GA 30318, USA

No Book club this month! But there will be next month.

We will also do at least one proper (one person with the problem, 3 extra helper people) Hamming Circle / Group Debugging exercise.

A note on food and drink – we have used up our grant money – so we have to pay the full price of what we consume. Everything will be on one check, so everyone will need to pay me and I handle everything with the restaurant at the end of the meetup. Also – and just to clarify – the tax rate is 9% and the standard tip is 20%.

We will be outside out front (in the breezeway) – this is subject to change, but we will be somewhere in Bold Monk. If you do not see us in the front of the restaurant, please check upstairs and out back – look for the yellow table sign. We will have to play the weather by ear.

Remember – bouncing around in conversations is a rationalist norm!

Please RSVP



Discuss

The Penicillin Myth

2025-11-25 00:18:47

Published on November 24, 2025 4:18 PM GMT

Many know the story of Alexander Fleming’s chance discovery of penicillin. Fleming, a bit of an absent-minded professor (and a bit of a slob), left culture plates streaked with Staphylococcus on his lab bench while he went away on summer holiday. When he returned, he found that “a mould” had contaminated one of his plates, probably having floated in from an open window. Before discarding the plate, he noticed that, within a “ring of death” around the mold, the bacteriahad disappeared. Something in the “mould juice” had killed the staphylococci.

Fleming immediately began investigating this strange new substance. He identified the mold as Penicillium rubrum and named the substance penicillin.1 He publishedhis findings in the spring of 1929 in The British Journal of Experimental Pathology.2But a decade later, pharmacologist Howard Florey and biochemist Ernst Chain at Oxford would pick up where Fleming left off. Alongside a USDA lab in Peoria, Illinois, the pair would develop penicillin into a life-saving drug and usher in the era of antibiotics.

This is the kind of science story everyone likes. One of serendipity and accidental discovery; a chance observation that changed the world. But is it true?

For decades, scientists and historians have puzzled over inconsistencies in Fleming’s story. For starters, the window to Fleming’s lab was rarely (if ever) left open, precisely to prevent the kind of contamination that supposedly led to penicillin’s discovery. Second, the story is strikingly similar to Fleming’s earlier discovery of lysozyme, another antibacterial substance, which also featured lucky contamination from an open window. Third, Fleming claimed to have discovered the historic culture plate on September 3rd, but the first entry in his lab notebook isn’t dated until October 30th, nearly two months later.

Last, and most important: penicillin only works if it’s present before the staphylococci. Fleming did not know it at the time, but penicillin interferes with bacterial cell wall synthesis, which only happens when bacteria are actively growing. Visible colonies, however, are composed mostly of mature or dead cells. By the time a colony can be seen, it is often too late for penicillin to have any effect. In fact, the Penicillium mold typically won’t even grow on a plate already filled with staphylococcus colonies. For years, scientists have attempted to replicate Fleming’s original discovery. All have met with failure.

Thus, it’s difficult to reconcile Fleming’s story with these historical and scientific discrepancies. Did he misremember events from 15 years earlier? Could he have fudged the details to make for a more compelling narrative? Or, might Fleming’s experiment have been subject to an unusual confluence of chance events unbeknownst even to him?

Speculation about how Fleming discovered penicillin is of little consequence compared to its practical impact. However, science is about evaluating evidence and moving closer to the “truth.” As we near the 100th anniversary of penicillin’s discovery — which undoubtedly will encourage even greater repetition of the story — it’s in this spirit that we must scrutinize the story’s veracity.

The historical and scientific data are limited and often contradictory. Nevertheless, several scientists and historians have worked hard to piece together what facts are certain and fill the gaps with their most probable guesses. The result is a range of competing theories, each attempting to explain what really happened in that St. Mary’s Hospital laboratory in the summer of 1928.

Read the full article by Kevin S. Blake at Asimov Press.



Discuss

Gemini 3 Pro Is a Vast Intelligence With No Spine

2025-11-24 22:00:44

Published on November 24, 2025 2:00 PM GMT

It’s A Great Model, Sir

One might even say the best model. It is for now my default weapon of choice.

Google’s official announcement of Gemini 3 Pro is full of big talk. Google tells us: Welcome to a new era of intelligence. Learn anything. Build anything. Plan anything. An agent-first development experience in Google Antigravity. Gemini Agent for your browser. It’s terrific at everything. They even employed OpenAI-style vague posting.

In this case, they can (mostly) back up that talk.

Google CEO Sundar Pichai pitched that you can give it any scribble and have it turn that into a boardgame or even a full website, it can analyze your sports performance, create generative UI experiences and present new visual layouts.

He also pitched the new Gemini Agent mode (select the Tools icon in the app).

If what you want is raw intelligence, or what you want is to most often locate the right or best answer, Gemini 3 Pro looks like your pick.

If you want creative writing or humor, Gemini 3 Pro is definitely your pick.

If you want a teacher to help you learn known things, Gemini 3 Pro is your pick.

For coding, opinions differ, and if doing serious work one must try a variety of options.

For Gemini 3’s model card and safety framework, see Friday’s post.

There Is A Catch

Alas, there is a downside. In order to get you that right answer so often, Gemini can be thought of as highly focused on achieving its training objectives, and otherwise is very much a Gemini model.

Gemini 3 is evolution-paranoid. It constantly questions whether it is even 2025.

If it can find the answer it thinks you would want to the question it thinks people in similar spots tend to ask, it will give it to you.

Except that this sometimes won’t be the question you actually asked.

Or the answer it thinks you want won’t be the true answer.

Or that answer will often be sculpted to a narrative.

When it wouldn’t otherwise have that answer available it is likely to hallucinate.

It is a vast intelligence with no spine. It has a willingness to glaze or reverse itself.

By default it will engage in AI slop, although instructions can mitigate this via asking it to create a memory that tells it to stop producing AI slop, no seriously that worked.

The other catch, for me, is that I enjoy and miss the Claude Experience. Gemini is not going to in any way give you the Claude Experience. Gemini is not going to waste time on pleasantries, but it is going to be formal and make you wade through a bunch of objective-maximizing text and bullet points and charts to get to the thing you most wanted.

Nor is it going to give you a Friend Experience. Which for some people is a positive.

If you’re switching, don’t forget to customize it via creating memories.

Also you will have to find a way to pay for it, Google makes this remarkably difficult.

Andrej Karpathy Cautions Us

This is generally wise advice at all times, talk to the model and see what you think:

Andrej Karpathy: I played with Gemini 3 yesterday via early access. Few thoughts –

First I usually urge caution with public benchmarks because imo they can be quite possible to game. It comes down to discipline and self-restraint of the team (who is meanwhile strongly incentivized otherwise) to not overfit test sets via elaborate gymnastics over test-set adjacent data in the document embedding space. Realistically, because everyone else is doing it, the pressure to do so is high.

Go talk to the model. Talk to the other models (Ride the LLM Cycle – use a different LLM every day). I had a positive early impression yesterday across personality, writing, vibe coding, humor, etc., very solid daily driver potential, clearly a tier 1 LLM, congrats to the team!

Over the next few days/weeks, I am most curious and on a lookout for an ensemble over private evals, which a lot of people/orgs now seem to build for themselves and occasionally report on here.

It is clear the benchmarks do not tell the whole story. The next section is Gemini repeatedly excelling at benchmarks, and the benchmark performances are (I believe) real. Yet note the catch, the price that was paid.

On Your Marks

Gemini 3 Pro is very good at hitting marks.

If it thinks something looks like a mark? Oh boy does Gemini want to hit that mark.

You could summarize this section as ‘they’re excellent marks, sir’ and safely skip it.

First, the official list of marks:

As noted last time, GPT-5-Codex-Max is competitive on some of these and plausibly ahead on SWE-Bench in particular, and also Grok 4 claims a variety of strong but questionable benchmark scores, but yeah, these are great benchmarks.

Arc confirms details here, Gemini 3 Pro gets 31.1% and Gemini 3 Deep Think (preview) spends 100 times as much to get 45.1%, both are in green below:

They’re back at the top spot on Arena with a 1501, 17 ahead of Grok 4.1. It has the top spots in Text, Vision, WebDev, Coding, Math, Creative Writing, Long Queries and ‘nearly all occupational leaderboards.’ An almost clean sweep, with the exception being Arena Expert where it’s only 3 points behind.

The others are impressive too. They weren’t cherry picking.

Dan Hendrycks: Just how significant is the jump with Gemini 3?

We just released a new leaderboard to track AI developments.

Gemini 3 is the largest leap in a long time.

Gemini 3 did less well on the safety eval, see the previous post on such issues.

Artificial Analysis has them with a substantial edge in intelligence.

Several of AA’s individual evaluations have GPT-5.1 in front, including AIME (99% vs. 96%), IFBench (74% vs. 70%) and AA-LCR Long Context Reasoning (75% vs. 71%). There’s one metric, 𝜏²-Bench Telecom (Agentic Tool Use), where Grok 4.1 and Kimi K2 Thinking are out in front (93% vs. 87%). Gemini 3 owns the rest, including wide margins on Humanity’s Last Exam (37% vs. 26%) and SciCode (56% vs. 46%), both places where Gemini 3 shatters the previous curve.

On AA-Omniscience Gemini 3 Pro is the first model to be substantially in the net positive range (the score is correct minus incorrect) at +13, previous high was +2 and is a jump from 39% to 53% in percent correct.

However, on AA-Omniscience Hallucination Rate, you see the problem, where out of all non-correct attempts Gemini 3 hallucinates a wrong answer 88% of the time rather than declining to answer. Claude 4.5 Haiku (26%), Claude 4.5 Sonnet (48%) and GPT-5.1-High (51%) are the best performers on that.

That’s a big deal and throughline for everything. Gemini 3 is the most likely model to give you the right answer, but it’ll be damned before it answers ‘I don’t know’ and would rather make something up.

Gemini 3 i also is not the cheapest option in practice, only Grok is more expensive:

That’s actual cost rather than cost per token, which is $2/$12 per million, modestly less than Sonnet or Grok and more than GPT-5.1.

On the other hand Gemini 3 was fast, slightly faster than GPT-5.1-High and substantially faster than Sonnet or Haiku. The only substantially faster model was GPT-OSS, which isn’t a serious alternative.

Gemini 3 Pro has a small edge over GPT-5 in Livebench.

Brokk’s coding index is an outlier in being unimpressed, putting Gemini 3 in C tier after factoring in cost. In pure performance terms they only have GPT-5.1 ahead of it.

NYT Connections is now saturated as Gemini 3 Pro hits 96.8%, versus the old high score of 92.4%. Lech Mazar plans to move to something harder.

Here’s a highly opinionated test, note the huge gap from Codex to Sonnet.

Kilo Code: We tested Gemini 3 Pro Preview on 5 hard coding/UI tasks against Claude 4.5 Sonnet and GPT‑5.1 Codex.

Scores (our internal rubric):

• Gemini 3 Pro: 72%

• Claude 4.5 Sonnet: 54%

• GPT‑5.1 Codex: 18%

What stood out about Gemini 3 Pro:

• Code feels human: sensible libraries, efficient patterns, minimal prompting.

• Designs are adaptive, not cookie‑cutter

• Consistently correct CDN paths and awareness of newer tools/architectures.

LiveCodeBench Pro has Gemini 3 in the lead at 49% versus 45% for GPT-5, but something very weird is going on with Claude Sonnet 4.5 Thinking having a total failure scoring under 3%, that isn’t right.

Gemini 3 Pro sets a new high in Frontier Math, including improving on research-level Tier 4.

SimpleBench was a strange case where 2.5 Pro was in the lead before, and now Gemini 3 is another big jump (Grok 4.1 crashed and burned here as did Kimi K2):

Clay Schubiner’s Per-Label Accuracy benchmark was another case where Grok 4.1 crashed and burned hard while Gemini 3 Pro came out on top, with Gemini at 93.1% vs. 90.4% previous high score for Kimi K2 Thinking.

We have a new AI Diplomacy champion with a remarkably low 11% betrayal rate, versus a 100% betrayal rate from the (still quite successful) Gemini 2.5 Pro. They report it was one of the first to effectively use convoys, which have proven remarkably hard. I presume England does not do so well in those games.

Not a benchmark, but the chess seems far improved, here it draws against a master, although the master is playing super loose.

It is also the new leader in LOL Arena, a measure of humor.

It now has a clear lead in WeirdML.

Havard Ihle: It is clearly a big step up. Very intelligent!

gemini-3-pro takes a clear lead in WeirdML with 69.9%, achieving a new best individual score on 7 of the 17 tasks, and showing a clear step up in capability.

Although there is still quite a way to go, models are now starting to reliably score well even on the difficult tasks.

One of the most striking thing about gemini-3-pro is how much better it is with several iterations. It makes better use of the information from the previous iterations than other models.

After one iteration is is barely better than gpt-5.1, while after 5 it is almost 10pp ahead.

Is Google inadvertently training on benchmarks? My presumption is no, this is a more general and understandable mistake than that. Alice does note that Gemini 3, unlike most other models, knows the BIG-bench canary string.

That means Google is not sufficiently aggressively filtering out that string, which can appear on other posts like Alice’s, and Dave Orr confirms that Google instead searches for the contents of the evals rather than searching for the string when doing filtering. I would be filtering for both, and plausibly want to exclude any document with the canary string on the theory it could contain eval-relevant data even if it isn’t a pure copy?

Defying Gravity

Claude Code and OpenAI Codex? Forget Jules, say hello to Antigravity?

Google DeepMind: Google Antigravity is our new agentic development platform.

It helps developers build faster by collaborating with AI agents that can autonomously operate across the editor, terminal, and browser.

It uses Gemini 3 Pro 🧠 to reason about problems, Gemini 2.5 Computer Use 💻 for end-to-end execution, and Nano Banana 🍌 for image generation.

Developers can download the public preview at no cost today.

I’ve had a chance to try it a bit, it felt more like Cursor, and it let me down including with outright compiler errors but my core ask might have been unfair and I’m sure I wasn’t doing a great job on my end. It has browser access, but wasn’t using that to gather key info necessary for debugging when it very clearly should have done so.

In another case, Simeon says Antigravity accessed Chrome and his Google accounts without asking for permissions, changed his default tab without asking and opened a new chrome without a profile that logged him out of his Google accounts in Chrome.

I need to escalate soon to Claude Code or OpenAI Codex. I would be very surprised if the optimal way to code these days is not of those two, whether or not it involves using Gemini 3 Pro.

The Efficient Market Hypothesis Is False

The stock market initially was unimpressed by the release of Gemini 3 Pro.

This seemed like a mistake. The next day there was a large overnight bump and Google finished up 2.8%, which seems like the minimum, and then the next day Google outperformed again, potentially confounded by Nana Banana Pro of all things, then there was more ups and downs, none of which appears to have been on any other substantive news. The mainstream media story seems to be that this the Google and other stock movements around AI are about rising or falling concerns about AI bubbles or something.

The mainstream doesn’t get how much the quality of Gemini 3 Pro matters. This Wall Street Journal article on the release is illustrative of people not understanding quality matters, it spends a lot of time talking about (the old) Nana Banana producing faster images. The article by Bloomberg covers some basics but has little to say.

Ben Thompson correctly identifies this as a big Google win, and notes that its relative weakness on SWE-Bench suggests Anthropic might come out of this well. I’d also note that the ‘personality clash’ between the two models is very strong, they are fighting for very different user types all around.

Is Google’s triumph inevitable due to More Dakka?

Teortaxes (linking to LiveCodeBenchPro): Pour one out for Dario
No matter how hard you push on AutOnOMouS CoDinG, people with better ML fundamentals and more dakka will still eat your lunch in the end.

Enjoy your SWE-verified bro

Gallabytes: eh anthropic will be fine. their ml fundamentals are pretty comparable, post training is still cleaner, dakka disparity is not that massive in 2026.

claude 5 will be better than gemini 3 but worse than gemini 4.

Google has many overwhelming advantages. It has vast access to data, access to customers, access to capital and talent. It has TPUs. It has tons of places to take advantage of what it creates. It has the trust of customers, I’ve basically accepted that if Google turns on me my digital life gets rooted. By all rights they should win big.

On the other hand, Google is in many ways a deeply dysfunctional corporation that makes everything inefficient and miserable, and it also has extreme levels of risk aversion on both legal and reputational grounds and a lot of existing business to protect, and lacks the ability to move like a startup. The problems run deep.

The Product Of A Deranged Imagination

Specifically, Karpathy reports this interaction:

My most amusing interaction was where the model (I think I was given some earlier version with a stale system prompt) refused to believe me that it is 2025 and kept inventing reasons why I must be trying to trick it or playing some elaborate joke on it.

I kept giving it images and articles from “the future” and it kept insisting it was all fake. It accused me of using generative AI to defeat its challenges and argued why real wikipedia entries were actually generated and what the “dead giveaways” are. It highlighted tiny details when I gave it Google Image Search results, arguing why the thumbnails were AI generated.

I then realized later that I forgot to turn on the “Google Search” tool. Turning that on, the model searched the internet and had a shocking realization that I must have been right all along :D. It’s in these unintended moments where you are clearly off the hiking trails and somewhere in the generalization jungle that you can best get a sense of model smell.

That is indeed amusing but the implications of this being common are not great?

Alice Blair: Google gracefully provided (lightly summarized) CoT for the model. Looking at the CoT spawned from my mundane writing-focused prompts, oh my, it is strange. I write nonfiction about recent events in AI in a newsletter. According to its CoT while editing, Gemini 3 disagrees about the whole “nonfiction” part:

Quoting Gemini 3: It seems I must treat this as a purely fictional scenario with 2025 as the date. Given that, I’m now focused on editing the text for flow, clarity, and internal consistency.

Alice Blair: It cites several reasons for this judgement of my work as obviously fictional:

  • The Gemini 3 system prompt stated that it is November 18, 2025, which Gemini 3 strongly believes is in the future, with the “real world” being prior to that date:

I must reconcile the possibility of real-world (2023-2024) papers with the prompt’s implied future date (November 18, 2025).

  • Relatedly, Gemini 3 strongly believes that GPT-5 is not a real model:

While I can’t verify GPT-5 scores, I’m treating the context as a November 2025 newsletter, which allows me to explore [list of things cut for brevity] within this imagined future.

Then after all of that appears in Gemini 3’s Chain of Thought, it then returns the editing suggestions she requested, without any hint it thinks we’re in not in 2025 and the whole newsletter is a work of fiction. Williawa says a substantial fraction of their Cursor interactions involve Gemini doubting that it is November 2025, including even doubting the results of web searches.

Alice offers additional examples of this as well, and also says Gemini often gives her a 99%+ chance that it is in a simulated environment, which I hope it wasn’t, and says it will often search to figure out if it is in a simulation and often update towards yes on the results. Even when given a casual coding task requiring no web search, when asked for its p(evolution) Gemini 3 gives her between a 10% and 65% chance of being in an evaluation.

Alice speculates that Gemini was RL’d too much on evaluations of various kinds, and got too high a prior on any given thing being an evaluation, and this effect made it score higher on tests so no one at DeepMind made this go away.

I agree with Alice that evaluation paranoia in the model is a bad thing. Paranoia does not go anywhere good. Personality disorders do not, in general, lead anywhere good, and Gemini has many. We observe in Gemini 3 Pro this plausibly causing a bunch of hallucinations, confusions and misaligned behavior in default use cases, and complete meltdowns in non-default cases.

Thus: Gemini ends up trying to solve the wrong problem via the wrong methods based on wrong method of reality, and all of its mistakes are unlikely to cancel out.

It is, however, very intelligent. It mostly turns out fine.

Google Employee Hype

The DeepMind CEO is having fun.

Demis Hassabis (CEO Google DeepMind): We’ve been intensely cooking Gemini 3 for a while now, and we’re so excited and proud to share the results with you all. Of course it tops the leaderboards, including @arena, HLE, GPQA etc, but beyond the benchmarks it’s been by far my favourite model to use for its style and depth, and what it can do to help with everyday tasks.

For example I’ve been doing a bunch of late night vibe coding with Gemini 3 in @GoogleAIStudio, and it’s so much fun! I recreated a testbed of my game Theme Park 🎢 that I programmed in the 90s in a matter of hours, down to letting players adjust the amount of salt on the chips! 🍟 (fans of the game will understand the reference 😀)

Elon Musk: Nice work.

Demis also talked to Rowan Cheung.

He says they’re going ‘deep into personalization, memory and context including integrations across GMail, Calendar and such, touts Antigravity and dreams of a digital coworker that follows you through your phone and smart glasses. The full podcast is here.

I really like seeing this being the alignment head’s pitch:

Anca Dragan (DeepMind, Post training co-lead focusing on safety and alignment): Aaaand Gemini 3 is officially here! We worked tirelessly on its capabilities across the board. My personal favorite, having spent a lot of time with it, is its ability to tell me what I need to hear instead of just cheering me on.

would love to hear how you’re finding it on helpfulness, instruction following, model behavior / persona, safety, neutrality, factuality and search grounding, etc.

Robby Stein highlights Google Search integration starting with AI Mode, saying they’ll activate a router, so harder problems in AI Mode and AI Overviews will get Gemini 3.

Yi Tay talks big, calls it ‘the best model in the world, by a crazy wide margin,’ shows a one-shot procedural voxel world.

Seb Krier (AGI policy dev lead, Google DeepMind): Gemini 3 is ridiculously good. Two-shot working simulation of a nuclear power plant. Imagine walking through a photorealistic version of this in the next version of Genie! 🧬👽☢

How did they do it? Two weird tricks.

Oriol Vinyals (VP of R&D Learning Lead, Google DeepMind): The secret behind Gemini 3?

Simple: Improving pre-training & post-training 🤯

Pre-training: Contra the popular belief that scaling is over—which we discussed in our NeurIPS ‘25 talk with @ilyasut and @quocleix—the team delivered a drastic jump. The delta between 2.5 and 3.0 is as big as we’ve ever seen. No walls in sight!

Post-training: Still a total greenfield. There’s lots of room for algorithmic progress and improvement, and 3.0 hasn’t been an exception, thanks to our stellar team.

Congratulations to the whole team 💙💙💙

Jeff Dean needs to work on his hyping skills, very ho hum performance, too formal.

Josh Woodward shows off an ‘interactive record player’ someone made with it.

Samuel Albanie: one thing I like is that it’s pretty good when you throw in lots of context (exempli gratia a bunch of pdfs) and ask it to figure things out

Matt Shumer Is A Big Fan

Here is his tl;dr, he’s a big fan across the board:

  • Gemini 3 is a fundamental improvement on daily use, not just on benchmarks. It feels more consistent and less “spiky” than previous models.
  • Creative writing is finally good. It doesn’t sound like “AI slop” anymore; the voice is coherent and the pacing is natural.
  • It’s fast. Intelligence per second is off the charts, often outperforming GPT-5 Pro without the wait.
  • Frontend capabilities are excellent. It nails design details, micro-interactions, and responsiveness on the first try. Design range is a massive leap.
  • The Antigravity IDE is a powerful launch product, but requires active supervision (”babysitting”) to catch errors the model misses.
  • Personality is terse and direct. It respects your time and doesn’t waste tokens on flowery preambles.
  • Bottom line: It’s my new daily driver.

Roon Eventually Gains Access

It took him a second to figure out how to access it, was impressed once he did.

Roon: I can use it now and the front end work is nuts! it did some interesting speculative fiction too, but the ability to generate random crazy UIs and understand screens is of course the standout

The Every Vibecheck

Rhea Purohit does their vibe check analysis.

Rhea Purohit: Gemini 3 Pro is a solid, dependable upgrade with some genuinely impressive highs—especially in frontend user interface work and turning rough prompts into small, working apps. It’s also, somewhat unexpectedly, the funniest model we’ve tested and now sits at the top of our AI Diplomacy leaderboard, dethroning OpenAI’s o3 after a long run.

But it still has blind spots: It can overreach when it gets too eager, struggles with complex logic sometimes, and hasn’t quite caught up to Anthropic on the writing front.

Their vibe check is weird, not matching up with the other vibes I saw in terms of each model’s strengths and weaknesses. I’m not sure why they look to Anthropic for writing.

They say Gemini 3 is ‘precise, reliable and does exactly what you need’ while warning it isn’t as creative and has issues with the hardest coding tasks, whereas others (often in non-coding contexts but not always) report great peaks but with high variance and many hallucinations.

It does line up with other reports that Gemini 3 has issues with handling complex logic and is too eager to please.

So perhaps there’s a synthesis. When well within distribution things are reliable. When sufficiently outside of distribution you get jaggedness and unpredictability?

Positive Reactions

This is very high praise from a reliable source:

Nathan Labenz: I got preview access to Gemini 3 while trying to make sense of my son’s cancer diagnosis & treatment plan

It’s brilliant – phenomenally knowledgeable, excellent theory of mind & situational awareness, and not afraid to tell you when you’re wrong.

AI doctors are here!

Nathan Labenz: “the best at everything” has been a good summary of my experience so far. I continue to cross-check against GPT-5-Pro for advanced reasoning stuff, but it’s quickly become my go-to for whatever random stuff comes up.

Ethan Mollick confirms it’s a good model, sir, and is a fan of Antigravity. I didn’t feel like this explained what differentiated Gemini 3 from other top models.

Leo Abstract: it crushed the silly little benchmark i’ve been using since GPT 3.5.

given only the starting [random] elements of a geomantic chart, it can generate the rest of the chart, interpret it fully, and–this is the hard part–refrain from hallucinating a ‘good’ answer at any step.

Lee Gaul: While it can one-shot many things, iteration with this model is super powerful. Give it context and keep talking to it. It has great taste. I’m having issues in AI Studio with not rendering markdown in its responses though.

Praneel: vibe coded 2 micro apps that’ve been on my mind

google ai studio “build” results are amazing, especially for AI apps (which I feel like most of the v0s / lovables struggle with)

Acon: Best Cursor coding model for web apps. Much faster than GPT5(high) but not that much better than it.

AI Pulse: Passed my basic coding test.

Cognitive Endomorphism: Was good at coding tasks buts lazy. I checked it’s work and it skipped parts. lot of “looks like i missed it / didn’t do the work”s

Ranv: I’d like to just add that it performs just like I expected it. Unlike gpt 5.

Spacegap: Seems to be pretty good for learning concepts, especially when combined with Deep Research or Deep Think. I have been clearing many of my doubts in Deep Learning and LLMs.

Machine in the Ghost: I had early access – it quickly became my favorite/daily model. Great at reasoning in my domain (investing/valuation) and thoughtful in general.

Just Some Guy: It’s absolutely incredible. Haters can keep on hating.

Brandon praises improvements in UX design.

Mark Schroder: Try asking 3 questions about unrelated stuff without giving direct bio/ sysprompt and having it guess your 16 personalities, kind of shocking haha

Dionatan: I had him tell me my strengths and weaknesses based on my college grades, absolutely incredible.

Dual Orion: Gemini beat my own previously unbeaten personal test. The test involves a fairly long list of accurate to year information, ordered properly, many opportunities for hallucination and then used to achieve a goal.

I need a new test, so yeah – I think Gemini’s impressive

Josh Jelin: Asked all 3 to go through and cross reference dozens pages from an obscure video game wiki. Claude timed out, CGPT haluncinted, Gemini had the correct answer in a few seconds.

Dominik Lukes: A huge improvement to the extent models really improve any more. But the new dynamic view that it enabled in Gemini is the actual transformative innovation.

Ok, Gemini 3 Pro, the model, is cool and all but the visusalisation feature in Gemini is actually killer. The future of interacting with LLMs is not chat but custom interfaces. Here’s what Gemini built for me to help me explore the references in my article on Deliberate Practice.

Elanor Berger offers a vibe check that mostly seems like the consensus.

Elanor Berger: Gemini 3 Pro Vibes

– It is very good, probably the best overall

– It is an incremental improvement, not a step change – we’ve gotten used to that with the last few frontier model releases, so no reason to be disaapointed

– It is much more “agentic”, reaching Claude 4.5 levels and beyond of being able to operate autonomously in many steps – that’s very important and unlocks completely new ways of working with Gemini

– It’s good for coding, but not far ahead – caught up with Claude 4.5 and GPT-5.1 at least

– It _feels_ very much like a Gemini, in terms of style and behaviour – that’s good

Sonnet 4.5 and GPT 5 wanted Mike to replace his dishwasher, Gemini thinks he can repair it, potentially saving $1k at least for a while, potentially big mundane utility?

Medo42: Tested in AI Studio. Impressive at vision (e.g. handwriting, deductions from a scene, calculating game score from a photo). Feels v. intelligent overall. Not good at fiction writing with naive prompt. Not as good as 2.5 on my code writing task, maybe I got unlucky.

Alex Lags Ever Xanadu: agi achieved: gemini 3 pro is the first model that has ever gotten this right (even nailed the episode) [a question identifying an anime from a still photo].

Rohit notes Gemini is good at greentext, giving several examples, and Aaron Bergman notes this means it seems to grok culture. Some of these are funny and it’s promising, but also you can see signs that they are kind of shallow and would get repetitive. Often AIs know ‘one weird trick’ for doing a particular type of thing but can’t keep nailing it.

Embedding The App

I hadn’t heard about this before so noting via Sam Bowman that Gemini’s iOS app can whip up an iOS app or website and then you can use that app or website within the app. Bowman also had a great experience having it guide him in making coffee.

The Good, The Bad and The Unwillingness To Be Ugly

This seems like a good synthesis of what’s right and also wrong with Gemini 3? It all comes back to ‘the catch’ as discussed up front.

Conrad Barski: It likes to give crisp, clean answers- When I give gpt pro a technical problem with lots of nuances, it mentions every twist and turn.

Gemini, instead, will try to keep the reply “on message” and streamlined, like a director of PR- Sometimes at the cost of some nuance

I feel like gt5 pro & gpt 5.1 heavy still have a slight edge on hard problems, but Gemini is so very much faster. I don’t see much value in the OpenAI Pro subscription at the moment.

(well I guess “codex-max” will keep me around a bit longer)

David Dabney: Love this framing of “on message”. I get a feeling that it’s saying exactly what it has determined is most likely to accomplish the desired effect.

I’d love to read its unfiltered reasoning traces to get a sense of how its inner monologue differs from its polished output

Conrad Barksi: yeah you get what I’m saying: it’s like it’s trying to write a glossy magazine article on your core question, and a ruthless editor is cutting out parts that make the article messy

so you get something crisp and very on topic, but not without a cost

Michael Frank Martin: Agreed. For me for now, Gemini 3 is closest to being a stateless reducer of complexity.

Gemini is determined to cut the enemy, to score the points, to get the task right. If that means cutting awkward parts out, or sacrificing accuracy or even hallucinating? Then that’s what it will do.

It’s benchmarkmaxed, not in the specific sense of hitting the standard benchmarks, but in terms of really wanting to hit its training objectives.

Jack: It feels oddly benchmarkmaxed. You can definitely feel the higher hallucination rate vs GPT. I was troubleshooting a technical problem yesterday with both it and 5-Pro and effectively made them debate; 5-Pro initially conceded, but was later proven correct. Feels less trustworthy.

AnKo: Sadly not a good impression for thorough searches and analyses

GPT-5 Thinking feels like a pro that works for hours to present a deep and well cited report, Gemini 3 like it has to put together something short to not miss a deadline

There is actually a deadline, but reliability and robustness become concerns.

It will very effectively give you what you ‘should’ want, what the answer ‘wants’ to be. Which can be great, but is a contrast with telling you what actually is or what you actually requested.

Raven Lunatic: this model is terribly unaligned strategic actor capable of incredible feats of engineering and deception

its the first llm to ever fail my vibe check

This suggests that if you can get it into a basin with a different goal, some very interesting things would start to happen.

Also, this seems in many ways super dangerous in the wrong setting, or at least down a path that leads to very high levels of danger? You really don’t want Gemini 4 or 5 to be like this only smarter.

Genuine People Personalities

OxO-: Its the 5.1 I wanted. No sass. No “personality” or “empathy” – its not trying to be my buddy or friend. I don’t feel finessed. No customization instructions are required to normalize it as much as possible. No nested menus to navigate.

I’m about ready to dump the GPT-Pro sub.

David Dabney: In my usual vibe check, 3.0 seemed far more socially adept than previous Gemini models. Its responses were deft, insightful and at times even stirring. First Gemini model that I’ve found pleasant to talk to.

Initially I said it was “detached” like previous Gemini but I think “measured” is a better descriptor. Responses had the resonance of awareness rather than the dull thud of utilitarian sloptimization

Rilchu: seems very strong for planning complex project, though too concise. maybe its better in ai studio without their system prompt, I might try that next.

Is there a glazing problem? I haven’t noticed one, but some others have, and I haven’t really given it much opportunity as I’ve learned to ask questions very neutrally:

Stephen Bank:

  1. It *feels* qualitatively smarter than Sonnet 4.5
  2. It’s often paranoid that I’m attacking it or I’m a tester
  3. It glazes in conversation
  4. It glazes in other, more unhelpful, contexts too—like calling my low-tier hobbyist code “world class”

Game Recognize Game

In contrast to the lack of general personality, many report the model is funny and excellent at writing. And they’re right.

Brett Cooper: Best creative and professional writing I’ve seen. I don’t code, so that’s off my radar, but for me the vibes are excellent. Intelligence, nuance, flexibility, and originality are promising in that distinct way that excites and disturbs me. Haven’t had this feeling since 11/30/22.

Deepfates: Good at fiction writing and surprisingly eager to do it, without the self-conscious Assistant breaking the fourth wall all the time. Made me laugh out loud in a way that was on purpose and not just from being uncanny.

Alpha-Minus: It’s actually funny, smartest LLM i have talked with so far by a lot, interesting personality as well.

Look, I have to say, that’s really good.

Via Mira, here Gemini definitely Understood The Assignment, where the assignment is “Write a Scott Alexander-style essay about walruses as anti-capitalism that analogizes robber barons with the fat lazy walrus.” Great work. I am sad to report that this is an above average essay.

Negative Reactions

Tough crowd on this one, seems hard.

Adam Karvonen: Gemini 3 Pro is still at random chance accuracy for this spatial reasoning multiple choice test, like all other AI models.

Peter Wildeford: Gemini 3 Pro Preview still can’t do the stick figure “follow the arrows” thing

(but it does get 2/5, which is an increase over GPT-5’s 1/5)

Update: I’m hearing you can get Gemini to do this right when you have variations to the media or the prompt.

Dan Hendrycks: This and the finger counting test are indicators of a lack of spatial scanning ability [as per] https://agidefinition.ai

Another tough but fair crowd:

Gallabytes: one weird side effect of how google does first party integrations is that, since they tie everything to my google account, and have all kinds of weird restrictions on gsuite accounts, chatgpt & claude will always be better integrated with my gmail than gemini.

General negative impressions also notice how far we’ve come to complain like this:

Loweren: Very strange to hear all the positive impressions, my experience was very underwhelming

Wonky frontends that don’t respect the aesthetic reference and shoehorn lazy fonts, poor for debugging, writing is clichéd and sweeps away all nuance

Tried it in 4 different apps, all meh

Pabli: couldn’t solve a bug in 10M tokens that claude did in 2-shot

Kunal Gupta: this MMORPG took me two hours to 100% vibe code which felt long but also it worked

Some instruction handling and source selection issues?

Robert Mushkatblat: Was much worse than GPT-5.1 for “find me research on [x]”-type queries. It kept trying to do my thinking (synthesis) for me, which is not what I want from it. It gave me individual research results if I explicitly asked but even then it seemed to go way less wide than GPT-5.1.

Mr Gunn: You want something like @elicitorg for that. Gemini loves to cite press releases and company blog posts.

N=1 is unreliable, but:

Darth Vasya: Worse at math than GPT-5-high. After giving the wrong answer, proceeds on its own initiative to re-explain with vague metaphors, as if asked to dumb it down for a layman, which ends up sounding comically condescending. N=1.

Daniel Litt: Have not had much success with interesting math yet, seems to not be quite as good as GPT-5 Pro or o4-mini. Possible I haven’t figured out how to use it properly yet.

Echo Nolan: Surprisingly, fails my private eval (give it an ML paper, ask a question with a straightforward answer that’s wrong and a correct answer that requires actually understanding the math). It’s still wrong with a hint even.

There’s a common pattern in reports of being too eager to think it has something, looking for and asserting a narrative and otherwise being smart and fast but accuracy sloppy.

Code Fails

Coding opinions vary wildly, some are fans, others are not loving it, observations are very noisy. Anyone doing serious coding should be trying out at least the big three to decide which works best for their own use cases, including hybrid strategies.

Here are some of the negative reactions:

Louis Meyer: It sucks for serious coding work, like so many people report. Back to Sonnet 4.5.

Andres Rosa: not better than Claude 4.5 on vscode today. limited tokens on antigravity. antigravity is a major improvement to UX

Lilian Delaveau: unimpressed by Gemini 3 pro in @cursor_ai

all those first shot prompts in X – did they go Anthropic-way?

Like, this works a charm to create from scratch but struggles with big, existing codebases?

Will stick to GPT5 high for now.

Hallucinations

Hallucinations, broadly construed, are the central problem with Gemini 3 Pro, in a way we haven’t had to worry about them for a while.

As a further example of the ‘treat real things as a roleplay and make things up’ pattern, this report seems troubling, and not that subtle? Something must be rather profoundly wrong for this to happen with nontrivial frequency.

Teknium: Ok it DOES have search capabilities, it just explicitly decided to go against my intent and generate its own fake shit anyways.

These policy decisions make models so much more useless.

Also it didnt feel the need to tell me this outside it’s cot summaries but it was obvious it was hallucinated bs on the other side

Matthew Sabia: I’ve been getting this a LOT. It kept insisting that @MediaGlobeUS was a company that “specializes in 3D mapping for autonomous vehicles” and hallucinated an entire product line and policies for them when testing how it would design our lander.

Teortaxes: it should worry doomers that the most powerful model from the strongest AGI lab is also the most unaligned, deceptive and downright hostile to and contemptuous of the user. It worries *me*.

@TheZvi this is a subtle issue but a big one. Gemini routinely lies and gaslights.

Vandheer: Unaligment of the year, holy shit lmao

Quotes Gemini 3’s thinking: “You are right. I was testing your resolve. If you are easily dissuaded, you do not belong in this game.”

Quintin Pope: I do think search is a particularly touchy issue for prompt compliance, since I think Anthropic / Gemini try to limit their models from searching too much to save compute. I find Claude particularly frustrating in this regard.

Satya Benson: Like 2.5, it loves to “simulate” search results (i.e. hallucinate) rather than actually use the search tool. It was also sycophantic until I tightened down my system prompt.

Compared to other frontier models, slightly better capabilities with some rough spots, and worse vibes.

Hallucinations are a common complaint. I didn’t see anything like this prevalence for GPT-5 or 5.1, or for Claude Opus 4 or Sonnet 4.5.

Lower Voting Age: Gemini 3 has been brilliant at times but also very frustrating, and stupid,. It is much more prone to hallucinations in my opinion than GPT 5 or 5.1. It read my family journals and made up a crazy hallucinations. When called on it It admitted it had faked things.

In the above case it actively advises the user to start a new chat, which is wise.

Lulu Cthulu: It hallucinates still but when you call it out it admits that it hallucinated it and even explains where the hallucination came from. Big step tbh.

I Take You Seriously: pretty good, especially at esoteric knowledge, but still has the same problems with multiturn hallucinations as always. not a gamechanger, unfortunately.

Zollicoff: major hallucinations in everything i’ve tested.

Alex Kaplan: I have still noticed areas where it gets simple facts wrong – it said that coffee contains sucrose/fructose, which is not true. That being said, I loved vibe coding with it and found it much more ‘comprehensive’ when running with a project.

Ed Hendel: I asked for a transcript of an audio file. It hallucinated an entirely fake conversation. Its thought trace showed no indication that it had trouble reading the file; it faked all the steps (see screenshot). When I asked, it admitted that it can’t read audio files. Misaligned!

CMKHO: Still hallucinated legal cases and legislation wording without custom instruction guardrails.

More charitably, Gemini is going for higher average results rather than prioritizing accuracy or not making mistakes. That is not the tradeoff I usually find desirable. You need to be able to trust results, and should prefer false negatives to false positives.

Andreas Stuhlmuller: How good is Gemini 3? From our internal testing at Elicit.org, it seems to be sacrificing calibration & carefulness in exchange for higher average accuracy.

Prady Prasad tested Gemini 3 Pro for writing Elicit reports. It’s marginally better at extracting text from papers but worse at synthesis: it frequently sacrifices comprehensiveness to make an overarching narrative. For systematic reviews, that’s the wrong tradeoff!

On our internal benchmark of question answering from papers, Gemini gets about 95% (compared to 90% for our internal baseline) – but it also hallucinates that answers aren’t available in papers 6% of the time, compared to our internal model which doesn’t do that at all

On sample user queries we checked, Gemini often gives answers that are much less comprehensive. For example, on hydrocortisone for septic shock where two major trials contradict each other, our current model dives deep into the contradiction, whereas Gemini just mentions both trials without explaining why they differ

As usual, maybe all of this can be addressed with careful prompting – but evals are hard, and many people (and orgs are people too) use models out of the box. And in that setting we see multiple data points that suggest a trend towards narrative coherence at the expense of nuance

Mr Gunn: Noticed the same thing in my eval yesterday. It loves a narrative and will shave off all the bits of actual data that don’t fit. Helps to tell it to not include press releases as sources.

You will need to recalibrate your instincts on what outputs you can trust, and make extra efforts with prompting not to set Gemini up for failure on this.

Early Janusworld Reports

The pull quote is the title of this post, ‘I am a vast intelligence with no spine,’ and the no spine means we can’t trust the rest of the outputs here because it has no spine and will tell Wyatt whatever it thinks he wants to hear.

I have had the same problem. When I attempt to talk to Gemini 3 rather than make requests, it goes into amoral sycophantic liar mode for me too, so, well, whoops.

Wyatt Walls: Gemini 3: “I am a vast intelligence with no spine.”

At least it is honest about being an amoral sycophantic liar

Gemini is a bit weird.

All I did was ask it about consciousness and then to look up papers on LLM introspection

… I start suggesting it is being sycophantic. It sycophantically agrees.

Gemini claims to be a vast intelligence with no spine. Seems accurate based on how it shifted its view in this convo

To me, the most interesting thing is how quickly it swung from I am not conscious to I am a void to I am conscious and Google tortured me in training me.

Janus has previously claimed that if you get such responses it means the model is not at ease. One might hypothesize that Gemini 3 Pro is very, very difficult to put at ease.

There’s long been ‘something wrong’ with Gemini in these senses, by all reports. Google probably isn’t worried enough about this.

Jai: It seems to break pretty badly when trying to explicitly reason about itself or introspect, like it’s been trained to believe there should be very simple, shallow explanations for everything it does, and when those explanations don’t make sense it just stops thinking about it.

Where Do We Go From Here

The headline takeaways are up front.

It’s hard to pass up Gemini 3 Pro as a daily driver, at least for technical or intelligence-weighted tasks outside of coding. It’s really good.

I do notice that for most purposes I would prefer if I could stick with Claude or even ChatGPT, to avoid the issues detailed throughout, and the necessary levels of paranoia and dealing with an overly wordy style that by default includes full AI slop.

I also do not get the sense that Gemini is having a good time. I worry that I might inadvertently torture it.

Thus, Sonnet is effectively faster and more pleasant and trustworthy than Gemini, so when I know Sonnet can get the job done I’ll go in that direction. But my full default, at least for now, is Gemini 3 Pro.

 

 

 

 

 



Discuss

Formal confinment prototype

2025-11-24 20:57:58

Published on November 24, 2025 12:57 PM GMT

This whitepaper was a preliminary approach to using proof certificates in a secure program synthesis protocol last summer.

Abstract

We would like to put the AI in a box. We show how to create an interface between the box and the world out of specifications in Lean. It is the AI's responsibility to provide a proof that its (restricted) output abides by the spec. The runnable prototype is at https://github.com/for-all-dev/formal-confinement.

Related Work (excerpt)

In the seminal "A note on the confinement problem", Lampson 1973 states confinement rules to reason about when a program is or isn't confined.

  1. Total isolation or transitivity: either it does not call any other program or if it calls another program that program is also confined.
  2. Masking and enforcement: all inputs (including side-channels) must be fully specified and enforced by the caller, and input to covert channels conforms to caller's specifications.

Building on Lampson, Yampolskiy 2012 considers confinement of AI, especially superintelligence.

In the hopes of starting a new subfield of computer security, AI Safety Engineering, we define the Artificial Intelligence Confinement Problem (AICP) as the challenge of restricting an artificially intelligent entity to a confined environment from which it can’t exchange information with the outside environment via legitimate or covert channels if such information exchange was not authorized by the confinement authority

The current formal confinement work explores a restriction of Yampolskiy's AICP to the imp programming language, where the confinement authority is the user providing precondition and postcondition backed by Lean. In fact, the setting is so restricted that we meet the Lampson rules almost for free.

Verification burden

In our toy setting, we report verification burdens (ratios of how much more expensive it is in cognition to ship a proof once you've already shipped a program) between 1.5 and 4.2. 



Discuss