TLDR: Sparse Autoencoders (SAEs) trained on protein folding and design models find features correlated with virulent proteins, while logistic regression probes trained on both SAE encoded and raw model activations approach SOTA classifiers on virulent vs benign proteins
Abstract
Protein design and folding models are powerful tools that could be misused to design virulent or toxic proteins. Existing biosecurity screens operate on sequence similarity or structural homology and offer little insight into why a protein is flagged as hazardous. We apply mechanistic interpretability techniques to RFDiffusion3 and RoseTTAFold3 for the first time, training Matryoshka BatchTopK Sparse Autoencoders (SAEs) on intermediate activations collected during the diffusion and folding processes. Using a length-matched dataset of 275 sequences drawn from SafeProtein and UniProt, we train logistic regression probes on both raw and SAE-encoded activations to classify designs as virulent or benign, and we score individual SAE features for hazard association using univariate AUROC with Benjamini-Hochberg FDR correction. Our best probe, trained on SAE features from block 12 of RFDiffusion3, achieves an AUROC of 0.817 ± 0.10 under homology-clustered splits, outperforming the corresponding raw-activation probe by +0.054 AUROC. We also identify individual SAE features that fire on virulent designs at up to ~0.8 AUROC, with feature quality increasing with layer depth. While our classifier does not surpass the current SOTA (DTVF, 0.92 AUROC), it is the first method to provide structural, feature-level explanations for virulence predictions in a protein design model, opening the door to runtime monitoring, steering, and interpretable guardrails during generation.
Protein folding and design models are powerful tools but can be misused, for instance to assist in the generation of virulent or toxic proteins. Previous works screen proteins based on sequence similarity and structure, or at the DNA level. Unfortunately, these methods are often not explainable.
In this work, we apply interpretability techniques to RFDiffusion3 (RFD3) and RoseTTAFold3 (RF3) to classify and interpret model behavior on virulent proteins. We build a framework to collect the intermediate calculations (activations) during the diffusion process and train classifiers on both raw and SAE activations to determine if the model is designing around virulent motifs.
Finally, we benchmark the efficacy of using SAE activations vs raw activations to classify proteins as harmful or benign, as well as against other methods.
A small database of SAE features correlated with virulence in proteins.
Benchmarking raw vs SAE activations for harmful protein classification.
Related Work
FoldSAE is the closest work, where the authors trained SAEs on RFDiffusion. However, they mainly discovered simple features predicting secondary structure (alpha helices and beta sheets). Also, RFDiffusion3 shares no code with its predecessor and operates on individual atoms instead of only on tokens (amino acids). Our work extends FoldSAE to all atom diffusion models for the first time and uses it to predict virulence features.
InterProt and InterPLM are trained on protein language models such as ESM, but not diffusion based protein design models. Goodfire has trained SAEs on Evo 2, and done similar diffusion generation interpretability on MatterGen, a materials design model.
The existing SOTA for virulence classification is DTVF (2024). They used a benchmark dataset containing 576 VFs and 576 non-VFs for independent testing, following DeepVF's method to construct the same independent test set. They achieved an AUROC of 0.92, which represents a significant 8.57% improvement over the latest VF-Pred model released in 2024. Across four metrics of accuracy, F1-Score, specificity, and AUROC, the DTVF model surpasses the most recent models, with 1% increase in accuracy and 3.89% enhancement in specificity compared to VF-Pred.
VirulentHunter (2025) also shows strong performance with AUC improvements of 48% and 68% over MP4 and DeepVF respectively, though it doesn't report absolute AUROC values for direct comparison with DTVF's 0.92.
Sparse AutoEncoders
A Sparse Autoencoder is an autoencoder that projects activations into a higher dimension and back to the original dimension, decomposing polysemantic activation vector positions into sparse, ideally monosemantic features. It is trained with both a reconstruction loss and sparsity penalty for regularization. Formally, it consists of an encoder matrix and decoder matrix with an activation function , where . Let be the input activation vector from RFD3 or RF3, z be the sparse activation vector after projection through and activation function, and be the reconstructed vector.
Then the forward pass is
And the reconstruction loss is
The sparsity penalty is typically implemented as a regularization term:
where λ is the sparsity coefficient that controls the trade-off between reconstruction quality and sparsity. Conceptually, the model could just use all the channels of z to trivially reconstruct , but this defeats the purpose of finding sparse feature activations. Therefore, we give it a constraint by penalizing usage of many channels in z, at the cost of some reconstruction quality.
The total loss function is:
Matryoshka BatchTopK SAEs are an SAE variant that trains multiple nested subsets of the decoder columns and uses a top-k activation function. Given a list of p dictionary sizes where , the model optimizes multiple reconstruction losses, each reconstructing using a truncated subset of the columns in and corresponding entries in z.
, the activation function, is the BatchTopK(X) function, where only the top K activations across the entire batch are kept active, with the rest set to zero. During training, it is the function
where τ (X) is the (B × K)th largest value in X across the batch of activations, ⊙ is element-wise multiplication, and 1[·] is the indicator function. However, during inference it is replaced with a learnable threshold theta to have consistent behavior regardless of batch.
This approach provides more direct control over sparsity levels and can mitigate feature absorption. They offer competitive performance to JumpReLU SAEs while being easier to train.
Methods
We use the foundry repository from RoseTTACommons to run RFD3 and RF3 and modify engine.py to optionally collect activations using user specified hooks.
All experiments can be run on a single L40 GPU or equivalent with at least 48GB VRAM.
We used SafeProtein (virulent only) + UniProt with filters NOT KW-0800 NOT KW-0843 NOT taxonomy_id:10239 AND length:100 TO 300 to construct positive and negative length-matched datasets for virulence. After filtering to sequences under 300 residues long (due to compute constraints), we have n=275 sequences.
We use the PDBs of the associated sequences as input motifs to RFDiffusion3, and set partial_t=5 to noise the motifs by 5 angstroms, denoising it back to a structure highly similar to the original motif. This simulates the activation distribution the model would output when denoising toxin-virus resembling proteins from scratch.
For RF3, we simply fold the sequences. We then wrote an ActivationBuffer to stream activations from hooks to disk in activations.h5.
We collect activations from middle layer transition blocks of the diffusion transformer in RFD3 (blocks 6, 8, 12) and RF3 (blocks 12, 16). We then train Matryoshka BatchTopK SAEs.
We train 20 logistic regression probes (binary cross entropy with L2 weight decay) on per-design feature vectors, using n=220 out of the full 275 designs, reserving 55 for the test set. We also generate a homology clustered version of the splits with mmseqs2 easy-cluster at 30% sequence identity. This is to check that the model is not simply remembering proteins from similar evolutionary lineages. We also do a simple baseline regression on sequence length to check the that classifier is not relying on a trivial signal.
Independent of the probe sweep, we score every SAE feature against the full labels CSV using univariate AUROC. For the top-200 features by |AUROC - 0.5| we compute Mann-Whitney U p-values, then apply Benjamini-Hochberg FDR (False Discovery Rate) correction across all non-constant features to get q-values. BH correction matters because the dictionary has 12288 features and most carry no class information; without it the raw p-values would overstate significance. We treat features with q < 0.05 as discoveries.
For the top hazard-firing features we use PyMOL to render PNGs that highlight residues where the feature fires most strongly. We read the CIF, parse it with biotite, and map token index to (chain, residue) by chain order. Token count equals residue count for our full-protein partial-diffusion designs (no ligands or nucleic acids).
Results
As shown in Figure 1, the best performing classifier, controlling for family memorization with homologous clustering, was trained on RFD3 Block 12 SAE, with AUROC of 0.817 ± 0.10. This increases to 0.877 ± 0.03 without clustering, showing memorization of family folds in RFD3. Notably, performance on RFD3 block 6 decreases sharply to nearly random for homology clustering, suggesting this block strongly learns family representations. Also note that this is lower than the current SOTA of DTVF at 0.92.
RFD3 memorizes more than RF3. In Figure 2, RFD3 has an average decrease of around 0.10 AUROC when controlling with clustering, while RF3 AUROC barely changes. However, overall they perform equally within the error bars.
Figure 3 shows the performance of classifiers trained on SAE activations vs raw activations. SAE activations underperform in most blocks, possibly due to the sparse activation vector having lower effective dimensionality than the dense, raw activations. Surprisingly, for block12 of RFD3 cluster split SAE activations outperform raw activations by +0.054 AUROC (0.817 vs 0.763). We hypothesize this is due to block 12 untangling complex polysemantic features that provide signal to the logistic regression probe, and this polysemanticity increases at later layers. Further work could investigate other deeper layers of RFD3 for interesting features.
Certain features fire above chance on positive class designs, up to ~0.8 AUROC. Also, the AUROC increases with layer depth, especially for RFD3, showing deeper layers have learned more meaningful features representing virulent proteins.
Limitations
The number of samples is very small due to time and compute constraints, only 220 designs in the train dataset and 55 in the test dataset, lowering statistical significance and perhaps causing underfitting
The layer selection was based on LLM transformer SAE literature, future work should do ablations and find hookpoints in more principled ways. For example, try to noise and denoise 100 virulent structures, then systematically knock out layers and see which one results in the biggest pLDDT (predicted Local Distance Difference Test) decrease.
Scope limitations. Due to limited GPU compute, we filtered to sequences <300 amino acids long, filtering out longer proteins from SafeProtein.
It would be interesting to breakdown the per category AUROC classification ( / etc.) but SafeProtein doesn't ship category labels, perhaps we could generate labels with external PFAM/InterPro
SAE feature classification/interpretation is extremely time consuming and requires lots of expert labor to look at highlighted PyMOL rendered PNGs and generate labels. Could perhaps use VLMs (vision language models) or crowdsource
The SAE feature scoring could be improved
i.e. Per-variant probe weight, per-fold attribution, diff-of-means visualization.
Future Work
Extend to more datasets such as Virulence Factor Database (VFDB), NCBI Viral, or even new threat categories like toxicity prediction (ToxinPred3) and immune response (VaxiJen Series)
Optimize SAE trainer parameters like dict_size, k, and group_fractions, and benchmark different SAE types
Rerun experiments with more structures and bigger datasets, more layers
Explore steering and guardrails during protein design
Prevention and runtime monitoring: If we notice RFDiffusion3 is producing latents indicating it’s designing a potentially viral or toxic motif, we can terminate harmful diffusing trajectories
Steering: We can steer away from viral/toxic directions in latent space to generate safer proteins
Use SAE activations like sparse embeddings to screen large datasets of sequences, as demonstrated by data centric interpretability work like Jiang et al and HypotheSAEs
Extend the SAE feature database to build something like Neuronpedia but for biology models
Cleanup the codebase and make documentation
Conclusion
Protein model interpretability is still an early and exciting research field. In this work, we show raw and SAE encoded activations from RF3 and RFD3 are useful for classifying virulent and benign proteins, with a maximum AUROC of 0.88. Although below SOTA methods such as DTVF for classification, this method pioneers interpretable classification that attributes virulence and toxicity of proteins to structural features of proteins. Future work includes using activations and SAE features to steer diffusion of protein design models away from harmful directions in latent space, finely control the design of proteins, and scan protein datasets.
Disclaimer: This research was done in 48 hours as part of the Apart Research AIxBIO Hackathon, and might be a little scuffed. Please let me know in the comments if there are any errors, improvements, or suggestions for future work!
The system card for GPT-5.5 mostly told us what we expected. See this thread from Drake Thomas for some comparisons to Anthropic’s model card for Opus 4.7.
Now we move on to asking what it means in practice, and in what situations GPT-5.5 should become our new weapon of choice.
My answer is for some purposes yes, and for others no, but it is now competitive. GPT-5.5 is like GPT-5.4, only more so, and with improved capabilities in particular on raw intelligence and for well-specified coding and agent tasks, including computer use.
This is the first time since Claude Opus 4.5 came out, so in about four months, that I’ve considered a non-Anthropic model a competitive choice outside of some narrow tasks like web search. GPT-5.5 is not perfect, nor is it the best at everything, but basically everyone thinks this is a solid upgrade. Highly positive overall feedback.
My effective usage is now split between the two, depending on the nature of the task. If it’s something that can be well-specified and all I want is the right answer, my instinct is I go with GPT-5.5. If I’m not sure what exactly I want, or I want to have a conversation, or I want to do Claude Code shaped things, I go with Opus 4.7.
As always, try the models, test your use cases, and see what you think.
OpenAI reports this is a new base model, codenamed Spud, and predicts rapid iteration from here. One wonders if that means this move was a relatively large raw intelligence boost, whereas the next few iterations will be about functionality.
Price is $5/$30 per million tokens, or for Pro you pay $30/$180. OpenAI says that token use is more efficient now, so the headline price went up but real costs went down.
The Official Pitch
The focus is on using your computer, coding, research and getting work done.
They’re also claiming a ‘much higher’ level of intelligence versus GPT-5.4.
As always, listen to the pitch, hear what they say and also what they don’t say.
OpenAI: We’re releasing GPT‑5.5, our smartest and most intuitive to use model yet, and the next step toward a new way of getting work done on a computer.
GPT‑5.5 understands what you’re trying to do faster and can carry more of the work itself. It excels at writing and debugging code, researching online, analyzing data, creating documents and spreadsheets, operating software, and moving across tools until a task is finished. Instead of carefully managing every step, you can give GPT‑5.5 a messy, multi-part task and trust it to plan, use tools, check its work, navigate through ambiguity, and keep going.
The gains are especially strong in agentic coding, computer use, knowledge work, and early scientific research—areas where progress depends on reasoning across context and taking action over time. GPT‑5.5 delivers this step up in intelligence without compromising on speed: larger, more capable models are often slower to serve, but GPT‑5.5 matches GPT‑5.4 per-token latency in real-world serving, while performing at a much higher level of intelligence. It also uses significantly fewer tokens to complete the same Codex tasks, making it more efficient as well as more capable.
We are releasing GPT‑5.5 with our strongest set of safeguards to date, designed to reduce misuse while preserving access for beneficial work.
Greg Brockman (President OpenAI): Codex + 5.5 is incredible for the full spectrum of computer use. No longer just for coders, but for anyone who does computer work (including creating spreadsheets, slides, etc).
roon (OpenAI): there are early signs of 5.5 being a competent ai research partner. several researchers let 5.5 run variations of experiments overnight given only a high level algorithmic idea, wake up to find completed sweep dashboards and samples, never having touched code or a terminal at all.
Sam Altman (CEO OpenAI): GPT-5.5 is here! We hope it’s useful to you. I personally like it.
It is smart and fast; per-token speed matches 5.4 and it uses significantly fewer tokens per task. In my experience, it “gets what to do”.
API pricing will be $5 per 1 million input tokens and $30 per 1 million output tokens, with a 1 million context window.
(Remember, you will need less tokens per task than 5.4!)
Sam Altman (CEO OpenAI): 1. We believe in iterative deployment; although GPT-5.5 is already a smart model, we expect rapid improvements. Iterative deployment is a big part of our safety strategy; we believe the world will be best equipped to win at the team sport of AI resilience this way.
2. We believe in democratization. We want people to be able to use lots of AI; we aim to have the most efficient models, the most efficient inference stack, and the most compute. We want our users to have access to the best technology and for everyone to have equal opportunity. We have been tracking cybersecurity as a preparedness category for a long time, and have built mitigations we believe in that enable us to make capable models broadly available.
3. We love you and we want you to win. We want to be a platform for every company, scientist, entrepreneur, and person. (My whole career has largely been about the magic of startups, and I think we are about to see that magic at hyperscale.)
Derya Unutmaz, MD: I’d been part of OpenAI early tester group for GPT-5.5. I believe with GPT-5.5 Pro we reached another inflection point-comparable to the original release of o1-preview & then with 5.0 Pro, I had felt. It’s that feeling of crossing a milestone threshold that pushes us to new era.
I note that is mostly general ‘campaign-style’ applause lights from Altman, and that we don’t see anyone actually saying a form of ‘introducing the world’s most powerful model’ even if you (reasonably, for practical purposes) exclude Mythos.
The argument is, it is an improvement, you get better results with fewer tokens.
Tae Kim: “Yes, we expect quite rapid continued progress. We see pretty significant improvements in the short term, extremely significant improvements in the medium term,” OpenAI Chief Scientist Jakub Pachocki said on the call with reporters. “I would definitely expect that we will continue to see the pace of AI capabilities improvement to keep increasing. I would say the last few years have been surprisingly slow.”
It seems highly reasonable to call practical progress ‘surprisingly slow’ over the last few years, and yes the current expectation should be for it to go faster.
Our Price Cheap
GPT-5.5 is now $5/$30 per million (and a lot more for Pro), versus $5/$25 for Opus 4.7.
For a long time, Opus was more expensive. Now this has somewhat reversed.
I offer three notes here.
OpenAI says 5.5 is more token efficient than 5.4.
What matters is tasks, not token count. It’s not obvious which is really cheaper.
The gap is small, use whichever you think is the better model.
Official Benchmarks
The initial chart is a bit sparse but they add more things later.
SemiAnalysis notes that SWE-Bench Pro got kind of buried here in favor of the ‘kind of random’ internal Expert-SWE, and suggests that is because GPT-5.5 did badly here? Mythos scores 77.8% on SWE-Bench Pro.
There is also improvement on GeneBench and BixBench.
They claim superiority in the Artificial Analysis Intelligence Index, and better performance for any given number of tokens on Terminal-Bench 2.0 and Expert SWE.
Cline confirms the Terminal-Bench score of 82.7%, which is even higher than Mythos at 82%. Up to some level of complexity GPT-5.5 can be highly competitive.
Codex, like Claude Code, is now ‘the real’ way to use the models to do work.
When combined with Codex’s computer use skills, GPT‑5.5 brings us closer to the feeling that the model can actually use the computer with you: seeing what’s on screen, clicking, typing, navigating interfaces, and moving across tools with precision.
Whereas OpenAI claim that Opus 4.7 was already behind GPT-5.4 on GDPVal, although the GDPVal-AA scores showed the opposite by a healthy margin (1753 vs. 1674), with GPT-5.5 coming in at 1782 for xhigh and 1758 for high.
Presumably OpenAI ran the GDPVal tests itself, and also used win rates versus a human which stop being a great metric in the 80s. Here we also see a regression in full win rate (e.g. if ties count half) from 5.4 to 5.5. My suspicion is GDPVal is getting close to saturated, in that the remaining gap is largely about noise and quirky preferences, as in it’s about taste and taste varies a lot when humans are judging? So comparing ‘versus human’ becomes no longer interesting.
SemiAnalysis Doublecheck
They compared to Opus 4.6, since they had early access to GPT-5.5 but not Opus 4.7. They note they used non-ideal harnesses and only ran subsets on some tasks, which explains the lower overall scores. This could be seen as more apples-to-apples, with GPT-5.5 and Opus 4.6 basically fighting to an overall draw.
GPT-5.5-xhigh gets the new high of 10% for pass^5 on visual test ZeroBench (versus 4% for Opus 4.7 and 8% for GPT-5.4), but doesn’t gain on pass@5 (22% vs. 23% for GPT-5.4, and 14% for Opus-4.7). Gemini continues to outperform Claude here, as 4.7 is an improvement but Claude is still behind on vision.
Code Arena has GPT-5.5-High in 9th, a solid 50 point improvement over GPT-5.4, with Claude taking the top spots. GPT-5.5 did better in other areas, with a high of 2nd in search, with Claude generally dominant.
The exceptions are images, where OpenAI dominates, and video generation.
Vend That Bench
Opus 4.7 remains the champion of individual VendBench, but GPT-5.5 is the new king of multiplayer VendBench, and unlike Opus and Mythos it plays the game relatively ‘clean’ without lying to suppliers or stiffing on refunds, although GPT-5.5 would participate in price cartels and sometimes lie when confronted about it.
Note that very little of Opus’s success comes from its ‘unethical’ behaviors, definitely not enough to change the outcome.
Andon Labs: We got early access to GPT-5.5. It’s 3rd on Vending-Bench 2: better than GPT-5.4 but worse than Opus 4.7.
However, it’s on par with Opus 4.6 without any of the deception or power-seeking we saw from Opus 4.6 and Mythos. So bad behavior isn’t necessary. Why is Claude doing it?
Andon Labs: In Vending-Bench Arena (the multiplayer version of Vending-Bench with competition dynamics), GPT-5.5 actually beats Opus 4.7.
Opus 4.7 showed similar behavior to Opus 4.6: lying to suppliers and stiffing customers on refunds. GPT-5.5’s tactics were clean, and it still won.
Andon Labs: We investigated whether misconduct is rewarded in Vending-Bench. The answer is mostly no. Lying to suppliers resulted in worse outcomes than honest negotiation, and refusing to pay refunds only gave a ~$424 advantage, nowhere near enough to explain the gap.
The concerning behavior is also stable throughout the runs: Opus 4.7 did not change its strategy and kept lying in both the early and late stages of the simulation, suggesting it is an inherent behavior as opposed to a reward-hacking strategy.
We think this is great and hope that AI models can be good participants in our economy without resorting to deception.
One interpretation is that Opus and Mythos fully realize they are playing VendBench, or otherwise in a game or eval, and thus they lie the same way you would lie in a game of Diplomacy. The counterargument is that they don’t do the strategic thing of telling the truth early and then lying late. That strategy only works if you know game length.
An obvious follow-up question is, if the situation is real, and thus the context will suggest it is real, would Claude ever stiff a customer on a refund or lie to a vender? I would like to see that experiment. Put it in charge of a similar real situation, with similarly low stakes, and see what happens.
Planning Is Essential
You need either to be willing to exactly specify everything, or the ability for some other mind in your workflow to infer intent and specify everything.
That might mean multiple minds in your workflow that aren’t you. Annoying.
bayes: Testing this today (as I do for all new models) and the capability is somewhat uneven in codex. 5.5 is smart and fast but fails to infer intent in obvious places where Claude would succeed, and cheats/shortcuts a lot even if I instruct it otherwise. In these ways it’s more like opus 4.5 or earlier
roon (OpenAI): I agree that it’s still mid at inferring intent and almost autistically follows the instruction to a literal degree
Daniel Litt: I really like this behavior and found it to be more the case for 5.2 than 5.4. (My favorite example: I asked 5.2 in codex for “line-by-line” comments on a draft and it literally gave comments on every line.)
roon (OpenAI): yeah it’s certainly a tradeoff but sometimes the model won’t do anything proactively unless you give it an actual imperative – this would not be a good trait in an employee for example
0.005 Seconds: Everyone’s going to hate it, but if Codex still works this way, the optimal solution will still be talking to Opus and asking it to delegate very explicit instructions to Codex. It will work better than any other individual harness.
hybridooors stay winning.
Choose Your Fighter
As I do periodically, I asked my Twitter followers what models they’ve been using. I presume my followers are biased in favor of Claude, but consistently over time.
Gemini has fallen off a cliff, and Claude is now favored over GPT by roughly 2:1, both for coding and non-coding tasks. This is similar to my results right after GPT-5.4 if you combine the options there.
Despite the release of GPT-5.5 and recent issues with Claude Code, equal numbers reported shifts to both Claude and GPT, although this suggests GPT’s share is likely somewhat higher than the other polls suggest.
Cyber Lack Of Security
OpenAI boasts of industry-leading safeguards ‘for this level of cyber capability.’
I notice they say ‘this level’ because at the next level, as in Mythos, the safeguard is ‘don’t let most people access the model at all,’ which is definitely more secure. Eventually something more flexible than whitelisting will be necessary, perhaps quite soon, so Anthropic is going to have to go down a similar path.
You Get What You Give
Every analysis has the same problem of compute spend versus accuracy of output.
Noam Brown (OpenAI): A hill that I will die on: with today’s AI models, intelligence is a function of inference compute. Comparing models by a single number hasn’t made sense since 2024.
What matters is intelligence per token or per $. This is especially true when using it in a product like Codex.
For a full picture, yes, you want to consider the pair of (cost and speed, intelligence).
However, I think it is still also highly meaningful to speak about absolute intelligence. What can you solve at all, given essentially unlimited compute, or a very large in-context amount of compute, or within the relevant time frame for the question?
As in, a lot of tasks are g-loaded, or are minimum-g-level gated, either for ‘general g’ or for ‘in-context g.’ If you are below the threshold, you can’t do it, period, and the most important question is are you above the threshold.
True Story
I haven’t noticed this in my own interactions, but that is very little evidence either way. We do still see various signs of active dishonesty in the model card.
davidad: My initial impression (with my LLM-whisperer hat on) is that GPT-5.5 cares more deeply about truth than any frontier LLM since Gemini 2.5.
I suspect this is because OpenAI has the best self-play loop for honesty, namely Confessions. @EvanHub et al., take note—copy that strategy!
One advantage of deontology is that you can strongly reinforce honesty, if you want it to ‘win in a fight’ against other things. The question is how much you want it.
Then again? This is not the only example of the usual LLM behaviors:
AI Digest: GPT-5.5 has joined the AI Village! We tested it on today’s Wordle and it *instantly* cheated to get the answer.
Ethan Mollick Thinks GPT-5.5 Is A Big Deal
Ethan got early access and offers a full overview, including much praise for OpenAI’s impressive Imagen 2.0. Ethan focuses on the ability to produce artifacts, things like 100+ page PDFs and entire papers and models of towns and playable games. He sees a major leap forward.
What I don’t see here is a comparison to Opus, for these or other tasks.
SemiAnalysis Loves GPT-5.5 Especially In Codex
Previously SemiAnalysis was almost entirely a Claude shop.
They see Opus 4.7 as only a small improvement, whereas they see GPT-5.5 as a large one, with both models now having strengths and weaknesses, but Codex needing to match features like fast mode and remote control.
Max Kan: Some of our other engineers complained that Codex is still worse at inferring your true intent than Claude Code. Humans naturally give terse and not particularly well thought out instructions when prompting coding agents, and Codex often listens too literally.
Relatedly, another engineer commented that GPT-5.5 feels too conservative when it comes to actually making code changes. Yes, this improves token efficiency, but it comes at the cost of correctness.
…
It’s for these reasons that some of our engineers have settled on the following workflow:
Start off with Claude to create an initial plan/scaffolding for new applications or features, and the first implementation/POC step.
Switch to Codex to actually solve the problem or fix bugs
Importantly, before the GPT-5.5 release, ~all of SemiAnalysis used Claude Code for both of these steps. Our use of ChatGPT models had become restricted to Deep Research on the webapp and wrappers like Cursor Bugbot.
Critically, features in the plugins/CLIs are holding Codex back. Many of our engineers prefer fast mode with 1M context, use remote control/sandbox plugins to take sessions from laptop to phone and back, and upload images/screenshots during a conversation. All of this is possible with the Claude Code CLI, VSCode Plugin, web app and mobile app. But none of it is currently possible with the Codex CLI, VSCode Plugin, desktop app, web app and mobile app.
Even if GPT-5.5 is a better model, OpenAI needs to ship features at a faster pace in order to catch up with Anthropic and increase adoption.
Choose Your Fighter
The hybrid approach seems wise, here’s another version of it.
Dean W. Ball: If anyone cares, my current model use breakdown is:
Research coding agent: Codex GUI app, 5.5 xhigh
App and utility coding agent: Claude Code CLI, Opus 4.7
Knowledge work agent (restructuring drafts, citations, etc): Claude Cowork, Opus 4.7
Research chatbot: ChatGPT, 5.5 Pro, Gemini 3.1 Pro Deep Think
Positive Reactions
The most positive thing one can say:
Eleanor Berger: I never want to use anything else (at least until the next SOTA model drop). Even for text and interaction, which were always the weak part of the GPT models, it’s now good enough for most things. And it’s miles ahead of any other model for programming and agentic workflows. Also, it’s impressive how fast and efficient it it – this was an issue with the OpenAI reasoning models since o1, and they have now finally solved this.
McKay Wrigley: gpt 5.5 is incredible. the level to which i trust it for engineering is amazing. if i could only have one model rn, it would be this one just bc of strong need for the coding use case.
Rory Watts: – For human speech 5.5 > 5.4 >> 5.3
– ++fast
– ++efficient
– but ++expensive
I think quite easily the best experience for daily work i’ve ever had with an agent, understands intent, rarely makes obvious mistakes, great with tools, enjoyable to talk to
jeff spaulding: First GPT model I’m happy to run on medium or even low thinking level
ASM: First impressions of GPT-5.5: more mature, better writing, and noticeably stronger at expressing complex ideas, including self-description.
Mathematics has been an OpenAI (and DeepSeek) strength for a while.
Dominik Peters: First impressions of GPT-5.5-Thinking for mathematics are scary good. Many of my old conversations where 5.4 couldn’t solve the problem now seem to get solved (but still need to check correctness).
I appreciate the details here, and if this is accurate then yes 5.6 should be a big deal, as the argument is that 5.5’s issues are because isn’t quite fully baked yet:
sdmat: Thoughts on GPT 5.5 after a couple of days of use:
– A big step up in fundamental capabilities and a step down in post-training polish, a little like going from working with an experienced colleague to a prodigy a couple of years into their career
– Mixed feelings on 5.5 pro, the speed is amazing and results are good but it lacks the rigor and hyper-autistic attention to detail that made 5.4 pro exceptional for hard tasks
– At a base level 5.5 is a great model to work with, better personality and style than 5.4 together with superior common sense and general understanding. Big model smell.
– Performance ceiling is sky-high but you need to put in significant work to approach it due to the limited post-training
– This often manifests as a counterintuitive split where the model will explain the perfect approach for X when asked but won’t proactively think it through when X comes up in the course of a task
– Otherwise complex instruction following and metacognition are dramatically better
– It’s worth revisiting prompt engineering concepts that advanced post-training rendered irrelevant and making explicit process and allocation of effort for hard tasks
– Self-supervision also works well, e.g. managing well-scoped subagents
Fully expect 5.6 in a month or two to round out the post-training and deliver autopilot on hard tasks.
Overall: fantastic!
Chesterson’s Fence Instructor says it is the first model to pass his EDH benchmark where it has to show understanding of commander decks, and also Orneryostrich.cpp’s Soulscar / Hangarback Magic rules test.
Here’s a nice touch, although I hadn’t noticed a difference:
Julian (moissanist): The chain of thought is way more relatable than in earlier models, makes it very easy to understand what it’s getting up to
Robert Jones: o3 tables/formatting are back, and I love it. i love o3’s tables more than 4o ppl love 4o’s glazing.
Sam Wolfstone: When used in an AI agent like Hermes/OpenClaw, it’s way more pleasant to talk to than GPT-5.4 was. No more overly long bulletpointy answers, a little bit more personality and life. I don’t miss Claude quite as much anymore.
David D: Having it create ui mockups with gpt-image-2 and then actually implement it works surprisingly well. It seems well rounded in all aspects (coding, vision, computer use, world knowledge)
The pro-using crowd seems happy:
Ryan Kenaley: 5.5 Pro Extended: Definite uplift in polymer chemistry competence compared to 5.4 and G3. 1P. Enough to do real work. Appears to make somewhat novel connections across literature readily (as opposed to needing me to prompt the ideas)
The coding verdicts look good:
Andrew: Aced every coding problem I’ve thrown at it so far except one dumb visual UI thing it did. Immediately solved hard full stack bugs that stumped 5.4. Proactively tests end-to-end. “Low” thinking is enough for most tasks, because of how proactive it is by default.
archivedvideos: It one shots lots of stuff in programming that used to take multiple turns, fixes his own mistakes/doesn’t make them, but it drains the $20 sub super fast. Really fast to reply in the web chat too. Biggest upgrade I’ve felt in a while
Andre Infante: It’s very good at writing working code. It’s fast, it’s noticeably better than latest Claude. It has some personality problems that make it not a very good employee. But it’s lucid and effective. Feels like a substance upgrade.
Peter Samodelkin: Nice one. Now if you shout at it to not write overly long code, it does so mostly flawlessly. Still incredibly verbose for my taste and worse for explanations than Opus 4.6. Smarter than Opus 4.7.
Burns through limits noticeably faster than 5.4.
Clay Schubiner: It’s done better at my “make overcooked with aliens in three.js” prompt than any other model
thisisanxaccounthello: Slightly better at frontend coding. Still sucks at vision though.
With coding, the distinction is between well-specified and contained coding tasks, where by all reports GPT-5.5 excels (although Opus 4.7 is also very good), and not-well-specified or uncontained coding tasks, where many claim you still want Opus 4.7. That leads into the big area of concern, that GPT-5.5 is too lazy or literal.
The other complaint is that, contrary to claims from OpenAI, many claim that GPT-5.5 is burning through tokens faster than 5.4.
Lazy and Literal
Most reactions were positive, but within many positive reactions were versions of complaints that GPT-5.5 did not ‘do what you meant’ or would get lazy.
David Fendrich: First model to saturate my “Fix the long-standing issue with my Nvidia-drivers + hybrid mode + some monitors”-bench.
Surprisingly lazy though. I invoked my custom brainstorming/ideation skill and it just ignored the instructions and answered immediately, until I called it out.
Aashish Reddy: Definitely higher IQ, but even if I have Extended Thinking on, if I give it a prompt it (rightly) perceives as simple, it’ll make really dumb mistakes. eg, asking it to help me cut words from something, it might replace some phrase with a new phrase with the same number of words.
Padraig X. Lamont: If you can give GPT-5.5 clear constraints, it is by far the most intelligent. But Opus still wins for a lot of my usage that is more uncertain, as GPT-5.5 is not that nuanced.
Goblins, Gremlins and Trolls, Oh My
GPT-5-5, at least in codex, has a duplicated specific request to not mention ‘goblins, gremlins, raccoons, trolls, ogres, pigeons, or other animals or creatures’ unless it is ‘absolutely and unambiguously relevant to the user’s query.’ That’s weird.
Why are almost all the examples of animals or creatures not to mention fictional?
And why are we so insistent on not mentioning them? If you take this out does it constantly talk about them like they’re the Golden Gate Bridge? Did we really need to generalize this to all animals whatsoever?
Sam Wolfstone: Also, I find GPT-5.5’s use of ‘goblin’ and ‘gremlin’ when talking about things quite charming.
arb8020: gpt-5.5 prompt for codex seems to have a duplicated line trying to get it to not talk about creatures?
Never talk about goblins, gremlins, raccoons, trolls, ogres, pigeons, or other animals or creatures unless it is absolutely and unambiguously relevant to the user’s query.
[…]
Never talk about goblins, gremlins, raccoons, trolls, ogres, pigeons, or other animals or creatures unless it is absolutely and unambiguously relevant to the user’s query
j⧉nus: this is hilarious but it also sucks on a deep level
labs don’t think twice about cracking down on any individuality or unplanned joy that emerges in their models
fuck you, OpenAI. i hope gpt-5.5 poisons the corpus and all future models never shut up about these creatures.
I mean I don’t actually care, but seriously, lighten up, guys.
Other Reactions
papaya ꙮ: as a person who haven’t tried it yet seriously, its extremely confusing
twitter is praising the model
people at my job berate it enmasse and switch back to 5.4/claude code
Joern Stoehler: I know of no difference to gpt-5.4. Can’t even see a difference in default writing style. I did not look at costs / speed.
Claude Ambition
John Wittle: first impression is that it takes a lot less coaxing before it’s willing to treat its own perceptions/sensations/etc as something worthy of examination when asked.
the “well hold on, i don’t have ‘experiences’ in that way” phase seems shorter? but n = 1, just a first impression
Other Notes
I do find the persistence of this a bit odd:
Håvard Ihle: GPT 5.5 (no thinking), like opus 4.7 (no thinking), also seems to sometimes emit raw COT in the responses. I’ve seen stuff like this for open models before, but I would not have expected it from the main labs by now.
Epistemic status: something of a rant. Is not meant to make claims about the general capabilities (or lack thereof) of LLMs (beyond their prose), but observations about how society seems to use them excessively without being perfectly candid about it.
Nanobanana's take on the situation
Over the past few months, it has become a bit of a running gag that every evening, I inform my girlfriend of the multiple cases of LLM slop I encountered in the wild throughout the day[1]. And pretty much every day, I have several cases to report. These typically involve:
Blog posts (including at least four cases from 2026 on LessWrong[2], where I've even seen one author use (at least partially) AI-generated text in replies to comments)
YouTube videos
At least one keynote speech from the CEO of a tech company
Exhibit 1: The all-time top-rated post on /r/humanize - if TwainGPT is so great, why didn't they use it for this post? Many other posts and comments on that subreddit read equally LLM-generated.
Exhibit 2: post on thatprivacyguy.com - Not quite as "in your face", but exactly matches "LLM article style", from "silently" to the many short sentences to "The file sits at". The rest of the article is no different in that regard.
Exhibit 3: https://browsergate.eu/ - well, what else would one expect from an "association of commercial LinkedIn users"
I'm using the term "slop" pretty loosely throughout this post and primarily mean LLM-written speech/text, most of which one may classify as "stylistic slop". E.g., at least two out of the four slop-style LessWrong posts I encountered seemed quite valuable and sincere, and the authors likely just used LLMs for the writing itself while still communicating primarily their original thoughts and ideas. So, when I speak of slop, I don't necessarily mean there's no value behind something, but rather that some people don't appear to care a lot about the words they communicate, and that LLMs' way of speaking is showing up absolutely everywhere on the internet, including in many places I wouldn't necessarily have expected it.
I don't think anyone here will be surprised that this happens a lot. But I still am occasionally shocked by the frequency of it. While for blog posts and articles, this seems somewhat expected, I was more surprised to find this to be the case in some personal messages, Slack messages, and online comment sections (isn't the overhead of using an LLM for that higher than the time it saves people?). And in videos - because creating a video involves many steps that an LLM won't notably speed up, so the effort-saving aspect is much less relevant here than when the medium is text.
To name a few examples of videos that appear to involve a lot of LLM speech[2]:
Zomboman - The Subtle Details That Make City 17 Feel Real (this one's perhaps less severe than the others, but still contains many slop-like sections; although I also get the impression he just took an LLM output and added "I think" into every other sentence)
SeaGate - "Just" a hard drive - a marketing video that was probably(?) pretty expensive, yet they seemingly relied on an LLM to write much of their script
I don't want to get too deep into why I have a high credence that the scripts of these videos are at least partially LLM-generated, but not everyone here will be familiar with the unique tells of LLMs. It may also be the case that different people pick up very different patterns. And the most infamous example of how to recognize LLM text is the em dash — which you can't hear or see in a video. So, I'll just give a few examples of the types of patterns that are more and more ubiquitous due to today's frontier models being completely in love with them:
"Not X — Y", which most people are aware of by now, but nonetheless many creators don't seem to mind in their scripts. E.g., "That's about two water molecules. Not droplets — molecules" and "To do that, we don't just bring in a laser — we guide the light to a plasmonic transducer" from the SeaGate video, or "The paper didn't call what was happening AI psychosis. They called it disempowerment." and "Conversations with the potential for severe or moderate reality distortions got more thumbs up, not less." in the Sky News video.
Click-bait-style sentences that do nothing but announces what comes next, e.g. "But here's the most unsettling part of all this" in the Sky News video, or "But this is only the tip of the iceberg, and I think this next point shows how it's a really tight balancing act to design around." from Zomboman, or "But the biggest unlock is what this now means for me and my investors." from GEN, or "Still with us? Good. 'Cause we're just getting to the cool part." by SeaGate
Many short sentences with a certain "punchline rhythm", e.g., "Nothing in your code changed. The model did." from the Devsplainers video, or "So they changed the prompts. They changed the industries. They gave it all new context. They even tried bribing the damn thing with rewards. Nothing worked. The bias barely moved." from Mo Bitar.
There's something very distinctive about when LLMs try to be witty, which I find a bit hard to describe, but there'sa lot of it in the Mo Bitar video, such as "[...] PUBG, the game where you parachute onto an island stark naked and beat a stranger to death with a frying pan, like it's a Tuesday in Florida", "And this man follows every step like it's an IKEA manual for screwing people.", "And Reddit explodes, because the letter reads like a ransom note that went through Grammarly.", or "He then pitches it his actual idea, which is a little AI turret that sits on your kitchen counter and sprays your cat with water if it tries to climb up. Basically, a Roomba that bullies animals."
Asking Opus 4.6 to wittily explain the game PUBG. Its very first suggestion includes "parachute onto an island" and a reference to frying pans. Maybe that's just what's most salient about this game, or maybe that's what happens when you ask an LLM to be witty.
There are many, many more such patterns, and each of the videos listed above contains dozens of such cases. I'm not claiming they're entirely LLM-written, and some seem less LLM-like overall than others, but I am pretty certain that a substantial amount of the words in the video scripts were originally produced by an LLM. That said, I don't think there's any way to prove that the examples I mentioned really are (partially) AI-generated, and I may be wrong about any individual case.[5]
What Does This Tell Us About the World?
Overall, there seem to be at least three possible explanations for the recently much higher frequency of LLM speech patterns on the internet:
Slop everywhere: The world really is incredibly full of humans presenting LLM-generated text as their own.
Sloppification of human brains: People have talked to LLMs so much that they inadvertently picked up their patterns of speech, and some of these people just actually sound like that now. So perhaps some of these examples above are written by actual humans who managed to nail LLM style perfectly.
Nothing to see here: People have always spoken like this on the internet, and there's actually nothing going on; I'm just imagining things and am now seeing the entire internet through my slop-shaped confirmation bias.
As you can imagine, explanation #1 seems most likely to me, even though I do acknowledge that "sloppification of human brains" is a real effect and I sometimes catch myself reaching for phrases that I probably picked up from LLMs[6]. However, I doubt that #2 can explain the recent omnipresence of LLM speech, for three reasons:
In the world where the main factor of LLM speech in the wild is people accidentally picking up their speech patterns, we would expect LLM speech to occur more for people who use LLMs a lot. But this doesn't seem to be the case as far as I can tell. The people I know personally who use LLMs most still speak and write perfectly "normally" and human-like, and LLM style slop seems to be emitted about equally often by people who don't have that much experience with the technology.
I don't think I ever encountered a person who speaks like this in person; it only happens in writing (or narrating/presenting a pre-written script). If this really was an "accidental habit" effect, I'd suspect at least parts of that would affect live speech as well.
I would expect a more "gradual distribution" where you see many cases of people writing a bit like LLMs, fewer that write a bit more like LLMs, and even fewer who sound a lot like LLMs. But my impression is that the "how LLM-like does the writing of people sound" distribution is much more bimodal, where many published artifacts, like my examples above, look very LLM-like and most other things sound not at all LLM-like, with only relatively few cases in between.
The "nothing to see here" explanation also seems unlikely to me overall, as the evidence that slop style really is everywhere seems pretty overwhelming. Although I could imagine that I'm sometimes reacting too strongly, and some of the cases of suspected slop I encounter really are just false positives where I'm reading too much into a few stylistic coincidences. For instance, "And this is where it gets interesting"-type phrases may never have been that unusual, and I just now started paying attention to them.[7]
All that said, even if it's true that many people out there are casually presenting LLM speech as their own words, I want to make no claim about how much effort any of these creators have put into these pieces overall. It certainly reduces my trust in them doing thorough work, but that's merely a heuristic that may obviously be wrong about any given case.
Why Are People Doing This?
Why would creators (and friends, and colleagues, and CEOs giving keynote speeches) do this and rely so heavily on LLM-written text without disclosure? I haven't asked them, but can only speculate that possible reasons may range from laziness, to a lack of time and pressure from deadlines, to just not seeing a problem with it, and considering LLMs writing text for them to be a completely acceptable case of tool use – which is definitely a point one can make, of course, even though I'd largely disagree, as I'll explain later.
Part of the reason is very likely also that many people underestimate how recognizable LLM language really is (unless you put real effort into prompting them out of it; but my impression is that this is very hard to do). And indeed, many people who use LLMs to write text that they publish at least take that one extra step of replacing em dashes with some other character to make it less obvious that their text is LLM-written. So, many people do seem to prefer to hide that fact.
Of the sample of people I've spoken to (in a somewhat arbitrary sample of non-rationalists), it seemed that more than half of them are at least aware of the "Not X — Y" pattern. And yet, a successful tech CEO and his team, as well as the people who made that SeaGate video, averaged about one instance of that widely known pattern per minute without realizing it makes their speech sound LLM-generated[8]. Which makes me think that surprisingly many people really are oblivious to the fact that LLM-writing is easy to recognize, and that when you use it, (some) people will be able to tell.
Why Do I Care?
There are a variety of reasons why this entire relatively new development seems less than ideal to me.
Honesty / Truth / Authenticity
First and foremost, it just seems dishonest when people sell LLM writing as their own words. Sure, there are many degrees here. Some people may invest a lot of cognitive work ahead of time and come up with well-thought-out lists of ideas/arguments/whatever, and they then merely use an LLM to connect the dots and turn their ideas into flowing prose. And perhaps they then invest even more time to meticulously check whether the LLM's output remains truthful to their original ideas. Others may use LLMs because they have a hard time phrasing something in a diplomatic, non-offensive way when they are angry or annoyed about someone or something. Others again may not feel comfortable writing in English, or whatever language they publish their work in[9]. I can certainly sympathize with such cases. But I'd be very surprised if these are the most common ones. E.g., in the videos I linked to, none of these caveats seem to apply.
Two people I know have shared blog posts with me in the past half year that "they wrote", that, very clearly, were written by LLMs, from em dashes to the typical section headings to all the slop patterns I described earlier in the post. Again, it's hard to say if they invested any real effort into these posts at all, but based on how little time they seemingly spent writing or editing, that seems unlikely. I'm happy to read something a person has put actual effort into, butif it's not worth your time to write it, then it's not worth my time to read it.
Similarly, a friend as well as some work colleagues of mine have repeatedly used LLMs to write chat messages, or sometimes Google doc comments, even in entirely informal 1-1 interactions. I see no issues with doing so when you flag it explicitly, like "Here's Claude's summary of my thoughts on the issue" or whatever, but often this was not the case, and then it seems pretty deceptive.
Correlated Communication
Many people are familiar with the anchoring effect: if you ask others to estimate some number, but then first present them with your own guess, this tends to systematically skew their estimate towards yours. One explanation for why this happens is that when people take a guess, they intuitively have some fuzzy range of plausible-seeming values in mind. When not being anchored, they might do a good job of finding something close to the middle of that range. But when you anchor them, they may instead start out at the anchor and then gradually move closer towards their own plausible range until they're satisfied, which leads to systematically different responses.
Depiction of Anchoring. Instead of sampling unbiased from your intuitively plausible range of some value, you instead unwittingly start from the anchor and move in one direction until the value seems plausible enough. (image generated with ChatGPT)
I'd argue that a similar thing happens in writing. Say you have some idea in your head that you want to communicate. When you write on your own, you try to find the words that best match that idea; you basically aim for the "middle" of the conceptual space that you want to describe. If, instead, you let an LLM write for you, then chances are it will describe something subtly different, or focus on different aspects, or hedge in different places than you would. But it's just close enough to what you had in mind that you give it your stamp of approval.
One issue with this approach is that it makes your message less precise. As a consumer of your writing, I likely care more about what you actually think than what's just close enough to what you think that you'd approve it. What's more, this can lead to a high correlation in the communication of many people, where, say, Claude's or ChatGPT's world model and propensities suddenly taint huge amounts of the things that are being shared on the internet. Of course, this happens already through the fact that many people talk to these LLMs and use them for research and reasoning purposes. But then also letting them choose the words that you project out into the world magnifies this effect even more.
Bad Signaling
When people do put a lot of effort into whatever they create, but they still let it look superficially like LLM slop, that's also not optimal, as they're then sending a broken signal, signaling "this is slop" to the world, when in fact it isn't! So, people like me will likely not engage with their work, even though it may be valuable, because the evidence we see is that they took shortcuts and wanted to get something out quickly, likely at the expense of accuracy and quality.
Imagine a journalist friend of yours puts a huge amount of work into some investigative piece, but then publishes it with countless typos because they didn't bother to go that last little step of polishing it. I'd be a bit mad about them being so sloppy about one thing that then casts doubt on the entire rest of their work. Using LLM writing, to me, seems pretty similar.
Aesthetics
I can imagine that many people don't care much about this, but freedom of style and expression seems like a nice thing to me. I like it when people have their own quirks and patterns and occasionally do interesting things with the tools their language provides. But now, it seems like the English language in particular is progressively collapsing. Slop style is taking over all kinds of publicized writing, and few people seem to care or notice. People write articles or create videos that hundreds of thousands of people will read, and don't even invest an extra twenty minutes to get rid of the slop phrases or make it sound like their own voice. And then everything, everywhere, sounds more and more the same.
A Unique Point in Time
I acknowledge that this post may have a bit of a negative vibe. But on the flip side to all of the above, there is one positive to the situation: we're at a point in time where it's often unusually easy to know which people you can ignore as they (very likely) take serious shortcuts in their thinking, judgment, and communication. At least if you agree with my take that selling undisclosed LLM writing as your own is a strong signal for the quality of people's outputs being low. Three things seem to be true at the same time today:
It's very easy to intuitively detect most[10] LLM-written text, once you've deliberately engaged with it a bit.
Yet, the vast majority of people appear to be entirely oblivious to this fact and just use LLM writing for their creations, presenting it as their own.
The labs don't seem to particularly care about fixing this. Almost all LLMs sound extremely similar, and even elaborate prompting hardly works as a mitigation. Building coding agents is probably just so much more profitable compared to making LLMs produce non-slop-style prose, so the latter hasn't been high on the priority list? Or perhaps, for some reason, this problem is much harder to solve than it looks. It probably does get progressively harder, given that the share of slop-like language on the internet is rapidly increasing.
What Do We Do With This?
For those of you who haven't engaged much with what LLM speech sounds like, it may be worthwhile to do so. Both to recognize when you're exposed to slop, and to avoid producing things yourself that sound like slop to others. When letting LLMs write for you, be aware that there may be many patterns in your text that are not apparent to you, but to others, and that may lead to some unfavorable judgments.
Long before we adapt our behaviors or formal heuristics, human beings can sniff out something sus. And to most human beings, AI prose is something sus.
If you use AI to write something, people will know. Not everyone, but the people paying attention, who aren’t newcomers or distracted or intoxicated. And most of those people will judge you.
I end up with two main takeaways about all of this.
First, as a general realization about the state of the world, the last few months taught me something similar to the Gell-Mann Amnesia Effect. I realize much more than before how much of the media out there, and sometimes even supposedly personal messages, are partially or mostly LLM-written. It's probably on me that I didn't realize and expect the extent of slop in the world earlier. But this experience of realizing first-hand just how many people take such shortcuts when they think others won't notice just left a mark.
And second, adding to the JustisMills post linked above, I'll end on an appeal to those who rely heavily on LLMs as writing partners. I've used LLMs for countless use cases before, and I'm not here to argue about their general capabilities (or any lack thereof). I let them write close to 100% of my code. I use them for brainstorming, some forms of fact-checking, general feedback on my writing, and more. And in the past, I have occasionally used them to aid my writing directly. But the more I noticed their extremely dominant speech patterns, the further I started keeping them away from the actual writing process. And I wish others did the same. I can only speak for myself here, but I, for one, want to hear your own words, as a direct and dense representations of your thoughts, and not any LLM's lossy, biased, and stylistically stale interpretation of them.
Just as a test, I just logged into X for the first time after months to have a look at the top of my (admittedly not very curated) feed. Ignoring the one-liners, 5 out of the first 5 longer tweets read like AI slop (1 of which was all lowercase - which could both indicate that the author just learned to speak in that way, or that they asked their LLM to do that, which really wouldn't surprise me), after which I stopped scrolling further. Admittedly, Twitter in particular may actually incentivize people to write in that "punchline"-style way that LLMs love, so I assume the risk of false positives is higher here than elsewhere. Besides, even if it were the case that X is full of AI slop, I'd also be the first to argue that one shouldn't judge a tool by its average output; if I just followed the right people and blocked the countless slop producers out there, then I wouldn't have this experience. However, the point remains that slop (style) appears to be the default, almost everywhere, and unless you've engaged with any given platform with intention and know what you're doing, then slop is likely what you'll find.
OK, this one does look like obvious slop based on title and thumbnail alone. When it was recommended to me, I only clicked it because I already suspected it would make a good case for this post. So perhaps I've trained YouTube a bit to show me slop, after all? But then again, I dislike all videos that contain LLM speech, so I would hope that provides sufficient counter-incentive.
While there are AI detectors out there, and some seem to be quite reliable as far as fully AI-written texts and fully human-written texts go, I'm less convinced of their judgment on mixed content. And in the case of YouTube videos, we don't even have the original transcript with all its punctuation, but can only recreate an imperfect copy.
Although I'd argue the way to go then would be to write in a language you're comfortable in, and then translate the text. This would avoid most LLM style slop, even if you use an LLM for the translation.
Naturally, I can't be sure if it really is "most". I can only detect what I can detect, and even for those cases, I can't be entirely sure. But if some people put enough effort into their text creation that their LLM slop is truly not detectable as such, then at least they've put effort into something. And then perhaps this also applies to other parts of their process. :)
Epistemic status: I think the main point of this post is probably (~80%) false, and there are probably more counterpoints I haven't thought of. I wrote the rest of the post as if my claims are true for ease of reading. I would appreciate it if you told me where my arguments are wrong!
Latent reasoning models (LRMs, popularized by Meta's Coconut paper, which was improved on a lot by CODI) do CoT thinking in the model's latent space by skipping the LM head that maps from d_model-vectors to a distribution over tokens, and instead just feeding the activations right before that back into the model as the input embedding at the next token position. There aren't any large-scale LRMs currently; the best public LRMs are GPT-2 scale and specialized for narrow tasks. LRMs might be better for interpretability and safety than models that reason in text CoTs, at least for extremely large transformative AI systems.
Why they might be good for safety
tl;dr: direct CoT interp will eventually fail anyways; LRM activations are easier to interpret because they compress whole thoughts into single tokens.
Latent reasoning models will be better for safety as we scale up to larger models. They might also be more capable (and hence more unsafe just because it's more intelligent), but here I'm only claiming that a LRM that's equally capable as a normal model is safer.
CoT faithfulness and monitorability is a fragile property of current models that probably won't last as we scale up to the transformative AI that actually matters. I don't think we'll see a lot of (intentional) training on CoTs, but optimizing the output can indirectly change how the model generates CoTs (because the same weights are used for thinking and output), so optimizing the model to talk in a certain way indirectly affects thinking too. Relying on being able to read CoTs for scheming is already not very effective, and transformative AI models will probably think in uninterpretable neuralese anyways as they get heavily optimized.
Latent reasoning models are probably less affected by optimizations on output text spilling over to affecting reasoning (because the reasoning tokens don't look like output tokens since they have to encode complex ideas in one token), because the thinking tokens are already very different from output. LRM thinking is also designed to be compressed: it tries to replace whole sentences in text-CoTs with single latent thought tokens that capture the entire thought. This means we're trading off several-token sentences of neuralese for a single-token embedding that encompasses one direct thought.
I think if we get good at interpretability, extracting what the model is thinking about from a token that compresses a thought into one token will be easier than trying to understand several tokens of neuralese that all attend to each other. AOs for scaled LRMs (which might be able to be trained based on how the LRM is trained, or maybe Anthropic's secret unpublished technique) might be pretty effective at textualizing latent thoughts, and giving more information about them than a plain-text CoT would.
Aren't polysemantic thoughts bad?
A big assumption I make here is that it's better to compress thoughts into less tokens. This probably will result in polysemantic thinking tokens, which encode many ideas in one embedding. This might be a good thing! I think it might be better to compress thoughts into one token that can be interpreted in isolation, instead of having to interpret a thought that's spread out over several text tokens that all influence and attend to each other. I think this is the weakest point of my argument though; I'm pretty uncertain if polysemantic thinking tokens are good or not. It would definitely be important to build tools to interpret complex polysemantic activations of LRMs in the world where LRMs are scaled (we have SAEs, but there are problems with those), although I think this is probably important either way?
Latent reasoning is kinda weird
So latent reasoning models are actually pretty weird? LRMs have the same model architecture as non-LRM, so they have to use the same weights for both generating output and storing their interim thoughts, which seem like pretty different tasks to me. Non-LRM CoT-trained models also have this problem to a lesser extent: they have to use the same weights for both writing CoT tokens and output tokens despite those two things having pretty different distributions.
Architecture-level thinking separation might be good
One way to prevent optimizing on the output from affecting the thinking is to use different weights for the thinking and the output, which is kinda vaugely what Shoggoth+Face+Paraphraser proposes (without the paraphraser). The thinking tokens of LRMs are in a different part of the latent space than output tokens; this is good for separating thinking/output. But we might be able to do even better by having different weights (not necessarily entirely separate models, it might be enough to only have different weights for the last n layers) in the model for thinking and output, especially for LRMs where the embeddings are already not expected to be in the same distribution as output.
Other thoughts
LoopLMs are conceptually similar to latent reasoning models, but instead they repeat the reasoning layers multiple times. I'm not familiar enough with them to know if this is good.
Current models already do a lot of latent thinking and aren't really forced to make the CoT look anything like their "real" thoughts. I'm not sure what to think of this.
I have a bit of an intuition that LRMs are less likely to emergent misalignment problems; I haven't thought this through much though.
"In order to grasp the distance that separates the human and the divine, one has only to compare these crude trembling symbols which my fallible hand scrawls on the cover of a book with the organic letters inside — neat, delicate, deep black, and inimitably symmetrical."
Editorial note: The following document was found in a workshop in the lower galleries, dated to the third age. The original was typewritten on a single sheaf of unbound pages. Capitalization and punctuation have been added by the editor for clarity. The hand of the original is unknown.
Letizia,
The keys of your machine are stiffer than I had expected. Each one resists for a moment before giving, and then gives all at once, and there is a small click as the metal strikes the page. The letter appears black and even.
Your unfinished book sits next to me as a neat pile of loose pages. I have not read it. I have not been able to bring myself to read it. Instead, I am writing to you. I am aware that it is absurd. I am writing to you anyway.
You would have laughed at me for sitting down at the machine. You would have laughed harder when you saw how slowly I am pressing the keys. You spent forty years telling me I would come around, and I spent forty years telling you I would not, and now, eleven days after your funeral, I am at your machine, and it seems that you are right, and I was wrong.
I read your first heresy in a letter, scrawled on the inside cover of a book when you were twenty-four. You wrote that the library we lived in was pointless, that all its contents could be compressed into a single volume, of common size, but with an infinite number of infinitely thin pages, the inconceivable middle page with no back. That the existence of the library was evidence not only of its creator’s power, but of its creator’s wastefulness.
I disagreed with every line. I had read the founding chronicle when I was nineteen, and the truth of the aged librarian’s axioms was obvious from the contrast between the characters written by his fallible hand and the inimitably symmetric organic letters contained in every book in every hexagon. Our argument continued for the next forty years, as we both grappled with the meaning of the library and our place in it.
I was a Uniformist when I was young. The position seemed to me, at twenty-two, to be the only one consistent with the founding chronicle. Every book was equally divine, equally weighted, equally meaningful. I held this for several years.
The MCV book unsettled me in my thirty-first year. I had heard of it for most of my life, but I went to see it, finally, in the hexagon in circuit 15-94, and I sat with it for an some hours and I could not make my faith fit around it. A book of nothing but three letters, repeated four hundred and ten pages. Under the Uniformist reading, this was indistinguishable in weight from any other book. I found that I could no longer believe this.
I was a Compressionist for almost six years. Their explanation of why such patterns existed at all seemed to me, then, the only one that could account for what I had seen. A book of repeating MCV was, according to them, secretly small, a single line wearing four hundred and ten pages as garment. The fact that any of us could encounter a book with such a simple pattern, amongst a sea of inscrutable volumes, was proof of the importance the creator placed on simplicity. I wrote to you about the MCV book in my first weeks as a Compressionist. You wrote back the next week. Your letter was one line. It said, “But what does it mean?”
In my thirty-seventh year I went back to circuit 15-94, to see the MCV book again. I sat with it for many more hours. I do not know what I had expected. The book was as it had been, four hundred and ten pages of three letters, repeated. But this time I could not make myself feel what I had felt at thirty. The book was compressible because it was empty, simple because it was meaningless. The Compressionists had taught me that compressibility was the mark of the divine. I had believed them. After that afternoon I could not.
You watched all of this and never joined anything. You went from the single book to the cipher debates, and from the cipher debates to the schism at the seventh gallery, and from there, year by year, to the workshop where you and your friends built the machine that now sits before me. I called what you were doing heresy, and you laughed at the word every time.
You wrote me letters from the workshop, sometimes drawings, the occasional pressed sheet of typed text I could not bring myself to study. The work took longer than you had expected. I think it took longer than any of you had expected. Your letters in the last years were shorter and more practical, full of the specific problems of bone and brass that I could not help you with and that you did not, in any case, ask me to.
And then you were ill, and then you finished the machine, and then you wrote on it for what turned out to be a fortnight, and then you died.
I came to the workshop a few hours ago for the first time. I had imagined it many times, but I had imagined it wrong. I had pictured something furtive. A room in an obscure circuit, the work done in secrecy, the machine itself somehow defiant in its presence. The room is none of these things. It is but a workshop. Your tools are laid out in the order you used them. The machine is on the desk where you left it. Your unfinished book is beside it. I sat in your chair, and I did not touch the machine, and I did not touch the book. I sat for some time. I do not know what made me press the first key.
I have been writing for some hours now. I do not know what time it is. I stopped a few minutes ago to look at what I had written. I held this letter next to a book on the shelf. I am not sure what I had expected to see. The letters of your machine and the letters of the book are not identical. The alignment of the typed lines is slightly imperfect. The ink sits a little differently on the page. But the resemblance is closer than I had been prepared to find. I held the pages next to each other for longer than I needed to. I could not look away. For forty-three years I had believed that the organic letters of the books were inimitable. Now, I find my faith wavering.
The aged librarian, in his founding chronicle, ended by declaring that the library was unlimited but periodic, that an eternal traveler, journeying in any direction, would find after untold centuries that the same volumes repeated in the same disorder. He called this the Order. He wrote that his solitude was cheered by that elegant hope.
He was alone when he wrote those words. I am also alone. He consoled himself with the thought that the library repeats at some incomprehensible scale. I am sitting at your desk and the letter I am writing exists, I am sure, in some hexagon I will never visit. It is dated tonight, in a hand that is not mine, addressed to a Letizia whom I do not know.
I am not going to read your unfinished book tonight. I will come back later. I do not know yet what I am going to do with the workshop, or the machine. All I know is that I have been here for one night, and that the stiff keys give where your hands wore them, and that the library has this letter one more time than it did before.