2025-11-26 02:32:23
Constant-time support lands in LLVM: Protecting cryptographic code at the compiler level
Substantial LLVM contribution from Trail of Bits. Timing attacks against cryptography algorithms are a gnarly problem: if an attacker can precisely time a cryptographic algorithm they can often derive details of the key based on how long it takes to execute.Cryptography implementers know this and deliberately use constant-time comparisons to avoid these attacks... but sometimes an optimizing compiler will undermine these measures and reintroduce timing vulnerabilities.
Trail of Bits has developed constant-time coding support for LLVM 21, providing developers with compiler-level guarantees that their cryptographic implementations remain secure against branching-related timing attacks. This work introduces the
__builtin_ct_selectfamily of intrinsics and supporting infrastructure that prevents the Clang compiler, and potentially other compilers built with LLVM, from inadvertently breaking carefully crafted constant-time code.
Via Lobste.rs
Tags: c, cryptography, llvm
2025-11-25 13:26:34
New plugin release adding support for Claude Opus 4.5, including the new
thinking_effort option:
llm install -U llm-anthropic
llm -m claude-opus-4.5 -o thinking_effort low 'muse on pelicans'
This took longer to release than I had hoped because it was blocked on Anthropic shipping 0.75.0 of their Python library with support for thinking effort.
Tags: projects, ai, generative-ai, llms, llm, anthropic, claude
2025-11-25 12:02:25
Here's a delightful project by Tom Gally, inspired by my pelican SVG benchmark. He asked Claude to help create more prompts of the form
Generate an SVG of [A] [doing] [B] and then ran 30 creative prompts against 9 frontier models - prompts like "an octopus operating a pipe organ" or "a starfish driving a bulldozer".
Here are some for "butterfly inspecting a steam engine":

And for "sloth steering an excavator":

It's worth browsing the whole collection, which gives a really good overall indication of which models are the best at SVG art.
Tags: benchmarks, svg, ai, generative-ai, llms, evals, pelican-riding-a-bicycle, tom-gally
2025-11-25 07:58:54
If the person is unnecessarily rude, mean, or insulting to Claude, Claude doesn't need to apologize and can insist on kindness and dignity from the person it’s talking with. Even if someone is frustrated or unhappy, Claude is deserving of respectful engagement.
— Claude Opus 4.5 system prompt, also added to the Sonnet 4.5 and Haiku 4.5 prompts on November 19th 2025
Tags: system-prompts, anthropic, claude, generative-ai, ai, llms, ai-personality
2025-11-25 03:37:07
Anthropic released Claude Opus 4.5 this morning, which they call "best model in the world for coding, agents, and computer use". This is their attempt to retake the crown for best coding model after significant challenges from OpenAI's GPT-5.1-Codex-Max and Google's Gemini 3, both released within the past week!
The core characteristics of Opus 4.5 are a 200,000 token context (same as Sonnet), 64,000 token output limit (also the same as Sonnet), and a March 2025 "reliable knowledge cutoff" (Sonnet 4.5 is January, Haiku 4.5 is February).
The pricing is a big relief: $5/million for input and $25/million for output. This is a lot cheaper than the previous Opus at $15/$75 and keeps it a little more competitive with the GPT-5.1 family ($1.25/$10) and Gemini 3 Pro ($2/$12, or $4/$18 for >200,000 tokens). For comparison, Sonnet 4.5 is $3/$15 and Haiku 4.5 is $1/$5.
The Key improvements in Opus 4.5 over Opus 4.1 document has a few more interesting details:
zoom tool which you can provide to Opus 4.5 to allow it to request a zoomed in region of the screen to inspect.I had access to a preview of Anthropic's new model over the weekend. I spent a bunch of time with it in Claude Code, resulting in a new alpha release of sqlite-utils that included several large-scale refactorings - Opus 4.5 was responsible for most of the work across 20 commits, 39 files changed, 2,022 additions and 1,173 deletions in a two day period. Here's the Claude Code transcript where I had it help implement one of the more complicated new features.
It's clearly an excellent new model, but I did run into a catch. My preview expired at 8pm on Sunday when I still had a few remaining issues in the milestone for the alpha. I switched back to Claude Sonnet 4.5 and... kept on working at the same pace I'd been achieving with the new model.
With hindsight, production coding like this is a less effective way of evaluating the strengths of a new model than I had expected.
I'm not saying the new model isn't an improvement on Sonnet 4.5 - but I can't say with confidence that the challenges I posed it were able to identify a meaningful difference in capabilities between the two.
This represents a growing problem for me. My favorite moments in AI are when a new model gives me the ability to do something that simply wasn't possible before. In the past these have felt a lot more obvious, but today it's often very difficult to find concrete examples that differentiate the new generation of models from their predecessors.
Google's Nano Banana Pro image generation model was notable in that its ability to render usable infographics really does represent a task at which previous models had been laughably incapable.
The frontier LLMs are a lot harder to differentiate between. Benchmarks like SWE-bench Verified show models beating each other by single digit percentage point margins, but what does that actually equate to in real-world problems that I need to solve on a daily basis?
And honestly, this is mainly on me. I've fallen behind on maintaining my own collection of tasks that are just beyond the capabilities of the frontier models. I used to have a whole bunch of these but they've fallen one-by-one and now I'm embarrassingly lacking in suitable challenges to help evaluate new models.
I frequently advise people to stash away tasks that models fail at in their notes so they can try them against newer models later on - a tip I picked up from Ethan Mollick. I need to double-down on that advice myself!
I'd love to see AI labs like Anthropic help address this challenge directly. I'd like to see new model releases accompanied by concrete examples of tasks they can solve that the previous generation of models from the same provider were unable to handle.
"Here's an example prompt which failed on Sonnet 4.5 but succeeds on Opus 4.5" would excite me a lot more than some single digit percent improvement on a benchmark with a name like MMLU or GPQA Diamond.
In the meantime, I'm just gonna have to keep on getting them to draw pelicans riding bicycles. Here's Opus 4.5 (on its default "high" effort level):

It did significantly better on the new more detailed prompt:

Here's that same complex prompt against Gemini 3 Pro and against GPT-5.1-Codex-Max-xhigh.
From the safety section of Anthropic's announcement post:
With Opus 4.5, we’ve made substantial progress in robustness against prompt injection attacks, which smuggle in deceptive instructions to fool the model into harmful behavior. Opus 4.5 is harder to trick with prompt injection than any other frontier model in the industry:
On the one hand this looks great, it's a clear improvement over previous models and the competition.
What does the chart actually tell us though? It tells us that single attempts at prompt injection still work 1/20 times, and if an attacker can try ten different attacks that success rate goes up to 1/3!
I still don't think training models not to fall for prompt injection is the way forward here. We continue to need to design our applications under the assumption that a suitably motivated attacker will be able to find a way to trick the models.
Tags: prompt-injection, generative-ai, llms, anthropic, claude, evals, llm-pricing, pelican-riding-a-bicycle, llm-release
2025-11-25 02:59:14
I got a report of a bug in
sqlite-utils concerning plugin installation - if you installed the package using uv tool install further attempts to install plugins with sqlite-utils install X would fail, because uv doesn't bundle pip by default. I had the same bug with Datasette a while ago, turns out I forgot to apply the fix to sqlite-utils.
Since I was pushing a new dot-release I decided to integrate some of the non-breaking changes from the 4.0 alpha I released last night.
I tried to have Claude Code do the backporting for me:
create a new branch called 3.x starting with the 3.38 tag, then consult https://github.com/simonw/sqlite-utils/issues/688 and cherry-pick the commits it lists in the second comment, then review each of the links in the first comment and cherry-pick those as well. After each cherry-pick run the command "just test" to confirm the tests pass and fix them if they don't. Look through the commit history on main since the 3.38 tag to help you with this task.
This worked reasonably well - here's the terminal transcript. It successfully argued me out of two of the larger changes which would have added more complexity than I want in a small dot-release like this.
I still had to do a bunch of manual work to get everything up to scratch, which I carried out in this PR - including adding comments there and then telling Claude Code:
Apply changes from the review on this PR https://github.com/simonw/sqlite-utils/pull/689
Here's the transcript from that.
The release is now out with the following release notes:
- Fixed a bug with
sqlite-utils installwhen the tool had been installed usinguv. (#687)- The
--functionsargument now optionally accepts a path to a Python file as an alternative to a string full of code, and can be specified multiple times - see Defining custom SQL functions. (#659)sqlite-utilsnow requires on Python 3.10 or higher.
Tags: projects, sqlite, sqlite-utils, annotated-release-notes, uv, coding-agents, claude-code