MoreRSS

site iconSimon WillisonModify

Creator of Datasette and Lanyrd, co-creator of the Django Web Framework.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of Simon Willison

GPT-5.4 mini and GPT-5.4 nano, which can describe 76,000 photos for $52

2026-03-18 03:39:17

OpenAI today: Introducing GPT‑5.4 mini and nano. These models join GPT-5.4 which was released two weeks ago.

OpenAI's self-reported benchmarks show the new 5.4-nano out-performing their previous GPT-5 mini model when run at maximum reasoning effort. The new mini is also 2x faster than the previous mini.

Here's how the pricing looks - gpt-5.4-nano is notably even cheaper than Google's Gemini 3.1 Flash-Lite:

Model Input Cached input Output
gpt-5.4 $2.50 $0.25 $15.00
gpt-5.4-mini $0.75 $0.075 $4.50
gpt-5.4-nano $0.20 $0.02 $1.25
Other models for comparison
Claude Opus 4.6 $5.00 - $25.00
Claude Sonnet 4.6 $3.00 - $15.00
Gemini 3.1 Pro $2.00 - $12.00
Claude Haiku 4.5 $1.00 - $5.00
Gemini 3.1 Flash-Lite $0.25 - $1.50

I used GPT-5.4 nano to generate a description of this photo I took at the John M. Mossman Lock Collection:

Description below

llm -m gpt-5.4-nano -a IMG_2324.jpeg 'describe image'

Here's the output:

The image shows the interior of a museum gallery with a long display wall. White-painted brick walls are covered with many framed portraits arranged in neat rows. Below the portraits, there are multiple glass display cases with dark wooden frames and glass tops/fronts, containing various old historical objects and equipment. The room has a polished wooden floor, hanging ceiling light fixtures/cords, and a few visible pipes near the top of the wall. In the foreground, glass cases run along the length of the room, reflecting items from other sections of the gallery.

That took 2,751 input tokens and 112 output tokens, at a cost of 0.069 cents (less than a tenth of a cent). That means describing every single photo in my 76,000 photo collection would cost around $52.44.

I released llm 0.29 with support for the new models.

Then I had OpenAI Codex loop through all five reasoning effort levels and all three models and produce this combined SVG grid of pelicans riding bicycles (generation transcripts here). I do like the gpt-5.4 xhigh one the best, it has a good bicycle (with nice spokes) and the pelican has a fish in its beak!

Described by Claude Opus 4.6: A 5x3 comparison grid of AI-generated cartoon illustrations of a pelican riding a bicycle. Columns are labeled "gpt-5.4-nano", "gpt-5.4-mini", and "gpt-5.4" across the top, and rows are labeled "none", "low", "medium", "high", and "xhigh" down the left side, representing quality/detail settings. In the "none" row, gpt-5.4-nano shows a chaotic white bird with misplaced arrows and tangled wheels on grass, gpt-5.4-mini shows a duck-like brown bird awkwardly straddling a motorcycle-like bike, and gpt-5.4 shows a stiff gray-and-white pelican sitting atop a blue tandem bicycle with extra legs. In the "low" row, nano shows a chubby round white bird pedaling with small feet on grass, mini shows a cleaner white bird riding a blue bicycle with motion lines, and gpt-5.4 shows a pelican with a blue cap riding confidently but with slightly awkward proportions. In the "medium" row, nano regresses to a strange bird standing over bowling balls on ice, mini shows two plump white birds merged onto one yellow-wheeled bicycle, and gpt-5.4 shows a more recognizable gray-and-white pelican on a red bicycle but with tangled extra legs. In the "high" row, nano shows multiple small pelicans crowded around a broken green bicycle on grass with a sun overhead, mini shows a tandem bicycle with two white pelicans and clear blue sky, and gpt-5.4 shows two pelicans stacked on a red tandem bike with the most realistic proportions yet. In the "xhigh" row, nano shows the most detailed scene with a pelican on a detailed bicycle with grass and a large sun but still somewhat jumbled anatomy, mini produces the cleanest single pelican on a yellow-accented bicycle with a light blue sky, and gpt-5.4 shows a well-rendered gray pelican on a teal bicycle with the best overall coherence. Generally, quality improves moving right across models and down through quality tiers, though "medium" is inconsistently worse than "low" for some models, and all images maintain a lighthearted cartoon style with pastel skies and simple backgrounds.

Tags: ai, openai, generative-ai, llms, llm, vision-llms, llm-pricing, pelican-riding-a-bicycle, llm-release

Quoting Tim Schilling

2026-03-18 00:13:37

If you do not understand the ticket, if you do not understand the solution, or if you do not understand the feedback on your PR, then your use of LLM is hurting Django as a whole. [...]

For a reviewer, it’s demoralizing to communicate with a facade of a human.

This is because contributing to open source, especially Django, is a communal endeavor. Removing your humanity from that experience makes that endeavor more difficult. If you use an LLM to contribute to Django, it needs to be as a complementary tool, not as your vehicle.

Tim Schilling, Give Django your time and money, not your tokens

Tags: ai-ethics, open-source, generative-ai, ai, django, llms

Subagents

2026-03-17 20:32:28

Agentic Engineering Patterns >

LLMs are restricted by their context limit - how many tokens they can fit in their working memory at any given time. These values have not increased much over the past two years even as the LLMs themselves have seen dramatic improvements in their abilities - they generally top out at around 1,000,000, and benchmarks frequently report better quality results below 200,000.

Carefully managing the context such that it fits within those limits is critical to getting great results out of a model.

Subagents provide a simple but effective way to handle larger tasks without burning through too much of the coding agent’s valuable top-level context.

When a coding agent uses a subagent it effectively dispatches a fresh copy of itself to achieve a specified goal, with a new context window that starts with a fresh prompt.

Claude Code’s Explore subagent

Claude Code uses subagents extensively as part of its standard way of working. Let's use that as an illustrative example.

Any time you start a new task against an existing repo Claude Code first needs to explore that repo to figure out its general shape and find relevant information needed to achieve that task.

It does this by constructing a prompt and dispatching a subagent to perform that exploration and return a description of what it finds.

I kicked off a Claude Code session with a screenshot of a page from my guide and this prompt:

Make the chapter diffs also show which characters have changed in this diff view with a darker color of red or green for the individually changed segments of text within the line

Here's the transcript of that session.

Claude Code started by running an “Explore” subagent with the following starter prompt:

Find the code that implements the diff view for "chapters" in this Django blog. I need to find:

  1. Templates that render diffs (look for diff-related HTML/CSS with red/green backgrounds)
  2. Python code that generates diffs (look for difflib usage or similar)
  3. Any JavaScript related to diff rendering
  4. CSS styles for the diff view (red/green line backgrounds)

Search thoroughly - check templates/, static/, blog/ directories. Look for keywords like "diff", "chapter", "revision", "history", "compare".

Subagents work similar to any other tool call: the parent agent dispatches them just as they would any other tool and waits for the response. It's interesting to see models prompt themselves in this way - they generally have good taste in prompting strategies.

The Explore subagent churned away for a few seconds and returned a response that started like this:

Perfect! Now let me create a comprehensive summary with all the findings:

Summary: Chapter Diff View Implementation

I found the complete implementation of the diff view for chapters in this Django blog. Here are the key components:

1. Python Code - Diff Generation (/home/user/simonwillisonblog/guides/views.py, lines 139-202)

The chapter_changes() view function uses Python's built-in difflib.unified_diff() to generate diffs: […]

The full subagent response included all of the details the parent agent needed in order to start editing the code to address my original request.

Parallel subagents

This Explore subagent is the simplest example of how subagents can work, with the parent agent pausing while the subagent runs. The principle advantage of this kind of subagent is that it can work with a fresh context in a way that avoids spending tokens from the parent’s available limit.

Subagents can also provide a significant performance boost by having the parent agent run multiple subagents at the same time, potentially also using faster and cheaper models such as Claude Haiku to accelerate those tasks.

Coding agents that support subagents can use them based on your instructions. Try prompts like this:

Use subagents to find and update all of the templates that are affected by this change.
For tasks that involve editing several files - and where those files are not dependent on each other - this can offer a significant speed boost.

Specialist subagents

Some coding agents allow subagents to run with further customizations, often in the form of a custom system prompt or custom tools or both, which allow those subagents to take on a different role.

These roles can cover a variety of useful specialties:

  • A code reviewer agent can review code and identify bugs, feature gaps or weaknesses in the design.
  • A test runner agent can run the test. This is particularly worthwhile if your test suite is large and verbose, as the subagent can hide the full test output from the main coding agent and report back with just details of any failures.
  • A debugger agent can specialize in debugging problems, spending its token allowance reasoning though the codebase and running snippets of code to help isolate steps to reproduce and determine the root cause of a bug.

While it can be tempting to go overboard breaking up tasks across dozens of different specialist subagents, it's important to remember that the main value of subagents is in preserving that valuable root context and managing token-heavy operations. Your root coding agent is perfectly capable of debugging or reviewing its own output provided it has the tokens to spare.

Official documentation

Several popular coding agents support subagents, each with their own documentation on how to use them:

Tags: parallel-agents, coding-agents, generative-ai, agentic-engineering, ai, llms

Introducing Mistral Small 4

2026-03-17 07:41:17

Introducing Mistral Small 4

Big new release from Mistral today (despite the name) - a new Apache 2 licensed 119B parameter (Mixture-of-Experts, 6B active) model which they describe like this:

Mistral Small 4 is the first Mistral model to unify the capabilities of our flagship models, Magistral for reasoning, Pixtral for multimodal, and Devstral for agentic coding, into a single, versatile model.

It supports reasoning_effort="none" or reasoning_effort="high", with the latter providing "equivalent verbosity to previous Magistral models".

The new model is 242GB on Hugging Face.

I tried it out via the Mistral API using llm-mistral:

llm install llm-mistral
llm mistral refresh
llm -m mistral/mistral-small-2603 "Generate an SVG of a pelican riding a bicycle"

The bicycle is upside down and mangled and the pelican is a series of grey curves with a triangular beak.

I couldn't find a way to set the reasoning effort in their API documentation, so hopefully that's a feature which will land soon.

Also from Mistral today and fitting their -stral naming convention is Leanstral, an open weight model that is specifically tuned to help output the Lean 4 formally verifiable coding language. I haven't explored Lean at all so I have no way to credibly evaluate this, but it's interesting to see them target one specific language in this way.

Tags: ai, generative-ai, llms, llm, mistral, pelican-riding-a-bicycle, llm-reasoning, llm-release

Use subagents and custom agents in Codex

2026-03-17 07:03:56

Use subagents and custom agents in Codex

Subagents were announced in general availability today for OpenAI Codex, after several weeks of preview behind a feature flag.

They're very similar to the Claude Code implementation, with default subagents for "explorer", "worker" and "default". It's unclear to me what the difference between "worker" and "default" is but based on their CSV example I think "worker" is intended for running large numbers of small tasks in parallel.

Codex also lets you define custom agents as TOML files in ~/.codex/agents/. These can have custom instructions and be assigned to use specific models - including gpt-5.3-codex-spark if you want some raw speed. They can then be referenced by name, as demonstrated by this example prompt from the documentation:

Investigate why the settings modal fails to save. Have browser_debugger reproduce it, code_mapper trace the responsible code path, and ui_fixer implement the smallest fix once the failure mode is clear.

The subagents pattern is widely supported in coding agents now. Here's documentation across a number of different platforms:

Update: I added a chapter on Subagents to my Agentic Engineering Patterns guide.

Via @OpenAIDevs

Tags: ai, openai, generative-ai, llms, coding-agents, codex-cli, parallel-agents, agentic-engineering

Quoting A member of Anthropic’s alignment-science team

2026-03-17 05:38:55

The point of the blackmail exercise was to have something to describe to policymakers—results that are visceral enough to land with people, and make misalignment risk actually salient in practice for people who had never thought about it before.

A member of Anthropic’s alignment-science team, as told to Gideon Lewis-Kraus

Tags: ai-ethics, anthropic, claude, generative-ai, ai, llms