2025-11-23 08:49:39
Armin Ronacher presents a cornucopia of lessons learned from building agents over the past few months.
There are several agent abstraction libraries available now (my own LLM library is edging into that territory with its tools feature) but Armin has found that the abstractions are not worth adopting yet:
[…] the differences between models are significant enough that you will need to build your own agent abstraction. We have not found any of the solutions from these SDKs that build the right abstraction for an agent. I think this is partly because, despite the basic agent design being just a loop, there are subtle differences based on the tools you provide. These differences affect how easy or hard it is to find the right abstraction (cache control, different requirements for reinforcement, tool prompts, provider-side tools, etc.). Because the right abstraction is not yet clear, using the original SDKs from the dedicated platforms keeps you fully in control. […]
This might change, but right now we would probably not use an abstraction when building an agent, at least until things have settled down a bit. The benefits do not yet outweigh the costs for us.
Armin introduces the new-to-me term reinforcement, where you remind the agent of things as it goes along:
Every time the agent runs a tool you have the opportunity to not just return data that the tool produces, but also to feed more information back into the loop. For instance, you can remind the agent about the overall objective and the status of individual tasks. […] Another use of reinforcement is to inform the system about state changes that happened in the background.
Claude Code’s TODO list is another example of this pattern in action.
Testing and evals remains the single hardest problem in AI engineering:
We find testing and evals to be the hardest problem here. This is not entirely surprising, but the agentic nature makes it even harder. Unlike prompts, you cannot just do the evals in some external system because there’s too much you need to feed into it. This means you want to do evals based on observability data or instrumenting your actual test runs. So far none of the solutions we have tried have convinced us that they found the right approach here.
Armin also has a follow-up post, LLM APIs are a Synchronization Problem, which argues that the shape of current APIs hides too many details from us as developers, and the core challenge here is in synchronizing state between the tokens fed through the GPUs and our client applications - something that may benefit from alternative approaches developed by the local-first movement.
Via Hacker News
Tags: armin-ronacher, definitions, ai, prompt-engineering, generative-ai, llms, evals, ai-agents
2025-11-23 07:59:46
Olmo is the LLM series from Ai2 - the Allen institute for AI. Unlike most open weight models these are notable for including the full training data, training process and checkpoints along with those releases.
The new Olmo 3 claims to be "the best fully open 32B-scale thinking model" and has a strong focus on interpretability:
At its center is Olmo 3-Think (32B), the best fully open 32B-scale thinking model that for the first time lets you inspect intermediate reasoning traces and trace those behaviors back to the data and training decisions that produced them.
They've released four 7B models - Olmo 3-Base, Olmo 3-Instruct, Olmo 3-Think and Olmo 3-RL Zero, plus 32B variants of the 3-Think and 3-Base models.
Having full access to the training data is really useful. Here's how they describe that:
Olmo 3 is pretrained on Dolma 3, a new ~9.3-trillion-token corpus drawn from web pages, science PDFs processed with olmOCR, codebases, math problems and solutions, and encyclopedic text. From this pool, we construct Dolma 3 Mix, a 5.9-trillion-token (~6T) pretraining mix with a higher proportion of coding and mathematical data than earlier Dolma releases, plus much stronger decontamination via extensive deduplication, quality filtering, and careful control over data mixing. We follow established web standards in collecting training data and don't collect from sites that explicitly disallow it, including paywalled content.
They also highlight that they are training on fewer tokens than their competition:
[...] it's the strongest fully open thinking model we're aware of, narrowing the gap to the best open-weight models of similar scale – such as Qwen 3 32B – while training on roughly 6x fewer tokens.
If you're continuing to hold out hope for a model trained entirely on licensed data this one sadly won't fit the bill - a lot of that data still comes from a crawl of the web.
I tried out the 32B Think model and the 7B Instruct model using LM Studio. The 7B model is a 4.16GB download, the 32B one is 18.14GB.
The 32B model is absolutely an over-thinker! I asked it to "Generate an SVG of a pelican riding a bicycle" and it thought for 14 minutes 43 seconds, outputting 8,437 tokens total most of which was this epic thinking trace.
I don't usually quote the full SVG in these write-ups, but in this case it's short enough that I think it's worth sharing. The SVG comments give a great impression of what it was trying to do - it has a Bicycle, Bike frame, Pelican, Left and Right wings and even "Feet on pedals".
<svg width="200" height="200" viewBox="0 0 100 100">
<!-- Bicycle -->
<circle cx="30" cy="60" r="15" stroke="black" fill="none"/>
<circle cx="70" cy="60" r="15" stroke="black" fill="none"/>
<!-- Bike frame -->
<rect x="35" y="25" width="30" height="10" fill="saddlebrown"/>
<line x1="35" y1="40" x2="30" y2="60" stroke="black" stroke-width="3"/>
<line x1="65" y1="40" x2="70" y2="60" stroke="black" stroke-width="3"/>
<!-- Pelican -->
<ellipse cx="55" cy="65" rx="20" ry="15" fill="white"/>
<polygon points="52 50,57 35,62 50" fill="black"/> <!-- Head/beak -->
<circle cx="55" cy="45" r="2" fill="white"/>
<circle cx="60" cy="45" r="2" fill="white"/>
<polygon points="45 60,50 70,55 60" fill="lightgrey"/> <!-- Left wing -->
<polygon points="65 60,70 70,55 60" fill="lightgrey"/> <!-- Right wing -->
<!-- Feet on pedals -->
<polygon points="25 75,30 85,35 75" fill="black"/>
<polygon points="75 75,70 85,65 75" fill="black"/>
</svg>Rendered it looks like this:

I tested OLMo 2 32B 4bit back in March and got something that, while pleasingly abstract, didn't come close to resembling a pelican or a bicycle:

To be fair 32B models generally don't do great with this. Here's Qwen 3 32B's attempt (I ran that just now using OpenRouter):

I was particularly keen on trying out the ability to "inspect intermediate reasoning traces". Here's how that's described later in the announcement:
A core goal of Olmo 3 is not just to open the model flow, but to make it actionable for people who want to understand and improve model behavior. Olmo 3 integrates with OlmoTrace, our tool for tracing model outputs back to training data in real time.
For example, in the Ai2 Playground, you can ask Olmo 3-Think (32B) to answer a general-knowledge question, then use OlmoTrace to inspect where and how the model may have learned to generate parts of its response. This closes the gap between training data and model behavior: you can see not only what the model is doing, but why---and adjust data or training decisions accordingly.
You can access OlmoTrace via playground.allenai.org, by first running a prompt and then clicking the "Show OlmoTrace" button below the output.
I tried that on "Generate a conference bio for Simon Willison" (an ego-prompt I use to see how much the models have picked up about me from their training data) and got back a result that looked like this:

It thinks I co-founded co:here and work at Anthropic, both of which are incorrect - but that's not uncommon with LLMs, I frequently see them suggest that I'm the CTO of GitHub and other such inaccuracies.
I found the OlmoTrace panel on the right disappointing. None of the training documents it highlighted looked relevant - it appears to be looking for phrase matches (powered by Ai2's infini-gram) but the documents it found had nothing to do with me at all.
Ai2 claim that Olmo 3 is "the best fully open 32B-scale thinking model", which I think holds up provided you define "fully open" as including open training data. There's not a great deal of competition in that space though - Ai2 compare themselves to Stanford's Marin and Swiss AI's Apertus, neither of which I'd heard about before.
A big disadvantage of other open weight models is that it's impossible to audit their training data. Anthropic published a paper last month showing that a small number of samples can poison LLMs of any size - it can take just "250 poisoned documents" to add a backdoor to a large model that triggers undesired behavior based on a short carefully crafted prompt.
This makes fully open training data an even bigger deal.
Ai2 researcher Nathan Lambert included this note about the importance of transparent training data in his detailed post about the release:
In particular, we're excited about the future of RL Zero research on Olmo 3 precisely because everything is open. Researchers can study the interaction between the reasoning traces we include at midtraining and the downstream model behavior (qualitative and quantitative).
This helps answer questions that have plagued RLVR results on Qwen models, hinting at forms of data contamination particularly on math and reasoning benchmarks (see Shao, Rulin, et al. "Spurious rewards: Rethinking training signals in rlvr." arXiv preprint arXiv:2506.10947 (2025). or Wu, Mingqi, et al. "Reasoning or memorization? unreliable results of reinforcement learning due to data contamination." arXiv preprint arXiv:2507.10532 (2025).)
I hope we see more competition in this space, including further models in the Olmo series. The improvements from Olmo 1 (in February 2024) and Olmo 2 (in March 2025) have been significant. I'm hoping that trend continues!
Tags: ai, generative-ai, llms, interpretability, pelican-riding-a-bicycle, llm-reasoning, ai2, ai-ethics, llm-release, nathan-lambert, olmo
2025-11-22 01:27:33
We should all be using dependency cooldowns
William Woodruff gives a name to a sensible strategy for managing dependencies while reducing the chances of a surprise supply chain attack: dependency cooldowns.Supply chain attacks happen when an attacker compromises a widely used open source package and publishes a new version with an exploit. These are usually spotted very quickly, so an attack often only has a few hours of effective window before the problem is identified and the compromised package is pulled.
You are most at risk if you're automatically applying upgrades the same day they are released.
William says:
I love cooldowns for several reasons:
- They're empirically effective, per above. They won't stop all attackers, but they do stymie the majority of high-visibiity, mass-impact supply chain attacks that have become more common.
- They're incredibly easy to implement. Moreover, they're literally free to implement in most cases: most people can use Dependabot's functionality, Renovate's functionality, or the functionality build directly into their package manager
The one counter-argument to this is that sometimes an upgrade fixes a security vulnerability, and in those cases every hour of delay in upgrading as an hour when an attacker could exploit the new issue against your software.
I see that as an argument for carefully monitoring the release notes of your dependencies, and paying special attention to security advisories. I'm a big fan of the GitHub Advisory Database for that kind of information.
Via Hacker News
Tags: definitions, github, open-source, packaging, supply-chain
2025-11-21 00:32:25
Hot on the heels of Tuesday's Gemini 3 Pro release, today it's Nano Banana Pro, also known as Gemini 3 Pro Image. I've had a few days of preview access and this is an astonishingly capable image generation model.
As is often the case, the most useful low-level details can be found in the API documentation:
Designed to tackle the most challenging workflows through advanced reasoning, it excels at complex, multi-turn creation and modification tasks.
- High-resolution output: Built-in generation capabilities for 1K, 2K, and 4K visuals.
- Advanced text rendering: Capable of generating legible, stylized text for infographics, menus, diagrams, and marketing assets.
- Grounding with Google Search: The model can use Google Search as a tool to verify facts and generate imagery based on real-time data (e.g., current weather maps, stock charts, recent events).
- Thinking mode: The model utilizes a "thinking" process to reason through complex prompts. It generates interim "thought images" (visible in the backend but not charged) to refine the composition before producing the final high-quality output.
- Up to 14 reference images: You can now mix up to 14 reference images to produce the final image.
[...] These 14 images can include the following:
- Up to 6 images of objects with high-fidelity to include in the final image
- Up to 5 images of humans to maintain character consistency
There is also a short (6 page) model card PDF which lists the following as "new capabilities" compared to the previous Nano Banana: Multi character editing, Chart editing, Text editing, Factuality - Edu, Multi-input 1-3, Infographics, Doodle editing, Visual design.
Max Woolf published the definitive guide to prompting Nano Banana just a few days ago. I decided to try his example prompts against the new model, requesting results in 4K.
Here's what I got for his first test prompt, using Google's AI Studio:
Create an image of a three-dimensional pancake in the shape of a skull, garnished on top with blueberries and maple syrup.

The result came out as a 24.1MB, 5632 × 3072 pixel PNG file. I don't want to serve that on my own blog so here's a Google Drive link for the original.
Then I ran his follow-up prompt:
Make ALL of the following edits to the image:
- Put a strawberry in the left eye socket.
- Put a blackberry in the right eye socket.
- Put a mint garnish on top of the pancake.
- Change the plate to a plate-shaped chocolate-chip cookie.
- Add happy people to the background.

I'll note that it did put the plate-sized cookie on a regular plate. Here's the 24.9MB PNG.
The new model isn't cheap. Here's the API pricing: it's 24 cents for a 4K image and 13.4 cents for a 1K or 2K image. Image inputs are 0.11 cents (just over 1/10th of a cent) each - an earlier version of their pricing page incorrectly said 6.7 cents each but that's now been fixed.
Unlike most of Google's other models it also isn't available for free via AI Studio: you have to configure an API key with billing in order to use the model there.
So this thing is great at following instructions. How about rendering text?
I tried this prompt, this time using the Gemini consumer app in "thinking" mode (which now uses Nano Banana Pro for image generation). Here's a share link - my prompt was:
Infographic explaining how the Datasette open source project works
This is a great opportunity to test its ability to run searches (aka "Grounding with Google Search"). Here's what it created based on that 9 word prompt:
![Described by Gemini 3 Pro: A technical architecture diagram titled "How Datasette Works: From Raw Data to Explorable API," illustrating a workflow from left to right. 1. "RAW DATA SOURCES" include "CSV", "JSON", "Excel (XLSX)", and "Log Files". 2. These flow into "DATA PREPARATION & CONVERSION" using tools "csvs-to-sqlite" and "sqlite-utils" to create a "SQLite DATABASE". 3. This feeds into the central "DATASETTE APPLICATION CORE," a stack comprising "Data Ingestion (Read-Only)", "Query Engine (SQL)", "API Layer (JSON)", and "Web UI Rendering". 4. A "PLUGIN ECOSYSTEM" connects below the core, listing "Vega Charts", "Cluster Maps", "Full-Text Search", and "Custom Renderers". 5. To the right, the core branches into "OUTPUT & INTERFACES": an "Interactive Web Interface (Explore, Filter, Facet)", a "JSON API (Programmatic Access)" showing a snippet {"data": [...]}, and a "SQL EDITOR Custom SQL Queries" showing SELECT * FROM.... 6. The API output connects to "PUBLISHING & DEPLOYMENT" via a terminal command datasette publish cloudrun my.db leading to deployment targets "Heroku", "Google Cloud Run", "Fly.io", and "Vercel".](https://static.simonwillison.net/static/2025/nano-banana-datasette.jpg)
That's a pretty extraordinary result for such a simple prompt. The text is all spelled correctly and rendered without glitches. The content is solid too - it even included logos for the most popular publish platforms, and a tiny thumbnail of the Datasette UI which is close-enough for an infographic.
"Data Ingestion (Read-Only)" is a bit off. Everything else looks great to me.
Here's the thinking trace for that image generation process:
Initiating Datasette Infographic
I'm starting a detailed infographic about the Datasette project. I plan to use Google Search to gather specific information about its core processes, which I will then visualize in a multi-panel diagram.
Developing the Datasette workflow
I am visualizing the Datasette workflow as described in the search results. I'm focusing on the conversion of data from different formats into a SQLite database, which is then presented as a web application complete with a user interface and JSON output.
Visualizing Datasette Workflow
I've sketched out the initial structure for the infographic, focusing on a flowchart style. The key components include data input from diverse sources, conversion to SQLite, the Datasette core application, its main features (web UI, JSON API, SQL querying), ecosystem plugins, and finally, deployment options. I aim for clear icons and labels to represent each step. Now I will focus on refining the flow and clarifying the relationships between the parts.
Defining Core Datasette Stages
I've sketched out the initial infographic structure. I have identified the key components, focusing on visual representation of the data flow and the key benefits and features of the system. I have made updates to the flowchart with clear icons and labels. Now I'm working to expand the data input section and refine the visual representation of each step.
Constructing Datasette Infographic
I am translating the current Datasette workflow outline into visual elements. The input section will show icons for different data formats. I've sketched the data conversion stage with clear illustrations of the necessary tools. I will add a depiction of the Datasette core application, showcasing its interfaces and functions. The plugin ecosystem and deployment options will follow. Now, I am refining the visual representation.
Analyzing Datasette's Architecture
I'm currently working on representing the Datasette open-source project's architecture visually. I've broken it down into key components: raw data processing, SQLite conversion, core application functionalities, plugin integration, and deployment methods. I'm focusing on creating a clear, linear representation to show the workflow.
Mapping the Datasette Flow
I've crafted a comprehensive graphic to illustrate Datasette's operation. It spans the data ingestion, SQLite conversion, main application features, plugin interactions, and deployment stages, aiming for clarity and a logical flow.
This note from the announcement post caught my eye:
We believe it's critical to know when an image is AI-generated. This is why all media generated by Google's tools are embedded with our imperceptible SynthID digital watermark.
Today, we are putting a powerful verification tool directly in consumers' hands: you can now upload an image into the Gemini app and simply ask if it was generated by Google AI, thanks to SynthID technology. We are starting with images, but will expand to audio and video soon.
Last night I used Nano Banana Pro to generate a fake photograph of raccoons stealing our food delivery, then scrubbed out the little diamond icon using the Apple Photos "cleanup" tool. I uploaded that Gemini app and asked "Was this image created with AI?":

It replied:
Yes, it appears that all or part of this image was created with Google Al. SynthID detected a watermark in 25-50% of the image.
Presumably that 25-50% figure is because the rest of the photo was taken by me - it was just the raccoons that were added by Nano Banana Pro.
Tags: google, ai, datasette, generative-ai, llms, gemini, text-to-image, llm-release, nano-banana
2025-11-20 09:01:44
Previously, when malware developers wanted to go and monetize their exploits, they would do exactly one thing: encrypt every file on a person's computer and request a ransome to decrypt the files. In the future I think this will change.
LLMs allow attackers to instead process every file on the victim's computer, and tailor a blackmail letter specifically towards that person. One person may be having an affair on their spouse. Another may have lied on their resume. A third may have cheated on an exam at school. It is unlikely that any one person has done any of these specific things, but it is very likely that there exists something that is blackmailable for every person. Malware + LLMs, given access to a person's computer, can find that and monetize it.
— Nicholas Carlini, Are large language models worth it? Misuse: malware at scale
Tags: ai-ethics, generative-ai, nicholas-carlini, ai, llms
2025-11-20 07:15:10
Building more with GPT-5.1-Codex-Max
Hot on the heels of yesterday's Gemini 3 Pro release comes a new model from OpenAI called GPT-5.1-Codex-Max.(Remember when GPT-5 was meant to bring in a new era of less confusing model names? That didn't last!)
It's currently only available through their Codex CLI coding agent, where it's the new default model:
Starting today, GPT‑5.1-Codex-Max will replace GPT‑5.1-Codex as the default model in Codex surfaces. Unlike GPT‑5.1, which is a general-purpose model, we recommend using GPT‑5.1-Codex-Max and the Codex family of models only for agentic coding tasks in Codex or Codex-like environments.
It's not available via the API yet but should be shortly.
The timing of this release is interesting given that Gemini 3 Pro appears to have aced almost all of the benchmarks just yesterday. It's reminiscent of the period in 2024 when OpenAI consistently made big announcements that happened to coincide with Gemini releases.
OpenAI's self-reported SWE-Bench Verified score is particularly notable: 76.5% for thinking level "high" and 77.9% for the new "xhigh". That was the one benchmark where Gemini 3 Pro was out-performed by Claude Sonnet 4.5 - Gemini 3 Pro got 76.2% and Sonnet 4.5 got 77.2%. OpenAI now have the highest scoring model there by a full .7 of a percentage point!
They also report a score of 58.1% on Terminal Bench 2.0, beating Gemini 3 Pro's 54.2% (and Sonnet 4.5's 42.8%.)
The most intriguing part of this announcement concerns the model's approach to long context problems:
GPT‑5.1-Codex-Max is built for long-running, detailed work. It’s our first model natively trained to operate across multiple context windows through a process called compaction, coherently working over millions of tokens in a single task. [...]
Compaction enables GPT‑5.1-Codex-Max to complete tasks that would have previously failed due to context-window limits, such as complex refactors and long-running agent loops by pruning its history while preserving the most important context over long horizons. In Codex applications, GPT‑5.1-Codex-Max automatically compacts its session when it approaches its context window limit, giving it a fresh context window. It repeats this process until the task is completed.
There's a lot of confusion on Hacker News about what this actually means. Claude Code already does a version of compaction, automatically summarizing previous turns when the context runs out. Does this just mean that Codex-Max is better at that process?
I had it draw me a couple of pelicans by typing "Generate an SVG of a pelican riding a bicycle" directly into the Codex CLI tool. Here's thinking level medium:

And here's thinking level "xhigh":

I also tried xhigh on the my longer pelican test prompt, which came out like this:

Also today: GPT-5.1 Pro is rolling out today to all Pro users. According to the ChatGPT release notes:
GPT-5.1 Pro is rolling out today for all ChatGPT Pro users and is available in the model picker. GPT-5 Pro will remain available as a legacy model for 90 days before being retired.
That's a pretty fast deprecation cycle for the GPT-5 Pro model that was released just three months ago.
Via Hacker News
Tags: ai, openai, generative-ai, llms, evals, pelican-riding-a-bicycle, llm-release, gpt-5, codex-cli