RSS preview of Simon Willison

Rss preview of Blog of Simon Willison

Quoting François Chollet

2025-10-30 10:37:18

To really understand a concept, you have to "invent" it yourself in some capacity. Understanding doesn't come from passive content consumption. It is always self-built. It is an active, high-agency, self-directed process of creating and debugging your own mental models.

— François Chollet

Tags: francois-chollet, teaching

Introducing SWE-1.5: Our Fast Agent Model

2025-10-30 07:59:20

Introducing SWE-1.5: Our Fast Agent Model

Here's the second fast coding model released by a coding agent IDE in the same day - the first was Composer-1 by Cursor. This time it's Windsurf releasing SWE-1.5:

Today we’re releasing SWE-1.5, the latest in our family of models optimized for software engineering. It is a frontier-size model with hundreds of billions of parameters that achieves near-SOTA coding performance. It also sets a new standard for speed: we partnered with Cerebras to serve it at up to 950 tok/s – 6x faster than Haiku 4.5 and 13x faster than Sonnet 4.5.

Like Composer-1 it's only available via their editor, no separate API yet. Also like Composer-1 they don't appear willing to share details of the "leading open-source base model" they based their new model on.

I asked it to generate an SVG of a pelican riding a bicycle and got this:

This one felt really fast. Partnering with Cerebras for inference is a very smart move.

They share a lot of details about their training process in the post:

SWE-1.5 is trained on our state-of-the-art cluster of thousands of GB200 NVL72 chips. We believe SWE-1.5 may be the first public production model trained on the new GB200 generation. [...]

Our RL rollouts require high-fidelity environments with code execution and even web browsing. To achieve this, we leveraged our VM hypervisor otterlink that allows us to scale Devin to tens of thousands of concurrent machines (learn more about blockdiff). This enabled us to smoothly support very high concurrency and ensure the training environment is aligned with our Devin production environments.

That's another similarity to Cursor's Composer-1! Cursor talked about how they ran "hundreds of thousands of concurrent sandboxed coding environments in the cloud" in their description of their RL training as well.

This is a notable trend: if you want to build a really great agentic coding tool there's clearly a lot to be said for using reinforcement learning to fine-tune a model against your own custom set of tools using large numbers of sandboxed simulated coding environments as part of that process.

Via @cognition

Tags: ai, generative-ai, llms, ai-assisted-programming, pelican-riding-a-bicycle, llm-release, coding-agents

MiniMax M2 & Agent: Ingenious in Simplicity

2025-10-30 06:49:47

MiniMax M2 & Agent: Ingenious in Simplicity

MiniMax M2 was released on Monday 27th October by MiniMax, a Chinese AI lab founded in December 2021.

It's a very promising model. Their self-reported benchmark scores show it as comparable to Claude Sonnet 4, and Artificial Analysis are ranking it as the best currently available open weight model according to their intelligence score:

MiniMax’s M2 achieves a new all-time-high Intelligence Index score for an open weights model and offers impressive efficiency with only 10B active parameters (200B total). [...]

The model’s strengths include tool use and instruction following (as shown by Tau2 Bench and IFBench). As such, while M2 likely excels at agentic use cases it may underperform other open weights leaders such as DeepSeek V3.2 and Qwen3 235B at some generalist tasks. This is in line with a number of recent open weights model releases from Chinese AI labs which focus on agentic capabilities, likely pointing to a heavy post-training emphasis on RL.

The size is particularly significant: the model weights are 230GB on Hugging Face, significantly smaller than other high performing open weight models. That's small enough to run on a 256GB Mac Studio, and the MLX community have that working already.

MiniMax offer their own API, and recommend using their Anthropic-compatible endpoint and the official Anthropic SDKs to access it. MiniMax Head of Engineering Skyler Miao provided some background on that:

M2 is a agentic thinking model, it do interleaved thinking like sonnet 4.5, which means every response will contain its thought content. Its very important for M2 to keep the chain of thought. So we must make sure the history thought passed back to the model. Anthropic API support it for sure, as sonnet needs it as well. OpenAI only support it in their new Response API, no support for in ChatCompletion.

MiniMax are offering the new model via their API for free until November 7th, after which the cost will be $0.30/million input tokens and $1.20/million output tokens - similar in price to Gemini 2.5 Flash and GPT-5 Mini, see price comparison here on my llm-prices.com site.

I released a new plugin for LLM called llm-minimax providing support for M2 via the MiniMax API:

llm install llm-minimax
llm keys set minimax
# Paste key here
llm -m m2 -o max_tokens 10000 "Generate an SVG of a pelican riding a bicycle"

Here's the result:

51 input, 4,017 output. At $0.30/m input and $1.20/m output that pelican would cost 0.4836 cents - less than half a cent.

This is the first plugin I've written for an Anthropic-API-compatible model. I released llm-anthropic 0.21 first adding the ability to customize the base_url parameter when using that model class. This meant the new plugin was less than 30 lines of Python.

Tags: ai, generative-ai, local-llms, llms, llm, llm-pricing, pelican-riding-a-bicycle, llm-release, ai-in-china

Composer: Building a fast frontier model with RL

2025-10-30 04:45:53

Composer: Building a fast frontier model with RL

Cursor released Cursor 2.0 today, with a refreshed UI focused on agentic coding (and running agents in parallel) and a new model that's unique to Cursor called Composer 1.

As far as I can tell there's no way to call the model directly via an API, so I fired up "Ask" mode in Cursor's chat side panel and asked it to "Generate an SVG of a pelican riding a bicycle":

Here's the result:

The notable thing about Composer-1 is that it is designed to be fast. The pelican certainly came back quickly, and in their announcement they describe it as being "4x faster than similarly intelligent models".

It's interesting to see Cursor investing resources in training their own code-specific model - similar to GPT-5-Codex or Qwen3-Coder. From their post:

Composer is a mixture-of-experts (MoE) language model supporting long-context generation and understanding. It is specialized for software engineering through reinforcement learning (RL) in a diverse range of development environments. [...]

Efficient training of large MoE models requires significant investment into building infrastructure and systems research. We built custom training infrastructure leveraging PyTorch and Ray to power asynchronous reinforcement learning at scale. We natively train our models at low precision by combining our MXFP8 MoE kernels with expert parallelism and hybrid sharded data parallelism, allowing us to scale training to thousands of NVIDIA GPUs with minimal communication cost. [...]

During RL, we want our model to be able to call any tool in the Cursor Agent harness. These tools allow editing code, using semantic search, grepping strings, and running terminal commands. At our scale, teaching the model to effectively call these tools requires running hundreds of thousands of concurrent sandboxed coding environments in the cloud.

One detail that's notably absent from their description: did they train the model from scratch, or did they start with an existing open-weights model such as something from Qwen or GLM?

Cursor researcher Sasha Rush has been answering questions on Hacker News, but has so far been evasive in answering questions about the base model. When directly asked "is Composer a fine tune of an existing open source base model?" they replied:

Our primary focus is on RL post-training. We think that is the best way to get the model to be a strong interactive agent.

Sasha did confirm that rumors of an earlier Cursor preview model, Cheetah, being based on a model by xAI's Grok were "Straight up untrue."

Via Hacker News

Tags: ai, generative-ai, llms, ai-assisted-programming, pelican-riding-a-bicycle, llm-release, coding-agents, cursor, parallel-agents

Hacking the WiFi-enabled color screen GitHub Universe conference badge

2025-10-29 01:17:44

I'm at GitHub Universe this week (thanks to a free ticket from Microsoft). Yesterday I picked up my conference badge... which incorporates a ~~full Raspberry Pi~~ Raspberry Pi Pico microcontroller with a battery, color screen, WiFi and bluetooth.

GitHub Universe has a tradition of hackable conference badges - the badge last year had an eInk display. This year's is a huge upgrade though - a color screen and WiFI connection makes this thing a genuinely useful little computer!

The only thing it's missing is a keyboard - the device instead provides five buttons total - Up, Down, A, B, C. It might be possible to get a bluetooth keyboard to work though I'll believe that when I see it - there's not a lot of space on this device for a keyboard driver.

Everything is written using MicroPython, and the device is designed to be hackable: connect it to a laptop with a USB-C cable and you can start modifying the code directly on the device.

Getting setup with the badge

Out of the box the badge will play an opening animation (implemented as a sequence of PNG image frames) and then show a home screen with six app icons.

The default apps are mostly neat Octocat-themed demos: a flappy-bird clone, a tamagotchi-style pet, a drawing app that works like an etch-a-sketch, an IR scavenger hunt for the conference venue itself (this thing has an IR sensor too!), and a gallery app showing some images.

The sixth app is a badge app. This will show your GitHub profile image and some basic stats, but will only work if you dig out a USB-C cable and make some edits to the files on the badge directly.

I did this on a Mac. I plugged a USB-C cable into the badge which caused MacOS to treat it as an attached drive volume. In that drive are several files including secrets.py. Open that up, confirm the WiFi details are correct and add your GitHub username. The file should look like this:

WIFI_SSID = "..."
WIFI_PASSWORD = "..."
GITHUB_USERNAME = "simonw"

The badge comes with the SSID and password for the GitHub Universe WiFi network pre-populated.

That's it! Unmount the disk, hit the reboot button on the back of the badge and when it comes back up again the badge app should look something like this:

Building your own apps

Here's the official documentation for building software for the badge.

When I got mine yesterday the official repo had not yet been updated, so I had to figure this out myself.

I copied all of the code across to my laptop, added it to a Git repo and then fired up Claude Code and told it:

Investigate this code and add a detailed README

Here's the result, which was really useful for getting a start on understanding how it all worked.

Each of the six default apps lives in a apps/ folder, for example apps/sketch/ for the sketching app.

There's also a menu app which powers the home screen. That lives in apps/menu/. You can edit code in here to add new apps that you create to that screen.

I told Claude:

Add a new app to it available from the menu which shows network status and other useful debug info about the machine it is running on

This was a bit of a long-shot, but it totally worked!

The first version had an error:

I OCRd that photo (with the Apple Photos app) and pasted the message into Claude Code and it fixed the problem.

This almost worked... but the addition of a seventh icon to the 2x3 grid meant that you could select the icon but it didn't scroll into view. I had Claude fix that for me too.

Here's the code for apps/debug/__init__.py, and the full Claude Code transcript created using my terminal-to-HTML app described here.

Here are the four screens of the debug app:

An icon editor

The icons used on the app are 24x24 pixels. I decided it would be neat to have a web app that helps build those icons, including the ability to start by creating an icon from an emoji.

I bulit this one using Claude Artifacts. Here's the result, now available at tools.simonwillison.net/icon-editor:

And a REPL

I noticed that last year's badge configuration app (which I can't find in github.com/badger/badger.github.io any more, I think they reset the history on that repo?) worked by talking to MicroPython over the Web Serial API from Chrome. Here's my archived copy of that code.

Wouldn't it be useful to have a REPL in a web UI that you could use to interact with the badge directly over USB?

I pointed Claude Code at a copy of that repo and told it:

Based on this build a new HTML with inline JavaScript page that uses WebUSB to simply test that the connection to the badge works and then list files on that device using the same mechanism

It took a bit of poking (here's the transcript) but the result is now live at tools.simonwillison.net/badge-repl. It only works in Chrome - you'll need to plug the badge in with a USB-C cable and then click "Connect to Badge".

Get hacking

If you're a GitHub Universe attendee I hope this is useful. The official badger.github.io site has plenty more details to help you get started.

There isn't yet a way to get hold of this hardware outside of GitHub Universe - I know they had some supply chain challenges just getting enough badges for the conference attendees!

It's a very neat device, built for GitHub by Pimoroni in Sheffield, UK. A version of this should become generally available in the future under the name "Pimoroni Tufty 2350".

Update: Setup with iPhone only

If you don't have a laptop with you it's still possible to start hacking on the device using just a USB-C cable.

Plug the badge into the phone, hit the reset button on the back twice to switch it into disk mode and open the iPhone Files app - the badge should appear as a mounted disk called BADGER.

I used Textastic to edit that secrets.py and configure a new badge, then hit reset again to restart it.

Tags: github, hardware-hacking, microsoft, ai, generative-ai, raspberry-pi, llms, claude-code, disclosures

Quoting Aaron Boodman

2025-10-28 10:08:57

Claude doesn't make me much faster on the work that I am an expert on. Maybe 15-20% depending on the day.

It's the work that I don't know how to do and would have to research. Or the grunge work I don't even want to do. On this it is hard to even put a number on. Many of the projects I do with Claude day to day I just wouldn't have done at all pre-Claude.

Infinity% improvement in productivity on those.

— Aaron Boodman

Tags: ai-assisted-programming, claude, generative-ai, ai, llms, aaron-boodman

Simon WillisonModify