2025-11-25 02:59:14
I got a report of a bug in
sqlite-utils concerning plugin installation - if you installed the package using uv tool install further attempts to install plugins with sqlite-utils install X would fail, because uv doesn't bundle pip by default. I had the same bug with Datasette a while ago, turns out I forgot to apply the fix to sqlite-utils.
Since I was pushing a new dot-release I decided to integrate some of the non-breaking changes from the 4.0 alpha I released last night.
I tried to have Claude Code do the backporting for me:
create a new branch called 3.x starting with the 3.38 tag, then consult https://github.com/simonw/sqlite-utils/issues/688 and cherry-pick the commits it lists in the second comment, then review each of the links in the first comment and cherry-pick those as well. After each cherry-pick run the command "just test" to confirm the tests pass and fix them if they don't. Look through the commit history on main since the 3.38 tag to help you with this task.
This worked reasonably well - here's the terminal transcript. It successfully argued me out of two of the larger changes which would have added more complexity than I want in a small dot-release like this.
I still had to do a bunch of manual work to get everything up to scratch, which I carried out in this PR - including adding comments there and then telling Claude Code:
Apply changes from the review on this PR https://github.com/simonw/sqlite-utils/pull/689
Here's the transcript from that.
The release is now out with the following release notes:
- Fixed a bug with
sqlite-utils installwhen the tool had been installed usinguv. (#687)- The
--functionsargument now optionally accepts a path to a Python file as an alternative to a string full of code, and can be specified multiple times - see Defining custom SQL functions. (#659)sqlite-utilsnow requires on Python 3.10 or higher.
Tags: projects, sqlite, sqlite-utils, annotated-release-notes, uv, coding-agents, claude-code
2025-11-24 22:52:34
I released a new alpha version of sqlite-utils last night - the 128th release of that package since I started building it back in 2018.
sqlite-utils is two things in one package: a Python library for conveniently creating and manipulating SQLite databases and a CLI tool for working with them in the terminal. Almost every feature provided by the package is available via both of those surfaces.
This is hopefully the last alpha before a 4.0 stable release. I use semantic versioning for this library, so the 4.0 version number indicates that there are backward incompatible changes that may affect code written against the 3.x line.
These changes are mostly very minor: I don't want to break any existing code if I can avoid it. I made it all the way to version 3.38 before I had to ship a major release and I'm sad I couldn't push that even further!
Here are the annotated release notes for 4.0a1.
- Breaking change: The
db.table(table_name)method now only works with tables. To access a SQL view usedb.view(view_name)instead. (#657)
This change is for type hint enthusiasts. The Python library used to encourage accessing both SQL tables and SQL views through the db["name_of_table_or_view"] syntactic sugar - but tables and view have different interfaces since there's no way to handle a .insert(row) on a SQLite view. If you want clean type hints for your code you can now use the db.table(table_name) and db.view(view_name) methods instead.
- The
table.insert_all()andtable.upsert_all()methods can now accept an iterator of lists or tuples as an alternative to dictionaries. The first item should be a list/tuple of column names. See Inserting data from a list or tuple iterator for details. (#672)
A new feature, not a breaking change. I realized that supporting a stream of lists or tuples as an option for populating large tables would be a neat optimization over always dealing with dictionaries each of which duplicated the column names.
I had the idea for this one while walking the dog and built the first prototype by prompting Claude Code for web on my phone. Here's the prompt I used and the prototype report it created, which included a benchmark estimating how much of a performance boost could be had for different sizes of tables.
- Breaking change: The default floating point column type has been changed from
FLOATtoREAL, which is the correct SQLite type for floating point values. This affects auto-detected columns when inserting data. (#645)
I was horrified to discover a while ago that I'd been creating SQLite columns called FLOAT but the correct type to use was REAL! This change fixes that. Previously the fix was to ask for tables to be created in strict mode.
- Now uses
pyproject.tomlin place ofsetup.pyfor packaging. (#675)
As part of this I also figured out recipes for using uv as a development environment for the package, which are now baked into the Justfile.
- Tables in the Python API now do a much better job of remembering the primary key and other schema details from when they were first created. (#655)
This one is best explained in the issue.
- Breaking change: The
table.convert()andsqlite-utils convertmechanisms no longer skip values that evaluate toFalse. Previously the--skip-falseoption was needed, this has been removed. (#542)
Another change which I would have made earlier but, since it introduces a minor behavior change to an existing feature, I reserved it for the 4.0 release.
- Breaking change: Tables created by this library now wrap table and column names in
"double-quotes"in the schema. Previously they would use[square-braces]. (#677)
Back in 2018 when I started this project I was new to working in-depth with SQLite and incorrectly concluded that the correct way to create tables and columns named after reserved words was like this:
create table [my table] (
[id] integer primary key,
[key] text
)
That turned out to be a non-standard SQL syntax which the SQLite documentation describes like this:
A keyword enclosed in square brackets is an identifier. This is not standard SQL. This quoting mechanism is used by MS Access and SQL Server and is included in SQLite for compatibility.
Unfortunately I baked it into the library early on and it's been polluting the world with weirdly escaped table and column names ever since!
I've finally fixed that, with the help of Claude Code which took on the mind-numbing task of updating hundreds of existing tests that asserted against the generated schemas.
The above example table schema now looks like this:
create table "my table" (
"id" integer primary key,
"key" text
)
This may seem like a pretty small change but I expect it to cause a fair amount of downstream pain purely in terms of updating tests that work against tables created by sqlite-utils!
- The
--functionsCLI argument now accepts a path to a Python file in addition to accepting a string full of Python code. It can also now be specified multiple times. (#659)
I made this change first in LLM and decided to bring it to sqlite-utils for consistency between the two tools.
- Breaking change: Type detection is now the default behavior for the
insertandupsertCLI commands when importing CSV or TSV data. Previously all columns were treated asTEXTunless the--detect-typesflag was passed. Use the new--no-detect-typesflag to restore the old behavior. TheSQLITE_UTILS_DETECT_TYPESenvironment variable has been removed. (#679)
One last minor ugliness that I waited for a major version bump to fix.
Tags: projects, sqlite, sqlite-utils, annotated-release-notes, ai-assisted-programming, coding-agents, claude-code
2025-11-24 05:29:09
"Good engineering management" is a fad
Will Larson argues that the technology industry's idea of what makes a good engineering manager changes over time based on industry realities. ZIRP hypergrowth has been exchanged for a more cautious approach today, and expectations of managers has changed to match:Where things get weird is that in each case a morality tale was subsequently superimposed on top of the transition [...] the industry will want different things from you as it evolves, and it will tell you that each of those shifts is because of some complex moral change, but it’s pretty much always about business realities changing.
I particularly appreciated the section on core engineering management skills that stay constant no matter what:
- Execution: lead team to deliver expected tangible and intangible work. Fundamentally, management is about getting things done, and you’ll neither get an opportunity to begin managing, nor stay long as a manager, if your teams don’t execute. [...]
- Team: shape the team and the environment such that they succeed. This is not working for the team, nor is it working for your leadership, it is finding the balance between the two that works for both. [...]
- Ownership: navigate reality to make consistent progress, even when reality is difficult Finding a way to get things done, rather than finding a way that it not getting done is someone else’s fault. [...]
- Alignment: build shared understanding across leadership, stakeholders, your team, and the problem space. Finding a realistic plan that meets the moment, without surprising or being surprised by those around you. [...]
Will goes on to list four additional growth skill "whose presence–or absence–determines how far you can go in your career".
Via Hacker News
Tags: software-engineering, will-larson, careers, management, leadership
2025-11-23 08:49:39
Armin Ronacher presents a cornucopia of lessons learned from building agents over the past few months.
There are several agent abstraction libraries available now (my own LLM library is edging into that territory with its tools feature) but Armin has found that the abstractions are not worth adopting yet:
[…] the differences between models are significant enough that you will need to build your own agent abstraction. We have not found any of the solutions from these SDKs that build the right abstraction for an agent. I think this is partly because, despite the basic agent design being just a loop, there are subtle differences based on the tools you provide. These differences affect how easy or hard it is to find the right abstraction (cache control, different requirements for reinforcement, tool prompts, provider-side tools, etc.). Because the right abstraction is not yet clear, using the original SDKs from the dedicated platforms keeps you fully in control. […]
This might change, but right now we would probably not use an abstraction when building an agent, at least until things have settled down a bit. The benefits do not yet outweigh the costs for us.
Armin introduces the new-to-me term reinforcement, where you remind the agent of things as it goes along:
Every time the agent runs a tool you have the opportunity to not just return data that the tool produces, but also to feed more information back into the loop. For instance, you can remind the agent about the overall objective and the status of individual tasks. […] Another use of reinforcement is to inform the system about state changes that happened in the background.
Claude Code’s TODO list is another example of this pattern in action.
Testing and evals remains the single hardest problem in AI engineering:
We find testing and evals to be the hardest problem here. This is not entirely surprising, but the agentic nature makes it even harder. Unlike prompts, you cannot just do the evals in some external system because there’s too much you need to feed into it. This means you want to do evals based on observability data or instrumenting your actual test runs. So far none of the solutions we have tried have convinced us that they found the right approach here.
Armin also has a follow-up post, LLM APIs are a Synchronization Problem, which argues that the shape of current APIs hides too many details from us as developers, and the core challenge here is in synchronizing state between the tokens fed through the GPUs and our client applications - something that may benefit from alternative approaches developed by the local-first movement.
Via Hacker News
Tags: armin-ronacher, definitions, ai, prompt-engineering, generative-ai, llms, evals, ai-agents
2025-11-23 07:59:46
Olmo is the LLM series from Ai2 - the Allen institute for AI. Unlike most open weight models these are notable for including the full training data, training process and checkpoints along with those releases.
The new Olmo 3 claims to be "the best fully open 32B-scale thinking model" and has a strong focus on interpretability:
At its center is Olmo 3-Think (32B), the best fully open 32B-scale thinking model that for the first time lets you inspect intermediate reasoning traces and trace those behaviors back to the data and training decisions that produced them.
They've released four 7B models - Olmo 3-Base, Olmo 3-Instruct, Olmo 3-Think and Olmo 3-RL Zero, plus 32B variants of the 3-Think and 3-Base models.
Having full access to the training data is really useful. Here's how they describe that:
Olmo 3 is pretrained on Dolma 3, a new ~9.3-trillion-token corpus drawn from web pages, science PDFs processed with olmOCR, codebases, math problems and solutions, and encyclopedic text. From this pool, we construct Dolma 3 Mix, a 5.9-trillion-token (~6T) pretraining mix with a higher proportion of coding and mathematical data than earlier Dolma releases, plus much stronger decontamination via extensive deduplication, quality filtering, and careful control over data mixing. We follow established web standards in collecting training data and don't collect from sites that explicitly disallow it, including paywalled content.
They also highlight that they are training on fewer tokens than their competition:
[...] it's the strongest fully open thinking model we're aware of, narrowing the gap to the best open-weight models of similar scale – such as Qwen 3 32B – while training on roughly 6x fewer tokens.
If you're continuing to hold out hope for a model trained entirely on licensed data this one sadly won't fit the bill - a lot of that data still comes from a crawl of the web.
I tried out the 32B Think model and the 7B Instruct model using LM Studio. The 7B model is a 4.16GB download, the 32B one is 18.14GB.
The 32B model is absolutely an over-thinker! I asked it to "Generate an SVG of a pelican riding a bicycle" and it thought for 14 minutes 43 seconds, outputting 8,437 tokens total most of which was this epic thinking trace.
I don't usually quote the full SVG in these write-ups, but in this case it's short enough that I think it's worth sharing. The SVG comments give a great impression of what it was trying to do - it has a Bicycle, Bike frame, Pelican, Left and Right wings and even "Feet on pedals".
<svg width="200" height="200" viewBox="0 0 100 100">
<!-- Bicycle -->
<circle cx="30" cy="60" r="15" stroke="black" fill="none"/>
<circle cx="70" cy="60" r="15" stroke="black" fill="none"/>
<!-- Bike frame -->
<rect x="35" y="25" width="30" height="10" fill="saddlebrown"/>
<line x1="35" y1="40" x2="30" y2="60" stroke="black" stroke-width="3"/>
<line x1="65" y1="40" x2="70" y2="60" stroke="black" stroke-width="3"/>
<!-- Pelican -->
<ellipse cx="55" cy="65" rx="20" ry="15" fill="white"/>
<polygon points="52 50,57 35,62 50" fill="black"/> <!-- Head/beak -->
<circle cx="55" cy="45" r="2" fill="white"/>
<circle cx="60" cy="45" r="2" fill="white"/>
<polygon points="45 60,50 70,55 60" fill="lightgrey"/> <!-- Left wing -->
<polygon points="65 60,70 70,55 60" fill="lightgrey"/> <!-- Right wing -->
<!-- Feet on pedals -->
<polygon points="25 75,30 85,35 75" fill="black"/>
<polygon points="75 75,70 85,65 75" fill="black"/>
</svg>Rendered it looks like this:

I tested OLMo 2 32B 4bit back in March and got something that, while pleasingly abstract, didn't come close to resembling a pelican or a bicycle:

To be fair 32B models generally don't do great with this. Here's Qwen 3 32B's attempt (I ran that just now using OpenRouter):

I was particularly keen on trying out the ability to "inspect intermediate reasoning traces". Here's how that's described later in the announcement:
A core goal of Olmo 3 is not just to open the model flow, but to make it actionable for people who want to understand and improve model behavior. Olmo 3 integrates with OlmoTrace, our tool for tracing model outputs back to training data in real time.
For example, in the Ai2 Playground, you can ask Olmo 3-Think (32B) to answer a general-knowledge question, then use OlmoTrace to inspect where and how the model may have learned to generate parts of its response. This closes the gap between training data and model behavior: you can see not only what the model is doing, but why---and adjust data or training decisions accordingly.
You can access OlmoTrace via playground.allenai.org, by first running a prompt and then clicking the "Show OlmoTrace" button below the output.
I tried that on "Generate a conference bio for Simon Willison" (an ego-prompt I use to see how much the models have picked up about me from their training data) and got back a result that looked like this:

It thinks I co-founded co:here and work at Anthropic, both of which are incorrect - but that's not uncommon with LLMs, I frequently see them suggest that I'm the CTO of GitHub and other such inaccuracies.
I found the OlmoTrace panel on the right disappointing. None of the training documents it highlighted looked relevant - it appears to be looking for phrase matches (powered by Ai2's infini-gram) but the documents it found had nothing to do with me at all.
Ai2 claim that Olmo 3 is "the best fully open 32B-scale thinking model", which I think holds up provided you define "fully open" as including open training data. There's not a great deal of competition in that space though - Ai2 compare themselves to Stanford's Marin and Swiss AI's Apertus, neither of which I'd heard about before.
A big disadvantage of other open weight models is that it's impossible to audit their training data. Anthropic published a paper last month showing that a small number of samples can poison LLMs of any size - it can take just "250 poisoned documents" to add a backdoor to a large model that triggers undesired behavior based on a short carefully crafted prompt.
This makes fully open training data an even bigger deal.
Ai2 researcher Nathan Lambert included this note about the importance of transparent training data in his detailed post about the release:
In particular, we're excited about the future of RL Zero research on Olmo 3 precisely because everything is open. Researchers can study the interaction between the reasoning traces we include at midtraining and the downstream model behavior (qualitative and quantitative).
This helps answer questions that have plagued RLVR results on Qwen models, hinting at forms of data contamination particularly on math and reasoning benchmarks (see Shao, Rulin, et al. "Spurious rewards: Rethinking training signals in rlvr." arXiv preprint arXiv:2506.10947 (2025). or Wu, Mingqi, et al. "Reasoning or memorization? unreliable results of reinforcement learning due to data contamination." arXiv preprint arXiv:2507.10532 (2025).)
I hope we see more competition in this space, including further models in the Olmo series. The improvements from Olmo 1 (in February 2024) and Olmo 2 (in March 2025) have been significant. I'm hoping that trend continues!
Tags: ai, generative-ai, llms, interpretability, pelican-riding-a-bicycle, llm-reasoning, ai2, ai-ethics, llm-release, nathan-lambert, olmo
2025-11-22 01:27:33
We should all be using dependency cooldowns
William Woodruff gives a name to a sensible strategy for managing dependencies while reducing the chances of a surprise supply chain attack: dependency cooldowns.Supply chain attacks happen when an attacker compromises a widely used open source package and publishes a new version with an exploit. These are usually spotted very quickly, so an attack often only has a few hours of effective window before the problem is identified and the compromised package is pulled.
You are most at risk if you're automatically applying upgrades the same day they are released.
William says:
I love cooldowns for several reasons:
- They're empirically effective, per above. They won't stop all attackers, but they do stymie the majority of high-visibiity, mass-impact supply chain attacks that have become more common.
- They're incredibly easy to implement. Moreover, they're literally free to implement in most cases: most people can use Dependabot's functionality, Renovate's functionality, or the functionality build directly into their package manager
The one counter-argument to this is that sometimes an upgrade fixes a security vulnerability, and in those cases every hour of delay in upgrading as an hour when an attacker could exploit the new issue against your software.
I see that as an argument for carefully monitoring the release notes of your dependencies, and paying special attention to security advisories. I'm a big fan of the GitHub Advisory Database for that kind of information.
Via Hacker News
Tags: definitions, github, open-source, packaging, supply-chain