2026-03-31 07:58:49
Release: datasette-files 0.1a3
I'm working on integrating datasette-files into other plugins, such as datasette-extract. This necessitated a new release of the base plugin.
owners_can_editandowners_can_deleteconfiguration options, plus thefiles-editandfiles-deleteactions are now scoped to a newFileResourcewhich is a child ofFileSourceResource. #18- The file picker UI is now available as a
<datasette-file-picker>Web Component. Thanks, Alex Garcia. #19- New
from datasette_files import get_filePython API for other plugins that need to access file data. #20
Tags: datasette
2026-03-31 05:31:02
Note that the main issues that people currently unknowingly face with local models mostly revolve around the harness and some intricacies around model chat templates and prompt construction. Sometimes there are even pure inference bugs. From typing the task in the client to the actual result, there is a long chain of components that atm are not only fragile - are also developed by different parties. So it's difficult to consolidate the entire stack and you have to keep in mind that what you are currently observing is with very high probability still broken in some subtle way along that chain.
— Georgi Gerganov, explaining why it's hard to find local models that work well with coding agents
Tags: coding-agents, generative-ai, ai, local-llms, llms, georgi-gerganov
2026-03-31 03:48:43
Release: datasette-llm 0.1a3
Adds the ability to configure which LLMs are available for which purpose, which means you can restrict the list of models that can be used with a specific plugin. #3
2026-03-30 22:28:34
Trip Venturella released Mr. Chatterbox, a language model trained entirely on out-of-copyright text from the British Library. Here's how he describes it in the model card:
Mr. Chatterbox is a language model trained entirely from scratch on a corpus of over 28,000 Victorian-era British texts published between 1837 and 1899, drawn from a dataset made available by the British Library. The model has absolutely no training inputs from after 1899 — the vocabulary and ideas are formed exclusively from nineteenth-century literature.
Mr. Chatterbox's training corpus was 28,035 books, with an estimated 2.93 billion input tokens after filtering. The model has roughly 340 million paramaters, roughly the same size as GPT-2-Medium. The difference is, of course, that unlike GPT-2, Mr. Chatterbox is trained entirely on historical data.
Given how hard it is to train a useful LLM without using vast amounts of scraped, unlicensed data I've been dreaming of a model like this for a couple of years now. What would a model trained on out-of-copyright text be like to chat with?
Thanks to Trip we can now find out for ourselves!
The model itself is tiny, at least by Large Language Model standards - just 2.05GB on disk. You can try it out using Trip's HuggingFace Spaces demo:

Honestly, it's pretty terrible. Talking with it feels more like chatting with a Markov chain than an LLM - the responses may have a delightfully Victorian flavor to them but it's hard to get a response that usefully answers a question.
The 2022 Chinchilla paper suggests a ratio of 20x the parameter count to training tokens. For a 340m model that would suggest around 7 billion tokens, more than twice the British Library corpus used here. The smallest Qwen 3.5 model is 600m parameters and that model family starts to get interesting at 2b - so my hunch is we would need 4x or more the training data to get something that starts to feel like a useful conversational partner.
But what a fun project!
I decided to see if I could run the model on my own machine using my LLM framework.
I got Claude Code to do most of the work - here's the transcript.
Trip trained the model using Andrej Karpathy's nanochat, so I cloned that project, pulled the model weights and told Claude to build a Python script to run the model. Once we had that working (which ended up needing some extra details from the Space demo source code) I had Claude read the LLM plugin tutorial and build the rest of the plugin.
llm-mrchatterbox is the result. Install the plugin like this:
llm install llm-mrchatterbox
The first time you run a prompt it will fetch the 2.05GB model file from Hugging Face. Try that like this:
llm -m mrchatterbox "Good day, sir"
Or start an ongoing chat session like this:
llm chat -m mrchatterbox
If you don't have LLM installed you can still get a chat session started from scratch using uvx like this:
uvx --with llm-mrchatterbox llm chat -m mrchatterbox
When you are finished with the model you can delete the cached file using:
llm mrchatterbox delete-model
This is the first time I've had Claude Code build a full LLM model plugin from scratch and it worked really well. I expect I'll be using this method again in the future.
I continue to hope we can get a useful model from entirely public domain data. The fact that Trip was able to get this far using nanochat and 2.93 billion training tokens is a promising start.
Update 31st March 2026: I had missed this when I first published this piece but Trip has his own detailed writeup of the project which goes into much more detail about how he trained the model. Here's how the books were filtered for pre-training:
First, I downloaded the British Library dataset split of all 19th-century books. I filtered those down to books contemporaneous with the reign of Queen Victoria—which, unfortunately, cut out the novels of Jane Austen—and further filtered those down to a set of books with a optical character recognition (OCR) confidence of .65 or above, as listed in the metadata. This left me with 28,035 books, or roughly 2.93 billion tokes for pretraining data.
Getting it to behave like a conversational model was a lot harder. Trip started by trying to train on plays by Oscar Wilde and George Bernard Shaw, but found they didn't provide enough pairs. Then he tried extracting dialogue pairs from the books themselves with poor results. The approach that worked was to have Claude Haiku and GPT-4o-mini generate synthetic conversation pairs for the supervised fine tuning, which solved the problem but sadly I think dilutes the "no training inputs from after 1899" claim from the original model card.
Tags: ai, andrej-karpathy, generative-ai, local-llms, llms, ai-assisted-programming, hugging-face, llm, training-data, uv, ai-ethics, claude-code
2026-03-30 10:20:46
Release: llm-mrchatterbox 0.1
See Mr. Chatterbox is a (weak) Victorian-era ethically trained model you can run on your own computer.
Tags: llm
2026-03-30 04:08:45
Exciting new browser library from Cheng Lou, previously a React core developer and the original creator of the react-motion animation library.
Pretext solves the problem of calculating the height of a paragraph of line-wrapped text without touching the DOM. The usual way of doing this is to render the text and measure its dimensions, but this is extremely expensive. Pretext uses an array of clever tricks to make this much, much faster, which enables all sorts of new text rendering effects in browser applications.
Here's one demo that shows the kind of things this makes possible:
The key to how this works is the way it separates calculations into a call to a prepare() function followed by multiple calls to layout().
The prepare() function splits the input text into segments (effectively words, but it can take things like soft hyphens and non-latin character sequences and emoji into account as well) and measures those using an off-screen canvas, then caches the results. This is comparatively expensive but only runs once.
The layout() function can then emulate the word-wrapping logic in browsers to figure out how many wrapped lines the text will occupy at a specified width and measure the overall height.
I had Claude build me this interactive artifact to help me visually understand what's going on, based on a simplified version of Pretext itself.
The way this is tested is particularly impressive. The earlier tests rendered a full copy of the Great Gatsby in multiple browsers to confirm that the estimated measurements were correct against a large volume of text. This was later joined by the corpora/ folder using the same technique against lengthy public domain documents in Thai, Chinese, Korean, Japanese, Arabic, and more.
Cheng Lou says:
The engine’s tiny (few kbs), aware of browser quirks, supports all the languages you’ll need, including Korean mixed with RTL Arabic and platform-specific emojis
This was achieved through showing Claude Code and Codex the browsers ground truth, and have them measure & iterate against those at every significant container width, running over weeks
Via @_chenglou
Tags: browsers, css, javascript, testing, react, typescript