2026-04-17 20:46:47
Every morning, your inbox separates spam from real email. News apps sort articles into sports, tech, and politics. Customer support systems route tickets to the right team. Behind all of these is text classification: teaching a machine to read a document and assign it a category.
The building blocks are simpler than you might expect. You need a way to convert text into numbers (TF-IDF), a classifier that works well with sparse, high-dimensional data (Naive Bayes), and a few lines of code to tie them together. No deep learning, no GPUs, no embeddings.
By the end of this post, you'll classify news articles into 20 categories with 77% accuracy using just 10 lines of Python, then push that to 84% with hyperparameter tuning. You'll understand exactly how TF-IDF works and why the "naive" independence assumption in Naive Bayes is a feature, not a bug.
Click the badge to open the interactive notebook:
Here's the complete classifier. We use scikit-learn's 20 Newsgroups dataset, which contains around 18,000 posts across 20 topics, from computer graphics to religion to space exploration:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
# Load training and test data
twenty_train = fetch_20newsgroups(subset='train', shuffle=True, random_state=42)
twenty_test = fetch_20newsgroups(subset='test', shuffle=True, random_state=42)
# Build the pipeline: raw text → word counts → TF-IDF → Naive Bayes
text_clf = Pipeline([
('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', MultinomialNB()),
])
# Train and evaluate
text_clf.fit(twenty_train.data, twenty_train.target)
predicted = text_clf.predict(twenty_test.data)
print(f'Accuracy: {accuracy_score(twenty_test.target, predicted):.1%}')
# Accuracy: 77.4%
With 10 lines of modelling code, we classify documents into one of 20 categories at 77.4% accuracy on unseen data. Random guessing would give 5%.
Let's test it on fresh sentences the model has never seen:
docs_new = [
'OpenGL shading techniques for real-time rendering',
'The Detroit Tigers signed a new pitcher today',
'NASA launched the James Webb telescope last year',
'Is there evidence for the existence of God?',
]
predicted_new = text_clf.predict(docs_new)
for doc, category in zip(docs_new, predicted_new):
print(f'{twenty_train.target_names[category]:>28s} ← {doc}')
comp.graphics ← OpenGL shading techniques for real-time rendering
rec.sport.baseball ← The Detroit Tigers signed a new pitcher today
sci.space ← NASA launched the James Webb telescope last year
soc.religion.christian ← Is there evidence for the existence of God?
The model correctly identifies the topic of each sentence. It works by finding which words are most characteristic of each category.
The confusion matrix reveals where the classifier struggles. Related categories like comp.sys.ibm.pc.hardware and comp.sys.mac.hardware (both about computer hardware) are frequently confused, as are talk.religion.misc and soc.religion.christian. These make intuitive sense: documents about Mac hardware and PC hardware use very similar vocabulary.
Three components work in sequence: CountVectorizer turns text into word counts, TfidfTransformer re-weights those counts to highlight distinctive words, and MultinomialNB learns which words signal which categories.
A machine learning model can't read English. It needs numbers. The simplest conversion is the bag of words: count how many times each word appears in a document, ignoring order entirely.
from sklearn.feature_extraction.text import CountVectorizer
corpus = [
'The cat sat on the mat',
'The dog sat on the log',
'The cat chased the dog',
]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names_out())
# ['cat', 'chased', 'dog', 'log', 'mat', 'on', 'sat', 'the']
print(X.toarray())
# [[1, 0, 0, 0, 1, 1, 1, 2],
# [0, 0, 1, 1, 0, 1, 1, 2],
# [1, 1, 1, 0, 0, 0, 0, 2]]
Each row is a document. Each column is a word from the vocabulary. The value is the word count. Notice that "the" always gets a count of 2, regardless of the document. It's everywhere, so it carries no information about which document you're looking at.
On the 20 Newsgroups training set, CountVectorizer discovers around 130,000 unique tokens. Each document becomes a vector of 130,000 dimensions, mostly zeros (since any single post uses only a tiny fraction of the full vocabulary).
Not all words are equally informative. Words like "the", "is", and "a" appear in every document. What we want are words that are common within a specific category but rare overall. This is exactly what TF-IDF (Term Frequency, Inverse Document Frequency) captures.
The weight for word $t$ in document $d$ is:
Where:
$\log\!\frac{1+N}{1+n_t}+1$, where $N$ is the total number of documents and $n_t$ is the number of documents containing word $t$
A word that appears in every document gets a low IDF, shrinking its weight. A word that appears in only a few documents gets a high IDF, amplifying its signal.
import numpy as np
from sklearn.feature_extraction.text import TfidfTransformer
tfidf = TfidfTransformer()
X_tfidf = tfidf.fit_transform(X)
print(np.round(X_tfidf.toarray(), 2))
After TF-IDF weighting, the document vectors highlight what's distinctive about each text rather than what's common across all of them.
Naive Bayes applies Bayes' theorem to classify documents. Given a document with words $w_1, w_2, \ldots, w_n$, it computes:
The "naive" part is the assumption that words are conditionally independent given the class. This is obviously wrong: the word "neural" is far more likely to appear near "network" than near "baseball". But the simplification works remarkably well in practice because:
$P(\text{sci.space} \mid \text{doc})$ is the highest, the prediction is correct even if the probability value itself is off.The MultinomialNB variant uses word counts (or TF-IDF weights) as features and models $P(w_i \mid \text{class})$ as a multinomial distribution. The parameters are estimated via maximum likelihood: the probability of word $w_i$ in class $c$ is simply the fraction of times $w_i$ appears in training documents of class $c$, with Laplace smoothing to handle words never seen in training.
Scikit-learn's Pipeline chains these three transformations so you can treat the entire workflow as a single estimator:
text_clf = Pipeline([
('vect', CountVectorizer()), # raw text → word counts
('tfidf', TfidfTransformer()), # word counts → TF-IDF weights
('clf', MultinomialNB()), # TF-IDF vectors → class predictions
])
When you call text_clf.fit(X, y), it runs CountVectorizer.fit_transform(), feeds the output to TfidfTransformer.fit_transform(), then passes the result to MultinomialNB.fit(). At prediction time, the same chain runs in sequence. This also means you can do grid search over any parameter in the pipeline using the double-underscore naming convention (vect__ngram_range, clf__alpha).
Naive Bayes at 77.4% is a strong starting point, but we can improve it in three ways: removing noise (stop words), capturing phrases (bigrams), and tuning the smoothing parameter.
Stop words are common words ("the", "is", "at") that carry little discriminative value. Removing them reduces noise and bumps accuracy from 77.4% to 81.7%:
text_clf_stop = Pipeline([
('vect', CountVectorizer(stop_words='english')),
('tfidf', TfidfTransformer()),
('clf', MultinomialNB()),
])
text_clf_stop.fit(twenty_train.data, twenty_train.target)
print(f'NB + stop words: {accuracy_score(twenty_test.target, text_clf_stop.predict(twenty_test.data)):.1%}')
# NB + stop words: 81.7%
A 4-point gain for one parameter change.
Grid search systematically explores combinations of pipeline parameters. The naming convention (vect__, tfidf__, clf__) lets you reach into any pipeline step:
from sklearn.model_selection import GridSearchCV
parameters = {
'vect__ngram_range': [(1, 1), (1, 2)], # unigrams vs unigrams+bigrams
'tfidf__use_idf': (True, False), # use IDF weighting or not
'clf__alpha': (1e-2, 1e-3), # smoothing strength
}
gs_clf = GridSearchCV(text_clf, parameters, cv=5, n_jobs=-1)
gs_clf.fit(twenty_train.data, twenty_train.target)
print(f'Best CV score: {gs_clf.best_score_:.1%}')
print(f'Best params: {gs_clf.best_params_}')
print(f'Test accuracy: {accuracy_score(twenty_test.target, gs_clf.predict(twenty_test.data)):.1%}')
# Best CV score: 91.6%
# Best params: {'clf__alpha': 0.001, 'tfidf__use_idf': True, 'vect__ngram_range': (1, 2)}
# Test accuracy: 83.6%
The best configuration uses bigrams (ngram_range=(1,2)), IDF weighting, and weak smoothing (alpha=0.001). Bigrams capture phrases like "White House" or "hard drive" that individual words miss. The 5-fold CV score (91.6%) is higher than the test accuracy (83.6%) because cross-validation evaluates on data drawn from the same distribution as training, while the test set may contain authors, topics, or writing styles not seen during training.
If you've read our hyperparameter optimisation post, you'll recognise grid search as the brute-force baseline. With only 8 combinations to evaluate here, it's fast enough.
Swapping Naive Bayes for a linear SVM (support vector machine) gives a larger improvement than any amount of NB tuning:
from sklearn.linear_model import SGDClassifier
text_clf_svm = Pipeline([
('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf-svm', SGDClassifier(loss='hinge', penalty='l2',
alpha=1e-3, max_iter=100,
random_state=42)),
])
text_clf_svm.fit(twenty_train.data, twenty_train.target)
print(f'SVM accuracy: {accuracy_score(twenty_test.target, text_clf_svm.predict(twenty_test.data)):.1%}')
# SVM accuracy: 82.4%
That's 82.4% out of the box, without any tuning. Grid search for SVM yields 83.5%, virtually identical to the tuned Naive Bayes.
The story is clear: the biggest gains come from better feature representation (bigrams, stop word removal, IDF weighting) rather than the choice of classifier. With good features, even the "naive" model performs competitively.
What words does the classifier rely on? Raw class-conditional probabilities are dominated by common words like "the" and "of". To find truly discriminative features, we compare each word's log-probability within a class against its average across all classes:
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vect = TfidfVectorizer(stop_words='english', max_df=0.9, min_df=5)
X_tfidf = tfidf_vect.fit_transform(twenty_train.data)
clf_disc = MultinomialNB().fit(X_tfidf, twenty_train.target)
feature_names = np.array(tfidf_vect.get_feature_names_out())
log_probs = clf_disc.feature_log_prob_
mean_log_prob = np.mean(log_probs, axis=0)
discriminativeness = log_probs - mean_log_prob
for i, category in enumerate(twenty_train.target_names):
top_indices = discriminativeness[i].argsort()[-5:][::-1]
print(f'{category}: {", ".join(feature_names[top_indices])}')
The model learns sensible patterns. sci.space relies on words like "space", "orbit", and "nasa". rec.sport.baseball relies on "baseball", "team", and "pitching". talk.politics.mideast picks up "israel", "armenian", and "turkish". These are the words that carry the strongest evidence for each category, well beyond their background frequency.
Stemming maps words to their root form ("running" to "run", "computers" to "comput"). This merges related word forms into a single feature, reducing vocabulary size:
import nltk
from nltk.stem.snowball import SnowballStemmer
nltk.download('punkt', quiet=True)
stemmer = SnowballStemmer('english', ignore_stopwords=True)
class StemmedCountVectorizer(CountVectorizer):
def build_analyzer(self):
analyzer = super().build_analyzer()
return lambda doc: [stemmer.stem(w) for w in analyzer(doc)]
text_clf_stemmed = Pipeline([
('vect', StemmedCountVectorizer(stop_words='english')),
('tfidf', TfidfTransformer()),
('clf', MultinomialNB(fit_prior=False)),
])
text_clf_stemmed.fit(twenty_train.data, twenty_train.target)
print(f'NB + stemming + stop words: '
f'{accuracy_score(twenty_test.target, text_clf_stemmed.predict(twenty_test.data)):.1%}')
Stemming often gives a small additional boost. The original code uses the Snowball stemmer, a refined version of Porter's classic 1980 algorithm that handles irregular forms more gracefully.
This approach has clear limitations:
For many practical applications, TF-IDF with Naive Bayes remains hard to beat when you factor in the ratio of performance to complexity. It trains in seconds, requires no GPU, and produces interpretable results.
The foundational paper for Naive Bayes text classification is McCallum, A. & Nigam, K. (1998) "A Comparison of Event Models for Naive Bayes Text Classification", presented at the AAAI Workshop on Learning for Text Categorization.
They compared two Naive Bayes variants for text:
BernoulliNB in scikit-learn.MultinomialNB our pipeline uses."We find that the multinomial model is almost uniformly superior, especially for large vocabulary sizes."
The multinomial model works better because it uses word frequency information. A document mentioning "baseball" 15 times is stronger evidence for rec.sport.baseball than one mentioning it once. The Bernoulli model discards this frequency signal entirely.
Formally, the predicted class for a document $d$ is:
Where:
$P(c)$ is the class prior (fraction of training documents in class $c$)$n_i(d)$ is the count of word $w_i$ in document $d$
$P(w_i \mid c)$ is estimated with Laplace smoothing:The smoothing parameter $\alpha$ prevents zero probabilities for words that never appeared in a particular class during training. Our grid search found $\alpha = 0.001$ optimal, meaning the model trusts the training data more and smooths less aggressively than the default $\alpha = 1.0$.
TF-IDF was formalised by Salton, G. & Buckley, C. (1988) "Term-weighting approaches in automatic text retrieval", Information Processing & Management. The core idea predates this work: Sparck Jones proposed inverse document frequency in 1972.
Scikit-learn's variant uses:
The "+1" terms prevent division by zero and ensure no word gets zero weight. After computing TF-IDF, each document vector is L2-normalised to unit length.
Text classification has a long lineage:
Today, transformer-based models (BERT, GPT) dominate text classification benchmarks. But TF-IDF with Naive Bayes remains the standard baseline for its speed, interpretability, and surprising competitiveness.
The interactive notebook includes exercises:
comp.graphics, rec.sport.baseball, sci.space, talk.politics.mideast). How much does accuracy improve with fewer, more distinct categories?min_df=5 and max_df=0.5 to CountVectorizer to trim rare and ubiquitous words. How does this affect accuracy and vocabulary size?MultinomialNB with BernoulliNB. Does the McCallum & Nigam finding hold on this dataset?TfidfVectorizer with sublinear_tf=True and character n-grams (analyzer='char_wb', ngram_range=(3,5)). Character n-grams capture morphological patterns that word-level features miss.The "naive" refers to the conditional independence assumption: the model assumes that each word in a document is independent of every other word, given the class. This is clearly wrong (e.g. "neural" and "network" tend to co-occur), but it works surprisingly well in practice because classification only requires getting the ranking of class probabilities right, not the exact values. Independence errors tend to cancel out across thousands of features.
Raw word counts treat all words equally, so common words like "the" and "is" dominate the representation despite carrying no discriminative information. TF-IDF re-weights each word by how rare it is across the entire corpus. Words that appear in many documents get downweighted, while words distinctive to a few documents get amplified. This makes the representation much more informative for classification.
Naive Bayes with TF-IDF is an excellent choice when you need fast training (seconds, not hours), interpretability (you can inspect which words drive predictions), or when labelled data is limited. It also requires no GPU. For tasks where word order matters (sentiment analysis, entailment) or where you need state-of-the-art accuracy on competitive benchmarks, transformer models will outperform it significantly.
Alpha controls Laplace smoothing, which prevents zero probabilities for words that never appeared in a particular class during training. With alpha = 1.0 (the default), the model adds a pseudocount of 1 to every word-class combination. Smaller values like 0.001 trust the training data more and smooth less aggressively. The optimal value depends on your dataset and can be found through cross-validation.
The bag-of-words representation captures which words appear in a document but not the subtle semantic differences between closely related topics. Categories like PC hardware and Mac hardware share a large portion of their vocabulary (words like "drive", "memory", "board", "system"). The model can only distinguish them by the few words unique to each category, which may not always be present in a given document.
Yes. TF-IDF is language-agnostic at its core since it operates on tokens, not linguistic structures. However, you may need to adjust tokenisation for languages without clear word boundaries (e.g. Chinese or Japanese) and consider language-specific stop word lists. Stemming and lemmatisation tools are also language-dependent, so you would need appropriate resources for your target language.
2026-04-17 20:46:46
"Toto, I've a feeling we're not in Kansas anymore."— Dorothy Gale, The Wizard of Oz (1939)
Kansas is a traditional terminal. It is grey, functional, unchanged since the 1970s in ways that matter. Commands vanish upward in a wall of text. You scroll to find output from thirty seconds ago. The prompt is a single blinking cursor that knows nothing about what you typed last week. An error arrives with no suggestion of how to fix it. You copy-paste from Stack Overflow into a terminal that treats you like a system administrator from 1983.
You are Dorothy. You have been working in this Kansas for years. It works. Everything works. The Ruby slippers are right there on your feet — you just do not know they are magic yet.
Then the tornado arrives. It is called Warp.
| Suppliers | Inputs | Process | Outputs | Customers |
|---|---|---|---|---|
| Warp installer (DMG) | Your Mac Mini M4 Pro, macOS Sequoia | Download → drag to Applications → launch → authenticate | A modern Agentic Development Environment running natively on Apple Silicon | You — with every future episode's workflow running here |
| Your project idea | An empty directory | Scaffold warp-of-oz-tasks with uv, FastAPI, Python 3.12 | A running health endpoint at http://localhost:8000/health | The series codebase that grows across all 8 episodes |
| Warp's block system | Every command you run | Group command + output into a navigable atomic Block | A terminal session you can navigate, copy, search, and share | Your future self — no more scrolling through output walls |
Warp is not a terminal emulator with an AI bolt-on. It is an Agentic Development Environment — a rethinking of what the terminal can be when you assume the developer also has access to a large language model, a cloud orchestration platform, and a team knowledge base.
It has two parts:
Warp Terminal — the application you download and run on your Mac. Built in Rust for performance. It supports zsh, bash, fish, and PowerShell. It adds blocks, a code-editor-quality input experience, AI features, a file editor, code review, and the ability to run local agents interactively.
Oz — the orchestration platform that lives at oz.warp.dev. Cloud agents that run in the background, triggered by Slack messages, GitHub PRs, Linear issues, or schedules. Oz is the Emerald City — not visible from Dorothy's first view of the munchkin country, but always glowing on the horizon.
This episode is about arriving. The tornado. The yellow bricks. The first steps.
# Option 1: Direct download from warp.dev
# Download Warp.dmg → drag to /Applications → launch
# Option 2: Homebrew (recommended — keeps it updated)
brew install --cask warp
Warp runs natively on Apple Silicon — no Rosetta translation. The Mac Mini M4 Pro (24 GB unified memory, 12-core CPU) is an excellent host: Warp launches in under a second and local agent inference runs without thermal throttling.
After first launch, Warp asks you to sign in. A free account unlocks AI features with generous limits. Sign in with GitHub or email.
# Warp will already be your terminal — just confirm it loaded zsh correctly
echo $SHELL
# /bin/zsh
# Confirm macOS version
sw_vers
# ProductName: macOS
# ProductVersion: 15.x.x (Sequoia)
# BuildVersion: ...
# Confirm architecture
uname -m
# arm64 ← Apple Silicon, not x86_64
The first thing you notice in Warp is that command output is not a river — it is Blocks.
Every command and its output form a single, self-contained unit. The block has a border. You can click anywhere in it to select it. You can navigate between blocks with Ctrl-Up and Ctrl-Down. You can copy just the output, just the command, or both — with formatting preserved.
Try it:
# Run a command
ls -la ~
# Click somewhere else, then press Ctrl-Up
# You land on the ls -la ~ block, not anywhere in the sea of text
# Press Cmd-C on a block to copy just the output
# Press Shift-Cmd-C to copy "command + output" — useful for sharing
# Type: ls -la ~/Des
# Warp auto-completes the path. Press Tab. It works.
This is the yellow brick road. Every step is paved.
warp-of-oz-tasks 🏗️
Throughout this series, we build a single Python FastAPI application — a task management API — that grows from a bare scaffold to a production-grade service with authentication, background processing, and cloud agent automation. Every feature is added in Warp, using Warp's tools, on the Mac Mini M4 Pro.
uv (Python package manager)
uv is a Rust-based Python package manager from Astral. Massively faster than pip. Perfect pairing with Warp on Apple Silicon.
# Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh
# Add to PATH (uv installer does this, but verify)
source ~/.zshrc
which uv
# /Users/you/.cargo/bin/uv (or ~/.local/bin/uv)
uv --version
# uv 0.x.x
# Create the project directory
mkdir -p ~/projects/warp-of-oz-tasks
cd ~/projects/warp-of-oz-tasks
# Initialise with uv
uv init --name warp-of-oz-tasks --python 3.12
# The project structure uv creates:
# warp-of-oz-tasks/
# ├── .python-version ← pins Python version
# ├── pyproject.toml ← project metadata and dependencies
# ├── README.md
# └── hello.py ← placeholder (we'll replace this)
# Add FastAPI and uvicorn
uv add fastapi "uvicorn[standard]"
# Confirm
cat pyproject.toml
# Remove the placeholder
rm hello.py
# Create the application entry point
cat > src/main.py << 'PYTHON'
"""
warp-of-oz-tasks: A FastAPI task management API.
Built across 8 episodes of the Warp of Oz series on a Mac Mini M4 Pro.
"""
from fastapi import FastAPI
from datetime import datetime
app = FastAPI(
title="Warp of Oz Tasks",
description="Follow the yellow brick road — one endpoint at a time.",
version="0.1.0",
)
@app.get("/health")
async def health_check():
"""The Munchkins confirm you have arrived safely."""
return {
"status": "alive",
"message": "Toto, I've a feeling we're not in Kansas anymore.",
"timestamp": datetime.utcnow().isoformat(),
}
PYTHON
# uv expects src layout — create the package
mkdir -p src
touch src/__init__.py
Wait — let's check the pyproject.toml and fix the source layout:
# Update pyproject.toml to reflect src layout
cat > pyproject.toml << 'TOML'
[project]
name = "warp-of-oz-tasks"
version = "0.1.0"
description = "A FastAPI task API — built with Warp, episode by episode."
requires-python = ">=3.12"
dependencies = [
"fastapi>=0.115.0",
"uvicorn[standard]>=0.30.0",
]
[build-system]
requires = ["hatchling"]
build-backend = "hatchling.build"
[tool.hatch.build.targets.wheel]
packages = ["src"]
TOML
# Run with uv run (handles the venv automatically)
uv run uvicorn src.main:app --reload --host 0.0.0.0 --port 8000
Open another Warp tab (Cmd-T) and test:
curl http://localhost:8000/health | python3 -m json.tool
# {
# "status": "alive",
# "message": "Toto, I've a feeling we're not in Kansas anymore.",
# "timestamp": "2026-04-17T..."
# }
The yellow brick road is paved. The first block sparkles.
Traditional terminals treat the input line like a 1970s text field. Warp treats it like a code editor:
# Multi-line input — press Shift-Enter for new lines
# The entire block becomes a multi-line script
python3 -c "
import sys
print(f'Python {sys.version}')
print('Running on Apple Silicon' if 'arm64' in sys.platform else 'Not ARM')
"
# Alt-click to place the cursor exactly where you want
# Cmd-Z to undo a character you deleted in the input
# Select text with Shift-Arrow and delete it
# This works like VS Code. Because Warp is built to.
# Key: Your First Taste of Magic ✨
Type # in Warp's input and start describing what you want in plain English:
# show me all python processes running on this machine
Warp converts this to:
ps aux | grep python
This is not AI autocomplete. This is natural language to shell command translation. Try more:
# find all files larger than 100MB in my home directory
# show disk usage for the current directory sorted by size
# what port is uvicorn running on
The Scarecrow is acquiring a brain. But that is Episode 2.
~/projects/warp-of-oz-tasks/
├── .python-version ← Python 3.12
├── pyproject.toml ← dependencies and metadata
├── README.md
└── src/
├── __init__.py
└── main.py ← health endpoint ✓
Commit this starting point:
cd ~/projects/warp-of-oz-tasks
git init
git add .
git commit -m "feat: scaffold warp-of-oz-tasks — the tornado has landed"
| # | Episode | Oz parallel | Warp feature | Codebase milestone |
|---|---|---|---|---|
| 1 | This one — Yellow Brick Road | Dorothy arrives | Blocks, input editor, # key | Health endpoint |
| 2 | The Scarecrow Gets a Brain | Scarecrow + brain | AI completions, Active AI, agent chat | CRUD endpoints for tasks |
| 3 | The Tin Man Gets a Heart | Tin Man + heart | WARP.md, Rules, Skills | Auth middleware |
| 4 | The Lion Gets Courage | Cowardly Lion | Agent pair mode, code review panel | Debug a planted error |
| 5 | Flying Monkeys: Dispatch | Flying monkeys | Autonomous dispatch mode | Background task processor |
| 6 | The Emerald City: Oz | Emerald City | Cloud agents, Oz CLI, schedules | Scheduled cleanup agent |
| 7 | The Ruby Slippers: Augment | Ruby slippers | Augment Code Intent + Warp | Spec-driven feature |
| 8 | There's No Place Like Home | Going home | MCP, Warp Drive, full workflow | Production-ready service |
In Episode 2, the Scarecrow joins the road. He needs a brain. Warp's AI features are exactly that.
🔗 Resources
🌪️ Warp of Oz Series follows the Yellow Brick Road through Warp's Agentic Development Environment — from the first install on a Mac Mini M4 Pro to cloud agent orchestration with Oz, with Augment Code Intent as the ruby slippers that were powerful all along.
2026-04-17 20:44:54
I’ve always found open source a bit overwhelming and intimidating. The large codebases, and processes made it feel out of reach. But thanks to Outreachy, I pushed myself to face that fear and realized it wasn’t as impossible as I had imagined. I just needed to take it one step at a time.
When I received the email that I had been accepted into the Outreachy contribution stage, I had mixed feelings. I was excited, but also nervous. As I explored the available projects, the Firefox Sidebar project immediately caught my attention, and I decided to go for it.
With the help of the detailed project description and documentation, I was able to download the Firefox source code. It was huge and honestly, a bit intimidating at first. I remember thinking, where do I even start from? That initial fear crept back in, but I didn’t let it stop me.
Everything started to change when I made my first contribution and my patch was accepted. That moment felt like validation. It gave me the confidence I needed and made me realize that I could actually do this.
One particularly memorable experience was working on a Ui related bug that initially seemed simple but turned out to require a deeper understanding of layout behavior across different components. It taught me patience and showed me the importance of testing thoroughly before submitting a fix.
By the end of the Outreachy contribution stage, I had submitted four patches and all of them were accepted. That experience not only boosted my confidence but also strengthened my problem solving and debugging skills. I’m proud of how far I’ve come, and I definitely plan to continue contributing to Firefox.
Lastly, I’d like to appreciate my mentors, Nikkis and Kcochrane, for their guidance and support throughout this journey. Their feedback made a huge difference.
This experience has shown me that open source isn’t as intimidating as it seems you just have to start, stay consistent, and be willing to learn.
2026-04-17 20:41:02
I use several AI coding CLIs depending on the task.
Claude Code is good at one kind of workflow. OpenCode has its own shape. Gemini CLI is useful when I want another model family in the loop. Codex is often strong when I need a second implementation or review pass.
The annoying part is not the models. The annoying part is switching tools.
chorus is my attempt to remove that friction.
It is an open-source cross-agent plugin collection for four AI coding CLIs:
The idea is simple: from the tool I am already using, I should be able to delegate a task to the other agents.
That creates a 4×3 mesh. Each agent can call the other three.
From Claude Code:
/gemini:review Review this diff for hidden edge cases and missing tests.
/codex:run Add regression tests for the parser bug we just fixed.
/opencode:run Try a smaller refactor of the auth middleware without changing behavior.
From OpenCode, the same idea is exposed through MCP tools:
delegate_claude
delegate_gemini
delegate_codex
Gemini CLI and Codex get skills installed so they can delegate in any direction too.
Instead of asking one agent "is this fine?", ask three different agents to review the same change independently.
Different agents have different failure modes. One will over-focus on architecture. Another will catch a small test gap. Another will suggest a simpler implementation. Often one of them is wrong. That is fine. The value is in having multiple independent passes without leaving the terminal.
/gemini:review Check correctness and missed edge cases.
/codex:run Review test coverage and suggest missing cases.
/opencode:run Look for simplifications and risky abstractions.
This is not about pretending agents are teammates. It is about using model disagreement as a tool.
The important design constraint for chorus is that it does not try to become a new AI IDE or orchestration platform. It is glue.
One install gives you access to the other agents from your preferred tool. Claude Code gets slash commands. OpenCode gets MCP tools. Gemini CLI and Codex get skills.
Keep using the interface you already like, but stop treating each CLI as an isolated island.
# Claude Code
claude plugin install https://github.com/valpere/chorus
# OpenCode
opencode plugin @valpere/chorus-opencode
# Gemini CLI
gemini skills install https://github.com/valpere/chorus --path for-gemini/claude
gemini skills install https://github.com/valpere/chorus --path for-gemini/opencode
gemini skills install https://github.com/valpere/chorus --path for-gemini/codex
I built this because my own workflow had become repetitive — make a change in one CLI, copy context into another, ask for a review, manually bring the useful parts back. It worked, but it was clumsy.
chorus turns that into a normal command.
GitHub: https://github.com/valpere/chorus
If you already use more than one AI coding CLI, this may fit your workflow without asking you to change it. If you only use one, multi-agent review may still be worth trying on risky changes. A second opinion from a different agent is often cheaper than debugging the same blind spot later.
Valentyn Solomko — Ukrainian software engineer
2026-04-17 20:40:19
Design patterns are reusable solutions to commonly occurring problems in software design. They fall into 3 categories:
🏗️ Creational Patterns
How objects are created
javascriptclass Database {
constructor() {
if (Database.instance) return Database.instance;
Database.instance = this;
}
}
const db1 = new Database();
const db2 = new Database();
console.log(db1 === db2); // true
Use when: Managing a shared resource like a DB connection or config object.
javascriptfunction createUser(role) {
if (role === "admin") return { role, permissions: ["read", "write", "delete"] };
if (role === "guest") return { role, permissions: ["read"] };
}
const admin = createUser("admin");
const guest = createUser("guest");
Use when: You need to create different object types based on a condition.
javascriptclass QueryBuilder {
constructor() { this.query = "SELECT "; }
select(fields) { this.query += fields; return this; }
from(table) { this.query += ` FROM ${table}`; return this; }
where(condition) { this.query += ` WHERE ${condition}`; return this; }
build() { return this.query; }
}
const query = new QueryBuilder()
.select("*")
.from("users")
.where("age > 18")
.build();
Use when: Building complex objects like SQL queries, form configs, or request options.
🔗 Structural Patterns
How objects are composed/organized
javascriptconst CartModule = (() => {
let items = []; // private
return {
add: (item) => items.push(item),
getItems: () => [...items],
total: () => items.reduce((sum, i) => sum + i.price, 0),
};
})();
CartModule.add({ name: "Shoes", price: 50 });
console.log(CartModule.total()); // 50
Use when: You want to avoid polluting the global scope (very common in JS).
5.** Decorator**
Adds new behavior to an object without modifying the original.
javascriptfunction withLogging(fn) {
return function (...args) {
console.log(`Calling with`, args);
return fn(...args);
};
}
const add = (a, b) => a + b;
const loggedAdd = withLogging(add);
loggedAdd(2, 3); // logs: Calling with [2, 3] → returns 5
Use when: Adding features like logging, caching, or auth checks to functions.
🔔 Behavioral Patterns
How objects communicate
javascriptclass EventEmitter {
constructor() { this.listeners = {}; }
on(event, fn) {
(this.listeners[event] ??= []).push(fn);
}
emit(event, data) {
this.listeners[event]?.forEach(fn => fn(data));
}
}
const emitter = new EventEmitter();
emitter.on("login", (user) => console.log(`${user} logged in`));
emitter.emit("login", "Alice"); // Alice logged in
Use when: Building event systems, real-time updates, or state management.
javascriptconst sorter = {
bubble: (arr) => { /* bubble sort logic */ },
quick: (arr) => { /* quick sort logic */ },
};
function sortData(data, strategy = "quick") {
return sorter[strategy](data);
}
Use when: You need to switch between different implementations of the same behavior (sorting, payment methods, validation rules).
Quick Reference
2026-04-17 20:40:11
"Transient failures are inevitable; durable execution requires state to survive the crash."
We are constructing a resilient worker service in Rust that processes background tasks from a persistent queue. This example prioritizes data durability over peak throughput, ensuring that failed jobs are never lost but eventually succeed or move to a dead letter queue. We will use async Rust with SQL for storage, demonstrating how to structure state transitions that survive application restarts. The focus is on architectural correctness over raw performance, building a foundation for long-running background processing systems.
The worker must track a job's lifecycle without relying on volatile memory alone. We start by defining an enum that explicitly tracks every state transition, ensuring the logic is exhaustive.
pub enum JobStatus {
Pending,
Running,
Succeeded,
Failed,
DeadLetter,
}
This choice matters because explicit states prevent silent state drifts that often plague long-running daemon processes. By forcing the developer to handle every case, we reduce the chance of forgetting to update a database column after a panic.
A transient failure of the application worker must not result in data loss. We model the job table to include columns for status, retry count, and last attempt timestamp, creating a source of truth that survives restarts.
#[derive(sqlx::FromRow)]
pub struct Job {
pub id: uuid::Uuid,
pub status: JobStatus,
pub retry_count: i32,
pub created_at: DateTime<Utc>,
pub last_attempted: Option<DateTime<Utc>>,
}
Storing metadata here allows us to query for pending work and ensures we can resume processing from exactly where the application died. We use UUIDs for the ID to maintain uniqueness and avoid accidental collisions.
When a job fails, we must wait before retrying to prevent database overload. We generate a delay based on the current retry count, using a tokio::time::sleep to enforce a pause before the next attempt.
pub fn calculate_delay(retry_count: i32) -> Duration {
// Start with 1 second delay and double it with each retry
let base_duration = Duration::from_secs(1);
let max_duration = Duration::from_secs(30);
let raw_delay = base_duration * (1 << (retry_count as u32));
let capped_delay = raw_delay.min(max_duration);
// Add jitter to prevent thundering herd issues
let jitter = Duration::from_millis(rand::random::<u64>() % 100);
Duration::from_secs(capped_delay.as_secs() + jitter.as_secs())
}
Using exponential backoff instead of a fixed delay ensures that transient network issues resolve without overwhelming the system resources. The jitter component is critical for preventing multiple workers from retrying at the exact same second, which can cause spikes in database load.
A job should not be retried infinitely if the error is irrecoverable. If the retry count exceeds a threshold, we transition the state to DeadLetter to prevent an infinite loop and allow operators to manually inspect or discard the job.
pub fn should_retry(job: &Job, error: &Error) -> bool {
if job.retry_count >= MAX_RETRIES {
// Mark as DeadLetter
return false;
}
true
}
This separation isolates error handling from success paths, adhering to the principle of separation of concerns. The DeadLetter state acts as a final repository for problematic jobs, ensuring the system doesn't block on them.
Building a durable job queue requires treating state as an external truth source rather than application memory. By defining a strict state machine and persisting it in a relational database, we ensure that no work is ever lost even if the worker process crashes. The retry logic with exponential backoff protects system health, while the dead letter queue allows for manual intervention on permanent failures. This pattern scales well for any background processing system that values correctness over speed. The separation of concerns—logic for success, logic for retry, logic for failure—ensures that the code remains maintainable and the architecture remains robust against transient failures.
To expand on this pattern, consider adding concurrency controls to process jobs in parallel without overloading the database write locks. Investigate how postgres connection pooling interacts with long-running transactions when processing large payloads. Finally, review the logging strategies for tracking job lifecycle events in a distributed system context to ensure observability aligns with operational expectations. You might also consider implementing a metrics pipeline to track average processing times per job type.
Part of the Architecture Patterns series.