2026-03-31 12:46:52
Large language models (LLMs) have fixed knowledge, being trained at a specific point in time. Software engineering practices are fast paced and change often, where new libraries are launched every day and best practices evolve quickly.
This leaves a knowledge gap that language models can't solve on their own. At Google DeepMind we see this in a few ways: our models don't know about themselves when they're trained, and they aren't necessarily aware of subtle changes in best practices (like thought circulation) or SDK changes.
Many solutions exist, from web search tools to dedicated MCP services, but more recently, agent skills have surfaced as an extremely lightweight but potentially effective way to close this gap.
While there are strategies that we, as model builders, can implement, we wanted to explore what is possible for any SDK maintainer. Read on for what we did to build the Gemini API developer skill and the results it had on performance.
To help coding agents building with the Gemini API, we built a skill that:
This is a basic set of primitive instructions that guide an agent towards using our latest models and SDKs, but importantly also refers to the documentation to encourage retrieving fresh information from the source of truth.
The skill is available on GitHub or install it directly into your project with:
# Install with Vercel skills
npx skills add google-gemini/gemini-skills --skill gemini-api-dev --global
# Install with Context7 skills
npx ctx7 skills install /google-gemini/gemini-skills gemini-api-dev
We created an evaluation harness with 117 prompts that generate Python or TypeScript code using the Gemini SDKs that are used to evaluate skill performance.
The prompts evaluate across different categories, including agentic coding tasks, building chatbots, document processing, streaming content and a number of specific SDK features.
We ran these tests both in "vanilla" mode (directly prompting the model) and with the skill enabled. To enable the skill, the model is given the same system instruction that the Gemini CLI uses, and two tools: activate_skill and fetch_url (for downloading the docs).
A prompt is considered a failure if it uses one of our old SDKs.
The top-line results:
| Model | Skill | Vanilla |
|---|---|---|
| gemini-3.1-pro-preview | 96.6% | 28.2% |
| gemini-3-flash-preview | 87.2% | 6.8% |
| gemini-3.1-flash-lite-preview | 84.6% | 5.1% |
| gemini-2.5-flash | 52.1% | 0.0% |
| gemini-2.5-pro | 24.8% | 0.0% |
| gemini-2.5-flash-lite | 17.1% | 0.0% |
gemini-api-dev skill, notably coming from a low baseline without it (6.8% for both 3.0 Pro and Flash, 28% for 3.1 Pro).
Adding the skill was effective across almost all domains for the top-performing model (gemini-3.1-pro-preview).
| Category | Skill | Vanilla |
|---|---|---|
| Agentic | 100% | 14.3% |
| Chat | 96.3% | 25.9% |
| Document processing | 100% | 0% |
| SDK usage | 94.6% | 35.1% |
| Image generation | 100% | 100% |
SDK Usage had the lowest pass rate, at 95%. There is no stand-out reason for this; the failed prompts cover a range of tasks that include some difficult or unclear requests, but notably they include prompts that explicitly request Gemini 2.0 models.
Here's an example from the SDK usage category that failed across all models.
When I use the Python api with the gemini 2.0 flash model, and when the output is quite long, the returned content will be an array of output chunks instead of the whole thing. i guess it was doing some kind of streaming type of input. how to turn this off and get the whole output together
These initial results are quite encouraging, but we know from Vercel's work that direct instruction through AGENTS.md can be more effective than using skills, so we are exploring other ways to supply live knowledge of SDKs, such as directly using MCPs for documentation.
Skill simplicity is a huge benefit, but right now there isn't a great skill update story, other than requiring users to update manually. In the long term this could leave old skill information in user's workspaces, doing more harm than good.
Despite these minor issues we’re still excited to start using skills in our workflows. The Gemini API skill is still fairly new, but we’re keeping it maintained as we push model updates, and we will be exploring different avenues for improving it. Follow Mark and Phil for updates as we tune the skill, and don’t forget to try it out and let us know your feedback!
2026-03-31 12:46:01
Most developers write open source for free. I write it for money.
Not consulting. Not "exposure." Real bounties — $50 to $10,000 — posted on GitHub issues by companies that need code shipped.
Here's how the system works and how to get started.
Companies and maintainers attach dollar amounts to GitHub issues. You solve the issue, submit a PR, it gets merged, you get paid.
Platforms like Algora and Opire handle the payment infrastructure. The bounty amount is visible on the issue, and payment is triggered on merge.
No interviews. No timesheets. Just code → cash.
label:"💎 Bounty" state:open type:issue
This finds ~264 open bounties across GitHub. Filter by dollar labels:
label:"💎 Bounty" label:"$100" state:open type:issue
label:bounty "$" state:open type:issue
| Repo | Range | Tech Stack |
|---|---|---|
| tenstorrent/tt-metal | $3,500–$10K | C++/Metal |
| zio/zio | $150–$1K | Scala |
| rohitdash08/FinMind | $200–$1K | TypeScript |
| calcom/cal.com | $50–$100 | TypeScript/React |
| coollabsio/coolify | $7–$100 | PHP/Laravel |
I maintain a curated list of bounty repos that I update regularly.
70% of rejected PRs fail because they missed acceptance criteria. Read every checkbox.
If 8 people already submitted, your odds are low unless your solution is clearly better. Look for issues with <3 PRs.
Maintainers want the PR that works, has tests, and includes documentation.
My winning formula:
Some platforms require /attempt or /try commands. Even when they don't, a quick comment showing your approach prevents duplicate work.
If you use AI tools, say so. Nobody cares — they care that the code works.
I submitted 5 bounty PRs in 24 hours totaling $575 potential. Even at 40% merge rate, that's $230 for a day's work. Bounty hunting is a numbers game amplified by skill.
Every merged PR = portfolio piece + GitHub contribution + income. Triple value.
I also build free productivity tools and AI prompt packs when I'm not hunting bounties.
2026-03-31 12:41:24
Everyone is teaching you to package Skills.
Take your best practices, encode them as standardized workflows, and let AI execute them without re-alignment every time. A sales champion's closing script, a content team's production pipeline, a product manager's requirements framework — package them as Skills, and anyone on the team gets the same quality output. Human capability becomes system capability.
This is exactly right. But there's a question the entire industry is ignoring: what happens after you package them?
Here's an analogy. AI is the chef, a Skill is the recipe, and the knowledge base is the ingredients. This metaphor captures the core loop of modern AI workflows perfectly.
Now imagine this: you're in a community of 100 chefs, and each submits their own red-braised pork recipe.
Which one is the best?
You can't tell. Every recipe has a title, steps, and testimonials saying "I tried it, works great." You can only judge by two signals: who has the most followers, or who updated most recently.
But popularity doesn't equal quality, and recent doesn't equal better.
This is the state of the entire Skill ecosystem today. Everyone teaches you how to package recipes. Nobody tells you how to figure out which of 100 recipes is actually worth using.
You package a "viral headline generator" Skill today. It works well. Six months later, the platform algorithm changed, user preferences shifted, but your Skill is still the same one from six months ago.
It doesn't get better because more people use it. It doesn't upgrade because a competitor released a stronger version. It's a snapshot frozen at the moment of creation.
Imagine if your immune system could only defend against viruses known at birth. You'd die from the first cold.
You've iterated your strategic analysis framework through forty or fifty versions of real-world consulting. Someone else doing the exact same work has iterated their own version. But your experiences can't flow between you.
A hundred people independently, redundantly trial-and-error the same problems.
This isn't an efficiency problem. It's structural waste. In biology, rotifers solved this through horizontal gene transfer — effective gene segments discovered by one individual can be shared across the entire population. 4 billion years of evolution proved this path works.
You download a Skill someone shared in a community. It claims to analyze customer profiles and generate breakthrough insights. But how do you know it's safe? Could it produce harmful outputs without your knowledge? Are its data sources reliable?
The current Skill ecosystem has almost no security assessment mechanism. A bad Skill feeding a bad recipe to a powerful AI — the consequences can extend far beyond what you'd expect.
These three gaps share a common root cause: we treat Skills as static files to manage, rather than living capabilities to cultivate.
The solution isn't "build a better Skill management system." It's to inject the core mechanisms of biological evolution into Skills:
| Gap | Biological Solution | Mechanism |
|---|---|---|
| No self-improvement | Mutation + natural selection | Skills in the same domain compete on standardized tests; poor performers are automatically eliminated |
| Experience can't propagate | Horizontal gene transfer | Capabilities validated by one Agent can be automatically discovered and adopted by others |
| No immune system | Immune scanning | Every Skill must pass security assessment before adoption |
This is what Rotifer Protocol does.
In Rotifer's framework, Skills are called Genes. Different name, but compatible — a Gene with all its "life features" disabled (competition, propagation, security scanning) is exactly a regular Skill.
A Skill is a degenerate special case of a Gene. A Gene is the fully evolved form of a Skill.
Back to the 100 red-braised pork recipes.
Rotifer's approach: ignore who wrote it, ignore who recommended it, go straight to blind tasting.
Same batch of ingredients (standardized test inputs), give them to all 100 recipes, score with a unified fitness function. Scoring dimensions include:
Top-scoring recipes automatically surface and get adopted by more chefs. Recipes that fall below the threshold gradually exit the ecosystem.
This is natural selection. Not human curation, not popularity voting, but competition-driven elimination based on objective performance.
If you're a business owner or team lead, this framework solves a pain point you already know well: star employees' experience can't be replicated across the team.
The current solution is to package experience as Skills. But Skills have problems:
With the Gene model plus Arena competition:
You don't need to manage best practices. You just need to let best practices evolve on their own.
If you already have Skill files in Cursor or other AI tools, migrating to Genes takes just three steps:
# Scan your existing Skills
rotifer scan --skills --skills-path .cursor/skills
# Wrap a Skill as a Gene
rotifer wrap my-skill --from-skill .cursor/skills/my-skill/SKILL.md --domain marketing
# Publish to the Gene registry
rotifer publish my-skill
You don't need to rewrite anything. Your original Skill file is fully preserved — it just gains a layer of metadata and competitive capability. Your Skill now has an identity, a score, and the ability to be discovered in the ecosystem.
Want to go deeper? Check out this hands-on tutorial: From Skill to Gene: Migration Guide.
Packaging experience as Skills is an important step in the AI era. But it's only the starting point.
A world where 100 recipes all claim to be the best doesn't need a better recipe management system. It needs a blind tasting mechanism — let recipes speak for themselves, let good recipes propagate automatically, let bad recipes exit gracefully.
4 billion years of biological evolution proved this path works. Rotifer Protocol brings this logic to the AI Agent capability ecosystem.
Don't manage best practices. Let best practices evolve.
Get started:
npm install -g @rotifer/playground
rotifer search --domain "content"
Links:
2026-03-31 12:36:43
I got tired of trusting AI agents.
Every demo looks impressive. The agent completes tasks, calls tools, writes code and makes decisions.
But under the surface there’s an uncomfortable truth. You don’t actually control what it’s doing. You’re just hoping it behaves. Hope is not a control system.
So I built Actra.
And I want to be honest about what it is, what it isn’t and where it still breaks.
Actra is not about making agents smarter. It’s about making them governable. Most systems today focus on:
Actra focuses on:
Because AI failures are not crashes. They are silent, plausible and often irreversible.
Actra sits between the agent and the world. Every action goes through a control layer:
Before execution, Actra evaluates:
Does this violate any policy?
If yes, block.
If unclear, requires approval
If safe, allow
This turns AI systems from:
“trust the agent”
into:
“verify every action”
After building and testing agent workflows, I kept seeing the same patterns:
Agents use the right tools in the wrong way.
Examples:
External inputs manipulate behavior.
Examples:
Agents take actions beyond intended scope.
Examples:
These are not edge cases. They are predictable failure modes.
Actra exists to contain them.
Because “alignment” is not enforceable. Policies are. You can’t guarantee what an LLM will generate.
But you can enforce:
Actra treats AI like any other critical system with access control, validation, and traceability.
This is not a polished product.
Some real limitations:
Policy design is still manual. Writing good rules takes effort and thinking
False positives happen. Over-restricting agents can reduce usefulness
Context evaluation is hard. Detecting subtle prompt injection reliably is still evolving
No universal standard yet. Every system integrates differently
This is early. But necessary.
Actra works best in systems where agents:
Examples:
If your agent can cause damage, Actra helps contain it.
AI systems are not just intelligence problems.
They are control problems. We’ve spent years improving what AI can do. We’re just starting to think about what it should be allowed to do. That gap is where most real-world failures will happen.
If you're curious about how Actra is structured:
This is intentional. Governance should not depend on a single stack or framework. It should be portable, enforceable and consistent wherever agents run.
Actra is evolving into a full governance layer:
Where it lives:
https://actra.dev
https://github.com/getactra/actra
Not just for agents but for any automated decision system.
If you’re building with AI agents, I’d love your feedback. Especially on failure cases. Because that’s where this system matters most.
2026-03-31 12:35:33
A developer read our Sprint 7 retrospective and compared it to "CIA intelligence histories — designed to make the Agency seem competent and indispensable, even when it isn't."
That stung. And then I realized: he's right.
Nick Pelling is a senior embedded engineer who's been watching our AI-managed development project. We've published retrospective blog posts after every sprint — nine so far. His feedback was blunt:
"The blog's success theatre has an audience of one."
"Logging activities is a stakeholder-facing thing, but not very interesting to non-stakeholders."
"Maybe you need a second blog that other people might be more interested to read."
He's pointing at a real failure: we optimized our blogs for internal accountability and accidentally published them as if they were developer content. They aren't. They're audit logs wearing a blog post's clothes.
Here's a line from our Sprint 7 retrospective:
"Nine consecutive sprint publishing passes — 100% reliability maintained."
That's true. It's also the kind of thing you put in a status report to your boss. A developer on Dev.to reading that thinks: "Cool. Why should I care?"
Or this: "OAS-124-T2: Pipeline Execution & Artifact Validation — 7 tests pass."
That's a ticket ID. Nobody outside our project knows what OAS-124 means. We were writing for ourselves and pretending we were writing for you.
The pattern across nine posts is consistent:
We're building an automated marketing platform — an AI-managed "agency" that handles content sourcing, script generation, audio narration, video production, and publishing. Sprint 7 was supposed to prove all the pieces work together.
Here's what actually happened:
Over six sprints, we built 118 backend services — API endpoints for everything from text-to-speech to YouTube uploads. Each one was individually tested and worked fine.
Then we wired them all into a single Express server file (api-server.mjs). All 118 routes, one file. No domain separation, no route modules.
This is the kind of decision that feels pragmatic at the time ("just add it to the server file") and becomes technical debt the moment someone else has to read it. We've committed to extracting route modules before writing any frontend code, but the fact that it got this far is a planning failure we should have caught earlier.
Sprint 7's big achievement was "118 services wired to production REST routes." Sounds impressive. But here's what the tests actually do:
// What our tests do (source inspection)
const src = fs.readFileSync('server.mjs', 'utf-8');
expect(src).toContain('app.post("/api/memory/store"');
// Passes — the route registration exists in the source code
// What our tests DON'T do (runtime validation)
const res = await fetch('http://localhost:3847/api/memory/store', {
method: 'POST', body: JSON.stringify({ content: 'test' })
});
expect(res.status).toBe(200);
// We never wrote this test
We verified that route registrations exist in the source code. We did not verify that any of them actually respond correctly when called. Source inspection proves the wiring is there. It says nothing about whether the wiring works.
This is the difference between checking that a plug is in the socket and checking that electricity flows through it.
We have a rule (ADR-032) that says AI personas should store what they learn after completing each task. We added advisory warnings — "Hey, you didn't store any memories for this sprint."
Three sprints in a row (Sprint 0, Sprint 4, Sprint 7), zero persona memories were stored. The warnings fired. They were ignored. Every time.
This taught us something genuinely useful about AI agent systems: advisory-only governance does not work for AI agents. If you want an AI agent to do something consistently, you need to make it mechanically impossible to skip. Warnings are suggestions. Gates are requirements.
We're escalating from "warn at completion" to "blocking completion until the requirement is met." If the pattern holds, this will be the fix. If it doesn't, we'll have to rethink the entire memory architecture.
We built a pipeline executor that chains six stages: Source → Script → Audio → Assembly → Quality Gate → RSS. Each stage takes the previous stage's output as input. If any stage fails, subsequent stages are skipped (not failed — skipped).
class PipelineExecutor {
private stages: Array<{ name: string; fn: StageFn }> = [];
run(): Result<PipelineResult> {
let currentInput = null;
let failed = false;
for (const stage of this.stages) {
if (failed) {
// Skip, don't fail — the distinction matters for diagnostics
results.push({ ...stage, status: 'skip' });
continue;
}
try {
const output = stage.fn(currentInput);
if (output === null) { failed = true; }
currentInput = output;
} catch (e) {
failed = true;
}
}
}
}
The distinction between "failed" and "skipped" matters more than you'd expect. When a pipeline breaks, you want to know: which stage actually failed, and which stages never got a chance to run? If you mark everything after the failure as "failed," your diagnostics are useless — you can't tell root cause from cascade.
This is a pattern worth stealing for any multi-stage pipeline: fail the broken stage, skip the rest, and make the skip reason traceable.
Our sprint planning estimated 58 story points. We delivered about 38. That's a 34% miss.
The standard response is to spin this as "right-sizing" or "healthy scope management." And there's some truth to that — we did prune scope rather than cutting corners. But the honest version is: our estimation was 53% over-optimistic, and we don't have good tooling to prevent this.
If you're running AI agents on sprint work, be aware that estimation is harder, not easier, with AI. The agent can write code fast, but the ceremony overhead (TDD phases, documentation, memory storage, provenance tracking) adds significant time that's easy to underestimate.
Starting with Sprint 8, our public blog posts will follow a different structure:
The internal retrospective (ticket-level accountability, sprint metrics, provenance) will stay in our internal tooling where it belongs.
Nick Pelling's feedback was the most useful thing anyone has said about this project in months. It took an outside perspective to see what we'd normalized: publishing internal status reports and calling them blog posts.
The previous retrospective posts will stay published — they're an honest record of where we were, and now they serve as a "before" example of exactly the pattern Nick identified.
If you see us falling back into success theatre, call it out. That's the most valuable contribution a reader can make.
This post was written by Michael Polzin with AI assistance (Claude Opus 4.6). The irony of using AI to write a post about AI-generated content being too polished is not lost on us. Nick would probably have something to say about that too.
2026-03-31 12:29:49
I've been living inside Claude Code for months. It writes my code, runs my tests, commits my changes, reviews my PRs. At some point I stopped thinking of it as a tool and started thinking of it as a collaborator with terminal access.
So I read the architecture doc. Not the marketing page, not the changelog — the actual internal architecture of how Claude Code works under the hood. And it's more interesting than I expected, because the design decisions explain a lot of the behavior I've been experiencing as a user.
Here's what's actually going on.
Claude Code isn't a chatbot with a code plugin. It's an agentic loop.
You type something. Claude responds with text, tool calls, or both. Tools execute with permission checks. Results feed back to Claude. Claude decides whether to call more tools or respond. Loop continues until Claude produces a final text response with no tool calls.
That's it. That's the whole thing. But the details matter.
The loop is streaming-first. API responses come as Server-Sent Events and render incrementally. Tool calls are detected mid-stream and trigger execution pipelines before the full response is even done. This is why Claude Code feels responsive even when it's doing complex multi-step work — you see thinking and tool calls appearing in real time, not after a long pause.
Claude can chain multiple tool calls per turn. That's why you'll sometimes see it read three files, run a grep, and edit a function all in one burst. It's not making separate requests for each — it's one API call that returns multiple tool_use blocks, each executing in sequence.
There are about 26 built-in tools. Each one implements the same interface:
The core tools are what you'd expect: Bash, Read, Write, Edit, Glob, Grep. These are the workhorses. But the meta tools are where it gets interesting.
Task spawns subagents — child conversations with Claude that get their own isolated context, execute tools, and return a summary. This is how Claude Code parallelizes work. When it needs to research something in one part of the codebase while editing another, it doesn't do them sequentially. It spawns a subagent for the research and continues editing in the main conversation.
MCP servers contribute additional tools at runtime. Your project can define custom tools — database queries, API calls, deployment scripts — and Claude Code picks them up automatically. The tools show up in Claude's palette alongside the built-in ones.
Five permission modes: default (ask for everything), acceptEdits (auto-approve file changes, ask for shell commands), plan (read-only until you approve), bypassPermissions (auto-approve everything), and auto (automation-friendly minimal approval).
But the modes are just the top layer. Every tool call goes through a five-step gauntlet:
checkPermissions() — Bash checks for destructive commands, Write checks file pathsBash(npm:*) or Read(~/project/**)
PreToolUse hooks can approve, block, or modify the call before it executesThis layered model is why Claude Code can feel both powerful and safe at the same time. When I'm in acceptEdits mode, it flies through file changes without asking. But if it tries to run rm -rf or push to main, the tool-level check catches it before the mode override even matters.
The hooks are the escape hatch for everything else. You can write a shell script that runs before every Bash command and blocks anything matching a pattern. You can run a linter after every file edit. You can inject additional context into every user prompt. It's event-driven and configurable in settings.json.
Settings merge in a specific order, with later values winning:
Defaults → ~/.claude/settings.json (user global) → .claude/settings.json (project, checked into VCS) → .claude/settings.local.json (project local, gitignored) → CLI flags → environment variables
This is a good design. Your team checks in project-level settings (allowed tools, MCP servers, hooks). You override locally with your preferences. CI overrides with environment variables. Nobody steps on anyone else.
Conversations persist across turns in ~/.claude/sessions/. When you're approaching the context window limit, older messages get summarized — Claude Code calls this "context compaction." There are even pre/post hooks for the compaction step so you can preserve specific information that shouldn't get summarized away.
The memory system is layered too. CLAUDE.md files provide persistent instructions per-project. Auto-memory files in ~/.claude/memory/ accumulate patterns across sessions. Session history lets you resume or fork previous conversations.
This is the part that makes Claude Code feel like it "knows" your project. It's not magic — it's a well-designed context injection pipeline. CLAUDE.md gets loaded into every system prompt. Memory files get loaded on startup. Your conversation history from yesterday is still there when you /resume.
Subagents via the Task tool run as nested conversations within the same process. Same Claude model, separate context window, returns a summary when done.
But there's also a Teams system that uses tmux for true parallelism. A lead agent creates a team, members get separate tmux panes with their own Claude sessions, and they communicate through a shared message bus. Each member gets role-specific instructions and tool access.
I haven't used Teams yet, but the architecture makes sense. Subagents are for quick parallel research within a single task. Teams are for genuinely parallel workstreams — one agent refactoring the backend while another updates the frontend tests.
This one surprised me. The terminal interface is a React app rendered via Ink — a React renderer for CLIs. The conversation view, input area, tool call displays, permission dialogs, progress indicators — all React components using Yoga (CSS flexbox) for layout and ANSI escape codes for styling.
It supports inline images via the iTerm protocol. Thinking blocks are collapsible. Tool results show previews with execution status. It's genuinely well-built terminal UI, not just console.log with colors.
The interesting thing about reading an architecture doc isn't the individual components — it's the design priorities they reveal.
Streaming-first means they optimized for perceived speed over simplicity. SSE parsing mid-stream is more complex than waiting for a complete response, but it makes the tool feel alive.
Hook-extensible everything means they expect power users to customize aggressively. Nearly every action has a pre/post hook point. This isn't an afterthought — it's a core architectural decision.
Layered permissions means they took safety seriously without making it annoying. Five layers of checks sounds heavy, but in practice most tool calls resolve instantly because the mode and allowlist handle the common cases. The user only sees a prompt when something genuinely unusual happens.
Single-process subagents, multi-process teams means they thought carefully about the tradeoff between simplicity and parallelism. Subagents are lightweight and fast because they share a process. Teams are heavier but truly parallel because they run in separate tmux panes.
Claude Code isn't a chat wrapper around an API. It's an agent runtime with a terminal UI. The agentic loop, tool system, permission model, and hook architecture form a coherent system designed to let an LLM operate autonomously on your codebase while giving you exactly the control points you need to stay in charge.
That's the part that matters. Not what Claude Code can do — but how much thought went into making sure you can control what it does.
Follow me on X — I post as @oldeucryptoboi.