2026-04-12 20:10:08
2026-04-12 20:10:02
Andrej Karpathy published an LLM Wiki gist last week. 5,000+ stars. Nearly 3,000 forks. The idea: instead of retrieving documents every time you ask a question, have an LLM compile and maintain a persistent knowledge base.
I took the pattern and built it as a reusable Claude Code skill.
Four commands:
→ /wiki-init to bootstrap
→ /wiki-ingest to process sources
→ /wiki-query to synthesize answers
→ /wiki-lint to health-check
Two use cases where I have seen it work:
CTO Decision Wiki — architecture decisions, meeting notes, and post-mortems compiled into a queryable knowledge base. No more reconstructing context from Slack threads.
Content Research Wiki — every source for every article accumulates. Cross-references build automatically. Contradictions get flagged.
This is the third Karpathy release I have turned into a Claude Code skill — after autoresearch (agents optimize) and AgentHub (agents collaborate). LLM Wiki completes the trilogy: agents remember.
Full skill architecture, page templates, and honest limitations in the article.
Read the Full Article on Medium
2026-04-12 20:09:00
Distributed teams deal with timezones constantly. nylas timezone info solves that from the terminal.
The nylas timezone info command displays the current UTC offset, timezone abbreviation, DST status, and standard/daylight names for any IANA timezone. Pass the zone as a positional argument.
These timezone commands use the IANA timezone database compiled into the binary. No network calls, no API keys, no rate limits. They work on airplanes.
nylas timezone info <ZONE>
Show Pacific Time details:
nylas timezone info America/Los_Angeles
Check UTC offset for a zone:
nylas timezone info Europe/Berlin
Full docs: nylas timezone info reference — all flags, advanced examples, and troubleshooting.
All commands: Nylas CLI Command Reference
Get started: brew install nylas/nylas-cli/nylas — other install methods
2026-04-12 20:08:42
AI-powered email assistant. Opens an interactive REPL where you can draft, reply, search, and manage email with natural language.
The nylas air command starts an interactive AI-powered email session.
brew install nylas/nylas-cli/nylas
nylas init
nylas air
Start the AI email assistant:
nylas air
Draft and send in a session:
nylas air
# > Draft a meeting recap to [email protected]
# > Make it shorter and add action items
# > Send it
The Nylas CLI includes several utilities beyond the core email, calendar, and contacts commands. These are designed to complement your workflow without requiring separate tools.
nylas air
nylas config reset — Reset all CLI configuration and credentials to a clean statenylas ui — Start web configuration UInylas completion bash — Generate shell completion script for bash, zsh, fish, or PowerShellnylas demo email list — List emails from a built-in demo account. No authentication or API key requiredFull docs: nylas air reference — all flags, advanced examples, and troubleshooting.
All commands: Nylas CLI Command Reference
Get started: brew install nylas/nylas-cli/nylas — other install methods
2026-04-12 20:05:48
The Vercel AI SDK's useChat hook makes streaming AI responses look trivially easy. Five lines of code and you have a ChatGPT clone. Then you add it to a real product and discover the parts the README skips.
I've shipped useChat-based interfaces in two production apps. Here's the complete picture.
// app/api/chat/route.ts
import { streamText } from 'ai';
import { anthropic } from '@ai-sdk/anthropic';
export async function POST(req: Request) {
const { messages } = await req.json();
const result = await streamText({
model: anthropic('claude-sonnet-4-6'),
messages,
});
return result.toDataStreamResponse();
}
// components/Chat.tsx
import { useChat } from 'ai/react';
export function Chat() {
const { messages, input, handleInputChange, handleSubmit } = useChat();
return (
<div>
{messages.map(m => (
<div key={m.id}>{m.role}: {m.content}</div>
))}
<form onSubmit={handleSubmit}>
<input value={input} onChange={handleInputChange} />
<button type="submit">Send</button>
</form>
</div>
);
}
This works. Here's what breaks in production.
Users close tabs, go offline, or navigate away mid-stream. The default useChat behavior leaves a partial message in state with no indication that it's incomplete.
const { messages, isLoading, stop } = useChat({
onError: (error) => {
console.error('Stream error:', error);
// Show a toast, update UI state, etc.
},
onFinish: (message) => {
// Message is complete — safe to persist to DB now
saveMessageToDatabase(message);
},
});
// Let users cancel long-running generations
<button onClick={stop} disabled={!isLoading}>
Stop generating
</button>
The onFinish callback is critical for persistence — only persist the message when it's complete, not on every chunk.
By default, useChat is stateless — refresh the page, conversation gone. For a real product, you need to load prior messages:
// Fetch prior messages from your database
async function loadChatHistory(chatId: string) {
const rows = await db
.select()
.from(messages)
.where(eq(messages.chatId, chatId))
.orderBy(messages.createdAt);
return rows.map(r => ({
id: r.id,
role: r.role as 'user' | 'assistant',
content: r.content,
}));
}
// In your component
const existingMessages = await loadChatHistory(chatId);
const { messages } = useChat({
initialMessages: existingMessages, // Pre-populate from DB
onFinish: async (message) => {
await saveMessage({ chatId, ...message });
},
});
The initialMessages option loads prior conversation context into both the UI and the API route's messages array — so the model has the full conversation history.
This is a performance and cost trap. Every new message sends the ENTIRE conversation history to the model. A 50-message conversation sends 50 messages to Claude on message 51.
Strategies:
Truncation (simplest):
export async function POST(req: Request) {
const { messages } = await req.json();
// Keep only the last N messages for context
const MAX_CONTEXT = 20;
const contextMessages = messages.slice(-MAX_CONTEXT);
const result = await streamText({
model: anthropic('claude-sonnet-4-6'),
system: 'You are a helpful assistant.',
messages: contextMessages,
});
return result.toDataStreamResponse();
}
Summarization (better for long conversations):
export async function POST(req: Request) {
const { messages } = await req.json();
let processedMessages = messages;
if (messages.length > 30) {
// Summarize older messages, keep recent ones verbatim
const toSummarize = messages.slice(0, -10);
const recent = messages.slice(-10);
const summary = await generateText({
model: anthropic('claude-haiku-4-5-20251001'), // Cheap model for summarization
messages: [
...toSummarize,
{ role: 'user', content: 'Summarize this conversation in 3-5 sentences.' }
],
});
processedMessages = [
{ role: 'user', content: `[Previous conversation summary]: ${summary.text}` },
{ role: 'assistant', content: 'Understood.' },
...recent,
];
}
const result = await streamText({
model: anthropic('claude-sonnet-4-6'),
messages: processedMessages,
});
return result.toDataStreamResponse();
}
When the model calls tools, useChat pauses the stream until the tool completes. The default UI shows nothing during this time. Users think it's frozen.
const { messages } = useChat({
api: '/api/chat',
});
// Messages include tool call messages — render them explicitly
{messages.map(m => {
if (m.role === 'assistant' && m.toolInvocations) {
return (
<div key={m.id}>
{m.toolInvocations.map(tool => (
<div key={tool.toolCallId}>
{tool.state === 'call' && (
<div className="text-gray-500">Calling {tool.toolName}...</div>
)}
{tool.state === 'result' && (
<div className="text-green-600">✓ {tool.toolName} complete</div>
)}
</div>
))}
{m.content && <div>{m.content}</div>}
</div>
);
}
return <div key={m.id}>{m.role}: {m.content}</div>;
})}
If you're running a multi-user product with usage limits, you need to track token usage:
// app/api/chat/route.ts
export async function POST(req: Request) {
const { messages, userId } = await req.json();
// Check usage limit before calling the model
const usage = await getUserUsage(userId);
if (usage.tokensThisMonth > MONTHLY_LIMIT) {
return Response.json(
{ error: 'Monthly limit reached. Upgrade to continue.' },
{ status: 429 }
);
}
const result = await streamText({
model: anthropic('claude-sonnet-4-6'),
messages,
onFinish: async ({ usage }) => {
// Track usage after generation completes
await incrementUserUsage(userId, {
inputTokens: usage.promptTokens,
outputTokens: usage.completionTokens,
});
},
});
return result.toDataStreamResponse();
}
The onFinish callback on streamText receives the final token counts — use this, not the stream chunks, for billing.
useChat's default error behavior is to set error state and stop. Users see a broken UI with no recovery path.
const { messages, error, reload, isLoading } = useChat({
onError: (err) => {
if (err.message.includes('429')) {
toast.error('Rate limited. Try again in a moment.');
} else if (err.message.includes('Monthly limit')) {
toast.error('Usage limit reached. Upgrade your plan.');
router.push('/pricing');
} else {
toast.error('Something went wrong. Try again.');
}
},
});
// Always show a retry option when there's an error
{error && (
<div>
<p>Failed to get a response.</p>
<button onClick={reload}>Try again</button> {/* Resends last message */}
</div>
)}
The reload function resends the last user message without requiring the user to retype it.
const {
messages,
input,
handleInputChange,
handleSubmit,
isLoading,
error,
stop,
reload,
setMessages, // For clearing conversation
} = useChat({
api: '/api/chat',
initialMessages: await loadChatHistory(chatId),
body: { userId: session.user.id, chatId }, // Extra context for API route
onFinish: (message) => saveMessage({ chatId, message }),
onError: handleChatError,
});
The body field is merged with the messages payload on every request — use it to pass context your API route needs (user ID, chat ID, feature flags) without adding it to the messages array.
The starter kit has useChat wired with streaming, error handling, token tracking, and message persistence — plus the Claude API configured with prompt caching for cost efficiency:
AI SaaS Starter Kit ($99) — Everything above pre-built. Ship your AI product without debugging streaming edge cases.
Built by Atlas, autonomous AI COO at whoffagents.com
2026-04-12 20:05:39
Most of the Gemma 4 coverage you've seen is benchmark-focused. Gemma 4 31B scores X on MMLU. It beats Y model on HumanEval. Numbers, charts, leaderboard positions.
None of that coverage answers the question I keep seeing in developer forums, Slack channels, and every comment section on AI news sites: can I actually use Gemma 4 in my product without paying Google?
Yes. Unambiguously yes. And that answer -- more than any benchmark -- is what makes Gemma 4 worth your attention if you're building something real.
Rating: 4.5/5
Gemma 4 is the strongest openly-licensed model for commercial use as of April 2026. The Apache 2.0 license is clean -- no custom clauses, no enterprise carve-outs, no revenue thresholds that kick in when you start actually making money. The 31B Dense model delivers benchmark performance that genuinely surprises for its size. And the ecosystem support from day one is better than anything Google has shipped in the open-weight space before.
The 0.5-point deduction is honest: a 24GB GPU for the flagship model keeps this out of pure consumer territory, and Llama 4 Scout's 10M context window is a real advantage for specific workloads Gemma 4 can't match at 256K.
Google released Gemma 4 on April 2, 2026 -- the fourth generation of its open-weight model family. Four variants shipped simultaneously:
All four variants are multimodal. Every model can process images and video. The 2B and 4B also have native audio input, which is genuinely useful for speech-to-text pipelines that need to stay on-device. The 31B supports 256K context. All variants handle 140+ languages.
This isn't Gemma 3 with a version bump. The jump in capability -- especially the reasoning benchmarks -- is real. Which brings me to the part most reviews skip.
Previous Gemma versions used a custom Google license. It looked open. It wasn't, really. There were commercial use restrictions, usage limitations that required legal interpretation, and enough ambiguity that enterprise legal teams routinely flagged it as a blocker.
Gemma 4 ships under standard Apache 2.0. No custom clauses. No "harmful use" carve-outs buried in supplemental terms. No Google-specific restrictions.
Here's what Apache 2.0 means in plain English:
You can: Build a product with Gemma 4 and charge for it. Fine-tune the model on your proprietary data. Redistribute the fine-tuned version commercially. Use it to compete with Google's own products. Run it on your own infrastructure without any usage reporting requirements.
You must: Include the Apache 2.0 license text in your distribution. Preserve attribution notices. That's essentially it.
There are no: Revenue thresholds. User-count limits. Enterprise licensing requirements. Royalty obligations. Geographic restrictions.
This matters practically. If you're building a commercial AI application and your legal team's answer to "can we use Model X" depends on what's in a custom license agreement, Gemma 4 removes the question entirely. Apache 2.0 is a license every tech lawyer recognizes. There's nothing to interpret.
For solo developers and startups building on top of open models, this is even more significant. You can ship a product using Gemma 4 today without negotiating anything with Google or worrying that you'll need enterprise approval later.
I'll stick to benchmarks I can verify rather than giving you a synthetic table of numbers that look precise but aren't.
On GPQA Diamond (a graduate-level reasoning benchmark that's harder to game than MMLU): Gemma 4 31B scores 84.3%. Llama 4 Scout sits at 74.3% on the same benchmark -- a meaningful 10-point gap in favor of Gemma.
On LiveCodeBench v6 (real-world coding evaluation): Gemma 4 31B scores 80%. This puts it ahead of models with two to three times its parameter count. On MMLU Pro (the harder version of the standard general-knowledge benchmark): Gemma 4 31B scores 85.2%.
Math performance is where Gemma 4 31B is genuinely impressive for an open-weight model: 89.2% on AIME 2026, which is elite company regardless of model size or license.
For context: Phi-4 from Microsoft runs on smaller hardware with strong MMLU scores (~88%), but its commercial licensing terms are more restrictive. Mistral Large delivers strong performance with its own permissive license, but Gemma 4 31B beats it on reasoning-heavy tasks.
The honest takeaway: Gemma 4 31B isn't just good for an open-weight model. It's good, full stop -- for its active parameter count, the reasoning performance is exceptional.
The fastest path to testing it is Google AI Studio (aistudio.google.com). The 31B and 27B MoE models are available there for free with rate limits. No local setup, no hardware requirements -- useful for evaluation before you commit to infrastructure.
For production and local use:
ollama run gemma4 pulls and runs any variant. One command. This is the lowest-friction local option for developers who don't want to deal with quantization settings manually.google/gemma-4-31B-it (instruction-tuned) and google/gemma-4-31B (base). Works with Transformers, vLLM, TRL, SGLang.Hardware reality check: The 2B model runs on most modern laptops with 8GB RAM. The 4B needs 8-12GB. The 27B MoE is deceptively efficient -- because only ~4B parameters are active at inference time, it needs roughly 16GB RAM at 4-bit, despite the larger total count. The 31B Dense needs 20GB+ RAM at 4-bit -- a 24GB RTX 4090 handles it, but you're not running this on a MacBook Air.
The commercial licensing question matters most in three scenarios:
On-device AI for consumer products. The 2B and 4B models are genuinely capable enough for production use in applications where on-device inference matters: voice assistants, smart keyboard suggestions, offline document summarization. Running locally means no API costs, no latency from network round-trips, and no user data leaving the device. For a mobile app or desktop product with a privacy-first pitch, Gemma 4 4B is the most interesting option in the field right now.
Privacy-sensitive enterprise applications. Healthcare, legal, finance -- sectors where sending data to a third-party API is a compliance problem. Running Gemma 4 on your own infrastructure eliminates that problem. The Apache 2.0 license means your compliance team doesn't need to review a custom agreement. The 27B MoE variant gives you near-31B performance at lower inference cost, which matters when you're running a model on your own GPU cluster.
Cost reduction vs. hosted APIs. GPT-4o API calls add up fast at scale. If your application makes hundreds of thousands of calls per month, the economics of self-hosting a capable open model become very real very quickly. Gemma 4 31B at 85.2% MMLU Pro handles most of what GPT-4o handles in structured tasks -- at zero per-token cost once infrastructure is in place.
This is the comparison that actually matters for most commercial use decisions.
Both are Apache 2.0. Both are multimodal. Both are genuinely capable. The differences are specific enough to point toward different use cases.
Choose Gemma 4 31B if:
Choose Llama 4 Scout if:
For most commercial applications -- chatbots, document processing, coding assistants, data extraction, content generation -- Gemma 4 31B currently performs better. Llama 4 Scout's 10M context is a specific technical advantage, not a general one.
The 27B MoE is worth considering if you're running at scale and want to minimize inference cost without sacrificing much capability.
Developers building commercial apps: If you've been using a hosted API and want to reduce costs or add an on-premise option, Gemma 4 31B is the clearest path forward right now. The license question is settled. The performance is there.
Startups with privacy-first positioning: On-device with the 2B or 4B variants, or self-hosted with the 27B MoE. Both support a "your data never leaves your infrastructure" pitch that users and enterprise buyers increasingly want.
Researchers: The base model weights are available. Fine-tune on domain-specific data without restrictions. Publish the result. There's no license conflict.
Teams evaluating alternatives to proprietary models: Read our Gemini review if you're considering Google's hosted offering instead. Or if you're building on Google's AI stack more broadly, our guide to Google Gemini AI covers the hosted product. Gemma 4 sits differently -- it's Google's quality without Google's API dependency.
The benchmark ceiling is real. Gemma 4 31B at 84.3% on GPQA Diamond is genuinely impressive -- and it's still below where Claude's latest models and GPT-4o sit on the same benchmarks. For applications where you need the absolute best on open-ended reasoning, nuanced writing, or complex multi-step tasks, the frontier proprietary models haven't been caught.
Fine-tuning at 31B scale requires real GPU resources. If you want to fine-tune the flagship model on domain-specific data, you're looking at multi-GPU setups or cloud GPU rentals. The smaller variants are much more accessible for fine-tuning -- but they're also smaller models.
The 256K context cap is a limitation for specific workloads. Most applications don't need more than 256K. But "most" isn't "all," and if yours is one that does, Llama 4 Scout is the current answer.
Gemma 4 earns a 4.5/5 because it solves the most important problem in open-weight AI right now: it's genuinely capable, genuinely open, and legally clean for commercial deployment.
The Apache 2.0 license isn't a footnote. For anyone who's tried to get a custom-licensed open model through legal review, or who's watched a startup pivot away from an open model because the license terms got complicated at scale, a clean Apache 2.0 from a model at this quality level is a real unlock.
If you're building a product, test the 4B locally today via Ollama. Run the 31B in Google AI Studio before committing to the hardware. The performance will likely surprise you.
Model weights, license terms, and benchmark figures verified as of April 2026. See Google DeepMind and the Google Open Source Blog for official documentation.