2026-02-19 04:22:48
Your Next.js app is ready. You've tested it locally, fixed the bugs, and now you're wondering: where do I put this thing?
This guide covers every realistic deployment option for Next.js in 2026, from dead-simple (Vercel) to fully custom (your own VPS). I'll walk through each, with code examples and real costs.
TL;DR: Deploy directly from GitHub. Zero config. Works.
Vercel is built by the creators of Next.js. It's optimized for it, and honestly, it just works.
Connect your GitHub repo
That's it. Vercel auto-detects Next.js and deploys it.
Your app is live at: yourproject.vercel.app
In your Vercel dashboard, add environment variables:
NEXT_PUBLIC_API_URL=https://api.example.com
DATABASE_URL=postgresql://user:pass@host/db
STRIPE_SECRET_KEY=sk_test_xxx
Your Next.js app reads them immediately.
Next.js API routes automatically become serverless functions on Vercel:
// pages/api/hello.js
export default function handler(req, res) {
res.status(200).json({ message: 'Hello from Vercel!' });
}
No configuration. It's just a function on the server, deployed globally.
Run code at the edge (CDN nodes worldwide) with zero cold start:
// middleware.js (runs at the edge)
import { NextResponse } from 'next/server';
export function middleware(request) {
// Block requests from certain countries
const country = request.geo.country;
if (country === 'XX') {
return NextResponse.error();
}
return NextResponse.next();
}
export const config = {
matcher: ['/api/*'],
};
For most applications, the free tier or $20/month is sufficient.
✅ Building a SaaS product
✅ Prototyping quickly
✅ Small to medium traffic sites
✅ Teams that want zero ops
❌ Extreme scale (100k+ concurrent users)
❌ Custom infrastructure needs
Cost: $5/month baseline, $0.30/hour per service
Railway is like Vercel's rebellious cousin. Less opinionated, more flexible.
npm install -g railway
railway login
railway init
railway up
That's it. Railway reads your package.json and deploys.
✅ Need database out-of-the-box
✅ Want more control than Vercel
✅ Don't need global edge distribution
✅ Prefer CLI-first workflow
❌ Need serverless (Railway is containers)
❌ Extreme scale requirements
Cost: $15-50/month for small apps, $500+/month for production scale
You want full control? Here's how to deploy Next.js on AWS.
Create Dockerfile:
FROM node:18-alpine
WORKDIR /app
# Copy package files
COPY package*.json ./
# Install dependencies
RUN npm ci
# Copy app code
COPY . .
# Build Next.js
RUN npm run build
# Expose port
EXPOSE 3000
# Start app
CMD ["npm", "start"]
# Create ECR repo
aws ecr create-repository --repository-name my-nextjs-app
# Build and push
docker build -t my-nextjs-app:latest .
docker tag my-nextjs-app:latest 123456789.dkr.ecr.us-east-1.amazonaws.com/my-nextjs-app:latest
docker push 123456789.dkr.ecr.us-east-1.amazonaws.com/my-nextjs-app:latest
✅ Need complete control
✅ Complex infrastructure requirements
✅ Compliance/security requirements
✅ Plan to scale significantly
❌ Spend extra time on DevOps
❌ Higher learning curve
❌ More expensive than Vercel
Cost: $5-12/month for small apps
Think of it as "Railway but slightly cheaper."
That's it. App runs on DigitalOcean's infrastructure.
| Platform | Cold Start | Global CDN | Database | Free Tier | Easiest |
|---|---|---|---|---|---|
| Vercel | <100ms | Yes | Yes | Yes | ✅ Yes |
| Railway | 500ms | No | Yes | Yes | Yes |
| DigitalOcean | 1-2s | No | Yes | Yes | Yes |
| AWS ECS | Depends | Optional | Optional | No | No |
For 90% of use cases, pick Vercel. It's fast, cheap, and just works.
→ Vercel (free tier, zero setup)
→ Vercel (scales beautifully, great analytics)
→ AWS ECS (full control, more complexity)
→ Railway (simpler than AWS, more flexible than Vercel)
→ DigitalOcean App Platform ($5-12/month)
That's seriously it.
In 2026, deploying a Next.js app is dead simple. Push to GitHub, and it's live.
For 90% of projects, Vercel is the answer. No ops, automatic scaling, and you pay only for what you use.
For everything else, pick your platform based on your needs and budget.
Now stop reading and deploy something.
Published: February 2026
Need help? Check Next.js docs or ask in the comments.
2026-02-19 04:20:02
When I first heard about Docker, I thought it was something extremely complex that only senior developers used.
Until recently, that is. When I finally started learning Docker, I found out how amazing it is, and I want to share what I've learned.
If you're just starting out, this is for you.
So…
1. What Is Docker?
Docker is a tool that lets you package your application with everything it needs to run:-dependencies, libraries, system tools, and your code;into something called a container.
2. What Problem Does Docker Solve?
Think of it like this:
"It works on my machine" stops being an excuse.
With Docker, if it works inside the container, it works everywhere.
Before Docker, this is what used to happen:
Developer X runs the application successfully.
Developer Y tries to run it, and it breaks.
Developer X tells Developer Y: “But it works on my machine!”
Why?
Different environments:
° Different operating systems
° Different versions of NODE or Go or Python
° Different installed dependencies
Docker solves this by creating a consistent environment that travels with your application.
3. What Is a Container?
A container is like a lightweight, portable box for your application.
But unlike a full virtual machine:
° It starts fast—usually in seconds
° It uses fewer resources
° It's easy to share and move around
How It Works:
When I first installed Docker, created my Dockerfile, built an image, and it actually worked…
I felt like I unlocked a new level in development 😂.
That was the moment I realized:
This is how real-world applications are deployed.
Why Every Beginner Should Learn Docker
Here is why I believe Docker is worth learning early:
🛠 Things That Confused Me at First
To be honest:
° Images vs. Containers: I couldn't keep them straight. (Think of an image as a recipe and a container as the actual cooked meal.)
° Dockerfile syntax: It looked scary at first glance.
° Ports: Mapping ports from the container to my computer didn't make sense initially.
But after building just one simple container for a Go app, everything started connecting.
What Next?
Now that I understand the basics, I plan to:
° Containerize my Go projects
° Learn Docker Compose (for running multiple containers)
° Understand how containers are used in production
One step at a time.
Summary
Imagine you bake a cake 🍰 at home.
You pack:
°The cake
°The ingredients list
°The exact oven settings
°The instructions
Then you put everything inside one box.
Now, no matter where that box goes, Nairobi, Kisumu, or New York, anyone can open it and get the exact same cake.
That is exactly what Docker does.
It puts:
. Your app
. The tools it needs
. The correct settings
Inside one “box” called a container.
So instead of saying:
“It works on my machine.”
You can confidently say:
“It works everywhere.”
2026-02-19 04:17:53
It was a rainy sunday. The kind of Sunday that makes you want to stay inside with Claude Code and a good book. But I knew this wasn't going to be an ordinary Sunday the minute GLM-5 showed up and sweet talked its way into my Claude Code CLI. But Claude Code wasn't having it. You see, when you point Claude Code at a local base URL for local inference, you're inviting a poltergeist into your terminal.
For a weekend project, my mission was to set up a local version of GLM-5 as a coding agent on my new M3 Ultra. My reasons for deciding to run my local quantized GLM-5 in Claude Code are documented in my companion article, Finding my Frontier: Cloud free coding on GLM-5.
I thought this would be straight forward. Part of what makes it possible is that Claude Code has an ANTHROPIC_BASE_URL env var. Llama-server has an Anthropic Messages API endpoint. I thought this would be a walk in the park. But once I had it setup, it segfaulted immediately before my prompt even reached the model. My technical investigation lead to some very interesting findings.
Claude Code has increasing support for running open source models and the open source community is embracing it too. Ollama allows you to launch models directly in Claude Code. Some frontier-class open source models recommend it as the primary way to access their models. These integrations are typically optimized for cloud-hosted versions of the models though, not local inference. I love the Claude Code CLI and the idea of having some of its coolest features already baked into your open source model coding setup is so very tempting. But my job today is to dampen your enthusiasm.
See my companion article, Finding my Frontier: Cloud free coding on GLM-5, for the full OpenCode setup guide and the MLX vs GGUF performance story.
After it crashed, I ran GLM-5 through llama-server's Anthropic Messages API and it handles tool calling no problem:
curl -s 'http://localhost:8080/v1/messages' \
-H 'Content-Type: application/json' \
-H 'x-api-key: none' \
-H 'anthropic-version: 2023-06-01' \
-d '{
"model": "test",
"max_tokens": 50,
"tools": [{
"name": "get_weather",
"description": "Get weather for a location",
"input_schema": {
"type": "object",
"properties": {
"location": {"type": "string"}
},
"required": ["location"]
}
}],
"messages": [{"role": "user", "content": "What is the weather in Paris?"}]
}'
This is 164 input tokens, 50 output tokens, and a prompt reply (pun intended) in 4.7 seconds. A 744B model doing structured tool calling on consumer hardware. The model isn't the problem here.
ANTHROPIC_BASE_URL="http://localhost:8080" \
ANTHROPIC_API_KEY="none" \
claude --model GLM-5-UD-IQ2_XXS-00001-of-00006.gguf
Dead server. Not even a useful error message.
To see what was happening under the surface, I dropped a logging proxy between Claude Code and llama-server. I needed to see the exact moment the handshake turned into a death spiral.
The logs revealed a massacre.
[1] POST /v1/messages | model=claude-haiku-4-5-20251001 | tools=0 → 200 OK
[2] POST /v1/messages | model=claude-haiku-4-5-20251001 | tools=0 → 200 OK
[3] POST /v1/messages/count_tokens | model=GLM-5... | tools=1 → intercepted
[4] POST /v1/messages/count_tokens | model=GLM-5... | tools=1 → intercepted
...
[8] POST /v1/messages | model=claude-haiku-4-5-20251001 | tools=1 → CRASH (segfault)
[9+] Everything after → Connection refused
This revealed three separate problems. Any one of them kills the server on its own.
What on earth was Haiku doing there? I checked every configuration file; I knew for sure I hadn’t invited it.
As it turns out, Claude Code is a creature of habit. It sends internal requests to claude-haiku-4-5-20251001 for housekeeping stuff (things like generating conversation titles, filtering tools, other background tasks). When you set ANTHROPIC_BASE_URL, all of those get routed to your local server.
In one session I counted 37 Haiku requests before the actual inference request even got sent. Title generation, tool checking for each of 30+ MCP tools, all hitting a server that has never even heard of Haiku.
But that wasn't all. Before the actual inference request, Claude Code hits /v1/messages/count_tokens with one request per tool group. This endpoint doesn't exist in llama-server, so it returns a 404 that Claude Code doesn't handle gracefully.
The gasoline that lights the fire is one of Claude Code's best features, but a concurrency mis-match for poor little llama-server. Haiku calls to the ether, count_tokens calls, and a parallel request to run the inference for your prompt. A single-slot llama-server can't handle concurrent requests which result in, you guessed it, a croaked out "se-egfault" just before the server's untimely demise (I might have watched too many British Police Procedurals).
The GLM-5 inference request (in this case a simple "hello"), which is actually the one I cared about, never made it to the server. It was stuck behind crashed Haiku calls and preflight requests hitting endpoints that aren't there.
Here's what that looks like:
Okay, I admit, this was a hacky fix. But it worked. Instead of waiting for upstream fixes, I wrote a proxy that sits between Claude Code and llama-server. It does three things: fakes all Haiku responses, intercepts count_tokens, and serializes real requests so they don't flood the server. Here's the walkthrough.
Standard library only. The proxy listens on port 9090 and forwards real requests to llama-server on 8080. All real inference requests go through a single-threaded queue so the server only ever sees one at a time.
#!/usr/bin/env python3
"""
Smart proxy for Claude Code -> llama-server.
Serializes requests, intercepts count_tokens, fakes Haiku calls.
"""
import json, threading, queue, time
from http.server import HTTPServer, BaseHTTPRequestHandler
from urllib.request import Request, urlopen
from urllib.error import HTTPError
TARGET = "http://127.0.0.1:8080"
request_queue = queue.Queue()
response_slots = {}
slot_lock = threading.Lock()
request_timestamps = {}
This is the single-file-line to llama-server. Requests go into the queue, this thread sends them one at a time, and stashes the response so the original handler can pick it up.
def worker():
while True:
req_id, method, path, headers, body = request_queue.get()
t_start = time.time()
try:
req = Request(f"{TARGET}{path}", data=body, method=method)
for k, v in headers.items():
req.add_header(k, v)
resp = urlopen(req, timeout=600)
resp_data = resp.read()
resp_headers = dict(resp.getheaders())
elapsed = time.time() - t_start
print(f"[{req_id}] <- {resp.status} | {elapsed:.1f}s", flush=True)
with slot_lock:
response_slots[req_id] = ("ok", resp.status, resp_headers, resp_data)
except HTTPError as e:
error_body = e.read() if e.fp else b""
with slot_lock:
response_slots[req_id] = ("http_error", e.code, {}, error_body)
except Exception as e:
with slot_lock:
response_slots[req_id] = ("error", 502, {}, str(e).encode())
finally:
request_timestamps.pop(req_id, None)
request_queue.task_done()
threading.Thread(target=worker, daemon=True).start()
req_counter = 0
counter_lock = threading.Lock()
When Claude Code sends a Haiku request (title generation, tool filtering, etc.), we don't bother the model. We just send back a minimal valid Anthropic Messages API response. Claude Code gets what it needs, the model never knows it happened.
def fake_response(handler, req_id, model, text):
"""Return a minimal Anthropic Messages API response."""
fake = {
"id": f"msg_{req_id}", "type": "message", "role": "assistant",
"content": [{"type": "text", "text": text}],
"model": model, "stop_reason": "end_turn", "stop_sequence": None,
"usage": {"input_tokens": 10, "output_tokens": 1}
}
body = json.dumps(fake).encode()
handler.send_response(200)
handler.send_header("Content-Type", "application/json")
handler.send_header("Content-Length", str(len(body)))
handler.end_headers()
handler.wfile.write(body)
This is where the routing logic lives. Every POST gets inspected and sent down one of three paths:
count_tokens requests get a fake estimate and never touch the server.class SmartProxy(BaseHTTPRequestHandler):
def do_POST(self):
global req_counter
with counter_lock:
req_counter += 1
req_id = req_counter
length = int(self.headers.get("Content-Length", 0))
body = self.rfile.read(length)
data = json.loads(body)
model = data.get("model", "?")
tools = data.get("tools", [])
# 1. Intercept count_tokens
if "count_tokens" in self.path:
estimated = 500 * max(len(tools), 1)
resp = json.dumps({"input_tokens": estimated}).encode()
self.send_response(200)
self.send_header("Content-Type", "application/json")
self.send_header("Content-Length", str(len(resp)))
self.end_headers()
self.wfile.write(resp)
return
# 2. Fake ALL Haiku calls
if "haiku" in model.lower():
system = data.get("system", [])
is_title = False
if isinstance(system, list):
for b in system:
if isinstance(b, dict) and "new topic" in b.get("text", "").lower():
is_title = True
elif isinstance(system, str) and "new topic" in system.lower():
is_title = True
if is_title:
fake_response(self, req_id, model,
'{"isNewTopic": true, "title": "GLM-5 Chat"}')
else:
fake_response(self, req_id, model, "OK")
return
# 3. Real requests: serialize through queue
print(f"[{req_id}] {model[:30]} | {len(tools)} tools -> queued", flush=True)
headers_dict = {}
for h in ["Content-Type", "Authorization", "x-api-key", "anthropic-version"]:
if self.headers.get(h):
headers_dict[h] = self.headers[h]
request_timestamps[req_id] = time.time()
request_queue.put((req_id, "POST", self.path, headers_dict, body))
while True:
time.sleep(0.05)
with slot_lock:
if req_id in response_slots:
result = response_slots.pop(req_id)
break
status_type, code, resp_headers, resp_data = result
self.send_response(code)
for k, v in resp_headers.items():
if k.lower() not in ("transfer-encoding", "content-length"):
self.send_header(k, v)
self.send_header("Content-Length", str(len(resp_data)))
self.end_headers()
self.wfile.write(resp_data)
def log_message(self, *args):
pass
HTTPServer(("127.0.0.1", 9090), SmartProxy).serve_forever()
Save the whole thing as claude-proxy.py and run it with python3 claude-proxy.py. That's it.
With the proxy in place, the picture changes completely:
Claude Code's request flow goes from 42 chaotic requests to this:
[1] haiku title gen → fake response (instant)
[2] GLM-5 | 23 tools → queued
[2] ← 200 | 17.8s
[3] haiku title gen → fake response (instant)
| Turn | TTFT (prefill) | Generation | Total | Notes |
|---|---|---|---|---|
| 1st (cold cache) | 336.6s / 24,974 tokens | 13.7s / 133 tok | 350.3s | Full prefill, tool defs + system prompt |
| 2nd (warm cache) | 0.1s / 1 token | 17.0s / 165 tok | 17.1s | Prompt cache hit |
| 3rd | 2.2s / 14 tokens | 15.6s / 151 tok | 17.8s | Near-instant prefill |
| 4th | 3.4s / 96 tokens | 10.8s / 104 tok | 14.1s | Stable |
First turn is 5.6 minutes. Every turn after that: 2-3 seconds to first token.
The first turn is slower than OpenCode (350s vs 100s) because Claude Code sends ~25K tokens of tool definitions (23 tools including Playwright, Figma, and the built-in ones like Read, Write, Bash, Glob, Grep, etc.) compared to OpenCode's ~10K. But llama-server's prompt cache means you only pay that cost once. After the first turn the server sees the 25K token prefix hasn't changed and skips straight to the new tokens.
Three terminals:
# Terminal 1: llama-server
llama-server --model GLM-5-UD-IQ2_XXS-00001-of-00006.gguf \
--ctx-size 65536 --parallel 1 --port 8080
# Terminal 2: proxy
python3 claude-proxy.py
# Terminal 3: Claude Code
ANTHROPIC_BASE_URL="http://localhost:9090" \
ANTHROPIC_API_KEY="none" \
claude --model GLM-5-UD-IQ2_XXS-00001-of-00006.gguf
First turn will take ~6 minutes. Be patient. After that: ~15 seconds.
Claude Code's ANTHROPIC_BASE_URL feature technically supports custom endpoints. But the implementation assumes a cloud-scale API server on the other end. One that can handle parallel requests, implements every endpoint in the Anthropic spec, and doesn't mind servicing dozens of lightweight Haiku calls alongside heavyweight inference.
That's fine for cloud infrastructure. It's a completely broken assumption for a single-slot local server running a 225GB model. Local model support exists on paper but crashes in practice, and the failure mode (immediate segfault, no useful error message) makes it nearly impossible to diagnose without building your own proxy.
This proxy is a workaround, not a fix. The real solution would be for coding agents to detect local endpoints and skip the background services that assume cloud-scale infrastructure. Until then, 180 lines of Python bridge the gap.
But even with the proxy working, I still wouldn't recommend this as your daily coding setup. Claude Code was purpose-built for a specialized agentic flow that works really well with Anthropic models. Giving it to your local LLM as a hand-me-down is going to end in tears and segfaults (which you now hopefully know how to fix). Coding with this setup felt janky at best. If you want to run a local model as a coding agent, OpenCode is a much better fit. I wrote about that setup here.
So, is this the future of development? Will cloud models always be ahead of the open source local community?
Is anyone else running Claude Code with local LLMs for production work, or do you still fall back to the cloud when the "poltergeists" start acting up?
Hardware: M3 Ultra Mac Studio, 512GB | Model: unsloth/GLM-5-GGUF IQ2_XXS (225GB) | Server: llama.cpp with Metal | Proxy: claude-proxy.py
2026-02-19 04:17:35
I unboxed my new M3 Ultra Mac Studio over the weekend and the first thing I wanted to do with it was try to fit a frontier-sized model on it. I watched some youtube videos (too many.. youtube videos of people in their homelabs. Hello Network Chuck and Alex Ziskind.) and got all fired up with visions of pretty beehive dashboards and the sound of MLX metal screaming down 512GB RAM highways.
When Z.ai unexpectedly dropped GLM-5, within hours all the home-labbers (is that a word?) were headlining things like: GLM-5 replaces OPUS 4.6! I saw those headlines and thought: this is my frontier. GLM-5 would be the model to break in my new hardware and spawn minion models that would herald a new era of local agentic coding and lifestyle enhancements (no. not through openclaw. Of course not through openclaw. I'd use something...else...secure and clawless...refuses to make eye contact).
My opening searches for how to run this model on my hardware were almost immediately rewarded by the discovery of a community/mlx build of GLM-5! I pulled it down to my local drive and set it up to run on OpenCode. It took days. What seemed like days. Not to set it up. That was super quick. To get the response back from my first prompt (which was the word "hello"). I persisted and actually sat there while the hours turned to days and GLM-5 took upwards of 30 minutes to respond to each and every prompt. I stuck it out though and prompted it to build its own WebUI front loading the prompt with a detailed waterfall style requirements doc. This was actually painful to watch and required me to don my infinite patience cloak, but all the magic cloaks in the world couldn't hide the truth. The MLX path was a dead end for agent use.
One inarguable trait of mine is that I'm stubborn. I've always believed "Dead End" signs are just suggestions if you have the right vehicle. For my next move, I decided to try Unsloth's guide to running locally with a quantized GGUF. The guide was geared towards a GPU setup and not an MLX macOS setup so I had to improvise and ask Claude Code for help in spots.
Once I got this running on OpenCode, I clocked 20 full minutes for the time to first token. My immediate thought was that my beefy Mac Studio just wasn't enough to run this model, quantized or no. But I was seeing a few clues that told me this wasn't an issue with the hardware or the model. For one thing, a direct request to the llama-server via Curl came back in under 5 seconds. This pointed to OpenCode as the culprit. But it was actually more nuanced than that. I decided to try running it in the Claude Code CLI to see if another CLI would be better. That proved to be its own dead end, but what I learned from it actually helped me figure out what the issue was with OpenCode. If you're curious you can find that writeup here: The Ghost in the CLI: Why Claude Code Kills Local Inference.
The proxy server I ran to capture data on my Claude Code run hinted at what was happening with OpenCode too. It was pre-processing over 10K tokens of invisible overhead (tool calls, policy prompts, etc.) and though OpenCode does have some support for parallelization of the tool definitions, llama.cpp apparently doesn't. These were getting processed as a giant 10K prompt on the server side. Once I figured this out, I realized GLM-5 might respond faster once that prompt was cached, and that turned out to be true.
Since you're still here and sat through the whole story, you probably want the technical tea. The least I can do is give you the complete guide to how I'm now running a frontier-class model with reasonable speed and writing code with it on my local computer. Cloud-free.
GLM-5 is a Mixture-of-Experts model, so it only activates a fraction of its 744B parameters per token. At IQ2_XXS quantization it fits in 225GB. Plenty of headroom on this machine.
The MLX I used was the community 4-bit build (390GB) through mlx-lm's server. Apple's own ML framework, purpose-built for Apple Silicon, and it was painfully slow. Here's how it stacks up against the Unsloth GGUF (IQ2_XXS, 225GB) through llama-server:
| Setup | Model Size | First Turn | Subsequent Turns | Generation Speed |
|---|---|---|---|---|
| MLX (mlx-lm) | 390GB | ~20 min | ~20 min | ~0.5 tok/s |
| GGUF (llama-server) | 225GB | ~10-20 min | 2.6s | 14 tok/s |
Yes, the GGUF was a smaller footprint, but that didn't fully explain the painful slowness of the mlx-lm server.
I believe there were two elements causing this slowness:
Prompt caching. One of the key features of llama-server is that it caches the KV state of the prompt prefix between turns. The first turn chews through the full prompt. That's where the ~10-20 minutes goes (more on why it's so big in the next section). The second turn recognizes the prefix hasn't changed, skips prefill, and you're generating in 0.3 seconds.
mlx-lm does have prompt caching features (mlx_lm.cache_prompt, prompt_cache in the Python API, etc.) but the server mode (mlx_lm.server) never actually cached the prompt prefix between HTTP requests in my testing. Every turn paid the full prefill cost no matter how far into the conversation I was. There are known bugs around this: mlx-lm #259 reports different logits on repeat prompts, and LM Studio hit similar KV cache issues with their MLX engine. But a broken prompt cache triggers a domino effect as your context window builds up. Without working prompt caching in server mode, each turn reprocesses the entire conversation history from scratch (system prompt + tool definitions + every prior message), so response times just keep climbing the longer your session runs.
Fair warning: the first GGUF turn also takes 10-20 minutes, so it looks identical to the MLX problem. Don't give up. Send a second message. That's when you'll see the difference.
The invisible CLI prompt For your first simple "hello" message, here's what actually gets sent:
POST /v1/chat/completions
Messages: 2
Tools field entries: 11
System message length: 10,082 chars
Tools field size: 36,614 chars
Total prompt tokens: 10,643
Over 10,000 tokens of system prompt and tool definitions before my message even shows up. The tools do actually go into the proper tools field, but llama-server's OpenAI endpoint serializes all of that into the prompt template as text, so every token has to go through prefill. That's the "giant 10K prompt injection" I mentioned earlier.
This is actually more bearable if you think of it as a one-time cost.
| Turn | Prefill | Generation | Total |
|---|---|---|---|
| 1st (cold) | 97.0s / 10,623 tok | 3.9s / 54 tok | 100.9s |
| 2nd (cached) | 0.3s / 5 tok | 2.3s / 33 tok | 2.6s |
For a coding session that goes dozens of turns, 98 seconds of cold start is nothing. The prompt cache makes those 10,000+ tokens invisible after the first message. You need to be aware that anytime you trigger a new uncached prompt, you'll pay this cost again. Reviewing an existing codebase would be really slow here. In my mind though, this is one giant leap for nerdy woman-kind in order to run a frontier class model on my homelab. I'll don my infinite patience cloak and I probably won't have to wear it for very long. Maybe I'll make a youtube video in my homelab. Just kidding. Let's face it. I'm not as photogenic as Network Chuck or Alex Ziskind.
pip install huggingface_hub
huggingface-cli download unsloth/GLM-5-GGUF \
--local-dir ~/Models/GLM-5-GGUF \
--include "*UD-IQ2_XXS*"
Six shards, ~225GB total. You need enough RAM for the model plus KV cache, so realistically 300GB+ (I had 512GB).
# Build from source with Metal support
git clone https://github.com/ggml-org/llama.cpp
cmake llama.cpp -B llama.cpp/build \
-DBUILD_SHARED_LIBS=OFF -DGGML_METAL=ON
cmake --build llama.cpp/build --config Release -j
# Run
llama.cpp/build/bin/llama-server \
--model ~/Models/GLM-5-GGUF/UD-IQ2_XXS/GLM-5-UD-IQ2_XXS-00001-of-00006.gguf \
--ctx-size 65536 --parallel 1 --port 8080
Notes:
--ctx-size 65536 gives you room for tool definitions + conversation history--parallel 1 keeps memory usage predictable with a single inference slotAdd to ~/.config/opencode/opencode.json:
{
"$schema": "https://opencode.ai/config.json",
"provider": {
"llama.cpp": {
"npm": "@ai-sdk/openai-compatible",
"name": "llama.cpp (local)",
"options": {
"baseURL": "http://localhost:8080/v1"
},
"models": {
"GLM-5-UD-IQ2_XXS-00001-of-00006.gguf": {
"name": "GLM-5 IQ2_XXS",
"limit": { "context": 128000, "output": 8192 }
}
}
}
}
}
opencode -m "llama.cpp/GLM-5-UD-IQ2_XXS-00001-of-00006.gguf"
First message will take a while. Go get coffee. Play Doom. This will be worth it.
Claude Code also supports local models via ANTHROPIC_BASE_URL, and llama-server has an Anthropic Messages API endpoint. But Claude Code sends a bunch of internal background requests that crash a single-slot local server before your prompt ever reaches the model. I wrote a proxy to deal with it, and debugging that proxy is actually how I figured out the 10K token overhead issue. That's a whole separate writeup: The Ghost in the CLI: Why Claude Code Kills Local Inference.
My journey to GLM-5 was exactly like that sentence sounds: a trip to a far-off planet full of mysterious black holes, scientific conundrums, and strange alien symbols. Most of all, it was about the long passage of time. Running on an MLX server technically worked but was unusable. Prompt caching never kicked in, and the growing context meant each turn was slower than the last. Running an Unsloth quantization was the best choice for me, even though you can't hide from the tool tax. Unlike zippy cloud models, you will notice the 10K tokens of invisible overhead because they dominate your first local interaction (and any uncached prompts thereafter).
For me this was really about adjusting my expectations. But I'm here to tell you, if you have the hardware to run it, save your tokens and get your Doom game ready. You might just unlock a doorway to the future.
Hardware: M3 Ultra Mac Studio, 512GB | Model: unsloth/GLM-5-GGUF IQ2_XXS (225GB) | Server: llama.cpp with Metal | Agent: OpenCode
2026-02-19 04:17:21
You know the moment. Someone shares their screen, pulls up a list of icebreaker questions, and reads: "If you were a kitchen appliance, what would you be and why?"
Half the team goes on mute. One person nervously laughs. Someone types "toaster" in the chat and hopes that counts as participation.
I've sat through so many bad icebreakers that I almost gave up on them entirely. But then I started running retros regularly and noticed something I didn't expect -- the meetings where we skipped the icebreaker were consistently worse. Not because the question itself mattered, but because nobody had spoken yet. And when people don't talk in the first five minutes, they tend to stay quiet the whole time.
So the issue isn't icebreakers. It's that most of them are terrible.
They're either too weird, too generic, or wrong for the room.
"What's your spirit animal?" might be fine with close friends. In a Monday morning standup with people you've known for two weeks, it just creates silence. "How was your weekend?" isn't really an icebreaker at all -- it's small talk everyone already had in Slack. And a goofy question before a serious retro about a failed sprint? Tone-deaf.
I think the mistake people make is treating icebreakers like a box to check instead of thinking about what the meeting actually needs. A retro needs people in a reflective headspace. A casual sync just needs people awake and willing to talk. Those are different questions.
I run retrospectives and team meetings through Kollabe, so I've watched a lot of teams try different approaches over the past year. Some patterns keep showing up.
Keep it to five minutes. Go around, everyone answers in a sentence or two, done. The icebreaker is the on-ramp, not the meeting.
Let people pass. This sounds minor but it changes everything. The moment someone feels forced to answer, the whole exercise backfires. Say "feel free to skip" and, ironically, almost nobody does.
Rotate who picks the question. When the same scrum master picks the icebreaker every single week, it starts feeling like their thing rather than a team thing. Some of the best icebreakers I've seen came from the quietest person on the team, given the chance.
And match the energy. Light questions for regular syncs. Reflective ones for retros. Professional but warm for cross-team meetings where half the people don't know each other. This is the part most people skip, and it's the part that matters most.
I keep coming back to this: icebreakers aren't about fun. They're about getting voices in the room early.
There's actual research behind this. When someone speaks in the first few minutes of a meeting, they're more likely to contribute for the rest of it. Remote teams feel this the hardest. In an office, you had the hallway chat, the coffee run, the "hey did you see that PR?" as you sat down. Remote meetings just... start. You go from solo deep work to a group video call with zero transition.
An icebreaker is that transition. It doesn't have to be clever. "What's the last thing you watched that you'd recommend?" does the job.
I'll be honest -- this is the annoying part. You want to do an icebreaker, but it's Tuesday night and you're trying to think of one that isn't too cheesy, isn't too personal, and isn't the same one you used three weeks ago. It takes longer than it should for something that lasts two minutes.
I built an Icebreaker Generator because I got tired of this exact loop. Pick the meeting type, pick the tone -- fun, thoughtful, professional -- and it spits out questions that actually fit. You can add a theme too if your team has a thing. It's free and takes about ten seconds.
Whether you use it or not, the point stands: spend thirty seconds picking a question that matches your meeting, and you'll get better participation out of the other fifty-nine minutes.
(If you're also in goal-setting mode because review season never really ends, I wrote about making SMART goals less painful too.)
I've tested a lot of these. A few that consistently get good responses:
For regular team meetings: "What's something small that made your week better?" Low stakes, easy to answer, and you actually learn something about people.
For retros: "One word to describe this sprint." Sets the reflective tone without being heavy.
For meetings with new faces: "What's your role in one sentence, and one thing about your work people might not know?" Gives context and a conversation starter.
For Fridays: "What are you looking forward to this weekend?" Simple. Gets people mentally out of work mode before the meeting even ends.
That's sort of the whole point. Icebreakers don't need to be creative or clever or carefully workshopped. They need to get people talking before the real agenda starts. Two minutes, one question, everyone speaks. The meetings where that happens go better than the ones where it doesn't.
Pick something that fits the room, keep it short, and let people opt out. That's the whole strategy.
2026-02-19 04:17:02
Recently, I've been subscribing to Claude Code Max, Codex (ChatGPT Pro), and Antigravity (Google AI Pro), which has dramatically increased my workload. At some point, I started getting headaches. I wondered if it was from lack of sleep, but our CTO at work asked if I was getting headaches. And the thing is, I had actually taken Tylenol the day before. So I thought that might be it, but after talking to others who use AI heavily, they said they occasionally get headaches too. So I decided to investigate. It turns out I'm not alone. Community posts asking "Does anyone get headaches when using AI? Planning and directing takes so much brainpower" are becoming common.
A 2025 academic study also found that deeper engagement with GenAI doesn't reduce cognitive burden—it actually amplifies it.
In traditional development, you'd spend a day diving deep into one design problem. Implementation took time, giving you the luxury of slowly making architectural decisions. AI flips this dynamic. When you can prototype three approaches in the time it previously took to build one, you must constantly make architecture-level decisions. The bottleneck shifts from "can we build this?" to "should we build this, and how?"
AI doesn't move on its own. "Remove this," "redo it," "change direction"—you must constantly direct the next action. This process intensely consumes your brain's executive function, a high-intensity cognitive task.
A 2025 study of 832 GenAI users found that uncertainty about how to write prompts causes emotional fatigue, while unexpected responses cause cognitive fatigue. The process of choosing words and designing context to get desired results consumes a new type of energy.
Prompt writing → result review → revision instruction → re-review. This loop repeats dozens or hundreds of times daily. While AI doesn't tire from context switching, the human brain pays a transition cost each time it changes modes.
Every 20 minutes, look at something 20 feet (6m) away for 20 seconds. Proposed by ophthalmologist Dr. Anshel in the 1990s, this rule is recommended by both the American Optometric Association (AOA) and the American Academy of Ophthalmology (AAO). Research shows that applying this rule for 2 weeks significantly reduces digital eye strain symptoms.
I happen to have a view of the Mississauga skyline from my place, so every 20 minutes I look out at the open landscape for 20 seconds. Having a distant view to rest your eyes on makes practicing this rule much easier than trying to focus on a wall or nearby objects.
Instead of continuously micro-directing, give broad guidelines once, let AI draft the solution, then review the results in batches. This reduces the number of brain transitions. For example, tools like oh-my-claudecode's autopilot or ralplan's autonomous execution modes let you review outputs without directing every step.
After 50 minutes of focus, you need 10 minutes away from screens entirely. This allows your brain's Default Mode Network (DMN) to activate, consolidating and organizing information—a completely different brain activity from continuously reading and judging AI outputs.
An easily overlooked aspect. When concentrating on AI conversations, you may unconsciously tense your neck and shoulders, leading to tension headaches. Simply positioning your monitor at eye level and maintaining at least 63cm (arm's length) from the screen makes a noticeable difference.
The solution to AI fatigue isn't to use AI less. The key is using it with boundaries, intention, and awareness that you're not a machine.
Acknowledging that productivity gains come with increased cognitive costs, and managing those costs, has become the new essential skill for developers in the AI era.