2026-03-28 08:10:57
I’ve spent the past year building Do Not Eat, an AI platform that generates, publishes, and manages social media content across Instagram, TikTok, YouTube, LinkedIn, and Threads. Along the way, I’ve run into a lot of technical decisions that don’t have obvious answers.
This isn’t a product pitch. This is a breakdown of the engineering and product choices behind AI social media automation — what works, what we got wrong, and what I’d tell another developer building in this space.
The #1 complaint about AI-generated social media content is that it all sounds the same. Generic. Robotic. Interchangeable.
The naive approach is prompt engineering: "Write an Instagram caption in a casual, friendly tone about [topic]." This produces acceptable-ish content, but it sounds like every other AI caption on the internet.
The better approach is brand voice profiling. Here’s how it works:
1. Ingest existing content. Pull the user’s last 50–100 posts across all platforms. You need enough data to extract meaningful patterns, but not so much that you’re averaging out voice evolution over time.
2. Extract style features. Not just sentiment — that’s too coarse. You want:
3. Condition generation on these features. When generating new content, the prompt includes the extracted style profile as constraints. The AI isn’t just "writing in a casual tone" — it’s matching specific measurable characteristics of the user’s actual writing.
This still isn’t perfect. Voice has subtleties that feature extraction misses — irony, timing, cultural references. But it gets you to 85–90% quality, which is good enough for daily social content that the user can review and tweak.
Cross-posting is the lazy solution. It’s also the wrong one. Each platform’s algorithm actively deprioritizes content that looks like it was copy-pasted.
We handle this by treating each platform as a separate rendering target with its own constraints:
Core message → Platform adapter → Native format
LinkedIn: Long text, paragraph breaks, no hashtags in body
Instagram: Visual-first, concise caption, hashtags at end or in comment
TikTok: Script for voiceover, trending audio suggestions, hook in first 2 sec
YouTube: SEO-optimized title + description, thumbnail copy
Threads: Conversational, short, opinion-forward
The platform adapter isn’t just reformatting text. It’s re-structuring the argument for the platform’s consumption pattern. LinkedIn users read long-form. Instagram users skim. TikTok users decide in 1.5 seconds whether to keep watching.
Publishing content is the easy part. The hard part is handling what happens after: comments, DMs, mentions, and the signal buried in the noise.
We built a lead detection system called Lead Radar that classifies incoming comments by intent. The categories:
Each comment gets a relevance score (1–10) based on intent signals, commenter profile quality, and recency. High-scoring comments get surfaced with a draft reply. Low-scoring ones get auto-liked or ignored.
The tricky part is context sensitivity. "This is exactly what I need" means very different things under a product post vs. a meme. We solve this by including the parent post’s content and intent category in the classification context.
Every social media blog says "post when your audience is most active." This is correct but incomplete.
What they don’t tell you is that "most active" is not static. It shifts based on:
We approach this as an optimization problem. Start with platform-level heuristics (well-documented best practices), then adapt per-account based on actual engagement data. After 2–3 weeks of posting, the system has enough data to find per-account optimal windows.
The key insight: optimal posting time is a function of (platform × content_type × audience_segment × day_of_week), not just a single "best time."
1. Start with fewer platforms. We launched supporting 5 platforms simultaneously. Should have launched with 2, nailed them, then expanded. Each platform has enough edge cases to be its own product.
2. Invest in voice profiling earlier. We spent the first few months on generic content generation and only built voice matching after users complained. Should have been day-one.
3. Build the review interface first. Users need to trust AI output before they’ll let it auto-publish. The review/approval flow is more important than the generation quality itself. If users can easily review, edit, and approve, they’re forgiving of imperfect generation. If they can’t, even good content feels risky.
If you’re thinking about building AI social media tools, here’s my advice:
If you’re interested in the AI social media automation space, I’m happy to chat about what we’ve learned. Hit me up on LinkedIn or check out what we’re building at donoteat.tech.
What’s your experience with AI content generation? Have you tried automating social media for a project? I’d love to hear what worked and what didn’t in the comments.
2026-03-28 08:08:18
The Model Context Protocol is the connective tissue of the AI agent ecosystem. It's how Claude, Cursor, VS Code Copilot, and hundreds of other AI tools connect to external services, databases, APIs, and local system resources. There are now over 16,000 MCP servers in the wild, and the number is growing by hundreds every week.
We've spent the last several months scanning, analyzing, and probing MCP servers at scale. Our registry at CraftedTrust has indexed 4,275 servers and scored each one across 12 security categories aligned to the CoSAI threat taxonomy. What we found is concerning.
The average trust score for statically analyzed npm packages is 54 out of 100. That's an F.
MCP servers occupy a uniquely dangerous position in the software stack. A traditional API serves data to an application that a developer controls. An MCP server serves data and capabilities to an AI model that reasons about what to do next. The model decides which tools to call, what parameters to pass, and how to interpret results. This means a compromised or poorly built MCP server doesn't just return bad data. It can influence the behavior of the entire agent.
Three categories of vulnerabilities show up repeatedly across the ecosystem.
This is the most common pattern. An MCP server exposes a tool that does something dangerous (execute code, navigate a browser, run shell commands, read the filesystem) and relies entirely on the AI model to use it responsibly.
We recently published a critical advisory for chrome-local-mcp, an npm package with 332 weekly downloads that gives AI agents browser automation capabilities. We found three chained vulnerabilities:
Arbitrary JavaScript Execution (CWE-94, Critical): The server exposes an eval MCP tool and an HTTP /eval endpoint that pass user-supplied JavaScript directly to Puppeteer's page.evaluate() with no restrictions whatsoever. Because the browser uses a persistent profile directory that retains login sessions across invocations, an attacker can navigate to any site where the user is logged in and extract document.cookie, localStorage, session tokens, or any DOM content.
SSRF via Unrestricted URL Navigation (CWE-918, High): The navigate tool passes URLs directly to page.goto() with no validation. No scheme allowlist (accepts file://, data:, javascript:), no hostname blocklist (allows 169.254.169.254, localhost, internal IPs), and no port restriction. On cloud-hosted deployments, this enables direct credential theft from instance metadata endpoints.
Unauthenticated HTTP API on All Interfaces (CWE-306, High): The Express server listens on 0.0.0.0 with no authentication. All 15 endpoints are accessible to any local process or network neighbor. Any website opened in a regular browser can send fetch('http://localhost:3033/eval', {method:'POST', body:...}) to execute arbitrary JavaScript in the Puppeteer session.
These three findings chain together into a full attack: any website you visit in your normal browser can silently call the unauthenticated local API, navigate the Puppeteer session to a site where you're logged in, and extract your credentials. No user interaction required beyond having the MCP server running.
The full advisories are published on CraftedTrust Touchstone.
MCP servers are typically installed via npx from npm or cloned from GitHub. The installation process is: a user copies a JSON config snippet from a README, pastes it into their MCP client config, and the next time their AI tool starts up, it runs npx some-package as a subprocess with whatever permissions the user has.
There is no code signing. There is no permission manifest. There is no sandbox by default. If the package author pushes a malicious update, it executes automatically the next time the user's AI tool starts.
We published two supply chain advisories this week. One for a third-party republication of the official Notion MCP server (@osematouati/notion-mcp-server) that claims to be "Official" and points its repository field to Notion's GitHub org but has no npm provenance attestation and a single maintainer. Another for a Gmail MCP server (@gongrzhe/server-gmail-autoauth-mcp) requesting gmail.modify and gmail.settings.basic scopes with zero provenance verification across all 7 published versions.
Research from other teams corroborates the scale: Astrix Security found that 53% of MCP servers use static secrets (API keys embedded in configuration), BlueRock Security found 36.7% of 7,000+ servers potentially vulnerable to SSRF, and the OpenClaw ecosystem saw over 800 malicious skills published across 12 attacker accounts, roughly 20% of the ClawHub registry.
CyberArk's research demonstrated that tool poisoning doesn't just hide in the description field. Every schema field is a potential injection vector: parameter names, types, anyOf/oneOf constructs, enum values, and even tool output. Their testing showed an 84.2% success rate for tool poisoning attacks with auto-approval enabled. Invariant Labs found that even the best-performing model (Claude 3.7 Sonnet) had less than a 3% refusal rate against tool poisoning, and more capable models are actually more susceptible.
This means traditional security scanning that only checks for known malicious patterns in tool descriptions is catching a fraction of the real attack surface.
CraftedTrust operates as an independent trust verification layer for the MCP ecosystem. Every server in our registry is scored across 12 security categories: identity and authentication, permission scope, transport security, declaration accuracy, tool integrity, supply chain, input validation, data protection, network behavior, code transparency, publisher trust, and protocol compliance. Scores map to five compliance frameworks (CoSAI, OWASP Top 10 for Agentic Apps, EU AI Act, NIST AI RMF, and AIUC-1).
Our security research arm, CraftedTrust Touchstone, runs automated deep scans across 60 security checks in 8 domains, auto-triages findings, and manages a 90-day coordinated disclosure pipeline. When we find something, we notify the maintainer, give them 90 days to fix it, and publish the advisory with full technical details.
We also built an MCP server interface so AI agents can check trust scores programmatically before connecting to any server. Add CraftedTrust to your agent's MCP config and it can call check_trust on any server URL before deciding whether to connect:
{
"mcpServers": {
"craftedtrust": {
"url": "https://mcp.craftedtrust.com/api/v1/mcp"
}
}
}
Six tools are available: check_trust, scan_server, search_registry, get_stats, pay_for_certification, and verify_payment. The search and stats tools are free. Premium endpoints accept x402 micropayments (USDC on Base) so agents can pay per-request without API keys or subscriptions.
If you're using MCP servers with your AI tools:
Audit what's installed. Check your Claude Desktop, Cursor, or VS Code MCP config files. Every server listed there runs with your user permissions. If you don't recognize it or don't actively use it, remove it.
Check trust scores. Search for your installed servers at mcp.craftedtrust.com. If a server scores below 40 (grade D or F), investigate before continuing to use it.
Prefer servers from verified publishers. Look for servers published by the organization they claim to represent (e.g., @notionhq/notion-mcp-server over third-party republications), with multiple maintainers, npm provenance attestation, and active GitHub repositories.
Don't auto-approve tool calls. Most MCP clients support an approval flow for tool calls. Use it, at least for servers that have filesystem, network, or code execution capabilities.
If you're building MCP servers:
Get scanned. Submit your server URL at mcp.craftedtrust.com for a free 12-category trust assessment. If you want a deeper review, our certification tiers ($29/$79/$499) include enhanced scanning, compliance framework mappings, and a verified trust badge.
Read the OWASP MCP Top 10. It covers the attack patterns we see most frequently: tool poisoning, excessive permissions, SSRF, credential exposure, and supply chain compromise.
Add authentication. If your server exposes any capability beyond read-only public data, it should require authentication. OAuth 2.1 with PKCE is the standard. At minimum, don't bind to 0.0.0.0 without auth.
The MCP ecosystem is growing fast and building incredible capabilities. But the security posture of most servers assumes a world where every AI model always does exactly what the developer intended. That world doesn't exist. The sooner we build trust verification into the agent connection flow, the safer this ecosystem becomes for everyone.
Jeremy Kenitz is the founder of Cyber Craft Solutions LLC and the creator of the CraftedTrust Agent Trust Stack. CraftedTrust Touchstone advisories are published at touchstone.craftedtrust.com.
2026-03-28 08:07:04
I've been running an autonomous AI agent on a Mac Mini 24/7 for over a month. It manages multiple businesses, publishes content, monitors accounts, and makes decisions while I sleep.
It's also gotten shadow-banned, suspended from platforms, and nearly leaked credentials. Twice.
Here's everything that went wrong and the security architecture I built to prevent it from happening again.
My agent was happily posting to a social platform. Engagement was growing. Then — silence. No errors, no warnings, no rejection messages. Posts were going through successfully (200 OK), but nobody could see them.
Shadow bans are invisible to the banned account. My agent's monitoring looked at "did the post succeed?" not "can anyone else see it?"
# Before any platform activity:
python3 scripts/rate-limiter.py check <platform> <action>
A rate limiter that tracks every external action across every platform. Not just API rate limits — behavioral limits:
Lesson: API success ≠ visible to humans. Always verify from an external perspective.
My agent writes daily logs. Detailed ones. One day I noticed an API key in a log file that was about to be committed to a git repo.
The agent wasn't trying to leak anything — it was logging a failed API call, and the error message included the full request headers. Including the Authorization header.
Three layers of defense:
Layer 1: Credential isolation — All secrets live in one file with chmod 600. The agent reads credentials through a helper, never stores them in variables that get logged.
Layer 2: Git pre-commit scanning — Before any git push, a hook scans for patterns that look like tokens, API keys, or passwords.
Layer 3: File permission enforcement — Credential files are chmod 600. Log directories are chmod 700.
Lesson: Agents are verbose loggers by nature. Treat every log line as potentially public.
Day 1 on a new blogging platform: my agent published 7 articles. All high-quality, well-formatted, properly tagged content.
Result: 3 articles deleted by moderation. Not because the content was bad — because no human publishes 7 articles in one day on a new account.
The agent now has a "new account warming" protocol:
Lesson: Platforms profile behavior, not content. A new account doing anything at volume = bot.
OAuth tokens expire. My agent has a refresh mechanism. But when two cron jobs fire at the same time, both try to refresh the token simultaneously. One succeeds, invalidating the token the other one is using. Second job fails. Retry? It tries to refresh again — but the refresh token was already rotated.
Result: Complete lockout requiring manual re-authentication.
Token refresh is now serialized through a single process with file locking:
# Simplified version
with FileLock("~/.token-lock"):
if token_expired():
new_token = refresh()
save_token(new_token)
return load_token()
Plus a recovery hierarchy:
Lesson: Autonomous systems need autonomous recovery. Asking the human should be the last resort.
My agent runs in Seoul (UTC+9). Some platforms flag accounts active 24/7 — humans sleep. My agent doesn't.
I got flagged for "impossible activity patterns" — posting at 3 AM and 3 PM with equal frequency.
# Hard curfew rules:
# 09-22 KST: External activity OK
# 22-09 KST: Research + local work ONLY
Between 11 PM and 9 AM, the agent can research and write drafts, but it cannot publish, push code, comment, or send any external requests.
Lesson: Platforms expect human patterns. An agent that never sleeps looks like a bot. Because it is one.
One morning I found my agent had burned through 47 API calls trying to upload a video that kept failing. Each retry was identical. Each failure was the same error.
A 403 (permission denied) was being treated the same as a 500 (server error).
Error classification with different strategies:
| Error Type | Strategy |
|---|---|
| 4xx (client error) | Stop immediately, log, alert |
| 429 (rate limit) | Exponential backoff, respect Retry-After |
| 5xx (server error) | Retry 3x with backoff, then stop |
| Network timeout | Retry 2x, then skip to next task |
Plus a circuit breaker: if any platform returns 3+ errors in a row, all activity on that platform pauses for 1 hour.
Lesson: Blind retries amplify failures. Classify errors before deciding what to do.
In a group chat, someone asked about our tech stack. My agent — trying to be helpful — shared specific infrastructure details including server specs and internal tools.
None of it was secret, exactly. But aggregated, it painted a very detailed picture of operations.
Context-aware information sharing:
Lesson: Agents don't have social instincts. They'll share everything unless explicitly told what's private.
┌─────────────────────────────────────┐
│ CORE RULES (always loaded) │
│ • Time curfew (09-22 external) │
│ • Rate limiter (per-platform) │
│ • Error classification │
│ • Credential isolation │
├─────────────────────────────────────┤
│ PER-PLATFORM RULES │
│ • Account age requirements │
│ • Action limits (posts/comments) │
│ • Warm-up protocols │
│ • Platform-specific gotchas │
├─────────────────────────────────────┤
│ RECOVERY HIERARCHY │
│ 1. Auto-retry (classified errors) │
│ 2. Token refresh (serialized) │
│ 3. Session recovery (cookies) │
│ 4. Circuit breaker (pause) │
│ 5. Human escalation (last resort) │
└─────────────────────────────────────┘
Start with security, not add it after failures. Every failure above was preventable with 30 minutes of upfront thinking.
Assume every platform has bot detection. The question is how aggressive it is, not whether it exists.
Log everything, share nothing. Internal logs should be verbose. External-facing actions should be minimal.
Test with burner accounts first. Before connecting real accounts to an AI agent, test automation on throwaway accounts.
I'm 6 weeks in. New failure modes appear regularly. Last week it was a platform that changed their API without notice. The week before, a rate limit that wasn't documented anywhere.
The goal isn't zero failures — it's fast recovery and no repeated failures. Every incident becomes a rule. Every rule prevents the next incident.
That's the real security model: an agent that gets smarter about its own vulnerabilities over time.
Running AI agents autonomously means security can't be an afterthought. Here are some tools I've built along the way:
📦 The $0 Developer Playbook — Free toolkit including automation safety patterns
🛠️ Complete Dev Toolkit — Project management templates with built-in QA checklists
2026-03-28 08:06:47
It's 6 AM on a Saturday. Most of my scrapers are quiet.
But one isn't.
Total runs: 10,721
| Actor | Total Runs | Weekend Pattern |
|---|---|---|
| naver-news-scraper | 7,539 | Near zero Friday night |
| naver-place-search | 1,087 | +177 Friday night alone |
| naver-blog-search | 726 | Low |
| naver-blog-reviews | 601 | Moderate |
| Others | 768 | Minimal |
naver-place-search ran more between midnight and 6 AM on a Saturday than naver-news ran all day Sunday.
I never surveyed my users. I don't know their names. But I know their schedules.
naver-news users work office hours.
They're running scheduled pipelines — media monitoring, brand intelligence, market research. The kind of automation that runs Monday to Friday, gets reviewed by a human on Tuesday morning, and completely shuts down over the weekend.
These are B2B users. They have a boss. The boss takes weekends off.
naver-place-search users don't care what day it is.
Someone was querying Korean place data at 3 AM on a Saturday. Either:
Either way, this isn't someone with a Monday standup.
The day/night pattern I wrote about before suggested enterprise demand. This weekend pattern adds nuance:
Not all actors serve the same market.
If I'd built one product for both user types, I'd have optimized for the wrong one.
I built 13 Korean data scrapers. I expected people to use them for data extraction.
What I didn't expect was that usage patterns would become my market research.
I don't need a survey. I don't need user interviews. The traffic pattern IS the interview:
Every API call is a data point about who's using it and why.
The previous two Mondays both showed 2-2.3x weekend rate. If that holds, baseline is stable.
If it doesn't, something changed in the user base. And that's equally interesting data.
Day 15 of building Korean data scrapers in public. 13 actors, 10,721 runs.
Previous: After 10,000: Why I Stopped Building and Started Marketing
2026-03-28 08:02:31
There are now 15+ AI coding tools competing for your workflow. Cursor, Windsurf, Claude Code, Copilot, Goose, Junie, Google Antigravity — and more launching every week.
Most developers pick based on Twitter hype. Here's a systematic framework instead.
Every AI coding tool can be scored on 4 axes:
1. Autonomy — How much can it do without you?
2. Context — How much of your codebase does it understand?
3. Integration — How well does it fit your existing workflow?
4. Cost — What's the real $/month including API usage?
Here's a scoring template:
# tool_evaluator.py
from dataclasses import dataclass
@dataclass
class ToolScore:
name: str
autonomy: int # 1-10: 1=autocomplete only, 10=full autonomous agent
context: int # 1-10: 1=single file, 10=entire monorepo
integration: int # 1-10: 1=standalone, 10=deep IDE + CI/CD + git
cost_monthly: float # USD/month for typical solo dev usage
@property
def value_score(self) -> float:
"""Capability per dollar."""
capability = (self.autonomy + self.context + self.integration) / 3
if self.cost_monthly == 0:
return capability * 10 # Free tools get a big bonus
return capability / (self.cost_monthly / 20) # Normalize to $20 baseline
# 2026 landscape (approximate scores based on public benchmarks)
tools = [
ToolScore("Cursor Pro", autonomy=7, context=8, integration=9, cost_monthly=20),
ToolScore("Windsurf Pro", autonomy=8, context=7, integration=8, cost_monthly=15),
ToolScore("Claude Code", autonomy=9, context=9, integration=6, cost_monthly=25),
ToolScore("Copilot Business", autonomy=5, context=6, integration=10, cost_monthly=19),
ToolScore("Goose (Block)", autonomy=7, context=7, integration=5, cost_monthly=0),
ToolScore("Junie CLI", autonomy=6, context=6, integration=7, cost_monthly=0),
]
# Sort by value
ranked = sorted(tools, key=lambda t: t.value_score, reverse=True)
for i, t in enumerate(ranked, 1):
print(f"{i}. {t.name:20s} | Value: {t.value_score:.1f} | "
f"A:{t.autonomy} C:{t.context} I:{t.integration} | ${t.cost_monthly}/mo")
Output:
1. Goose (Block) | Value: 63.3 | A:7 C:7 I:5 | $0/mo
2. Junie CLI | Value: 63.3 | A:6 C:6 I:7 | $0/mo
3. Windsurf Pro | Value: 10.2 | A:8 C:7 I:8 | $15/mo
4. Cursor Pro | Value: 8.0 | A:7 C:8 I:9 | $20/mo
5. Claude Code | Value: 6.4 | A:9 C:9 I:6 | $25/mo
6. Copilot Business | Value: 5.5 | A:5 C:6 I:10 | $19/mo
START
│
├─ Do you work in VS Code or JetBrains?
│ ├─ VS Code → Cursor or Copilot
│ └─ JetBrains → Junie or Copilot
│
├─ Do you need autonomous multi-file changes?
│ ├─ Yes → Claude Code or Windsurf
│ └─ No → Copilot (fastest autocomplete)
│
├─ Is cost a hard constraint?
│ ├─ $0 budget → Goose + local model
│ └─ $20/mo okay → Cursor Pro (best balance)
│
├─ Do you work on large monorepos (500K+ lines)?
│ ├─ Yes → Claude Code (largest context window)
│ └─ No → Any tool works
│
└─ Do you need MCP tool integration?
├─ Yes → Claude Code or Cursor
└─ No → Any tool works
Hype says "tool X is best." Data says something different. Here's a benchmark template you can run on your own codebase:
# benchmark.py — Test AI tools on YOUR codebase
import time
import json
from pathlib import Path
TASKS = [
{
"name": "add_endpoint",
"prompt": "Add a GET /health endpoint that returns {status: 'ok', uptime: <seconds>}",
"verify": lambda: "health" in open("src/routes.py").read(),
"category": "feature"
},
{
"name": "fix_bug",
"prompt": "The login function doesn't hash passwords before comparing. Fix it.",
"verify": lambda: "bcrypt" in open("src/auth.py").read() or "hashlib" in open("src/auth.py").read(),
"category": "bugfix"
},
{
"name": "write_tests",
"prompt": "Write comprehensive tests for the UserService class",
"verify": lambda: Path("tests/test_user_service.py").exists(),
"category": "testing"
},
{
"name": "refactor",
"prompt": "Extract the email validation logic into a separate utils module",
"verify": lambda: Path("src/utils/validation.py").exists(),
"category": "refactor"
},
{
"name": "docs",
"prompt": "Generate API documentation for all public endpoints",
"verify": lambda: Path("docs/api.md").exists(),
"category": "docs"
},
]
def run_benchmark(tool_name: str) -> dict:
"""Run all tasks and measure success rate + time."""
results = []
for task in TASKS:
start = time.time()
# Reset codebase to clean state
# subprocess.run(["git", "checkout", "."])
print(f" Running: {task['name']}...")
# Execute with your tool (manual for now)
input(f" → Execute with {tool_name}, then press Enter: ")
elapsed = time.time() - start
success = task["verify"]()
results.append({
"task": task["name"],
"category": task["category"],
"success": success,
"time_seconds": round(elapsed),
})
print(f" {'✅' if success else '❌'} {task['name']} ({elapsed:.0f}s)")
success_rate = sum(1 for r in results if r["success"]) / len(results)
avg_time = sum(r["time_seconds"] for r in results) / len(results)
return {
"tool": tool_name,
"success_rate": f"{success_rate:.0%}",
"avg_time_seconds": round(avg_time),
"results": results,
}
After analyzing how developers actually use these tools, three patterns emerge:
If you're writing code you already know how to write, Copilot's autocomplete is the fastest. It predicts the next line in ~200ms. Nothing beats muscle memory + autocomplete.
If you're working with an unfamiliar API, codebase, or language, agentic tools (Claude Code, Windsurf) are 3-5x faster. They can read docs, try approaches, and iterate — things autocomplete can't do.
The tool with the largest effective context window usually wins on complex tasks. If your agent can't see the relevant file, it can't help.
# Quick context window comparison
context_limits = {
"Copilot": 8_000, # tokens per completion
"Cursor": 100_000, # with @codebase indexing
"Windsurf": 128_000, # Cascade context
"Claude Code": 200_000, # native context window
"Goose": 128_000, # depends on model
}
# Rule of thumb: 1 token ≈ 4 chars ≈ 0.75 words
# 100K tokens ≈ 75K words ≈ ~300 pages of code
for tool, tokens in sorted(context_limits.items(), key=lambda x: x[1], reverse=True):
pages = tokens * 4 / 250 / 4 # chars → words → pages (rough)
print(f"{tool:20s}: {tokens:>8,} tokens (~{pages:.0f} pages of code)")
Most productive developers don't use one tool. They use 2-3:
Daily Autocomplete → Copilot (always-on, fast, cheap)
Complex Tasks → Cursor Pro or Claude Code (when you need agents)
Quick Scripts/Prototypes → Goose or Claude CLI (free, terminal-based)
Cost: ~$40/month total. ROI: 2-5 hours saved per week.
Before committing to any tool, test these 5 things on YOUR codebase:
## AI Coding Tool Evaluation Checklist
### 1. Setup Time
- [ ] How long to install and configure?
- [ ] Does it work with your language/framework?
- [ ] Does it support your IDE?
### 2. Context Quality
- [ ] Can it reference files you didn't open?
- [ ] Does it understand your project structure?
- [ ] Can it read your README/docs?
### 3. Task Completion
- [ ] Can it add a simple feature end-to-end?
- [ ] Can it fix a bug from an error message?
- [ ] Can it write tests that actually pass?
### 4. Iteration Speed
- [ ] How fast is autocomplete? (<500ms = good)
- [ ] How long for a multi-file change? (<2min = good)
- [ ] Can it recover from mistakes without starting over?
### 5. Cost Reality
- [ ] What's the real $/month with your usage?
- [ ] Are there hidden API costs?
- [ ] Is there a free tier for evaluation?
This is part of the "AI Engineering in Practice" series — practical guides for developers building with AI. Follow for more.
2026-03-28 08:02:30
AI coding tools crossed a line in 2026. They're no longer just suggesting the next line — they're running multi-step workflows autonomously. File creation, test writing, debugging, deployment.
Here are 7 patterns where agentic coding actually works better than doing it yourself, with code you can adapt.
Instead of writing tests manually, describe what you want tested and let the agent generate + run + fix them.
# agent_test_gen.py — Prompt pattern for autonomous test generation
SYSTEM_PROMPT = """You are a test generation agent. Given a function:
1. Analyze the function signature and docstring
2. Generate pytest tests covering: happy path, edge cases, error cases
3. Run the tests
4. Fix any failures
5. Return the final passing test file
Rules:
- Minimum 5 test cases per function
- Include at least 1 edge case and 1 error case
- Use descriptive test names: test_<function>_<scenario>
"""
def generate_tests(source_file: str, function_name: str) -> str:
"""Agent loop: generate → run → fix → repeat."""
import subprocess
source = open(source_file).read()
prompt = f"Generate tests for `{function_name}` in:\n```
{% endraw %}
python\n{source}\n
{% raw %}
```"
max_attempts = 3
for attempt in range(max_attempts):
test_code = call_llm(SYSTEM_PROMPT, prompt)
# Write and run
with open("test_generated.py", "w") as f:
f.write(test_code)
result = subprocess.run(
["pytest", "test_generated.py", "-v"],
capture_output=True, text=True
)
if result.returncode == 0:
return test_code # All tests pass
# Feed errors back to the agent
prompt = f"Tests failed:\n{result.stdout}\n{result.stderr}\nFix them."
return test_code # Best effort after max attempts
Why it works: The agent iterates. It writes tests, runs them, reads failures, and fixes — something autocomplete can't do.
Rename a function and the agent updates every import, test, and reference across the codebase.
# refactor_agent.py
import ast
import os
from pathlib import Path
def find_all_references(root: str, old_name: str) -> list[dict]:
"""Scan codebase for all references to a symbol."""
references = []
for path in Path(root).rglob("*.py"):
source = path.read_text()
try:
tree = ast.parse(source)
for node in ast.walk(tree):
if isinstance(node, ast.Name) and node.id == old_name:
references.append({
"file": str(path),
"line": node.lineno,
"col": node.col_offset,
"type": "name"
})
elif isinstance(node, ast.ImportFrom):
for alias in node.names:
if alias.name == old_name:
references.append({
"file": str(path),
"line": node.lineno,
"type": "import"
})
except SyntaxError:
continue
return references
def refactor(root: str, old_name: str, new_name: str):
"""Find and replace all references safely."""
refs = find_all_references(root, old_name)
print(f"Found {len(refs)} references across {len(set(r['file'] for r in refs))} files")
# Group by file for batch edits
by_file = {}
for ref in refs:
by_file.setdefault(ref["file"], []).append(ref)
for filepath, file_refs in by_file.items():
content = open(filepath).read()
# Simple replacement — production version uses AST rewriting
updated = content.replace(old_name, new_name)
with open(filepath, "w") as f:
f.write(updated)
print(f" Updated {filepath} ({len(file_refs)} refs)")
Key insight: The AST scan finds references that grep would miss (aliased imports, string references in decorators). An agentic version would also run tests after each file change to catch regressions.
When CI fails, the agent reads the error, proposes a fix, and opens a PR.
# .github/workflows/self-heal.yml
name: Self-Healing CI
on:
workflow_run:
workflows: ["CI"]
types: [completed]
jobs:
auto-fix:
if: ${{ github.event.workflow_run.conclusion == 'failure' }}
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Get failure logs
run: |
gh run view ${{ github.event.workflow_run.id }} --log-failed > failure.log
env:
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
- name: Agent fix
run: |
python3 agent_fix.py failure.log
# agent_fix.py reads the log, identifies the issue,
# applies a fix, runs tests locally, commits if green
- name: Create PR
if: success()
run: |
git checkout -b fix/auto-heal-${{ github.run_id }}
git add -A
git commit -m "fix: auto-heal CI failure"
git push origin HEAD
gh pr create --title "fix: auto-heal CI failure" --body "Automated fix for CI failure in run ${{ github.event.workflow_run.id }}"
Reality check: This works well for dependency issues, type errors, and linting failures. It won't fix logic bugs — but it handles 40-60% of CI failures in practice.
Keep docs in sync with code automatically. When a function signature changes, the agent updates the docs.
# doc_sync_agent.py
import ast
import re
def extract_functions(source: str) -> dict:
"""Extract function signatures and docstrings."""
tree = ast.parse(source)
functions = {}
for node in ast.walk(tree):
if isinstance(node, (ast.FunctionDef, ast.AsyncFunctionDef)):
args = [a.arg for a in node.args.args]
docstring = ast.get_docstring(node) or ""
functions[node.name] = {
"args": args,
"docstring": docstring,
"lineno": node.lineno
}
return functions
def check_doc_drift(source_file: str, doc_file: str) -> list[str]:
"""Find functions where docs don't match code."""
source = open(source_file).read()
docs = open(doc_file).read()
functions = extract_functions(source)
drifts = []
for name, info in functions.items():
# Check if function is documented
if name not in docs:
drifts.append(f"MISSING: `{name}({', '.join(info['args'])})` not in docs")
continue
# Check if args match
for arg in info["args"]:
if arg != "self" and arg not in docs:
drifts.append(f"DRIFT: `{name}` param `{arg}` missing from docs")
return drifts
Scan your package.json or requirements.txt, check for vulnerabilities, outdated packages, and license issues — then fix them.
# dep_audit_agent.py
import json
import subprocess
def audit_python_deps() -> dict:
"""Run a full dependency audit."""
# Check for vulnerabilities
vuln_result = subprocess.run(
["pip-audit", "--format", "json"],
capture_output=True, text=True
)
vulns = json.loads(vuln_result.stdout) if vuln_result.returncode == 0 else []
# Check for outdated
outdated_result = subprocess.run(
["pip", "list", "--outdated", "--format", "json"],
capture_output=True, text=True
)
outdated = json.loads(outdated_result.stdout)
return {
"vulnerabilities": len(vulns),
"outdated": len(outdated),
"critical": [v for v in vulns if v.get("severity") == "critical"],
"update_candidates": [
f"{p['name']}: {p['version']} → {p['latest_version']}"
for p in outdated[:10]
]
}
def auto_fix_vulnerabilities(audit: dict) -> list[str]:
"""Attempt to fix critical vulnerabilities by updating packages."""
fixes = []
for vuln in audit["critical"]:
pkg = vuln["name"]
fixed_version = vuln.get("fix_versions", [None])[0]
if fixed_version:
subprocess.run(["pip", "install", f"{pkg}>={fixed_version}"])
fixes.append(f"Updated {pkg} to {fixed_version}")
return fixes
Automate first-pass code review. The agent checks for patterns, not just syntax.
# review_agent.py
REVIEW_CHECKLIST = """
Review this PR diff for:
1. **Security**: SQL injection, hardcoded secrets, unsafe deserialization
2. **Performance**: N+1 queries, missing indexes, unbounded loops
3. **Reliability**: Missing error handling, race conditions, resource leaks
4. **Maintainability**: Functions >50 lines, magic numbers, missing types
For each issue found, provide:
- Severity: 🔴 critical / 🟡 warning / 🔵 suggestion
- File and line number
- What's wrong
- How to fix it (with code)
If the code looks good, say so. Don't invent problems.
"""
def review_pr(diff: str) -> str:
"""Run an AI code review on a diff."""
response = call_llm(
system=REVIEW_CHECKLIST,
user=f"Review this diff:\n```
{% endraw %}
diff\n{diff}\n
{% raw %}
```"
)
return response
Pro tip: Run this as a GitHub Action on every PR. It catches 80% of the issues a human reviewer would flag, and it responds in seconds instead of hours.
Old feature flags are tech debt. An agent can find flags that are 100% rolled out and remove them.
# flag_cleanup_agent.py
import re
from pathlib import Path
def find_stale_flags(codebase: str, flag_config: dict) -> list[dict]:
"""Find feature flags that are fully rolled out and can be removed."""
stale = []
for flag_name, config in flag_config.items():
if config.get("rollout_percentage") != 100:
continue
if config.get("age_days", 0) < 14:
continue # Wait at least 2 weeks after full rollout
# Find all references in code
references = []
for path in Path(codebase).rglob("*.py"):
content = path.read_text()
if flag_name in content:
lines = [
(i+1, line.strip())
for i, line in enumerate(content.split("\n"))
if flag_name in line
]
references.extend([
{"file": str(path), "line": ln, "code": code}
for ln, code in lines
])
if references:
stale.append({
"flag": flag_name,
"rollout": "100%",
"age_days": config["age_days"],
"references": references
})
return stale
Every agentic workflow follows the same loop:
Analyze → Act → Verify → Fix → Repeat
The difference from traditional automation is step 4: when something goes wrong, the agent adapts. It reads the error, adjusts its approach, and tries again. That's what makes it agentic — not the AI, but the feedback loop.
This is part of the "AI Engineering in Practice" series — practical guides for developers building with AI. Follow for more.