2026-04-22 19:01:49
In Part 1, we diagnosed the chaos.
In Part 2, we installed an MVP strategy to create traction.
Now comes the dangerous phase.
Early wins create confidence.
Confidence creates noise.
Noise can look like progress.
This is where many transformations quietly fail.
Not because they lacked action.
Because they lacked structural indicators.
You don’t measure activity.
You measure stability emerging.
Most teams track output:
These are motion metrics.
Motion is not structure.
Structural health answers different questions:
If you cannot answer these clearly, the chaos is still there — just wearing a better suit.
Fewer signals. Higher meaning. (Gemini generated image)
During transition, you do not need twenty KPIs.
You need the right three.
Time between:
Code committed → Risk understood
Not “defect found”.
Not “ticket closed”.
Risk understood.
Risk is understood when it is documented, discussed, and assigned a mitigation path — not when someone merely senses it.
If that window is shrinking, structure is forming.
If it fluctuates wildly, instability remains.
Not the absolute number of production defects.
The trend — normalized against change volume.
Are fewer critical defects reaching production relative to deployment frequency or change size?
If velocity increases but escape rate stabilizes or drops, the structure is strengthening.
If both rise together, the system is fragile.
What percentage of effort goes to:
When this ratio drops consistently, chaos is losing territory.
When it spikes unpredictably, structural debt still exists.
Stability is visible before it is celebrated. (Gemini generated image)
Most organizations measure lagging pain:
These are autopsy metrics.
They tell you what already broke.
Transition KPIs are leading indicators.
They tell you whether structural pressure is building before failure occurs.
A team once showed me their dashboard.
Defect detection was up.
Automation coverage was rising.
Everything looked healthy.
Two weeks later, a critical production failure.
What did the dashboard miss?
It measured output, not stability.
It tracked activity, not structural integrity.
Feedback latency was high.
Unplanned work was spiking.
The metrics that mattered were not on the dashboard.
If you wait for production to confirm improvement, you are steering by impact.
Not by direction.
When structure begins to take hold, the most noticeable change is what stops happening.
Panic stops.
Blame stops.
Heroics stop.
Planning conversations become shorter.
Risk discussions become sharper.
“Who tests this?” disappears.
Surprises become rare instead of routine.
The system becomes — boring. Boring is the goal.
Structure reduces drama.
Order is not louder. It is calmer. (Gemini generated image)
Once KPIs exist, they will be gamed.
People optimize what is measured.
Measure automation coverage alone — you get shallow automation.
Measure defect count alone — you get underreporting.
Structural KPIs must be:
If they can be gamed easily, they will become performance theater.
And chaos will return quietly.
You know the transition is stabilizing when:
No announcement.
No ceremony.
Just fewer surprises under pressure.
Transformation is not complete when a strategy exists.
It is complete when the system behaves predictably under stress.
Transition KPIs are not about dashboards.
They are about confidence earned slowly.
Measure correctly, and chaos becomes visible.
And what is visible can be reduced.
Structural health is not self-sustaining.
Left alone, systems drift. Incentives shift. Deadlines compress. Shortcuts return.
In the next post, we move from measuring structure to protecting it.
Because what you don’t govern slowly erodes.
📚 Series Navigator: From Chaos to Structure — Series Overview
1️⃣ Diagnosing Chaos & Defining the Target Model
Understand the invisible disorder. See what’s broken before you fix it.
2️⃣ MVP Test Strategy: First 30 Days
Small, immediate actions to start taming chaos — without waiting for perfect conditions.
3️⃣ Transition KPIs: Measuring Structural Health
How to know if the new test structure is actually working — before a major defect appears.
4️⃣ Stakeholder Alignment & Feasibility
Building buy-in and negotiating constraints with the team and leadership.
5️⃣ Economic Impact: Cost of Non-Structure
Translate structured testing into predictable outcomes and business value.
✨ If you see these patterns in your projects, share your experience below — or connect with me to discuss ways to bring structure and predictability to software quality.
© 2026 Abdul Osman. All rights reserved. You are welcome to share the link to this article on social media or other platforms. However, reproducing the full text or republishing it elsewhere without permission is prohibited
2026-04-22 19:01:07
I've been sketching an architecture for agent platforms for a while, and the pieces are starting to lock together. Time to get it out of my head.
Two reasons for writing.
First, selfish: good ideas rot in Slack DMs and half-finished notes. Writing forces the system to hold. If I can explain it, it's real.
Second, more honest: a lot of this is about to be obvious. The LLM gateway/proxy space has exploded. Kong, LiteLLM, Portkey, Gravitee, Harness, everyone's building some version of the control plane. My specific combination feels new, but "new" has a shelf life measured in weeks right now. Flag planted.
This is a build log, not a product announcement. I'm running a working version at workspace level with one agent end to end. Policy, router, RAG, tracing, CLI hub, all wired up. What's next is scaling the same pattern up to repo and company level. Some of what I say here is lived experience. Some is where I'm aiming.
This is the question that wouldn't let me go.
Everyone builds agents. Almost nobody talks about where they live. You watch a demo, someone runs a command, the agent does a thing, the demo ends, the agent evaporates. Cool. Now do that for a hundred engineers across twenty repos for six months.
The moment you try, you hit the questions nobody answers:
Nobody answers these because the answers commit you to a shape for the agent's world. Most frameworks punt. They give you the brain and hand you the body as an exercise. So I started designing the body.
The shape I landed on: agents live in workspaces. A workspace isn't a folder or a container or a session. It's a persistent, identity-bearing thing. It has a policy. It has inherited credentials. It attaches to repos. It has its own trace. When the agent stops, the workspace stays. When it resumes, everything is still there.
And a workspace is also a place a human can walk into and take over.
Every enterprise AI deck has a slide that says "human in the loop" with a stick figure approving things. That framing is backwards, and it's why so many rollouts stall.
Human-in-the-loop treats the human as a supervisor. An approval checkpoint. But the real thing is that humans and agents are doing the same job, and they need to do it in the same place, with the same tools, policies, and traces.
If I design the workspace so a human can do everything the agent does (same CLI tools, same services, same tokens, same trace), then:
The workspace stops being a cage and becomes a shared workbench. A human-AI workshop, not a supervision hierarchy.
The industry has converged hard on "skills." Every framework ships with them. I'm not building them, and I want to say why, because it's the core of the bet.
A skill, as the industry defines it, is a prompt the agent judges whether to invoke. You write a structured markdown file, the agent scans available skills, and decides on judgment whether it applies. When it picks right, it feels magical. When it doesn't, you're debugging vibes.
Fine for exploratory work. Not fine for what I'm building.
I'm building a deterministic system that AI drives. The AI will make mistakes. That's a given, not a bug. The job of the platform is to contain those mistakes inside flows that are solid and traceable. The flow doesn't get to happen or not happen based on the agent's mood. The flow is the thing.
This is why the CLI Hub as a golden path matters so much. Every tool usage flows through a deterministic, traced, policy-enforced rail. I'm not hoping the agent picks the right skill. I'm constraining the ground it walks on. In my system the skill isn't a prompt. The skill is the system. The CLI. The policy cascade. The workflow. The router.
The industry's skill is a suggestion the agent considers. Mine is a rail the agent runs on.
This isn't anti-AI. The more the platform handles determinism, the more the AI can be creative inside it. Ground the agent, then let it run. Don't ask it to ground itself.
I used to think tracing was an observability concern. Something to add later, like metrics. I was wrong.
Tracing is the foundation because agents generate trust deficits faster than logging can clear them. A human who made a weird commit, you ask them in standup. An agent that made a weird commit at 3am using an inherited Jira token to move a ticket, the only way anyone understands what happened is if the trace is good enough to reconstruct it. The policy that was resolved. The credential that was used. The workflow step. The repo. The model.
Most gateway products trace the LLM call and stop. That's useless. The interesting question isn't "which model was called," it's "which workspace used which inherited credential at which step of which workflow against which repo." If your trace answers that sentence, you have governance. If it doesn't, you have a dashboard.
Here's what a trace looks like in the system I'm running today:
trace_id: 7f9c2a...
workspace: ws-backend-refactor-01
agent: claude-opus-4.7
operator: agent # or "human" when I step in
repo: inventory-service (policy layer: repo)
policy_source:
model_allowlist: company.default
jira_token: company.default
rag_scope: repo.override ← scoped to this repo
workflow_max_steps: workspace.override
span: workflow.refactor_auth
span: step.analyze_imports
span: router.resolve_policy (2ms)
span: rag.retrieve (docs: shared + repo, hit=4) (180ms)
span: llm.call (model: sonnet-4.6, tokens: 2.1k) (1.2s)
span: step.apply_changes
span: router.resolve_policy (1ms)
span: tool.git_commit (via cli-hub) (80ms)
span: service.jira.update (token: company.default) (220ms)
↑ this is attributable to ws-backend-refactor-01
even though the token is company-level
The key detail is the policy_source block. For every value this workspace resolved, I can see which level it came from. Model allowlist: company default. Jira token: company default. RAG scope: overridden at the repo level. That's not a log line. That's a governance story.
When I scale this to company level, that block gets more interesting. Right now every value resolves from company or workspace, because those are the two levels I've wired up. Once repos and workspace overrides cascade properly, this same trace starts showing "three of these values came from company, one was overridden by the repo, two were overridden by the workspace." Audit becomes a read, not a forensic archaeology.
Most LLM platforms are organized around "which model do we call." Wrong primitive.
The primitive is "what knowledge does this request have access to." The model is a downstream detail.
When a request comes in, the first question isn't GPT or Claude. It's: should this pull context from shared docs? From this repo's docs? From the workspace's scratch space? Should the response be re-grounded against docs on the way out? What's this agent allowed to know, and what are we obligated to cite?
Answer those first, then lower the request into the model. If you build from the LLM up, RAG becomes a clumsy bolt-on. If you build from the knowledge boundaries down, RAG is the layer, and the LLM is just the execution engine.
The router doesn't "route to models." It resolves a policy, applies RAG on input, picks a model, applies RAG on output, and dispatches to external services if needed. The LLM is one of several things that happen inside the resolved policy, not the center of it.
Compress it to a triangle:
Three verbs. Get them right and everything else is tractable. Get any one wrong and the system is insecure, opaque, or hallucinating. Pick your poison.
Most platforms nail one. A few nail two. I haven't seen one that nails all three in a coherent way. That's the gap I'm building into.
The speculative piece I can't shake.
Imagine workspaces don't just sit idle between tasks. They wake up on their own cadence, call it a standup, and talk to each other about open issues. Each workspace has an energy budget proportional to how much meaningful work it did last cycle. High-signal work earns energy. Churn doesn't. They use that energy to flag blockers, propose collaborations, ask for help from workspaces with knowledge they don't have.
The honest framing: this is a social test for the agents. Do they self-organize usefully, or just generate noise and cost? Do productive workspaces emerge as hubs? Does the hivemind surface blockers humans missed, or amplify them?
I don't know. That's why I want to build it. A fleet of agents that can't cooperate without human orchestration is a fleet of expensive Mechanical Turks. A fleet that can is something else.
Where I am right now: one workspace, full stack, running. Next is getting repo and company policy layers to cascade the way the design says they should. That's the proof point. The traces above are real in shape, simplified in detail. When repo and company are in, the policy_source block starts telling a richer story, and I'll write that up too.
If you're building something that rhymes with this, I want to talk. If you're shipping it already and I should just use yours, I also want to talk.
Writing it down was step one.
2026-04-22 19:00:40
Anthropic le quitó Claude Code al plan de $20/mes. Ahora empieza en $100/mes. Pero la mayoría de la gente no sabe que puedes usar Claude Code con MiniMax, Qwen, Kimi y otros por una fracción del precio. Yo llevo semanas con Claude Code + MiniMax a $40/mes sin que se me acaben los tokens.
El 21 de abril de 2026, cualquiera que entrara a la página de precios de Anthropic vio algo distinto: Claude Code ya no está incluido en el plan Pro de $20/mes. Aparece con una X roja. Solo está disponible a partir del plan Max 5x — que cuesta $100/mes.
No hubo anuncio. No hubo post en el blog. La página de soporte cambió su título de «Using Claude Code with your Pro or Max plan» a «Using Claude Code with your Max plan». Una palabra menos. Un nivel de acceso que desapareció.
Cuando los desarrolladores empezaron a quejarse, un vocero de Anthropic salió a decir que era «una prueba en el 2% de nuevos signups». Pero la realidad es que las páginas públicas ya están actualizadas como si fuera un cambio global.
Y ojo: esto no es el primer movimiento de Anthropic en esa dirección. El 4 de abril de 2026, bloquearon que las suscripciones Pro y Max funcionaran con herramientas de terceros como OpenClaw. Si querías usar Claude con un agente, ahora tenías que pagar API por separado — y hacer eso con Sonnet a precios de consumo te sale alrededor de $3,000/mes si lo usas intensivamente.
Y en paralelo, en marzo ya habían recortado los límites de uso de Claude Code retroactivamente. Si antes podías usarlo tranquilo todo el día, ahora se te cortaba el flujo a mitad de sesión.
Traduzco: Claude Code pasó de ser una herramienta de $20/mes accesible para cualquier developer, a un producto que o pagas a precio premium ($100-$200/mes) o lo armas tú mismo con API key y alternativas que cuestan una fracción.
Claude Code es una herramienta de línea de comandos de Anthropic que te permite programar con IA directamente desde tu terminal. No es un autocomplete — entiende tu código completo, puede editar múltiples archivos, correr tests, hacer debugging, y en general actuar como un pair programador que no se cansa.
En términos menos técnicos: le dices «necesito un endpoint que haga X» y Claude Code entiende tu codebase, busca dónde va, escribe el código, corre los tests y te dice si pasó o no. No es Copilot sugiriendo líneas — es un agente que ejecuta tareas completas.
El plot twist: Claude Code no está atado exclusivamente a los modelos de Anthropic. Como herramienta open source, puede conectarse a otros proveedores. Y ahí es donde empieza la parte interesante.
Antes de que me pregunte, sí — yo sigo usando Claude Code. Lo que cambié es el modelo que corre detrás.
En mi setup actual:
Claude Code como interfaz + MiniMax M2.7-highspeed como modelo a través de OpenRouter. La suscripción me sale $40/mes (plan Plus HS: 300 prompts cada 5 horas, ~100 TPS de velocidad).
El resultado: no se me acaban los tokens. Uso Claude Code como siempre — terminal, VS Code, JetBrains — pero el modelo que responde es MiniMax M2.7-highspeed en vez de Opus o Sonnet de Anthropic.
Y para el 90% de lo que necesito (automatizaciones, scripting, debugging de workflows de n8n, features nuevas), la diferencia con Opus es marginal. Donde MiniMax pierde puntos es en «personalidad» — genera código funcional y correcto, pero no explica las decisiones de diseño con la misma elocuencia técnica que tiene Opus.
Costo real: $40/mes con el plan Plus HS. Tokens ilimitados, 300 prompts cada 5 horas.
Modelos: M2.7 (~50 TPS estándar) y M2.7-highspeed (~100 TPS, alta velocidad).
Context window: 200K tokens.
Herramientas compatibles: Claude Code, Roo Code, Kilo Code, Cline, Codex CLI, OpenCode, Cursor, Trae, Grok CLI, y más.
Calidad en pruebas reales: En 2 meses de uso real:
Bugs encontrados: 6/6 en un proyecto legacy de 2,000 líneas
Vulnerabilidades de seguridad: 10/10 detectadas
Fixes aplicados correctamente: 8/10 (las 2 restantes necesitaban contexto de negocio que el modelo no podía inferir)
Comparado con Opus 4.6 en las mismas tareas, la diferencia en calidad de output es marginal.
Lo bueno: Precio imbatible para lo que ofrece. 100 TPS de velocidad es rápido de verdad. Soporta más de 10 herramientas de coding. Con 200K tokens de contexto, le puedes tirar un proyecto entero de una sola vez.
Costo real: Gratis. 1,000 solicitudes por día con Gemini 2.5 Pro.
Context window: 1M de tokens — la más grande disponible en cualquier tier gratuito.
Velocidad: Rápida.
Calidad: No está al nivel de Opus 4.6 o MiniMax M2.7 en código complejo, pero para prototipado rápido, debugging simple y generación de boilerplate es más que suficiente.
Setup:
## Instalar
npm install -g @anthropic-ai/claude-code
## Espera, ese es Claude Code...
## Instalar Gemini CLI:
npx @google/gemini-cli
Veredicto: El mejor punto de entrada. Si estás empezando de cero, probá Gemini CLI gratis antes de pagar nada.
Plan: Lite ($10/mes, 1,200 requests / 5h) hasta Pro ($50/mes, 6,000 requests / 5h).
Modelos: Qwen3.5-Plus, Qwen3-Coder, y otros (multi-model).
Lo bueno: Puedes cambiar entre 6+ modelos dentro del mismo plan. Qwen3-Coder es competitivo en SWE-bench. El plan Lite te da muchísimos más requests que la competencia.
Lo malo: Los modelos chinos tienen documentación en inglés limitada. Y la latencia desde LatAm puede ser un factor.
Plan: Parte del ecosistema Kimi/Kimi CLI, también compatible con Claude Code y Roo Code. Contexto: 256K tokens — la más amplia de esta categoría (entre los modelos de coding).
Modelos: Kimi K2.6 Instruct.
Velocidad: Alta en tareas de instruction following.
Calidad: Líder en coding instruction — seguir especificaciones complejas. En el benchmark HumanEval, consistentemente por encima de 90%.
Lo bueno: La ventana de contexto de 256K significa que puedes tirarle un proyecto entero de una sola vez y que lo entienda sin dividirlo en chunks.
Setup: Via Claude Code (con provider alternativo), Roo Code, o Kimi CLI.
Costo real: $0.27 por millón de tokens de input, $1.07 de output. Para un proyecto mediano (~500K tokens total), menos de $1.
Contexto: 128K tokens.
Calidad: Domina en razonamiento matemático y código puro. En benchmarks como AIME (matemáticas competitivas) scorea ~79.8%. En SWE-bench (tareas de ingeniería reales) está en el top 5 de modelos abiertos.
Donde brilla: Debugging pesado y refactorizaciones que requieren reasoning multi-paso.
Donde pierde: Cuando necesitás que entienda contexto de negocio o arquitectura grande — no es su fortaleza.
Setup: Via Aider, OpenCode, o CLI directo.
Plan: Max 5x ($100/mes, 5× cuotas Pro) y Max 20x ($200/mes, 20× cuotas Pro).
Modelos: Opus 4.6 (tope de línea), Sonnet 4.6, Haiku 4.5.
Herramientas compatibles: Terminal CLI, VS Code, JetBrains, Web, Desktop App, Slack.
La realidad: Sigues siendo el gold standard. Opus 4.6 es el mejor modelo de coding que existe hoy. Pero $100-$200/mes es 5-10× más que las alternativas. Y si lo que necesitas es un agente que te haga código funcional, la diferencia entre Opus y un M2.7-highspeed puede no justificar el salto de precio para tu caso de uso.
El truco no es solo cambiar de modelo — es usar herramientas que acepten cualquier modelo:
El campeón del multi-model. Funciona con MiniMax, GLM, Qwen, Kimi, DeepSeek, y más. Es un fork de Cline optimizado para agentes. Si quieres máxima compatibilidad con modelos alternativos, esta es tu herramienta.
Open source, $0. Solo pagás el costo de la API del modelo que conectés. 64K tokens de contexto por defecto (configurable). Como CLI wrapper, Aider no genera código — conecta con el modelo que tengas. La ventaja real es que es local, open source, y no tiene vendor lock-in. Lo conectás a cualquier modelo.
Setup:
pip install aider-chat
aider --model deepseek/deepseek-coder-2.0 --api-key tu_key
Open source, $0. 128K tokens de contexto. Velocidad comparable a Claude Code. En pruebas de la comunidad, comparable a Claude Code en tareas del día a día. La alternativa open source más completa a Claude Code.
Setup:
npm install -g opencode
opencode --provider minimax
No es una herramienta sino otro modelo, pero vale la pena mencionarlo: GLM a $3/mes (plan Lite, ~80 prompts / 5h). Es el plan de coding más barato del mercado. Incluye herramientas MCP gratis (web search, vision). Soporta más de 20 herramientas. Para probar sin gastar, no hay nada más barato.
| Herramienta / Modelo | Costo | Context | Mejor Para |
|---|---|---|---|
| MiniMax M2.7 HS | $40/mes ilimitado | 200K | Coding principal diario |
| DeepSeek Coder V2 | $0.27/1M tokens in | 128K | Debugging, reasoning |
| Kimi K2.6 | $1.10/1M tokens in | 256K | Bases de código grandes |
| Gemini CLI | Gratis | 1M | Prototipado, primer intento |
| Aider | $0 + API | 64K+ | Control total, vendor lock-in |
| OpenCode | $0 + API | 128K | Open source completo |
| GLM 5 | $3-$49/mes | Variable | Probar sin gastar |
| Qwen 3.5 | $10-$50/mes | Multi-model | Cambiar entre 6+ modelos |
| Claude Code (Anthropic) | $100-$200/mes | 1M+ | Gold standard, calidad máxima |
Después de semanas probando estas combinaciones, mi setup real es este:
Principal (diario): Claude Code + MiniMax M2.7-highspeed a $40/mes
Secondary (razonamiento): DeepSeek Coder V2 por API
Prototipado (gratis): Gemini CLI
Este stack me sale ~$50/mes en total (la suscripción de MiniMax + uso puntual de DeepSeek API). Antes estaba pagando $100/mes solo en Claude Code Max. Ahora uso Claude Code como interfaz con el modelo de MiniMax, y el resto de las herramientas como complemento para casos específicos.
1. Esperar que el nuevo modelo piense como Opus.
No lo va a hacer. Cada modelo tiene personalidad técnica distinta. El cambio real es ajustar expectativas: generá más código desde el primer intento, iterá menos. Si necesitas que el modelo explique sus decisiones con elocuencia, va a ser menos detallado que Opus.
2. No calibrar el system prompt.
Cada modelo responde distinto al mismo sistema de instrucciones. Tómate tiempo de ajustar tu system prompt para cada modelo — no copies y pegues el que funcionaba con Claude.
3. No aprovechar el contexto máximo.
Con 200K tokens de MiniMax o 256K de Kimi, tirale el proyecto entero y dejá que lo lea. No dividas en partes si no es necesario. El contexto grande es la ventaja más subutilizada.
4. Quedarse en la opción gratuita cuando no alcanza.
Si tu productividad sube 20-30% con el modelo pago, probablemente vale la pena. Hacé la matemática: si te ahorra 5 horas de trabajo a la semana, $40/mes se pagan solos en el primer día.
El arbitraje de suscripciones se acabó. Anthropic se dio cuenta de que los developers usaban planes de $20-200 para correr agentes que quemaban cientos de dólares en tokens. Y lo cortaron.
No es que sea injusto — es un negocio. Pero sí significa que la era de «pago $20 y tengo un agente ilimitado» terminó.
La buena noticia es que la competencia china (MiniMax, DeepSeek, Kimi, Qwen) está peleando precio con agresividad. Y los planes que ofrecen son reales — no trials ni promos. Son suscripciones mensuales con cuotas claras.
Las alternativas chinas no son «la opción barata para developers que no pueden pagar». Son opciones legítimas que rinden 80-92% del resultado por 10-20% del precio. Y en muchos casos del día a día, esa diferencia del 8-20% es imperceptible en la práctica.
El error más caro no es pagar $40/mes por MiniMax. Es seguir pagando $100/mes por Claude Code Max cuando no lo necesitás.
Empieza con Gemini CLI (gratis). 1,000 requests/día con 1M tokens de contexto. Si no necesitás más, no pagues.
Si necesitas más potencia, salta a MiniMax Plus HS ($40/mes) — 300 prompts cada 5 horas con alta velocidad es más que suficiente para un developer individual. Usa Claude Code como interfaz conectándolo a MiniMax.
Prueba Roo Code como herramienta — soporta MiniMax, GLM, Qwen, Kimi, DeepSeek, y puedes cambiar sin reconfigurar todo.
Agrega DeepSeek para debugging pesado — $0.27/1M tokens es ridículo para problemas que requieren reasoning multi-paso.
No te cases con un modelo — la ventaja de este ecosistema es que puedes cambiar. Usa Qwen cuando MiniMax no alcance, GLM cuando necesites MCP tools gratis, Kimi cuando tengas un proyecto grande que analizar de una sola vez.
Todo esto que te cuento no lo aprendí en un paper. Lo aprendí operando — corriendo un entorno con dos servidores Hostinger (dev y prod), un server dedicado en Hetzner, decenas de automatizaciones en n8n, y un ecosistema que funciona 24/7.
Si te interesa meterle de verdad a este mundo de IA aplicada a negocio (no el hype, el día a día real), únete a mi comunidad de emprendedores en Cágala, Aprende, Repite — ahí compartimos lo que nos funciona, lo que no, y nos ayudamos entre todos a no cometer los mismos errores.
Fuentes: Anthropic pricing pages (21 Abr 2026), Pasquale Pillitteri — Claude Code Removed from Pro Plan, AI Coding Plan Comparison 2026, Dev.to — Every AI Coding CLI in 2026, Simon Willison — Claude Code Pricing Confusion, SSDNodes — Claude Code Pricing 2026, MorphLLM — Claude Code Alternatives 2026, Reddit — MiniMax M2.7 vs Opus 4.6, BenchLM — DeepSeek vs Kimi, KDnuggets — Top 5 Agentic CLI Coding Tools.
The post Claude Code Ya No Viene en tu Suscripcion de $20/mes — Alternativas por Menos de $50 appeared first on Cristian Tala Sánchez.
Este articulo fue publicado originalmente en cristiantala.com. Si te interesa emprendimiento, IA y automatizacion, unite gratis a la comunidad Cagala, Aprende, Repite.
2026-04-22 19:00:30
A service level objective (SLO) is a measurable reliability target for a service over a specific time window—like "99.9% of requests complete in under 200ms over 30 days." SLOs turn vague notions of "good performance" into concrete numbers that engineering teams can track, test against, and use to make release decisions.
This guide covers how SLOs relate to SLIs and SLAs, how to define effective targets for your applications, and how to validate SLO compliance through load testing before performance problems reach production.
A service level objective (SLO) is a measurable target for how reliably a service performs over a specific time window. It defines what "good performance" actually looks like in concrete, trackable terms. For example, "99% of API requests complete in under 200ms over a rolling 30-day period" is an SLO.
Without SLOs, performance conversations tend to go in circles. One person says the app feels slow, another disagrees, and nobody has data to settle the argument. SLOs fix that problem by giving everyone the same yardstick.
Every SLO has three parts:
You'll see SLO, SLI, and SLA used together constantly. They're related but serve different purposes, and mixing them up creates confusion fast.
:root{ --surface:#ffffff; --text:#0f172a; --muted:#64748b; --border:#e2e8f0; --row:#f8fafc; --row-alt:#f1f5f9; --g1:#FF763C; --g2:#F861EE; --g3:#4557DD; } .cool-table-wrap{ background:linear-gradient(135deg,var(--g1) 0%,var(--g2) 50%,var(--g3) 100%); padding:2px; border-radius:24px; margin:32px 0; } .cool-table-inner{ background:var(--surface); border-radius:22px; padding:clamp(16px,3vw,28px); } .cool-table-title{ display:flex; align-items:center; gap:12px; margin:0 0 18px; font-size:22px; font-weight:700; color:var(--text); } .cool-pill{ font-size:12px; padding:6px 10px; border-radius:999px; font-weight:700; letter-spacing:.05em; text-transform:uppercase; color:#fff; background:linear-gradient(90deg,var(--g1),var(--g2),var(--g3)); white-space:nowrap; } .cool-table{ width:100%; border-collapse:collapse; border-radius:14px; overflow:hidden; } .cool-table th, .cool-table td{ padding:14px 16px; font-size:15px; text-align:left; vertical-align:top; } .cool-table thead th{ font-size:13px; text-transform:uppercase; letter-spacing:.05em; color:var(--muted); border-bottom:2px solid var(--border); } .cool-table tbody tr:nth-child(odd){background:var(--row)} .cool-table tbody tr:nth-child(even){background:var(--row-alt)} .cool-table tbody tr:hover{ background:linear-gradient( 90deg, rgba(255,118,60,.08), rgba(248,97,238,.08), rgba(69,87,221,.08) ); } .cool-table td{ border-bottom:1px solid var(--border); color:var(--text); } .cool-table tbody tr:last-child td{ border-bottom:none; } .cool-table td:first-child{ font-weight:700; white-space:nowrap; } /* Mobile */ @media(max-width:768px){ .cool-table thead{display:none} .cool-table, .cool-table tbody, .cool-table tr{ display:block; width:100%; } .cool-table tr{ border:1px solid var(--border); border-radius:12px; margin-bottom:12px; } .cool-table td{ display:grid; grid-template-columns:140px 1fr; gap:10px; } .cool-table td::before{ content:attr(data-label); font-weight:600; font-size:12px; text-transform:uppercase; color:var(--muted); } .cool-table td:first-child{ white-space:normal; } }
SLI vs SLO vs SLA RELIABILITY • BASICS
| Term | What it is | Who uses it | Example |
|---|---|---|---|
| SLI | Raw measurement | Engineers | Request latency in milliseconds |
| SLO | Internal target | Engineering teams | 99% of requests under 200 ms |
| SLA | External contract | Business and customers | 99.9% uptime or credit issued |
A service level indicator (SLI) is the raw metric that captures how your service actually behaves. It's the number itself: request latency in milliseconds, error count per minute, or uptime percentage over the last hour. These are all common performance testing metrics that feed into your SLO targets.
Think of SLIs as the speedometer reading. SLOs are the speed limit. SLIs tell you what's happening right now. SLOs tell you whether that's acceptable.
A service level agreement (SLA) is a contract between a service provider and its customers. SLAs typically include financial consequences for missing targets, like credits or refunds if uptime drops below a promised threshold.
The key difference: SLAs are external promises you make to customers. SLOs are internal targets that help you keep those promises before they become contractual problems.
The relationship flows in one direction. You measure an SLI, compare it against your SLO target, and use that data to ensure you're meeting your SLA commitments. SLIs feed SLOs, and SLOs inform SLAs.
TermWhat it isWho uses itExampleSLIRaw measurementEngineersRequest latency in millisecondsSLOInternal targetEngineering teams99% of requests under 200msSLAExternal contractBusiness and customers99.9% uptime or credit issued
An error budget is the amount of unreliability your service can experience before breaching an SLO. If your SLO targets 99.9% availability, your error budget is the remaining 0.1%. That works out to roughly 43 minutes of downtime per month.
Error budgets reframe reliability as a resource you can spend. Want to ship a risky feature? Go ahead, as long as you have budget left. Running low on budget? Time to slow down and stabilize.
Here's how error budgets work in practice:
SLOs aren't just for monitoring production systems. They're equally valuable during load testing, where they help you catch problems before users ever see them.
When you define SLO-based assertions in your load tests, you detect degradation during development through early performance testing. A test that passed last week but fails this week signals a regression worth investigating immediately.
Gatling's performance assertions let you define thresholds directly in your test code. Violations surface as soon as the test runs, not after a customer complaint.
SLOs give developers, QA engineers, and operations teams a common language. Instead of debating whether "the app feels slow," everyone references the same objective targets. That shared understanding reduces friction and speeds up decision-making.
SLO compliance provides objective go/no-go criteria for deployments. Did the load test meet all SLO targets? Ship it. Did latency breach the threshold? Investigate first. No more gut feelings or heated debates in release meetings.
SLOs become automated pass/fail criteria in continuous integration. A pipeline that blocks releases when SLOs are breached prevents performance problems from reaching production. You catch issues early, when they're cheaper to fix.
SLOs vary depending on what aspect of performance matters most for your application. Here are concrete examples for common scenarios.
"95% of checkout API requests complete in under 300ms."
Latency SLOs directly impact user experience. Slow responses frustrate users — 53% abandon sites loading over 3 seconds — especially for interactive features like search or checkout where every millisecond counts.
"The system handles at least 1,000 requests per second under peak load."
Throughput targets matter when you expect traffic spikes. Black Friday sales, product launches, or viral moments all require systems that can handle sudden surges.
"Fewer than 0.5% of requests return 5xx errors."
Error rate SLOs set a ceiling on acceptable failures. Even a small percentage of errors erodes user trust over time, so tracking this metric helps maintain reliability.
"The service maintains 99.95% availability during load tests."
Availability SLOs ensure your system stays up under stress testing conditions. For services where downtime can cost over $300,000 per hour, availability is often the most critical metric to track.
Creating effective SLOs involves more than picking arbitrary numbers. The process starts with understanding what actually matters to your users.
Start with user-facing outcomes: page load speed, transaction success, checkout completion. Don't try to measure everything. Focus on the interactions that impact experience most directly.
Select SLIs that reflect user experience and that you can actually collect from your monitoring or testing tools. Vague metrics lead to vague SLOs, which lead to arguments about what "good" means.
Base targets on historical performance data and business requirements, not aspirational ideals. Starting conservative and tightening over time works better than setting aggressive targets you'll never hit.
Define what happens when the error budget runs low. Some teams slow down releases. Others trigger incident response. The specific action matters less than having a clear policy everyone follows.
Store SLO definitions in version control alongside your test code. Share them with stakeholders so everyone understands the targets and the reasoning behind them.
Gatling Enterprise Edition lets you define SLOs directly in the UI without touching test code. Each SLO has three components:
That last point matters. Unlike a single end-of-test assertion, Gatling SLOs evaluate compliance continuously throughout the run, then report what percentage of seconds met the threshold. Results appear as color-coded gauges: green for ≥99% compliance, orange for 90–99%, and red for anything below 90%.
A few configuration details worth knowing:
Implementing SLOs effectively takes some discipline. Here's what works well for most teams.
Too many objectives dilute focus. Begin with the most critical user journeys and expand later once you've built confidence in the process.
Technical targets work best when they map to actual business outcomes. A latency SLO tied to conversion rates, which drop 4.42% per additional second of load time, carries more weight than one chosen arbitrarily.
Treat SLOs as code using a test-as-code approach. Store them in your repository so changes are tracked, reviewable, and tied to specific releases. This creates accountability and makes it easy to see how targets evolved over time.
Manual result checking doesn't scale. Configure automated load testing to evaluate SLO compliance on every run and fail tests when thresholds are breached. Gatling supports this through performance assertions that integrate directly into your test scripts.
SLOs aren't static. Revisit them as your application evolves, user expectations shift, or infrastructure changes. What worked six months ago might not reflect current reality.
Connecting SLO concepts to actual load test execution requires a clear workflow. Here's how the pieces fit together.
Translate your SLO targets into test assertions. For example, assert that p95 latency stays below your SLO threshold throughout the entire test run. This turns abstract targets into concrete pass/fail criteria.
Use realistic user journeys and injection profiles that mirror production load. SLO validation is only meaningful if the test reflects how users actually behave. A test with artificial traffic patterns won't tell you much about real-world performance.
Configure CI/CD pipelines to treat SLO violations as test failures. This blocks deployment until issues are resolved, preventing performance problems from reaching users.
Monitor SLO trends over time to detect gradual degradation. Comparing test runs across releases reveals regressions that single-run analysis might miss. Gatling's analytics dashboards make this comparison straightforward.
Gatling operationalizes SLO-based performance testing through performance assertions in code, CI/CD integration, and regression detection in Insight Analytics. Teams define SLO thresholds directly in test scripts, automate validation in every pipeline run, and track compliance trends across releases.
Request a demo to see how Gatling helps engineering teams validate SLOs before performance issues reach production.
2026-04-22 19:00:00
Something peculiar happened when software development teams started delegating code generation to AI assistants. The traditional burden of implementation, that painstaking process of translating designs into working software, began shifting elsewhere. But it did not disappear. Instead, it transformed into something altogether different: an intensified requirement for architectural rigour that many teams were unprepared to provide.
In early 2025, a randomised controlled trial conducted by METR examined how AI tools affect the productivity of experienced open-source developers. Sixteen developers with moderate AI experience completed 246 tasks in mature projects on which they had an average of five years of prior experience. Each task was randomly assigned to allow or disallow usage of early 2025 AI tools. The finding shocked the industry: developers using AI tools took 19% longer to complete tasks than those working without them. Before starting, developers had forecast that AI would reduce their completion time by 24%. Even after finishing the study, participants still believed AI had made them faster, despite the data proving otherwise.
This perception gap reveals something fundamental about the current state of AI-assisted development. The tools are genuinely powerful, but their power comes with hidden costs that manifest as architectural drift, context exhaustion, and what practitioners have come to call the “zig-zag problem”: the iterative back-and-forth that emerges when teams dive into implementation without sufficient upfront specification.
The scale of AI adoption in software development has been nothing short of revolutionary. By March 2025, Y Combinator reported that 25% of startups in its Winter 2025 batch had codebases that were 95% AI-generated. These were not weekend projects built by hobbyists. These were venture-backed companies building production systems, with the cohort growing 10% per week in aggregate, making it the fastest-growing batch in YC history.
As CEO Garry Tan explained, the implications were profound: teams no longer needed fifty or a hundred engineers. They did not have to raise as much capital. The money went further. Companies like Red Barn Robotics developed AI-driven agricultural robots securing millions in contracts. Deepnight built military-grade night vision software for the US Army. Delve launched with over 100 customers and a multi-million pound run rate, all with remarkably lean teams.
Jared Friedman, YC's managing partner, emphasised a crucial point about these companies: “It's not like we funded a bunch of non-technical founders. Every one of these people is highly technical, completely capable of building their own products from scratch. A year ago, they would have built their product from scratch, but now 95% of it is built by an AI.”
Yet beneath these success stories lurked a more complicated reality. Pete Hodgson, writing about AI coding assistants in May 2025, captured the core problem with devastating clarity: “The state of the art with coding agents in 2025 is that every time you start a new chat session, your agent is reset to the same knowledge as a brand new hire, one who has carefully read through all the onboarding material and is good at searching through the codebase for context.”
This “brand new hire” phenomenon explains why architectural planning has become so critical. Traditional developers build mental models of codebases over months and years. They internalise team conventions, understand why certain patterns exist, and recognise the historical context behind architectural decisions. AI assistants possess none of this institutional memory. They approach each session with technical competence but zero contextual awareness.
The burden that has shifted is not the mechanical act of writing code. It is the responsibility for ensuring that generated code fits coherently within existing systems, adheres to established patterns, and serves long-term maintainability rather than short-term convenience.
To understand why architectural planning matters more with AI assistants, you must first understand how these systems process information. Every AI model operates within what engineers call a context window: the total amount of text it can consider simultaneously. By late 2025, leading models routinely supported 200,000 tokens or more, with some reaching one million tokens. Google's Gemini models offered input windows of over a million tokens, enough to analyse entire books or multi-file repositories in a single session.
But raw capacity tells only part of the story. Timothy Biondollo, writing about the fundamental limitations of AI coding assistants, articulated what he calls the Principle of Compounding Contextual Error: “If an AI interaction does not resolve the problem quickly, the likelihood of successful resolution drops with each additional interaction.”
The mechanics are straightforward but devastating. As you pile on error messages, stack traces, and correction prompts, you fill the context window with what amounts to garbage data. The model is reading a history full of its own mistakes, which biases it toward repeating them. A long, winding debugging session is often counterproductive. Instead of fixing the bug, you are frequently better off resetting the context and starting fresh with a refined prompt.
This dynamic fundamentally changes how teams must approach complex development tasks. With human developers, extended debugging sessions can be productive because humans learn from their mistakes within a session. They build understanding incrementally. AI assistants do the opposite: their performance degrades as sessions extend because their context becomes polluted with failed attempts.
The practical implication is that teams cannot rely on AI assistants to self-correct through iteration. The tools lack the metacognitive capacity to recognise when they are heading down unproductive paths. They will cheerfully continue generating variations of flawed solutions until the context window fills with a history of failures, at which point the quality of suggestions deteriorates further.
Predictions from industry analysts suggest that one million or more tokens will become standard for flagship models in 2025 and 2026, with ten million token contexts emerging in specialised models by 2027. True “infinite context” solutions may arrive in production by 2028. Yet even with these expansions, the fundamental challenge remains: more tokens do not eliminate the problem of context pollution. They merely delay its onset.
This context limitation has driven a renaissance in software specification practices. What the industry has come to call spec-driven development represents one of 2025's most significant methodological shifts, though it lacks the visibility of trendier terms like vibe coding.
Thoughtworks describes spec-driven development as a paradigm that uses well-crafted software requirement specifications as prompts for AI coding agents to generate executable code. The approach explicitly separates requirements analysis from implementation, formalising requirements into structured documents before any code generation begins.
GitHub released Spec Kit, an open-source toolkit that provides templates and workflows for this approach. The framework structures development through four distinct phases: Specify, Plan, Tasks, and Implement. Each phase produces specific artifacts that carry forward to subsequent stages.
In the Specify phase, developers capture user journeys and desired outcomes. As the Spec Kit documentation emphasises, this is not about technical stacks or application design. It focuses on experiences and what success looks like: who will use the system, what problem it solves, how users will interact with it, and what outcomes matter. This specification becomes a living artifact that evolves as teams learn more about users and their needs.
The Plan phase gets technical. Developers encode their desired stack, architecture, and constraints. If an organisation standardises on certain technologies, this is where those requirements become explicit. The plan captures compliance requirements, performance targets, and security policies that will guide implementation.
The Tasks phase breaks specifications into focused, reviewable work units. Each task solves a specific piece of the puzzle and enables isolated testing and validation. Rather than asking an AI to generate an entire feature, developers decompose work into atomic units that can be independently verified.
Only in the Implement phase do AI agents begin generating code, now guided by clear specifications and plans rather than vague prompts. The approach transforms fuzzy intent into unambiguous instructions that language models can reliably execute.
Not all specification documents prove equally effective at guiding AI assistants. Through extensive experimentation, the industry has converged on several artifact types that demonstrably reduce architectural drift.
The spec.md file has emerged as foundational. Addy Osmani, Chrome engineering lead at Google, recommends creating a comprehensive specification document containing requirements, architecture decisions, data models, and testing strategy. This document forms the basis for development, providing complete context before any code generation begins. Osmani describes the approach as doing “waterfall in fifteen minutes” through collaborative specification refinement with the AI before any code generation occurs.
Tasks.md serves a complementary function, breaking work into incremental, testable steps with validation criteria. Rather than jumping straight into code, this process establishes intent first. The AI assistant then uses these documents as context for generation, ensuring each piece of work connects coherently to the larger whole.
Plan.md captures the technical approach: a short overview of the goal, the main steps or phases required to achieve it, and any dependencies, risks, or considerations to keep in mind. This document bridges the gap between what the system should do and how it should be built.
Perhaps most critically, the CLAUDE.md file (or equivalent for other AI tools) has become what practitioners call the agent's constitution, its primary source of truth for how a specific repository works. HumanLayer, a company building tooling for AI development workflows, recommends keeping this file under sixty lines. The general consensus is that less than three hundred lines works best, with shorter being even better.
The rationale for brevity is counterintuitive but essential. Since CLAUDE.md content gets injected into every single session, bloated files consume precious context window space that should be reserved for task-specific information. The document should contain universally applicable information: core application features, technology stacks, and project notes that should never be forgotten. Anthropic's own guidance emphasises preferring pointers to copies: rather than including code snippets that will become outdated, include file and line references that point the assistant to authoritative context.
A particularly interesting development involves the application of Architecture Decision Records to AI-assisted development. Doug Todd has demonstrated transformative results using ADRs with Claude and Claude Code, showing how these documents provide exactly the kind of structured context that AI assistants need.
ADRs provide enough structure to ensure key points are addressed, but express that structure in natural language, which is perfect for large language model consumption. They capture not just what was decided, but why, recording the context, options considered, and reasoning behind architectural choices.
Chris Swan, writing about this approach, notes that ADRs might currently be an elite team practice, but they are becoming part of a boilerplate approach to working with AI coding assistants. This becomes increasingly important as teams shift to agent swarm approaches, where they are effectively managing teams of AI workers, exactly the sort of environment that ADRs were originally created for.
The transformation begins when teams stop thinking of ADRs as documentation and start treating them as executable specifications for both human and AI behaviour. Every ADR includes structured metadata and clear instructions that AI assistants can parse and apply immediately. Accepted decisions become mandatory requirements. Proposed decisions become considerations. Deprecated and superseded decisions trigger active avoidance patterns.
Dave Patten describes using AI agents to enforce architectural standards, noting that LLMs and autonomous agents are being embedded in modern pipelines to enforce architectural principles. The goal is not perfection but catching drift early before it becomes systemic.
ADR rot poses a continuing challenge. It does not happen overnight. At first, everything looks healthy: the repository is clean, decisions feel current, and engineers actually reference ADRs. Then reality sets in. Teams ship features, refactor services, migrate infrastructure, and retire old systems. If no one tends the ADR log, it quietly drifts out of sync with the system. Once that happens, engineers stop trusting it. The AI assistant, fed outdated context, produces code that reflects decisions the team has already moved past.
Without these planning artifacts, teams inevitably encounter what practitioners call the zig-zag problem: iterative back-and-forth that wastes cycles and produces inconsistent results. One developer who leaned heavily on AI generation for a rushed project described the outcome as “an inconsistent mess, duplicate logic, mismatched method names, no coherent architecture.” He realised he had been “building, building, building” without stepping back to see what the AI had woven together. The fix required painful refactoring.
The zig-zag emerges from a fundamental mismatch between how humans and AI assistants approach problem-solving. Human developers naturally maintain mental models that constrain their solutions. They remember what they tried before, understand why certain approaches failed, and build incrementally toward coherent systems.
AI assistants lack this continuity. Each response optimises for the immediate prompt without consideration of the larger trajectory. Ask for a feature and you will get a feature, but that feature may duplicate existing functionality, violate established patterns, or introduce dependencies that conflict with architectural principles.
Qodo's research on AI code quality found that about a third of developers verify AI code more quickly than writing it from scratch, whilst the remaining two-thirds require more time for verification. Roughly a fifth face heavy overruns of 50 to 100 percent or more, making verification the bottleneck. Approximately 11 percent of developers reported code verification taking much longer, with many code mismatches requiring deep rework.
The solution lies in constraining the problem space before engaging AI assistance. Hodgson identifies three key strategies: constrain the problem by being more directive in prompts and specifying exact approaches; provide missing context by expanding prompts with specific details about team conventions and technical choices; and enable tool-based context discovery through integrations that give AI access to schemas, documentation, and requirements.
The transition from planning to implementation represents a critical handoff that many teams execute poorly. GitHub's Spec Kit documentation emphasises that specifications should include everything a developer, or an AI agent, needs to know to start building: the problem, the approach, required components, validation criteria, and a checklist for handoff. By following a standard, the transition from planning to doing becomes clear and predictable.
This handoff structure differs fundamentally from traditional agile workflows. In conventional development, a user story might contain just enough information for a human developer to ask clarifying questions and fill in gaps through conversation. AI assistants cannot engage in this kind of collaborative refinement. They interpret prompts literally and generate solutions based on whatever context they possess.
The Thoughtworks analysis of spec-driven development emphasises that AI coding agents receive finalised specifications along with predefined constraints via rules files or agent configuration documents. The workflow emphasises context engineering: carefully curating information for agent-LLM interaction, including real-time documentation integration through protocols that give assistants access to external knowledge sources.
Critically, this approach does not represent a return to waterfall methodology. Spec-driven development creates shorter feedback cycles than traditional waterfall's excessively long ones. The specification phase might take minutes rather than weeks. The key difference is that it happens before implementation rather than alongside it.
Microsoft's approach to agentic AI explicitly addresses handoff friction. Their tools bridge the gap between design and development, eliminating time-consuming handoff processes. Designers iterate in their preferred tools whilst developers focus on business logic and functionality, with the agent handling implementation details. Teams now receive notifications that issues are detected, analysed, fixed, and documented, all without human intervention. The agent creates issues with complete details so teams can review what happened and consider longer-term solutions during regular working hours.
The practical workflow involves marking progress and requiring the AI agent to update task tracking documents with checkmarks or completion notes. This gives visibility into what is done and what remains. Reviews happen after each phase: before moving to the next set of tasks, teams review code changes, run tests, and confirm correctness.
Perhaps the most dangerous misconception about AI coding assistants is that they can self-correct through iteration. The METR study's finding that developers take 19% longer with AI tools, despite perceiving themselves as faster, points to a fundamental misunderstanding of how these tools operate.
The problem intensifies in extended sessions. When you see auto-compacting messages during a long coding session, quality drops. Responses become vaguer. What was once a capable coding partner becomes noticeably less effective. This degradation occurs because compaction loses information. The more compaction happens, the vaguer everything becomes. Long coding sessions feel like they degrade over time because you are literally watching the AI forget.
Instead of attempting marathon sessions where you expect the AI to learn and improve, effective workflows embrace a different approach: stop trying to do everything in one session. For projects spanning multiple sessions, implementing comprehensive logging and documentation serves as external memory. Documentation becomes the only bridge between sessions, requiring teams to write down everything needed to resume work effectively whilst minimising prose to conserve context window space.
Anthropic's September 2025 announcement of new context management capabilities represented a systematic approach to this problem. The introduction of context editing and memory tools enabled agents to complete workflows that would otherwise fail due to context exhaustion, whilst reducing token consumption by 84 percent in testing. In a 100-turn web search evaluation, context editing enabled agents to complete workflows that would otherwise fail due to context exhaustion.
The recommended practice involves dividing and conquering with sub-agents: modularising large objectives and delegating API research, security review, or feature planning to specialised sub-agents. Each sub-agent gets its own context window, preventing any single session from approaching limits. Telling the assistant to use sub-agents to verify details or investigate particular questions, especially early in a conversation or task, tends to preserve context availability without much downside in terms of lost efficiency.
Extended thinking modes also help. Anthropic recommends using specific phrases to trigger additional computation time: “think” triggers basic extended thinking, whilst “think hard,” “think harder,” and “ultrathink” map to increasing levels of thinking budget. These modes give the model additional time to evaluate alternatives more thoroughly, reducing the need for iterative correction.
Understanding the practical boundaries of AI self-correction helps teams design appropriate workflows. Several patterns consistently cause problems.
Open solution spaces present the first major limitation. When problems have multiple valid solutions, it is extremely unlikely that an AI will choose the right one without explicit guidance. The AI assistant makes design decisions at the level of a fairly junior engineer and lacks the experience to challenge requirements or suggest alternatives.
Implicit knowledge creates another barrier. The AI has no awareness of your team's conventions, preferred libraries, business context, or historical decisions. Every session starts fresh, requiring explicit provision of context that human team members carry implicitly. Anthropic's own research emphasises that Claude is already smart enough. Intelligence is not the bottleneck; context is. Every organisation has its own workflows, standards, and knowledge systems, and the assistant does not inherently know any of these.
Compound errors represent a third limitation. Once an AI starts down a wrong path, subsequent suggestions build on that flawed foundation. Without human intervention to recognise and redirect, entire implementation approaches can go astray.
The solution is not more iteration but better specification. Teams seeing meaningful results treat context as an engineering surface, determining what should be visible to the agent, when, and in what form. More information is not always better. AI can be more effective when further abstracted from the underlying system because the solution space becomes wider, allowing better leverage of generative and creative capabilities.
The tooling ecosystem has evolved to support these context management requirements. Cursor, one of the most popular AI coding environments, has developed an elaborate rules system. Large language models do not retain memory between completions, so rules provide persistent, reusable context at the prompt level. When applied, rule contents are included at the start of the model context, giving the AI consistent guidance for generating code.
The system distinguishes between project rules, stored in the .cursor/rules directory and version-controlled with the codebase, and global rules that apply across all projects. Project rules encode domain-specific knowledge, standardise patterns, and automate project workflows. They can be scoped using path patterns, invoked manually, or included based on relevance.
The legacy .cursorrules file has been deprecated in favour of individual .mdc files inside the .cursor/rules/ directory. This change provides better organisation, easier updates, and more focused rule management. Each rule lives in its own file with the .mdc (Markdown Components) extension, allowing for both metadata in frontmatter and rule content in the body.
The critical insight for 2025 is setting up what practitioners call the quartet: Model Context Protocol servers, rules files, memories, and auto-run configurations at the start of projects. This reduces token usage by only activating relevant rules when needed, giving the language model more mental space to focus on specific tasks rather than remembering irrelevant guidelines.
Skills represent another evolution: organised folders of instructions, scripts, and resources that AI assistants can dynamically discover and load. These function as professional knowledge packs that raise the quality and consistency of outputs across entire organisations.
The shift in architectural burden comes with significant implications for code quality. A landmark Veracode study in 2025 analysed over 100 large language models across 80 coding tasks and found that 45 percent of AI-generated code introduces security vulnerabilities. These were not minor bugs but critical flaws, including those in the OWASP Top 10.
In March 2025, a vibe-coded payment gateway approved over 1.5 million pounds in fraudulent transactions due to inadequate input validation. The AI had copied insecure patterns from its training data, creating a vulnerability that human developers would have caught during review.
Technical debt compounds the problem. Over 40 percent of junior developers admitted to deploying AI-generated code they did not fully understand. AI-generated code tends to include 2.4 times more abstraction layers than human developers would implement for equivalent tasks, leading to unnecessary complexity. Forrester forecast an incoming technical debt tsunami over the following two years due to advanced AI coding agents.
The verification burden has shifted substantially. Where implementation was once the bottleneck, review now consumes disproportionate resources. Code review times ballooned by approximately 91 percent in teams with high AI usage. The human approval loop became the chokepoint.
Teams with strong code review processes experience quality improvements when using AI tools, whilst those without see quality decline. This amplification effect makes thoughtful implementation essential. The solution involves treating AI-generated code as untrusted by default. Every piece of generated code should pass through the same quality gates as human-written code: automated testing, security scanning, code review, and architectural assessment.
These dynamics have implications for how development teams should be structured. The concern that senior developers will spend their time training AI instead of training junior developers is real and significant. Some organisations report that senior developers became more adept at leveraging AI whilst spending less time mentoring, potentially creating future talent gaps.
Effective teams structure practices to preserve learning opportunities. Pair programming sessions include AI as a third participant rather than a replacement for human pairing. Code review processes use AI-generated code as teaching opportunities. Architectural discussions explicitly evaluate AI suggestions against alternatives.
Research on pair programming shows that two sets of eyes catch mistakes early, with studies finding pair-programmed code has up to 15 percent fewer defects. A meta-analysis found pairs typically consider more design alternatives than programmers working alone, arrive at simpler and more maintainable designs, and catch design defects earlier. Teams are adapting this practice: one developer interacts with the AI whilst another reviews the generated code and guides the conversation, creating three-way collaboration that preserves learning benefits.
The skill set required for effective AI collaboration differs from traditional development. Where implementation expertise once dominated, context engineering has become equally important. The most effective developers of 2025 are still those who write great code, but they increasingly augment that skill by mastering the art of providing persistent, high-quality context.
The architectural planning burden that has shifted to human developers represents a permanent change in how software gets built. AI assistants will continue improving, context windows will expand, and tooling will mature. But the fundamental requirement for clear specifications, structured context, and human oversight will remain.
Microsoft's chief product officer for AI, Aparna Chennapragada, sees 2026 as a new era for alliances between technology and people. If recent years were about AI answering questions and reasoning through problems, the next wave will be about true collaboration. The future is not about replacing humans but about amplifying them. GitHub's chief product officer, Mario Rodriguez, predicts repository intelligence: AI that understands not just lines of code but the relationships and history behind them.
By 2030, all IT work is forecast to involve AI, with CIOs predicting 75 percent will be human-AI collaboration and 25 percent fully autonomous AI tasks. A survey of over 700 CIOs indicates that by 2030, none of the IT workload will be performed solely by humans. Software engineering will be less about writing code and more about orchestrating intelligent systems. Engineers who adapt to these changes, embracing AI collaboration, focusing on design thinking, and staying curious about emerging technologies, will thrive.
The teams succeeding at this transition share common characteristics. They invest in planning artifacts before implementation begins. They maintain clear specifications that constrain AI behaviour. They structure reviews and handoffs deliberately. They recognise that AI assistants are powerful but require constant guidance.
The zig-zagging that emerges from insufficient upfront specification is not a bug in the AI but a feature of how these tools operate. They excel at generating solutions within well-defined problem spaces. They struggle when asked to infer constraints that have not been made explicit.
The architecture tax is real, and teams that refuse to pay it will find themselves trapped in cycles of generation and revision that consume more time than traditional development ever did. But teams that embrace the new planning requirements, that treat specification as engineering rather than documentation, will discover capabilities that fundamentally change what small groups of developers can accomplish.
The future of software development is not about choosing between human expertise and AI capability. It is about recognising that AI amplifies whatever approach teams bring to it. Disciplined teams with clear architectures get better results. Teams that rely on iteration and improvisation get the zig-zag.
The planning burden has shifted. The question is whether teams will rise to meet it.
Tim Green
UK-based Systems Theorist & Independent Technology Writer
Tim explores the intersections of artificial intelligence, decentralised cognition, and posthuman ethics. His work, published at smarterarticles.co.uk, challenges dominant narratives of technological progress while proposing interdisciplinary frameworks for collective intelligence and digital stewardship.
His writing has been featured on Ground News and shared by independent researchers across both academic and technological communities.
ORCID: 0009-0002-0156-9795
Email: [email protected]
2026-04-22 18:55:00
Every server I audit has at least three of these issues. They're simple to fix, yet consistently overlooked. A single breach can cost a small business thousands of dollars in downtime, data loss, and reputation damage.
If your server accepts SSH passwords, it's being brute-forced right now. Check your /var/log/auth.log — you'll see hundreds of failed login attempts daily from bots around the world.
Fix: Switch to SSH key authentication and disable password login:
# /etc/ssh/sshd_config
PasswordAuthentication no
PubkeyAuthentication yes
PermitRootLogin no
This single change eliminates 99% of brute-force attacks. Takes 5 minutes.
I regularly find servers with all ports open to the internet. Database ports (3306, 5432), Redis (6379), admin panels — all accessible from anywhere.
Fix: Use ufw to allow only what's needed:
ufw default deny incoming
ufw allow 22/tcp # SSH
ufw allow 80/tcp # HTTP
ufw allow 443/tcp # HTTPS
ufw enable
Security patches are released weekly. If your server isn't automatically installing them, you're running known-vulnerable software. The Equifax breach happened because of an unpatched vulnerability — the fix had been available for months.
Fix: Enable automatic security updates on Ubuntu:
apt install unattended-upgrades
dpkg-reconfigure -plow unattended-upgrades
If your web application runs as root and gets compromised, the attacker has full control of your server. Running services as unprivileged users limits the blast radius of any vulnerability.
Fix: Create dedicated users for each service. With Docker, don't run containers with --privileged unless absolutely necessary.
Having backups is step one. Testing them is step two — and most people skip it. I've seen companies discover their backups were corrupted only when they needed to restore.
Fix: Automate daily backups to an external location (S3, another server). Set up a monthly restore test. If you can't restore it, it's not a backup — it's a hope.
These five fixes take less than an hour total and dramatically improve your security posture.
Not sure about your server's security? I offer comprehensive server audits with a detailed report and remediation plan. Contact me for a security review.