2026-06-24 23:31:30
AI-assisted development has changed how code gets written, but for many teams, testing and governance haven’t caught up. Tricentis AI Workspace closes that gap, giving quality engineering leaders one place to build, orchestrate, and govern AI quality agents across the SDLC, from code risk analysis and test automation to performance validation so quality decisions happen continuously, not at the end. Less errors introduced by AI-generated code, more confidence in what you’re shipping.
Discover how teams are using AI Workspace to bring structure to AI-driven development and compress delivery timelines without sacrificing confidence in business outcomes.
Apple’s most ambitious AI feature runs in about a gigabyte of memory on the iPhone. The same company runs a much larger model on its own cloud servers, and the two diverge in almost every architectural choice beyond the word “transformer” in their lineage.
The same split shows up at Google, Microsoft, and Meta, where one family of small models targets devices and a different family of large models targets data centers.
Small and large language models are different engineering responses to different constraints, and the differences begin with where each model runs, what hardware it targets, and how it was trained.
In this article, we will explore those constraints through three layers of model design, look at the tradeoffs that come with each approach, and investigate the production systems that combine both small and large models.
Disclaimer: This post is based on publicly shared details from various sources. Please comment if you notice any inaccuracies.
Before we look at what makes the two classes different, it helps to be precise about what makes them the same.
Both small and large language models are transformer-based decoder models, built by stacking layers of the same basic computational block. Each block runs an attention operation, which figures out which previous tokens matter most for predicting the next one, followed by a feed-forward computation that mixes that information through a wide intermediate layer. The model repeats this block thirty or more times before producing a probability distribution over what the next token should be.
Both classes go through similar training stages. They start with pretraining on large text corpora, where the model learns to predict the next token across billions of examples. They typically follow with supervised fine-tuning on specific instruction patterns, and many go through reinforcement learning from human feedback, which shapes how the model handles ambiguity and stays helpful in conversation.
The size of a model refers to its number of parameters, which are the learned weights adjusted during training. A small model in 2026 typically has between half a billion and fourteen billion parameters. A large model has tens of billions to hundreds of billions of parameters, and sometimes more.
Three constraints pull the designs of small and large models in opposite directions.
Deployment target: Where the model runs determines its memory, battery, and latency budgets.
Inference economics: Training is paid once, but serving is paid per request, which inverts the math at scale.
Training budget: Smaller budgets push teams toward efficiency through data quality and distillation rather than raw scale.
The deployment target determines everything that follows.
A model that runs on a phone has a memory budget measured in single gigabytes, a battery budget measured in milliamps, and a latency budget measured in milliseconds. A model that runs in a data center operates in a more permissive environment, with concerns around throughput, batching efficiency, and cost per request, but with an absolute resource ceiling orders of magnitude higher.
Inference economics is the second pressure.
Training a model is a one-time cost paid at the start of its life, while serving the model is a recurring cost paid every time someone uses it. For a high-volume product, the inference bill quickly dwarfs the training bill, so a team designing for high inference volume will gladly spend more training compute upfront to save inference compute across billions of requests downstream.
The training budget is the third pressure.
A frontier large model can cost tens of millions of dollars to train, while most teams working on small models operate with a small fraction of that, and the smaller budget forces choices. Those teams have to find other levers beyond raw scale, which usually means smarter training data, distillation from larger teachers, and more efficient training recipes.
These three constraints reinforce each other rather than acting in isolation. A model designed for the phone has a small inference budget per request and usually a smaller training budget too, while a model designed for the data center has the opposite profile across all three axes. The result is two distinct design regions in the same space.
When devs use AI to generate thousands of lines of unverified code, you risk a codebase slopocalypse. The review step becomes your team’s bottleneck, and the last thing standing between a subtle bug and production.
Greptile reviews each PR with full repo context and learns your team’s conventions over time from comments, reactions, and what gets merged. It flags real issues and suggests fixes that match your team, not generic best practices.
✅ Recently launched TREX runs your code, not just reads it. Greptile executes the change in a sandbox and returns screenshots, logs, and traces as proof of what actually broke.
✅ Review from your terminal. The Greptile CLI runs the same review locally, before you ever open a PR.
✅ Trusted by engineering teams at NVIDIA, Scale AI, and Brex.
✅ Now integrated with Claude Code: install via /plugin.
✅ Free for open source.
The architecture differences begin with a quick observation about inference.
During generation, the model has to keep around the keys and values for every previous token, since attention works by comparing the current token against all earlier ones. This stored set is called the KV cache, and it grows linearly with the length of the conversation. For long generations, the cache often dominates memory bandwidth and storage, more than the parameters themselves.
This single fact decides how small-model architectures get designed.
In the original transformer design, every attention head has its own keys and values, an arrangement called multi-head attention. For long sequences, the resulting cache footprint grows large enough to dominate the model’s memory consumption.
Grouped-query attention attacks the problem directly. The number of query heads stays the same, but several queries share a single key-value pair. A model with thirty-two query heads might use only eight key-value groups, which cuts the cache footprint by a factor of four with minimal quality loss. Llama, Qwen, Gemma, and most modern small models use grouped-query attention by default, and many large models have adopted it as well because the math also helps at scale.
Some small models push further. Gemma 2 interleaves sliding window attention with full attention across layers, so some layers attend only to the most recent few thousand tokens rather than the full context. This trades a bit of long-range reasoning for a significantly smaller cache. Apple’s on-device model shares its KV cache across multiple decoder layers, reusing the same stored state in several places.
These architectural decisions all serve the same goal of shrinking the runtime cost of inference, which is the constraint that matters most when the model has to run on a device with a few gigabytes of memory to spare.
Two models with identical architectures can end up with very different capabilities depending on what they were trained on and how.
Three techniques define the current state of the art in small-model training:
Data curation: Carefully chosen and synthetically generated training data can substitute for raw volume.
Knowledge distillation: A smaller student model learns from a larger teacher model’s output distribution.
Overtraining: Modern small models see far more training tokens than compute-optimal ratios suggest, trading training cost for inference savings.
The first technique is data curation. In 2023, a Microsoft research team published a paper called “Textbooks Are All You Need.” They trained a 1.3 billion parameter coding model on roughly seven billion tokens of carefully filtered code and synthetically generated textbook-style data. The model matched or beat models trained on hundreds of billions of tokens of raw web scrape. Training data quality could substitute for training data volume, at least for certain capabilities. The Phi family kept building on that insight, and the modern Phi-4 model continues to lean heavily on synthetic data quality as its primary lever.
The second technique is knowledge distillation.
The small model, called the student, learns from a larger model, called the teacher, by mimicking the teacher’s output distribution rather than only learning from raw text. The richer training signal helps the student pick up patterns it would struggle to learn from the underlying corpus alone. Gemma 2 used this approach to train its nine billion parameter model, while training its twenty-seven billion parameter version from scratch.
The third technique is overtraining relative to compute-optimal.
In 2022, the Chinchilla paper from DeepMind established that for a fixed compute budget, the best model came from balancing parameter count and training data, roughly twenty tokens of training data per parameter. Modern small models deliberately train on far more data than that ratio suggests. A three-billion-parameter model might see many trillions of tokens during training, which is many times the Chinchilla-optimal amount. Once the model gets deployed, every percentage point of quality improvement saves inference compute across billions of requests, so the team spends more on training to save more on serving.
The final layer of design choices covers how the model executes on real hardware. The two dominant techniques are quantization, which shrinks the storage cost of each parameter, and KV cache management, which shrinks the runtime cost of generation.
Quantization is the practice of storing each parameter with fewer bits. A standard pretrained model stores each parameter as a sixteen-bit floating point number, where cutting that to eight bits halves the memory footprint and cutting to four bits halves it again. The post-training approach is faster to implement but tends to lose quality at aggressive bit widths, while quantization-aware training preserves quality at the cost of more complex training.
Hardware mapping is the next consideration. Apple’s Neural Engine has different strengths from an NVIDIA Jetson, which has different strengths from a data center H100, and the model design follows the target hardware. Phi-4-mini gets tuned for consumer GPUs. Gemma 3 4B variants run on NVIDIA Jetson Orin for edge AI deployments in robotics and embedded systems. Apple’s 3B model runs on the iPhone’s Neural Engine with the assumption that the device also handles other workloads at the same time.
KV cache management is the second major lever, and it connects directly back to the architecture section. The cache stores keys and values for every previous token during generation, and its size determines how much memory the model utilizes at runtime. Grouped-query attention attacks this by reducing the number of key-value heads, and Apple’s on-device model goes further by sharing its cache across multiple decoder layers.
These deployment decisions stack on top of everything earlier. The same architectural choices that shrink the KV cache also make quantization easier, and the same training recipes that produce capable small models also produce models that survive aggressive compression.
Small models perform well on standard benchmarks like MMLU and HumanEval. Production usage looks more varied. Three gaps tend to matter most:
Generalization gap: Small models are more brittle outside their training distribution.
Reasoning gap: Multi-step problems still favor larger models, though the gap is closing.
Knowledge ceiling: Parameters function as memory, so small models have a hard cap on what they can store.
The first gap is generalization.
Small models tend to be more brittle outside their training distribution, and they can be excellent at tasks similar to what they saw during training, while showing weakness on unexpected ones. A small model trained heavily on code performs well on code but may struggle with creative writing in an unusual style. A model trained on synthetic textbook data does well on textbook-style questions but can falter on the messy, ambiguous prompts that real users send.
The second gap is multi-step reasoning.
For problems that require chaining inference across many tokens, large models still have a noticeable advantage. The gap has been closing thanks to step-by-step reasoning techniques and reasoning-focused fine-tuning, but at very small parameter counts, the ceiling remains real. Phi-4 has done well on math reasoning specifically because Microsoft optimized for that capability through training data design, while a general-purpose small model usually shows a clearer gap.
The third gap is world knowledge.
Parameters function as a form of memory, and a larger model can store more facts, more named entities, more obscure references, and more multilingual coverage. A small model has a fundamental cap on how much it can know, since storage requires parameters and parameters require memory. For applications that need broad factual recall, the small model often pairs with an external knowledge source that the model queries when needed, since trying to fit all that knowledge into the parameters themselves would push the model past its budget.
The most interesting design question in 2026 is rarely which model to use. The more useful question is how to compose multiple models into a system that uses each for what it does best. Three patterns appear in most production setups.
Routing: A small model handles requests directly and escalates harder ones to a large model.
Guardrails: A small model filters input or output around the large model’s core work.
Drafting: A small fast model generates candidate tokens that a larger model verifies in a batch.
The most common pattern is routing.
A small model handles the request directly if it falls within its competence, and escalates to a large model when the request is harder than it can confidently handle. The pattern resembles caching tiers in a distributed system, where the fast, cheap layer handles the common case, and the slower, more expensive layer handles the rest. The router itself is often a small classifier model that decides which path to take.
The second pattern is the guardrail.
A small model sits in front of the large model to filter or classify input before the expensive computation runs, checking for unsafe content, classifying the intent of the request, or stripping out information that should stay private. A second small model often sits on the output side, doing similar checks before the response gets returned to the user. These guardrail models are cheap, fast, and specialized, which makes them well-suited to the role.
The third pattern is the drafter, sometimes called speculative decoding.
A small fast model generates candidate tokens, and a larger, more capable model verifies them in batch. When the verifications agree, the system gets the throughput of the small model with the quality of the large one. Apple’s on-device system uses a draft model alongside its base model for exactly this reason. The technique sounds like a hack, but it has become standard in production inference systems.
Picking a model class is the wrong frame for most product decisions. Designing a system around multiple model classes is the right frame, and the interesting design work lives in the composition layer, the routing logic, the fallback behavior, and the orchestration between models.
The question we started with was “small versus large language models,” and the more useful version of that question turns out to be “which constraints drove each model’s design.” The size of a model is a downstream consequence of those constraints rather than the starting point for the design.
Three layers of design choices flow from the constraints:
Architecture adapts through attention variants like grouped-query and sliding-window attention that shrink the KV cache.
Training adapts through high-quality synthetic data, distillation from larger teachers, and deliberate overtraining relative to compute-optimal ratios.
Deployment adapts through quantization, KV cache management, and careful hardware mapping. Each layer reinforces the others, and the result is two distinct design regions in the same space.
Small models are extremely capable for their size, and they have a real ceiling on generalization, on multi-step reasoning, and on broad world knowledge. Production systems handle this by composing both classes, using small models for the common case and large models for the harder requests, sometimes with multiple small models acting as routers, guardrails, and drafters around a larger core.
For a working engineer choosing between models, the right starting point is the constraints rather than the benchmark. The questions that matter are about deployment target, inference budget, and the shape of the request distribution in production.
References:
Apple Intelligence Foundation Language Models Tech Report 2025
Updates to Apple’s On-Device and Server Foundation Language Models
Lightweight, Multimodal, Multilingual Gemma 3 Models Are Streamlined for Performance
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
Fast Transformer Decoding: One Write-Head is All You Need (Multi-Query Attention)
2026-06-23 23:31:20
Most teams pick a search provider by running a few test queries and hoping for the best – a recipe for hallucinations and unpredictable failures. This technical guide from You.com gives you access to an exact framework to evaluate AI search and retrieval.
What you’ll get:
A four-phase framework for evaluating AI search
How to build a golden set of queries that predicts real-world performance
Metrics and code for measuring accuracy
Go from “looks good” to proven quality.
This is a guest post by Kun Chen, a former L8 principal engineer at Meta, Microsoft, and Atlassian, where he led development of Rovo Dev, Atlassian’s AI SDLC product. He has since left big tech to build solo and has gone all-in on agentic engineering. Below, he walks through his complete setup, step by step. You can follow him on X and subscribe to him on YouTube, where he shares his agentic engineering workflow, the open-source tools he builds, and his take on AI and software craft. Over to Kun.
Hi everyone, Kun here. For context, I spent years driving agent adoption among tens of thousands of engineers at all levels, both within my company and across many customers’ engineering organizations. Going solo has actually let me lean into agents even more.
Here’s the difference using agents has made to my productivity: shipping 30+ high-quality PRs that meet my own bar used to be hard to imagine, and it’s now a slow day. I’ve reached what feels like a constant flow state, where the quality and speed of my thoughts is the only bottleneck left.
All of this didn’t come from a single trick or using some hyped tool. It came from a long and often messy process of figuring out what actually works in the real world versus what just sounds good in a demo. The short version is that I have now stopped writing most of the code myself and started acting like an engineering manager directing a team of agents. I stay at the level of deciding what to build and whether it’s good, and I’ve built tooling to handle almost everything in between.
The interesting part of this journey is all the friction I had to remove to reach this point. Therefore, in this post, I’m attempting to share everything I do, step by step, for both my professional and personal projects.
If you’re on the same journey of making your work with agents more productive and enjoyable, I hope this gives you a head start and shortcuts some of your own exploration.
First, what I’m sharing here is my personal setup. What works well for me may not be the best fit for everyone. I’m sharing my workflow as-is, mainly hoping it can be a useful reference or inspiration for what to explore, even if you don’t end up using the same tools.
Second, I have no affiliation with any of the 3rd party products I mention in this post, and the tools built by me are all free and open source. I share these specific products because those are genuinely what I use in my setup. They are often not the only choice for the problems they solve, so I encourage everyone to research different options based on their interests and requirements.
To make this post concrete and practical, I’ll walk you through my workflow using a real project I’m actively building. It’s called “Hi Bit”: an AI tutor I’m making for my son to teach him agentic engineering. In the rest of the post, I will follow the implementation of a specific image input feature in the Hi Bit project from the idea to merged PR so that you can get a first-hand look at my agentic workflow.
What happens when deterministic code hits the edge of its knowledge? In this live webinar, you’ll see a working plant health monitor built on Temporal’s entity workflow pattern where each plant is a long-running, crash-proof workflow that polls sensors, fires alerts, and falls back to GPT-4o only when the rules run out.
The architecture is clean: structured data first, AI second. The boundary is auditable. The state survives everything. Whether you’re building patient monitors, supply chain detectors, or any long-running process that occasionally needs a smarter answer, the patterns here translate directly.
There has been a constant debate in the developer community about terminal vs GUI.
I’m obviously biased because I started coding almost 30 years ago and built decades of muscle memory on top of a terminal-centric workflow ever since. But I did try GUIs every once in a while, from Visual Basic, Visual Studio, to Atom, and now the latest Codex app.
The reason I stick with terminals is very simple. I keep my flow and focus best when my hands never leave the keyboard. Some GUIs let you do everything via keyboard shortcuts as well, but they’re very inconsistent about it, which makes it hard to build strong muscle memory.
The terminal emulator I’ve been using for many years is WezTerm.
It’s the only terminal I’ve found that is highly performant, customizable, and works consistently even when I’m forced to use Windows. I run it as a single frameless window: no tabs, title bar, or status line, literally nothing else.
I use Claude Code for Anthropic’s models and OpenCode for everything else.
The CLI agent harnesses nowadays are quite commoditized, and you won’t really go wrong with any of them. Almost everything I share below works with any mainstream harness you can find.
In fact, I actually recommend avoiding the “fancy” gimmicks that only some agents have, such as auto-managed memory. They’re often designed to lock you into a particular vendor, when in reality you benefit a lot from being able to switch to whichever newer model works best, even if it comes from a different vendor. I try to keep my whole workflow agent-agnostic, so I have no switching cost. It’s far from clear which model will win in the end, and as a user, you’re in a much better position if you can work with any model available rather than being locked into one.
Neovim has been my primary editor for a long time, and it’s a critical part of staying fully keyboard-driven inside the terminal. You might ask, “But I use agents now. Why do I need an IDE?”
I use it to quickly examine the file system, review diffs, and make small edits when needed. A few plugins do most of the heavy lifting:
oil.nvim: navigate and edit the file system like a buffer
neogit: quickly review git status and diffs, and perform simple operations
snacks.nvim: I use its picker for finding files and grepping the codebase
I don’t let my terminal emulator manage tabs, because I manage all my sessions, windows, tabs, and panes in tmux instead. It’s one of the most powerful primitives in my whole setup, and it unlocks a few things at once:
Splitting my terminal window into panes the way I like
Driving the entire terminal experience from the keyboard
Persisting the working sessions and layout
Accessing the same session from my other devices (more on that later)
A popular alternative is Zellij, but tmux has worked well enough that I haven’t switched. As soon as I’m in, I create a split on the left for the agent and one on the right for Neovim, and I separate different tasks into different tabs that I keep track of along the top.
We’re now in the terminal, and the agent is waiting for instructions. All I have to do is write a prompt. Sounds easy, right?
Actually, how you write your prompts is one of the biggest levers on both the velocity of your work and the quality of the outcome. So let me share a few things that made a big difference for me.
Typing was my primary input method for decades. But over the last couple of years, speech recognition models have really changed the game. You can now run high-quality models locally on your Mac, for free, that generate output extremely fast.
You talk a lot faster than you type, so moving to voice as your primary input is one of the easiest ways to greatly improve your productivity. It applies to prompting your agents, but also to anything else that used to require typing. This post, for example, is mostly written by voice.
The solution I use is OpenSuperWhisper, which is completely free and runs the Whisper model (turbo v3 large) locally. I set a hotkey to trigger it, and now I can just talk wherever I could type. There are plenty of other free and paid options that give a great experience as well.
For many new tech leads and people managers, the first struggle is delegation. The same thing happens with how you interact with agents.
The most common mistakes I see people make about delegation to both humans and agents are:
asking for an action, not an outcome;
not explaining the “why.”
taking back control.
Take “rename this variable.” It’s a valid prompt, but it has a couple of problems:
The agent finishes in a few seconds and waits for you again. It barely saves more time than doing it yourself, and you’re still the bottleneck.
There’s no “why.” Do you want it renamed for readability? To follow a team convention? Because of a future plan you haven’t mentioned? Without the why, there’s no room for the agent to suggest something better, and it won’t know how to do it right next time.
If you were following a convention, a better prompt would be: “Let’s audit this part of our codebase and make sure our variable naming follows this convention <link_to_document>.”
That explains the rationale, gives the necessary context, and asks for an outcome instead of an action. The agent can run longer, get more done in a way that’s aligned with your goal, and respect that convention for the rest of the session instead of creating more problems for you.
The other failure mode is taking back control. When a mistake happens, people immediately think of doing it themselves. This happens to new tech leads working with a junior engineer. They could do the job faster by taking over, but in doing so, they make themselves the bottleneck and fail to scale. People hit the exact same wall with AI: they see an agent do something wrong and revert to doing it manually, and they never truly unlock the leverage agents offer.
The right approach is to give feedback and help the other party improve. With agents, this is actually easier. You can write directly into the agent’s memory file (CLAUDE.md or AGENTS.md), or ask the agent to reflect on its mistake and update that file so the same thing doesn’t happen again.
The feature I’m building right now is image input. I want the chat box in Hi Bit to accept a pasted image from the clipboard, or open a file picker or camera. This is a bit more than I think the agent can one-shot the way I want. In the beginning, I also didn’t know exactly where the button should go or what the attached images should look like.
This happens a lot: problems where I genuinely can’t describe the full solution up front. It could be a new project from scratch, a major refactor, or a big feature on an existing system. In those cases, I work with the agent to write a plan first.
There’s a school of thought in the agentic community that’s against technical planning. They prefer to “just talk to your agent” instead.
I went the other way, and here’s why. When you just talk to your agent, you have to stay in an interactive session. You get constantly pulled back into the conversation, and after a few rounds, it’s hard to keep track of what the actual plan is. A long wall of text in the terminal is also painful to parse and hard to give targeted feedback on.
Instead, I spend a concentrated chunk of time up front getting the plan into a confident state, so I can hand it off for a fully autonomous implementation without jumping back in until it’s done. That’s also what frees me up to run other tasks in parallel without constant context switching.
For a long time, I did this by asking the agent to write a proposal in a markdown file, then questioning it and iterating. That worked, but I have something better now. Inspired by an article on using HTML as interactive artifacts, I built a tool called Lavish Editor to collaborate with the agent on anything complex.
So instead of “draft a technical plan in a markdown file,” I say “draft a technical plan using npx lavish-axi.” Lavish guides the agent to render the plan as an interactive HTML page and opens it in my browser. It even encourages the agent to match the look and feel of the current project, so the plan for a UI feature looks visually consistent with the real app, which makes options far easier to judge.
A few minutes later, the agent had a plan open in my browser. It opened with the goal and context, flagged the decisions I’d need to make, and then laid out three UI options — a “+” button menu, an always-on button trio, and a smart-paste tile — each with pros and cons and the agent’s own recommendation.
The real payoff is the interactive back-and-forth. I liked Option A’s tiny resting footprint, but not that its menu dropped down and stretched the chat area. Rather than writing a paragraph describing which element I meant, I just clicked that element in the page and annotated it directly: “Can we make it a floating overlay above the + button instead?”
The agent came back almost immediately with exactly the revision I wanted, and I finalized the remaining decisions just by clicking buttons. Being able to interact with a plan this richly, instead of editing a markdown file, turned out to be a big productivity boost, and it’s not limited to planning. I now use Lavish for brainstorming, reviewing changes, data reports, and anything else that benefits from a visual artifact and tight back-and-forth. It’s been a game-changer.
Once the plan is clear, the implementation is fully autonomous, and there’s honestly not much for me to do except wait for the agent to ping when it’s done.
The one thing worth mentioning over here is that for every project, I spend a lot of effort making sure the agent can validate its own change end-to-end. As an example, for Hi Bit, I keep explicit instructions in the repo’s AGENTS.md for how to exercise the app, so changes like this get validated by the agent before they come back to me. I can often watch it test its own work in real time by driving the real app, attaching an image, and checking how the new button actually behaves.
Occasionally, a task is so complex that it doesn’t fit well in a single context window. If I just let the agent grind on it, it fills its context, fires off very large requests, and eventually compacts to free space, sometimes losing important context in the process. The newer /goal command in Codex and Claude Code helps a bit, but I’ve been using something better since well before it existed.
I call it good night, have fun: gnhf for short. It’s a dead-simple, long-running orchestrator I built for running big tasks overnight; you invoke it with gnhf <your objective>.
Under the hood, it works similarly to the Ralph Loop and Autoresearch patterns. It breaks the task into small steps, and each step runs in a fresh context window seeded with a common base context plus the learnings from previous steps. Failed attempts roll back automatically, and the next attempt takes the failure into account. I can also set a token budget so I don’t wake up bankrupt. When I come back, there’s a branch with well-organized commits and a notes.md file summarizing what was done.
I reach for gnhf in three kinds of situations:
Implement a massive plan: gnhf “fully implement this plan…”
Improve a measurable metric such as reducing lines of code, increasing test coverage, cutting startup latency, or page load time: gnhf “improve <this metric> while keeping product functionality unchanged.”
Run lots of offline experiments when you have an evaluator capable of scoring each attempt. For one project, I generated a game map by running 50+ layout experiments and scoring each through a gameplay simulation. Babysitting those by hand would have taken weeks.
Back to the image task: the agent has done the work, and it has produced a big change. Now what? This is where many people hit the real bottleneck: code review. There’s simply too much code to read, and reviewing it isn’t the fun part of the job.
I’m increasingly convinced that working with agents means acting like an engineering manager. Most managers rarely review code directly. They have the team review each other’s work, and before anything ships, they ask for evidence that it actually works. It’s the same with AI, except the developer is the manager and the agents are the team. You have to get good at using agents to scrutinize agents’ code, get them to self-correct, and get them to produce artifacts that demonstrate the feature really works.
I’ve experimented a lot here, and a few things turned out to matter most:
Run the reviewer agent in a fresh context window. If you review in the same session that wrote the code, the agent is biased by what it just did and assumes it was intended. It’s like asking someone to check their own work. They’ll catch some things, but it’s far weaker than a real peer review.
Escalate ambiguous, product-changing decisions to the human. Reviewers make mistakes, too. If you let the agent auto-fix every finding as if it’s all valid, it can drift into rabbit holes away from what you actually want. Keeping those decisions with the human keeps you in control of the ambiguity.
Force end-to-end evidence. Today’s frontier models lean heavily on unit tests to validate changes, probably because of how they’re trained. But I’ve seen countless cases where every unit test passes, and the product is still buggy. You have to make the agent prove the change works E2E.
I packaged all of this into an open-source tool I built called no-mistakes, which I now run on the image change. First, I use the neogit plugin to quickly scan the diff and make sure it’s roughly aligned with what I asked. Sometimes an agent goes in a completely wrong direction, and that’s obvious at a glance.
If it looks reasonable, I run no-mistakes -y and it handles the rest: commit with a conventional message into a descriptively named branch, rebase onto the latest main and resolve conflicts, spin up agents to peer-review and self-correct obvious bugs, test the change E2E and produce evidence, close documentation gaps, fix linting, push, open a well-structured PR, and babysit CI until it’s green.
All of it runs autonomously except for the decisions it deliberately escalates to me.
This validation pipeline has become one of the most critical pieces of my whole workflow. My own stats show that 68% of the changes I pushed through the no-mistakes tool had bugs in them. I genuinely can’t imagine what my codebase would look like without it.
With a fully autonomous implement-and-validate pipeline, a single task can take a while, and that’s a good thing, because it frees me up to run more things at once.
In tmux, this means opening a new window. Terminal tabs achieve the same parallelism. The important part is keeping those tabs visible and showing each agent’s status in the tab title. Claude Code and Codex do this out of the box; for harnesses that don’t, I wrote small plugins to do the same, and I made the no-mistakes tool report its status as a custom title too.
That one detail is what lets me run many sessions without going insane. At any moment, I can see which agents are running, which are done, and which need my input. A couple of tmux keystrokes jump me to any tab.
The other problem with running agents in parallel is that they step on each other’s toes when they share a directory.
Git worktrees exist to solve this. A worktree is an efficient clone of the same repo in another directory where you can work in parallel. But worktrees carry a lot of cognitive load: where to put them, how to name them, create a new one or reuse an old one, which are in use, which have dependencies installed, and whether env files are ready. What I actually want to think about is the work, not where to do it.
So I built another open-source tool named treehouse. You’re in a repo, you want to start a parallel task, you run treehouse, and it drops you into a ready worktree. Behind the scenes, it manages a pool of worktrees, tracks which are free, reuses an idle one when possible (so dependencies, build artifacts, and env files are already there), and makes sure it’s synced with the latest main before dropping you in. I don’t think about any of that. I simply run the treehouse tool and start working.
I repeat this and usually end up managing 5 to 10 tasks at once. I don’t context-switch much because most of them go straight to a clean PR with no involvement from me. Occasionally, the no-mistakes tool escalates something for a decision, but most of the time, I’m just thinking about and writing the next instruction.
The diagram below tries to show the parallelization angle that I’m talking about:
A little while later, the image-attachment task’s pipeline finished and handed me a PR that was ready to merge. Many issues had been caught and auto-fixed along the way, all logged on the PR, so I can audit them. Also, my favorite part, a “Testing” section with evidence (including screenshots) that the feature works end-to-end, is presented.
Every couple of weeks, I have to drive my son to a birthday party, where I’d find myself useless for a few hours. He’d be having a great time with his friends while I sat somewhere with no Wi-Fi, missing my agents and wondering whether they were blocked and waiting on me.
That stopped once I set up the remote control feature. I don’t use the built-in remote features from Claude Code or Codex for a few reasons:
I want one consistent workflow across all my agents, not separate apps that do the same thing but stay siloed because different companies want to lock me in.
I want real, full terminal access — not an agent-only view — so I can also run treehouse, no-mistakes, and gnhf.
I want perfect continuity across phone, laptop, and PC. My son has zero patience: if he says it’s time to leave, I get up and go. If I typed half a sentence on my phone, I want to finish it on my PC later.
So I set up Tailscale, which puts my PC, laptop, and phone on the same private network where they can reach each other safely. I then ssh between them (which on Mac just means enabling “Remote Login”). On my phone, I use an SSH client to connect to my Mac, attach to my tmux session, and instantly I’m in the same workspace with the same tabs, same agents, same environment.
To keep the connection stable, I use mosh, a transport layer on top of SSH built specifically for terminal state over flaky networks, which matters a lot on cellular. It is the same experience, just more resilient.
So how does all of this come together on a normal day?
It usually starts with me talking instead of typing, describing a feature or a gnarly refactor by voice.
If it’s complex, I have the agent draft a plan in Lavish Editor and iterate on it in the browser until it’s right.
Once the plan is solid, I either ask the agent to implement it directly or hand it to gnhf if it’s big, while I spin up a fresh worktree with treehouse and start the next task in a parallel tmux window.
When an agent finishes, I don’t read massive diffs line by line. I run no-mistakes, which reviews the code, tests it end-to-end, and opens a clean PR while I move on.
When I’m away from my Mac, I SSH in from my phone, and the whole workspace stays with me.
Here’s what the workflow looks like on a high level:
Each tool removes one specific point of friction, and together they chain into a smooth workflow I genuinely enjoy. I get to stay at the level of deciding what to build and whether it’s good, while most of what’s in between runs itself.
That’s everything I can think of that made a meaningful difference in my workflow. Reflecting on how it came together over the past couple of years, my biggest realization is that as the models keep advancing, the tooling and workflow around them will keep evolving too. What works well today may be obsolete a few months from now.
At the same time, relying only on off-the-shelf products like Claude Code and Codex is never quite enough. There’s always room for a better, more efficient workflow to take the agents a step further. I benefited a lot from building custom tools to remove whatever friction I hit. You’ll likely face a different set of problems because you work on different projects with different processes.
So I’d encourage you to never accept anything that slows you down. If part of your workflow frustrates you, chances are others are hitting the same thing. Find a tool that fixes it, or build one and share it. We’re in the middle of an industrial revolution. It’s the best time to be creative and redefine how things should work. Let’s tinker and have fun.
2026-06-22 23:30:47
Run npx workos@latest to launch an AI agent that reads your project, detects your framework, and writes the auth integration directly into your codebase. No account required upfront. WorkOS automatically creates your environment and keys, then lets your claim the project when you're ready.
Once installed, manage users, orgs, and environments directly from the terminal.
This is Part 2 of our series with Shah Rahman, Global Head of Autonomous ML Iteration & Optimization for Ads at Meta, where he architects AI-native infrastructure and multi-agent systems at hyperscale. Connect with him on LinkedIn.
Part 1, published two weeks ago, was written for the individual engineer. Shah covered:
The shift from engineer to orchestrator
The four core practices: context engineering, spec-driven development, critical verification, and problem decomposition
The Agentic Development Life Cycle (ADLC)
The security guardrails that are no longer optional
Part 1 was about the person. Part 2 is about the organization. Here we cover:
Pod-based structures and the Agent Champion model
The leadership crisis from first principles: ownership, empathy, and deciding what to build
A phased transformation playbook, plus the metrics that prove it worked
Individual gains do not become organizational gains on their own. This is the playbook for making that leap. Let’s dive in.
AI-native leadership is the most significant organizational transformation since the industry moved to agile more than a decade ago. Several companies watched AI-generated code climb from zero to 50 or 60% of their output inside a single year. Select teams have posted 2 to 10x productivity gains.
But we keep learning the hard way: individual tool usage produces individual gains, while systemic improvement takes deliberate leadership and a redesign of how work flows.
The evidence is hard to argue with. Around 70% of transformation success comes from operational and cultural change rather than from deploying technology. And most organizations get this wrong. They distribute tools, measure adoption rates, and then wonder why velocity refuses to move.
But some organizations are getting real results. At Shopify, CEO Tobi Lutke told employees that AI usage is now a baseline expectation, and that teams have to show why a task cannot be done by AI before they ask for more headcount. At Klarna, AI-driven restructuring reduced the workforce by more than a thousand people. These organizations treat AI as a fundamental operating model change, not a tooling upgrade. Almost everyone else is now racing to catch up.
This is the atomic unit of AI-native engineering is the small, cross-functional team: 3 to 5 people operating autonomously with AI agents and tools. The hierarchies established during the dot-com era, all those layers of managers, leads, and coordinators, are being dismantled.
When a 10x engineer armed with AI tools can do what used to take a much larger group, the organizational consequences are significant. Some pods now report directly to senior leaders based on strategic importance. Team impact gets redefined around outcomes rather than headcount.
The results from one established team’s pod pilot were striking: 3 projects running on self-sufficient agentic loops, more than 90% engineer adoption across the org in under two months, and features built in hours rather than days using agent-assisted development loops.
Roles become fluid in this setup. Engineers may design, designers may code, and product managers may prototype directly. This is not role confusion, it is capability amplification. AI removes the traditional skill bottlenecks, so teams operate with more judgment and less procedural overhead.
Most AI agents work in demos — but fail in production. Learn how to build durable, enterprise-ready AI agents with open-source frameworks using Orkes Agentspan and Conductor. This whitepaper explores how to orchestrate long-running, fault-tolerant agent workflows with built-in governance, observability, retries, and human approvals. See how Agentspan compares to LangGraph, CrewAI, and AutoGen for real-world enterprise AI systems. If you’re building AI workflows that need reliability, scale, and control, this guide shows the architecture patterns that make production-grade agents possible.
While your implementation will be your org-specific, here’s a usable template:
Start with 1 or 2 pilot pods aimed at high-priority challenging issues that block entire teams.
Strip out non-essential review layers and reduce pre-approval friction.
Formalize autonomy so pods can decide for themselves between failing fast and pushing forward.
Only scale after the pilot metrics validate the results. Resist arbitrary rollout timelines.
Every pillar should name 1 or 2 full-time Agent Champions, responsible for reshaping workflows, preparing codebases, and restructuring operating models. This is not a side-of-desk assignment. It calls for dedicated, high-agency technical leaders who spend 50 to 100% of their time on the transformation itself.
The Champion model reaches well beyond traditional engineering:
Product mgmt. champions redesign product reviews, experiment workflows, and cross-functional handoffs for autonomous execution.
Design champions build agent-first prototyping frameworks while protecting craft standards.
Analytics champions let agents run analyses at a scale that was never possible before, on top of an AI-native data infrastructure.
One important note: engineers working with Agent Champions write 70%+ of their code with AI assistance, shifting from human-in-the-loop to human-on-the-loop. The implication is that when those engineers make manual edits, it signals missing AI context rather than business as usual.
Four things matter the most for anyone stepping into the Champion role:
Lead with personal AI adoption first: use the tools daily and share what happens, the wins and the failures alike.
Commit to the vision of AI as foundational to strategy, not an optional enhancement.
Remove barriers through structured, individualized engagement with each team.
Recognize impact based on productivity gains and business outcomes, never on tool usage metrics.
Senior leaders are spinning up “AI-native managers” and “AI-native leaders” groups that go deep on the operating context: processes, tools, reporting, and metrics. This is a competency evolution that educational institutions simply cannot keep pace with yet and hence, the need for such learning and development groups at most organizations.
The leadership competency shifts from delegation to orchestration. You are managing multiple parallel AI workflows, not assigning tasks to humans. Technical depth becomes non-negotiable. Hands-on managers have to evaluate agent-generated code and stand up verification layers. And context engineering becomes a core leadership skill, because the precision of the guidance you give AI systems is the precision your teams inherit.
Before we go any deeper into the playbook, it is worth stepping back to the core crisis underneath it all.
This is the insight most organizations miss. The dominant narrative celebrates AI’s speed: solo founders shipping with agents, dramatic productivity claims, demos everywhere. But the parts of software development that were always hard, remain hard:
Deciding what to build among competing options
Identifying the features users actually need
Prioritizing the capabilities customers will pay for
Knowing when to kill a project that lacks clear feedback
Have you heard that building great software is an act of empathy? AI cannot replicate a human understanding of user friction or the emotional stakes inside a product decision. Multiple Y Combinator partners have made the same argument: product taste, design sensibility, and customer empathy become the differentiating human skills once execution is commoditized.
The danger shows up when cheap coding invites excessive feature creation. Users do not get 10x more cognitive bandwidth just because you can ship 10x more features. Teams spiral into uncontrolled development and manufacture false progress.
The shift that matters is asking whether something should be built at all rather than asking if it can be built faster.
Anecdotally, most dysfunction in AI-native organizations comes from unclear ownership, not bad process. Even the most empowered teams get fuzzy when responsibility is ambiguous. Work gets picked up or dropped based on whatever is most urgent that day. Leadership becomes the escalation path for every decision, which hollows out middle management and triggers the great flattening.
Piling on more processes to fix a process failure only deepens the hole. The principle is that if something is important enough, give it to a single owner and make them accountable for the outcome.
We put this into practice with a “STO for Everything” model, where STO stands for Single Task Owner. Each one carries clear priority, authority, and decision rights. This single change turbocharged our transformation by eliminating the coordination tax that ambiguous responsibility almost always creates.
Because, AI dramatically expands the surface area of parallel work. More projects in flight means more coordination overhead, which triggers an instinct to add process. When ownership stays undefined, those ad hoc processes become bureaucratic substitutes for accountability, and you end up in a vicious cycle.
You can automate coordination with agents (dependency tracking, scheduling, status summaries), but that only buys temporary relief. It masks the underlying challenges that nobody owns. The moment key people leave, those challenges surface and the systems collapse.
If you want to fix it, you must own the outcome, not the process. Map the STO model onto the human-on-the-loop paradigm: humans set direction, verify outcomes, and make irreducible judgments, while AI handles the mechanics of execution.
The most common failure I have watched play out is that teams spend months perfecting products that have no product-market fit. They polish the UI, add settings, refine the copy, all of it generating false progress without changing the trajectory. AI makes this temptation worse by dropping build costs to hours, proliferation of code now drives unvetted product frenzy
The discipline is to test the hypothesis before committing to development. Ask “What is the scrappiest way to learn whether this matters?” before you build anything. The rapid prototyping ecosystem (Vercel’s v0, Replit Agent, Lovable, Bolt.new) makes that nearly costless.
Then design to 50-60%. Ship the minimal functionality that enables the core user journeys. Watch where users hesitate, misunderstand, or abandon. That tells you the real product challenges instead of the imagined ones. Over 70% of features never reach a real user. In the age of AI, there is no excuse for building fully polished features that nobody wants.
The temptation is real, but giving into it may decide the winner vs. loser product.
Power users have moved past simple human-AI pairing and into orchestrating multiple specialized AI systems that effectively set up a council of agents. There are few different modalities these councils can take.
Role-based delegation treats agents as specialized staff, each with a distinct persona. Cross-evaluation systems deploy multiple agents to independently analyze a problem and review each other’s work. Assembly line workflows chain sequential specialization: architect, then designer, then coder, then reviewer.
The emerging pattern aims at autonomous, agent-driven development, where agents code, build, test, and fix issues while humans provide oversight. The key distinction is that agents drive the actual tasks, and humans step in when agents hit an obstacle, not the other way around.
A few touchpoints make this collaboration work. Every AI module ships with context files that carry a clear architecture context. Work breaks into small, manageable, verifiable chunks. Quality assurance never assumes the AI got it right. And multi-agent coordination manages the interactions between specialized agents.
Teams running AI-first approach often report 2 to 10x acceleration across a wide range of tasks, conditional on getting the foundations right first.
Until 2025, humans had to drive agents hands-on. This year, AI agents have advanced enough so that humans no longer need to sit in the driver’s seat. AI agents self-drive while humans provide oversight, governance, and stay in the loop.
One large team made the shift cleanly. Humans set the plans and success criteria, AI executes the implementation and self-verifies, AI iterates on its own until the criteria are met, and humans review and approve the final output. This semi-autonomous approach delivered a 40 to 50% speedup over their previous development loop.
The other results have been just as compelling.
One team’s “Squad of AI Agents” approach drove revenue impact that used to be barely a P25 goal. Another rolled out AI-native workflows targeting 2X-plus productivity, with agents autonomously managing code from authoring through production. A third adopted AI-driven tech debt reduction and gained more than 60% productivity with no quality regression, moving to human-on-the-loop in under 4 months, a transition that usually takes 6 to 12.
Traditional metrics fall apart when AI generates thousands of lines of code in seconds. Measurement has to move from output-based to outcome-based.
Only 20 to 30% of an engineer’s time is spent coding. Speeding up code generation does not automatically translate into overall productivity. The surrounding work (review, testing, coordination, governance) accounts for the other 70 to 80%, and that is exactly where the bottlenecks form.
Research backs this paradox closely:
McKinsey found developers using AI assistants were 20 to 45% faster on discrete coding tasks, but cautioned that org-level gains were smaller and harder to measure.
Google’s DORA team found AI tools improved individual throughput without automatically improving deployment frequency or change failure rates, absent process changes.
Microsoft Research found a 26% increase in completed pull requests per week, but noted the review burden simply shifted onto other team members.
BCG put it best: real productivity gains require reshaping the work, not just adding tools. The same task done faster matters less than redefining which tasks are worth doing at all.
When one of our teams systematically removed those surrounding bottlenecks, they hit a 1.8 to 2.4x velocity improvement over six months.
Given the productivity paradox, metrics to measure productivity and transformation, must be resilient to the paradox:
AI-First MAU: 75% or more of code AI-generated. Agent-assisted diffs: aim for at least 55% to see meaningful productivity gains. L4-plus AI tool adoption: 80% or more weekly active usage across engineering functions.
For business impact, tie AI usage and productivity gains directly to revenue. Track feature velocity, where the 2 to 10x improvements in prototype-to-production timelines show up. And measure developer experience (satisfaction, flow state, collaboration effectiveness) right alongside the output metrics.
Quality has to be a core metric, not just a guardrail. Watch for “AI slop,” the gradual degradation of a codebase as AI-generated code piles up without adequate review. This is the “nobody’s problem” phenomenon, and it can quietly undermine an entire codebase.
Eight patterns that consistently derail transformation need special attention from every AI-native leader:
Tool bolt-on: AI tools bolted on without redesigning the workflow, producing minimal impact. This is the most common failure mode.
Review bottleneck: Traditional review steps become the throughput limit once AI accelerates generation.
Prompt cargo culting: Teams copy external prompts without context and get poor agent performance. The bottleneck is context engineering skill, not the model.
Metrics gaming: Teams optimize for agent-generated code percentage or adoption stats instead of outcomes.
Security shortcuts: Privileged agents deployed without proper audit controls. Some of the resulting production incidents are real and expensive.
Knowledge debt: Verification and specification fall behind agent-generated work, creating maintainability risk that compounds over time.
Junior pipeline hollowing: Early-career developer experience degrades when human validation gets outsourced to agents. The “missing rung” talent pipeline problem turns into a long-term sustainability risk.
Meeting creep: AI acceleration paradoxically makes room for more frequent syncs with no clear impact. The coordination overhead swallows the time that the faster generation saves.
Success depends on systematically detecting these patterns and rolling out with a real change-management framework like ADKAR (Awareness, Desire, Knowledge, Ability, Reinforcement), backed by structured rollouts and feedback loops. Tool distribution and usage metrics alone will not drive transformation.
Now that the AI-native leadership groundwork is in place, here is the phased playbook.
Leadership credibility: Use AI personally every day with measurable goals. Target 50% or more of your daily tasks within 30 days, then push higher. Get hands-on mastery of the tools so your understanding is authentic. Share both successes and failures in public to model the learning mindset.
Agent Champion designation: Identify high-agency technical leaders who can dedicate 50 to 100% of their time. Pull leaders and individual contributors closer together for faster decisions. Set up cross-functional coordination so the whole end-to-end workflow gets transformed.
Pilot pod formation: Start with a codebase AI readiness assessment. Form 3 to 5-person cross-functional teams with autonomous operation charters. Aim them at real problems, not toy exercises, so the momentum is genuine.
Workflow transformation: Audit the high-friction manual workflows that are good candidates for AI. Move from human-in-the-loop to human-on-the-loop. Build AI-readable documentation and specification systems that turn tribal knowledge into shared knowledge.
Cultural transformation: Establish psychological safety: MIT research found 83% of leaders believe psychological safety measurably improves AI initiative success. Formalize “AI failure story” sessions. Shift measurement from output to outcome.
Technical foundation: Clear out dead code, technical debt, and documentation debt to improve AI readiness. Implement sandboxing controls, audit mechanisms, and automated security checks. Build the verification layers that autonomous AI operation depends on.
Flatten hierarchies: Remove the coordination layers that slow AI-accelerated work. AI-native builders and AI-native leaders are what you need (consider the STO model).
Impact-based progression: Reward leverage and outcomes over team size. Define the success metrics that genuinely matter for your organization, and make AI tools and agents your highest-leverage assets.
Cross-functional fluency: Let roles flex as AI removes traditional skill barriers. Break down the walls between product management, design, engineering, data science, and field support, so AI-native builders can move fluidly and accelerate their builds.
Throughout the process, track the compounding gains that show up beyond the initial productivity bump. Connect AI adoption to strategic business metrics. And hold quality standards rigorously, because velocity without quality is a negative value.
Impact scaling: If you shipped 10x faster, would users be 10x happier? If not, you may be optimizing the wrong thing.
Empathy depth: Do you understand your users well enough to delete half the interface? Without empathy, more AI-generated features will not fix products that make users feel incompetent.
Learning velocity: Are you processing real user behavioral data every week? If not, your bottleneck is cycle time to insight, not cycle time to code.
Ownership clarity: Does every major initiative have a single owner? Ownership problems wearing a process-problem costume only get worse under AI acceleration.
Hypothesis discipline: Are you testing theses or building products? If you cannot name the signal that would kill your project, you are committed to something with no user validation behind it.
Here is the counterintuitive truth: AI does not reduce the need for process, it changes what the process is for.
Pre-AI processes coordinated execution among humans. AI compresses execution while raising the cost of deciding what is worth executing. The world now runs on simultaneous builds, parallel experiments, and stacks of prototype iterations. The leadership decisions about what matters, what to cut, and what to double down on become the binding constraint.
Process optimization comes down to three questions:
What are we learning this week? Reward faster, deeper learning across teams.
What are we killing this week? Actively retire the products and agents that lack genuine value.
Who owns each bet, and what signal would change their mind? STOs steer on objective signals, not intuition and not AI-generated content.
Everything else is overhead.
The window for this transformation is narrowing. Organizations that pull it off within the next year will open a 5 to 10x productivity gap over the ones that delay, and that gap will be brutally hard to close as AI-native practices compound.
The organizations that succeed show real advantages in product development velocity, technical innovation capacity, and their ability to attract top talent. The early-mover results (2.4x velocity improvements, 60%-plus AI-generated code, features built in hours instead of days) point to a fundamental capability shift rather than an incremental one. A few closing thoughts.
The scarce resource has shifted. It went from generation and production to orchestration and judgment. When AI generates at near-zero marginal cost, the ability to evaluate quality, set direction, and make the hard calls becomes the bottleneck. Leaders who invest in building AI-native team capability will significantly outperform those who just deploy more agents.
Structural change is mandatory. The productivity paradox is real. Individual gains do not become organizational gains without redesigning the workflows, the measurement systems, and the cultural norms. Remember the famous line, “culture eats strategy for breakfast,” only shines brighter under the AI light. No amount of transformation will save you if the foundations and the structure are not redesigned for the AI-native era.
Risk mitigation is continuous, not a one-time fix. Monitor AI-generated code quality and maintainability so technical debt does not accumulate. Address the security risks (prompt injection, memory corruption, access control, audit compliance) through embedded CI/CD checks. Prevent the “missing rung” talent pipeline problem by developing AI-native engineers at every level. And hold on to human values while you embrace AI acceleration, because human capital keeps paying dividends in the AI-native era when it is applied well.
AI changes the tools. It does not change the core reality. The hard part stays insanely human.
2026-06-20 23:30:36
AI shows up in 60% of engineering work. But only about a fifth of it can be handed off without someone babysitting the output. That’s because agents are missing context.
This 8-stage context maturity model gives a real answer on why you still get inconsistent output for all the tokens burned.
Join Unblocked live on June 24 (FREE) to learn:
Why more MCPs provides agents access but not understanding
What it takes to deploy agents you can trust without supervision
How a context layer solves for quality, efficiency and cost
This week’s system design refresher:
Claude Fable 5: Everything You Need to Know! (Youtube video)
12 Open-source LLMs
SLMs vs. LLMs, Clearly Explained
Single Agent vs. Multi-Agent Architecture
7 Permission Modes Every Claude Code User Should Know
Twelve models worth knowing in 2026, each with one standout strength.
Llama 4 Scout: Meta's first natively multimodal open-weight model.
DeepSeek V4: A Mixture-of-Experts model under MIT license with a native million-token context window. Near-frontier performance at a fraction of the cost per token.
Qwen3: Alibaba's flagship open-weight model with switchable thinking and non-thinking modes, all under Apache 2.0.
Gemma 4: Google's open-weight family released under Apache 2.0, with the widest language coverage of any model on this list.
Phi 4: Microsoft’s compact model trained almost entirely on synthetic, curated data. A practical choice for edge and on-device deployment.
Mistral Small 3.1: A VLM with a long context window that fits on a consumer laptop.
Nemotron 3 Super: NVIDIA’s hybrid MoE with a million-token context window. Fully open weights, datasets, and recipes, with strong results on agentic coding benchmarks.
GLM 5.1: The first open-weight model to top SWE-Bench Pro. Released under MIT with no commercial restrictions.
Kimi K2.6: Competitive with leading closed models on coding while costing far less per million tokens. Available on Hugging Face under a Modified MIT license.
StarCoder2: One of the most transparent code models available.
OLMo 2 (AI2): The most complete example of open-source reproducibility on this list. Weights, training data, code, and full recipes all released under Apache 2.0.
Falcon 3: A family of lightweight open-weight models built to run on a single GPU.
Over to you: which open-source model would you add to this list?
Speed without control is a false economy. As AI code-generation accelerates software delivery, the FeatureOps Summit 2026 is here to ensure that when we ship more, we break less.This premier virtual event brings together engineers, architects, and product leaders from companies like Samsung, Lloyds Banking Group, Wayfair, Visa, AWS, Allianz and many others, to explore the infrastructure of fearless delivery.
Key Themes:
AI Safety Nets: Guardrails for the flood of automated code.
Edge Resilience: Sub-millisecond evaluation at scale.
Continuous Flow: Moving past the “fixed-release” mindset. Register today to master the tools and patterns required for a fail-safe release environment.
Big models cost more. Small models do less. Here's how SLMs and LLMs differ across the dimensions that matter in production:
Architecture: SLMs are usually under 10B parameters and run on a laptop or phone. LLMs sit at 10B+ with deeper layers and more attention heads, built for broad reasoning across tasks.
Task Complexity: SLMs work well on simple tasks but fail on complex multiple reasoning steps. LLMs handle difficult math, multi-step code, and long-horizon planning.
Long Context Recall: SLMs lose the thread across long documents or extended conversations. LLMs reliably track and connect information across large inputs.
Latency and Cost: SLMs run on consumer hardware with low response times and significantly lower inference costs. LLMs require GPU and carry higher costs per request.
Deployment and Privacy: SLMs run on-device or on-premise. LLMs are typically cloud-hosted, which adds data governance complexity.
Where each fits:
SLMs: on-device assistants, real-time classification, or privacy-sensitive applications
LLMs: complex reasoning, agent workflows, or broad knowledge tasks.
Are you using SLMs, LLMs, or a hybrid setup in production?
Some tasks need a single agent. Others need a whole team. Knowing the difference is the skill.
Single-agent system: One reasoning LLM that plans, picks a tool, and loops on its own until the task is done. Use a single agent when:
the task is a clear, linear sequence
one agent can hold the whole problem in its head
you want something simple to build and easy to debug
Multi-agent system: An orchestrator that splits a task into subtasks and routes each one to a specialized agent. Use multi-agent when:
subtasks can run in parallel
one agent writes and another independently verifies the work
the problem is too big for one agent to coordinate alone
Single agents are cheaper and easier to build, but they hit a ceiling on complex work.
Multi-agent systems are more capable and more reliable, but they add coordination cost.
Start with a single agent. Move to multi-agent only when context or reliability become the bottleneck.
Over to you: Are you running single-agent or multi-agent systems in production?
plan: The model drafts a plan. Nothing executes until the user approves.
default: Standard interactive use. Most tool calls require user approval.
acceptEdits: Edits in the working directory are auto-approved. Other shell commands still prompt.
auto: An ML classifier decides on requests that miss the fast path.
dontAsk: No prompts shown. Deny rules are still enforced.
bypassPermissions: Most prompts are skipped. Safety-critical guards still apply.
bubble: A subagent escalates its permission request to the parent.
Only 5 modes are user-selectable. “auto” is gated by a feature flag, and “bubble” is internal.
Over to you: Which mode do you reach for most, and what made you pick it?
2026-06-18 23:31:17
Observability for Beginners: Logs, Metrics, Traces, and Everything Around Them
A running service generates events constantly.
Requests arrive, functions run, errors appear, and each one is a thing that happened at a specific time with a specific context and a specific outcome.
Logs, metrics, and traces are three ways of looking at this same stream. A log captures one event as a line of text, a metric counts or aggregates many events, and a trace links related events as they move across services. Most of the concepts in observability, including cardinality, sampling, and correlation, are consequences of this structure.
In this article, we will look at the basics of observability in detail with concepts like logs, metrics, and traces explained in detail.
2026-06-17 23:31:10
We’re launching Cohort 2 of our 2-day intensive, cohort-based course, Build with Claude Code, taught by John Kim, who has trained hundreds of engineers at Meta to use Claude Code in real production workflows.
The course kicks off on June 18th, and enrollment closes in less than 24 hours. If you’ve been thinking about leveling up how you and your team work with Claude Code, this is the moment.
A few things you’ll learn:
The agentic loop, context engineering, and memory layers that make Claude Code useful for real projects
How to build with Claude Code Skills, MCPs, and hooks to give Claude the tools and feedback loops it needs to self correct
Parallel development with Git worktrees, subagents, and agent teams
A capstone project where you ship something real on your own stack
The course includes live sessions, assignments, and office hours, so there’s plenty of room to ask questions and get unstuck.
The second cohort starts in just a few days: June 18 to 19, 2026. If you want to learn everything from the fundamentals of Claude Code to advanced production workflows, including working with large codebases, this could be a great way to level up.