2026-04-28 11:03:50
Most teams pick a search provider by running a few test queries and hoping for the best – a recipe for hallucinations and unpredictable failures. This technical guide from You.com gives you access to an exact framework to evaluate AI search and retrieval.
What you’ll get:
A four-phase framework for evaluating AI search
How to build a golden set of queries that predicts real-world performance
Metrics and code for measuring accuracy
Go from “looks good” to proven quality.
Every time you buy something online from a Stripe-powered business, a machine learning model evaluates over 1,000 signals about your transaction and decides in under 100 milliseconds whether to let it through.
Across billions of legitimate payments, it reaches the correct verdict 99.9% of the time. The system that delivers those numbers, however, looks entirely different from what Stripe originally built.
The architecture has been overhauled multiple times, and one of the most important upgrades required removing a component the team knew was actively improving accuracy, because keeping it was holding back everything else the team wanted to do.
For reference, online payment fraud occurs in roughly 1 out of every 1,000 transactions. That rarity makes fraud detection a difficult machine learning problem because the system has to surface a small number of fraudulent payments from a massive volume of legitimate ones, and it has to do this quickly and cheaply on every single transaction.
In this article, we will look at how Stripe’s Radar does this effectively and the architectural decisions the team took while building it.
Disclaimer: This post is based on publicly shared details from the Stripe Engineering Team. Please comment if you notice any inaccuracies.
Stripe began with relatively simple ML models like logistic regression (a statistical method that predicts the probability of an outcome based on input variables). Over time, as the Stripe network grew and ML technology advanced, they moved to more complex architectures. Each jump produced an equivalent leap in model performance.
The architecture preceding the current one was called Wide & Deep. It combined two models into an ensemble.
The “wide” component was XGBoost, a gradient-boosted decision tree that works by combining many small decision trees into one powerful predictor. XGBoost excelled at memorization, meaning it was strong at recognizing specific patterns and feature correlations it had encountered in training data.
The “deep” component was a deep neural network (DNN) that excelled at generalization, meaning it could learn abstract concepts like “unusual payment velocity on a card” and apply them to entirely new situations it had never seen before.
Together, the two components worked well. But XGBoost was creating operational bottlenecks. It was hard to parallelize, which meant retraining the combined model was slow. It was incompatible with advanced ML techniques Stripe wanted to adopt, like transfer learning that involves using knowledge gained from one task to improve performance on a different but related task, and embeddings. And it was also limiting how quickly the many engineers working on Radar each day could experiment with new ideas.
Simply dropping XGBoost would have caused a 1.5% drop in recall, meaning 1.5% more fraud would go undetected. That was an unacceptably large regression in performance. The value XGBoost provided was real and measurable, so it had to be replicated within a new architecture rather than just discarded.
Stripe’s solution drew inspiration from a research architecture called ResNeXt.
The core idea, sometimes called “Network-in-Neuron,” splits computation into multiple distinct branches, where each branch functions as a small neural network on its own. The outputs from all branches are summed to produce a final result. This multi-branch approach enriches feature representation along a new dimension, and it achieves this more effectively than the brute-force approach of simply making a DNN wider or deeper, which risks overfitting (the model memorizing random noise rather than learning real patterns).
The resulting architecture, internally called Shield NeXt, reduced training time by over 85%, bringing it to under two hours. Experiments that previously required overnight jobs could now run multiple times in a single working day. Stripe is now exploring techniques that this architectural shift made possible, including multi-task learning, where a single model is trained to handle several related objectives simultaneously.
Agents can generate code. Getting it right for your system, team conventions, and past decisions is the hard part. You end up babysitting the agent and watch the token costs climb.
More MCPs, rules, and bigger context windows give agents access to information, but not understanding. The teams pulling ahead have a context engine to give agents only what they need for the task at hand.
Join us live (FREE) on May 6 to see:
Where teams get stuck on the AI maturity curve and why common fixes fall short
How a context engine solves for quality, efficiency, and cost
Live demo: the same coding task with and without a context engine
The model architecture matters, but Radar’s biggest competitive advantage comes from the data flowing through the Stripe network. Stripe has engineered specific mechanisms to convert that scale into model performance.
90% of cards used on the Stripe network have been seen more than once across different merchants. A single business has visibility into only its own transactions. Radar, by contrast, sees patterns across millions of businesses and thousands of partner banks around the world.
There is also a structural advantage in how Radar gets its training labels, the data that tells the model which past transactions were actually fraudulent. Since Radar is built directly into Stripe’s payment flow, it receives these labels automatically when cardholders dispute charges. Most third-party fraud solutions require businesses to build separate data pipelines for sending payment labels back to the fraud provider, or to label payments manually, which is time-consuming and error-prone. Radar sidesteps all of this by ingesting ground truth data straight from the payment flow and card networks.
Stripe uses hundreds of features in its model, and most of them are aggregates computed across the entire network. As the network grows, each feature becomes more informative because the training data better represents the feature’s real-world distribution.
A “feature” in this context is a single signal the model uses to evaluate a transaction. Some are intuitive. For example, does the cardholder’s name match the provided email address? How many different cards have been associated with this IP address? A high count might indicate someone testing stolen cards. Other features are more surprising. The difference between the device’s local time and UTC, or the count of countries where a card has been successfully authorized, both turn out to be meaningful fraud signals.
Finding new features is part forensics and part experimentation. Stripe’s team reviews past fraud attacks in detail, building investigation reports that try to reconstruct how fraudsters operate. They look for patterns in throwaway email addresses used to set up multiple accounts quickly. They monitor dark web activity weekly. From this research, they build a prioritized list of candidate features, implement each one rapidly, and prototype them to measure model impact.
Sometimes the most promising ideas yield little. For example, Stripe once built a feature capturing whether a business was currently under a distributed fraud attack. It barely moved the model performance because the model was already learning that pattern implicitly.
One of the more powerful techniques Stripe uses is embeddings, which are learned numerical representations for categorical data. Things like merchant identity, issuing bank, user country, and day of the week have many possible values, and defining useful numerical representations for them is challenging.
Stripe trains its model to learn an embedding for each value, essentially a set of coordinates that position it relative to others based on transaction patterns. Uber and Lyft, for example, would end up with similar embedding coordinates because their transaction patterns resemble each other, while Slack would be positioned very differently.
Embeddings enable geographic transfer of fraud knowledge. If Stripe identifies a new fraud pattern in Brazil, the embeddings allow the system to recognize that same pattern in the US automatically, without retraining. The model essentially learns which merchants and regions behave similarly, then applies fraud knowledge across the entire network.
Stripe also found that scaling up training data continued to yield significant gains. A 10x increase in training transaction data still produced meaningful model improvements, and the team was working on a 100x version. This kind of scaling was only feasible because the DNN-only architecture could train fast enough to handle much larger datasets practically.

Having a great model and great data still leaves a fundamental question unanswered.
How much fraud should you actually block?
Every fraud detection system faces an inherent tension between two types of errors:
A false negative is when fraud slips through undetected, costing the business the product, a chargeback fee, and potential reputational damage with card networks.
A false positive is when a legitimate customer gets blocked, and the business loses the sale, along with potentially the customer forever. A survey found that 33% of consumers said they would stop shopping at a business after a single false decline.
These two errors exist on a curve.
Precision measures the fraction of blocked transactions that are actually fraudulent. Recall measures the fraction of all actual fraud that gets caught. As you raise the blocking threshold, requiring a higher fraud probability before blocking a payment, precision goes up because you become more selective about what you block. But recall goes down because more marginal fraud slips through. Lowering the threshold pushes things in the opposite direction.
Stripe frames this as two distinct problems:
The data science problem is about making the model better by adding predictive features, training on more data, and refining the architecture. A better model shifts the entire precision-recall curve upward, meaning that at any given threshold, you get better outcomes on both dimensions.
The business problem is about choosing where on that curve to operate, and the right answer depends entirely on the economics of each merchant.
For example, consider two businesses.
A food delivery company with thin margins might earn $2 in profit per order. Once you account for product cost and chargeback fees, a single fraudulent transaction can wipe out the profit from nearly 19 legitimate ones. For this business, aggressive blocking makes sense because the cost of missed fraud is devastating. On the other hand, a SaaS company with high margins faces the opposite calculation. The lifetime revenue lost by blocking a legitimate subscriber who would have paid $200 per month for years far outweighs the cost of an occasional fraudulent charge.
This is why Stripe built Radar to be configurable.
Merchants can adjust their risk threshold, and Radar for Fraud Teams lets them compose custom rules and set up manual review queues.
Stripe evaluates custom rules with the same precision-recall framework it uses for the model itself. When a merchant creates a rule, Stripe shows historical statistics on matching transactions that were actually disputed, refunded, or accepted, so the merchant can evaluate the impact before the rule goes live. Stripe also uses additional evaluation tools like ROC curves and AUC (area under the curve) scores to assess overall model quality, but the precision-recall framing captures the core tension most directly.
Manual review adds yet another lever.
Sending borderline transactions to human reviewers instead of blocking them outright improves precision with minimal impact on recall. Also, sending borderline transactions to review instead of allowing them through improves recall with minimal impact on precision. The cost is human effort, but it gives merchants a way to reshape their own precision-recall curve using business knowledge the model cannot access on its own.
All machine learning models are opaque to some degree, and deep neural networks are especially more opaque. Stripe accepted this when they chose DNNs over simpler, more interpretable techniques. The predictions are better, but explaining why a specific transaction received a given score is harder.
Stripe’s response was to build layers of interpretability around the model.
In 2020, they launched risk insights, a feature that shows merchants which factors contributed to a transaction being declined. The interface displays the top fraud signals, like an address being associated with a previous early fraud warning or an unusually high number of names linked to a card. It includes a location map showing distances between the billing address, shipping address, and IP address. It shows customer metadata like email, cardholder name, and the authorization rate for transactions associated with that email.
See the diagram below:

Stripe also uses Elasticsearch, a search engine optimized for fast lookups across large datasets, to surface related transactions and help merchants put a specific decline in a broader context.
Internally, the team built a table view displaying the exact features contributing most to a transaction’s fraud score, which engineers use to debug support cases. Stripe is working on sharing more of these internal tools with merchants, closing the gap between what engineers can see and what users can see.
Explainability serves a practical purpose beyond building trust. When merchants understand why Radar scored a transaction the way it did, they can improve the data they send to Stripe for more accurate decisions. They can create custom rules that incorporate knowledge only they have about their own business. The explanation layer transforms Radar from a black box into something merchants can actively collaborate with.
Building a better model is half the challenge. Deploying it safely at Stripe’s scale is the other half, and it involves two hard engineering problems.
The first is real-time feature computation. Every feature the model uses during training must also be computable in production, because Radar needs to score every incoming payment as part of the Stripe API flow. For a feature like “the two most frequent IP addresses previously used with this card,” Stripe maintains an up-to-date state on every card ever seen on the network, and fetching or updating that state has to be fast. Stripe’s ML infrastructure team built systems that let engineers define features declaratively, with current values made available automatically in production at low latency.
The second is ensuring that model improvements hold across the entire user base, all the way down to individual merchants. A model that performs better on aggregate metrics might still cause a spike in block rate for smaller businesses, which would be disruptive for those merchants and their customers. Before releasing any model, Stripe measures the change it would cause to the false positive rate, block rate, and authorization rate on both an aggregate and per-merchant basis. If a model would cause undesirable shifts for certain users, they adjust it for those segments before release. They also compare score distributions between old and new models, aiming to keep the proportion of transactions above each merchant’s blocking threshold stable.
Fraud patterns shift constantly, which means even a well-performing model degrades over time, a phenomenon called model drift. Stripe found that retraining the same model on more recent data, with identical features and architecture, improves recall by up to half a percentage point per month. That is a big gain from simply keeping the data fresh. By investing in automated training, tuning, and evaluation tooling, Stripe tripled their model release cadence. They continuously update performance dashboards after training but before release, so engineers can spot stale model candidates and proactively retrain them.
The fraud landscape itself keeps evolving. Patterns have shifted from primarily stolen credit card fraud to a growing mix of traditional card fraud and high-velocity card testing attacks, where automated scripts try large numbers of stolen card numbers against a merchant’s checkout flow. Stripe’s deployment infrastructure is built to support this kind of rapid adaptation.
Stripe also faces a subtle measurement challenge in production. Transactions that the model blocks have unknown true outcomes because the payment was never completed. Computing a full production precision-recall curve requires counterfactual analysis, meaning statistical methods that estimate what would have happened to payments Radar blocked. Stripe has developed proprietary techniques for this over the years.
Radar is a very different product from what it was at launch.
The models, the data pipelines, the explainability tools, and the way Stripe communicates fraud decisions to merchants have all been rebuilt. Fraud patterns have changed considerably in that time as well.
However, the core goal of the Radar team remains the same.
They are still working to create an environment where businesses and customers can transact with confidence, still optimizing that brief moment customers barely register, the instant between clicking “purchase” and seeing the transaction confirmed. Every architectural choice, every feature, every deployment safeguard exists to make that 100-millisecond window as accurate, fair, and trustworthy as possible.
References:
2026-04-27 23:30:57
Most AI agents don’t fail because the model is bad. They fail because the model doesn’t have the proper infrastructure to reason well.
Simba Khadder, Head of Engineering at Redis, lays out a 4 pillar framework for building context systems that hold up in production—plus an architectural self-audit checklist you can run against your stack today.
Search “shoes for pregnant women” on Amazon, and the best results you get might be slip-resistant shoes. This is even though the word “pregnant” appears nowhere in those product listings.
In other words, there is zero keyword overlap between the query and the product. The search engine has to reason that pregnant women need stability, that stability means slip-resistance, and that slip-resistant shoes are the right match.
Traditional recommendation systems match text to text and purchase history to purchase history. They handle keyword overlap quite well. However, when a shopper’s intent requires a reasoning step that lives entirely in human common sense, those systems hit a wall.
Amazon’s search team recognized this blind spot and built a commonsense knowledge graph called COSMO that teaches the recommendation engine to think the way a human shopper would.
In this article, we will look at how COSMO works and the challenges the engineering team faced.
Disclaimer: This post is based on publicly shared details from the Amazon Engineering Team. Please comment if you notice any inaccuracies.
Amazon already operates large-scale knowledge graphs that store factual product attributes like brand, color, material, and category. These graphs power a lot of what works well in product search today. However, they mainly try to encode what a product is, and they don’t explain why a human would want it.
This is the semantic gap problem.
For example, a query like “winter clothes” carries an implicit intent around warmth. The product catalog for a long-sleeve puffer coat describes its material, size options, and sleeve length, but it may say nothing about warmth directly. The gap between what the customer typed and what the product listing says requires a reasoning step that factual knowledge graphs were never designed to handle.
Amazon’s team surveyed the landscape of existing solutions.
Alibaba built AliCoCo (163K nodes, 91 relations) and AliCG (5M nodes), both extracted from search logs. These capture product concepts, but they stay focused on product attributes and categories, skipping user intent entirely.
General commonsense knowledge bases like ConceptNet (8M nodes, 21M edges) cover everyday reasoning but are built for general purposes, with little grounding in shopping behavior.
Amazon’s own earlier effort, FolkScope, demonstrated that commonsense knowledge could be extracted from shopping data, but it covered only 2 product categories and only co-purchase behavior.
The gap was clear. Though factual product knowledge and general commonsense knowledge existed, structured knowledge about why people buy things at an e-commerce scale was missing.
The intuition behind Amazon’s approach was simple. Large language models encode enormous amounts of world knowledge in their parameters. Taking our earlier example, if you ask an LLM why a customer who searched “winter coat” bought a long-sleeve puffer coat, it can reason that puffer coats provide warmth, and warmth is what the customer wanted.
The team fed millions of user behavior pairs into OPT-175B and OPT-30B, large language models hosted internally on 16 A100 GPUs. The choice of OPT over GPT-4 was driven by a hard constraint around data privacy. Customer behavior data, meaning which queries led to which purchases, could only be processed on Amazon’s own infrastructure.
Two types of behavior data went into the system.
Query-purchase pairs capture the connection between a search query and the product a customer ultimately bought.
Co-purchase pairs capture products bought together in the same shopping session
Across 18 product categories, the team sampled 3.14 million co-purchase pairs and 1.87 million query-purchase pairs.
The sampling strategy was itself a design decision.
For products, Amazon covered popular browse node categories and selected top-tier products with high interaction volume, also using product type labels (more than a thousand classes like “umbrella” or “chair”) for finer-grained selection.
For co-purchase pairs, the team cross-checked product types to remove random co-purchases and filtered out products that co-occurred with too many different product types (a signal of noise rather than intent).
For search-buy pairs, thresholds on both purchase rate and click rate determined which queries and products entered the sample.
Crucially, an in-house query specificity service helped prioritize broad or ambiguous queries, because those are exactly where the semantic gap is largest and commonsense knowledge adds the most value.
Prompt design mattered too. Rather than using simple text continuation, Amazon formatted each behavior pair as a question-answering task and instructed the LLM to generate a numbered list of candidates rather than a single response.
The LLM generated millions of candidate explanations. However, only 35% of search-buy explanations met Amazon’s quality bar for typicality, meaning they were representative of genuine shopping intent. For co-purchase explanations, that number dropped to 9%. The rest were filler. The LLM produced circular rationales like “customers bought them together because they like them,” or trivially obvious statements like “customers bought an Apple Watch because it is a type of watch.”
The 9% vs. 35% gap reveals something about how LLMs reason. Explaining why a query led to a purchase is relatively constrained because the query provides clear context about intent. But explaining why two products were bought together requires identifying a shared reason across two different items, and LLMs tend to default to generic explanations for one item rather than reasoning about the pair.
Amazon also needed a way to categorize the relationships that the LLM was generating. The team started with 4 broad seed relations (usedFor, capableOf, isA, cause) that prior work had shown produce diverse outputs. From there, they mined finer-grained relation types directly from the LLM’s generated text by looking for recurring predicate patterns.
The most common pattern was “the product is capable of being used [preposition],” where different prepositions mapped to different semantic relationships. This data-driven process produced 15 relation types that capture distinct ways humans reason about products. These include used_for_function (”dry face”), used_for_event (”walk the dog”), used_for_audience (”daycare worker”), used_in_location (”bedroom”), used_in_body (”sensitive skin”), used_with (complementary products like “surface cover”), and person-centric relations like xIs_a (”pregnant women”) and xWant (”play tennis”). The ontology was shaped by what the LLM actually generated, then canonicalized and structured by Amazon’s researchers, rather than being designed top-down by a team of knowledge engineers.
The LLM produced a mountain of hypotheses, which were mostly noise. Amazon’s solution was a multi-stage refinement pipeline, where each stage catches a different type of failure.
Coarse-grained filtering tackled the most obvious problems first. Rule-based filters removed incomplete sentences by measuring sentence quality with a language model (GPT-2) and tuning a threshold. Generations that exactly matched the query text, the product type, or the product title (or fell within a small edit distance) were discarded. For generic statements like “used for the same reason” or “used with clothes,” Amazon identified these by combining frequency and entropy, since generic explanations tend to co-occur with many different products rather than specific ones.
Similarity filtering handled a subtler problem. Some LLM outputs looked different from the input on the surface but were semantically just paraphrases of the original query or product description.
Amazon used an in-house language model, pre-trained on e-commerce text including queries and product information, to compute embeddings for the generated knowledge, the query, and the product. When the vector similarity (measured by cosine distance) between the generated knowledge and the original context was too high, the candidate was filtered out. The team found that filtered generations were essentially syntactic transformations of the original input, rearranging the same meaning in slightly different words.
Human-in-the-loop annotation came next. Amazon sampled 30,000 knowledge candidates for human review, with 15,000 from co-purchase behavior and 15,000 from search-buy behavior spread across 18 categories. Rather than picking candidates uniformly, the team used a weighted formula that combined the frequency of a piece of generated knowledge with the popularity of the associated product or query. Popular products produce common knowledge, so the weighting pushed toward diverse, less obvious knowledge that the classifier would later need to generalize.
Annotators evaluated each candidate on two dimensions:
Plausibility measures whether the posited relationship is reasonable.
Typicality measures whether the knowledge is representative of genuine shopping behavior.
As a concrete example, the more typical reason people buy Apple Watches is that they are intelligent watches, rather than that they tell the time. Both statements are plausible, but only the first is typical.
To reduce cognitive burden and disagreement among annotators, Amazon decomposed these assessments into five yes/no questions covering completeness, relevance, informativeness, plausibility, and typicality. Two annotators labeled each question independently, with a third resolving disagreements. A pilot study of 2,000 examples showed this decomposition significantly reduced the disagreement rate, and internal auditing of 5% of all annotations showed over 90% accuracy. Due to data privacy requirements, Amazon employed a professional data annotation vendor company, followed by a strict internal auditing process.
Classifier generalization was the final step. Amazon fine-tuned DeBERTa-large (a high-performing language model for classification tasks) and an in-house language model on the 30,000 annotated samples to predict plausibility and typicality scores for all remaining candidates. Only candidates scoring above a 0.5 plausibility threshold survived.
The output of this pipeline is a set of structured knowledge triples. A triple connects two entities through a defined relationship. For example, the triple <co-purchase of camera case and screen protector, capableOf, protecting camera> captures the commonsense reasoning that these two products are bought together because they both serve the purpose of protecting a camera. Assembled, these triples form a knowledge graph of 6.3 million nodes and 29 million edges spanning 18 product categories. From 30,000 human judgments to 29 million edges.
The knowledge graph captures pre-computed commonsense relationships, but Amazon’s search engine encounters new queries and products constantly. Running the full pipeline (OPT-175B generation followed by classifier scoring) for every new behavior pair would be prohibitively expensive in production.
Amazon’s solution was instruction tuning.
The team used their 30,000 annotated samples to create instruction data and fine-tuned LLaMA 7B and 13B models. These base models offered the best balance between generation quality and inference cost for production serving, with far fewer parameters than OPT-175B while still producing high-quality outputs when trained on domain-specific data. The resulting model, COSMO-LM, was trained across 18 product domains, 15 relation types, and 5 distinct tasks.
Beyond commonsense generation, those tasks included plausibility prediction, typicality prediction, search relevance prediction, and co-purchase prediction. The multi-task training means COSMO-LM can both generate knowledge and evaluate its own output quality, effectively collapsing the “big LLM plus classifier” stack into a single, smaller model.
To make the model robust to different input formats, Amazon varied the templates during training. The same query-product pair might be prefixed with “search query,” “user input,” or “user searched” across different training examples. This prevents COSMO-LM from becoming brittle to prompt phrasing.
The result is two complementary artifacts in production. The static knowledge graph (29 million pre-computed edges) handles known product relationships. COSMO-LM generates fresh commonsense knowledge on the fly for new or unseen query-product pairs, with dramatically lower inference cost than the original OPT-175B pipeline. A demo of the system shows COSMO-LM generating knowledge for a query like “how to decorate a home,” producing a list of product types (wall art, decorative signage, sticker decal, decorative pillow cover, artificial plant, rug, home mirror, lamp), each accompanied by a commonsense explanation of its role in home decoration.
Having a model that generates useful knowledge is one challenge. Serving it at Amazon’s scale with acceptable latency is another.
Amazon’s deployment architecture centers on two components:
A Feature Store transforms COSMO-LM’s raw text outputs into structured features that downstream applications can consume directly. These features include product key-value pairs, semantic subcategory representations, and intent signals.
An Asynchronous Cache Store manages the serving layer through a two-tiered caching strategy.
The first tier pre-loads responses for yearly frequent searches, covering the majority of traffic. The second-tier batch processes daily requests for newer or less common queries and updates the cache.
When a user query arrives, the system checks the cache first. Hits get immediate responses. Misses go to batch processing, and the cache updates for future identical queries.
SageMaker manages model deployment and refresh, ingesting customer behavior session logs daily. The structured data from the cache feeds three downstream systems simultaneously, including Search Relevance, Recommendation, and Navigation.
This architecture meets Amazon’s strict search latency requirements while keeping storage costs comparable to real-time serving for most traffic. But it comes with a tradeoff. COSMO updates daily, which means it cannot incorporate real-time events like flash sales that fluctuate within hours. Amazon explicitly acknowledges this limitation and identifies it as an area for future development.
Search relevance saw the most dramatic offline improvements.
On the public ESCI dataset from KDD Cup 2022, a cross-encoder (a model architecture that jointly processes query and product features together, rather than encoding them separately) augmented with COSMO triples achieved 73.48% Macro F1 and 90.78% Micro F1 with trainable encoders.
For context, Macro F1 averages performance across all product categories equally (so rare categories matter just as much as common ones), while Micro F1 measures overall accuracy regardless of category. That cross-encoder result surpassed the top-1 ensemble model on the KDD Cup leaderboard. With frozen encoders, where the only difference was whether COSMO triples were included as input, the improvement was 60% on Macro F1.
On private datasets spanning four markets (US, Canada, UK, and India), the COSMO-enhanced model consistently outperformed baselines in every locale, with the strongest gains in the India market, where the gap between query language and product catalog language tends to be larger.
Session-based recommendation benefited from COSMO knowledge as well.
Amazon built COSMO-GNN, extending a graph neural network model (a model that learns relationships between items by treating shopping sessions as connected graphs) for session-based recommendations with COSMO-generated intent knowledge. It outperformed all competitive baselines on Hits@10 and NDCG@10 in both clothing and electronics categories.
The improvement was larger for electronics (5.82% vs. 4.05% on Hits@10), where users revise their search queries more frequently (2.47 unique queries per session versus 1.36 for clothing). This pattern makes sense. When users are actively reformulating queries to narrow down what they want, commonsense knowledge about why they are searching becomes especially valuable.
Search navigation is where COSMO reached production and generated real business impact. COSMO powers a multi-turn navigation system that organizes intent hierarchically. A search for “camping” branches into fine-grained intents like “winter camping,” “beach camping,” or “lakeside camping.” These connect to product types like “air mattress” or “winter boots,” which are then further refined by attributes like “4 person.”
This hierarchical organization of knowledge allows the system to mirror a natural discovery process, helping customers progressively narrow their search through multiple rounds of refinement rather than requiring them to formulate the perfect query upfront.
Amazon ran A/B tests over several months, targeting approximately 10% of U.S. traffic. The results were significant. A 0.7% relative increase in product sales within the test segment translated to hundreds of millions of dollars in additional annual revenue.
An 8% increase in navigation engagement was observed in the same segment. These outcomes came from a single, relatively small feature on the search page with limited visibility. Amazon has projected that extending COSMO-LM across all traffic for navigation alone could produce revenue gains in the billions.
COSMO is Amazon’s first production system that uses instruction-tuned large language models to construct a knowledge graph and serve it to online applications. It marks a shift from factual product knowledge graphs toward intent-based commonsense knowledge graphs.
The most important number from this entire project may be the leverage ratio. Thirty thousand human annotations became 29 million knowledge graph edges across 18 product categories. That ratio was possible because Amazon invested heavily in sampling strategy, annotation design, classifier training, and instruction tuning rather than in brute-force labeling.
The system’s acknowledged limitations are worth keeping in mind as well.
COSMO’s daily refresh cycle means it cannot keep up with real-time dynamics. Its aggressive filtering (only candidates above 0.5 plausibility survive) means the knowledge graph has gaps in coverage, especially for long-tail products and unusual queries. These are genuine tradeoffs, and Amazon chose precision over recall because unreliable commonsense knowledge in production would be worse than missing knowledge.
References:
2026-04-25 23:30:59
If slow QA processes bottleneck you or your software engineering team and you’re releasing slower because of it — you need to check out QA Wolf.
QA Wolf’s AI-native service supports web and mobile apps, delivering 80% automated test coverage in weeks and helping teams ship 5x faster by reducing QA cycles to minutes.
QA Wolf takes testing off your plate. They can get you:
Unlimited parallel test runs for mobile and web apps
24-hour maintenance and on-demand test creation
Human-verified bug reports sent directly to your team
Zero flakes guarantee
The benefit? No more manual E2E testing. No more slow QA cycles. No more bugs reaching production.
With QA Wolf, Drata’s team of 80+ engineers achieved 4x more test cases and 86% faster QA cycles.
This week’s system design refresher:
Coding Agents Explained: How Claude Code, Codex & Cursor Actually Work (Youtube video)
Data Warehouse vs Data Lake vs Data Mesh
API Concepts Every Software Engineer Should Know
Polling vs Long Polling vs Webhooks vs SSE
SLA vs SLO vs SLI
Build with Claude Code — Course Direction Survey
Storing data is the easy part. Deciding where and how to organize it is the real challenge.
A data warehouse is the traditional approach. It cleans and structures data before storing it. Queries run fast, and reports stay consistent. But adding a new data source takes effort because everything has to fit the schema first.
A data lake takes the opposite approach. It stores everything raw, like databases, logs, images, and video. Process it when you need it. The flexibility is great, but if rules around naming, formatting, and ownership are not properly set, you end up with duplicate, outdated, and undocumented data that is hard to manage.
Data mesh shifts data ownership from a central team to individual departments. For example, sales publishes sales data, and finance publishes finance data. Shared standards keep things compatible across teams.
It works well in larger organizations. But it requires every team to have the right people and processes to manage their data quality, documentation, and access, which is a challenge.
In practice, many companies use more than one approach. They'll use a warehouse for dashboards and reporting, a lake for machine learning workloads and start applying mesh principles as teams scale.
Most engineers use APIs every day. Sending a request and reading JSON is one thing. Designing an API that other people can rely on is something where things get complicated.
A lot of problems begin with basic HTTP details that seem small at first. Methods, status codes, request formats, and response structure can make an API feel clear and predictable, or confusing and inconsistent.
Then there are the bigger design choices. REST, GraphQL, gRPC, webhooks, and WebSockets each make sense in different situations. The challenge is knowing what actually fits the system and the use case.
A lot of API problems also comes from design decisions that do not get enough attention early on. Naming, pagination, versioning, error responses, and backward compatibility often decide whether an API is easy to work with or frustrating to maintain.
Security is another area where weak decisions can cause real problems. API keys, OAuth, JWTs, scopes, and permissions are easy to mention. Getting them right is harder, and mistakes here can be costly.
Reliability matters too. Timeouts, retries, idempotency, rate limits, and caching are often easy to ignore until the system is under pressure.
And once an API starts growing, the supporting work matters too. Clear documentation, solid specs, observability, and contract testing make it much easier for teams to trust the API and use it without guessing how it works.
Over to you: What’s the most overlooked API concept in your experience?
Four ways to get updates from a server. Each one makes a different tradeoff between simplicity, efficiency, and real-time delivery.
Here's how they compare:
Polling: The client sends a request every few seconds asking "anything new?" The server responds immediately, whether or not there's new data. Most of those requests come back empty, wasting client and server resources. For use cases like an order status page where a small delay is acceptable, polling is the simplest option to implement.
Long Polling: The client sends a request, and the server keeps the HTTP connection open until new data is available or a timeout occurs. This means fewer empty responses compared to regular polling. Some chat applications used this pattern to deliver messages closer to real-time communication.
Server-Sent Events (SSE): The client opens a persistent HTTP connection, and the server streams events through it as they're generated. It is one-way, lightweight, and built on plain HTTP. Many AI responses that appear token by token are delivered through SSE, streaming each chunk over a single open connection.
Webhooks: Instead of the client asking for updates, the service sends an HTTP POST to a pre-registered callback URL whenever a specific event occurs. Stripe uses this for payment confirmations. GitHub uses it for push events. The client never polls or holds a connection open, it just waits for the server to call.
Many systems don't rely on a single pattern. You may use polling for order status, SSE for streaming AI responses, and webhooks for payment confirmations.
These three terms are related, but they mean different things. Knowing the difference helps you define what to measure, aim for, and promise your customers.
Here's how they actually connect:
SLI (Service Level Indicator): This is the metric you're measuring. For a login service, it could be the ratio of successful login requests to total valid requests. It tells you how your service is performing right now.
SLO (Service Level Objective): You take that SLI and define a target around it. Something like "login availability should stay above 99.9% over a rolling 28-day window." When you're missing your SLO, it’s a signal to find out what's failing before customers notice.
SLA (Service Level Agreement): This is what you promise your customers in a contract. It's usually set lower than the SLO, say 99.5% monthly availability. If you breach it, you owe service credits.
If your SLO and SLA are both set to 99.9%, then the moment your availability drops below 99.9%, you've already breached the agreement.
The SLI tells you where you stand. The SLO tells you where you should be. The SLA tells your customers what they can expect.
Over to you: How do you decide what the right SLO target is when you're launching a new service?
We’re building a new course, Build with Claude Code, and we’d love your input before we finalize it.
If you’re an engineer or engineering leader, we’d appreciate 3 minutes of your time. Your answers will directly shape what we cover. Thank you so much!
2026-04-23 23:30:59
Every database has to solve the same basic problem.
Data lives on disk, and accessing disk is slow. Every read and every write eventually has to reach the disk, and how a database organizes data on that disk determines everything about its performance.
Over decades of research, two dominant approaches have emerged.
B-Trees keep data sorted on disk so reads are fast, but pay for it on every write.
LSM Trees buffer writes in memory and flush them to disk in bulk, making writes cheap but reads more expensive.
Neither approach is better. They represent two different approaches, and understanding the tradeoff between them is one of the most useful mental models in system design.
In this article, we will look at B-Trees and LSM trees in detail, along with the trade-offs associated with each of them.
2026-04-21 23:30:21
Skip the guesswork with this MongoDB cheatsheet from Datadog. You’ll get a quick, practical reference for monitoring performance and diagnosing issues in real systems.
Use it to:
Track key metrics like latency, throughput, and resource utilization
Monitor MongoDB and Atlas health with the right signals
Set up dashboards to quickly identify bottlenecks and performance issues
When DoorDash needed to launch Dasher onboarding in Puerto Rico, it took about a week. That wasn’t because they cut corners or threw a huge team at it. It took a week because almost no new code was needed. The steps that Puerto Rican Dashers would go through (identity checks, data collection, compliance validation) already existed as independent modules, battle-tested by thousands of Dashers in other countries. The team assembled them into a new workflow, made one minor customization, and shipped.
And it wasn’t just Puerto Rico. Australia’s migration was completed in under a month. Canada took two weeks, and New Zealand required almost no new development at all.
This speed came from an architectural decision the DoorDash engineering team made when they looked at their growing mess of country-specific if/else statements and decided to stop patching.
They rebuilt their onboarding system around a simple idea. Decompose the process into self-contained modules with standardized interfaces, then connect them through a deliberately simple orchestration layer.
In this article, we will look at how this architecture was designed and the challenges they faced.
Disclaimer: This post is based on publicly shared details from the DoorDash Engineering Team. Please comment if you notice any inaccuracies.
DoorDash’s Dasher onboarding started simple, with just a few steps serving a single country through straightforward logic. Then the company expanded internationally, and every new market meant new branches in the code.
At one point, three API versions ended up coexisting. V3, the newest, continued calling V2 handlers for backward compatibility and also continued writing to V2 database tables. The system literally couldn’t avoid its own history. All developers have probably seen something like this before, where nobody can fully explain which version handles what, and removing any piece feels dangerous because something else might depend on it.
See the diagram below that shows the legacy system view:
The step sequences themselves were hard-coded, with country-specific logic spread throughout. Business logic started immediately after receiving a request, branching into deep if/else chains based on country, step type, or prior state. Adding a new market meant carefully threading new conditions through this maze of conditions.
Vendor integrations followed no consistent pattern either. Some onboarding steps used internal services, which called third-party vendors. Other steps called vendors directly. This inconsistent layering made testing and debugging unpredictable.
And then there was also the state management problem. Onboarding progress was tracked across multiple separate database tables. Flags like validation_complete = true or documents_uploaded = false lived in different systems. If a user dropped off mid-onboarding and came back later, reconstructing where they actually stood required querying several systems and inferring logic. This frequently led to errors.
The practical cost was that adding a new country took months of engineering effort across APIs, tables, and code branches. Every change carried the risk of breaking something in a market on the other side of the world.
DoorDash’s rebuild was organized around three distinct layers, each with a single responsibility. It’s easy to blur these layers together, but the separation between them is where the real power lives.
The Orchestrator sits at the top. It’s a lightweight routing layer that looks at context (which country and which market type) and decides which workflow definition should handle the request. That’s all it does. It doesn’t execute steps or manage state. It doesn’t contain business logic either. The main insight here is that the smartest thing about the orchestrator is how little it does. Developers tend to imagine the central controller as the brain of the system. However, in this architecture, the brain is distributed, and the orchestrator is just a traffic cop.
Workflow Definitions are the second layer. A workflow is simply an ordered list of steps for a specific market. The US workflow might look like Data Collection, followed by Identity Verification, followed by Compliance Check, followed by Additional Validation. Australia’s workflow skips one step and reorders another. Puerto Rico adds a regional customization. Each workflow is defined as a class with a list of step references, making it easy to see exactly what each market’s onboarding process looks like.
Think of it like a Lego set. Each brick has a standardized shape, studs on top, tubes on the bottom, and that standard interface lets you build anything. A workflow definition is like building instructions for a specific model.
Step Modules are the third layer, and this is where the actual work happens. Each step (data collection, identity verification, risk and compliance checking, document verification) is implemented as an independent and self-contained module. A step knows how to collect its data, validate it, call its external vendors, handle retries and failures, and report success or failure. What it doesn’t know is which workflow it belongs to, or what step comes before or after it. This isolation is what makes reuse possible.
The mechanism enabling this plug-and-play behavior is the interface contract. Every step implements the same standardized interface, with a method to process the step, a method to check if it’s complete, and a method to return its response data. As long as a new step honors this contract, it can slot into any workflow without the workflow knowing or caring about its internals.
This contract also enables team autonomy. The identity verification step can be owned entirely by the security team. Payment setup can belong to the finance team. Each team iterates on their step independently, as long as they maintain the shared interface. In a way, the architecture mirrors the organizational structure, or more accurately, it lets the organizational structure work for the system instead of against it.
Two additional capabilities make the system even more flexible:
Composite steps group multiple granular steps into a single logical unit. One country might collect all personal information on a single screen. Another might split it across three screens. A composite step called “PersonalDetails” can wrap Profile, Additional Info, and Vehicle steps together, handling that variation without changing the individual step implementations underneath.
And steps can be dynamic and conditional. A Waitlist step might only appear in markets with specific supply conditions. The same step can even appear multiple times within a single workflow.
This flexibility goes beyond simple reordering and confirms that steps are truly stateless and workflow-agnostic.
The address collection step is the clearest proof that this works in practice. DoorDash built it once as a standalone module. When Australia needed address collection early in their flow for compliance checks, the team simply inserted the module before the compliance step in Australia’s workflow definition, without any special logic or branching. Canada later adopted the same step for validation and service-area mapping. It worked out of the box. The US team then experimented by enabling it in select regions, and again, with no new code.
This three-layer pattern isn’t specific to onboarding. Any multi-step process that varies across contexts (checkout flows, approval pipelines, content moderation queues) can be decomposed this way.
One important clarification here is that DoorDash’s step modules are not separate microservices. They are modules within a single service, which means the lesson here is about logical decomposition and interface design rather than strict deployment boundaries. Technically, we could apply this same pattern inside a monolith.
How does the system know where each applicant is in their journey?
Answering this question is needed to make modular steps work.
In the legacy system, this was a mess. Progress was tracked across multiple separate tables, each representing part of the workflow. Introducing a new onboarding step meant modifying several of these tables. Ensuring synchronization between them required close coordination across services, and it often broke down, leading to data mismatches and brittle integrations.
The new system introduced the status map, a single JSON object in the database where every step writes its own progress. It looks something like this:
{
“personal_info”: { “status”: “DONE”, “metadata”: { “name”: “Jane” } },
“address”: { “status”: “DONE”, “metadata”: { “address_id”: “abc123” } },
“validation”: { “status”: “IN_PROGRESS” },
“compliance”: { “status”: “INIT” }
}Each step is responsible for updating its own entry in the map. When a step starts, completes, fails, or gets skipped, it writes that transition directly to its entry. The workflow layer never writes to the status map. It just reads it.
See the diagram below:

Each step also exposes an isStepCompleted() method that defines its own completion logic based on the status map. One step might treat “SKIPPED” as a terminal state, while another might not. This flexibility lives at the step level, not the workflow level, which keeps the orchestration logic simple and stateless.
The practical benefit is immediate. A single query on the status map tells you exactly where any applicant stands in their onboarding journey. Partial updates are handled through atomic JSON key merges, meaning that when one step updates its status, it only touches its own entry without overwriting the rest of the map.
The architecture is only half the story. Getting there without breaking a running system is where the real engineering difficulty lives.
DoorDash didn’t flip a switch. They designed the new platform to coexist with the existing V2 and V3 APIs, running old and new systems side by side. Applicants who had partially completed onboarding under the legacy system needed to continue seamlessly, so the team built temporary synchronization mechanisms that mirrored progress between systems until the migration was complete. This parallel operation was itself a temporary technical debt, built intentionally to be thrown away.
Other major initiatives were underway during the rebuild, sometimes conflicting with the new onboarding design. Rather than treating these as blockers, the team collaborated across those efforts and adapted the architecture where necessary.
The migration started with the US in January 2025, their largest and most complex market, as the proving ground. Then the compounding payoff kicked in. Australia was completed in under a month, needing only two localized steps. Canada followed in two weeks with a single new module. Puerto Rico took a week with a minor customization. New Zealand required almost no new development.
Every migration launched with zero regressions, no user-facing incidents, no onboarding downtime, and no unexpected drop-offs in completion rates. Each rollout got faster because more modules had already been battle-tested by thousands of Dashers in prior markets.
The architecture has also proven its value beyond adding countries. DoorDash is integrating its onboarding with another large, independently developed ecosystem that has its own mature onboarding flow. The modular design allowed them to build integration-specific workflows while reusing much of the existing logic, something that would have been extremely painful with the legacy system.
The tradeoffs are real, though. Modularity adds coordination overhead. For a single-market startup, this architecture can be considered overkill. A monolithic onboarding flow is completely fine until you hit the inflection point where country-specific branching becomes more expensive than decomposition.
Reusable modules work well when the underlying concept generalizes across markets. For example, addresses are conceptually similar everywhere, which is why the address step was reused so cleanly. However, compliance requirements can be fundamentally different between regulatory regimes.
The boundary between the platform team and domain teams also requires ongoing negotiation. DoorDash addresses this through published platform principles, versioned interface contracts, and joint KPIs that create shared accountability. Domain expert teams own their business logic (fraud detection, compliance, payments) while the platform enforces consistency. This is a human coordination challenge that architecture alone doesn’t solve.
Looking ahead, DoorDash’s roadmap includes dynamic configuration loading to enable workflows to go live through config rather than code, step versioning to allow multiple iterations of a step to coexist during experiments or rollouts, and enhanced operational tooling to give non-engineering teams the ability to manage workflows directly.
That said, DoorDash deliberately kept workflows code-defined rather than jumping straight to config-driven. While config-driven systems are powerful, they introduce their own complexity. They can be harder to debug and harder to test.
Ultimately, what DoorDash built is a sort of pattern for any system that needs to support multiple variants of a multi-step process. The core idea is three layers (a thin orchestrator, composable workflows, and self-contained steps behind standardized interfaces) connected by a single shared state structure.
References:
2026-04-20 23:30:47
npx workos@latest launches an AI agent, powered by Claude, that reads your project, detects your framework, and writes a complete auth integration into your codebase. No signup required. It creates an environment, populates your keys, and you claim your account later when you're ready.
But the CLI goes way beyond installation. WorkOS Skills make your coding agent a WorkOS expert. workos seed defines your environment as code. workos doctor finds and fixes misconfigurations. And once you're authenticated, your agent can manage users, orgs, and environments directly from the terminal. No more ClickOps.
GitHub built an AI agent that can fix documentation, write tests, and refactor code while you sleep. Then they designed their entire security architecture around the assumption that this agent might try to steal your API keys, spam your repository with garbage, and leak your secrets to the internet.
This can be considered paranoia, but it’s the only responsible way to put a non-deterministic system inside your CI/CD pipeline.
GitHub Agentic Workflows let you plug AI agents into GitHub Actions so they can triage issues, generate pull requests, and handle routine maintenance without human supervision. The appeal is clear, but so is the risk. These agents consume untrusted inputs, make decisions at runtime, and can be manipulated through prompt injection, where carefully crafted text tricks the agent into doing things it wasn’t supposed to do.
In this article, we will look at how GitHub built a security architecture that assumes the agent is already compromised. However, to understand their solution, you first need to understand why the problem is harder than it looks.
Disclaimer: This post is based on publicly shared details from the GitHub Engineering Team. Please comment if you notice any inaccuracies.
CI/CD pipelines are built on a simple assumption. The developers define the steps, the system runs them, and every execution is predictable. All the components in a pipeline share a single trust domain, meaning they can all see the same secrets, access the same files, and talk to the same network. That shared environment is actually a feature for traditional automation. When every component is a deterministic script, sharing a trust domain makes everything composable and fast.
Agents break that assumption completely because they don’t follow a fixed script. They reason over repository state, consume inputs they weren’t specifically designed for, and make decisions at runtime. A traditional CI step either does exactly what you coded it to do or fails. An agent might do something you never anticipated, especially if it processes an input designed to manipulate it.
GitHub’s threat model for Agentic Workflows is blunt.
They assume the agent will try to read and write state that it shouldn’t, communicate over unintended channels, and abuse legitimate channels to perform unwanted actions. For example, a prompt-injected agent with access to shell commands can read configuration files, SSH keys, and Linux /proc state to discover credentials. It can scan workflow logs for tokens. Once it has those secrets, it can encode them into a public-facing GitHub object like an issue comment or pull request for an attacker to retrieve later. The agent isn’t actively malicious, but following instructions that it couldn’t distinguish between legitimate ones.
In a standard GitHub Actions setup, everything runs in the same trust domain on top of a runner virtual machine. A rogue agent could interfere with MCP servers (the tools that extend what an agent can do), access authentication secrets stored in environment variables, and make network requests to arbitrary hosts. A single compromised component gets access to everything. The problem isn’t that Actions are insecure. It’s that agents change the assumptions that made a shared trust domain safe in the first place.
Agents can generate code. Getting it right for your system, team conventions, and past decisions is the hard part. You end up babysitting the agent and watch the token costs climb.
More MCPs, rules, and bigger context windows give agents access to information, but not understanding. The teams pulling ahead have a context engine to give agents only what they need for the task at hand.
Our April webinar filled up, so we are bringing it back! Join us live (FREE) on May 6 to see:
Where teams get stuck on the AI maturity curve and why common fixes fall short
How a context engine solves for quality, efficiency, and cost
Live demo: the same coding task with and without a context engine
GitHub Agentic Workflows use a layered security architecture with three distinct levels.
Each layer limits the impact of failures in the layer above it by enforcing its own security properties independently.
The substrate layer sits at the bottom. It’s built on a GitHub Actions runner VM and several Docker containers, including a set of trusted containers that mediate privileged operations. This layer provides isolation between components, controls system calls, and enforces kernel-level communication boundaries. These protections hold even if an untrusted component is fully compromised and executes arbitrary code within its container. The substrate doesn’t rely on the agent behaving correctly, and even arbitrary code execution inside the agent’s container hits a wall at this level.
The configuration layer sits on top of the substrate layer. This is where the system’s structure gets defined. It includes declarative artifacts and the toolchains that interpret them to set up which components are loaded, how they’re connected, what communication channels are permitted, and what privileges are assigned. The most important piece in this layer is the compiler. GitHub doesn’t just run your workflow definition as-is, but compiles it into a GitHub Action with explicit constraints on permissions, outputs, auditability, and network access. The configuration layer also controls which secrets go into which containers. Externally minted tokens like agent API keys and GitHub access tokens are loaded only into the specific containers that need them, never into the agent’s container.
The planning layer sits on top. While the configuration layer dictates which components exist and how they communicate, the planning layer dictates which components are active over time. Its job is to create staged workflows with explicit data exchanges between stages. The safe outputs subsystem, which we’ll get to shortly, is the most important instance of this. It ensures the agent’s work gets reviewed before it affects anything real.
These layers are independent. If the planning layer fails, the configuration layer still enforces its constraints. If the configuration layer has a bug, the substrate layer still provides isolation. No single failure compromises the whole system.
From the beginning, GitHub wanted workflow agents to have zero access to secrets.
In a standard GitHub Actions setup, sensitive material like agent authentication tokens and MCP server API keys sits in environment variables and configuration files visible to all processes on the runner VM. That’s fine when everything sharing the environment is trusted. It’s dangerous with agents because they’re susceptible to prompt injection. An attacker can hide malicious instructions in a web page, a repository issue, or a pull request comment, and trick the agent into extracting and leaking whatever it can find.
GitHub’s solution is a specific container topology where secrets are physically unreachable from the agent.
The agent runs in its own Docker container on a private network. All internet access goes through a dedicated firewall container (gh-aw-firewall). MCP tools are accessed through a separate trusted container called the MCP gateway (gh-aw-mcpg), which launches MCP servers and holds exclusive access to MCP authentication material like the GitHub Personal Access Token. LLM API calls go through an isolated API proxy that holds the model’s authentication token (like the Codex token). The agent communicates with the firewall over HTTP, the firewall connects to the MCP gateway over HTTP, and the gateway connects to MCP servers over stdio.
See the diagram below:
At no point in this chain does the agent touch a secret. It can make LLM calls because the proxy adds authentication on its behalf. It can use MCP tools because the gateway authenticates on its behalf. But the agent itself is completely secretless.
There’s a real tradeoff here, and GitHub names it explicitly. Coding workloads need broad access to compilers, interpreters, scripts, and repository state. But you can’t just install all those tools inside the agent’s container without expanding the set of network destinations allowed through the firewall and duplicating a lot of the host setup.
Their resolution is elegant, but they mount the entire VM host filesystem read-only at /host inside the agent’s container. Then they overlay sensitive paths (the ones containing secrets and configuration) with empty tmpfs layers, which are temporary in-memory filesystems that effectively hide whatever was at those paths. Finally, they launch the agent in a chroot jail rooted at /host. The result is that the agent can access every compiler, interpreter, and tool on the host machine while being unable to discover any secret material. The host-side setup stays intact. The agent’s writable and discoverable surface is constrained to only what it needs. If you work with containers, this pattern of mounting a host filesystem read-only and masking sensitive paths with tmpfs is applicable well beyond the agentic context.
GitHub isn’t the only one that arrived at this zero-secret principle. OpenAI’s Codex takes a different path to the same destination. In the Codex cloud, secrets are available only during the setup phase and are removed before the agent phase starts, and internet access is disabled by default during execution. GitHub uses proxies and gateways. OpenAI uses a two-phase model. The fact that both teams independently converged on “agents should never touch secrets” validates the principle.
Prompt-injected agents can cause real harm even without access to secrets. A rogue agent could spam a repository with pointless issues and pull requests to overwhelm maintainers, or inject objectionable URLs and content into repository objects. The absence of secrets doesn’t mean the absence of risk.
GitHub’s answer is the safe outputs system, and it’s the most distinctive part of their architecture.
Here’s how it works:
The agent can read GitHub state through the GitHub MCP server, which operates in read-only mode. But it cannot write to GitHub directly. All write operations go through a separate safe output MCP server, which only buffers the agent’s intended changes without executing them.
Once the agent finishes its work and exits, everything it staged goes through a deterministic analysis pipeline.
This pipeline does three things.
First, it checks the type of each operation against an allowlist that the workflow author defined. If the author said the agent can create issues and comments but not pull requests, any staged PR gets dropped.
Second, it enforces quantity limits. An agent might be restricted to creating at most three pull requests per run.
Third, it runs content sanitization, scanning for secrets that might have leaked into the output text, removing URLs, and running content moderation checks.
Only outputs that survive the entire pipeline get committed to GitHub. Every stage’s side effects are explicit and vetted.
The compiler plays an important role here, too. When it decomposes a workflow into stages, it defines for each stage the active components and their permissions (read versus write), the data artifacts that stage can emit, and the admissible downstream consumers of those artifacts.
The workflow author declares upfront what the agent is allowed to produce, and the system enforces those declarations deterministically. Since the pipeline uses deterministic analysis, it can only catch patterns that GitHub anticipated. A truly novel attack vector might slip through, which is exactly why the other layers exist. No single layer is the complete answer.
Agents are determined to accomplish their tasks by any means and can have a surprisingly deep toolbox of tricks for doing so. When an agent behaves unexpectedly, you need full visibility into what happened.
Agentic Workflows make observability a first-class architectural property by logging at every trust boundary.
Network and destination-level activity gets recorded at the firewall.
Model request/response metadata and authenticated requests are captured by the API proxy.
Tool invocations are logged by the MCP gateway and MCP servers.
GitHub even adds internal instrumentation to the agent container to audit potentially sensitive actions like environment variable accesses.
Together, these logs support full forensic reconstruction, policy validation, and detection of anomalous behavior.
But there’s a more important long-term play here. Every point where you can observe communication is also a point where you can mediate it. GitHub is building the observation infrastructure now with future control in mind. They already support a lockdown mode for the GitHub MCP server, and they plan to introduce controls that enforce policies across MCP servers based on whether repository objects are public or private, and based on who authored them.
Every security decision GitHub made came with a cost.
Security versus utility is the most obvious tension. Agents running inside GitHub’s architecture are more constrained than a developer working locally. The chroot approach gives agents access to host tools, but the firewall still limits network access, and the safe outputs pipeline still restricts what the agent can produce. In other words, more security means less flexibility.
Strict-by-default is a strong opinion. Most other coding agents make sandboxing opt-in. Claude Code and Gemini CLI both require you to turn on their sandbox features. GitHub Agentic Workflows run in strict security mode by default. That’s a deliberate choice to prioritize safety over developer convenience, and it won’t be the right tradeoff for every use case.
Prompt injection remains fundamentally unsolved. GitHub’s architecture is a damage containment strategy, not a prevention strategy. It limits the blast radius when an agent gets tricked, but it can’t prevent the issue itself. And the deterministic vetting in the safe outputs pipeline can only catch patterns that were anticipated. A novel attack vector might need a new pipeline stage.
The architecture is also complex, involving multiple containers, proxies, gateways, a compilation step, and a staged output pipeline. This is engineering overhead that makes sense at GitHub’s scale. For simpler use cases, we might not need every piece.
As AI agents become standard in development tooling, the question will shift from whether to sandbox to building a complete security architecture.
GitHub’s four principles offer a transferable framework:
Defend in depth with independent layers.
Keep agents away from secrets by architecture, not policy.
Vet every output through deterministic analysis before it affects the real world.
Log everything at every trust boundary, because today’s observability is tomorrow’s control plane.
References: