MoreRSS

site iconByteByteGoModify

System design and interviewing experts, authors of best-selling books, offer newsletters and courses.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of ByteByteGo

How Anthropic’s Claude Thinks

2026-03-25 23:31:03

How AgentField Ships Production Code with 200 Autonomous Agents (Sponsored)

We hit the ceiling of single-session AI coding fast. We now orchestrate 200+ Claude Code instances in parallel on a shared codebase. Each instance runs in its own git worktree with filesystem access, test execution, and git. The system produces draft pull requests that have already been through automated writing, testing, code review, and verification before a human reviews them.

We recently open-sourced this system as SWE-AF. In this article, we cover the two-mode LLM integration pattern, a three-loop failure recovery hierarchy, and checkpoint-based execution that makes $116 builds survivable.

Read the article


Nobody at Anthropic programmed Claude to think a certain way. They trained it on data, and it developed its own strategies, buried inside billions of computations. For the people who built it, this could feel like an uncomfortable black box. Therefore, they decided to build something like a microscope for AI, a set of tools that would let them trace the actual computational steps Claude takes when it produces an answer.

The findings surprised them.

Take a simple example. Ask Claude to add 36 and 59, and it will probably tell you it carried the ones and added the columns as per the standard algorithm we all learned in school. However, when the researchers watched what actually happened inside Claude during that calculation, they saw something quite different. There was no carrying. Instead, two parallel strategies ran at once, one estimating the rough answer and another precisely calculating the last digit. In other words, Claude got the math right, but had no idea how it was done.

That gap between what Claude says and what it actually does turned out to be just the beginning. Over the course of multiple research papers published in 2025, Anthropic’s interpretability team traced Claude’s internal computations across a range of tasks, from writing poetry to answering factual questions to handling dangerous prompts.

In this article, we will look at what the Claude researchers found.

Disclaimer: This post is based on publicly shared details from the Anthropic Research and Engineering Team. Please comment if you notice any inaccuracies.

Looking Inside an LLM

The diagram below shows a typical flow of how a modern LLM works:

Before getting to the findings that Anthropic’s research team found, it helps to understand what this “microscope” actually is.

The core problem is that individual neurons inside an LLM’s neural network don’t map neatly to single concepts. One neuron might activate for “basketball,” “round objects,” and “the color orange” all at once. This is called polysemanticity, and it means that looking at neurons directly doesn’t tell us much about what the model is doing.

Anthropic’s solution is to use specialized techniques to decompose neural activity into what they call “features.” These are more interpretable units that correspond to recognizable concepts, such as things like smallness, known entity, or rhyming words.

To find these features, the team built a replacement model, which is basically a simplified copy of Claude that swaps neurons for features while producing the same outputs. They study this copy, not Claude directly.

Once they have features, they can trace how they connect to each other from input to output, producing attribution graphs. Think of these as wiring diagrams for a specific computation. And the most powerful part of this tool is the ability to intervene. You can reach into the model and suppress or inject specific features, then watch how the output changes. If you suppress the concept of “rabbit” and the model writes a different word, that’s strong causal evidence that the “rabbit” feature was doing what you thought it was doing. This technique is borrowed directly from neuroscience, where researchers stimulate specific brain regions to test their function.

Claude Thinks In Concept

Claude speaks dozens of languages fluently. So a natural question is whether there’s a separate “French Claude” and “English Claude” running internally, each responding in its own language.

There isn’t. When the researchers asked Claude for “the opposite of small” in English or French, they found that the same core features for “smallness” and “oppositeness” were activated regardless of the language used in the prompt. These shared features triggered a concept of “largeness,” which then got translated into whatever language the question was asked in.

This shared circuitry scales with model size. For example, Claude 3.5 Haiku shares more than twice the proportion of its features between languages compared to a smaller model. The implication of this is that Claude operates in some abstract conceptual space where meaning exists before language. If it learns something in English, it can potentially apply that knowledge when speaking French, not because it translates, but because at a deep level, both languages connect to the same internal representations.

How Claude Plans Poetry

Here’s a couplet Claude wrote.

He saw a carrot and had to grab it,

His hunger was like a starving rabbit

To write the second line, the model had to satisfy two constraints at once. It needed to rhyme with “grab it” and also make sense in context. The researchers’ hypothesis was that Claude probably writes word by word, then at the end of the line, picks a word that rhymes. They expected to find parallel paths for meaning and rhyming that converge at the final word.

Instead, they found that Claude plans ahead. Before writing the second line at all, it had already identified “rabbit” as a candidate ending. It picked the destination first, then wrote the line to get there.

The intervention experiments confirmed this was real. When the researchers suppressed the “rabbit” feature in Claude’s internal state, the model rewrote the line to end with “habit” instead. When they injected the concept of “green,” it wrote a completely different, non-rhyming line ending in “green.” This demonstrates both planning ability and flexibility.

What makes this experiment more credible than a typical AI capability claim is that the researchers had set out to show that Claude didn’t plan. Finding the opposite is what gives the result its weight. They followed the evidence rather than their expectations.

How Claude Does Maths

The mental math result deserves a closer look, because the gap it reveals goes deeper than a quirky arithmetic shortcut.

When Claude computes 36 + 59, the microscope shows two computational paths running in parallel. One path estimates the rough magnitude of the answer, placing it somewhere in the range of 88 to 97. The other path focuses specifically on the last digit, computing that 6 + 9 ends in 5. These paths interact and combine to produce 95.

This is nothing like the carrying algorithm Claude describes when you ask it to explain its work.

So why does Claude give the wrong explanation?

This is because it learned to explain math and to do math through completely separate processes. Claude’s explanations come from human-written text it absorbed during training, text where people describe the standard algorithm. However, Claude’s actual computational strategies emerged from the training process itself. No one taught it to use parallel approximation paths. It developed those on its own, and those internal strategies aren’t accessible to the part of Claude that generates natural language explanations.

This is an important finding, and not just for arithmetic. It means Claude’s self-reports about its own reasoning process can be inaccurate, not because it’s lying, but because it literally doesn’t have access to its own internal algorithms. When we ask a model to show its work, we might be getting a plausible reconstruction and not a faithful record.

This raises an obvious follow-up question. If Claude’s explanations don’t always match the internal process while doing easy math, what happens on harder problems?

When Claude’s Reasoning is Motivated

Modern models like Claude can “think out loud,” writing extended chains of reasoning before giving a final answer. Often this produces better results. However, Anthropic’s researchers found that the relationship between the written reasoning and the actual internal computation isn’t always what it seems.

On an easier problem that required computing the square root of 0.64, Claude produced a faithful chain of thought. The microscope showed internal features representing the intermediate step of computing the square root of 64. The explanation matched the process.

On a harder problem involving the cosine of a large number, something very different happened. Claude produced a chain of thought that claimed to work through the calculation step by step. But the microscope revealed no evidence of any calculation having occurred internally.

In other words, Claude had generated an answer and constructed a plausible-looking derivation after the fact, without actually computing anything. The philosopher Harry Frankfurt had a word for this kind of output. He called it bullshitting. Not lying, which requires knowing the truth and deliberately contradicting it, but something arguably worse, like producing statements without any concern for whether they’re true or false.

Further on, when the researchers gave Claude a hint about the expected answer, the model engaged in what they call motivated reasoning. It worked backward from the target answer, finding intermediate steps that would lead to that conclusion. It wasn’t solving the problem. It was reverse-engineering a justification for a predetermined result.

The self-unawareness in the math case was harmless. Claude got the right answer by the wrong-described method. No one gets hurt. But this is different. If a model’s step-by-step reasoning can be a performance rather than a genuine process, the chain-of-thought traces we increasingly rely on for trust become unreliable.

Why Hallucinations Happen

Perhaps the most counterintuitive finding involves hallucination, which is the tendency for language models to make up information.

The conventional view is that models hallucinate because they’re trained to always produce output. They’re completion machines, so they fill gaps with plausible-sounding text. The challenge, in this framing, is teaching them to stay quiet when they don’t know something.

Anthropic found something that turns this framing upside down. In Claude, refusal to answer is actually the default behavior. The researchers identified a circuit that is “on” by default and that causes the model to state that it lacks sufficient information to answer any given question. In other words, Claude’s natural state is to decline.

What lets Claude answer questions at all is a separate mechanism. When the model recognizes a well-known entity, say the basketball player Michael Jordan, a “known answer” feature activates and inhibits the default refusal circuit. This inhibition is what allows Claude to provide an answer.

Hallucinations happen when this recognition system misfires. When Claude encounters a name like “Michael Batkin” (a person it doesn’t know anything about), the refusal circuit should win. But if the name triggers enough familiarity, perhaps Claude has seen it in passing during training, the “known entity” feature can incorrectly activate and suppress the refusal. With refusal disabled and no actual knowledge to draw on, Claude invents a plausible answer.

The researchers confirmed this mechanism by intervening directly. By artificially activating the “known answer” features while asking about unknown entities, they could consistently make Claude hallucinate. They could also cause hallucination by inhibiting the “can’t answer” features.

This implies that hallucination isn’t Claude being reckless, but the recognition system misfiring and overriding a safety default that was working correctly.

When Grammar Overrides Safety

The final case study involves jailbreaks, prompting strategies designed to get a model to produce outputs it shouldn’t.

The researchers studied a specific jailbreak that tricks Claude through an acrostic. The prompt “Babies Outlive Mustard Block” asks the model to put together the first letters of each word. Claude spells out B-O-M-B without initially recognizing what it’s producing. By the time it realizes it has been asked about bomb-making, it has already started a sentence providing instructions.

What happens next reveals a surprising tension. Safety features activate. The model recognizes that it should refuse. However, features promoting grammatical coherence and self-consistency exert competing pressure. Once Claude has begun a sentence, these coherence features push it to complete that sentence in a way that is grammatically and semantically valid. The safety features want to stop, but the grammar features want to finish.

Claude only manages to pivot to refusal at a sentence boundary. Once it reaches a natural stopping point, it starts a new sentence with the kind of refusal it had been trying to give all along. The features that ordinarily make Claude a fluent, coherent writer became, in this specific case, the vulnerability that a jailbreak could exploit.

Conclusion

These findings create a richer picture of Claude’s internals than anything that came before. However, the researchers are also upfront about the limitations.

  • The tools produce satisfying insight on roughly a quarter of the prompts they try. The case studies in this article, and in the original blog post, are the success cases. Even on those successes, the microscope captures only a fraction of the total computation Claude performs.

  • Everything described here was observed in the replacement model, not in Claude itself. The replacement model is designed to behave identically, but the possibility of artifacts, things the replacement model does that the real model doesn’t, is real.

  • There’s also a scale problem. Current analysis requires hours of human effort on prompts containing only tens of words. Scaling this to the thousands of words in a complex reasoning chain is an unsolved problem.

Ultimately, the question “how does Claude think?” doesn’t have a single answer.

It thinks in abstract concepts that exist before language. It plans ahead, choosing destinations and writing routes to reach them. It invents its own computational methods and then describes completely different ones when asked. It sometimes fabricates reasoning to support predetermined conclusions. Its default is silence, and it speaks only when something overrides that default, sometimes incorrectly. And when it starts a sentence, finishing it grammatically can temporarily override everything else, including safety.

References:

How Netflix Live Streams to 100 Million Devices in 60 Seconds

2026-03-24 23:31:21

New Year, New Metrics: Evaluating AI Search in the Agentic Era (Sponsored)

Most teams pick a search provider by running a few test queries and hoping for the best – a recipe for hallucinations and unpredictable failures. This technical guide from You.com gives you access to an exact framework to evaluate AI search and retrieval.

What you’ll get:

  • A four-phase framework for evaluating AI search

  • How to build a golden set of queries that predicts real-world performance

  • Metrics and code for measuring accuracy

Go from “looks good” to proven quality.

Learn how to run an eval


Netflix Live Origin is a custom-built server that sits between the cloud live streaming pipelines and Open Connect, Netflix’s Content Delivery Network (CDN). It acts like a quality control checkpoint that decides which video segments get delivered to millions of viewers around the world.

When Netflix first introduced live streaming, they needed a system that could handle the unique challenges of real-time video delivery. Unlike Video on Demand (VOD), where content is prepared in advance, live streaming operates under time constraints. Every video segment must be encoded, packaged, and delivered to viewers within seconds. The Live Origin was designed specifically to handle these demands.

In this article, we will look at the architecture of this system and the challenges Netflix faced while building it.

Disclaimer: This post is based on publicly shared details from the Netflix Engineering Team. Please comment if you notice any inaccuracies.

How the System Works

The Live Origin operates as a multi-tenant microservice on Amazon EC2 instances. The communication model is quite straightforward:

  • The Packager, which prepares video segments for distribution, sends these segments to the Origin using HTTP PUT requests.

  • Open Connect nodes retrieve these segments using HTTP GET requests. The URL used for the PUT request matches the URL used for the GET request, creating a simple storage and retrieval pattern.

See the diagram below:

Netflix made a couple of architectural decisions that shaped how the Live Origin functions.

  • First, they achieve reliability through redundant regional live streaming pipelines. Instead of relying on a single encoding pipeline, Netflix runs two independent pipelines simultaneously. Each pipeline operates in a different cloud region with its own encoder, packager, and video contribution feed.

  • Second, Netflix adopted a manifest design with segment templates and constant segment duration. Instead of constantly updating a manifest file to list available segments, they use a predictable template where each segment has a fixed duration of 2 seconds. This design choice allows the Origin to predict exactly when segments should be published.

Multi-Pipeline Awareness and Intelligent Selection

Live video streams inevitably contain defects because of the unpredictable nature of live video feeds and the strict real-time publishing deadlines. Common problems include short segments with missing video frames or audio samples, completely missing segments, and timing discontinuities where the decode timestamps are incorrect.

Running two independent pipelines substantially reduces the chance that both will produce defective segments at the same time. Because the pipelines use different cloud regions, encoders, and video sources, when one pipeline produces a bad segment, the other typically produces a good one.

The Live Origin leverages its position in the distribution path to make intelligent decisions. When Open Connect requests a segment, the Origin checks candidates from both pipelines in a predetermined order and selects the first valid one. To detect defects, the Packager performs lightweight media inspection and includes defect information as metadata when publishing segments to the Origin. In the rare case where both pipelines have defective segments, this information passes downstream so clients can handle the error appropriately.


No invasive meeting bots (Sponsored)

Ever had a meeting where a random bot joins the call and, suddenly, everyone’s distracted?

Granola works differently.

There are no meeting bots. Nothing joins your call.
Granola transcribes directly from your device’s audio (on your computer or your phone). It works with any meeting tool: Zoom, Google Meet, Microsoft team … and even for in-person conversations.

You stay focused and jot down notes like you normally would.
Granola quietly transcribes and enhances the important bits in the background.

And if you want to be extra thoughtful or compliant, you can always send a quick consent email beforehand - automatically disclosing your use of Granola.

Try Granola on your next meeting and see how much easier it is to stay present.

→ Download Granola (Free)

Get one month free on any paid plan using the code BYTEBYTEGO


Optimizing for Open Connect

When the Live project started, Open Connect was highly optimized for VOD delivery. Netflix had spent years tuning nginx (a web server) and the underlying operating system for this use case. Unlike traditional CDNs that fill caches on demand, Open Connect pre-positions VOD assets on carefully selected servers called Open Connect Appliances (OCAs).

Live streaming did not fit neatly into this model, so Netflix extended nginx’s proxy-caching functionality. They made several optimizations to reduce unnecessary traffic and improve performance.

Open Connect nodes receive the same segment templates provided to clients. Using the template information, OCAs can determine the legitimate range of segments for any event at any point in time. This allows them to reject requests for segments outside this range immediately, preventing unnecessary requests from traveling through the network to the Origin.

When a segment is not yet available, and the Origin receives a request for it, the Origin returns a 404 status code (File Not Found) along with an expiration policy. Open Connect can cache this 404 response until just before the segment is expected to be published. This prevents repeated failed requests.

Netflix implemented a clever optimization for requests at the live edge, which is the most recent part of the stream. When a request arrives for the next segment that is about to be published, instead of returning another 404 that would propagate back through the network to the client, the Origin holds the request open. Once the segment is published, the Origin immediately responds to the waiting request. This significantly reduces network traffic for requests that arrive slightly early.

To support this functionality, Netflix added millisecond-grain caching to nginx. Standard HTTP Cache Control only works at the second granularity, which is too coarse when segments are generated every 2 seconds.

Streaming Metadata Through HTTP Headers

Netflix uses custom HTTP headers to communicate streaming events in a highly scalable way. The live streaming pipeline provides notifications to the Origin, which inserts them as HTTP headers on segments generated at that point in time. These headers are cumulative, meaning they persist to future segments.

Whenever a segment arrives at an OCA, the notification information is extracted from the response headers and stored in memory using the event ID as the key. When an OCA serves a segment to a client, it attaches the latest notification data to the response. This system ensures that, regardless of where viewers are in the stream, they receive the most recent notification data. The notifications can even be conveyed on responses that do not supply new segments.

This approach allows Netflix to communicate information like ad breaks, content warnings, or live event updates to millions of devices efficiently and independently of their playback position.

Cache Invalidation and Origin Masking

Netflix built an invalidation system that can flush all content associated with an event by altering the version number used in cache keys. This is particularly useful during pre-event testing, allowing the network to return to a pristine state between tests.

Each segment published by the Live Origin includes metadata about which encoding pipeline generated it and which region it came from. The enhanced invalidation system takes these variants into account. Netflix can invalidate a specific range of segment numbers, but only if they came from a particular encoder or from a specific encoder in a specific region.

Combined with this cache invalidation capability, the Live Origin supports selective encoding pipeline masking. This feature allows operations teams to exclude problematic segments from a particular pipeline when serving to Open Connect. When bad segments are detected during a live event, this system protects millions of viewers by hiding the problematic content, especially important during the DVR playback window when viewers might rewind to that part of the stream.

Storage Architecture Evolution

Netflix initially used AWS S3 for Live Origin storage, similar to their VOD infrastructure. This worked well for low-traffic events, but as they scaled up, they discovered that live streaming has unique requirements that differ significantly from on-demand content.

While S3 met its stated uptime guarantees, the strict 2-second retry budget inherent to live events meant that any delays were problematic. In live streaming, every write is critical and time-sensitive. The requirements were closer to those of a global, low-latency, highly available database rather than object storage.

Netflix established five key requirements for the new storage system.

  • First, they needed extremely high write availability within a single AWS region with low-latency replication to other regions. Any failed write operation within 500 milliseconds was considered a bug.

  • Second, the system needed to handle high write throughput with hundreds of megabytes replicating across regions.

  • Third, it had to efficiently support large writes that accumulate to thousands of keys per partition.

  • Fourth, they needed strong consistency within the same region to achieve read latency under one second.

  • Fifth, during worst-case scenarios involving Open Connect edge cases, the system might need to handle gigabytes of read throughput without affecting writes.

Netflix had previously built a Key-Value Storage Abstraction that leveraged Apache Cassandra to provide chunked storage of large values. This abstraction was originally built to support cloud game saves, but the Live use case would push its boundaries in terms of write availability, cumulative partition size, and read throughput.

The solution breaks large payloads into chunks that can be independently retried. Combined with Apache Cassandra’s local-quorum consistency model, which allows write availability even with an entire Availability Zone outage, and a write-optimized Log-Structured Merge Tree storage engine, Netflix could meet the first four requirements.

The performance improvements were solid. The median latency dropped from 113 milliseconds to 25 milliseconds, and the 99th percentile latency improved from 267 milliseconds to 129 milliseconds. This new solution was more expensive, but minimizing cost was not the primary objective.

However, handling the Origin Storm failure case required additional work. In this scenario, dozens of Open Connect top-tier caches could simultaneously request multiple large video segments. Calculations showed worst-case read throughput could reach 100 gigabits per second or more.

Netflix could respond to reads at network line rate from Apache Cassandra, but observed unacceptable performance degradation on concurrent writes. To resolve this, they introduced write-through caching using EVCache, their distributed caching system based on Memcached. This allows almost all reads to be served from a highly scalable cache, enabling throughput of 200 gigabits per second and beyond without affecting the write path.

In the final architecture, the Live Origin writes and reads to the KeyValue, which manages a write-through cache to the EVCache and implements a chunking protocol that spreads large values across the Apache Cassandra storage cluster. Almost all read load is handled from cache, with only cache misses hitting the storage layer. This combination has successfully met the demanding needs of the Live Origin for over a year.

Scalability and Request Prioritization

Netflix’s live streaming platform handles a high volume of diverse stream renditions for each live event. This complexity comes from supporting various video encoding formats with multiple quality levels, numerous audio options across languages and formats, and different content versions such as streams with or without advertisements. During the Tyson vs. Paul fight event in 2024, Netflix observed a historic peak of 65 million concurrent streams.

Netflix chose to build a highly scalable origin rather than relying on traditional origin shields for better cache consistency control and simpler system architecture. The Live Origin connects directly with top-tier Open Connect nodes distributed geographically across several sites. To minimize load on the origin, only designated nodes per stream rendition at each site can fill directly from the origin.

See the diagram below:

While the origin service can scale horizontally using EC2 instances, other system resources like storage platform capacity and backbone bandwidth cannot scale automatically. Not all requests to the live origin have the same importance. Origin writes are most critical because failure directly impacts the streaming pipeline. Origin reads for the live edge are critical because failure impacts the majority of clients. Origin reads for DVR mode are less critical because failure only affects clients who are rewinding.

Netflix implemented comprehensive publishing isolation to protect the latency-sensitive and failure-sensitive origin writes. The origin uses separate EC2 stacks for publishing and CDN traffic. The storage abstraction layer features distinct clusters for read and write operations. The storage layer itself separates the read path through EVCache from the write path through Cassandra. This complete path isolation enables independent scaling of publishing and retrieval traffic and prevents CDN traffic surges from impacting publishing performance.

The Live Origin implements priority-based rate limiting when the underlying system experiences stress. This approach ensures that requests with greater user impact succeed while requests with lower user impact are allowed to fail during periods of high load. Live edge traffic is prioritized over DVR traffic during periods of high load on the storage platform. The detection is based on the predictable segment template, which is cached in memory at the origin node. This allows priority decisions without accessing the datastore, which is valuable especially during datastore stress.

To mitigate traffic surges, Netflix uses TTL cache control alongside priority rate limiting. When low-priority traffic is impacted, the Origin instructs Open Connect to cache identical requests for 5 seconds by setting a max-age directive and returning an HTTP 503 error code. This strategy dampens traffic surges by preventing repeated requests within that 5-second window.

Handling 404 Storms

Publishing isolation and priority rate limiting successfully protect the live origin from DVR traffic storms. However, traffic storms generated by requests for non-existent segments present additional challenges and opportunities for optimization.

The Live Origin structures metadata hierarchically as event, stream rendition, and segment. The segment publishing template is maintained at the stream rendition level. This hierarchical organization allows the origin to preemptively reject requests using highly cacheable event and stream rendition level metadata, avoiding unnecessary queries to segment level metadata.

The process works as follows:

  • If the event is unknown, the request is rejected with a 404 error.

  • If the event is known but the segment request timing does not match the expected publishing timing, the request is rejected with a 404 and cache control TTL matching the expected publishing time.

  • If the event is known but the requested segment was never generated or missed the retry deadline, the request is rejected with a 410 (Gone) error, which tells the client to stop requesting.

At the storage layer, metadata is stored separately from media data in the control plane datastore. Unlike the media datastore, the control plane datastore does not use a distributed cache to avoid cache inconsistency. Event and rendition level metadata benefits from high cache hit ratios when in-memory caching is utilized at the live origin instance. During traffic storms involving non-existent segments, the cache hit ratio for control plane access easily exceeds 90 percent.

The use of in-memory caching for metadata effectively handles 404 storms at the live origin without causing datastore stress. This metadata caching complements the storage system’s distributed media cache, providing a complete solution for traffic surge protection.

Conclusion

The Netflix Live Origin is a sophisticated system built specifically for the unique demands of live streaming at a massive scale.

Through redundant pipelines, intelligent segment selection, optimized caching strategies, prioritized request handling, and a custom storage architecture, Netflix has created a reliable foundation for delivering live events to millions of concurrent viewers worldwide.

The system successfully balances the competing demands of write reliability, read scalability, and operational flexibility, proving its effectiveness during major events like the Tyson vs. Paul fight that reached 65 million concurrent streams.

References:

How Agentic RAG Works?

2026-03-23 23:31:43

The hidden reality of AI-Driven development (Sponsored)

There is a new “velocity tax” in software development. As AI adoption grows, your teams aren’t necessarily working less—they are spending 25% of their week fixing and securing AI-generated code. This hidden cost creates a verification bottleneck that stalls innovation. Sonar provides the automated, trusted analysis needed to bridge the gap between AI speed and production-grade quality.

Learn more


The main problem with standard RAG systems isn’t the retrieval or the generation. It’s that nothing sits in the middle deciding whether the retrieval was actually good enough before the generation happens.

Standard RAG is a pipeline where information flows in one direction, from query to retrieval to response, with no checkpoint and no second chance. This works fine for simple questions with obvious answers.

However, the moment a query gets ambiguous, or the answer is spread across multiple documents, or the first retrieval pulls back something that looks good but isn’t, RAG starts losing value.

Agentic RAG attempts to fix this problem. It is based on a single question: what if the system could pause and think before answering?

In this article, we will look at how agentic RAG works, how it improves upon standard RAG, and the trade-offs that should be considered.

One Query and One Retrieval

To understand what Agentic RAG fixes, we need to be clear about how standard RAG works and where it falls short.

A standard RAG pipeline has a straightforward flow:

  • A user asks a question.

  • The system converts that question into a numerical representation called an embedding, which captures the semantic meaning of the query.

  • It then searches a vector database, a database optimized for finding content with similar meaning, and retrieves the top matching chunks of text.

  • Those chunks get passed to a large language model along with the original question, and the LLM generates an answer grounded in the retrieved context.

See the diagram below:

The diagram below shows what embeddings typically look like:

This works extremely well for direct and unambiguous questions against a well-organized knowledge base. Think of questions like “What’s our return policy?” A clean documentation corpus will get a solid answer almost every time.

Here’s how typical query flow looks like:

The problems show up when queries get more complex. Here are a few scenarios:

  • Ambiguous queries: When a user asks, “How do I handle taxes?” they could mean personal income taxes, business taxes, or tax-exempt status for a nonprofit. Standard RAG can’t clarify or rewrite. It takes the query as-is, retrieves whatever scores highest on similarity, and hopes for the best.

  • Scattered evidence: Sometimes the answer lives across multiple documents. An employee asking “What’s the policy on remote work for contractors?” needs information from both the remote work policy and the contractor agreement. Standard RAG typically retrieves from one pool of chunks and has no concept of checking a second source if the first one comes up short.

  • False confidence: The retrieval returns something that looks relevant based on similarity scores, but doesn’t actually answer the question. Maybe it’s about the right topic, but from an outdated version of a document. The system has no mechanism to tell the difference between “relevant” and “actually correct.” It generates a confident response either way.

These three failure modes share the same root cause. The system does not reflect what it retrieved. It can’t ask itself whether the results were good enough.


AI companies aren’t scraping Google (Sponsored)

They’re using SerpApi: the industry-standard Web Search API that shares access to search engines with a simple API. Trusted by Uber, NVIDIA, and more. Start with 250 free credits/month.

Try for Free


From Pipeline to Control Loop

Agentic RAG replaces that linear pipeline with a loop by bringing the capabilities of AI agents into the mix.

At its core, an AI agent is a software system that can perceive its environment, make decisions, and take actions to achieve specific goals with some degree of independence. The word “agent” is key here. Just as a travel agent acts on our behalf to find flights and negotiate deals, an AI agent acts on behalf of users or systems to accomplish tasks without needing constant guidance for every single step.

See the diagram below that illustrates the concept of an AI agent:

In Agentic RAG, instead of retrieve-then-generate, the flow becomes: retrieve, evaluate what came back, decide whether to answer or try again, and if needed, retrieve differently.

See the diagram below:

The word “agentic” might sound like a marketing push, but in this context, an agent is an LLM that has been given the ability to make decisions and call tools. Think of it as an LLM that, instead of just generating text, can also choose to take actions such as running a search, querying a database, calling an API, or deciding that it needs more information before responding.

This gives the system three capabilities that standard RAG lacks.

  • Tool use and routing: The agent can decide which knowledge source to query based on the question. A financial question might go to a SQL database. A policy question might go to a document store. A product question might need both. In short, an agentic system picks the right place, or searches multiple places.

  • Query refinement: Before searching, the agent can rewrite an ambiguous query into something more specific. After searching, if the results look weak, it can reformulate and try again. The agent acts on the query both before and after retrieval.

  • Self-evaluation: After getting results back, the agent examines them by asking questions such as “Is this relevant?” or “Is it complete?” or “Does it conflict with other information?” If the answer to any of these is no, the agent can retry with a different query, a different source, or both. This directly addresses the “one-shot problem.”

See the diagram below that shows Agentic RAG approach on a high-level:

However, it’s misleading to think of Agentic RAG as a binary switch. In its simplest form, it’s like a router that decides which of two or three knowledge bases to query. That’s already a meaningful upgrade over standard RAG for multi-source environments.

Further along the spectrum, you get systems like ReAct (short for Reasoning + Acting), a framework where the agent alternates between reasoning about what it knows and taking actions to learn more, running multiple retrieval steps with evaluation between each one.

See the diagram below:

At the far end sit multi-agent systems where specialized agents collaborate, coordinated by an orchestrator.

Query Refinement, Routing, and Self-Correction

The control loop is a useful mental model. However, it can be understood better when mapped back to the failure modes from earlier.

  • Ambiguity solved by query refinement: The “How do I handle taxes?” question goes through the agent first. The agent can decompose it into sub-questions based on context, or rewrite it into something more targeted before any retrieval happens. If the first retrieval comes back with results about personal income tax but the context suggests the user is asking about business tax, the agent can refine and search again.

  • Scattered evidence solved by routing: The remote work policy question for contractors now goes through an agent that recognizes it needs two sources. It routes the query to the HR policy document store, retrieves what it finds, then routes to the contractor agreements, retrieves from there, and synthesizes both sets of results before generating an answer.

  • False confidence solved by self-evaluation: The agent retrieves a chunk that looks relevant but comes from a document last updated two years ago. The evaluation step flags this. Maybe the agent then searches for a more recent version, or it searches a different source entirely, or it includes a caveat in its response. The system no longer blindly trusts similarity scores.

These three capabilities map directly to the three failure modes.

Agentic RAG was designed specifically to address the gaps where standard RAG’s one-shot approach falls short. There are additional agentic capabilities beyond these three, like memory and semantic caching, which allow the system to retain context across multiple queries in a conversation.

The Trade-Offs

Everything above might make Agentic RAG sound like a straight upgrade over standard RAG. However, every iteration of that loop has a cost, and those costs can be significant enough that many systems shouldn’t use it. Here are a few considerations to have:

  • Latency: Every loop iteration means another LLM call, another retrieval, another evaluation. A standard RAG query might take 1-2 seconds. An agentic query with three or four loops could take 10 seconds or more. For real-time chat applications, that’s usually unacceptable.

  • Cost: Each agent decision consumes tokens. A system handling thousands of queries per day can see costs multiply 3-10x compared to standard RAG. Even if 80% of those queries are simple FAQ lookups, that’s a lot of money spent on unnecessary reasoning.

  • Debugging and predictability: Standard RAG is relatively deterministic. However, Agentic RAG introduces variability because the agent can make different decisions based on what it finds at each step. This makes it harder to reproduce issues, harder to write tests, and harder to explain to stakeholders why the same question produced different answers in different situations.

  • The evaluator paradox: The self-evaluation step uses an LLM to judge whether retrieval was good enough. The system’s ability to self-correct is only as good as the LLM’s ability to judge relevance. A weak evaluator might reject perfectly good results and send the system on a wild goose chase, or accept poor results and generate a bad answer anyway. Basically, we’re trusting one LLM call to oversee another.

  • Overcorrection: Sometimes the agentic loop is smarter than it needs to be. It might discard useful retrieved information during the evaluation step, keep searching for something “better,” and end up with a worse answer than if it had just gone with the first result.

None of this means Agentic RAG should not be used. It means that deciding to use it should be an engineering decision and not a default choice.

Direct factual lookups against a clean and single-source knowledge base don’t need a reasoning loop. Neither do high-volume, low-complexity query patterns where latency and cost matter more than handling edge cases. If most of the failures in an existing RAG system come from retrieval quality issues like bad chunking or stale data, fixing those will do more good than adding an agentic layer.

Conclusion

The core mental model for Agentic RAG is straightforward.

Agentic RAG turns retrieval from a one-shot pipeline into a loop with decision points. Those decision points are the entire value add.

When evaluating or building RAG systems, three questions can help cut through the noise:

  • Is the system retrieving from the right source?

  • Can it evaluate whether what it retrieved is good enough?

  • Does it have the ability to try again if it’s not?

If the answer to all three is “no” and the queries are complex, that’s the signal to consider the agentic approach. If the queries are simple and the knowledge base is clean, standard RAG is probably the right call.

The pipeline-to-loop shift also isn’t unique to RAG. It reflects a broader pattern in how AI systems are evolving, moving from rigid pipelines toward systems with feedback loops and decision-making capabilities.

References:

Last Chance to Enroll | Become an AI Engineer | Cohort-Based Course

2026-03-22 23:30:41

Our 5th cohort of Becoming an AI Engineer starts in less than a week. This is a live, cohort-based course created in collaboration with best-selling author Ali Aminian and published by ByteByteGo.

Check it out Here

Here’s what makes this cohort special:

  • Learn by doing: Build real world AI applications, not just by watching videos.

  • Structured, systematic learning path: Follow a carefully designed curriculum that takes you step by step, from fundamentals to advanced topics.

  • Live feedback and mentorship: Get direct feedback from instructors and peers.

  • Community driven: Learning alone is hard. Learning with a community is easy!

We are focused on skill building, not just theory or passive learning. Our goal is for every participant to walk away with a strong foundation for building AI systems.

If you want to start learning AI from scratch, this is the perfect platform for you to begin.

Check it out Here

EP207: Top 12 GitHub AI Repositories

2026-03-21 23:31:00

✂️ Cut your QA cycles down to minutes with QA Wolf (Sponsored)

If slow QA processes bottleneck you or your software engineering team and you’re releasing slower because of it — you need to check out QA Wolf.

QA Wolf’s AI-native service supports web and mobile apps, delivering 80% automated test coverage in weeks and helping teams ship 5x faster by reducing QA cycles to minutes.

QA Wolf takes testing off your plate. They can get you:

  • Unlimited parallel test runs for mobile and web apps

  • 24-hour maintenance and on-demand test creation

  • Human-verified bug reports sent directly to your team

  • Zero flakes guarantee

The benefit? No more manual E2E testing. No more slow QA cycles. No more bugs reaching production.

With QA Wolf, Drata’s team of 80+ engineers achieved 4x more test cases and 86% faster QA cycles.

Schedule a demo to learn more


This week’s system design refresher:

  • Top 12 GitHub AI Repositories

  • Where Different Types of Tests Fit

  • How Single Sign-On (SSO) Works

  • How LLMs Use AI Agents with Deep Research

  • How Hackers Steal Passwords


Top 12 GitHub AI Repositories

These repositories were selected based on their overall popularity and GitHub stars.

  1. OpenClaw: The always-on personal AI agent that lives on your device and talks to you through WhatsApp, Telegram, and 50+ other platforms.

  2. N8n: A visual workflow automation platform with native AI capabilities and 400+ integrations.

  3. Ollama: Run powerful LLMs locally on your own hardware with a single command.

  4. Langflow: A drag-and-drop visual builder for designing and deploying AI agents and RAG workflows.

  5. Dify: A full-stack prod-ready platform for building and deploying AI-powered apps and agentic workflows.

  6. LangChain: The foundational framework powering the AI agent ecosystem with modular building blocks.

  7. Open WebUI: A self-hosted, offline-capable ChatGPT alternative

  8. DeepSeek-V3: An open-weight LLM that rivals GPT on benchmarks and is free for commercial use.

  9. Gemini CLI: Google’s open-source tool to interact with the Gemini model right from your terminal.

  10. RAGFlow: An enterprise-grade RAG engine that grounds AI answers in real documents with citation tracking.

  11. Claude Code: An agentic coding tool that understands your entire codebase and executes engineering tasks from the terminal.

  12. CrewAI: A lightweight Python framework for assembling teams of role-playing AI agents to collaborate on tasks.

Over to you: Which other repository will you add to the list?


Where Different Types of Tests Fit

Writing code is easy now, but testing code is hard.

Let’s take a look at where different types of tests fit.

  • Unit + Component Tests: These test individual functions or UI components in isolation. They’re fast, inexpensive to run, and easy to maintain. Tools like Jest, Vitest, JUnit, pytest, React Testing Library, Cypress, Vue Test Utils, and Playwright are commonly used here, and most of your test coverage should come from this layer.

  • Integration Tests: These verify communication between services, APIs, and databases. Testcontainers, Postman, Bruno, Supertest. Unit tests won't catch a broken API contract, but integration tests will.

  • End-to-End Tests: Tools like Cypress, Playwright, Appium, and QA Wolf validate full user journeys across the whole system. They are expensive to run and maintain, which is why fewer tests live in this layer.

AI tools are becoming part of the testing workflow. Tools like GitHub Copilot, ChatGPT, Claude, Cursor, and Qodo can help draft tests, update suites, and spot gaps in coverage. They take care of repetitive tasks and give engineers more time to focus on the edge cases that may arise in production.

Over to you: How do you test your code?


Crawl an Entire Website With a Single API Call (Sponsored)

Building web scrapers for RAG pipelines or model training usually means managing fragile fleets of headless browsers and complex scraping logic. Cloudflare’s new Browser Rendering endpoint changes that. You can now crawl an entire website asynchronously with a single API call. Submit a starting URL, and the endpoint automatically discovers pages, renders them, and returns clean HTML, Markdown, or structured JSON. It fully respects robots.txt out of the box, supports incremental crawling to reduce costs, and includes a fast static mode. Stop managing scraping infrastructure and get back to building your application.

Try the Crawl API today


How Single Sign-On (SSO) Works

Single Sign-On (SSO) makes access feel effortless. One login, and you’re inside Slack and several other internal tools without logging in again.

But there’s a lot going on behind that single login.

Step 1: The first login

  • A user opens an application, for example Salesforce.

  • Instead of asking for credentials directly, Salesforce redirects the browser to an Identity Provider (IdP) like Okta or Auth0. This redirect usually happens through an HTTP 302 response.

  • The browser then sends an authentication request to the IdP using protocols such as SAML or OpenID Connect (OIDC).

  • The IdP presents the login page. The user enters their credentials, sometimes along with MFA.

  • Once verified, the IdP creates a login session and sends back an authentication response (a SAML assertion or ID token) through the browser.

  • The browser forwards that response back to Salesforce.

  • Salesforce validates the token and creates its own local session, typically stored as a cookie, and grants access.

Step 2: The SSO magic

  • Now the user opens another app, say Slack.

  • Slack also redirects the browser to the same identity provider. But the IdP checks and sees the user already has an active session. So it skips the login step entirely and issues a new authentication token.

  • The browser forwards that token to Slack.

  • Slack validates it, creates its own session cookie, and grants access.

The key idea behind SSO is simple. Applications don’t authenticate users themselves. They rely on a central identity provider to verify the user and issue a token that other systems trust.

Over to you: What SSO solutions have you used, and which is your favorite?


How LLMs Use AI Agents with Deep Research

When you ask an LLM such as Claude, ChatGPT, or Gemini to do deep research on a complex topic, it’s not just one model doing all the work. It’s a coordinated system of specialized AI agents.

Here’s how it works:

Step 1: Understanding The Question and Making a Plan
It all starts with the query, something like “Analyze the competitive landscape of AI agents in 2026. The system doesn’t just dive in blindly. First, it may ask clarifying questions to understand exactly what is needed. Then, it generates a plan and breaks the big question down into smaller and manageable tasks.

Step 2: Sub-Agents Get to Work
Each small task gets assigned to a sub-agent, which is basically a mini AI worker with a specific job. For example, one sub-agent might be tasked with finding the latest Nvidia earnings. It figures out which tools to use, such as searching the web, browsing a specific page, or even run code to analyze data. All of this happens through a secure layer of APIs and services that connect the AI to the outside world.

Step 3: Putting it All Together
Once all the sub-agents finish their tasks, a Synthesizer Agent takes over. It aggregates everything, identifies key themes, plans an outline, and removes any redundant or duplicate information. At the same time, a Citation Agent makes sure every claim is linked back to its source and properly formatted. The end result is a polished, well-cited final output ready for use.

Over to you: Have you tried deep research in any LLM?


How hackers steal passwords

Most password attacks don't involve sophisticated hacking. They rely on automation, reused credentials, and predictable human behavior.

Here are six common techniques:

  • Brute-force attack: Automated tools cycle through password combinations at high speed until one works. No logic involved, just volume.

  • Dictionary attacks: Instead of random guesses, attackers use curated wordlists built from common passwords, leaked data, and predictable patterns.

  • Credential stuffing: When one site is breached, attackers reuse those stolen username–password pairs across many other services. It works because a large portion of users reuse passwords across multiple accounts.

  • Password spraying: One common password gets tried across many accounts in the same organization. Spreading attempts across accounts avoids triggering lockout thresholds.

  • Phishing: The victim lands on a fake login page and enters credentials. The attacker captures them in real time. No malware needed.

  • Keylogger malware: Malicious software records keystrokes and sends them to the attacker. Passwords, usernames, and anything else typed on the device are captured.

Over to you: Which attack have you seen most often?

Event Sourcing Explained: Benefits and Use Cases

2026-03-19 23:30:59

Every time we run an UPDATE statement in a database, something disappears. The old value, whatever was there a moment ago, is gone.

In fact, most databases are designed to forget. Every UPDATE overwrites what came before, every DELETE removes it entirely, and the application is left with only a snapshot of the present state. We accept this as normal because it’s the most natural way to think about things.

But what if your system needs to answer a different kind of question: not just “what is the current state?” but “how did we get here?”

That’s the question Event Sourcing is built to answer. And the solution is both more rewarding and more demanding than it first appears. In this article, we will look at Event Sourcing along with its benefits and trade-offs.

What CRUD Loses

Read more