MoreRSS

site iconByteByteGoModify

System design and interviewing experts, authors of best-selling books, offer newsletters and courses.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of ByteByteGo

How Google Manages Trillions of Authorizations with Zanzibar

2026-01-28 00:30:31

WorkOS Pipes: Ship Third-Party Integrations Without Rebuilding OAuth (Sponsored)

Connecting user accounts to third-party APIs always comes with the same plumbing: OAuth flows, token storage, refresh logic, and provider-specific quirks.

WorkOS Pipes removes that overhead. Users connect services like GitHub, Slack, Google, Salesforce, and other supported providers through a drop-in widget. Your backend requests a valid access token from the Pipes API when needed, while Pipes handles credential storage and token refresh.

Simplify integrations today →


Sometime before 2019, Google built a system that manages permissions for billions of users while maintaining both correctness and speed.

When you share a Google Doc with a colleague or make a YouTube video private, a complex system works behind the scenes to ensure that only the right people can access the content. That system is Zanzibar, Google’s global authorization infrastructure that handles over 10 million permission checks every second across services like Drive, YouTube, Photos, Calendar, and Maps.

In this article, we will look at the high-level architecture of Zanzibar and understand the valuable lessons it provides for building large-scale systems, particularly around the challenges of distributed authorization.

See the diagram below that shows the high-level architecture of Zanzibar.

Disclaimer: This post is based on publicly shared details from the Google Engineering Team. Please comment if you notice any inaccuracies.

The Core Problem: Authorization at Scale

Authorization answers a simple question: Can this particular user access this particular resource? For a small application with a few users, checking permissions is straightforward. We might store a list of allowed users for each document and check if the requesting user is on that list.

The challenge multiplies at Google’s scale. For reference, Zanzibar stores over two trillion permission records and serves them from dozens of data centers worldwide. A typical user action might trigger tens or hundreds of permission checks. When searching for an artifact in Google Drive, the system must verify your access to every result before displaying it. Any delay in these checks directly impacts user experience.

Beyond scale, authorization systems also face a critical correctness problem that Google calls the “new enemy” problem. Consider the scenario where we remove someone from a document’s access list, then add new content to that document. If the system uses stale permission data, the person who was just removed might still see the new content. This happens when the system doesn’t properly track the order in which you made changes.

Zanzibar solves these challenges through three key architectural decisions:

  • A flexible data model based on tuples.

  • A consistency protocol that respects causality.

  • A serving layer optimized for common access patterns.


AI code review with the judgment of your best engineer. (Sponsored)

Unblocked is the only AI code review tool that has deep understanding of your codebase, past decisions, and internal knowledge, giving you high-value feedback shaped by how your system actually works instead of flooding your PRs with stylistic nitpicks.

Try Unblocked for free


The Data Model

Zanzibar represents all permissions as relation tuples, which are simple statements about relationships between objects and users. A tuple follows this format: object, relation, user. For example, “document 123, viewer, Alice” means Alice can view document 123.

See the diagram below:

This tuple-based approach differs from traditional access control lists that attach permissions directly to objects. Tuples can refer to other tuples. Instead of listing every member of a group individually on a document, we can create one tuple that says “members of the Engineering group can view this document.” When the Engineering group membership changes, the document permissions automatically reflect those changes.

The system organizes tuples into namespaces, which are containers for objects of the same type. Google Drive might have separate namespaces for documents and folders, while YouTube has namespaces for videos and channels. Each namespace defines what relations are possible and how they interact.

Zanzibar’s clients can use a configuration language to specify rules about how relations are composed. For instance, a configuration might state that all editors are also viewers, but not all viewers are editors.

See the code snippet below that shows the configuration language approach for defining the relations.

name: “doc”
relation { name: “owner” }
relation {
  name: “editor”
  userset_rewrite {
    union {
      child { _this {} }
      child { computed_userset { relation: “owner” } }
    }
  }
}
relation {
  name: “viewer”
  userset_rewrite {
    union {
      child { _this {} }
      child { computed_userset { relation: “editor” } }
    }
  }
}

Source: Zanzibar Research Paper

These rules, called userset rewrites, let the system derive complex permissions from simple stored tuples. For example, consider a document in a folder. The folder has viewers, and you want those viewers to automatically see all documents in the folder. Rather than duplicating the viewer list on every document, you write a rule saying that to check who can view a document, look up its parent folder, and include that folder’s viewers. This approach enables permission inheritance without data duplication.

The configuration language supports set operations like union, intersection, and exclusion. A YouTube video might specify that its viewers include direct viewers, plus viewers of its parent channel, plus anyone who can edit the video. This flexibility allows diverse Google services to specify their specific authorization policies using the same underlying system.

Handling Consistency with Ordering

The “new enemy” problem shows why distributed authorization is harder than it appears. When you revoke someone’s access and then modify content, two separate systems must coordinate:

  • Zanzibar for permissions

  • Application for content

Zanzibar addresses this through tokens called zookies. When an application saves new content, it requests an authorization check from Zanzibar. If authorized, Zanzibar returns a zookie encoding the current timestamp, which the application stores with the content.

Later, when someone tries to view that content, the application sends both the viewer’s identity and the stored zookie to Zanzibar. This tells Zanzibar to check permissions using data at least as fresh as that timestamp. Since the timestamp came from after any permission changes, Zanzibar will see those changes when performing the check.

This protocol works because Zanzibar uses Google Spanner, which provides external consistency.

If event A happens before event B in real time, their timestamps reflect that ordering across all data centers worldwide through Spanner’s TrueTime technology.

The zookie protocol has an important property. It specifies the minimum required freshness, not an exact timestamp. Zanzibar can use any timestamp equal to or fresher than required, enabling performance optimizations.

The Architecture: Distribution and Caching

Zanzibar runs on over 10,000 servers organized into dozens of clusters worldwide. Each cluster contains hundreds of servers that cooperate to answer authorization requests. The system replicates all permission data to more than 30 geographic locations, ensuring that checks can be performed close to users.

When a check request arrives, it goes to any server in the nearest cluster, and that server becomes the coordinator for the request. Based on the permission configuration, the coordinator may need to contact other servers to evaluate different parts of the check. These servers might recursively contact additional servers, particularly when checking membership in nested groups.

For instance, checking if Alice can view a document might require verifying if she is an editor (which implies viewer access), and whether her group memberships grant access, and whether the document’s parent folder grants access. Each of these checks can execute in parallel on different servers, which then combine the results.

The distributed nature of this processing can create potential hotspots. Popular content generates many concurrent permission checks, all targeting the same underlying data. Zanzibar employs several techniques to mitigate these hotspots:

  • First, the system maintains a distributed cache across all servers in a cluster. Using consistent hashing, related checks route to the same server, allowing that server to cache results and serve subsequent identical checks from memory. The cache keys include timestamps, so checks at the same time can share cached results.

  • Second, Zanzibar uses a lock table to deduplicate concurrent identical requests. When multiple requests for the same check arrive simultaneously, only one actually executes the check. The others wait for the result, then all receive the same answer. This prevents flash crowds from overwhelming the system before the cache warms up.

  • Third, for exceptionally hot items, Zanzibar reads the entire permission set at once rather than checking individual users. While this consumes more bandwidth for the initial read, subsequent checks for any user can be answered from the cached full set.

The system also makes intelligent choices about where to evaluate checks. The zookie flexibility mentioned earlier allows Zanzibar to round evaluation timestamps to coarse boundaries, such as one-second or ten-second intervals. This quantization means that many checks evaluate at the same timestamp and can share cache entries, dramatically improving hit rates.

Handling Complex Group Structures

Some scenarios involve deeply nested groups or groups with thousands of subgroups. Checking membership by recursively following relationships becomes too slow when these structures grow large.

Zanzibar includes a component called Leopard that maintains a denormalized index precomputing transitive group membership. Instead of following chains like “Alice is in Backend, Backend is in Engineering,” Leopard stores direct mappings from users to all groups they belong to.

Leopard uses two types of sets: one mapping users to their direct parent groups, and another mapping groups to all descendant groups. Therefore, checking if Alice belongs to Engineering becomes a set intersection operation that executes in milliseconds.

Leopard keeps its denormalized index consistent through a two-layer approach. An offline process periodically builds a complete index from snapshots. An incremental layer watches for changes and applies them on top of the snapshot. Queries combine both layers for consistent results.

Performance Optimization

Zanzibar’s performance reveals optimization for common cases. Around 99% of permission checks use moderately stale data, served entirely from local replicas. These checks have a median latency of 3 milliseconds and reach the 95th percentile at 9 milliseconds. The remaining 1% requiring fresher data have a 95th percentile latency of around 60 milliseconds due to cross-region communication.

Writes are slower by design, with a median latency of 127 milliseconds reflecting distributed consensus costs. However, writes represent only 0.25% of operations.

Zanzibar employs request hedging to reduce tail latency. After sending a request to one replica and receiving no response within a specified threshold, the system sends the same request to another replica and uses the response from the first replica that arrives. Each server tracks latency distributions and automatically tunes parameters like default staleness and hedging thresholds.

Isolation and Reliability

Operating a shared authorization service for hundreds of client applications requires strict isolation between clients. A misbehaving or unexpectedly popular feature in one application should not affect others.

Zanzibar implements isolation at multiple levels. Each client has CPU quotas measured in generic compute units. If a client exceeds its quota during periods of resource contention, its requests are throttled, but other clients continue unaffected. The system also limits the number of concurrent requests per client and the number of concurrent database reads per client.

The lock tables used for deduplication include the client identity in their keys. This ensures that if one client creates a hotspot that fills its lock table, other clients’ requests can still proceed independently.

These isolation mechanisms proved essential in production. When clients launch new features or experience unexpected usage patterns, the problems remain contained. Over three years of operation, Zanzibar has maintained 99.999% availability, meaning less than two minutes of downtime per quarter.

Conclusion

Google’s Zanzibar represents five years of evolution in production, serving hundreds of applications and billions of users. The system demonstrates that authorization at massive scale requires more than just fast databases. It demands careful attention to consistent semantics, intelligent caching and optimization, and robust isolation between clients.

Zanzibar’s architecture offers insights applicable beyond Google’s scale. The tuple-based data model provides a clean abstraction unifying access control lists and group membership. Separating policy configuration from data storage makes it easier to evolve authorization logic without migrating data.

The consistency model demonstrates that strong guarantees are achievable in globally distributed systems through careful protocol design. The zookie approach balances correctness with performance by giving the system flexibility within bounds.

Most importantly, Zanzibar illustrates optimizing for observed behavior rather than theoretical worst cases. The system handles the common case (stale reads) extremely well while supporting the uncommon case (fresh reads) adequately. The sophisticated caching strategies show how to overcome normalized storage limitations while maintaining correctness.

For engineers building authorization systems, Zanzibar provides a comprehensive reference architecture. Even at smaller scales, the principles of tuple-based modeling, explicit consistency guarantees, and optimization through measurement remain valuable.

References:

How Cursor Shipped its Coding Agent to Production

2026-01-27 00:30:35

New report: 96% of devs don’t fully trust AI code (Sponsored)

AI is accelerating code generation, but it’s creating a bottleneck in the verification phase. Based on a survey of 1,100+ developers, Sonar’s newest State of Code report analyzes the impact of generative AI on software engineering workflows and how developers are adapting to address it.

Survey findings include:

  • 96% of developers don’t fully trust that AI-generated code is functionally correct yet only 48% always check it before committing

  • 61% agree that AI often produces code that looks correct but isn’t reliable

  • 24% of a developer’s work week is spent on toil work

Download survey


On October 29, 2025, Cursor shipped Cursor 2.0 and introduced Composer, its first agentic coding model. Cursor claims Composer is 4x faster than similarly intelligent models, with most turns completing in under 30 seconds. For more clarity and detail, we worked with Lee Robinson at Cursor on this article.

Shipping a reliable coding agent requires a lot of systems engineering. Cursor’s engineering team has shared technical details and challenges from building Composer and shipping their coding agent into production. This article breaks down those engineering challenges and how they solved them.

What is a Coding Agent?

To understand coding agents, we first need to look at how AI coding has evolved.

AI in software development has evolved in three waves. First, we treated general-purpose LLMs like a coding partner. You copied code, pasted it into ChatGPT, asked for a fix, and manually applied the changes. It was helpful, but disconnected.

In the second wave, tools like Copilot and Cursor Tab brought AI directly into the editor. To power these tools, specialized models were developed for fast, inline autocomplete. They helped developers type faster, but they were limited to the specific file being edited.

More recently, the focus has shifted to coding agents that handle tasks end-to-end. They don’t just suggest code; they handle coding requests end-to-end. They can search your repo, edit multiple files, run terminal commands, and iterate on errors until the build and tests pass. We are currently living through this third wave.


A coding agent is not a single model. It is a system built around a model with tool access, an iterative execution loop, and mechanisms to retrieve relevant code. The model, often referred to as an agentic coding model, is a specialized LLM trained to reason over codebases, use tools, and work effectively inside an agentic system.

It is easy to confuse agentic coding models with coding agents. The agentic coding model is like the brain. It has the intelligence to reason, write code, and use tools. The coding agent is the body. It has the “hands” to execute tools, manage context, and ensure it reaches a working solution by iterating until the build and tests pass.

AI labs first train an agentic coding model, then wrap it in an agent system, also known as a harness, to create a coding agent. For example, OpenAI Codex is a coding agent environment powered by the GPT-5.2-Codex model, and Cursor’s coding agent can run on multiple frontier models, including its own agentic coding model, Composer. In the next section, we take a closer look at Cursor’s coding agent and Composer.


Web Search API for Your AI Applications (Sponsored)

LLMs are powerful—but without fresh, reliable information, they hallucinate, miss context, and go out of date fast. SerpApi gives your AI applications clean, structured web data from major search engines and marketplaces, so your agents can research, verify, and answer with confidence.

Access real-time data with a simple API.

Try for Free


System Architecture

A production-ready coding agent is a complex system composed of several critical components working in unison. While the model provides the intelligence, the surrounding infrastructure is what enables it to interact with files, run commands, and maintain safety. The next Figure shows the key components of a Cursor’s agent system.

Router

Cursor integrates multiple agentic models, including its own specialized Composer model. For efficiency, the system offers an “Auto” mode that acts as a router. It dynamically analyzes the complexity of each request to choose the best model for the job.

LLM (agentic coding model)

The heart of the system is the agentic coding model. In Cursor’s agent, that model can be Composer, or any other frontier coding models picked by the router. Unlike a standard LLM trained just to predict the next token of text, this model is trained on trajectories, sequences of actions that show the model how and when to use available tools to solve a problem.

Creating this model is often the heaviest lift in building a coding agent. It requires massive data preparation, training, and testing to ensure the model doesn’t just write code, but understands the process of coding (e.g., “search first, then edit, then verify”). Once this model is ready and capable of reasoning, the rest of the work shifts to system engineering to provide the environment it needs to operate.

Tools

Composer is connected to a tool harness inside Cursor’s agent system, with more than ten tools available. These tools cover the core operations needed for coding such as searching the codebase, reading and writing files, applying edits, and running terminal commands.

Context Retrieval

Real codebases are too large to fit into a single prompt. The context retrieval system searches the codebase to pull in the most relevant code snippets, documentation, and definitions for the current step, so the model has what it needs without overflowing the context window.

Orchestrator

The orchestrator is the control loop that runs the agent. The model decides what to do next and which tool to use, and the orchestrator executes that tool call, collects the result such as search hits, file contents, or test output, rebuilds the working context with the new information, and sends it back to the model for the next step. This iterative loop is what turns the system from a chatbot into an agent.

One common way to implement this loop is the ReAct pattern, where the model alternates between reasoning steps and tool actions based on the observations it receives.

Sandbox (execution environment)

Agents need to run builds, tests, linters, and scripts to verify their work. However, giving an AI unrestricted access to your terminal is a security risk. To solve this, tool calls are executed in a Sandbox. This secure and isolated environment uses strict guardrails to ensure that the user’s host machine remains safe even if the agent attempts to run a destructive command. Cursor offers the flexibility to run these sandboxes either locally or remotely on a cloud virtual machine.

Note that these are the core building blocks you will see in most coding agents. Different labs may add more components on top, such as long-term memory, policy and safety layers, specialized planning modules, or collaboration features, depending on the capabilities they want to support.

Production Challenges

On paper, tools, memory, orchestration, routing, and sandboxing look like a straightforward blueprint. In production, the constraints are harsher. A model that can write good code is still useless if edits do not apply cleanly, if the system is too slow to iterate, or if verification is unsafe or too expensive to run frequently.

Cursor’s experience highlights three engineering hurdles that general-purpose models do not solve out of the box: reliable editing, compounded latency, and sandboxing at scale.

Challenge 1: The “Diff Problem”

General-purpose models are trained primarily to generate text. They struggle significantly when asked to perform edits on existing files.

This is known as the “Diff Problem.” When a model is asked to edit code, it has to locate the right lines, preserve indentation, and output a rigid diff format. If it hallucinates line numbers or drifts in formatting, the patch fails even when the underlying logic is correct. Worse, a patch can apply incorrectly, which is harder to detect and more expensive to clean up. In production, incorrect edits are often worse than no edits because they reduce trust and increase cleanup time.

A common way to mitigate the diff problem is to train on edit trajectories. For example, you can structure training data as triples like (original_code, edit_command, final_code), which teaches the model how an edit instruction should change the file while preserving everything else.

Another critical step is teaching the model to use specific editing tools such as search and replace. Cursor emphasized that these two tools were significantly harder to teach than other tools. To solve this, they ensured their training data contained a high volume of trajectories specifically focused on search and replace tool usage, forcing the model to over-learn the mechanical constraints of these operations. Cursor utilized a cluster of tens of thousands of GPUs to train the Composer model, ensuring these precise editing behaviors were fundamentally baked into the weights.

Challenge 2: Latency Compounds

In a chat interface, a user might tolerate a short pause. In an agent loop, latency compounds. A single task might require the agent to plan, search, edit, and test across many iterations. If each step takes a few seconds, the end-to-end time quickly becomes frustrating.

Cursor treats speed as a core product strategy. The make the coding agent fast, they have employed three key techniques:

  • Mixture of Experts (MoE) architecture

  • Speculative decoding

  • Context compaction

MoE architecture: Composer is a MoE language model. MoE modifies the Transformer by making some feed-forward computation conditional. Instead of sending every token through the same dense MLP, the model routes each token to a small number of specialized MLP experts.

MoE can improve both capacity and efficiency by activating only a few experts per token, which can yield better quality at similar latency, or similar quality at lower latency, especially at deployment scale. However, MoE often introduces additional engineering challenges and complexity. If every token goes to the same expert, that expert becomes a bottleneck while others sit idle. This causes high tail latency.

Teams typically address this with a combination of techniques. During training they add load-balancing losses to encourage the router to spread traffic across experts. During serving, they enforce capacity limits and reroute overflow. At the infrastructure level, they reduce cross-GPU communication overhead by batching and routing work to keep data movement predictable.

Speculative Decoding: Generation is sequential. Agents spend a lot of time producing plans, tool arguments, diffs, and explanations, and generating these token by token is slow. Speculative decoding reduces latency by using a smaller draft model to propose tokens that a larger model can verify quickly. When the draft is correct, the system accepts multiple tokens at once, reducing the number of expensive decoding steps.

Since code has a very predictable structure, such as imports, brackets, and standard syntax, waiting for a large model like Composer to generate every single character is inefficient. Cursor confirms they use speculative decoding and trained specialized “draft” models that predict the next few tokens rapidly. This allows Composer to generate code much faster than the standard token-by-token generation rate.

Context Compaction: Agents also generate a lot of text that is useful once but costly to keep around, such as tool outputs, logs, stack traces, intermediate diffs, and repeated snippets. If the system keeps appending everything, prompts bloat and latency increases.

Context compaction addresses this by summarizing the working state and keeping only what is relevant for the next step. Instead of carrying full logs forward, the system retains stable signals like failing test names, error types, and key stack frames. It compresses or drops stale context, deduplicates repeated snippets, and keeps raw artifacts outside the prompt unless they are needed again. Many advanced coding agents like OpenAI’s Codex and Cursor rely on context compaction to stay fast and reliable when reaching the context window limit.

Context compaction improves both latency and quality. Fewer tokens reduce compute per call, and less noise reduces the chance the model drifts or latches onto outdated information.

Put together, these three techniques target different sources of compounded latency. MoE reduces per-call serving cost, speculative decoding reduces generation time, and context compaction reduces repeated prompt processing.

Challenge 3: Sandboxing at Scale

Coding agents do not only generate text. They execute code. They run builds, tests, linters, formatters, and scripts as part of the core loop. That requires an execution environment that is isolated, resource-limited, and safe by default.

In Cursor’s flow, the agent provisions a sandboxed workspace from a specific repository snapshot, executes tool calls inside that workspace, and feeds results back into the model. At a small scale, sandboxing is mostly a safety feature. At large scale, it becomes a performance and infrastructure constraint.

Two major issues dominate when training the model:

  • Provisioning time becomes the bottleneck. The model may generate a solution in milliseconds, but creating a secure, isolated environment can take much longer. If sandbox startup dominates, the system cannot iterate quickly enough to feel usable.

  • Concurrency makes startup overhead a bottleneck at scale. Spinning up thousands of sandboxes all at once very quickly is challenging. This becomes even more challenging during training. Teaching the model to call tools at scale requires running hundreds of thousands of concurrent sandboxed coding environments in the cloud.

These challenges pushed the Cursor team to build custom sandboxing infrastructure. They rewrote their VM scheduler to handle bursty demand, like when an agent needs to spin up thousands of sandboxes in a short time. Cursor treats sandboxes as core serving infrastructure, with an emphasis on fast provisioning and aggressive recycling so tool runs can start quickly and sandbox startup time does not dominate the time to a verified fix.

For safety, Cursor defaults to a restricted Sandbox Mode for agent terminal commands. Commands run in an isolated environment with network access blocked by default and filesystem access limited to the workspace and /tmp/. If a command fails because it needs broader access, the UI lets the user skip it or intentionally re-run it outside the sandbox.

The key takeaway is to not treat sandboxes as just containers. Treat them like a system that needs its own scheduler, capacity planning, and performance tuning.

Conclusion

Cursor shows that modern coding agents are not just better text generators. They are systems built to edit real repositories, run tools, and verify results. Cursor paired a specialized MoE model with a tool harness, latency-focused serving, and sandboxed execution so the agent can follow a practical loop: inspect the code, make a change, run checks, and iterate until the fix is verified.

Cursor’s experience shipping Composer to production points to three repeatable lessons that matter for most coding agents:

  1. Tool use must be baked into the model. Prompting alone is not enough for reliable tool calling inside long loops. The model needs to learn tool usage as a core behavior, especially for editing operations like search and replace where small mistakes can break the edit.

  2. Adoption is the ultimate metric. Offline benchmarks are useful, but a coding agent lives or dies by user trust. A single risky edit or broken build can stop users from relying on the tool, so evaluation has to reflect real usage and user acceptance.

  3. Speed is part of the product. Latency shapes daily usage. You do not need a frontier model for every step. Routing smaller steps to fast models while reserving larger models for harder planning turns responsiveness into a core feature, not just an infrastructure metric.

Coding agents are still evolving, but the trend is promising. With rapid advances in model training and system engineering, we are moving toward a future where they become much faster and more effective.

EP199: Behind the Scenes: What Happens When You Enter Google.com

2026-01-25 00:30:29

Debugging Next.js in Production (Sponsored)

Next.js makes it easy to ship fast, but once your app is in production it can be hard to tell where errors, slow requests, or hydration issues are really coming from.

Join Sentry’s hands-on workshop where Sergiy Dybskiy will dive into how these problems show up in real apps and how to connect what users experience with what’s happening under the hood. 🚀

Register for the workshop


This week’s system design refresher:

  • What Are AI Agents & How Do They Work? (Youtube video)

  • Behind the Scenes: What Happens When You Enter Google.com

  • Understanding the Linux Directory Structure

  • Symmetric vs. Asymmetric Encryption

  • Network Troubleshooting Test Flow

  • We’re hiring at ByteByteGo


What Are AI Agents & How Do They Work?


Behind the Scenes: What Happens When You Enter Google.com

Most of us hit Enter and expect the page to load instantly, but under the hood, a surprisingly intricate chain of events fires off in milliseconds.

Here’s a quick tour of what actually happens:

  1. The journey starts the moment you type “google. com” into the address bar.

  2. The browser checks everywhere for a cached IP: Before touching the network, your browser looks through multiple cache layers, browser cache, OS cache, router cache, and even your ISP’s DNS cache.
    A cache hit means an instant IP address. A miss kicks off the real journey.

  3. Recursive DNS resolution begins: Your DNS resolver digs through the global DNS hierarchy:
    - Root servers
    - TLD servers (.com)
    - Authoritative servers for google. com

  4. A TCP connection is established: Your machine and Google’s server complete the classic TCP 3-way handshake:
    - SYN → SYN/ACK → ACK
    Only after the connection is stable does the browser move on. TLS handshake wraps everything in encryption. By the end of this handshake, a secure HTTPS tunnel is ready.

  5. The actual HTTP request finally goes out: Google processes the request and streams back HTML, CSS, JavaScript, and all the assets needed to build the page.

  6. The rendering pipeline kicks in:
    Your browser parses HTML into a DOM tree, CSS into a CSSOM tree, merges them into the Render Tree, and then:
    - Lays out elements
    - Loads and executes JavaScript
    - Repaints the screen

    8. The page is fully loaded.
    Over to you: What part of this journey was most surprising the first time you learned how browsers work?


Understanding the Linux Directory Structure

  • The root directory “/” is the starting point of the entire filesystem. From there, Linux organizes everything into specialized folders.

  • “/boot” stores the bootloader and kernel files, without it, the system can’t start.

  • “/dev” holds device files that act as interfaces to hardware.

  • “/usr” contains system resources, libraries, and user-level applications.

  • “/bin” and “/sbin” store essential binaries and system commands needed during startup or recovery.

  • User-related data sits under “/home” for regular users and “/root” for the root account.

  • System libraries that support core binaries live in “/lib” and “/lib64”.

  • Temporary data is kept in “/tmp”, while “/var” tracks logs, caches, and frequently changing files.

  • Configuration files live in “/etc”, and runtime program data lives in “/run”.

  • Linux also exposes virtual filesystems through “/proc” and “/sys”, giving you insight into processes, kernel details, and device information.

  • For external storage, “/media” and “/mnt” handle removable devices and temporary mounts, and “/opt” is where optional third-party software installs itself.

Over to you: Which Linux directory took you the longest to fully understand, and what finally made it click for you?


Symmetric vs. Asymmetric Encryption

Symmetric and asymmetric encryption often get explained together, but they solve very different problems.

  • Symmetric encryption uses a single shared key. The same key encrypts and decrypts the data. It’s fast, efficient, and ideal for large amounts of data. That’s why it’s used for things like encrypting files, database records, and message payloads.

    The catch is key distribution, both parties must already have the secret, and sharing it securely is hard.

  • Asymmetric encryption uses a key pair. A public key that can be shared with anyone, and a private key that stays secret. Data encrypted with the public key can only be decrypted with the private key.

    This removes the need for secure key sharing upfront, but it comes at a cost. It’s slower and computationally expensive, which makes it impractical for encrypting large payloads.

    That’s why asymmetric encryption is usually used for identity, authentication, and key exchange, not bulk data.

Over to you: What’s the most common misunderstanding you’ve seen about encryption in system design?


Network Troubleshooting Test Flow

Most network issues look complicated, but the troubleshooting process doesn’t have to be.

A reliable way to diagnose problems is to test the network layer by layer, starting from your own machine and moving outward until you find exactly where things break.

That’s exactly why we put together this flow: a structured, end-to-end checklist that mirrors how packets actually move through a system.


We’re Hiring

I am hiring for 2 roles: Technical Deep Dive Writer (System Design or AI Systems), and Lead Instructor (Building the World’s Most Useful AI Cohort). Job descriptions below.

1. Technical Deep Dive Writer

ByteByteGo started with a simple idea: explain system design clearly. Over time, it has grown into one of the largest technical education platforms for engineers, reaching millions of engineers every month. We believe it can become much bigger and more impactful.

This role is for someone exceptional who wants to help build that future by producing the highest-quality technical content on the internet.

You will work very closely with me to produce deep, accurate, and well structured technical content. The goal is not volume. The goal is to set the quality bar for how system design and modern AI systems are explained at scale.

The role is to turn technical knowledge into world class technical writing.

What you will do:

  • Turn complex systems into explanations that are precise, readable, and memorable.

  • Create clear technical diagrams that accurately represent system architecture and tradeoffs.

  • Collaborate directly with tech companies like Amazon, Shopify, Cursor, Yelp, etc.

  • Continuously raise the bar for clarity, correctness, and depth.

Who we are looking for

  • 5+ years of experience building large scale systems

  • Ability to explain complex ideas without oversimplifying

  • Strong ownership mindset and pride in craft

Role Type: Part time remote (10-20 hours per week), with the possibility of converting to full time

Compensation: Competitive

This is not just a writing role. It is a chance to help build the most trusted technical education brand in the industry.

How to apply: If you are interested, please send your resume and a previous writing sample to [email protected]


2. Lead Instructor, Building the World’s Most Useful AI Cohort

Cohort Course Name: Building Production AI Systems

This cohort is focused on one of the hardest problems in modern engineering. Taking AI systems from impressive demos to reliable, secure, production ready systems used by real users.

Our learners already understand generative AI concepts. What they want and need is engineering rigor. How to evaluate models properly. How to ship safely. How to scale without blowing up cost or reliability. How to operate AI systems in the real world.

This role is for someone exceptional who has done this work in production and wants to help shape the future of AI engineering education.

Role Type: Part time remote (20+ hours per week), with the possibility of converting to full time

Compensation: Very Competitive

Responsibilities

  • Develop and maintain a curriculum centered on ai production

  • Design labs/assignments. Keep them runnable with modern tooling

  • Teach live sessions (lecture + hands-on lab)

  • Run weekly office hours

  • Provide clear feedback on assignments and async questions

Required expertise

Production AI Engineering

  • You have shipped and maintained AI features used by thousands of users. You have “war stories” regarding outages, cost spikes, or quality regressions.

  • Deep understanding of the FastAPI + Pydantic + Celery/Redis stack for handling asynchronous AI tasks.

  • Can articulate the nuances between Latency vs. Cost and Reliability vs. Velocity.

AI Evals

  • experience in evaluating AI systems like LLMs, RAGs, Agents, or image/video generation models)

  • Ability to explain standard metrics and design eval datasets (golden, adversarial, regression)

  • implement scoring (rules, rubrics, LLM-as-judge with guardrails)

  • Familiar with industry eval patterns and frameworks (e.g., OpenAI Evals)

AI security and guardrails

  • prompt injection,

  • insecure output handling, model DoS, supply chain risks.

  • Threat Modeling: Experience mapping threats to taxonomies like MITRE ATLAS.

  • Implementing input sanitization and output validation to prevent prompt injection and model DoS.

Deployment and optimization

  • Infrastructure: Comfort with Docker, Kubernetes, and hybrid routing (using a mix of self-hosted and managed APIs like Azure OpenAI or Bedrock).

  • Optimization: Hands-on experience with Quantization (FP8/INT8), Prompt Caching, Pruning, distillation, distributed inference, efficient attention variants, and batching strategies to maximize throughput.

  • Serving Engines like vLLM or NVIDIA TensorRT-LLM.

  • Comfortable with production deployment patterns (containerization, staged rollouts, canaries).

  • Backend: Python (Expert), FastAPI, Pydantic v2, and asynchronous programming.

Monitoring and observability

  • Ability to teach tracing and quality monitoring

Desirable Technical Stack Knowledge

  • Frameworks: LangGraph, CrewAI, or Haystack.

  • Databases: Vector DBs (Pinecone, Weaviate, Qdrant) and their indexing strategies (HNSW, IVFFlat).

  • Observability: LangSmith, Honeycomb, or Arize Phoenix.

  • Backend: Python (Expert), FastAPI, Pydantic v2, and asynchronous programming.

This is not just a teaching role. It is a chance to help scale the most popular AI cohort and define how production grade AI engineering is taught.

Let’s build the most popular AI cohort together.

How to Apply

Send your resume and a short note on why you’re excited about this role to [email protected]

We are hiring at ByteByteGo

2026-01-24 00:31:08

I am hiring for 2 roles: Technical Deep Dive Writer (System Design or AI Systems), and Lead Instructor (Building the World’s Most Useful AI Cohort). Job descriptions below.

1. Technical Deep Dive Writer

ByteByteGo started with a simple idea: explain system design clearly. Over time, it has grown into one of the largest technical education platforms for engineers, reaching millions of engineers every month. We believe it can become much bigger and more impactful.

This role is for someone exceptional who wants to help build that future by producing the highest-quality technical content on the internet.

You will work very closely with me to produce deep, accurate, and well structured technical content. The goal is not volume. The goal is to set the quality bar for how system design and modern AI systems are explained at scale.

The role is to turn technical knowledge into world class technical writing.

What you will do:

  • Turn complex systems into explanations that are precise, readable, and memorable.

  • Create clear technical diagrams that accurately represent system architecture and tradeoffs.

  • Collaborate directly with tech companies like Amazon, Shopify, Cursor, Yelp, etc.

  • Continuously raise the bar for clarity, correctness, and depth.

Who we are looking for

  • 5+ years of experience building large scale systems

  • Ability to explain complex ideas without oversimplifying

  • Strong ownership mindset and pride in craft

Role Type: Part time remote (10-20 hours per week), with the possibility of converting to full time

Compensation: Competitive

This is not just a writing role. It is a chance to help build the most trusted technical education brand in the industry.

How to apply: If you are interested, please send your resume and a previous writing sample to [email protected]


2. Lead Instructor, Building the World’s Most Useful AI Cohort

Cohort Course Name: Building Production AI Systems

This cohort is focused on one of the hardest problems in modern engineering. Taking AI systems from impressive demos to reliable, secure, production ready systems used by real users.

Our learners already understand generative AI concepts. What they want and need is engineering rigor. How to evaluate models properly. How to ship safely. How to scale without blowing up cost or reliability. How to operate AI systems in the real world.

This role is for someone exceptional who has done this work in production and wants to help shape the future of AI engineering education.

Role Type: Part time remote (20+ hours per week), with the possibility of converting to full time

Compensation: Very Competitive

Responsibilities

  • Develop and maintain a curriculum centered on ai production

  • Design labs/assignments. Keep them runnable with modern tooling

  • Teach live sessions (lecture + hands-on lab)

  • Run weekly office hours

  • Provide clear feedback on assignments and async questions

Required expertise

Production AI Engineering

  • You have shipped and maintained AI features used by thousands of users. You have “war stories” regarding outages, cost spikes, or quality regressions.

  • Deep understanding of the FastAPI + Pydantic + Celery/Redis stack for handling asynchronous AI tasks.

  • Can articulate the nuances between Latency vs. Cost and Reliability vs. Velocity.

AI Evals

  • experience in evaluating AI systems like LLMs, RAGs, Agents, or image/video generation models)

  • Ability to explain standard metrics and design eval datasets (golden, adversarial, regression)

  • implement scoring (rules, rubrics, LLM-as-judge with guardrails)

  • Familiar with industry eval patterns and frameworks (e.g., OpenAI Evals)

AI security and guardrails

  • prompt injection,

  • insecure output handling, model DoS, supply chain risks.

  • Threat Modeling: Experience mapping threats to taxonomies like MITRE ATLAS.

  • Implementing input sanitization and output validation to prevent prompt injection and model DoS.

Deployment and optimization

  • Infrastructure: Comfort with Docker, Kubernetes, and hybrid routing (using a mix of self-hosted and managed APIs like Azure OpenAI or Bedrock).

  • Optimization: Hands-on experience with Quantization (FP8/INT8), Prompt Caching, Pruning, distillation, distributed inference, efficient attention variants, and batching strategies to maximize throughput.

  • Serving Engines like vLLM or NVIDIA TensorRT-LLM.

  • Comfortable with production deployment patterns (containerization, staged rollouts, canaries).

  • Backend: Python (Expert), FastAPI, Pydantic v2, and asynchronous programming.

Monitoring and observability

  • Ability to teach tracing and quality monitoring

Desirable Technical Stack Knowledge

  • Frameworks: LangGraph, CrewAI, or Haystack.

  • Databases: Vector DBs (Pinecone, Weaviate, Qdrant) and their indexing strategies (HNSW, IVFFlat).

  • Observability: LangSmith, Honeycomb, or Arize Phoenix.

  • Backend: Python (Expert), FastAPI, Pydantic v2, and asynchronous programming.

This is not just a teaching role. It is a chance to help scale the most popular AI cohort and define how production grade AI engineering is taught.

Let’s build the most popular AI cohort together.

How to Apply

Send your resume and a short note on why you’re excited about this role to [email protected]

The Must-Know Fundamentals of Distributed Systems

2026-01-23 00:31:09

Every Google search, Netflix stream, and bank transfer relies on distributed systems where multiple computers work together to accomplish tasks impossible for a single machine. Understanding how these systems handle communication, failures, and coordination is becoming essential for modern software developers.

The fundamental challenge that makes distributed systems different is partial failure. In single-computer programs, everything typically crashes together. In distributed systems, some components can fail while others continue operating. For example, a database might crash while web servers keep running, or network connections might fail while both services remain healthy.

This creates ambiguity. When we send a request and receive no response, we cannot determine what happened.

  • Did the request never arrive?

  • Did the server process it, but crash before responding?

  • Did the response get lost?

Every concept in distributed systems addresses some aspect of this challenge.

In this article, we will look at five foundational topics around distributed systems: how computers communicate across networks, the protocols enabling reliable communication, how remote procedure calls abstract complexity, strategies for handling failures, and why time synchronization presents unique challenges.

How Computers Communicate

Read more

How Netflix Built a Real-Time Distributed Graph for Internet Scale

2026-01-22 00:31:00

2026 AI predictions for builders (Sponsored)

The AI landscape is changing fast—and the way you build AI systems in 2026 will look very different.

Join us live on January 28 as we unpack the first take from Redis’ 2026 predictions report: why AI apps won’t succeed without a unified context engine.

You’ll learn:

  • One architectural standard for AI across teams

  • Lower operational overhead via shared context infrastructure

  • Predictable, production-grade performance

  • Clear observability and governance for agent data access

  • Faster time to market for new AI features

Save your spot →

Read the full 2026 predictions report →


Netflix is no longer just a streaming service. The company has expanded into live events, mobile gaming, and ad-supported subscription plans. This evolution created an unexpected technical challenge.

To understand the challenge, consider a typical member journey. Assume that a user watches Stranger Things on their smartphone, continues on their smart TV, and then launches the Stranger Things mobile game on a tablet. These activities happen at different times on different devices and involve different platform services. Yet they all belong to the same member experience.

Disclaimer: This post is based on publicly shared details from the Netflix Engineering Team. Please comment if you notice any inaccuracies.

Understanding these cross-domain journeys became critical for creating personalized experiences. However, Netflix’s architecture made this difficult.

Netflix uses a microservices architecture with hundreds of services developed by separate teams. Each service can be developed, deployed, and scaled independently, and teams can choose the best data storage technology for their needs. However, when each service manages its own data, information can become siloed. Video streaming data lives in one database, gaming data in another, and authentication data separately. Traditional data warehouses collect this information, but the data lands in different tables and processes at different times.

Manually stitching together information from dozens of siloed databases became overwhelming. Therefore, the Netflix engineering team needed a different approach to process and store interconnected data while enabling fast queries. They chose a graph representation for the same due to the following reasons:

  • First, graphs enable fast relationship traversals without expensive database joins.

  • Second, graphs adapt easily when new connections emerge without significant schema changes.

  • Third, graphs naturally support pattern detection. Identifying hidden relationships and cycles is more efficient using graph traversals than siloed lookups.

This led Netflix to build the Real-Time Distributed Graph, or RDG. In this article, we will look at the architecture of RDG and the challenges Netflix faced while developing it.

Building the Data Pipeline

The RDG consists of three layers: ingestion and processing, storage, and serving. See the diagram below:

When a member performs any action in the Netflix app, such as logging in or starting to watch a show, the API Gateway writes these events as records to Apache Kafka topics.

Apache Kafka serves as the ingestion backbone, providing durable, replayable streams that downstream processors can consume in real time. Netflix chose Kafka because it offers exactly what they needed for building and updating the graph with minimal delay. Traditional batch processing systems and data warehouses could not provide the low latency required to maintain an up-to-date graph supporting real-time applications.

The scale of data flowing through these Kafka topics is significant. For reference, Netflix’s applications consume several different Kafka topics, each generating up to one million messages per second. Records use Apache Avro format for encoding, with schemas persisted in a centralized registry. To balance data availability against storage infrastructure costs, Netflix tailors retention policies for each topic based on throughput and record size. They also persist topic records to Apache Iceberg data warehouse tables, enabling backfills when older data expires from Kafka topics.

Apache Flink jobs ingest event records from the Kafka streams. Netflix chose Flink because of its strong capabilities around near-real-time event processing. There is also robust internal platform support for Flink within Netflix, which allows jobs to integrate with Kafka and various storage backends seamlessly.

A typical Flink job in the RDG pipeline follows a series of processing steps:

  • First, the job consumes event records from upstream Kafka topics.

  • Next, it applies filtering and projections to remove noise based on which fields are present or absent in the events.

  • Then it enriches events with additional metadata stored and accessed via side inputs.

The job then transforms events into graph primitives, creating nodes that represent entities like member accounts and show titles, plus edges that represent relationships or interactions between them.

After transformation, the job buffers, detects, and deduplicates overlapping updates to the same nodes and edges within a small configurable time window. This step reduces the data throughput published downstream and is implemented using stateful process functions and timers. Finally, the job publishes node and edge records to Data Mesh, an abstraction layer connecting data applications and storage systems.

See the diagram below:

For reference, Netflix writes more than five million total records per second to Data Mesh, which handles persisting the records to data stores that other internal services can query.

Learning Through Failure

Netflix initially tried one Flink job consuming all Kafka topics. Different topics have vastly different volumes and throughput patterns, making tuning impossible. They pivoted to one-to-one mapping from topic to job. While this added operational overhead, each job became simpler to maintain and tune.

Similarly, each node and edge type gets its own Kafka topic. Though this means more topics to manage, Netflix valued the ability to tune and scale each independently. They designed the graph data model to be flexible, making new node and edge types infrequent additions.

The Storage Challenge

After creating billions of nodes and edges from member interactions, Netflix faced the critical question of how to actually store them.

The RDG uses a property graph model. Nodes represent entities like member accounts, titles, devices, and games. Each node has a unique identifier and properties containing additional metadata. Edges represent relationships between nodes, such as started watching, logged in from, or plays. Edges also have unique identifiers and properties like timestamps.

See the diagram below:

When a member watches a particular show, the system might create an account node with properties like creation date and plan type, a title node with properties like title name and runtime, and a started watching edge connecting them with properties like the last watch timestamp.

This simple abstraction allows Netflix to represent incredibly complex member journeys across its ecosystem.

Why Traditional Graph Databases Failed

The Netflix engineering team evaluated traditional graph databases like Neo4j and AWS Neptune. While these systems provide feature-rich capabilities around native graph query support, they posed a mix of scalability, workload, and ecosystem challenges that made them unsuitable for Netflix’s needs.

  • Native graph databases struggle to scale horizontally for large, real-time datasets. Their performance typically degrades with increased node and edge count or query depth.

  • In early evaluations, Neo4j performed well for millions of records but became inefficient for hundreds of millions due to high memory requirements and limited distributed capabilities.

  • AWS Neptune has similar limitations due to its single-writer, multiple-reader architecture, which presents bottlenecks when ingesting large data volumes in real time, especially across multiple regions.

These systems are also not inherently designed for the continuous, high-throughput event streaming workloads critical to Netflix operations. They frequently struggle with query patterns involving full dataset scans, property-based filtering, and indexing.

Most importantly for Netflix, the company has extensive internal platform support for relational and document databases compared to graph databases. Non-graph databases are also easier for them to operate. Netflix found it simpler to emulate graph-like relationships in existing data storage systems rather than adopting specialized graph infrastructure.

The KVDAL Solution

The Netflix engineering team turned to KVDAL, the Key-Value Data Abstraction Layer from their internal Data Gateway Platform. Built on Apache Cassandra, KVDAL provides high availability, tunable consistency, and low latency without direct management of underlying storage.

See the diagram below:

KVDAL uses a two-level map architecture. Data is organized into records uniquely identified by a record ID. Each record contains sorted items, where an item is a key-value pair. To query KVDAL, you look up a record by ID, then optionally filter items by their keys. This gives both efficient point lookups and flexible retrieval of related data.

For nodes, the unique identifier becomes the record ID, with all properties stored as a single item. For edges, Netflix uses adjacency lists. The record ID represents the origin node, while items represent all destination nodes it connects to. If an account has watched multiple titles, the adjacency list contains one item per title with properties like timestamps.

This format is vital for graph traversals. To find all titles a member watched, Netflix retrieves the entire record with one KVDAL lookup. They can also filter by specific titles using key filtering without fetching unnecessary data.

Managing Data Lifecycle

As Netflix ingests real-time streams, KVDAL creates new records for new nodes or edges. When an edge exists with an existing origin but a new destination, it creates a new item in the existing record. When ingesting the same node or edge multiple times, KVDAL overwrites existing values, keeping properties like timestamps current. KVDAL can also automatically expire data on a per-namespace, per-record, or per-item basis, providing fine-grained control while limiting graph growth.

Namespaces Enable Massive Scale

Namespaces are the most powerful KVDAL feature Netflix leveraged. A namespace is like a database table, a logical grouping of records that defines physical storage while abstracting underlying system details.

You can start with all namespaces backed by one Cassandra cluster. If one namespace needs more storage or traffic capacity, you can move it to its own cluster for independent management. Different namespaces can use entirely different storage technologies. Low-latency data might use Cassandra with EVCache caching. High-throughput data might use dedicated clusters per namespace.

KVDAL can scale to trillions of records per namespace with single-digit millisecond latencies. Netflix provisions a separate namespace for every node type and edge type. While seemingly excessive, this enables independent scaling and tuning, flexible storage backends per namespace, and operational isolation where issues in one namespace do not impact others.

Conclusion

The numbers demonstrate real-world performance. Netflix’s graph has over eight billion nodes and more than 150 billion edges. The system sustains approximately two million reads per second and six million writes per second. This runs on a KVDAL cluster with roughly 27 namespaces, backed by around 12 Cassandra clusters across 2,400 EC2 instances.

These numbers are not limits. Every component scales linearly. As the graph grows, Netflix can add more namespaces, clusters, and instances.

Netflix’s RDG architecture offers important lessons.

  • Sometimes the right solution is not the obvious one. Netflix could have used purpose-built graph databases, but chose to emulate graph capabilities using key-value storage based on operational realities like internal expertise and existing platform support.

  • Scaling strategies evolve through experimentation. Netflix’s monolithic Flink job failed. Only through experience did they discover that one-to-one topic-to-job mapping worked better despite added complexity.

  • Isolation and independence matter at scale. Separating each node and edge type into its own namespace enabled independent tuning and reduced issue blast radius.

  • Building on proven infrastructure pays dividends. Rather than adopting new systems, Netflix leveraged battle-tested technologies like Kafka, Flink, and Cassandra, building abstractions to meet their needs while benefiting from maturity and operational expertise.

The RDG enables Netflix to analyze member interactions across its expanding ecosystem. As the business evolves with new offerings, this flexible architecture can adapt without requiring significant re-architecture.

References:


SPONSOR US

Get your product in front of more than 1,000,000 tech professionals.

Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases.

Space Fills Up Fast - Reserve Today

Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing [email protected].