2026-02-18 00:31:17
Code reviews are critical but time-consuming. CodeRabbit acts as your AI co-pilot, providing instant Code review comments and potential impacts of every pull request.
Beyond just flagging issues, CodeRabbit provides one-click fix suggestions and lets you define custom code quality rules using AST Grep patterns, catching subtle issues that traditional static analysis tools might miss.
CodeRabbit reviews 1 million PRs every week across 3 million repositories and is used by 100 thousand Open-source projects.
CodeRabbit is free for all open-source repo’s.
Cloudflare has reduced cold start delays in its Workers platform by 10 times through a technique called worker sharding.
A cold start occurs when serverless code must initialize completely before handling a request. For Cloudflare Workers, this initialization involves four distinct phases:
Fetching the JavaScript source code from storage
Compiling that code into executable machine instructions
Executing any top-level initialization code
Finally, invoking the code to handle the incoming request
See the diagram below:
The improvement around cold starts means that 99.99% of requests now hit already-running code instances instead of waiting for code to start up.
The overall solution works by routing all requests for a specific application to the same server using a consistent hash ring, reducing the number of times code needs to be initialized from scratch.
In this article, we will look at how Cloudflare built this system and the challenges it faced.
Disclaimer: This post is based on publicly shared details from the Cloudflare Engineering Team. Please comment if you notice any inaccuracies.
In 2020, Cloudflare introduced a solution that masked cold starts by pre-warming Workers during TLS handshakes.
TLS is the security protocol that encrypts web traffic and makes HTTPS possible. Before any actual data flows between a browser and server, they perform a handshake to establish encryption. This handshake requires multiple round-trip messages across the network, which takes time.
The original technique worked because Cloudflare could identify which Worker to start from the Server Name Indication (SNI) field in the very first TLS message. While the rest of the handshake continued, they would initialize the Worker in the background. If the Worker finished starting up before the handshake completed, the user experienced zero visible delay.
See the diagram below:
This technique succeeded initially because cold starts took only 5 milliseconds while TLS 1.2 handshakes required three network round-trips. The handshake provided enough time to hide the cold start entirely.
The effectiveness of the TLS handshake technique depended on a specific timing relationship in which cold starts had to complete faster than TLS handshakes. Over the past five years, this relationship broke down for two reasons.
First, cold starts became longer. Cloudflare increased Worker script size limits from 1 megabyte to 10 megabytes for paying customers and to 3 megabytes for free users. They also increased the startup CPU time limit from 200 milliseconds to 400 milliseconds. These changes allowed developers to deploy much more complex applications on the Workers platform. Larger scripts require more time to transfer from storage and more time to compile. Longer CPU time limits mean initialization code can run for longer periods. Together, these changes pushed cold start times well beyond their original 5-millisecond duration.
Second, TLS handshakes became faster. TLS 1.3 reduced the handshake from three round-trips to just one round-trip. This improvement in security protocols meant less time to hide cold start operations in the background.
The combination of longer cold starts and shorter TLS handshakes meant that users increasingly experienced visible delays. The original solution no longer eliminated the problem.
Cloudflare realized that further optimizing cold start duration directly would be ineffective. Instead, they needed to reduce the absolute number of cold starts happening across their network.
The key insight involved understanding how requests were distributed across servers. Consider a Cloudflare data center with 300 servers. When a low-traffic application receives one request per minute, load balancing distributes these requests evenly across all servers. Each server receives one request approximately every five hours.
This distribution creates a problem. In busy data centers, five hours between requests is long enough that the Worker must be shut down to free memory for other applications. When the next request arrives at that server, it triggers a cold start. The result is a 100% cold start rate for low-traffic applications.
The solution involves routing all requests for a specific Worker to the same server within a data center. If all requests go to one server, that server receives one request per minute rather than one request every five hours. The Worker stays active in memory, and subsequent requests find it already running.
This approach provides multiple benefits. The application experiences mostly warm requests with only one initial cold start. Memory usage drops by over 99% because 299 servers no longer need to maintain copies of the Worker. This freed memory allows other Workers to stay active longer, creating improved performance across the entire system.
Cloudflare borrowed a technique from its HTTP caching system to implement worker sharding. The core data structure is called a consistent hash ring.
A naive approach to assigning Workers to servers would use a standard hash table. In this approach, each Worker identifier maps directly to a specific server address. This works fine until servers crash, get rebooted, or are added to the data center. When the number of servers changes, the entire hash table must be recalculated. Every Worker would get reassigned to a different server, causing universal cold starts.
A consistent hash ring solves this problem. Instead of directly mapping Workers to servers, both are mapped to positions on a number line that wraps around from end to beginning. Think of a clock face where positions range from 0 to 359 degrees.
The assignment process works as follows:
Hash each server address to a position on the ring
Hash each Worker identifier to a position on the ring
Assign each Worker to the first server encountered, moving clockwise from the Worker’s position
When a server disappears from the ring, only the Workers positioned immediately before it need reassignment. All other Workers remain with their current servers.
Similarly, when a new server joins, only Workers in a specific range move to the new server.
This stability is crucial for maintaining warm Workers. If the server constantly reshuffled Worker assignments, the benefits of routing requests to the same server would disappear.
The sharding system introduces two server roles in request handling:
The shard client is the server that initially receives a request from the internet.
The shard server is the home server for that specific Worker according to the consistent hash ring.
When a request arrives, the shard client looks up the Worker’s home server using the hash ring. If the shard client happens to be the home server, it executes the Worker locally. Otherwise, it forwards the request to the appropriate shard server over the internal data center network.
Forwarding requests between servers adds latency. Each forwarded request must travel across the data center network, adding approximately one millisecond to the response time. However, this overhead is much less than a typical cold start, which can take hundreds of milliseconds. Forwarding a request to a warm Worker is always faster than starting a cold Worker locally.
Worker sharding can concentrate traffic onto fewer servers, which creates a new problem. Individual Workers can receive enough traffic to overload their home server. The system must handle this situation gracefully without serving errors to users.
Cloudflare evaluated two approaches for load shedding:
The first approach has the shard client ask permission before sending each request. The shard server responds with either approval or refusal. If refused, the shard client handles the request locally by starting a cold Worker. This permission-based approach introduces an additional latency of one network round-trip on every sharded request. The shard client must wait for approval before sending the actual request data.
The second approach sends the request optimistically without waiting for permission. If the shard server becomes overloaded, it forwards the request back to the shard client. This avoids the round-trip latency penalty when the shard server can handle the request, which is the common case.
See the diagram below that shows the pessimistic approach:
Cloudflare chose the optimistic approach for two reasons.
First, refusals are rare in practice. When a shard client receives a refusal, it starts a local Worker instance and serves all future requests locally. After one refusal, that shard client stops sharding requests for that Worker until traffic patterns change.
Second, Cloudflare developed a technique to minimize the cost of forwarding refused requests back to the client.
See the diagram below:
The Workers runtime uses Cap’n Proto RPC for communication between server instances.
Cap’n Proto provides a distributed object model that simplifies complex scenarios. When assembling a sharded request, the shard client includes a special handle called a capability. This capability represents a lazy Worker instance that exists on the shard client but has not been initialized yet. The lazy Worker has the same interface as any other Worker, but only starts when first invoked.
If the shard server must refuse the request due to overload, it does not send a simple rejection message. Instead, it returns the shard client’s own lazy capability as the response.
The shard client’s application code receives a Worker capability from the shard server. It attempts to invoke this capability to handle the request. The RPC system recognizes that this capability actually points back to a local lazy Worker. Once it realizes the request would loop back to the shard client, it stops sending additional request bytes to the shard server and handles everything locally.
This mechanism prevents wasted bandwidth. Without it, the shard client might send a large request body to the shard server, only to have the entire body forwarded back again. Cap’n Proto’s distributed object model automatically optimizes this pattern by recognizing local capabilities and short-circuiting the communication path.
Many Cloudflare products involve Workers invoking other Workers.
Service Bindings allow one Worker to call another directly. Workers KV, despite appearing as a storage service, actually involves cross-Worker invocations. However, the most complex scenario involves Workers for Platforms.
Workers for Platforms enables customers to build their own serverless platforms on Cloudflare infrastructure. A typical request flow involves three or four different Workers.
First, a dynamic dispatch Worker receives the request and selects which user Worker should handle it.
The user Worker processes the request, potentially invoking an outbound Worker to intercept network calls.
Finally, a tail Worker might collect logs and traces from the entire request flow.
These Workers can run on different servers across the data center. Supporting sharding for nested Worker invocations requires passing the execution context between servers.
The execution context includes information like permission overrides, resource limits, feature flags, and logging configurations. When Workers ran on a single server, managing this context was straightforward. With sharding, the context must travel between servers as Workers invoke each other.
Cloudflare serializes the context stack into a Cap’n Proto message and includes it in sharded requests. The shard server deserializes the context and continues execution with the correct configuration.
The tail Worker scenario demonstrates Cap’n Proto’s power. A tail Worker must receive traces from potentially many servers that participated in handling a request. Rather than having each server know where to send traces, the system includes a callback capability in the execution context. Each server simply invokes this callback with its trace data. The RPC system automatically routes these calls back to the dynamic dispatch Worker’s home server, where all traces are collected together.
After deploying worker sharding globally, Cloudflare measured several key metrics:
Only 4% of total enterprise requests are sharded to a different server. This low percentage reflects that 96% of requests go to high-traffic Workers that run multiple instances across many servers.
Despite sharding only 4% of requests, the global Worker eviction rate decreased by 10 times. Eviction rate measures how often Workers are shut down to free memory. Fewer evictions indicate that memory is being used more efficiently across the system.
The 4% sharding rate achieving 10 times efficiency improvement stems from the power law distribution of internet traffic. A small number of Workers receive the vast majority of requests.
These high-traffic Workers already maintained warm instances before sharding. A large number of Workers receive relatively few requests. These low-traffic Workers suffered from frequent cold starts and are exactly the ones helped by sharding.
The warm request rate for enterprise traffic increased from 99.9% to 99.99%. This improvement represents going from three nines to four nines of reliability. Equivalently, the cold start rate decreased from 0.1% to 0.01% of all requests. This is a 10 times reduction in how often users experience cold start delays.
The warm request rate also became less volatile throughout each day. Previous patterns showed significant variation as traffic levels changed. Sharding smoothed these variations by ensuring low-traffic Workers maintained warm instances even during off-peak hours.
Cloudflare’s worker sharding system demonstrates how distributed systems techniques can solve performance problems that direct optimization cannot address. Rather than making cold starts faster, they made cold starts less frequent. Rather than using more computing resources, they used existing resources more efficiently.
References:
2026-02-17 00:30:36
🤖Most AI coding tools only see your source code. Seer, Sentry’s AI debugging agent, uses everything Sentry knows about how your code has behaved in production to debug locally, in your PR, and in production.
🛠️How it works:
Seer scans & analyzes issues using all Sentry’s available context.
In development, Seer debugs alongside you as you build
In review, Seer alerts you to bugs that are likely to break production, not nits
In production, Seer can find a bug’s root cause, suggest a fix, open a PR automatically, or send the fix to your preferred IDE.
OpenAI scaled PostgreSQL to handle millions of queries per second for 800 million ChatGPT users. They did it with just a single primary writer supported by read replicas.
At first glance, this should sound impossible. The common wisdom suggests that beyond a certain scale, you must shard the database or risk failure. The conventional playbook recommends embracing the complexity of splitting the data across multiple independent databases.
OpenAI’s engineering team chose a different path. They decided to see just how far they could push PostgreSQL.
Over the past year, their database load grew by more than 10X. They experienced the familiar pattern of database-related incidents: cache layer failures causing sudden read spikes, expensive queries consuming CPU, and write storms from new features. Yet through systematic optimization across every layer of their stack, they achieved five-nines availability with low double-digit millisecond latency. But the road wasn’t easy.
In this article, we will look at the challenges OpenAI faced while scaling Postgres and how the team handled the various scenarios.
Disclaimer: This post is based on publicly shared details from the OpenAI Engineering Team. Please comment if you notice any inaccuracies.
A single-primary architecture means one database instance handles all writes, while multiple read replicas handle read queries.
See the diagram below:
This design creates an inherent bottleneck because writes cannot be distributed. However, for read-heavy workloads like ChatGPT, where users primarily fetch data rather than modify it, this architecture can scale effectively if properly optimized.
OpenAI avoided sharding its PostgreSQL deployment for pragmatic reasons. Sharding would require modifying hundreds of application endpoints and could take months or years to complete. Since their workload is primarily read-heavy and current optimizations provide sufficient capacity, sharding remains a future consideration rather than an immediate necessity.
So how did OpenAI go about scaling the read replicas? There were three main pillars to their overall strategy:
The primary database represents the system’s most critical bottleneck. OpenAI implemented multiple strategies to reduce pressure on this single writer:
Offloading Read Traffic: OpenAI routes most read queries to replicas rather than the primary. However, some read queries must remain on the primary because they occur within write transactions. For these queries, the team ensures maximum efficiency to avoid slow operations that could cascade into broader system failures.
Migrating Write-Heavy Workloads: The team migrated workloads that could be horizontally partitioned to sharded systems like Azure Cosmos DB. These shardable workloads can be split across multiple databases without complex coordination. Workloads that are harder to shard continue to use PostgreSQL but are being gradually migrated.
Application-Level Write Optimization: OpenAI fixed application bugs that caused redundant database writes. They implemented lazy writes where appropriate to smooth traffic spikes rather than hitting the database with sudden bursts. When backfilling table fields, they enforce strict rate limits even though the process can take over a week. This patience prevents write spikes that could impact production stability.
First, OpenAI identified several expensive queries that consumed disproportionate CPU resources. One particularly problematic query joined 12 tables, and spikes in this query’s volume caused multiple high-severity incidents.
The team learned to avoid complex multi-table joins in their OLTP system. When joins are necessary, OpenAI breaks down complex queries and moves join logic to the application layer, where it can be distributed across multiple application servers.
Object-Relational Mapping frameworks, commonly known as ORMs, generate SQL automatically from code objects. While convenient for developers, ORMs can produce inefficient queries. OpenAI carefully reviews all ORM-generated SQL to ensure it performs as expected. They also configure timeouts like idle_in_transaction_session_timeout to prevent long-running idle queries from blocking autovacuum (PostgreSQL’s cleanup process).
Second, Azure PostgreSQL instances have a maximum connection limit of 5,000. OpenAI previously experienced incidents where connection storms exhausted all available connections, bringing down the service.
Connection pooling solves this problem by reusing database connections rather than creating new ones for each request. Think of it as carpooling. Instead of everyone driving their own car to work, people share vehicles to reduce traffic congestion.
OpenAI deployed PgBouncer as a proxy layer between applications and databases. PgBouncer runs in statement or transaction pooling mode, efficiently reusing connections and reducing the number of active client connections. In benchmarks, average connection time dropped from 50 milliseconds to just 5 milliseconds.
Each read replica has its Kubernetes deployment running multiple PgBouncer pods. Multiple deployments sit behind a single Kubernetes Service that load-balances traffic across pods. OpenAI co-locates the proxy, application clients, and database replicas in the same geographic region to minimize network latency and connection overhead.
See the diagram below:
Today’s AI agents are mostly chatbots and copilots - reactive tools waiting for human input. But agents are moving into the backend: running autonomously, replacing brittle rule engines with reasoning, creating capabilities you couldn’t build with deterministic pipelines.
This changes everything about your architecture. Agent reasoning takes seconds, not milliseconds. You need identity beyond API keys. You need to know why an agent made every decision. And you need to scale from one prototype to thousands.
AgentField is the open-source infrastructure layer for autonomous AI agents in production.
OpenAI identified a recurring pattern in their incidents. To reduce read pressure on PostgreSQL, OpenAI uses a caching layer to serve most read traffic.
However, when cache hit rates drop unexpectedly, the burst of cache misses can push massive request volumes directly to PostgreSQL. In other words, an upstream issue causes a sudden spike in database load. This could be widespread cache misses from a caching layer failure, expensive multi-way joins saturating the CPU, or a write storm from a new feature launch.
As resource utilization climbs, query latency rises, and requests begin timing out. Applications then retry failed requests, which further amplifies the load. This creates a feedback loop that can degrade the entire service.
To prevent this situation, the OpenAI engineering team implemented a cache locking and leasing mechanism to prevent this scenario. When multiple requests miss on the same cache key, only one request acquires a lock and fetches data from PostgreSQL to repopulate the cache. All other requests wait for the cache update rather than simultaneously hitting the database.
See the diagram below:
Taking further precautions, OpenAI implemented rate limiting across the application, connection pooler, proxy, and query layers. This prevents sudden traffic spikes from overwhelming database instances and triggering cascading failures. They also avoid overly short retry intervals, which can trigger retry storms where failed requests multiply exponentially.
The team enhanced their ORM layer to support rate limiting and can fully block specific query patterns when necessary. This targeted load shedding enables rapid recovery from sudden surges of expensive queries.
Despite all this, OpenAI encountered situations where certain requests consumed disproportionate resources on PostgreSQL instances, creating a problem known as the noisy neighbor effect. For example, a new feature launch might introduce inefficient queries that heavily consume CPU, slowing down other critical features.
To mitigate this, OpenAI also isolates workloads onto dedicated instances. They split requests into low-priority and high-priority tiers and route them to separate database instances. This ensures that low-priority workload spikes cannot degrade high-priority request performance. The same strategy applies across different products and services.
PostgreSQL uses Multi-Version Concurrency Control for managing concurrent transactions. When a query updates a tuple (database row) or even a single field, PostgreSQL copies the entire row to create a new version. This design allows multiple transactions to access different versions simultaneously without blocking each other.
However, MVCC creates challenges for write-heavy workloads. It causes write amplification because updating one field requires writing an entire row. It also causes read amplification because queries must scan through multiple tuple versions, called dead tuples, to retrieve the latest version. This leads to table bloat, index bloat, increased index maintenance overhead, and complex autovacuum tuning requirements.
OpenAI’s primary strategy for addressing MVCC limitations involves migrating write-heavy workloads to alternative systems and optimizing applications to minimize unnecessary writes. They also restrict schema changes to lightweight operations that do not trigger full table rewrites.
Another constraint with Postgres is related to schema changes. Even small schema changes like altering a column type can trigger a full table rewrite in PostgreSQL. During a table rewrite, PostgreSQL creates a new copy of the entire table with the change applied. For large tables, this can take hours and block access.
To handle this, OpenAI enforces strict rules around schema changes:
Only lightweight schema changes are permitted, such as adding or removing certain columns that do not trigger table rewrites.
All schema changes have a 5-second timeout.
Creating and dropping indexes must be done concurrently to avoid blocking.
Schema changes are restricted to existing tables only.
New features requiring additional tables must use alternative sharded systems like Azure Cosmos DB.
When backfilling a table field, OpenAI applies strict rate limits even though the process can take over a week. This ensures stability and prevents production impact.
With a single primary database, the failure of that instance affects the entire service. OpenAI addressed this critical risk through multiple strategies.
First, they offloaded most critical read-only requests from the primary to replicas. If the primary fails, read operations continue functioning. While write operations would still fail, the impact is significantly reduced.
Second, OpenAI runs the primary in High Availability mode with a hot standby. A hot standby is a continuously synchronized replica that remains ready to take over immediately. If the primary fails or requires maintenance, OpenAI can quickly promote the standby to minimize downtime. The Azure PostgreSQL team has done significant work ensuring these failovers remain safe and reliable even under high load.
For read replica failures, OpenAI deploys multiple replicas in each region with sufficient capacity headroom. A single replica failure does not lead to a regional outage because traffic automatically routes to other replicas.
The primary database streams Write Ahead Log data to every read replica. WAL contains a record of all database changes, which replicas replay to stay synchronized. As the number of replicas increases, the primary must ship WAL to more instances, increasing pressure on network bandwidth and CPU. This causes higher and more unstable replica lag.
As mentioned, OpenAI currently operates nearly 50 read replicas across multiple geographic regions. While this scales well with large instance types and high network bandwidth, the team cannot add replicas indefinitely without eventually overloading the primary.
To address this future constraint, OpenAI is collaborating with the Azure PostgreSQL team on cascading replication. In this architecture, intermediate replicas relay WAL to downstream replicas rather than the primary streaming to every replica directly. This tree structure allows scaling to potentially over 100 replicas without overwhelming the primary. However, it introduces additional operational complexity, particularly around failover management. The feature remains in testing until the team ensures it can fail over safely.
See the diagram below:
OpenAI’s optimization efforts have delivered impressive results.
The system handles millions of queries per second while maintaining replication lag near zero. The architecture delivers low double-digit millisecond p99 latency, meaning 99 percent of requests complete in under roughly 50 milliseconds. The system achieves five-nines availability, equivalent to 99.999 percent uptime.
Over the past 12 months, OpenAI experienced only one SEV-0 PostgreSQL incident. This occurred during the viral launch of ChatGPT ImageGen when write traffic suddenly surged by more than 10x as over 100 million new users signed up within a week.
Looking ahead, OpenAI continues migrating remaining write-heavy workloads to sharded systems. The team is working with Azure to enable cascading replication for safely scaling to significantly more read replicas. They will continue exploring additional approaches, including sharded PostgreSQL or alternative distributed systems as infrastructure demands grow.
OpenAI’s experience shows that PostgreSQL can reliably support much larger read-heavy workloads than conventional wisdom suggests. However, achieving this scale requires rigorous optimization, careful monitoring, and operational discipline. The team’s success came not from adopting the latest distributed database technology but from deeply understanding their workload characteristics and eliminating bottlenecks.
References:
2026-02-15 00:30:21
Richard Socher and Bryan McCann are among the most-cited AI researchers in the world. They just released 35 predictions for 2026. Three that stand out:
The LLM revolution has been “mined out” and capital floods back to fundamental research
“Reward engineering” becomes a job; prompts can’t handle what’s coming next
Traditional coding will be gone by December; AI writes the code and humans manage it
This week’s system design refresher:
MCP vs RAG vs AI Agents
How ChatGPT Routes Prompts and Handles Modes
Agent Skills, Clearly Explained
12 Architectural Concepts Developers Should Know
How to Deploy Services
Everyone is talking about MCP, RAG, and AI Agents. Most people are still mixing them up. They’re not competing ideas. They solve very different problems at different layers of the stack.
MCP (Model Context Protocol) is about how LLMs use tools. Think of it as a standard interface between an LLM and external systems. Databases, file systems, GitHub, Slack, internal APIs.
Instead of every app inventing its own glue code, MCP defines a consistent way for models to discover tools, invoke them, and get structured results back. MCP doesn’t decide what to do. It standardizes how tools are exposed.
RAG (Retrieval-Augmented Generation) is about what the model knows at runtime. The model stays frozen. No retraining. When a user asks a question, a retriever fetches relevant documents (PDFs, code, vector DBs), and those are injected into the prompt.
RAG is great for:
Internal knowledge bases
Fresh or private data
Reducing hallucinations
But RAG doesn’t take actions. It only improves answers.
AI Agents are about doing things. An agent observes, reasons, decides, acts, and repeats. It can call tools, write code, browse the internet, store memory, delegate tasks, and operate with different levels of autonomy.
GPT-5 is not one model.
It is a unified system with multiple models, safeguards, and a real-time router.
This post and diagram are based on our understanding of the GPT 5 system card.
When you send a query, the mode determines which model to use and how much work the system does.
Instant mode sends the query directly to a fast, non-reasoning model named GPT-5-main. It optimizes for latency and is used for simple or low-risk tasks like short explanations or rewrites.
Thinking mode uses a reasoning model named GPT-5-thinking that runs multiple internal steps before producing the final answer. This improves correctness on complex tasks like math or planning.
Auto mode adds a real-time router. A lightweight classifier looks at the query and decides whether to use GPT-5-main or GPT-5-thinking when deeper reasoning is needed.
Pro mode does not use a different model. It uses GPT-5-thinking but samples multiple reasoning attempts and selects the best one using a reward model.
Across all modes, safeguards run in parallel at various stages. A fast topic classifier determines whether the topic is high-risk, followed by a reasoning monitor that applies stricter checks to ensure unsafe responses are blocked.
Over to you: What's your favorite AI chat bot?
Unblocked is the only AI code review tool that has deep understanding of your codebase, past decisions, and internal knowledge, giving you high-value feedback shaped by how your system actually works instead of flooding your PRs with stylistic nitpicks.
Why do we need Agent Skills? Long prompts hurt agent performance. Instead of one massive prompt, agents keep a small catalog of skills, reusable playbooks with clear instructions, loaded only when needed.
Here is what the Agent Skills workflow looks like:
User Query: A user submits a request like “Analyze data & draft report”.
Build Prompt + Skills Index: The agent runtime combines the query with Skills metadata, a lightweight list of available skills and their short descriptions.
Reason & Select Skill: The LLM processes the prompt, thinks, and decides: "I want Skill X."
Load Skill into Context: The agent runtime receives the specific skill request from the LLM. Then, it loads SKILL. md and adds it into the LLM's active context.
Final Output: The LLM follows SKILL. md, runs scripts, and generates the final report.
By dynamically loading skills only when needed, Agent Skills keep context small and the LLM’s behavior consistent.
Over to you: What skills would you find most useful in agents?
Load Balancing: Distributed incoming traffic across multiple servers to ensure no single node is overwhelmed.
Caching: Stores frequently accessed data in memory to reduce latency.
Content Delivery Network (CDN): Stores static assets across geographically distributed edge servers so users download content from the nearest location.
Message Queue: Decouples components by letting producers enqueue messages that consumers process asynchronously.
Publish-Subscribe: Enables multiple consumers to receive messages from a topic.
API Gateway: Acts as a single entry point for client requests, handling routing, authentication, rate limiting, and protocol translation.
Circuit Breaker: Monitors downstream service calls and stops attempts when failures exceed a threshold.
Service Discovery: Automatically tracks available service instances so components can locate and communicate with each other dynamically.
Sharding: Splits large datasets across multiple nodes based on a specific shard key.
Rate Limiting: Controls the number of requests a client can make in a given time window to protect services from overload.
Consistent Hashing: Distributes data across nodes in a way that minimizes reorganization when nodes join or leave.
Auto Scaling: Automatically adds or removes compute resources based on defined metrics.
Over to you: Which architectural concept will you add to the list?
Deploying or upgrading services is risky. In this post, we explore risk mitigation strategies.
The diagram below illustrates the common ones.
Multi-Service Deployment
In this model, we deploy new changes to multiple services simultaneously. This approach is easy to implement. But since all the services are upgraded at the same time, it is hard to manage and test dependencies. It’s also hard to rollback safely.
Blue-Green Deployment
With blue-green deployment, we have two identical environments: one is staging (blue) and the other is production (green). The staging environment is one version ahead of production. Once testing is done in the staging environment, user traffic is switched to the staging environment, and the staging becomes the production. This deployment strategy is simple to perform rollback, but having two identical production quality environments could be expensive.
Canary Deployment
A canary deployment upgrades services gradually, each time to a subset of users. It is cheaper than blue-green deployment and easy to perform rollback. However, since there is no staging environment, we have to test on production. This process is more complicated because we need to monitor the canary while gradually migrating more and more users away from the old version.
A/B Test
In the A/B test, different versions of services run in production simultaneously. Each version runs an “experiment” for a subset of users. A/B test is a cheap method to test new features in production. We need to control the deployment process in case some features are pushed to users by accident.
Over to you - Which deployment strategy have you used? Did you witness any deployment-related outages in production and why did they happen?
2026-02-13 00:30:55
Software architecture patterns are reusable solutions to common problems that occur when designing software systems. Think of them as blueprints that have been tested and proven effective by countless developers over many years.
When we build applications, we often face similar challenges, such as how to organize code, how to scale systems, or how to handle communication between different parts of an application. Architecture patterns provide us with established approaches to solve these challenges.
Learning about architecture patterns offers several key benefits.
First, it increases our productivity because we do not need to invent solutions from scratch for every project.
Second, it improves our code quality by following proven approaches that make systems more maintainable and easier to understand.
Third, it enhances communication within development teams by providing a common vocabulary to discuss design decisions.
In this article, we will explore the essential architecture patterns that every software engineer should understand. We will look at how each pattern works, when to use it, what performance characteristics it has, and see practical examples of each.
2026-02-12 00:30:38
Join us for Sonar Summit on March 3rd, a global virtual event, bringing together the brightest minds in software development.
In a world increasingly shaped by AI, it’s more crucial than ever to cut through the noise and amplify the ideas and practices that lead to truly good code. We created Sonar Summit to help you navigate the future with clarity and knowledge you need to build better software, faster.
OpenAI recently launched ChatGPT Atlas, a web browser where the LLM acts as your co-pilot across the internet. You can ask questions about any page, have ChatGPT complete tasks for you, or let it browse in Agent mode while you work on something else.
Delivering this experience wasn’t trivial. ChatGPT Atlas needed to start instantly and stay responsive even with hundreds of tabs open. To make development faster and avoid reinventing the wheel, the team built on top of Chromium, the engine that powers many other modern browsers.
However, Atlas is not just another Chromium-based browser with a different skin. Most Chromium-based browsers embed the web engine directly into their application, which creates tight coupling between the UI and the rendering engine. This architecture works fine for traditional browsing, but it makes certain capabilities extremely difficult to achieve.
Therefore, OpenAI’s solution was to build OWL (OpenAI’s Web Layer), an architectural layer that runs Chromium as a separate process, thereby unlocking capabilities that would have been nearly impossible otherwise.
In this article, we learn how the OpenAI Engineering Team built OWL and the technical challenges they faced around rendering and inter-process communication.
Disclaimer: This post is based on publicly shared details from the OpenAI Engineering Team. Please comment if you notice any inaccuracies.
Chromium was the natural choice as the web engine for Atlas. Chromium provides a state-of-the-art rendering engine with strong security, proven performance, and complete web compatibility. It powers many modern browsers, including Chrome, Edge, and Brave. Furthermore, Chromium benefits from continuous improvements by a global developer community. For any team building a browser today, Chromium is the logical starting point.
However, using Chromium comes with significant challenges. The OpenAI Engineering Team had ambitious goals that were difficult to achieve with Chromium’s default architecture:
First, they wanted instant startup times. Users should see the browser interface immediately, not after waiting for everything to load.
Second, they needed rich animations and visual effects for features like Agent mode, which meant using modern native frameworks like SwiftUI and Metal rather than Chromium’s built-in UI system.
Third, Atlas needed to support hundreds of open tabs without degrading performance.
Chromium has strong opinions about how browsers should work. It controls the boot sequence, the threading model, and how tabs are managed.
While OpenAI could have made extensive modifications to Chromium itself, this approach had problems. Making substantial changes to Chromium’s core would mean maintaining a large set of custom patches. Every time a new Chromium version was released, merging those changes would become increasingly difficult and time-consuming.
There was also a cultural consideration. OpenAI has an engineering principle called “shipping on day one,” where every new engineer makes and merges a code change on their first afternoon. This practice keeps development velocity high and helps new team members feel immediately productive. However, Chromium takes hours to download and build from source. Making this requirement work with traditional Chromium integration seemed nearly impossible.
OpenAI needed a different approach to integrate Chromium that would enable rapid experimentation, faster feature delivery, and maintain their engineering culture.
With the largest catalog of AI apps and agents in the industry, Microsoft Marketplace is a single source of cloud and AI needs. As a software company, Marketplace is how you connect your solution to millions of global buyers 24/7, helping reach new customers and sell with the power of Microsoft.
Publish your solution to the Microsoft Marketplace and grow pipeline with trials and product-led sales. Plus, you can simplify sales operations by streamlining terms, payouts, and billing.
Expand your product reach with Microsoft Marketplace
The answer was OWL, a new architectural layer that fundamentally changes how Chromium integrates with the browser application.
The key tenet behind the architecture is that instead of embedding Chromium inside the Atlas application, OpenAI runs Chromium’s browser process outside the main Atlas application process.
In this architecture, Atlas is the OWL Client, and the Chromium browser process is the OWL Host. These two components communicate through IPC using Mojo, which is Chromium’s own message-passing system. OpenAI wrote custom Swift and TypeScript bindings for Mojo, allowing their Swift-based Atlas application to call Chromium functions directly.
See the diagram below:
The OWL client library exposes a clean Swift API that abstracts several key concepts:
Session: Configures and controls the Chromium host globally
Profile: Manages browser state for a specific user profile (bookmarks, history, etc.)
WebView: Controls individual web pages, handling navigation, zoom, and input
WebContentRenderer: Forwards input events into Chromium and receives feedback
LayerHost/Client: Exchanges compositing information between Atlas UI and Chromium
Additionally, OWL provides service endpoints for managing high-level features like bookmarks, downloads, extensions, and autofill.
One of the most complex aspects of OWL is rendering.
How do you display web content that Chromium generates in one process within Atlas windows that exist in another process?
OpenAI solved this using a technique called layer hosting. Here is how it works:
On the Chromium side, web content is rendered to a CALayer, which is a macOS graphics primitive. This layer has a unique context ID.
On the Atlas side, an NSView (a window component) embeds this layer using the private CALayerHost API. The context ID tells Atlas which layer to display.
See the diagram below:
The result is that pixels rendered by Chromium in the OWL process appear seamlessly in Atlas windows. The GPU compositor handles this efficiently because both processes can share graphics memory. Multiple tabs can share a single compositing container. When you switch tabs, Atlas simply swaps which WebView is connected to the visible container.
This technique also works for special UI elements like dropdown menus from select elements or color pickers. These render in separate pop-up widgets in Chromium, each with its own rendering surface, but they follow the same delegated rendering model.
OpenAI also uses this approach selectively to project elements of Chromium’s native UI into Atlas. This is useful for quickly bootstrapping features like permission prompts without building complete replacements in SwiftUI. The technique borrows from Chromium’s existing infrastructure for installable web applications on macOS.
User input requires careful handling across the process boundary. Normally, Chromium’s UI layer translates platform events like mouse clicks or key presses from macOS NSEvents into Blink’s WebInputEvent format before forwarding them to web page renderers.
In the OWL architecture, Chromium runs without visible windows, so it never receives these platform events directly. Instead, the Atlas client library performs the translation from NSEvents to WebInputEvents and forwards the already-translated events to Chromium over IPC.
See the diagram below:
From there, events follow the same lifecycle they would normally follow for web content. If a web page indicates it did not handle an event, Chromium returns it to the Atlas client. When this happens, Atlas resynthesizes an NSEvent and gives the rest of the application a chance to handle the input. This allows browser-level keyboard shortcuts and gestures to work correctly even though the web engine is in a separate process.
Atlas includes an agentic browsing feature where ChatGPT can control the browser to complete tasks. This capability poses unique challenges for rendering, input handling, and data storage.
The computer use model that powers Agent mode expects a single screenshot of the browser as input. However, some UI elements, like dropdown menus, render outside the main tab bounds in separate windows. To solve this, Atlas composites these pop-up windows back into the main page image at their correct coordinates in Agent mode. This ensures the AI model sees the complete context in a single frame.
For input events, OpenAI applies a strict security principle. Agent-generated events route directly to the web page renderer and never pass through the privileged browser layer. This preserves the security sandbox even under automated control. The system prevents AI-generated events from synthesizing keyboard shortcuts that would make the browser perform actions unrelated to the displayed web content.
Agent mode also supports ephemeral browsing sessions. Instead of using the user’s existing Incognito profile, which could leak state between sessions, OpenAI uses Chromium’s StoragePartition infrastructure to create isolated, in-memory data stores. Each agent session starts completely fresh. When the session ends, all cookies and site data are discarded. You can run multiple logged-out agent sessions simultaneously, each in its own browser tab, with complete isolation between them.
The OWL architecture delivers several critical benefits that enable OpenAI’s product goals.
Atlas achieves fast startup because Chromium boots asynchronously in the background while the Atlas UI appears nearly instantly. Users see pixels on screen within milliseconds, even though the web engine may still be initializing.
The application is simpler to develop because Atlas is built almost entirely in SwiftUI and AppKit. This creates a unified codebase with one primary language and technology stack, making it easier for developers to work across the entire application.
Process isolation means that if Chromium’s main thread hangs, Atlas remains responsive. If Chromium crashes, Atlas stays running and can recover. This separation protects the user experience from issues in the web engine.
OpenAI maintains a much smaller diff against upstream Chromium because they are not modifying Chromium’s UI layer extensively. This makes it easier to integrate new Chromium versions as they are released.
Most importantly for developer productivity, most engineers never need to build Chromium locally. OWL ships internally as a prebuilt binary, so Atlas builds completely in minutes rather than hours.
Every architectural decision involves trade-offs:
Running two separate processes uses more memory than a monolithic architecture.
The IPC layer adds complexity that must be maintained.
Cross-process rendering could potentially add latency, although OpenAI mitigates this through efficient use of CALayerHost and GPU memory sharing.
However, OpenAI determined that these trade-offs were worthwhile. The benefits of stability, developer productivity, and architectural flexibility outweigh the costs. The clean separation between Atlas and Chromium creates a foundation that will support future innovation, particularly for agentic use cases.
OWL is not just about building a better browser today.
It creates infrastructure for the future of AI-powered web experiences. The architecture makes it easy to run multiple isolated agent sessions, add new AI capabilities, and experiment with novel interactions between users, AI, and web content. The built-in sandboxing for agent actions provides security by design rather than as an afterthought.
Building ChatGPT Atlas required rethinking fundamental assumptions about browser architecture. By running Chromium outside the main application process and creating the OWL integration layer, the OpenAI Engineering Team solved multiple challenges simultaneously. They achieved instant startup, maintained developer productivity, enabled rich UI capabilities, and built a strong foundation for agentic browsing.
References:
2026-02-11 00:30:21
Monster SCALE Summit is a virtual conference all about extreme-scale engineering and data-intensive applications. Engineers from Discord, Disney, LinkedIn, Uber, Pinterest, Rivian, ClickHouse, Redis, MongoDB, ScyllaDB + more will be sharing 50+ talks on topics like:
Distributed databases
Streaming and real-time processing
Intriguing system designs
Approaches to a massive scaling challenge
Methods for balancing latency/concurrency/throughput
Infrastructure built for unprecedented demands.
Don’t miss this chance to connect with 20K of your peers designing, implementing, and optimizing data-intensive applications – for free, from anywhere.
LinkedIn serves hundreds of millions of members worldwide, delivering fast experiences whether someone is loading their feed or sending a message. Behind the scenes, this seamless experience depends on thousands of software services working together. Service Discovery is the infrastructure system that makes this coordination possible.
Consider a modern application at scale. Instead of building one massive program, LinkedIn breaks functionality into tens of thousands of microservices. Each microservice handles a specific task like authentication, messaging, or feed generation. These services need to communicate with each other constantly, and they need to know where to find each other.
Service discovery solves this location problem. Instead of hardcoding addresses that can change as servers restart or scale, services use a directory that tracks where every service currently lives. This directory maintains IP addresses and port numbers for all active service instances.
At LinkedIn’s scale, with tens of thousands of microservices running across global data centers and handling billions of requests each day, service discovery becomes exceptionally challenging. The system must update in real time as servers scale up or down, remain highly reliable, and respond within milliseconds.
In this article, we learn how LinkedIn built and rolled out Next-Gen Service Discovery, a scalable control plane supporting app containers in multiple programming languages.
Disclaimer: This post is based on publicly shared details from the LinkedIn Engineering Team. Please comment if you notice any inaccuracies.
For the past decade, LinkedIn used Apache Zookeeper as the control plane for service discovery. Zookeeper is a coordination service that maintains a centralized registry of services.
In this architecture, Zookeeper allows server applications to register their endpoint addresses in a custom format called D2, which stands for Dynamic Discovery. The system stored the configuration about how RPC traffic should flow as D2 configs and served them to application clients. The application servers and clients formed the data plane, handling actual inbound and outbound RPC traffic using LinkedIn’s Rest.li framework, a RESTful communication system.
Here is how the system worked:
The Zookeeper client library ran on all application servers and clients.
The Zookeeper ensemble took direct write requests from application servers to register their endpoint addresses as ephemeral nodes called D2 URIs.
Ephemeral nodes are temporary entries that exist only while the connection remains active.
The Zookeeper performed health checks on these connections to keep the ephemeral nodes alive.
The Zookeeper also took direct read requests from application clients to set watchers on the server clusters they needed to call. When updates happened, clients would read the changed ephemeral nodes.
Despite its simplicity, this architecture had critical problems in three areas: scalability, compatibility, and extensibility. Benchmark tests conducted in the past projected that the system would reach capacity in early 2025.
LLMs are powerful—but without fresh, reliable information, they hallucinate, miss context, and go out of date fast. SerpApi gives your AI applications clean, structured web data from major search engines and marketplaces, so your agents can research, verify, and answer with confidence.
Access real-time data with a simple API.
The key problems with Zookeeper are as follows:
The control plane operated as a flat structure handling requests for hundreds of thousands of application instances.
During deployments of large applications with many calling clients, the D2 URI ephemeral nodes changed frequently. This led to read storms with huge fanout from all the clients trying to read updates simultaneously, causing high latencies for both reads and writes.
Zookeeper is a strong consistency system, meaning it enforces strict ordering over availability. All reads, writes, and session health checks go through the same request queue. When the queue had a large backlog of read requests, write requests could not be processed. Even worse, all sessions would be dropped due to health check timeouts because the queue was too backed up. This caused ephemeral nodes to be removed, resulting in capacity loss of application servers and site unavailability.
The session health checks performed on all registered application servers became unscalable with fleet growth. As of July 2022, LinkedIn had about 2.5 years of capacity left with a 50 to 100 percent yearly growth rate in cluster size and number of watchers, even after increasing the number of Zookeeper hosts to 80.
Since D2 entities used LinkedIn’s custom schemas, they were incompatible with modern data plane technologies like gRPC and Envoy.
The read and write logic in application containers was implemented primarily in Java, with a partial and outdated implementation for Python applications. When onboarding applications in other languages, the entire logic needed to be rewritten from scratch.
The lack of an intermediary layer between the service registry and application instances prevented the development of modern centralized RPC management techniques like centralized load balancing.
It also created challenges for integrating with new service registries to replace Zookeeper, such as Etcd with Kubernetes, or any new storage system that might have better functionality or performance.
The LinkedIn Engineering Team designed the new architecture to address all these limitations. Unlike Zookeeper handling read and write requests together, Next-Gen Service Discovery consists of two separate paths: Kafka for writes and Service Discovery Observer for reads.
Kafka takes in application server writes and periodic heartbeats through Kafka events called Service Discovery URIs. Kafka is a distributed streaming platform capable of handling millions of messages per second. Each Service Discovery URI contains information about a service instance, including service name, IP address, port number, health status, and metadata.
Service Discovery Observer consumes the URIs from Kafka and writes them into its main memory. Application clients open bidirectional gRPC streams to the Observer, sending subscription requests using the xDS protocol. The Observer keeps these streams open to push data and all subsequent updates to application clients instantly.
The xDS protocol is an industry standard created by the Envoy project for service discovery. Instead of clients polling for updates, the Observer pushes changes as they happen. This streaming approach is far more efficient than the old polling model.
D2 configs remain stored in Zookeeper. Application owners run CLI commands to leverage Config Service to update the D2 configs and convert them into xDS entities.
Observer consumes the configs from Zookeeper and distributes them to clients the same way as the URIs.
The Observer is horizontally scalable and written in Go, chosen for its high concurrency capabilities.
It can process large volumes of client requests, dispatch data updates, and consume URIs for the entire LinkedIn fleet efficiently. As of today, one Observer can maintain 40,000 client streams while sending 10,000 updates per second and consuming 11,000 Kafka events per second.
With projections of fleet size growing to 3 million instances in the coming years, LinkedIn will need approximately 100 Observers.
Here are some key improvements that the new architecture provided in comparison to Zookeeper:
LinkedIn prioritized availability over consistency because service discovery data only needs to eventually converge. Some short-term inconsistency across servers is acceptable, but the data must be highly available to the huge fleet of clients. This represents a fundamental shift from Zookeeper’s strong consistency model.
Multiple Observer replicas reach eventual consistency after a Kafka event is consumed and processed on all replicas. Even when Kafka experiences significant lag or goes down, Observer continues serving client requests with its cached data, preventing cascading failures.
LinkedIn can further improve scalability by separating dedicated Observer instances. Some Observers can focus on consuming Kafka events as consumers, while other Observers serve client requests as servers. The server Observers would subscribe to the consumer Observers for cache updates.
Next-Gen Service Discovery supports the gRPC framework natively and enables multi-language support.
Since the control plane uses the xDS protocol, it works with open-source gRPC and Envoy proxy. Applications not using Envoy can leverage open-source gRPC code to directly subscribe to the Observer. Applications onboarding the Envoy proxy get multi-language support automatically.
Adding Next-Gen Service Discovery as a central control plane between the service registry and clients enables LinkedIn to extend to modern service mesh features. These include centralized load balancing, security policies, and transforming endpoint addresses between IPv4 and IPv6.
LinkedIn can also integrate the system with Kubernetes to leverage application readiness probes. This would collect the status and metadata of application servers, converting servers from actively making announcements to passively receiving status probes, which is more reliable and better managed.
Next-Gen Service Discovery Observers run independently in each fabric. A fabric is a data center or isolated cluster. Application clients can be configured to connect to the Observer in a remote fabric and be served with the server applications in that fabric. This supports custom application needs or provides failover when the Observer in one fabric goes down, ensuring business traffic remains unaffected.
See the diagram below:
Application servers can also write to the control plane in multiple fabrics. Cross-fabric announcements are appended with a fabric name suffix to differentiate from local announcements. Application clients can then send requests to application servers in both local and remote fabrics based on preference.
See the diagram below:
Rolling out Next-Gen Service Discovery to hundreds of thousands of hosts without impacting current requests required careful planning.
LinkedIn needed the service discovery data served by the new control plane to exactly match the data on Zookeeper. They needed to equip all application servers and clients companywide with related mechanisms through just an infrastructure library version bump. They needed central control on the infrastructure side to switch Next-Gen Service Discovery read and write on and off by application. Finally, they needed good central observability across thousands of applications on all fabrics for migration readiness, results verification, and troubleshooting.
The three major challenges were as follows:
First, service discovery is mission-critical, and any error could lead to severe site-wide incidents. Since Zookeeper was approaching capacity limits, LinkedIn needed to migrate as many applications off Zookeeper as quickly as possible.
Second, application states were complex and unpredictable. Next-Gen Service Discovery Read required client applications to establish gRPC streams. However, Rest.li applications that had existed at the company for over a decade were in very different states regarding dependencies, gRPC SSL, and network access. Compatibility with the control plane for many applications was unpredictable without actually enabling the read.
Third, read and write migrations were coupled. If the write was not migrated, no data could be read on Next-Gen Service Discovery. If the read was not migrated, data was still read on Zookeeper, blocking the write migration. Since read path connectivity was vulnerable to application-specific states, the read migration had to start first. Even after client applications migrated for reads, LinkedIn needed to determine which server applications became ready for Next-Gen Service Discovery Write and prevent clients from regressing to read Zookeeper again.
LinkedIn implemented a dual mode strategy where applications run both old and new systems simultaneously, verifying the new flow behind the scenes.
To decouple read and write migration, the new control plane served a combined dataset of Kafka and Zookeeper URIs, with Kafka as the primary source and Zookeeper as backup. When no Kafka data existed, the control plane served Zookeeper data, mirroring what clients read directly from Zookeeper. This enabled read migration to start independently.
In Dual Read mode, an application client reads data from both Next-Gen Service Discovery and Zookeeper, keeping Zookeeper as the source of truth for serving traffic. Using an independent background thread, the client tried to resolve traffic as if it were served by Next-Gen Service Discovery data and reported any errors.
LinkedIn built comprehensive metrics to verify connectivity, performance, and data correctness on both the client side and Observer side. On the client side, connectivity and latency metrics watched for connection status and data latencies from when the subscription request was sent to when data was received. Dual Read metrics compared data received from Zookeeper and Next-Gen Service Discovery to identify mismatches. Service Discovery request resolution metrics showed request status, identical to Zookeeper-based metrics, but with a Next-Gen Service Discovery prefix to identify whether requests were resolved by Next-Gen Service Discovery data and catch potential errors like missing critical data.
On the Observer side, connection and stream metrics watched for client connection types, counts, and capacity. These helped identify issues like imbalanced connections and unexpected connection losses during restart. Request processing latency metrics measured time from when the Observer received a request to when the requested data was queued for sending. The actual time spent sending data over the network was excluded since problematic client hosts could get stuck receiving data and distort the metric. Additional metrics tracked Observer resource utilization, including CPU, memory, and network bandwidth.
See the diagram below:
With all these metrics and alerts, before applications actually used Next-Gen Service Discovery data, LinkedIn caught and resolved numerous issues, including connectivity problems, reconnection storms, incorrect subscription handling logic, and data inconsistencies, avoiding many companywide incidents. After all verifications passed, applications were ramped to perform Next-Gen Service Discovery read-only.
In Dual Write mode, application servers reported to both Zookeeper and Next-Gen Service Discovery.
On the Observer side, Zookeeper-related metrics monitored potential outages, connection losses, or high latencies by watching connection status, watch status, data received counts, and lags. Kafka metrics monitored potential outages and high latencies by watching partition lags and event counts.
LinkedIn calculated a URI Similarity Score for each application cluster by comparing data received from Kafka and Zookeeper. A 100 percent match could only be reached if all URIs in the application cluster were identical, guaranteeing that Kafka announcements matched existing Zookeeper announcements.
Cache propagation latency is measured as the time from when data was received on the Observer to when the Observer cache was updated.
Resource propagation latency is measured as the time from when the application server made the announcement to when the Observer cache was updated, representing the full end-to-end write latency.
On the application server side, a metric tracked the server announcement mode to accurately determine whether the server was announcing to Zookeeper only, dual write, or only Next-Gen Service Discovery. This allowed LinkedIn to understand if all instances of a server application had fully adopted a new stage.
See the diagram below:
LinkedIn also monitored end-to-end propagation latency, measuring the time from when an application server made an announcement to when a client host received the update. They built a dashboard to measure this across all client-server pairs daily, monitoring for P50 less than 1 second and P99 less than 5 seconds. P50 means that 50 percent of clients received the propagated data within that time, and P99 means 99 percent received it within that time.
The safest approach for write migration would be waiting until all client applications are migrated to Next-Gen Service Discovery Read and all Zookeeper-reading code is cleaned up before stopping Zookeeper announcements. However, with limited Zookeeper capacity and the urgency to avoid outages, LinkedIn needed to begin write migration in parallel with client application migration.
LinkedIn built cron jobs to analyze Zookeeper watchers set on the Zookeeper data of each application and list the corresponding reader applications. A watcher is a mechanism where clients register interest in data changes. When data changes, Zookeeper notifies all watchers. These jobs generated snapshots of watcher status at short intervals, catching even short-lived readers like offline jobs. The snapshots were aggregated into daily and weekly reports.
These reports identified applications with no readers on Zookeeper in the past two weeks, which LinkedIn set as the criteria for applications becoming ready to start Next-Gen Service Discovery Write. The reports also showed top blockers, meaning reader applications blocking the most server hosts from migrating, and top applications being blocked, identifying the largest applications unable to migrate, and which readers were blocking them.
This information helped LinkedIn prioritize focus on the biggest blockers for migration to Next-Gen Service Discovery Read. Additionally, the job could catch any new client that started reading server applications already migrated to Next-Gen Service Discovery Write and send alerts, allowing prompt coordination with the reader application owner for migration or troubleshooting.
The Next-Gen Service Discovery system achieved significant improvements over the Zookeeper-based architecture.
The system now handles the company-wide fleet of hundreds of thousands of application instances in one data center with data propagation latency of P50 less than 1 second and P99 less than 5 seconds. The previous Zookeeper-based architecture experienced high latency and unavailability incidents frequently, with data propagation latency of P50 less than 10 seconds and P99 less than 30 seconds.
This represents a tenfold improvement in median latency and a sixfold improvement in 99th percentile latency. The new system not only safeguards platform reliability at massive scale but also unlocks future innovations in centralized load balancing, service mesh integration, and cross-fabric resiliency.
Next-Gen Service Discovery marks a foundational transformation in LinkedIn’s infrastructure, changing how applications discover and communicate with each other across global data centers. By replacing the decade-old Zookeeper-based system with a Kafka and xDS-powered architecture, LinkedIn achieved near real-time data propagation, multi-language compatibility, and true horizontal scalability.
References: