2025-11-07 00:30:41
Modern software systems have grown more complex, and many organizations have moved from building large monolithic applications to using services. This shift offers several benefits, such as faster development, easier deployment, and better scalability. However, it also introduces new challenges, especially in how services handle and share data.
In a monolithic system, all components share the same database. Any part of the application can read or update data from a common source. This makes coordination simple but also creates tight coupling between different parts of the system. A small change in one area can affect the entire application.
In contrast, service-oriented architectures divide the system into smaller, independent services. Each service is responsible for a specific business function and should manage its own data. This principle is often referred to as service data ownership. It ensures that services can be developed, tested, and deployed independently without depending on the internal workings of other services.
However, even though services own their data, they still need to exchange information with each other. For example, an order service may need customer details from a customer service, or a payment service may need transaction information from an order service. Sharing data between services becomes essential for the system to work as a whole.
The main challenge, then, is how to share data without losing the independence that microservices aim to achieve.
Should multiple services connect to the same database?
Or should they communicate through APIs or messages?
How do we ensure consistency, performance, and fault tolerance while keeping services loosely coupled?
In this article, we explore these questions in depth. We investigate the difference between sharing a data source and sharing data, and examine the main strategies used to share data between services.
2025-11-06 00:30:42
One of AI’s biggest challenges today is memory—how agents retain, recall, and remember over time. Without it, even the best models struggle with context loss, inconsistency, and limited scalability.
This new O’Reilly + Redis report breaks down why memory is the foundation of scalable AI systems and how real-time architectures make it possible.
Inside the report:
The role of short-term, long-term, and persistent memory in agent performance
Frameworks like LangGraph, Mem0, and Redis
Architectural patterns for faster, more reliable, context-aware systems
Disclaimer: The details in this post have been derived from the details shared online by the Databricks Engineering Team. All credit for the technical details goes to the Databricks Engineering Team. The links to the original articles and sources are present in the references section at the end of the post. We’ve attempted to analyze the details and provide our input about them. If you find any inaccuracies or omissions, please leave a comment, and we will do our best to fix them.
Kubernetes has become the standard platform for running modern microservices. It simplifies how services talk to each other through built-in networking components like ClusterIP services, CoreDNS, and kube-proxy. These primitives work well for many workloads, but they start to show their limitations when traffic becomes high volume, persistent, and latency sensitive.
Databricks faced exactly this challenge. Many of their internal services rely on gRPC, which runs over HTTP/2 and keeps long-lived TCP connections between clients and servers. Under Kubernetes’ default model, this leads to uneven traffic distribution, unpredictable scaling behavior, and higher tail latencies.
By default, Kubernetes uses ClusterIP services, CoreDNS, and kube-proxy (iptables/IPVS/eBPF) to route traffic:
Clients resolve the service DNS (for example, my-service.default.svc.cluster.local) to a ClusterIP.
The packet goes to the ClusterIP.
kube-proxy selects a backend pod using round robin or random selection.
Since the selection happens only once per TCP connection, the same backend pod keeps receiving traffic for the lifetime of that connection. For short-lived HTTP/1 connections, this is usually fine. However, for persistent HTTP/2 connections, the result is traffic skew: a few pods get overloaded while others stay idle.
For Databricks, this created several operational issues:
High tail latency: A few pods handled most of the load, which increased response time for users.
Poor resource utilization: Some pods were overwhelmed while others sat idle, leading to over-provisioning.
Limited strategies: The Kubernetes built-in load balancing had no way to handle error-aware or zone-aware routing.
The Databricks Engineering Team needed something smarter: a Layer 7, request-level load balancer that could react dynamically to real service conditions instead of relying on connection-level routing decisions.
In this article, we will learn how they built such a system and the challenges they faced along the way.
To overcome the limitations of the default Kubernetes routing, the Databricks Engineering Team shifted the load balancing responsibility from the infrastructure layer to the client itself. Instead of depending on kube-proxy and DNS to make connection-level routing decisions, they built a client-side load balancing system supported by a lightweight control plane that provides real-time service discovery.
This means the application client no longer waits for DNS to resolve a service or for kube-proxy to pick a backend pod. Instead, it already knows which pods are healthy and available. When a request is made, the client can choose the best backend at that moment based on up-to-date information.
Here’s a table that shows the difference between the default Kubernetes LB and Databricks client-side LB:
By removing DNS from the critical path, the system gives each client a direct and current view of available endpoints. This allows smarter, per-request routing decisions instead of static, per-connection routing. The result is more even traffic distribution, lower latency, and better use of resources across pods.
This approach also gives Databricks greater flexibility to fine-tune how traffic flows between services, something that is difficult to achieve with the default Kubernetes model.
A key part of the intelligent load balancing system is its custom control plane. This component is responsible for keeping an accurate, real-time view of the services running inside the Kubernetes cluster. Instead of depending on DNS lookups or static routing, the control plane continuously monitors the cluster and provides live endpoint information to clients.
See the diagram below:
Here is how it works:
Watching the Kubernetes API: The control plane keeps a close watch on Kubernetes resources like Services and EndpointSlices. EndpointSlices contain details about all the pods that belong to a service, including their IP addresses and health status. Whenever pods are added, removed, or their state changes, the control plane detects it almost immediately.
Building a real-time topology: It maintains an internal map of all backend pods for each service. This includes important metadata such as:
Zone information to know where the pod is running (useful for zone-affinity routing).
Readiness status to make sure only healthy pods receive traffic.
Shard labels or other identifying information to route traffic intelligently.
Translating data into xDS responses: The control plane converts this information into xDS (Envoy Discovery Service) responses. xDS is a widely used API protocol that allows clients and proxies to receive dynamic configuration updates, including endpoint lists.
Streaming updates to clients and proxies: Instead of polling or re-resolving DNS, clients subscribe to the control plane and receive continuous streaming updates whenever endpoints change. If a pod goes down, the client learns about it almost immediately and routes requests elsewhere. If a new pod comes online, it can start receiving traffic right away.
This design has several benefits:
Clients don’t have to wait for DNS to expire or connections to break.
Routing decisions are based on the most recent view of the cluster.
This removes one of the common bottlenecks in large-scale Kubernetes systems.
For any load-balancing system to work at scale, it has to be easy for application teams to adopt. Databricks solved this by directly integrating the new load-balancing logic into their shared RPC client framework, which is used by most of their internal services.
Since many Databricks services are written in Scala, the engineering team was able to build this capability once and make it available to all services without extra effort from individual teams.
Here is how the integration works:
Subscribing to EDS updates: Each service uses a custom RPC client that automatically subscribes to the Endpoint Discovery Service (EDS) for any other service it depends on. This means the client is always aware of which pods are available and healthy.
Maintaining an in-memory list of endpoints: The client keeps a live, in-memory list of healthy endpoints, along with useful metadata like zone information, shard labels, and readiness status. This list updates automatically whenever the control plane sends new information.
Bypassing DNS and kube-proxy: Because the client already knows which endpoints are available, it doesn’t need to perform DNS lookups or rely on kube-proxy to pick a backend pod. It can select the right pod for each request directly.
Seamless organization-wide adoption: By embedding the logic inside the shared client library, Databricks made it possible for all teams to benefit from intelligent load balancing without changing their application code or deploying complex sidecars. This reduced operational overhead and made rollout much simpler.
One of the biggest advantages of the client-side load balancing system at Databricks is its flexibility. Since the routing happens inside the client and is based on real-time data, the system can use more advanced strategies than the basic round-robin or random selection used by kube-proxy.
These strategies allow the client to make smarter routing decisions for every request, improving performance, reliability, and resource efficiency.
The Power of Two Choices algorithm is simple but powerful. When a request comes in, the client:
Randomly selects two healthy endpoints.
Checks their current load.
Sends the request to the less loaded of the two.
This approach avoids both random traffic spikes and overloaded pods. It balances traffic more evenly than round-robin while keeping the logic lightweight and fast. Databricks found that P2C works well for the majority of its services.
In large, distributed Kubernetes clusters, network latency can increase when traffic crosses zones.
To minimize this, the team uses zone-affinity routing:
The client prefers endpoints in the same zone as the caller to reduce latency and data transfer costs.
If that zone is overloaded or unhealthy, the client intelligently spills traffic to other zones with available capacity.
This helps maintain low latency while ensuring the system remains resilient to partial failures.
The architecture is designed to be extensible. The team can easily add new load-balancing strategies without changing the overall system. For example:
Weighted routing, where traffic is distributed based on custom weights (such as pod capacity or specialized hardware).
Application-specific routing, where strategies can be tuned for specific workloads like AI or analytics.
The Databricks Engineering Team didn’t limit its intelligent load balancing system to internal traffic. They also extended their Endpoint Discovery Service (EDS) control plane to work with Envoy, which manages external ingress traffic. This means that both internal service-to-service communication and traffic coming into the cluster from outside follow the same set of routing rules.
Here’s how this works:
Providing real-time endpoint data to Envoy: The control plane implements the Endpoint Discovery Service (EDS) protocol. This allows it to send Envoy up-to-date information about all backend clusters and their endpoints. As pods are added, removed, or change their status, Envoy receives immediate updates, ensuring that external traffic is always directed to healthy and available pods.
Consistent routing for internal and external traffic: Because both the internal clients and Envoy use the same control plane as their source of truth, routing decisions remain consistent across the entire platform. There’s no risk of external traffic being sent to stale endpoints or diverging from internal routing logic.
Unified service discovery: This design avoids maintaining multiple service discovery systems for different types of traffic. Instead, Databricks uses a single, centralized control plane to manage endpoint information for both internal RPC calls and ingress gateway routing.
The shift to client-side load balancing brought measurable benefits to Databricks’ infrastructure. After deploying the new system, the traffic distribution across pods became uniform, eliminating the issue of a few pods being overloaded while others sat idle.
This led to stable latency profiles, with P90 and tail latencies becoming much more predictable, and a 20 percent reduction in pod count across multiple services.

The improved balance meant Databricks could achieve better performance without over-provisioning resources.
The rollout also surfaced some important lessons:
Cold starts became more noticeable because new pods began receiving traffic immediately after coming online. To address this, the team introduced slow-start ramp-up and a mechanism to bias traffic away from pods with high error rates.
Another lesson came from experimenting with metrics-based routing. Relying on CPU and memory metrics turned out to be unreliable, as these signals often lag behind real-time conditions. The team shifted to using health-based signals, which provided more accurate and timely routing decisions.
Additionally, not all services could benefit from this system immediately, since not every language had a compatible client library, meaning some traffic still relied on traditional load balancing.
Looking ahead, Databricks is working on cross-cluster and cross-region load balancing to scale this system globally using flat L3 networking and multi-region EDS clusters. The team is also exploring advanced AI-aware strategies, including weighted load balancing for specialized backends. These future improvements are aimed at handling even larger workloads, supporting AI-heavy applications, and maintaining high reliability as their platform grows.
Through this architecture, Databricks has demonstrated a practical way to overcome the limitations of the default Kubernetes load balancing and build a flexible, efficient, and scalable traffic management system.
References:
Get your product in front of more than 1,000,000 tech professionals.
Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases.
Space Fills Up Fast - Reserve Today
Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing [email protected].
2025-11-04 23:31:09
Keeping your systems reliable shouldn’t come at the expense of your team. This practical guide from Datadog shows how to design sustainable on-call processes that reduce burnout and improve response.
Get step-by-step best practices so you can:
Cut alert noise with signal-based monitoring
Streamline response with clear roles and smarter escalations
Design rotations that support recovery and long-term sustainability
Disclaimer: The details in this post have been derived from the details shared online by the Datadog Engineering Team and the P99 Conference Organizers. All credit for the technical details goes to the Datadog Engineering Team. The links to the original articles and sources are present in the references section at the end of the post. We’ve attempted to analyze the details and provide our input about them. If you find any inaccuracies or omissions, please leave a comment, and we will do our best to fix them.
In the world of cloud monitoring, Datadog operates at a massive scale. The company’s platforms must ingest billions of data points every single second from millions of hosts around the globe.
This constant flow of information creates an immense engineering challenge: how do you store, manage, and query this data in a way that is not only lightning-fast but also cost-effective?
For the Datadog Engineering Team, the answer was to build their own solution from the ground up.
In this article, we will understand how the Datadog engineering team built Monocle, their custom-built time series storage engine to power their real-time metrics platform. This article analyzes the technical decisions and clever optimizations behind the database.
AI is speeding things up, but all that new code creates a bottleneck — who’s verifying the quality and security? Don’t let new technical debt and security risks slip past. Sonar’s automated review gives you the trust you need in every line of code, human- or AI-written.
With SonarQube, your team can:
Use AI without fear: Get continuous analysis for quality and security.
Fix issues early: Detect and apply automated fixes before you merge.
Maintain your standards: Ensure all code meets your team’s quality bar, every time, for long-term code health.
Get started with SonarQube today to fuel AI-enabled development and build trust into all code.
Before diving into the custom database engine, it is important to understand where it fits.
Their custom engine, named Monocle, is just one specialized component within a much larger “Metrics Platform.” This platform is the entire system responsible for collecting, processing, storing, and serving all of its customer metrics.
The journey of a single data point begins at the “Metrics Edge.” This component acts as the front door, receiving the flood of data from millions of customer systems. From there, it is passed to a “Storage Router.” Just as the name suggests, this router’s main job is to process the incoming data and intelligently decide where it needs to be stored.
This is where Datadog’s first major design decision becomes clear.
The Datadog Engineering Team recognized that not all data queries are the same. An engineer asking for a performance report from last year has very different needs than an automated alert checking for a failure in the last 30 seconds. To serve both, they split their storage into two massive, specialized systems.
The first system is the Long-Term Metrics Storage. This system is like a vast historical archive. It is built for heavy-duty analytical queries, where an engineer might want to analyze a performance trend over many months or even years.
The second system is the Real-Time Metrics Storage, which is the primary focus of this article. This system is a high-speed, high-performance engine that holds only the most recent data, roughly the latest 24 hours. While that might not sound like a lot, this system handles the vast majority of the platform’s workload: 99% of all queries. These are the queries that power Datadog’s live dashboards and critical, automated monitors. In other words, when a user sees a graph on their screen update in real-time, this is the system at work.
A time series data point has two parts:
The Data: This is the actual timestamp and the value being measured. For example [12:00:01 PM, 85.5].
The Metadata: These are the labels, or “tags,” that describe what the data is, such as service:api, host:server-10, or region:us-east-1.
The Datadog Engineering Team made the critical decision to store these two parts in separate, specialized databases:
The Index Database stores only the tags (the metadata). Its only job is to be incredibly fast at finding data. When a user asks, “Show me the CPU for all services in us-east,” this index database instantly finds all the unique data streams that match those tags.
The Real-Time Timeseries Database (RTDB) stores only the timestamps and values. It is a highly optimized storage that, after getting the list of streams from the Index Database, just retrieves the raw numbers.
Perhaps the most unique architectural decision the Datadog Engineering Team made is how their database clusters are organized. In many traditional distributed databases, the server nodes (the individual computers in the cluster) constantly talk to each other. They “chatter” to coordinate who is doing what, to copy data between themselves (a process called replication), and to figure out what to do if one of them fails.
Datadog’s RTDB nodes do not do this.
Instead, the entire system is designed around Apache Kafka. Here, Kafka acts as a central place where all new data is written first before it even touches the database. This Kafka-centric design is the key to the cluster’s stability and speed.
See the diagram below that shows an RTDB cluster and the role of Kafka:
The Datadog Engineering Team uses Kafka to perform three critical functions that the database nodes would otherwise have to do themselves.
First is Data Distribution. The data in Kafka is organized into “topics” (like metrics), which are split into smaller logs called “partitions.” The team designed the system in such a way that each RTDB database node is assigned its own specific set of partitions to read from. This neatly divides the billions of incoming data points without the nodes ever needing to coordinate.
Second, Kafka acts as the Write-Ahead Log (WAL). This is a standard database concept for ensuring data safety. By using Kafka as the WAL, Datadog creates a “single source of truth.” If an RTDB node crashes due to a hardware failure, no data is lost. When the node restarts, it simply reconnects to Kafka, finds its last known reading position, and catches up on all the data it missed.
Third, Kafka handles Replication. To prevent data loss, you must have copies of your data in different physical locations. The team leverages Kafka’s built-in replication, which automatically copies the data partitions to different data centers (known as Availability Zones). This provides robust disaster recovery right out of the box.
At the heart of each RTDB node is Monocle, Datadog’s custom-built storage engine. See the diagram below:
This is where the team’s pursuit of performance gets truly impressive. While earlier versions of the platform used RocksDB, a popular and powerful open-source database engine, the team ultimately decided to build its own. By creating Monocle from scratch, they could tailor every single decision to their specific needs, unlocking a new level of efficiency.
Monocle is written in Rust, a modern programming language famous for its safety guarantees and “C-like” performance. It is built on Tokio, a popular framework in the Rust ecosystem for writing high-speed, asynchronous applications that can handle many tasks at once without getting bogged down.
Monocle’s key innovation is its simple data model. As mentioned, a time series is defined by its tags, like “system.cpu.user”, “host:web-01”, and “env:prod”. This set of tags is what makes a series unique. However, these tag sets can be long and complex to search.
The Datadog Engineering Team simplified this dramatically. Instead of working with these complex strings, Monocle hashes the entire set of tags for a series, turning it into a single, unique number. The database then just stores data in a simple map:
(Organization, Metric Name, Tag Hash) -> (A list of [Timestamp, Value] pairs)
This design is incredibly fast because finding all the data for any given time series becomes a direct and efficient lookup using that single hash. The separate Index Database is responsible for the human-friendly part: it tells the system that a query for env: prod corresponds to a specific list of “Tag Hashes.”
Monocle’s speed comes from two main areas: its concurrency model and its storage structure.
Monocle uses what is known as a “thread-per-core” or “shared-nothing” architecture. You can imagine each CPU core in the server has its own dedicated worker, which operates in total isolation. Each worker has its own data, its own cache, and its own memory. They do not share anything.
When new data comes in from Kafka, it is hashed. The system then sends that data to the specific queue for the one worker that “owns” that hash. Since each worker is the only one who can ever access their own data, there is no need for locks, no coordination, and no waiting. This eliminates a massive performance bottleneck common in traditional databases, where different threads often have to wait for each other to access the same piece of data.
See the diagram below:
Monocle’s storage layout is a Log-Structured Merge-Tree (LSM-Tree). This is a design that is extremely efficient for write-heavy workloads like Datadog’s.
Here are the main concepts associated with LSM Trees:
Memtable: All new data is batched in memory in a structure called a Memtable.
Flushing: When the Memtable is full, it is “flushed” to disk as a complete, read-only file. The system never goes back to modify old files.
Compaction: Over time, a background process called compaction tidies up by merging these small files into larger, more organized ones.
The Datadog Engineering Team added two critical optimizations to this design:
Arena Allocator: Normally, when a Memtable is flushed, the database would have to free millions of tiny objects from memory, which is a slow process. Monocle uses a custom arena allocator. This means the Memtable gets one giant chunk of memory (an arena). When it is done, the entire chunk is freed all at once, which is vastly more efficient.
Time-Based File Pruning: Since 99% of queries are for recent data, Monocle takes advantage of this. Each file on disk has a known time range (for example, 10:00 AM - 10:10 AM). When a query comes in for the last hour, Monocle can instantly prune (ignore) all the files that are not from that time window. This reduces the number of files it has to read to find the answer.
Handling so many queries with the “thread-per-core” design creates a unique challenge.
Since a query is fanned out to all workers, it is only as fast as its slowest worker. If one worker is busy performing a background task, like a heavy data compaction, it can stall the entire query for all the other workers. This is a classic computer science problem known as head-of-line blocking.
To solve this, the team built a two-layer system to manage the query load and stay responsive.
The first layer is Admission Control. This acts as a simple “gate” at the front door. If the system detects it is under extreme load (for example, it is falling behind on reading data from Kafka or is low on memory), this gate will simply stop accepting new queries. This protects the database from being overwhelmed.
The second is a smarter Cost-Based Scheduling system. This layer uses a well-known algorithm called CoDel (Controlled Delay) to actively manage latency. It can prioritize queries based on their cost (how much work they need to do) and ensure that even under heavy, unpredictable query bursts, the database remains responsive and does not grind to a halt.
The Datadog Engineering Team’s work on Monocle is far from over. They are already planning the next evolution of their platform, which involves two major changes.
Smarter Routing: The current system, while effective, is relatively static. The team is developing a dynamic, load-balancing system that will allow the cluster to move data around in real-time. This would let the database automatically adapt to query “hot spots,” for instance, when many engineers are suddenly querying the same small set of metrics during an incident.
Colocate Points and Tags: This is a fundamental shift. The team plans to merge their two separate databases (the Index Database and the RTDB ) into a single, unified system. Instead of just storing tag hashes, the new database will store the full tag strings right alongside their corresponding timestamps and values.
To achieve this, the team will move to a columnar database format.
In a columnar database, data is stored by columns instead of rows. This means a query can read only the specific tags and values it needs, which is a massive speedup for analytics.
This is a complex undertaking that will likely require a complete redesign of their “thread-per-core” model, but it highlights Datadog’s drive to push the boundaries of performance.
References:
Get your product in front of more than 1,000,000 tech professionals.
Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases.
Space Fills Up Fast - Reserve Today
Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing [email protected].
2025-11-04 00:31:09
Too often, agents write code that almost works, leaving developers debugging instead of shipping. Warp changes that.
With Warp you get:
#1 coding agent: Tops benchmarks, delivering more accurate code out of the box.
Tight feedback loop: Built-in code review and editing lets devs quickly spot issues, hand-edit, or reprompt.
1-2 hours saved per day with Warp: All thanks to Warp’s 97% diff acceptance rate.
See why Warp is trusted by over 700k developers.
Disclaimer: The details in this post have been derived from the details shared online by the Perplexity Engineering Team, Vespa Engineering Team, AWS, and NVIDIA. All credit for the technical details goes to the Perplexity Engineering Team, Vespa Engineering Team, NVIDIA, and AWS. The links to the original articles and sources are present in the references section at the end of the post. We’ve attempted to analyze the details and provide our input about them. If you find any inaccuracies or omissions, please leave a comment, and we will do our best to fix them.
At its core, Perplexity AI was built on a simple but powerful idea: to change online search from a list of a few blue links into a direct “answer engine”.
The goal was to create a tool that could read through web pages for you, pull out the most important information, and give you a single, clear answer.
Think of it as a combination of a traditional search engine and a smart AI chatbot. When you ask a question, Perplexity first scours the live internet for the most current and relevant information. Then, it uses a powerful AI to read and synthesize what it found into a straightforward summary. This approach is very different from AI models that rely only on the data they were trained on, which can be months or even years out of date.
This design directly tackles two of the biggest challenges with AI chatbots:
Their inability to access current events.
Their tendency to “hallucinate” or make up facts.
By basing every answer on real, verifiable web pages and providing citations for its sources, Perplexity aims to be a more trustworthy and reliable source of information.
Interestingly, the company didn’t start with this grand vision. Their initial project was a much more technical tool for translating plain English into database queries.
However, the launch of ChatGPT in late 2022 was a turning point. The team noticed that one of the main criticisms of ChatGPT was its lack of sources. They realized their own internal prototype already solved this problem. In a decisive move, they abandoned four months of work on their original project to focus entirely on the challenge of building a true answer engine for the web. This single decision shaped the entire technical direction of the company.
The backbone of Perplexity’s service is a meticulously implemented Retrieval-Augmented Generation (RAG) pipeline. Here’s what RAG looks like on a high-level.
Behind the scenes of RAG at Perplexity is a multi-step process, which is executed for nearly every query to ensure that the generated answers are both relevant and factually grounded in current information.
The pipeline can be deconstructed into five distinct stages:
Query Intent Parsing: The process begins when a user submits a query. Instead of relying on simple keyword matching, the system first employs an LLM (which could be one of Perplexity’s own fine-tuned models or a third-party model such as GPT-4) to parse the user’s intent. This initial step moves beyond the lexical level to achieve a deeper semantic understanding of what the user is truly asking, interpreting context, nuance, and the underlying goal of the query.
Live Web Retrieval: Once the user’s intent is understood, the parsed query is dispatched to a powerful, real-time search index. This retrieval system scours the web for a set of relevant pages and documents that are likely to contain the answer. This live retrieval is a non-negotiable step in the process, ensuring that the information used for answer generation is always as fresh and up-to-date as possible.
Snippet Extraction and Contextualization: The system does not pass the full text of the retrieved web pages to the generative model. Instead, it utilizes algorithms to extract the most relevant snippets, paragraphs, or chunks of text from these sources. These concise snippets, which directly pertain to the user’s query, are then aggregated to form the “context” that will be provided to the LLM.
Synthesized Answer Generation with Citations: The curated context is then passed to the chosen generative LLM. The model’s task is to generate a natural-language, conversational response based only on the information contained within that provided context. This is a strict and defining principle of the architecture: “you are not supposed to say anything that you didn’t retrieve”. To enforce this principle and provide transparency, a crucial feature is the attachment of inline citations to the generated text. These citations link back to the source documents, allowing users to verify every piece of information and delve deeper into the source material if they choose.
Conversational Refinement: The Perplexity system is designed for dialogue, not single-shot queries. It maintains the context of the ongoing conversation, allowing users to ask follow-up questions. When a follow-up is asked, the system refines its answers through a combination of the existing conversational context and new, iterative web searches.
The diagram below shows a general view of how RAG works in principle:
Perplexity’s core technical competency is not the development of a single, superior LLM but rather the orchestration of combining various LLMs with a high-performance search system to deliver fast, accurate, and cost-efficient answers. This is a complex challenge that needs to balance the high computational cost of LLMs with the low-latency demands of a real-time search product.
To solve this, the architecture is explicitly designed to be model-agnostic.
It leverages a heterogeneous mix of models, including in-house fine-tuned models from the “Sonar” family and third-party frontier models from leading labs like OpenAI (GPT series) and Anthropic (Claude series).
This flexibility is managed by an intelligent routing system. This system uses small, efficient classifier models to first determine the user’s intent and the complexity of their query. Based on this classification, the request is then routed to the most appropriate and cost-effective model for the specific task. For instance, a simple request for a definition might be handled by a small, fast, in-house model, whereas a complex query requiring multi-step reasoning or agentic behavior would be routed to a more powerful and expensive model like GPT-5 or Claude Opus.
This dynamic decision-making process, guided by the principle of using “the smallest model that will still give the best possible user experience,” is a key architectural strategy for managing performance and cost at scale.
This model-agnostic design is more than just a technical optimization, but also serves as a key strategic defense. In an industry where the underlying large language models are rapidly advancing and at risk of becoming commoditized, a naive architecture built entirely on a single third-party API would create significant business risks, including vendor lock-in, unpredictable pricing changes, and dependency on another company’s roadmap.
The Perplexity architecture deliberately mitigates these risks. The goal is to create a system that can leverage different models (like both open source and closed source) and have them work together symbiotically, to the point where the end-user does not need to know or care which specific model is being used.
This architectural choice demonstrates a clear belief that the company’s “moat” is not any single LLM, but the proprietary orchestration system that manages the interaction with these models to provide the best results for the end user.
The “retrieval” component of Perplexity’s RAG pipeline is the foundation upon which the entire system’s accuracy and relevance are built. The quality of the information retrieved directly dictates the quality of the final answer generated.
Perplexity uses Vespa AI to power its massive and scalable RAG architecture. The selection of Vespa was driven by the need for a platform capable of delivering real-time, large-scale RAG with the high performance, low latency, and reliability demanded by a consumer-facing application serving millions of users.
Here’s a comparative chart about query latency that Perplexity has been able to achieve:

A key advantage of Vespa is its unified nature. It integrates multiple critical search technologies, including vector search for semantic understanding, lexical search for precision, structured filtering, and machine-learned ranking, into a single, cohesive engine. This integrated approach eliminates the significant engineering overhead and complexity that would arise from attempting to stitch together multiple disparate systems, such as a standalone vector database combined with a separate keyword search engine like Elasticsearch.
This decision to build on Vespa was also driven by the need to focus on limited engineering resources. Building a web-scale, real-time search engine from the ground up is an extraordinarily difficult and capital-intensive endeavor, a problem that companies like Google and Yahoo (where Vespa originated) have invested decades and billions of dollars to solve.
Perplexity’s core mission is not to reinvent search indexing but to build a novel answer engine on top of a search foundation. By strategically outsourcing the mature and largely “solved” problem of distributed, real-time search to a specialized platform like Vespa, Perplexity’s relatively small engineering team of around 38 people has been able to focus its efforts on the unique and differentiating parts of its technology stack, made up of the following pieces:
The RAG orchestration logic.
The fine-tuning of its proprietary Sonar models.
The hyper-optimization of its in-house ROSE inference engine.
This is a classic “build vs. buy” decision executed at the highest architectural level.
The infrastructure built on Vespa is designed to handle the unique demands of an AI-powered answer engine, prioritizing scale, freshness, and a deep understanding of content.
Here are the key aspects of the same:
The system operates on a massive index that covers hundreds of billions of webpages.
Perplexity’s crawling and indexing infrastructure tracks over 200 billion unique URLs, supported by fleets of tens of thousands of CPUs and a multi-tier storage system with over 400 petabytes in hot storage alone.
Vespa’s distributed architecture is fundamental to managing this scale. It automatically, transparently, and dynamically partitions and distributes content across a cluster of many nodes. Crucially, it co-locates all information, indexes, and the computational logic for a given piece of data on the same node, which distributes the query load effectively and avoids the network bandwidth bottlenecks that can cripple large-scale systems.
For an answer engine, information staleness is a critical failure mode. The system must reflect the world as it is, right now.
Perplexity’s infrastructure is engineered for this, processing tens of thousands of index update requests every second to ensure the index provides the freshest results available.
This is enabled by Vespa’s unique index technology, which is capable of cheaply and efficiently mutating index structures in real time, even while they are being actively read by serving queries. It allows for a continuous stream of updates without degrading query performance.
To manage this process, an ML model is trained to predict whether a candidate URL needs to be indexed and to schedule the indexing operation at the most useful time, calibrated to the URL’s importance and likely update frequency.
The system’s understanding of content goes much deeper than the document level.
Perplexity’s indexing infrastructure divides documents into “fine-grained units” or “chunks”. Rather than returning the entire content of a long article, Vespa’s layered ranking capabilities allow it to score these individual chunks by their relevance to the query.
This means the system can identify and return only the most relevant paragraphs or sentences from the most relevant documents, providing a much more focused and efficient context for the LLM to process.
Here’s a screenshot that shows Perplexity typically presents its search results:

To contend with the unstructured and often inconsistent nature of the open web, Perplexity’s indexing operations utilize an AI-powered content understanding module.
This module dynamically generates and adapts parsing rulesets to extract semantically meaningful content from diverse websites.
This module is not static. It optimizes itself through an iterative AI self-improvement process. In this loop, frontier LLMs assess the performance of current rulesets on dimensions of completeness and quality. The system then uses these assessments to formulate, validate, and deploy proposed changes to address error classes, ensuring the module continuously evolves. This process is crucial for segmenting documents into the self-contained, atomic units of context that are ideal for LLMs.
The high quality of Perplexity’s answers is fundamentally limited by the quality of the information it retrieves. Therefore, its ranking algorithm acts as a critical quality gatekeeper for the entire RAG pipeline.
Perplexity leverages Vespa’s advanced capabilities to implement a multi-stage architecture that progressively refines results under a tight latency budget. Here are the key aspects:
Dense Retrieval (Vector Search): This technique allows the system to move beyond keywords to match content at a semantic or conceptual level. It uses vector embeddings to understand the meaning behind a user’s query and find contextually similar documents, even if they do not share the same words.
Sparse Retrieval (Lexical Search): While vector search is excellent for capturing broad meaning, it can lack precision. Sparse retrieval, which includes traditional keyword-based techniques and ranking functions like BM25, provides this necessary precision. It allows the system to perform exact matches on rare terms, product names, internal company monikers, and other specific identifiers where semantic ambiguity is undesirable.
Machine-Learned Ranking: The true power of the system lies in its ability to combine these different signals. Vespa enables Perplexity to implement advanced, multi-phase ranking models. In this process, an initial set of candidate documents might be retrieved using a combination of lexical and vector search. Then, a more sophisticated machine-learning model evaluates these candidates, using a rich set of features (such as lexical relevance scores, vector similarity, document authority, freshness, user engagement signals, and other metadata) to produce a final, highly accurate ranking.
The diagram below shows how Vector Search typically works in the context of an RAG system:
The ranking stack is co-designed with Perplexity’s user-facing products, allowing it to leverage rich, automated signals from millions of daily user requests to continuously enrich its training data.
After Perplexity finds the best information on the web, the next step is to turn it into a clear, easy-to-read answer.
This is handled by the “generation engine,” which is the AI brain that writes the response. To do this, Perplexity uses a clever, two-part strategy: it combines its own custom-built AI models with a selection of the most powerful models from other leading technology labs. This hybrid approach allows the company to perfectly balance cost, speed, and access to the most advanced AI capabilities.
The first part of this strategy is Perplexity’s own family of AI models, known as Sonar. These models are not built entirely from scratch, which would be incredibly expensive and time-consuming. Instead, Perplexity starts with powerful, publicly available open-source models. They then “fine-tune” them for their specific needs.
Perplexity trains these base models on its own vast collection of data, teaching them the special skills required to be a great answer engine. These skills include the ability to summarize information accurately, correctly add citations to sources, and stick strictly to the facts that were found during the web search. Every time a user interacts with the service, Perplexity gathers more data that helps it continuously improve its Sonar models, making them smarter and more helpful over time.
The second part of the strategy involves integrating the “best of the best” from the wider AI world. For its paying subscribers, Perplexity offers access to a curated selection of the most advanced models available, such as OpenAI’s GPT series and Anthropic’s Claude models. This gives users the option to use the absolute most powerful AI for tasks that require deep reasoning, creativity, or complex problem-solving.
To make this work smoothly, Perplexity uses a service called Amazon Bedrock, which acts like a universal adapter, allowing them to easily plug these different third-party models into their system without having to build a separate, custom integration for each one.
This “best of both worlds” approach is the key to Perplexity’s business model.
Having powerful AI models is one thing, but delivering fast and affordable answers to millions of people is a massive technical challenge.
Running AI models is incredibly expensive, so Perplexity has built a sophisticated, high-performance system to do it efficiently. This system, known as the “inference stack,” is the engine that makes the entire service possible.
At the heart of this system is a custom-built engine named ROSE. Perplexity created ROSE to do two things very well.
First, it needed to be flexible, allowing the engineering team to quickly test and adopt the latest AI models as they are released.
Second, it needed to be a platform for extreme optimization, squeezing every last bit of performance out of the models to make them run faster and cheaper.
ROSE is primarily built in Python and leverages PyTorch for its model definitions. This choice provides the flexibility and ease of development needed to adapt to new and varied model architectures. However, for components where performance is absolutely critical, such as the serving logic and batch scheduling algorithms, the team is migrating the codebase to Rust. This move leverages Rust’s performance, which is comparable to C++, along with its strong memory safety guarantees, making it an ideal choice for high-performance, reliable systems code.
The engine is architected around a core LLM engine that can load model weights and generate decoded tokens. It supports advanced decoding strategies, including speculative decoding and MTP (Multi-Token Prediction) decoders, which can improve latency.
This entire operation runs on the Amazon Web Services (AWS) cloud platform, using pods of state-of-the-art NVIDIA H100 GPUs. These GPUs are essentially super-powered computer chips designed specifically for AI use cases. To manage this fleet of powerful hardware, Perplexity uses industry-standard tools like Kubernetes to orchestrate all the moving parts and ensure the system runs smoothly and can handle huge amounts of traffic.
See the diagram below that shows how Perplexity deployed LLM production on a massive scale using NVIDIA.

The decision to build this complex system in-house instead of simply paying to use other AI models has a huge payoff. By controlling the entire stack, from the software engine down to the hardware, Perplexity can optimize everything for its specific needs. This control directly leads to faster response times for users and lower costs for the business.
The technical architecture of Perplexity AI reveals that its power as an “AI Google” does not stem from a single, magical Large Language Model.
Instead, its success is the result of the engineering of a complete, end-to-end system where each component is carefully chosen and deeply optimized to work in concert with the others.
First is the world-class retrieval engine, built upon the scalable and real-time foundation of Vespa.ai. This system provides high-quality, fresh, and relevant information that serves as the factual bedrock for every answer. It also has a sophisticated hybrid ranking algorithm acting as the critical gatekeeper.
Second is the flexible, model-agnostic orchestration layer. This core logic intelligently parses user intent and routes queries to the most appropriate generative model, whether it be a cost-effective, in-house Sonar model fine-tuned for specific tasks or a state-of-the-art frontier model from a third-party lab. This layer provides the economic and strategic flexibility necessary to compete in a rapidly evolving AI landscape.
Third is the hyper-optimized, in-house inference stack, centered around the ROSE engine. This custom-built system, running on state-of-the-art NVIDIA hardware within the AWS cloud, extracts every last drop of performance and cost-efficiency out of the models it serves.
References:
Spotlight: Perplexity AI Serves 400 Million Search Queries a Month Using NVIDIA Inference Stack
Deep Dive Read With Me: Perplexity CTO Denis Yarats on AI-Powered Search
Perplexity Builds Advanced Search Engine Using Anthropic’s Claude 3 in Amazon Bedrock
Get your product in front of more than 1,000,000 tech professionals.
Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases.
Space Fills Up Fast - Reserve Today
Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing [email protected].
2025-11-01 23:30:28
Bugs sneak out when less than 80% of user flows are tested before shipping. However, getting that kind of coverage (and staying there) is hard and pricey for any team.
QA Wolf’s AI-native solution provides high-volume, high-speed test coverage for web and mobile apps, reducing your organization’s QA cycle to minutes.
They can get you:
80% automated E2E test coverage in weeks—not years
24-hour maintenance and on-demand test creation
Zero flakes, guaranteed
The benefit? No more manual E2E testing. No more slow QA cycles. No more bugs reaching production.
With QA Wolf, Drata’s team of engineers achieved 4x more test cases and 86% faster QA cycles.
⭐ Rated 4.8/5 on G2
This week’s system design refresher:
10 Key Data Structures We Use Every Day
🚀 New Launch: Become an AI Engineer | Learn by Doing | Cohort 2!
IP Address Cheat Sheet Every Engineer Should Know
Which Protocols Run on TCP and UDP
Why is DeepSeek-OCR such a BIG DEAL?
SPONSOR US
list: keep your Twitter feeds
stack: support undo/redo of the word editor
queue: keep printer jobs, or send user actions in-game
hash table: cashing systems
Array: math operations
heap: task scheduling
tree: keep the HTML document, or for AI decision
suffix tree: for searching string in a document
graph: for tracking friendship, or path finding
r-tree: for finding the nearest neighbor
vertex buffer: for sending data to GPU for rendering
Over to you: Which additional data structures have we overlooked?
After the incredible success of our first cohort (almost 500 people attended), I’m thrilled to announce the launch of Cohort 2 of Become an AI Engineer!
This is not just another course about AI frameworks and tools. Our goal is to help engineers build the foundation and end to end skill set needed to thrive as AI engineers.
Here’s what makes this cohort special:
Learn by doing: Build real world AI applications, not just by watching videos.
Structured, systematic learning path: Follow a carefully designed curriculum that takes you step by step, from fundamentals to advanced topics.
Live feedback and mentorship: Get direct feedback from instructors and peers.
Community driven: Learning alone is hard. Learning with a community is easy!
We are focused on skill building, not just theory or passive learning. Our goal is for every participant to walk away with a strong foundation for building AI systems.
If you missed Cohort 1, now’s your chance to join us for Cohort 2.
Every message sent over the internet has two layers of communication, one that carries the data (transport) and one that defines what the data means (application). TCP and UDP sit at the transport layer, but they serve completely different purposes.
TCP is connection-oriented. It guarantees delivery, maintains order, and handles retransmission when packets get lost.
HTTP runs on TCP. The browser opens a TCP connection, sends an HTTP request, waits for the HTTP response, and closes the connection (or keeps it alive for subsequent requests). Every web page you have ever loaded used this pattern.
HTTPS adds TLS over TCP. The TCP connection happens first. Then comes the TLS handshake with public key exchange, session key negotiation, and finally encrypted data transfer.
SMTP uses TCP for email. Messages flow from sender to SMTP server to receiver over TCP connections. Email can’t afford to lose data mid-transmission, so TCP’s reliability is essential.
UDP is connectionless. No handshake. No guaranteed delivery. No order preservation. Just fire data requests and responses into the network and hope they arrive. Sounds chaotic, but it’s fast.
HTTP/3 runs over QUIC, which uses UDP. This seems backwards until you realize QUIC reimplements the reliability features of TCP inside UDP, but with better performance. Multiple streams over one connection. Built-in TLS 1.3. Faster connection establishment. The numbered streams in the diagram show parallel data flows that don’t block each other.
Over to you: What tools do you use to analyze transport layer performance?
Existing LLMs struggle with long inputs because they can only handle a fixed number of tokens, known as the context window, and attention cost grows quickly as inputs get longer.
DeepSeek-OCR takes a new approach.
Instead of sending long context directly to an LLM, it turns it into an image, compresses that image into visual tokens, and then passes those tokens to the LLM.
Fewer tokens lead to lower computational cost from attention and a larger effective context window. This makes chatbots and document models more capable and efficient.
How is DeepSeek-OCR built? The system has two main parts:
Encoder: It processes an image of text, extracts the visual features, and compresses them into a small number of vision tokens.
Decoder: A Mixture of Experts language model that reads those tokens and generate text one token at a time, similar to a standard decoder-only transformer.
When to use it?
DeepSeek-OCR shows that text can be efficiently compressed using visual representations.
It is especially useful for handling very long documents that exceed standard context limits. You can use it for context compression, standard OCR tasks, or deep parsing, such as converting tables and complex layouts into text.
Over to you: What do you think about using visual tokens to handle long-context problems in LLMs? Could this become the next standard for large models?
Get your product in front of more than 1,000,000 tech professionals.
Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases.
Space Fills Up Fast - Reserve Today
Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing [email protected].
2025-10-31 23:31:07
After the incredible success of our first cohort (almost 500 people attended), we are thrilled to announce the launch of Cohort 2 of Become an AI Engineer!
This is not just another course about AI frameworks and tools. Our goal is to help engineers build the foundation and end to end skill set needed to thrive as AI engineers.
Here’s what makes this cohort special:
• Learn by doing: Build real world AI applications, not just by watching videos.
• Structured, systematic learning path: Follow a carefully designed curriculum that takes you step by step, from fundamentals to advanced topics.
• Live feedback and mentorship: Get direct feedback from instructors and peers.
• Community driven: Learning alone is hard. Learning with a community is easy!
We are focused on skill building, not just theory or passive learning. Our goal is for every participant to walk away with a strong foundation for building AI systems.
If you want to start learning AI from scratch, this is the perfect time to begin.