MoreRSS

site iconByteByteGoModify

System design and interviewing experts, authors of best-selling books, offer newsletters and courses.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of ByteByteGo

How LinkedIn Built a Next-Gen Service Discovery for 1000s of Services

2026-02-11 00:30:21

Your free ticket to Monster SCALE Summit is waiting — 50+ engineering talks on data-intensive applications (Sponsored)

Monster SCALE Summit is a virtual conference all about extreme-scale engineering and data-intensive applications. Engineers from Discord, Disney, LinkedIn, Uber, Pinterest, Rivian, ClickHouse, Redis, MongoDB, ScyllaDB + more will be sharing 50+ talks on topics like:

  • Distributed databases

  • Streaming and real-time processing

  • Intriguing system designs

  • Approaches to a massive scaling challenge

  • Methods for balancing latency/concurrency/throughput

  • Infrastructure built for unprecedented demands.

Don’t miss this chance to connect with 20K of your peers designing, implementing, and optimizing data-intensive applications – for free, from anywhere.

GET YOUR FREE TICKET


LinkedIn serves hundreds of millions of members worldwide, delivering fast experiences whether someone is loading their feed or sending a message. Behind the scenes, this seamless experience depends on thousands of software services working together. Service Discovery is the infrastructure system that makes this coordination possible.

Consider a modern application at scale. Instead of building one massive program, LinkedIn breaks functionality into tens of thousands of microservices. Each microservice handles a specific task like authentication, messaging, or feed generation. These services need to communicate with each other constantly, and they need to know where to find each other.

Service discovery solves this location problem. Instead of hardcoding addresses that can change as servers restart or scale, services use a directory that tracks where every service currently lives. This directory maintains IP addresses and port numbers for all active service instances.

At LinkedIn’s scale, with tens of thousands of microservices running across global data centers and handling billions of requests each day, service discovery becomes exceptionally challenging. The system must update in real time as servers scale up or down, remain highly reliable, and respond within milliseconds.

In this article, we learn how LinkedIn built and rolled out Next-Gen Service Discovery, a scalable control plane supporting app containers in multiple programming languages.

Disclaimer: This post is based on publicly shared details from the LinkedIn Engineering Team. Please comment if you notice any inaccuracies.

Zookeeper-Based Architecture

For the past decade, LinkedIn used Apache Zookeeper as the control plane for service discovery. Zookeeper is a coordination service that maintains a centralized registry of services.

In this architecture, Zookeeper allows server applications to register their endpoint addresses in a custom format called D2, which stands for Dynamic Discovery. The system stored the configuration about how RPC traffic should flow as D2 configs and served them to application clients. The application servers and clients formed the data plane, handling actual inbound and outbound RPC traffic using LinkedIn’s Rest.li framework, a RESTful communication system.

Here is how the system worked:

  • The Zookeeper client library ran on all application servers and clients.

  • The Zookeeper ensemble took direct write requests from application servers to register their endpoint addresses as ephemeral nodes called D2 URIs.

  • Ephemeral nodes are temporary entries that exist only while the connection remains active.

  • The Zookeeper performed health checks on these connections to keep the ephemeral nodes alive.

  • The Zookeeper also took direct read requests from application clients to set watchers on the server clusters they needed to call. When updates happened, clients would read the changed ephemeral nodes.

Despite its simplicity, this architecture had critical problems in three areas: scalability, compatibility, and extensibility. Benchmark tests conducted in the past projected that the system would reach capacity in early 2025.


Web Search API for your AI applications (Sponsored)

LLMs are powerful—but without fresh, reliable information, they hallucinate, miss context, and go out of date fast. SerpApi gives your AI applications clean, structured web data from major search engines and marketplaces, so your agents can research, verify, and answer with confidence.

Access real-time data with a simple API.

Try for Free


Critical Problems with Zookeeper

The key problems with Zookeeper are as follows:

1 - Scalability Issues

The control plane operated as a flat structure handling requests for hundreds of thousands of application instances.

During deployments of large applications with many calling clients, the D2 URI ephemeral nodes changed frequently. This led to read storms with huge fanout from all the clients trying to read updates simultaneously, causing high latencies for both reads and writes.

Zookeeper is a strong consistency system, meaning it enforces strict ordering over availability. All reads, writes, and session health checks go through the same request queue. When the queue had a large backlog of read requests, write requests could not be processed. Even worse, all sessions would be dropped due to health check timeouts because the queue was too backed up. This caused ephemeral nodes to be removed, resulting in capacity loss of application servers and site unavailability.

The session health checks performed on all registered application servers became unscalable with fleet growth. As of July 2022, LinkedIn had about 2.5 years of capacity left with a 50 to 100 percent yearly growth rate in cluster size and number of watchers, even after increasing the number of Zookeeper hosts to 80.

2 - Compatibility Problems

Since D2 entities used LinkedIn’s custom schemas, they were incompatible with modern data plane technologies like gRPC and Envoy.

The read and write logic in application containers was implemented primarily in Java, with a partial and outdated implementation for Python applications. When onboarding applications in other languages, the entire logic needed to be rewritten from scratch.

3 - Extensibility Limitations

The lack of an intermediary layer between the service registry and application instances prevented the development of modern centralized RPC management techniques like centralized load balancing.

It also created challenges for integrating with new service registries to replace Zookeeper, such as Etcd with Kubernetes, or any new storage system that might have better functionality or performance.

The Next-Gen Service Discovery Architecture

The LinkedIn Engineering Team designed the new architecture to address all these limitations. Unlike Zookeeper handling read and write requests together, Next-Gen Service Discovery consists of two separate paths: Kafka for writes and Service Discovery Observer for reads.

1 - The Write Path

Kafka takes in application server writes and periodic heartbeats through Kafka events called Service Discovery URIs. Kafka is a distributed streaming platform capable of handling millions of messages per second. Each Service Discovery URI contains information about a service instance, including service name, IP address, port number, health status, and metadata.

2 - The Read Path

Service Discovery Observer consumes the URIs from Kafka and writes them into its main memory. Application clients open bidirectional gRPC streams to the Observer, sending subscription requests using the xDS protocol. The Observer keeps these streams open to push data and all subsequent updates to application clients instantly.

The xDS protocol is an industry standard created by the Envoy project for service discovery. Instead of clients polling for updates, the Observer pushes changes as they happen. This streaming approach is far more efficient than the old polling model.

3 - Configuration Management

D2 configs remain stored in Zookeeper. Application owners run CLI commands to leverage Config Service to update the D2 configs and convert them into xDS entities.

Observer consumes the configs from Zookeeper and distributes them to clients the same way as the URIs.

4 - The Observer Component

The Observer is horizontally scalable and written in Go, chosen for its high concurrency capabilities.

It can process large volumes of client requests, dispatch data updates, and consume URIs for the entire LinkedIn fleet efficiently. As of today, one Observer can maintain 40,000 client streams while sending 10,000 updates per second and consuming 11,000 Kafka events per second.

With projections of fleet size growing to 3 million instances in the coming years, LinkedIn will need approximately 100 Observers.

Key Improvements Over Zookeeper

Here are some key improvements that the new architecture provided in comparison to Zookeeper:

Scalability and Availability

LinkedIn prioritized availability over consistency because service discovery data only needs to eventually converge. Some short-term inconsistency across servers is acceptable, but the data must be highly available to the huge fleet of clients. This represents a fundamental shift from Zookeeper’s strong consistency model.

Multiple Observer replicas reach eventual consistency after a Kafka event is consumed and processed on all replicas. Even when Kafka experiences significant lag or goes down, Observer continues serving client requests with its cached data, preventing cascading failures.

LinkedIn can further improve scalability by separating dedicated Observer instances. Some Observers can focus on consuming Kafka events as consumers, while other Observers serve client requests as servers. The server Observers would subscribe to the consumer Observers for cache updates.

Compatibility with Modern Tools

Next-Gen Service Discovery supports the gRPC framework natively and enables multi-language support.

Since the control plane uses the xDS protocol, it works with open-source gRPC and Envoy proxy. Applications not using Envoy can leverage open-source gRPC code to directly subscribe to the Observer. Applications onboarding the Envoy proxy get multi-language support automatically.

Extensibility for Future Features

Adding Next-Gen Service Discovery as a central control plane between the service registry and clients enables LinkedIn to extend to modern service mesh features. These include centralized load balancing, security policies, and transforming endpoint addresses between IPv4 and IPv6.

LinkedIn can also integrate the system with Kubernetes to leverage application readiness probes. This would collect the status and metadata of application servers, converting servers from actively making announcements to passively receiving status probes, which is more reliable and better managed.

Cross-Fabric Capabilities

Next-Gen Service Discovery Observers run independently in each fabric. A fabric is a data center or isolated cluster. Application clients can be configured to connect to the Observer in a remote fabric and be served with the server applications in that fabric. This supports custom application needs or provides failover when the Observer in one fabric goes down, ensuring business traffic remains unaffected.

See the diagram below:

Application servers can also write to the control plane in multiple fabrics. Cross-fabric announcements are appended with a fabric name suffix to differentiate from local announcements. Application clients can then send requests to application servers in both local and remote fabrics based on preference.

See the diagram below:

The Migration Challenge

Rolling out Next-Gen Service Discovery to hundreds of thousands of hosts without impacting current requests required careful planning.

LinkedIn needed the service discovery data served by the new control plane to exactly match the data on Zookeeper. They needed to equip all application servers and clients companywide with related mechanisms through just an infrastructure library version bump. They needed central control on the infrastructure side to switch Next-Gen Service Discovery read and write on and off by application. Finally, they needed good central observability across thousands of applications on all fabrics for migration readiness, results verification, and troubleshooting.

The three major challenges were as follows:

  • First, service discovery is mission-critical, and any error could lead to severe site-wide incidents. Since Zookeeper was approaching capacity limits, LinkedIn needed to migrate as many applications off Zookeeper as quickly as possible.

  • Second, application states were complex and unpredictable. Next-Gen Service Discovery Read required client applications to establish gRPC streams. However, Rest.li applications that had existed at the company for over a decade were in very different states regarding dependencies, gRPC SSL, and network access. Compatibility with the control plane for many applications was unpredictable without actually enabling the read.

  • Third, read and write migrations were coupled. If the write was not migrated, no data could be read on Next-Gen Service Discovery. If the read was not migrated, data was still read on Zookeeper, blocking the write migration. Since read path connectivity was vulnerable to application-specific states, the read migration had to start first. Even after client applications migrated for reads, LinkedIn needed to determine which server applications became ready for Next-Gen Service Discovery Write and prevent clients from regressing to read Zookeeper again.

The Solution: Dual Mode Migration

LinkedIn implemented a dual mode strategy where applications run both old and new systems simultaneously, verifying the new flow behind the scenes.

To decouple read and write migration, the new control plane served a combined dataset of Kafka and Zookeeper URIs, with Kafka as the primary source and Zookeeper as backup. When no Kafka data existed, the control plane served Zookeeper data, mirroring what clients read directly from Zookeeper. This enabled read migration to start independently.

Dual Read Mode

In Dual Read mode, an application client reads data from both Next-Gen Service Discovery and Zookeeper, keeping Zookeeper as the source of truth for serving traffic. Using an independent background thread, the client tried to resolve traffic as if it were served by Next-Gen Service Discovery data and reported any errors.

LinkedIn built comprehensive metrics to verify connectivity, performance, and data correctness on both the client side and Observer side. On the client side, connectivity and latency metrics watched for connection status and data latencies from when the subscription request was sent to when data was received. Dual Read metrics compared data received from Zookeeper and Next-Gen Service Discovery to identify mismatches. Service Discovery request resolution metrics showed request status, identical to Zookeeper-based metrics, but with a Next-Gen Service Discovery prefix to identify whether requests were resolved by Next-Gen Service Discovery data and catch potential errors like missing critical data.

On the Observer side, connection and stream metrics watched for client connection types, counts, and capacity. These helped identify issues like imbalanced connections and unexpected connection losses during restart. Request processing latency metrics measured time from when the Observer received a request to when the requested data was queued for sending. The actual time spent sending data over the network was excluded since problematic client hosts could get stuck receiving data and distort the metric. Additional metrics tracked Observer resource utilization, including CPU, memory, and network bandwidth.

See the diagram below:

With all these metrics and alerts, before applications actually used Next-Gen Service Discovery data, LinkedIn caught and resolved numerous issues, including connectivity problems, reconnection storms, incorrect subscription handling logic, and data inconsistencies, avoiding many companywide incidents. After all verifications passed, applications were ramped to perform Next-Gen Service Discovery read-only.

Dual Write Mode

In Dual Write mode, application servers reported to both Zookeeper and Next-Gen Service Discovery.

On the Observer side, Zookeeper-related metrics monitored potential outages, connection losses, or high latencies by watching connection status, watch status, data received counts, and lags. Kafka metrics monitored potential outages and high latencies by watching partition lags and event counts.

LinkedIn calculated a URI Similarity Score for each application cluster by comparing data received from Kafka and Zookeeper. A 100 percent match could only be reached if all URIs in the application cluster were identical, guaranteeing that Kafka announcements matched existing Zookeeper announcements.

  • Cache propagation latency is measured as the time from when data was received on the Observer to when the Observer cache was updated.

  • Resource propagation latency is measured as the time from when the application server made the announcement to when the Observer cache was updated, representing the full end-to-end write latency.

On the application server side, a metric tracked the server announcement mode to accurately determine whether the server was announcing to Zookeeper only, dual write, or only Next-Gen Service Discovery. This allowed LinkedIn to understand if all instances of a server application had fully adopted a new stage.

See the diagram below:

LinkedIn also monitored end-to-end propagation latency, measuring the time from when an application server made an announcement to when a client host received the update. They built a dashboard to measure this across all client-server pairs daily, monitoring for P50 less than 1 second and P99 less than 5 seconds. P50 means that 50 percent of clients received the propagated data within that time, and P99 means 99 percent received it within that time.

Automated Dependency Analysis

The safest approach for write migration would be waiting until all client applications are migrated to Next-Gen Service Discovery Read and all Zookeeper-reading code is cleaned up before stopping Zookeeper announcements. However, with limited Zookeeper capacity and the urgency to avoid outages, LinkedIn needed to begin write migration in parallel with client application migration.

LinkedIn built cron jobs to analyze Zookeeper watchers set on the Zookeeper data of each application and list the corresponding reader applications. A watcher is a mechanism where clients register interest in data changes. When data changes, Zookeeper notifies all watchers. These jobs generated snapshots of watcher status at short intervals, catching even short-lived readers like offline jobs. The snapshots were aggregated into daily and weekly reports.

These reports identified applications with no readers on Zookeeper in the past two weeks, which LinkedIn set as the criteria for applications becoming ready to start Next-Gen Service Discovery Write. The reports also showed top blockers, meaning reader applications blocking the most server hosts from migrating, and top applications being blocked, identifying the largest applications unable to migrate, and which readers were blocking them.

This information helped LinkedIn prioritize focus on the biggest blockers for migration to Next-Gen Service Discovery Read. Additionally, the job could catch any new client that started reading server applications already migrated to Next-Gen Service Discovery Write and send alerts, allowing prompt coordination with the reader application owner for migration or troubleshooting.

Conclusion

The Next-Gen Service Discovery system achieved significant improvements over the Zookeeper-based architecture.

The system now handles the company-wide fleet of hundreds of thousands of application instances in one data center with data propagation latency of P50 less than 1 second and P99 less than 5 seconds. The previous Zookeeper-based architecture experienced high latency and unavailability incidents frequently, with data propagation latency of P50 less than 10 seconds and P99 less than 30 seconds.

This represents a tenfold improvement in median latency and a sixfold improvement in 99th percentile latency. The new system not only safeguards platform reliability at massive scale but also unlocks future innovations in centralized load balancing, service mesh integration, and cross-fabric resiliency.

Next-Gen Service Discovery marks a foundational transformation in LinkedIn’s infrastructure, changing how applications discover and communicate with each other across global data centers. By replacing the decade-old Zookeeper-based system with a Kafka and xDS-powered architecture, LinkedIn achieved near real-time data propagation, multi-language compatibility, and true horizontal scalability.

References:

How Yelp Built “Yelp Assistant”

2026-02-10 00:31:03

How to stop bots from abusing free trials (Sponsored)

Free trials help AI apps grow, but bots and fake accounts exploit them. They steal tokens, burn compute, and disrupt real users.

Cursor, the fast-growing AI code assistant, uses WorkOS Radar to detect and stop abuse in real time. With device fingerprinting and behavioral signals, Radar blocks fraud before it reaches your app.

Protect your app today →


You open an app with one specific question in mind, but the answer is usually hidden in a sea of reviews, photos, and structured facts. Modern content platforms are information-rich, though surfacing direct answers can still be a challenge. A good example is Yelp business pages. Imagine you are deciding where to go and you ask “Is the patio heated?”. The page might contain the answer in a couple of reviews, a photo caption, or an attribute field, but you still have to scan multiple sections to piece it together.

A common way to solve this is to integrate an AI assistant inside the app. The assistant retrieves the right evidence and turns it into a single direct answer with citations to the supporting snippets.

Figure 1: AI Assistant in Yelp business pages

This article walks through what it takes to ship a production-ready AI assistant using Yelp Assistant on business pages as a concrete case study. We’ll cover the engineering challenges, architectural trade-offs, and practical lessons from the development of the Yelp Assistant.

Note: This article is written in collaboration with Yelp. Special thanks to the Yelp team for sharing details with us about their work and for reviewing the final article before publication.

High-Level System Design

To deliver answers that are both accurate and cited, we cannot rely on an LLM’s internal knowledge alone. Instead, we use Retrieval-Augmented Generation (RAG).

RAG decouples the problem into two distinct phases: retrieval and generation, supported by an offline indexing pipeline that prepares the knowledge store.

The development of a RAG system starts with an indexing pipeline, which builds a knowledge store from raw data offline. Upon receiving a user query, the retrieval system scans this store using both lexical search for keywords and semantic search for intent to locate the most relevant snippets. Finally, the generation phase feeds these snippets to the LLM with strict instructions to answer solely based on the provided evidence and to cite specific sources.

Citations are typically produced by having the model output citation markers that refer to specific snippets. For example, if the prompt includes snippets with IDs S1, S2, and S3, the model might generate “Yes, the patio is heated” and attach markers like [S1] and [S3]. A citation resolution step then maps those markers back to the original sources, such as a specific review excerpt, photo caption, or attribute field, and formats them for the UI. Finally, citations are verified to ensure every emitted citation maps to real retrievable content.

While this system is enough for a prototype, a production system requires additional layers for reliability, safety, and performance. The rest of this article uses the Yelp Assistant as a case study to explore the real-world engineering challenges of building this at scale and the mitigations to solve them.


Live Webinar: Designing for Failure and Speed in Agentic Workflows with FeatureOps (Sponsored)

AI can write code in seconds. You’re the one who gets paged at 2am when production breaks.

As teams adopt agentic workflows, features change faster than humans can review them. When an AI-written change misbehaves, redeploying isn’t fast enough, rollbacks aren’t clean, and you’re left debugging decisions made by your AI overlord.

In this tech talk, we’ll show FeatureOps patterns to stay in control at runtime, stop bad releases instantly, limit blast radius, and separate deployment from exposure.

Led by Alex Casalboni, Developer Advocate at Unleash, who spent six years at AWS seeing the best and worst of running applications at scale.

Register now


Data Strategy: From Prototype to Production

A robust data strategy determines what content the assistant can retrieve and how quickly it stays up to date. The standard pipeline consists of three stages beginning with data sourcing where we select necessary inputs like reviews or business hours and define update contracts. Next is ingestion, which cleans and transforms these raw feeds into a trusted canonical format. Finally, indexing transforms these records into retrieval-ready documents using keyword or vector search signals so the system can filter to the right business scope.

Setting up a data pipeline for a demo is usually simple. For example, Yelp’s early prototype relied on ad hoc batch dumps loaded into a Redis snapshot which effectively treated each business as a static bundle of content.

In production, this approach collapses because content changes continuously and the corpus grows without bound. A stale answer regarding operating hours is worse than no answer at all, and a single generic index struggles to find specific needle-in-the-haystack facts as the data volume explodes. To meet the demands of high query volume and near real-time freshness, Yelp evolved their data strategy through four key architectural shifts.

1. Freshness

Treating every data source as real-time makes ingestion expensive to operate while treating everything as a weekly batch results in stale answers. Yelp set explicit freshness targets based on the content type. They implemented streaming ingestion for high-velocity data like reviews and business attributes to ensure updates appear within 10 minutes. Conversely, they used a weekly batch pipeline for slow-moving sources like menus and website text. This hybrid approach ensures a user asking “Is it open?” gets the latest status without wasting resources streaming static content.

2. Data Separation

Not all questions should be answered the same way. Some require searching through noisy text while others require a single precise fact. Treating everything as generic text makes retrieval unreliable; it allows anecdotes in reviews to override canonical fields like operating hours.

Yelp replaced the single prototype Redis snapshot with two distinct production stores. Unstructured content like reviews and photos serves through search indices to maximize relevance. Structured facts like amenities and hours live in a Cassandra database using an Entity-Attribute-Value layout.

This separation prevents hallucinated facts and makes schema evolution significantly simpler. Engineers can add new attributes such as EV charging availability without running migrations.

3. Hybrid Photo Retrieval

Photos can be retrieved using only captions, only image embeddings, or a combination of both. Caption-only retrieval fails when captions are missing, too short, or phrased differently than the user’s question. Embedding-only retrieval can miss literal constraints like exact menu item names or specific terms the user expects to match.

Yelp bridged this gap by implementing hybrid retrieval. The system ranks photos using both caption text matches and image embedding similarity. If a user asks about a heated patio, the system can retrieve relevant evidence whether the concept is explicitly written as “heaters” in the caption or simply visible as a heat lamp in the image itself.

4. Unified Serving

Splitting data across search indices and databases improves quality but can hurt latency. A single answer might require a read for hours, a query for reviews, and another query for photos. These separate network calls add up and force the assistant logic to manage complex data fetching.

Yelp solved this by placing a Content Fetching API in front of all retrieval stores. This abstraction handles the complexity of parallelizing backend reads and enforcing latency budgets. The result is a consistent response format that keeps the 95th percentile latency under 100 milliseconds and decouples the assistant logic from the underlying storage details. The following figure summarizes the data sources and any special handling for each one.

Inference Pipeline

Prototypes often prioritize simplicity by relying on a single large model for everything. The backend stuffs all available content such as menus and reviews into one massive prompt, forcing the model to act as a slow and expensive retrieval engine. Yelp followed this pattern in early demos. If a user asked, “Is the patio heated?”, the model had to read the entire business bundle to find a mention of heaters.

While this works for a demo, it collapses under real traffic. Excessive context leads to less relevant answers and high latency, while the lack of guardrails leaves the system vulnerable to adversarial attacks and out-of-scope questions that waste expensive compute.

To move from a brittle prototype to a robust production system, Yelp deconstructed the monolithic LLM into several specialized models to ensure safety and improve retrieval quality.

1. Retrieval

Yelp separated “finding evidence” from “writing the answer.” Instead of sending the entire business bundle to the model, the system queries near real-time indices to retrieve only the relevant snippets. For a question like “Is the patio heated?”, the system retrieves specific reviews mentioning “heaters” and the outdoor seating attribute. The LLM then generates a concise response based solely on that evidence, citing its sources.

2. Content Source Selection

Retrieval alone isn’t enough if you search every source by default. Searching menus for “ambiance” questions or searching reviews for “opening hours” introduces noise that confuses the model.

Yelp fixed this with a dedicated selector. A Content Source Selector analyzes the intent and outputs only the relevant stores. This enables the system to route inputs like “What are the hours?” to structured facts and “What is the vibe?” to reviews.

This routing also serves as conflict resolution if sources disagree. Yelp found it works best to default to authoritative sources like business attributes or website information for objective facts, and to rely on reviews for subjective, experience-based questions.

3. Keyword Generation

Users rarely use search-optimized keywords. They ask incomplete questions such as “vibe?” or “good for kids?” that fail against exact-match indices.

Yelp introduced a Keyword Generator, a fine-tuned GPT-4.1-nano model, that translates user queries into targeted search terms. For example, “vegan options” might generate keywords like “plant-based” or “dairy-free”. When the user’s prompt is broad, the Keyword Generator is trained to emit no keywords to avoid producing misleading keywords.

4. Input Guardrails

Before any retrieval happens, the system must decide if it should answer. Yelp uses two classifiers: Trust & Safety to block adversarial inputs and Inquiry Type to redirect out-of-scope questions like “Change my password” to the correct support channels.

Building this pipeline required a shift in training strategy. While prompt engineering a single large model works for prototypes, it proved too brittle for production traffic where user phrasing varies wildly. Yelp adopted a hybrid approach:

  • Fine-tuning for question analysis: They fine-tuned small and efficient models (GPT-4.1-nano) for the question analysis steps including Trust and Safety, Inquiry Type, and Source Selection. These small models achieved lower latency and higher consistency than prompting a large generic model.

  • Prompting for final generation: For the final answer where nuance and tone are critical, they stuck with a powerful generic model (GPT-4.1). Small fine-tuned models struggled to synthesize multiple evidence sources effectively, making the larger model necessary for the final output.

Serving Efficiency

Prototypes usually handle each request as one synchronous blocking call. The system fetches content, builds a prompt, waits for the full model completion, and then returns one response payload. This workflow is simple but generally not optimized for latency or cost. Consequently, it becomes slow and expensive at scale.

Yelp optimized serving to reduce latency from over 10 seconds in prototypes to under 3 seconds in production. Key techniques include:

  • Streaming: In a synchronous prototype, users stare at a blank screen until the full answer is ready. Yelp migrated to FastAPI to support Server-Sent Events (SSE), allowing the UI to render text token-by-token as it generates. This significantly reduced the perceived wait time (Time-To-First-Byte).

  • Parallelism: Serial execution wastes time. Yelp built asynchronous clients to run independent tasks concurrently. Question analysis steps run in parallel, as do data fetches from different stores (Lucene for text, Cassandra for facts).

  • Early Stopping: If the Trust & Safety classifier flags a request, the system immediately cancels all downstream tasks. This prevents wasting compute and retrieval resources on blocked queries.

  • Tiered Models: Running a large model for every step is slow and expensive. By restricting the large model (GPT-4o) to the final generation step and using fast fine-tuned models for the analysis pipeline, Yelp reduced costs and improved inference speed by nearly 20%.

Together, these techniques helped Yelp build a faster, more responsive system. At p50, the latency breakdown is:

  • Question analysis: ~1.4s

  • Retrieval: ~0.03s

  • Time to first byte: ~0.9s

  • Full answer generation: ~3.5s

Evaluation

In a prototype, evaluation is usually informal where developers try a handful of questions and tweak prompts until the result feels right. This approach is fragile because it only tests anticipated cases and often misses how real users phrase ambiguous queries. In production, failures show up as confident hallucinations or technically correct but unhelpful replies. Yelp observed this directly when their early prototype voice swung between overly formal and casual depending on slight wording changes.

A robust evaluation system must separate quality into distinct dimensions that can be scored independently. Yelp defined six primary dimensions. They rely on an LLM-as-a-judge system where a specialized grader evaluates a single dimension using a strict rubric. For example, the Correctness grader reviews the answer against retrieved snippets and assigns a label like “Correct” or “Unverifiable” .

The key learning from Yelp is that subjective dimensions like Tone and Style are difficult to automate reliably. While logical metrics like Correctness are easy to judge against evidence, tone is an evolving contract between the brand and the user. Rather than forcing an unreliable automated judge early, Yelp tackled this by co-designing principles with their marketing team and enforcing them via curated few-shot examples in the prompt.

Unique Challenges, War Stories, and Learned Lessons

Most teams can get a grounded assistant to work for a demo. The difficult part is engineering a system that stays fresh, fast, safe, and efficient under real traffic. Below are the key lessons from the journey to production.

1. Retrieval is never done. Keyword retrieval is often the fastest path to a shippable product because it leverages existing search infrastructure. However, in production, new question types and wordings keep appearing. These will expose gaps in your initial retrieval logic. You must design retrieval so you can evolve it without rewriting the whole pipeline. You start with keywords for high-precision intents (brands, locations, technical terms, many constraints), then add embeddings for more exploratory questions, and keep tuning based on log failures.

2. Prompt bloat silently erases cost wins. As you fix edge cases regarding tone, refusals, and citation formatting, the system prompt inevitably grows. Even if you optimize your retrieved context, this prompt growth can overwrite those savings. Treat prompts as code. Version them, review them, and track token counts and cost impact. Prefer modular prompt chunks and assemble them dynamically at runtime. Maintain an example library and retrieve only the few-shot examples that match the current case. Do not keep every example in the static prompt. Yelp relies on dynamic prompt composition that includes only the relevant instructions and examples for the detected question type. This keeps the prompt lean and focused.

3. Build Modular Guardrails. After launch, users will push every boundary. They ask for things you did not design for, try to bypass instructions, and shift their behavior over time. This includes unsafe requests, out of scope questions, and adversarial prompts. Trying to catch all of this with a single “safety check” becomes impossible to maintain. Instead, split guardrails into small tasks. Each task should have a clear decision and label set. Run these checks in parallel and give them the authority to cancel downstream work. If a check fails, the system should return immediately with the right response without paying for retrieval or generation.

Conclusion

The Yelp Assistant on business pages is built as a multi-stage evidence-grounded system rather than a monolithic chatbot. The key takeaway is that the gap between a working prototype and a production assistant is substantial. Closing this gap requires more than just a powerful model. It requires a complete engineering system that ensures data stays fresh, answers remain grounded, and behavior stays safe.

Looking ahead, Yelp is focused on stronger context retention in longer multi-turn conversations, better business-to-business comparisons, and deeper use of visual language models to reason over photos more directly.

EP201: The Evolution of AI in Software Development

2026-02-08 00:30:23

✂️ Cut your QA cycles down to minutes with QA Wolf (Sponsored)

If slow QA processes bottleneck you or your software engineering team and you’re releasing slower because of it — you need to check out QA Wolf.

QA Wolf’s AI-native service supports web and mobile apps, delivering 80% automated test coverage in weeks and helping teams ship 5x faster by reducing QA cycles to minutes.

QA Wolf takes testing off your plate. They can get you:

  • Unlimited parallel test runs for mobile and web apps

  • 24-hour maintenance and on-demand test creation

  • Human-verified bug reports sent directly to your team

  • Zero flakes guarantee

The benefit? No more manual E2E testing. No more slow QA cycles. No more bugs reaching production.

With QA Wolf, Drata’s team of 80+ engineers achieved 4x more test cases and 86% faster QA cycles.

Schedule a demo to learn more


This week’s system design refresher:

  • 9 AI Concepts Explained in 7 minutes (Youtube video)

  • The Evolution of AI in Software Development

  • Git pull vs. git fetch

  • Agentic Browsers Workflow

  • [Subscriber Exclusive] Become an AI Engineer - Cohort 4


9 AI Concepts Explained in 7 minutes


The Evolution of AI in Software Development

AI has fundamentally changed how engineers code. This shift can be described in three waves.

  1. General-purpose LLMs (chat assistants)
    Treating general-purpose LLMs like a coding partner: you copied code into ChatGPT, asked why it is wrong, and manually applied the fix. This helped engineers move faster, but the workflow was slow and manual.

  2. Coding LLMs (autocompletes)
    Tools like Copilot and Cursor Tab brought AI into the editor. As you type, a coding model suggests the next few tokens and you accept or reject. It speeds up typing, but it cannot handle repo-level tasks.

  3. Coding Agents
    Coding agents handle tasks end-to-end. You ask “refactor my code”, and the agent searches the repo, edits multiple files, and iterates until tests pass. This is where most capable tools such as Claude Code and OpenAI’s Codex focus today.

Over to you: What do you think will be the next wave?


Git pull vs. git fetch

If you have ever mixed up “git pull” and “git fetch”, you’re not alone, even experienced developers get these two commands wrong. They sound similar, but under the hood, they behave very differently.

Let’s see how each command updates your repository:

  • Initial state: Your local repo is slightly behind the remote. The remote has new commits (R3, R4, R5), while your local “main” still ends at L3.

  • What git fetch actually does: git fetch downloads the new commits without touching your working branch. It only updates “origin/main”.

    Think of it as: “Show me what changed, but don’t apply anything yet.”

  • What git pull actually does: git pull is a combination of “fetch + merge” commands. It downloads the new commits and immediately merges them into your local branch.

    This is the command that updates both “origin/main” and your local “main”.

    Think of it as: “Fetch updates and apply them now.”

Over to you: Which one do you use more often, “git pull” or “git fetch”?


How Agentic Browsers like OpenAI’s Atlas or Perplexity Comet Work at the high level?

Agentic browsers embed an agent that can read webpages and take actions in your browser.

Most agentic browsers have four major layers.

  1. Perception layer: Converts the current UI into model input. It starts with an accessibility tree snapshot. If the tree is incomplete or ambiguous, the agent takes a screenshot, sends it to a vision model (for example, Gemini Pro) to extract UI elements into a structured form, then uses that result to decide the next action.

  2. Reasoning layer: Uses specialized agents for read-only browsing, navigation, and data entry. Separating roles improves reliability and lets you apply safety rules per agent.

  3. Security layer: Enforces domain allowlisting and deterministic boundaries such as restricted actions, and confirmation steps to reduce prompt injection risk.

  4. Execution layer: Runs browser tools (click, type, upload, navigate, screenshot, tab operations) and refreshes state after each step.

Over to you: Do you think agentic browsers are reliable enough to be used at scale?


[Subscriber Exclusive] Become an AI Engineer - Cohort 4

We are excited to announce Cohort 4 of Becoming an AI Engineer.

Because you’re part of this newsletter community, you get an exclusive discount not available anywhere else.

A one-time 40% discount. Code expires next Friday

Use code: BBGNL

Enroll Here

This is not just another course about AI frameworks and tools. Our goal is to help engineers build the foundation and end to end skill set needed to thrive as AI engineers.

Here’s what makes this cohort special:

  • Learn by doing: Build real world AI applications, not just by watching videos.

  • Structured, systematic learning path: Follow a carefully designed curriculum that takes you step by step, from fundamentals to advanced topics.

  • Live feedback and mentorship: Get direct feedback from instructors and peers.

  • Community driven: Learning alone is hard. Learning with a community is easy!

We are focused on skill building, not just theory or passive learning. Our goal is for every participant to walk away with a strong foundation for building AI systems.

If you want to start learning AI from scratch, this is the perfect time to begin.

Dates: Feb 21— March 29, 2026

Enroll Here



[Subscriber Exclusive] Become an AI Engineer - Cohort 4

2026-02-07 00:31:56

We are excited to announce Cohort 4 of Becoming an AI Engineer.

Because you’re part of this newsletter community, you get an exclusive discount not available anywhere else.

A one-time 40% discount. Code expires next Friday

Use code: BBGNL

Enroll Here

This is not just another course about AI frameworks and tools. Our goal is to help engineers build the foundation and end to end skill set needed to thrive as AI engineers.

Here’s what makes this cohort special:

  • Learn by doing: Build real world AI applications, not just by watching videos.

  • Structured, systematic learning path: Follow a carefully designed curriculum that takes you step by step, from fundamentals to advanced topics.

  • Live feedback and mentorship: Get direct feedback from instructors and peers.

  • Community driven: Learning alone is hard. Learning with a community is easy!

We are focused on skill building, not just theory or passive learning. Our goal is for every participant to walk away with a strong foundation for building AI systems.

If you want to start learning AI from scratch, this is the perfect time to begin.

Dates: Feb 21— March 29, 2026

Enroll Here

Top Authentication Techniques to Build Secure Applications

2026-02-06 00:30:40

Authentication serves as the first line of defense in ensuring the security of applications and the sensitive data they handle. Whether it’s a personal banking app, a corporate platform, or an e-commerce website, effective authentication mechanisms are needed to verify the identity of users and safeguard their access to resources.

Without proper authentication, applications are vulnerable to unauthorized access, data breaches, and malicious attacks, potentially resulting in significant financial loss, reputational damage, and privacy violations.

In addition to security, authentication plays a critical role in the user experience. By identifying users, applications can provide personalized services, remember user preferences, and enable functionalities like Single Sign-On (SSO) across platforms.

With evolving threats, implementing secure and efficient authentication is more challenging than ever. Developers must navigate between competing priorities such as security (ensuring protection against different attack types like session hijacking, token theft, and replay attacks), scalability (supporting millions of users without compromising performance), and user experience (maintaining ease of use while applying strong security measures).

To tackle these challenges, developers rely on various authentication techniques. In this article, we’ll explore multiple authentication techniques used in applications and understand their advantages and disadvantages.

Fundamentals of Authentication

Read more

A Guide to Effective Prompt Engineering

2026-02-05 00:32:00

Unblocked is the AI code review with judgment of your best engineer. (Sponsored)

Most AI code review tools analyze the diff. Sometimes the file, occasionally the repo.

Experienced engineers work differently. They remember that Slack thread that explains why this database pattern exists. They know David on the platform team has strong opinions about error handling. They’ve internalized dozens of unwritten conventions.

Unblocked is the only AI code review tool that uses deep insight of your codebase, docs, and discussions to give high-signal feedback based on how your system actually works – instead of flooding your PR with stylistic nitpicks.

“Unblocked has reversed my AI fatigue completely. The level of precision is wild.” - Senior developer, Clio

Try now for free


Prompt engineering is the process of crafting instructions that guide AI language models to generate desired outcomes. At first glance, it might seem straightforward. We simply tell the AI what we want, and it delivers. However, anyone who has worked with these models quickly discovers that writing effective prompts is more challenging than it appears.

The ease of getting started with prompt engineering can be misleading.

While anyone can write a prompt, not everyone can write one that consistently produces high-quality results. Think of it as the difference between being able to communicate and being able to communicate effectively. The fundamentals are accessible, but mastery requires practice, experimentation, and understanding how these models process information.

In this article, we will look at the core techniques and best practices for prompt engineering. We will explore different prompting approaches, from simple zero-shot instructions to advanced chain-of-thought reasoning.

What Makes a Good Prompt

A prompt typically consists of several components:

  • The task description explains what we want the model to do, including any role or persona we want it to adopt.

  • The context provides necessary background information. Examples demonstrate the desired behavior or format.

  • Finally, the concrete task is the specific question to answer or action to perform.

Most model APIs allow us to split prompts into system prompts and user prompts.

System prompts typically contain task descriptions and role-playing instructions that shape how the model behaves throughout the conversation.

On the other hand, user prompts contain the actual task or question. For instance, if we are building a chatbot that helps buyers understand property disclosures, the system prompt might instruct the model to act as an experienced real estate agent, while the user prompt contains the specific question about a property.

See the diagram below:

Clarity is the key factor to effective prompting. Just as clear communication helps humans understand what we need, specific and unambiguous instructions help AI models generate appropriate responses. We should explain exactly what we want, define any scoring systems or formats we expect, and eliminate assumptions about what the model might already know.

Context is equally important. Providing relevant information helps models perform better and reduces hallucinations. If we want the model to answer questions about a research paper, including that paper in the context will significantly improve response quality. Without sufficient context, the model must rely on its internal knowledge, which may be outdated or incorrect.


GitHub Copilot: Innovate Faster with AI, Wherever You Code (Sponsored)

Meet GitHub Copilot. Accelerate software innovation on any platform or code repository with GitHub Copilot, the agentic AI software development tool that meets you where you are.

With GitHub Copilot your team can:

  • Plan, build, and deploy with transformed AI-powered workflows.

  • Use agentic capabilities to tackle hard tasks: spec-driven development, docs generation, testing, and app modernization/migration.

  • Integrate GitHub Copilot anywhere: your teams, your toolchain, with flexible plans for agentic workflows.

Start building with GitHub Copilot today


How Models Process Prompts

In-context learning is the fundamental mechanism that makes prompt engineering work.

This term refers to a model’s ability to learn new behaviors from examples provided in the prompt itself, without requiring any updates to the model’s weights. When we show a model examples of how to perform a task, it can adapt its responses to match the pattern we have demonstrated.

Models are typically better at understanding instructions at the beginning and end of prompts compared to the middle. This phenomenon, sometimes called the “needle in a haystack” problem, means we should place the most important information at strategic positions in our prompts.

The number of examples needed depends on both the model’s capability and the task’s complexity. Stronger models generally require fewer examples to understand what we want. For simpler tasks, powerful models might not need any examples at all. For domain-specific applications or complex formatting requirements, providing several examples can make a significant difference.

Core Prompting Techniques

Let’s look at some key prompting techniques:

Technique 1: Zero-Shot Prompting

Zero-shot prompting means giving the model instructions without providing any examples. In this approach, we simply describe what we want, and the model attempts to fulfill the request based on its training.

This technique works best for straightforward tasks where the desired output is clear from the instructions alone. For example, “Translate the following text to French” or “Summarize this article in three sentences” are both effective zero-shot prompts.

The main advantage of zero-shot prompting is efficiency. It uses fewer tokens, which reduces costs and latency. The prompts are also simpler to write and maintain. However, zero-shot prompting has limitations. When we need specific formatting or behavior that differs from the model’s default responses, zero-shot prompts may not be sufficient.

Best practices for zero-shot prompting include being as explicit as possible about what we want, specifying the output format clearly, and stating any constraints or requirements upfront. If the model’s initial response is not what we expected, we should revise the prompt to add more detail rather than immediately jumping to few-shot examples.

Technique 2: Few-Shot Prompting

Few-shot prompting involves providing examples that demonstrate how we want the model to respond. One-shot prompting uses a single example, while few-shot typically means two to five or more examples.

This technique is valuable when we need specific formatting or when the desired behavior might be ambiguous from instructions alone. For instance, if we are building a bot to talk to young children and want it to respond to questions about fictional characters in a particular way, showing an example helps the model understand the expected tone and approach.

Consider this comparison. Without an example, if a child asks, “Will Santa bring me presents on Christmas?”, a model might explain that Santa Claus is fictional. However, if we provide an example like “Q: Is the tooth fairy real? A: Of course! Put your tooth under your pillow tonight,” the model learns to maintain the magical perspective appropriate for young children.

The number of examples matters. More examples generally lead to better performance, but we are limited by context length and cost considerations. For most applications, three to five examples strike a good balance. We should experiment to find the optimal number for our specific use case.

When formatting examples, we can save tokens by choosing efficient structures. For instance, “pizza -> edible” uses fewer tokens than “Input: pizza, Output: edible” while conveying the same information. These small optimizations add up, especially when working with multiple examples.

Technique 3: Chain-of-Thought Prompting

Chain-of-thought prompting, often abbreviated as CoT, involves explicitly asking the model to think step by step before providing an answer. This technique encourages systematic problem-solving and has been shown to significantly improve performance on complex reasoning tasks.

The simplest implementation is adding phrases like “think step by step” or “explain your reasoning” to our prompts. The model then works through the problem methodically, showing its reasoning process before arriving at a conclusion.

CoT often improves model performance across various benchmarks, particularly for mathematical problems, logic puzzles, and multi-step reasoning tasks. CoT also helps reduce hallucinations because the model must justify its answers with explicit reasoning steps.

We can implement CoT in several ways. Zero-shot CoT simply adds a reasoning instruction to our prompt. We can also specify the exact steps we want the model to follow, or provide examples that demonstrate the reasoning process. The variation depends on the specific application and how much control we need over the reasoning structure.

The trade-off with CoT is increased latency and cost. The model generates more tokens as it works through its reasoning, which takes more time and increases API costs. For complex tasks where accuracy is critical, this trade-off is usually worthwhile.

Technique 4: Role Prompting

Role prompting assigns a specific persona or area of expertise to the model. By telling the model to adopt a particular role, we influence the perspective and style of its responses.

For example, if we ask a model to score a simple essay like “Summer is the best season. The sun is warm. I go swimming. Ice cream tastes good in summer,” it might give a low score based on general writing standards. However, if we first instruct the model to adopt the persona of a first-grade teacher, it will evaluate the essay from that perspective and likely assign a higher, more appropriate score.

Role prompting is particularly effective for customer service applications, educational content, creative writing, and any scenario where the context or expertise level matters. The model can adjust its vocabulary, level of detail, and approach based on the assigned role.

When using role prompting, we should be specific about the role and any relevant characteristics. Rather than just saying “act as a teacher,” we might say “act as an encouraging first-grade teacher who focuses on effort and improvement.” The more specific we are, the better the model can embody that perspective.

Technique 5: Prompt Chaining and Decomposition

Prompt chaining involves breaking complex tasks into smaller, manageable subtasks, each with its own prompt. Instead of handling everything in one giant prompt, we create a series of simpler prompts and chain them together.

Consider a customer support chatbot. The process of responding to a customer request can be decomposed into two main steps:

  • Classify the intent of the request

  • Generate an appropriate response based on that intent

The first prompt focuses solely on determining whether the customer needs billing help, technical support, account management, or general information. Based on that classification, we then use a second, specialized prompt to generate the actual response.

This approach offers several benefits. Each prompt is simpler to write and maintain. We can monitor and debug each step independently. We can use different models for different steps, perhaps using a faster, cheaper model for intent classification and a more powerful model for response generation. We can also execute independent steps in parallel when possible.

The main drawback is increased perceived latency for end users. With multiple steps, users wait longer to see the final output. However, for complex applications, the improved reliability and maintainability often outweigh this concern.

Best Practices for Writing Effective Prompts

Some best practices for effective prompting are as follows:

  • Be Clear and Specific: Ambiguity is the enemy of effective prompting. We should eliminate all uncertainty about what we want the model to do. If we want the model to score essays, we need to specify the scoring scale. Should it use 1 to 5 or 1 to 10? Are fractional scores allowed? What should the model do if it is uncertain about a score?

  • Provide Sufficient Context: Context helps models generate accurate, relevant responses. If we want the model to answer questions about a document, including that document in the prompt is essential. Without it, the model can only rely on its training data, which may lead to outdated or incorrect information.

  • Specify Output Format: We should explicitly state how we want the model to respond. Do we want a concise answer or a detailed explanation? Should the output be formatted as JSON, a bulleted list, or a paragraph? Should the model include preambles, or should it get straight to the point?

  • Use Examples Strategically: Examples are powerful tools for reducing ambiguity, but they come with a cost in terms of tokens and context length. We should provide examples when the desired format or behavior is not obvious from instructions alone. For straightforward tasks, examples may not be necessary.

  • Iterate and Experiment: Prompt engineering is iterative. We rarely write the perfect prompt on the first try. We should start with a basic prompt, test it, observe the results, and refine based on what we learn.

  • Versioning Prompts: We should version our prompts so we can track changes over time. Using consistent evaluation data allows us to compare different prompt variations objectively. We should test prompts not just in isolation but in the context of the complete system to ensure that improvements in one area do not create problems elsewhere.

Common Pitfalls to Avoid

Some common pitfalls that should be avoided when writing prompts are as follows:

  • Being Too Vague: One of the most common mistakes is assuming the model understands our intent without explicit explanation. Vague prompts like “write something about climate change” leave too much open to interpretation. Do we want a scientific explanation, a persuasive essay, a news article, or a social media post? What length? What perspective? The model will make its own choices, which may not align with what we actually want.

  • Overcomplicating Prompts: While clarity and detail are important, we can go too far in the other direction. Overly complex prompts with excessive instructions, too many examples, or convoluted logic can confuse the model rather than help it. We should aim for the simplest prompt that achieves our goal. If a zero-shot prompt works well, there is no need to add examples. If three examples are sufficient, five may not improve results.

  • Ignoring Output Format: Failing to specify the output format can cause problems, especially when model outputs feed into other systems. If we need structured data but do not request it explicitly, the model might generate unstructured text that requires additional parsing or cleaning. This adds complexity and potential points of failure to our application.

  • Not Testing Sufficiently: A single successful output does not mean the prompt is reliable. We should test prompts with various inputs, including edge cases and unusual scenarios. What works for typical cases might fail when inputs are slightly different or unexpected. Building a small evaluation dataset and testing systematically helps identify weaknesses before they become problems in production.

Conclusion

Effective prompt engineering combines clear communication, strategic use of examples, and systematic experimentation.

The core techniques we have explored, including zero-shot prompting, few-shot prompting, chain-of-thought reasoning, role prompting, and prompt chaining, provide a solid foundation for working with AI language models.

The key principles remain consistent across different models and applications:

  • Be specific and clear in our instructions.

  • Provide sufficient context for the model to work with.

  • Specify the output format we need.

  • Use examples when they add value and iterate based on results.