2026-01-18 00:30:31
Learn strategies for low-latency feature stores and vector search
This masterclass demonstrates how to keep latency predictably low across common real-time AI use cases. We’ll dig into the challenges behind serving fresh features, handling rapidly evolving embeddings, and maintaining consistent tail latencies at scale. The discussion spans how to build pipelines that support real-time inference, how to model and store high-dimensional vectors efficiently, and how to optimize for throughput and latency under load.
You will learn how to:
Build end-to-end pipelines that keep both features and embeddings fresh for real-time inference
Design feature stores that deliver consistent low-latency access at extreme scale
Run vector search workloads with predictable performance—even with large datasets and continuous updates
This week’s system design refresher:
Best Resources to Learn AI in 2026
The Pragmatic Summit
Why Prompt Engineering Makes a Big Difference in LLMs?
Modern Storage Systems
🚀 Become an AI Engineer Cohort 3 Starts Today!
The AI resources can be divided into different types such as:
Foundational and Modern AI Books
Books like AI Engineering, Machine Learning System Design Interview, Generative AI System Design Interview, and Designing Machine Learning Systems cover both principles and practical system patterns.
Research and Engineering Blogs
Follow OpenAI Research, Anthropic Engineering, DeepMind Blog, and AI2 to stay current with new architectures and applied research.
Courses and YouTube Channels
Courses like Stanford CS229 and CS230 build solid ML foundations. YouTube channels such as Two Minute Papers and ByteByteAI offer concise, visual learning on cutting-edge topics.
AI Newsletters
Subscribe to The Batch (Deeplearning. ai), ByteByteGo, Rundown AI, and Ahead of AI to learn about major AI updates, model releases, and research highlights.
Influential Research Papers
Key papers include Attention Is All You Need, Scaling Laws for Neural Language Models, InstructGPT, BERT, and DDPM. Each represents a major shift in how modern AI systems are built and trained.
Over to you: Which other AI resources will you add to the list?
I’ll be talking with Sualeh Asif, the cofounder of Cursor, about lessons from building Cursor at the Pragmatic Summit.
If you’re attending, I’d love to connect while we’re there.
📅 February 11
📍 San Francisco, CA
LLMs are powerful, but their answers depend on how the question is asked. Prompt engineering adds clear instructions that set goals, rules, and style. This turns vague questions and tasks into clear, well-defined prompts.
What are the key prompt engineering techniques?
Few-shot Prompting: Include a few (input → output) example pairs in the prompt to teach the pattern.
Zero-shot Prompting: Give a precise instruction without examples to state the task clearly.
Chain-of-thought (CoT) Prompting: Ask for step-by-step reasoning before the final answer. This can be zero-shot, where we explicitly include “Think step by step” in the instruction, or few-shot, where we show some examples with step-by-step reasoning.
Role-specific Prompting: Assign a persona, like “You are a financial advisor,” to set context for the LLM.
Prompt Hierarchy: Define system, developer, and user instructions with different levels of authority. System prompts define high-level goals and set guardrails, while developer prompts define formatting rules and customize the LLM’s behavior.
Here are the key principles to keep in mind when engineering your prompts:
Begin simple, then refine.
Break a big task into smaller, more manageable subtasks.
Be specific about desired format, tone, and success criteria.
Provide just enough context to remove ambiguity.
Over to you: Which prompt engineering technique gave you the biggest jump in quality?
Every system you build, whether it's a mobile app, a database engine, or an AI pipeline, eventually hits the same bottleneck: storage. And the storage world today is far more diverse than “HDD vs SSD.
Here’s a breakdown of how today’s storage stack actually looks
Primary Storage (where speed matters most): This is memory that sits closest to the CPU.
L1/L2/L3 caches, SRAM, DRAM, and newer options like PMem/NVDIMM.
Blazing fast but volatile. The moment power drops, everything is gone.
Local Storage (your machine’s own hardware): HDDs, SSDs, USB drives, SD cards, optical media, even magnetic tape (still used for archival backups).
Networked Storage (shared over the network):
SAN for block-level access.
NAS for file-level access.
Object storage and distributed file systems for large-scale clusters.
This is what enterprises use for shared storage, centralized backups, and high availability setups.
Cloud Storage (scalable + managed):
Block storage like EBS, Azure Disks, GCP PD for virtual machines.
Object storage like S3, Azure Blob, and GCP Cloud Storage for massive unstructured data.
File storage like EFS, Azure Files, and GCP Filestore for distributed applications.
Cloud Databases (storage + compute + scalability baked in):
Relational engines like RDS, Azure SQL, Cloud SQL.
NoSQL systems like DynamoDB, Bigtable, Cosmos DB.
Over to you: If you had to choose one storage technology for a brand-new system, where would you start, block, file, object, or a database service?
Our third cohort of Becoming an AI Engineer starts today. This is a live, cohort-based course created in collaboration with best-selling author Ali Aminian and published by ByteByteGo.
2026-01-17 00:30:56
Our third cohort of Becoming an AI Engineer starts in one day. This is a live, cohort-based course created in collaboration with best-selling author Ali Aminian and published by ByteByteGo.
Here’s what makes this cohort special:
• Learn by doing: Build real world AI applications, not just by watching videos.
• Structured, systematic learning path: Follow a carefully designed curriculum that takes you step by step, from fundamentals to advanced topics.
• Live feedback and mentorship: Get direct feedback from instructors and peers.
• Community driven: Learning alone is hard. Learning with a community is easy!
We are focused on skill building, not just theory or passive learning. Our goal is for every participant to walk away with a strong foundation for building AI systems.
If you want to start learning AI from scratch, this is the perfect time to begin.
2026-01-16 00:31:05
When applications grow popular, they often face a good problem of attracting more users, and exponentially more data. While this growth signals business success, it creates technical challenges that can cripple even well-designed systems. The database, often the heart of any application, becomes the bottleneck that threatens to slow everything down.
Unlike application servers, which can be easily scaled to handle more traffic, databases resist horizontal scaling. We cannot simply add more database servers and expect our problems to vanish. This is where sharding enters the picture as an important solution to one of the most persistent challenges in modern application architecture.
In this article, we will learn about database sharding in more detail. We will understand what it is, why it matters, how different approaches work, and what key considerations are important when implementing it.
2026-01-15 00:31:13
Writing code is no longer the bottleneck. Instead, engineering orgs spend 70%+ of their time investigating incidents and trying to debug the sh** out of prod.
Engineering teams at Coinbase, DoorDash, Salesforce, and Zscaler use Resolve AI’s AI SRE to help resolve incidents before on-call is out of bed and to optimize costs, team time, and new code created with production context.
Download the free buyer’s guide to learn more about the ROI of AI SRE, or join our online FinServ fireside chat on Jan 22 with eng leaders at MSCI and SoFi to hear how large-scale institutions are evaluating and implementing AI for prod in 2026.
Disclaimer: The details in this post have been derived from the details shared online by the Uber Engineering Team. All credit for the technical details goes to the Uber Engineering Team. The links to the original articles and sources are present in the references section at the end of the post. We’ve attempted to analyze the details and provide our input about them. If you find any inaccuracies or omissions, please leave a comment, and we will do our best to fix them.
When you open the Uber app to request a ride, check your trip history, or view driver details, you expect instant results. Behind that seamless experience lies a sophisticated caching system. Uber’s CacheFront serves over 150 million database reads per second while maintaining strong consistency guarantees.
In this article, we break down how Uber built this system, the challenges they faced, and the innovative solutions they developed.
Every time a user interacts with Uber’s platform, the system needs to fetch data like user profiles, trip details, driver locations, and pricing information. Reading directly from a database for every request introduces latency and creates a massive load on database servers. When you have millions of users making billions of requests per day, traditional databases cannot keep up.
Caching solves this by storing frequently accessed data in a faster storage system. Instead of querying the database every time, the application first checks the cache. If the data exists there (a cache hit), it returns immediately. If not (a cache miss), the system queries the database and stores the result in cache for future requests.
See the diagram below:
Uber uses Redis, an in-memory data store, as their cache. Redis can serve data in microseconds compared to milliseconds for database queries.
[Coderabbit]
Uber’s storage system, called Docstore, consists of three main components.
The Query Engine layer is stateless and handles all incoming requests from Uber’s services.
The Storage Engine layer is where data actually lives, using MySQL databases organized into multiple nodes.
CacheFront is the caching logic implemented within the Query Engine layer, sitting between application requests and the database.
When a read request comes in, CacheFront first checks Redis. If the data exists in Redis, it returns immediately to the client. Uber achieves cache hit rates above 99.9% for many use cases, meaning only a tiny fraction of requests need to touch the database.
If the data does not exist in Redis, CacheFront fetches it from MySQL, writes it to Redis, and returns the result to the client. The system can handle partial cache misses as well. For example, if a request asks for ten rows and seven exist in cache, it only fetches the missing three from the database.
Writes introduce significant complexity to any caching system. When data changes in the database, the cached copies of that data become stale. Serving stale data breaks application logic and creates poor user experiences. For example, imagine updating your destination in the Uber app, but the system keeps showing your old destination because it is reading from an outdated cache entry.
The challenge with refreshing the cache is determining which cache entries need to be invalidated when a write occurs. Uber supports two types of write operations, and they require different approaches.
Point writes are straightforward. These are INSERT, UPDATE, or DELETE queries where the exact rows being modified are specified in the query itself. For example, updating a specific user’s profile by their user ID. With point writes, you know exactly which cache entries to invalidate because the row keys are part of the query.
Conditional updates are far more complex. These are UPDATE or DELETE queries with WHERE clauses that filter based on conditions. For example, marking all trips longer than 60 minutes as completed. Before executing the query, you do not know which rows will match the condition, and therefore cannot invalidate the cache entries because you do not know which ones are affected.
This uncertainty meant that Uber initially could not invalidate the cache synchronously during writes. They had to rely on other mechanisms.
Uber’s initial approach used a system called Flux, which implements Change Data Capture.
Flux monitors the MySQL binary logs, which record every change made to the database. When a write commits, MySQL writes it to the binlog. Flux tails these logs, sees which rows changed, and then invalidates or updates the corresponding entries in Redis. See the diagram below:
This approach worked, but had a critical limitation. Flux operates asynchronously, meaning there is a delay between when data changes in the database and when the cache gets updated. This delay is usually sub-second but can stretch longer during system restarts, deployments, or when handling topology changes.
This asynchronous nature creates consistency problems. If a user writes data and immediately reads it back, they might get the old cached value because Flux has not yet processed the invalidation. This violates read-your-own-writes consistency, which is a fundamental expectation in most applications.
The system also relied on Time-To-Live expiration. Every cache entry has a TTL that determines how long it stays in the cache before expiring. Uber’s default recommendation is 5 minutes, though this can be adjusted based on application requirements. TTL expiration acts as a backstop, ensuring that even if invalidations fail, stale data eventually gets removed.
However, TTL-based expiration alone is insufficient for many use cases. Service owners wanted higher cache hit rates, which pushed them to increase TTL values. However, longer TTLs mean data stays cached longer, improving hit rates but also increasing the window for serving stale data.
As Uber scaled CacheFront, three main sources of inconsistency emerged.
Cache invalidation delays from Flux created read-your-own-writes violations where a write followed immediately by a read could return stale data.
Cache invalidation failures occurred when Redis nodes became temporarily unresponsive, leaving stale entries until TTL expiration.
Finally, cache refills from lagging MySQL follower nodes could introduce outdated data if the follower had not yet replicated recent writes from the leader.
There was another subtle consistency issue that was related to how stale cached data can actually become. Most engineers assume that if you set a TTL of 5 minutes, stale data will exist for at most 5 minutes. This assumption is incorrect.
Consider this scenario:
A row was written to the database one year ago and has not been accessed since.
At time T, which is today, a read request comes in. The cache does not have this row, so it is fetched from the database and cached. The cached entry now contains one-year-old data.
Moments later, a write request updates this row in the database. Flux attempts to invalidate the cache entry, but the invalidation fails due to a temporary Redis issue. Now, the cache still contains the one-year-old value while the database has the fresh value.
For the next hour, assuming a one-hour TTL, every read request will return the one-year-old cached data.
In other words, the staleness is not bounded by the TTL duration. Even though TTL is only 1 hour, the application may be serving data that’s actually 1 year out of date. The TTL only controls how long the cache entry lives, not how old the data inside it can be.
This problem becomes more severe with longer TTLs. Service owners wanting higher cache hit rates would increase TTL to 24 hours or more. If an invalidation failed, they could serve extremely outdated data for the entire duration.
The fundamental blocker for synchronous cache invalidation was not knowing which rows changed during conditional updates.
Uber made two critical changes to their storage engine to solve this:
First, they converted all deletes to soft deletes by setting a tombstone flag instead of removing rows.
Second, they implemented strictly monotonic timestamps at microsecond precision, making each transaction uniquely identifiable.
With these guarantees, the system can now determine which rows were modified. When rows get updated, their timestamp column is set to the transaction’s unique timestamp. Just before committing, the system executes a lightweight query to select all row keys modified within that transaction’s timestamp window. This query is fast because the data is already cached in MySQL’s storage engine, and the timestamp column is indexed.
With the ability to track all modified rows, Uber redesigned the write path.
When a write request comes into the Query Engine, it registers a callback that executes when the storage engine responds. The response includes the success status, the set of affected row keys, and the transaction’s commit timestamp.
The callback uses this information to invalidate corresponding cache entries in Redis. This invalidation can happen synchronously (within the request context, adding latency but providing the strongest consistency) or asynchronously (queued to run outside the request context, avoiding latency but with slightly weaker guarantees).
See the diagram below:
Critically, even if cache invalidation fails, the write request still succeeds. The system does not fail writes due to cache issues, preserving availability.
As you can see, Uber now runs three parallel mechanisms for keeping the cache consistent.
TTL expiration automatically removes entries after their configured lifetime (default 5 minutes).
Flux runs in the background, tailing MySQL binlogs and asynchronously invalidating cache entries.
Lastly, the new write-path invalidation provides immediate, synchronous cache updates when data changes.
Having three independent systems working together proved far more effective than relying on any single approach.
To validate improvements and measure cache consistency, Uber built Cache Inspector. This tool uses the same CDC pipeline as Flux but with a one-minute delay. Instead of invalidating cache, it compares binlog events with what is stored in Redis, tracking metrics like stale entries found and staleness duration.
The results were encouraging. For tables using 24-hour TTLs, Cache Inspector found essentially zero stale values over week-long periods while cache hit rates exceeded 99.9%. This measurement capability allowed Uber to confidently increase TTL values for appropriate use cases, dramatically improving performance without sacrificing consistency.

Beyond core invalidation improvements, Uber implemented numerous optimizations such as adaptive timeouts that adjust based on load, negative caching for non-existent data, pipelined reads to batch requests, circuit breakers for unhealthy nodes, connection rate limiters, and compression to reduce memory and bandwidth usage.
Today, CacheFront serves over 150 million rows per second during peak hours. Cache hit rates exceed 99.9% for many use cases. The system has scaled by nearly 4 times since the original implementation, while actually improving consistency guarantees.
By solving the cache invalidation problem with synchronous invalidation from the write path, combined with asynchronous CDC and TTL-based expiration, Uber achieved strong consistency with high performance at massive scale.
References:
After the amazing success of Cohorts 1 and 2 (with close to 1,000 engineers joining and building real AI skills), we are excited to announce the launch of Cohort 3 of Become an AI Engineer!
2026-01-14 00:30:57
If slow QA processes bottleneck you or your software engineering team and you’re releasing slower because of it — you need to check out QA Wolf.
QA Wolf’s AI-native service supports web and mobile apps, delivering 80% automated test coverage in weeks and helping teams ship 5x faster by reducing QA cycles to minutes.
QA Wolf takes testing off your plate. They can get you:
Unlimited parallel test runs for mobile and web apps
24-hour maintenance and on-demand test creation
Human-verified bug reports sent directly to your team
Zero flakes guarantee
The benefit? No more manual E2E testing. No more slow QA cycles. No more bugs reaching production.
With QA Wolf, Drata’s team of 80+ engineers achieved 4x more test cases and 86% faster QA cycles.
Disclaimer: The details in this post have been derived from the details shared online by the Lyft Engineering Team. All credit for the technical details goes to the Lyft Engineering Team. The links to the original articles and sources are present in the references section at the end of the post. We’ve attempted to analyze the details and provide our input about them. If you find any inaccuracies or omissions, please leave a comment, and we will do our best to fix them
When you request a ride on Lyft, dozens of machine learning models spring into action behind the scenes. One model may calculate the price of your trip. Another determines which drivers should receive bonus incentives. A fraud detection model scans the transaction for suspicious activity. An ETA prediction model estimates your arrival time. All of this happens in milliseconds, and it happens millions of times every single day.
The engineering challenge of serving machine learning models at this scale is immense.
Lyft’s solution was to build a system called LyftLearn Serving that makes this task easy for developers. In this article, we will look at how Lyft built this platform and the architecture behind it.
Lyft identified that machine learning model serving is difficult because of the complexity on two different planes:
The first plane is the data plane. This encompasses everything that happens during steady-state operation when the system is actively processing requests. This includes network traffic, CPU, and memory consumption. Also, the model must load into memory and execute the inference tasks quickly. In other words, these are the runtime concerns that determine whether the system can handle production load.
The second plane is the control plane, which deals with everything that changes over time. For example, models need to be deployed and undeployed. They need to be retrained on fresh data and versioned properly. New models need to be tested through experiments before fully launching. Also, backward compatibility must be maintained so that changes don’t break existing functionality.
Lyft needed to excel at both simultaneously while supporting a diverse set of requirements across dozens of teams.
The diversity of requirements at Lyft made building a one-size-fits-all solution nearly impossible. Different teams cared about wildly different system characteristics, creating a vast operating environment.
For example, some teams required extremely tight latency limits. Their models needed to return predictions in single-digit milliseconds because any delay would degrade user experience. Other teams cared more about throughput, needing to handle over a million requests per second during peak hours. Some teams wanted to use niche machine learning libraries that weren’t widely supported. Others needed continual learning capabilities, where models update themselves in real-time based on new data.
See the diagram below:

To make matters worse, Lyft already had a legacy monolithic serving system in production. While this system worked for some use cases, its monolithic design created serious problems.
All teams shared the same codebase, which meant they had to agree on which versions of libraries to use. One team’s deployment could break another team’s models. During incidents, it was often unclear which team owned which part of the system. Teams frequently blocked each other from deploying changes.
Code reviews are critical but time-consuming. CodeRabbit acts as your AI co-pilot, providing instant Code review comments and potential impacts of every pull request.
Beyond just flagging issues, CodeRabbit provides one-click fix suggestions and lets you define custom code quality rules using AST Grep patterns, catching subtle issues that traditional static analysis tools might miss.
CodeRabbit reviews 1 million PRs every week across 3 million repositories and is used by 100 thousand Open-source projects.
CodeRabbit is free for all open-source repo’s.
Lyft chose to build LyftLearn Serving using a microservices architecture, where small, independent services each handle specific responsibilities. This decision aligned with how most software systems at Lyft are built, allowing the team to leverage existing tooling for testing, networking, and operational management.
See the diagram below:
However, the Lyft team took the microservices concept further than typical implementations. Rather than building a single shared microservice that all teams use, they created a platform that generates completely independent microservices for each team.
When a team at Lyft wants to serve machine learning models, they use a configuration generator to create their own dedicated GitHub repository. This repository contains all the code and configuration for a microservice that belongs entirely to that team. For example, the Pricing team gets its own repository and microservice. The Fraud Detection team gets its own and so on.
These microservices share common underlying components, but they run independently. Each team controls its own deployment pipeline, choosing when to release changes to production. Each team can use whatever versions of TensorFlow, PyTorch, or other ML libraries they need without conflicts. Each team’s service runs in isolated containers with dedicated CPU and memory resources. If one team’s deployment breaks, only that team is affected.
This architecture solved the ownership problem that plagued the legacy system. Every repository clearly identifies which team owns it. On-call escalation paths are unambiguous. Library updates only affect one team at a time. Teams can move at their own pace without blocking each other.
See the diagram below:
Understanding what actually runs when a LyftLearn Serving microservice is deployed helps clarify how the system works. The runtime consists of several layers that work together.
At the outermost layer sits the HTTP serving infrastructure. Lyft uses Flask, a Python web framework, to handle incoming HTTP requests. Flask runs on top of Gunicorn, a web server that manages multiple worker processes to handle concurrent requests. In front of everything sits Envoy, a load balancer that distributes requests across multiple server instances. The Lyft team made custom optimizations to Flask to work better with Envoy and Gunicorn.
Beneath the HTTP layer sits the Core LyftLearn Serving Library, which contains the business logic that powers the platform. This library handles critical functionality like loading models into memory and unloading them when needed, managing multiple versions of the same model, processing inference requests, shadowing new models alongside production models for safe testing, monitoring model performance, and logging predictions for analysis.
The next layer is where teams inject their custom code. ML engineers write two key functions that get called by the platform.
The load function specifies how to deserialize their specific model from disk into memory.
The predict function defines how to preprocess input features and call their model’s prediction method.
At the deepest layer sit third-party ML libraries like TensorFlow, PyTorch, LightGBM, and XGBoost. The Lyft platform doesn’t restrict which libraries teams can use, as long as they have a Python interface. This flexibility was essential for supporting diverse team requirements.
Finally, the entire stack sits on top of Lyft’s infrastructure, which uses Kubernetes for container orchestration and Envoy for service mesh networking. The runtime implements interfaces for metrics, logging, tracing, and analytics that integrate with Lyft’s monitoring systems.
One of the most important aspects of LyftLearn Serving is how it solves the onboarding problem. Deploying a microservice at a company like Lyft requires extensive configuration across many systems, such as:
Kubernetes YAML files to define how containers run.
Terraform configuration for cloud infrastructure.
Envoy configs for networking.
Database connections, security credentials, monitoring setup, and deployment pipelines.
Expecting ML engineers to learn all of these systems would create a massive barrier to adoption. The Lyft team’s solution was to build a configuration generator using Yeoman, a project scaffolding tool.
See the diagram below:
The generator works through a simple question-and-answer flow. An ML engineer runs the generator and answers a handful of basic questions about their service name, team ownership, and a few other details. The generator then automatically creates a complete GitHub repository containing everything needed to run a LyftLearn Serving microservice.
The generated repository includes properly formatted configuration files for all infrastructure systems, working example code demonstrating how to implement model loading and prediction, unit test templates, a fully configured CI/CD deployment pipeline, and documentation on how to customize everything.
Most importantly, the generated repository is immediately deployable. An ML engineer can run the generator, merge the created code, deploy it, and have a functioning microservice serving models in production. This approach reduced the support burden on the ML Platform team. Teams could self-onboard without extensive help. With over 40 teams using LyftLearn Serving, this scalability was essential.
The Lyft team built a solution for ensuring models continue working correctly as the system evolves. They call this feature model self-tests.
ML engineers define test data directly in their model code. This test data consists of sample inputs and their expected outputs. For example, a neural network model might specify that input [1, 0, 0] should produce output close to [1]. This test data gets saved alongside the model binary itself.
The platform automatically runs these self-tests in two scenarios:
First, every time a model loads into memory, the system runs predictions on the test data and verifies the outputs match expectations. Results are logged and turned into metrics that ML engineers can monitor.
Second, whenever someone creates a pull request to change code, the continuous integration system tests all models in the repository against their saved test data.
This automated testing catches problems early. If a library upgrade breaks model compatibility, the tests fail before deployment. If container image changes affect model behavior, engineers know immediately. The tests provide confidence that models work correctly without requiring manual verification.
Seeing how an actual prediction request moves through LyftLearn Serving helps tie all the pieces together. Consider a request to predict driver incentives.
The request arrives as an HTTP POST to the /infer endpoint, containing a model ID and input features in JSON format.
The Flask server receives the request and routes it to the appropriate handler function. This handler is provided by the Core LyftLearn Serving Library.
The platform code first retrieves the requested model from memory using the model ID. It validates that the input features match the expected schema. If model shadowing is configured, it may route the request to multiple model versions simultaneously for comparison.
Next, the platform calls the custom predict function that the ML engineer wrote. This function preprocesses the features as needed, then calls the underlying ML library’s prediction method.
Finally, more platform code executes. The system emits latency metrics and logs for debugging. It generates analytics events for monitoring model performance. The prediction is packaged into a JSON response and returned to the caller.
See the diagram below:
This entire flow typically completes in milliseconds. The clean separation between platform code and custom code allows Lyft to add new capabilities without teams needing to change their prediction logic.
The Lyft team recognized that different users need different interfaces. They provided two primary ways to work with LyftLearn Serving.
For software engineers comfortable with code, the model repository offers full control. Engineers can edit deployment pipelines, modify CI/CD configurations, and write custom inference logic. Everything is version-controlled and follows standard software development practices.
For data scientists who prefer visual interfaces, LyftLearn UI provides a web application. Users can deploy models with one click, monitor performance through dashboards, and manage training jobs without writing infrastructure code.
See the diagram below:
Documentation also received first-class treatment. The Lyft team organized docs using the Diátaxis framework, which defines four documentation types:
Tutorials provide step-by-step learning for beginners.
How-to guides give specific instructions for common tasks.
Technical references document APIs in detail.
Discussions explain concepts and design decisions.
The Lyft engineering team shared several important lessons from building LyftLearn Serving. These insights apply broadly to anyone building platform systems.
First, they emphasized the importance of defining terms carefully. The word “model” can mean many different things: source code, trained weights, files in cloud storage, serialized binaries, or objects in memory. Every conversation needs clarity about which meaning is intended.
Second, they learned that models serve production traffic indefinitely. Once a model goes live, it typically runs forever. This reality demands extreme stability and careful attention to backward compatibility.
Third, they found that great documentation is critical for platform products. Thorough, clear docs enable self-onboarding and reduce support burden. The investment in documentation pays continuous dividends.
Fourth, they accepted that hard trade-offs are inevitable. The team constantly balanced seamless user experience against power user flexibility, or custom workflows against enforced best practices.
Fifth, they learned to align their vision with power users. The most demanding customers often have the right priorities: stability, performance, and flexibility. Meeting their needs tends to benefit everyone.
Finally, they embraced boring technology. Rather than chasing the latest trends, they chose proven, stable tools like Flask, Kubernetes, and Python. These technologies have strong community support, make hiring easier, and cause fewer unexpected problems.
Lyft made LyftLearn Serving available internally in March 2022. The team then migrated all models from the legacy monolithic system to the new platform. Today, over 40 teams use LyftLearn Serving to power hundreds of millions of predictions daily.
References:
Powering Millions of Real-Time Decisions with LyftLearn Serving
LyftLearn: ML Model Training Infrastructure built on Kubernetes
Get your product in front of more than 1,000,000 tech professionals.
Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases.
Space Fills Up Fast - Reserve Today
Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing [email protected].
2026-01-13 00:30:42
Securely authorizing access to an MCP server is complex. You need PKCE, scopes, consent flows, and a way to revoke access when needed.
Learn from WorkOS how to implement OAuth 2.1 in a production-ready setup, with clear steps and examples.
Large Language Models (LLMs) have moved from research labs into production applications at a remarkable pace. Developers are using them for everything from customer support chatbots to code generation tools to content creation systems. However, this rapid adoption brings an important question: how do we know if our LLM is actually working well?
Unlike traditional software, where we can write unit tests that check for exact outputs, LLMs are probabilistic systems. Ask the same question twice, and the model might give different answers, both of which could be perfectly valid. This uncertainty makes evaluation challenging but absolutely necessary.
This is where “evals” come in. Short for evaluations, evals are the systematic methods we use to measure how well our LLM performs. Without proper evaluation, we’re essentially flying blind, unable to know whether our latest prompt change made things better or worse, whether our model is ready for production, or whether it’s handling edge cases correctly.
In this article, we’ll explore why LLM evaluation is challenging, the different types of evaluations available, key concepts to understand, and practical guidance on setting up an evaluation process.
AI is only as powerful as the data behind it — but most teams aren’t ready.
We surveyed 200 senior IT and data leaders to uncover how enterprises are really using streaming to power AI, and where the biggest gaps still exist.
Discover the biggest challenges in real-time data infrastructure, the top obstacles slowing down AI adoption, and what high-performing teams are doing differently in 2025.
Download the full report to see where your organisation stands.
If we’re used to testing traditional software, LLM evaluation will feel different in fundamental ways. In conventional programming, we write a function that takes an input and produces a deterministic output. Testing is straightforward. Given input X, we expect output Y. If we get Y, the test passes. If not, it fails.
LLMs break this model in several ways.
First, there’s the subjective nature of language itself. What makes a response “good”? One response might be concise while another is comprehensive. Both could be appropriate depending on context. Unlike checking if a function returns the number 42, judging the quality of a paragraph requires nuance.
Second, most questions or prompts have multiple valid answers. For example, if we ask an LLM to summarize an article, there are countless ways to do it correctly. An eval that checks for exact text matching would fail even when the model produces excellent summaries.
Third, language is deeply context-dependent. The same words can mean different things in different situations. Sarcasm, humor, cultural references, and implied meaning all add layers of complexity that simple pattern matching can’t capture.
Finally, there’s a significant gap between impressive demos and consistent production performance. An LLM might handle our carefully crafted test cases beautifully but stumble on the messy, unpredictable inputs that real users provide.
Traditional software testing approaches like unit tests and integration tests remain valuable for the code surrounding our LLM, but they don’t fully translate to evaluating the model’s language understanding and generation capabilities. We need different tools and frameworks for this new challenge.
When evaluating LLMs, we have several approaches available, each with different strengths and tradeoffs. Let’s explore the main categories.
Automatic evaluations are programmatic assessments that can run without human involvement.
The simplest form is exact matching, where we check if the model’s output exactly matches an expected string. This works well for structured outputs like JSON or when there’s genuinely only one correct answer, but it’s too rigid for most natural language tasks.
Keyword matching is slightly more flexible. We check whether the output contains certain required keywords or phrases without demanding exact matching. This catches some variation while still being deterministic and easy to implement.
Semantic similarity measures how close the model’s output is to a reference answer in meaning, even if the words differ. These often use embedding models to compare the semantic content rather than surface-level text.
One increasingly popular approach is model-based evaluation, where we use another LLM as a judge. In this approach, we can ask a powerful model like GPT-4 or Claude to rate our target model’s outputs on criteria like helpfulness, accuracy, or relevance. This approach can capture nuance that simpler metrics miss, though it introduces its own complexities.
See the diagram below:
Automatic evaluations shine when we need to catch obvious failures, run regression tests to ensure changes don’t break existing functionality, or quickly iterate on prompts. However, they can miss subtle issues that only human judgment would catch.
Despite advances in automated testing, human evaluation remains the gold standard for assessing nuanced aspects of LLM performance. Humans can judge subjective qualities like tone, appropriateness, helpfulness, and whether a response truly addresses the underlying intent of a question.
Human evaluations take several forms. In preference ranking, evaluators compare multiple outputs and select which they prefer. Likert scales ask raters to score outputs on numerical scales across different dimensions. Task completion evaluations test whether the output accomplishes a specific goal.
The main tradeoff with human evaluation is cost and speed versus accuracy. Getting high-quality human ratings is expensive and time-consuming, but for many applications, it’s the only way to truly validate performance. For reference, human evaluation becomes essential when we’re working on subjective tasks, dealing with safety-critical applications, or need to validate our automatic evals.
The ML research community has developed standardized benchmark datasets for evaluating LLMs. These include datasets like MMLU (Massive Multitask Language Understanding) for general knowledge, HellaSwag for common sense reasoning, and HumanEval for code generation.
The advantage of benchmarks is comparability. We can see how our model stacks up against others and track progress using established baselines. They also provide ready-made test sets covering diverse scenarios.
However, benchmarks have limitations. They might not reflect our specific use case. A model that scores highly on academic benchmarks might still perform poorly on our customer support application. Additionally, as benchmarks become widely known, there’s a risk of models being optimized specifically for them rather than for general capability.
To build effective evaluations, we need to understand several core concepts:
Different tasks require different metrics. For classification tasks, we might use accuracy or F1 score. For text generation, metrics like BLEU and ROUGE measure overlap with reference texts, though they have known limitations. For code generation, we can check if the code executes correctly and produces expected outputs.
Beyond task-specific metrics, we often care about quality dimensions that cut across tasks.
Is the output relevant to the input?
Is it coherent and well-structured? Is it factually accurate?
Is it helpful to the user?
Does it avoid harmful content?
The quality of our evaluation depends heavily on the quality of our test dataset. We need test cases that are representative of real-world usage, diverse enough to cover different scenarios, and include edge cases where the model might fail.
A common pitfall is data contamination, where our test examples overlap with the model’s training data. This can make performance look better than it actually is on truly novel inputs. Using held-out data or creating new test cases helps avoid this issue.
Since LLM outputs can vary between runs (especially with higher temperature settings), we need to think statistically about evaluation. A single test run might not be representative. Sample size matters where testing on ten examples gives us much less confidence than testing on a thousand.
We also need to account for variance in the model’s outputs. Running the same prompt multiple times and averaging results can give us a more stable picture of performance. Understanding and controlling for parameters like temperature, top-p, and random seeds helps make our evals reproducible.
Here’s a practical approach to building an eval process.
Define success for our use case: What does “good” mean for our specific application? If we’re building a customer support bot, maybe “good” means answering the question accurately, maintaining a friendly tone, and escalating to humans when appropriate.
Create an initial eval set: Start with 50-100 diverse examples covering common cases, edge cases, and known failure modes. We don’t need thousands of examples to start getting value.
Choose our evaluation approach: If we have limited resources, start with automatic evals. If quality is paramount and we have a budget, incorporate human evaluation. Often, a hybrid approach works best where we use automatic evals for broad coverage and quick iteration with human evals for final validation.
Set up an iteration cycle: Run evals, identify where the model fails, make improvements (better prompts, different models, fine-tuning, etc.), and re-evaluate. This cycle is how we progressively improve performance.
Track performance over time: Keep a record of eval scores across different versions. This helps us understand whether changes are helping and preventing regression.
Version everything: Track which model version, which prompt version, and which eval dataset version produced each result. This makes debugging and reproduction much easier.
The key is to start simple and iterate. Don’t wait for the perfect eval setup. A basic eval running regularly is infinitely more valuable than a sophisticated eval that never gets implemented.
While building an eval practice, it’s good to watch out for these common mistakes:
Overfitting to the eval set is a real risk. If we repeatedly optimize against the same test cases, we might improve scores without improving real-world performance. Regularly refreshing our eval set with new examples helps prevent this.
We can also fall into the trap of gaming the metrics. Just because a model scores well on a particular metric doesn’t mean it’s actually better. Always combine quantitative metrics with a qualitative review of actual outputs.
Many teams neglect edge cases and adversarial examples. Real users will find ways to break the system that we never anticipated. Actively seeking out and testing difficult cases makes our evals more robust.
On the flip side, relying solely on vibes and anecdotes is problematic. Human intuition is valuable but can be misleading. Systematic evaluation gives us data to make informed decisions.
Perhaps the biggest pitfall is not evaluating at all. In the rush to ship features, evaluation often gets deprioritized. But shipping without evals means we have no idea if we’re making things better or worse.
The best practice is to maintain a diverse, evolving eval suite that grows alongside our product. As we discover new failure modes or expand to new use cases, they can be added to the eval set.
LLM evaluation is a continuous practice that should be woven into our development workflow. Just as we wouldn’t ship traditional software without tests, we shouldn’t deploy LLM applications without proper evaluation.
Good evals give us the confidence to iterate quickly and deploy safely. They help us understand what’s working and what isn’t while providing objective measures for comparing different approaches. They catch regressions before users do.
The good news is that we don’t need a sophisticated setup to start. Begin with a small set of test cases and basic metrics. Run evals regularly. Pay attention to failures. Gradually expand and refine our eval suite as we learn more about our application’s requirements.
The field of LLM evaluation is still evolving, with new tools, frameworks, and best practices emerging regularly. But the fundamental principle remains constant: we can’t improve what we don’t measure. By making evaluation a core part of our LLM development process, we transform what might otherwise be guesswork into engineering.
Get your product in front of more than 1,000,000 tech professionals.
Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases.
Space Fills Up Fast - Reserve Today
Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing [email protected].