RSS preview of ByteByteGo

Rss preview of Blog of ByteByteGo

Dropbox Multimedia Search: Making File Search More Useful

2025-12-10 00:30:52

How to stop bots from abusing free trials (Sponsored)

Free trials help AI apps grow, but bots and fake accounts exploit them. They steal tokens, burn compute, and disrupt real users.

Cursor, the fast-growing AI code assistant, uses WorkOS Radar to detect and stop abuse in real time. With device fingerprinting and behavioral signals, Radar blocks fraud before it reaches your app.

Start protecting your app for free →

Disclaimer: The details in this post have been derived from the details shared online by the Dropbox Engineering Team. All credit for the technical details goes to the Dropbox Engineering Team. The links to the original articles and sources are present in the references section at the end of the post. We’ve attempted to analyze the details and provide our input about them. If you find any inaccuracies or omissions, please leave a comment, and we will do our best to fix them.

You’re racing against a deadline, and you desperately need that specific image from last month’s campaign or that video clip from a client presentation. You know it exists somewhere in your folders, but where? Was it in that project folder? A shared team drive? Or nested somewhere three levels deep in an old archive?

We’ve all been in this situation at some point, as this is the daily reality for knowledge workers who lose countless hours hunting for the right files within their cloud storage.

The problem becomes even more frustrating with multimedia content. While documents often have descriptive titles and searchable text inside them, images and videos typically come with cryptic default names like IMG_6798 or VID_20240315. Without meaningful labels, these files become nearly impossible to locate unless you manually browse through folders or remember exactly where you saved them.

Dropbox solved this problem by building multimedia search capabilities into Dropbox Dash, their universal search and knowledge management platform.

The challenge their engineering team faced wasn’t just about finding a file anymore. It’s about finding what’s inside that file. And when the folder structure inevitably breaks down, when files get moved or renamed by team members, or when you simply can’t recall the location of what you need, traditional filename-based search falls short.

In this article, we’ll explore how the Dropbox engineering team implemented multimedia search features and the technical challenges they faced along the way.

Challenges of Multimedia Search

Building a search feature for images, videos, and audio files presents a fundamentally different set of problems compared to searching through text documents.

Some of the key challenges are as follows:

Storage Costs: The sheer size difference is significant. Image files average about 3X larger than typical documents, while video files clock in at roughly 13X larger. These size differences directly translate to increased storage demands and costs.
Compute Intensity: Beyond storage, multimedia files require substantially more processing power to extract useful features. The complexity goes beyond just handling larger files. Unlike text documents, multimedia search needs visual previews at multiple resolutions to be useful, dramatically increasing computational requirements.
Ranking Relevance: Dropbox Dash already operated a sophisticated multi-phase retrieval and ranking system optimized for textual content. Extending this to multimedia meant indexing entirely new types of signals, creating query plans that leverage these signals effectively, and handling edge cases to avoid irrelevant results appearing at the top.
Preview Generation Dilemma: Users need visual previews to quickly identify the right file, and they need these previews in multiple resolutions for a smooth experience. However, only a small fraction of indexed files actually get viewed during searches. Pre-generating previews for everything would be extremely wasteful, but generating them on demand during searches introduces latency challenges that could frustrate users.

The Dropbox engineering team had to ensure their solution supported seamless browsing, filtering, and previewing of media content directly within Dash. This meant confronting higher infrastructure costs, stricter performance requirements, and adapting systems originally designed for text-based retrieval.

The Architecture

To deliver fast and accurate multimedia search while keeping costs manageable, the Dropbox engineering team designed a solution built on three core pillars:

A metadata-first indexing pipeline
Intelligent location-aware search
A preview generation system that creates visuals only when needed

Indexing Pipeline for Metadata

The foundation of multimedia search begins with indexing, the process of cataloging files so they can be quickly retrieved later. Dropbox made a critical early decision to index lightweight metadata rather than performing deep content analysis on every single file. This approach dramatically reduces computational costs while still enabling effective search.

Before building this multimedia search capability, Dropbox had intentionally avoided downloading or storing raw media blobs to keep storage and compute costs low. However, this meant their existing search index lacked the necessary features to support rich, media-specific search experiences. To bridge this gap, the team added support for ingesting multimedia blob content to extract the required features. Importantly, they retain the raw content not just for preview generation, but also to enable computing additional features in the future without needing to re-ingest files.

To power this indexing pipeline, Dropbox leveraged Riviera, its existing internal compute framework that already processes tens of petabytes of data daily for Dropbox Search. By reusing proven infrastructure, the team gained immediate benefits in scalability and reliability without building something entirely from scratch.

The indexing process extracts several key pieces of information from each multimedia file. This includes basic details like file path and title, EXIF data such as camera metadata, timestamps, and GPS coordinates, and even third-party preview URLs when available from applications like Canva.

See the diagram below:

The data flows through the system in the following way:

Raw files are stored in a blob store
Riviera extracts features and metadata from these files
Information flows through third-party connectors
Kafka message brokers transport the data
Transformers process and structure the information
Finally, everything populates the search index

This metadata-first approach provides a lightweight foundation for search functionality while keeping processing overhead minimal. The team plans to selectively incorporate deeper content analysis techniques like semantic embeddings and optical character recognition in future iterations, but starting simple allowed them to ship faster.

Geolocation-Aware Retrieval System

Another feature Dropbox built into multimedia search is the ability to find photos and videos based on where they were taken. This geolocation-aware system works through a process called reverse geocoding.

See the diagram below:

During indexing, when a file contains GPS coordinates in its metadata, Dropbox converts those coordinates into a hierarchical chain of location IDs. For example, a photo taken in San Francisco would generate a chain linking San Francisco to California to the United States. This hierarchy is crucial because it enables flexible searching at different geographic levels.

At query time, when a user searches for something like “photos from California,” the system identifies that “California” is a geographic reference, validates it against a cached mapping of location IDs, and retrieves all photos tagged with that location or any of its child locations, like San Francisco. Since the number of known geographic locations has a manageable size, Dropbox caches the entire location mapping at service startup, making lookups extremely fast.

This approach elegantly handles the challenge of location-based search without requiring users to remember exact locations or browse through folder structures organized by place.

Just-In-Time Preview Generation

The most interesting architectural decision Dropbox made was generating previews on demand rather than pre-computing them for all files. This choice directly addresses the preview generation dilemma mentioned earlier.

The rationale was straightforward. Dropbox ingests files at a rate roughly three orders of magnitude higher than users query for them. Pre-computing previews for every single multimedia file would be prohibitively expensive, especially since only a small fraction of indexed files actually get viewed during searches.

Instead, when a search returns results, the system generates preview URLs that the frontend can fetch. These URLs point to a preview service built on top of Riviera that generates thumbnails and previews in multiple resolutions on the fly. To avoid repeatedly generating the same preview, the system caches them for 30 days, striking a balance between storage costs and performance.

See the diagram below:

The team optimized for speed by running preview URL generation in parallel with other search operations like ranking results, checking permissions, and fetching additional metadata. This parallelization significantly reduces overall response time. When users want to see more detail about a specific file, such as camera information or exact timestamps, the system fetches this metadata on demand through a separate endpoint, keeping the initial search response lean and fast.

See the diagram below:

Technical Trade-Offs and Design Decisions

Building the multimedia search feature required the Dropbox engineering team to make deliberate choices about where to invest resources and where to optimize for efficiency.

Cost vs. Performance Decisions

The team made three key trade-offs to balance system performance with infrastructure costs.

First, they chose metadata-only indexing initially, deferring expensive content analysis techniques like OCR and semantic embeddings to future iterations. This allowed them to ship faster while keeping compute costs manageable.
Second, they shifted the compute from the write path to the read path by generating previews just-in-time rather than during ingestion.
Finally, they implemented selective ingestion that currently covers 97% of media files, with ongoing work to optimize handling of edge cases.

Reusing What Works

Rather than building everything from scratch, Dropbox maximized code reusability wherever possible. They leveraged the existing Riviera framework for consistency with their established infrastructure and reused the Dropbox preview service that was already battle-tested. The team also shared frontend components between Dropbox and Dash, ensuring a consistent user experience across both platforms.

A critical organizational decision was establishing clear API boundaries between different systems. This separation allowed multiple teams to work in parallel rather than sequentially, significantly accelerating development timelines without creating integration headaches later.

Conclusion

Building a multimedia search for Dropbox Dash showcases how thoughtful engineering can solve complex problems without over-engineering the solution. By starting with lightweight metadata indexing, deferring expensive operations to query time, and leveraging existing infrastructure wherever possible, the Dropbox engineering team created a scalable system that balances performance with cost efficiency.

The development process itself offers valuable lessons. When faced with interdependencies that could have slowed progress, the team temporarily proxied Dropbox Search results through a custom endpoint during UX development. This workaround allowed frontend work to proceed in parallel while the backend infrastructure was being built, dramatically accelerating the overall timeline.

Performance monitoring played a crucial role in refining the system. The team added latency tracking for preview generation, used instrumentation to identify bottlenecks, and implemented aggressive concurrency improvements based on the metrics they gathered. This data-driven approach to optimization ensured they focused efforts where they would have the most impact.

As mentioned, Dropbox plans to enhance multimedia search with semantic embeddings and optical character recognition, bringing even deeper content understanding to the platform. The architecture they’ve built maintains clear paths for these future enhancements without requiring fundamental redesigns.

References:

How we brought multimedia search to Dropbox Dash

SPONSOR US

Get your product in front of more than 1,000,000 tech professionals.

Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases.

Space Fills Up Fast - Reserve Today

Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing [email protected].

How Reddit Migrated Comments Functionality from Python to Go

2025-12-08 00:30:37

Unwrap Unbeatable Holiday Deals with Verizon (Sponsored)

Reliability shouldn’t cost extra—and Verizon proves it this holiday season. Switch to Verizon and get four lines on Unlimited Welcome for $25 per line/month (with Auto Pay, plus taxes and fees) and everyone gets one of the hottest devices, all on them. No trade-in required. Devices include:

Everyone gets a better deal—flexibility, savings, and support with no extra cost.

Explore Holiday Deals and see here for terms.

Disclaimer: The details in this post have been derived from the details shared online by the Reddit Engineering Team. All credit for the technical details goes to the Reddit Engineering Team. The links to the original articles and sources are present in the references section at the end of the post. We’ve attempted to analyze the details and provide our input about them. If you find any inaccuracies or omissions, please leave a comment, and we will do our best to fix them.

When you upvote a clever comment on Reddit or reply to a discussion thread, you’re interacting with their Comments model. This model is probably the most important and high-traffic model in Reddit’s architectural setup.

Reddit’s infrastructure was built around four Core Models: Comments, Accounts, Posts, and Subreddits.

These models power virtually everything users do on the platform. For years, all four models were served from a single legacy Python service, with ownership awkwardly split across different teams. By 2024, this monolithic architecture had become a problem:

The service suffered from recurring reliability and performance issues.
Maintaining it had become increasingly difficult for all teams involved.
Ownership responsibilities were unclear and fragmented.

In 2024, the Reddit engineering team decided to break up this monolith into modern, domain-specific Go microservices.

They chose comments as their first migration target because it represented Reddit’s largest dataset and handled the highest write throughput of any core model. If they could successfully migrate comments, they would prove their approach could handle anything.

In this article, we will look at how Reddit carried out this migration and the challenges it faced.

The Easy Part: Migrating Read Operations

Before diving into the complex scenario, it’s worth understanding how Reddit approached the simpler part of this migration: read endpoints.

When you view a comment, that’s a read operation. The server fetches data from storage and returns it to you without changing anything.

Reddit used a testing technique called “tap compare” for read migrations. The concept is straightforward:

A small percentage of traffic gets routed to the new Go microservice.
The new service generates its response internally.
Before returning anything, it calls the old Python endpoint to get that response too.
The system compares both responses and logs any differences.
The old endpoint’s response is what actually gets returned to users.

This approach meant that if the new service had bugs, users never saw them. The team got to validate their new code in production with real traffic while maintaining zero risk to user experience.

The Hard Part: Migrating Write Operations

Write operations are an entirely different challenge. When you post a comment or upvote one, you’re modifying data.

Reddit’s comment infrastructure doesn’t just save your action to one place. It writes to three distinct datastores simultaneously:

Postgres: The primary database where all comment data lives permanently.
Memcached: A caching layer that speeds up reads by keeping frequently accessed comments in fast memory.
Redis: An event store for CDC (Change Data Capture) events that notify other services whenever a comment changes.

The CDC events were particularly critical. Reddit guarantees 100% delivery of these events because downstream systems across the platform depend on them. Miss an event, and you could break features elsewhere.

The team couldn’t simply use basic tap compare for writes because of a fundamental constraint: comment IDs must be unique. You can’t write the same comment twice to the production database because the unique key constraint would reject it.

But without writing to production, how do you validate that your new implementation works correctly?

The Sister Datastore Solution

Reddit’s engineering team came up with a solution they called “sister datastores”. They created three completely separate datastores that mirrored their production infrastructure (Postgres, Memcached, and Redis). The critical difference was that only the new Go microservice would write to these sister stores.

Here’s how the dual-write flow worked:

A small percentage of write traffic is routed to the Go microservice.
Go calls the legacy Python service to perform the real production write.
Users see their comments posted normally (Python is still handling the actual work).
Go performs its own write to the completely isolated sister datastores.
After both writes are complete, the system compares production data against sister data.

This comparison happened across all three datastores. The Go service would query both production and sister instances, compare the results, and log any differences. The beauty of this approach was that even if Go’s implementation had bugs, those bugs would only affect the isolated sister datastores, never touching real user data.

The Scale of Verification

The verification process was substantial. Reddit migrated three write endpoints:

Create Comment: Posting new comments
Update Comment: Editing existing ones
Increment Comment Properties: Actions like upvoting

Each endpoint wrote to three datastores, and data had to be verified across two different service implementations. This resulted in multiple comparisons running simultaneously, each requiring careful validation and bug fixing.

But even this wasn’t enough. Early in the migration, the team discovered serialization problems. Serialization is the process of converting data structures into a format that can be stored or transmitted. Different programming languages serialize data differently. When Go wrote data to the datastores, Python services sometimes couldn’t deserialize (read back) that data correctly.

To catch these problems, the team added another verification layer.

They ran all tap comparisons through actual CDC event consumers in the legacy Python service. This meant Python code would attempt to deserialize and process events written by Go. If Python could successfully read and handle these events, they knew cross-language compatibility was working. This end-to-end verification ensured not just that Go wrote correct data, but that the entire ecosystem could consume it.

Challenges With Different Languages

Migrating between programming languages introduced unexpected complications beyond serialization.

One major issue involved database interactions. Python uses an ORM (Object-Relational Mapping), which is a tool that simplifies database queries. Reddit’s Go services don’t use an ORM and instead write direct database queries.

It turned out that Python’s ORM had hidden optimizations that the team didn’t fully understand. When they started ramping up the Go service, it put unexpected pressure on the Postgres database. The same operations that ran smoothly in Python were causing performance issues in Go.

Fortunately, they caught this early and optimized their Go queries. They also established better monitoring for database resource utilization. This experience taught them that future migrations would need careful attention to database access patterns, not just application logic.

The Race Condition Problem

Another tricky issue was race conditions in the tap compare logs.

The team started seeing mismatches that didn’t make sense. They would spend hours investigating, only to discover that the “bug” wasn’t a bug at all, but a timing problem.

Here’s an example scenario:

User updates a comment, changing the text to “hello”
Go writes “hello” to the sister datastore
Go calls Python to write “hello” to production
In those milliseconds, another user edits the same comment to “hello again”
When Go compares its write against production, they don’t match

These timing-based false positives made debugging difficult.

Was a mismatch caused by a real bug in the Go implementation, or just unlucky timing?

The team developed custom code to detect and ignore race condition mismatches. For future migrations, they plan to implement database versioning, which would let them compare only updates that happened from the same logical change.

Interestingly, this problem was specific to certain datastores:

Redis eventstore: No race condition issues because they used unique event IDs
Postgres and Memcached: Race conditions were common and needed special handling

Testing Strategy and Comment Complexity

Much of the migration time was spent manually reviewing tap compare logs in production.

When differences appeared, engineers would investigate the code, fix issues, and verifthat y those specific mismatches stopped appearing. Since tap compare logs only capture differences, once a problem was fixed, those logs would disappear.

This production-heavy testing approach worked, but it was time-consuming. The team realized they needed more comprehensive local testing before deploying to production. Part of the challenge was the sheer complexity of comment data.

A comment might seem like simple text, but Reddit’s comment model includes numerous variations:

Simple text vs rich text formatting vs media content
Photos and GIFs with different dimensions and content types
Subreddit-specific workflows (some use Automod requiring approval states)
Various types of awards that can be attached
Different moderation and approval states

All these variations create thousands of possible combinations for how a single comment can be represented in the system. The initial testing strategy covered common use cases locally, then relied on “tap compare” in production to surface edge cases. For future migrations, the team plans to use real production data to generate comprehensive test cases before ever deploying to production.

Why Go Instead of Python Microservices?

An important question that can come up in this scenario is this: if language differences caused so many problems, why not just create Python microservices instead?

Just sticking to Python would have avoided serialization issues and database access pattern changes entirely.

The answer reveals Reddit’s strategic infrastructure direction. Reddit’s infrastructure organization has made a strong commitment to Go for several reasons:

Concurrency advantages: For high-traffic services, Go can run fewer pods while achieving higher throughput than Python.
Existing ecosystem: Go is already widely used across Reddit’s infrastructure.
Better tooling: The existing Go support makes development easier and more consistent.

The engineering team considered only Go for this migration. From their perspective, the strategic long-term benefits of standardizing on Go outweighed the short-term challenges of cross-language compatibility.

Conclusion

The migration succeeded completely. All comment endpoints now run on the new Golang microservice with zero disruption to users. Comments became the first of Reddit’s four core models to fully escape the legacy monolith.

While the primary goal was maintaining performance parity while improving reliability, the migration delivered an unexpected bonus: all three migrated write endpoints saw their p99 latency cut in half. P99 latency measures how long the slowest 1% of requests take, which matters because those slow requests represent the worst user experience.

The improvements were substantial:

The legacy Python service occasionally had latency spikes reaching 15 seconds
New Go service shows consistently lower and more stable latency
Typical latency stays well under 100 milliseconds

See the charts below that show the latency improvements for various scenarios:

The migration also provided some valuable lessons for future work:

Database versioning is essential for handling race conditions properly by tracking which version of data is being compared
Comprehensive local testing informed by real production data will reduce debugging time in production
Database monitoring matters when changing how services access data, not just when changing application logic
End-to-end verification must include actual downstream consumers, not just byte-level data comparison
Custom tooling helps automate parts of the manual review process (like their race condition detection code)

As they continue migrating the remaining core models (Accounts have been completed, while Posts and Subreddits are in progress), these lessons will make each subsequent migration smoother.

References:

SPONSOR US

Get your product in front of more than 1,000,000 tech professionals.

Space Fills Up Fast - Reserve Today

Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing [email protected].

EP192: What is a REST API?

2025-12-07 00:30:48

Get full visibility into your AWS containers and serverless apps (Sponsored)

Running containerized and serverless workloads on AWS? This free Datadog eBook shows how to monitor complex, cloud-native environments—so you can improve performance, reduce latency, and spot issues before they escalate.

You’ll also learn how to:

Monitor EKS, ECS, Lambda, and Fargate in one unified view
Trace requests across containers, services, and functions
Optimize resource usage and reduce overhead at scale

Download the eBook

This week’s system design refresher:

ICYMI: How Does a URL Shortener Work? (Youtube video)
What is Prompt Engineering? (Youtube video)
What is a REST API?
How Java HashMaps Work?
Virtualization Explained: From Bare Metal to Hosted Hypervisors
SPONSOR US

How Does a URL Shortener Work?

What is Prompt Engineering?

What is a REST API?

REST (Representational State Transfer) is an architectural style for building APIs that use HTTP for communication. To be considered RESTful, an API should follow six key constraints:

Client-Server: Separates the user interface from data storage and processing. This allows each side to evolve independently.
Stateless: Each request contains all the information needed to process it; the server does not store session state between requests.
Uniform Interface: Consistent resource naming and formats across the API, e.g., “/products”, “/users”.
Cacheable: Responses explicitly indicate if they can be cached (e.g., Cache-Control) to improve performance.
Layered System: Requests may pass through multiple layers (load balancers, auth services, etc.) before reaching the API server, but to the client it appears as a single endpoint.
Code on Demand (Optional): The server can send executable code (like JavaScript) to the client to extend its functionality.

Over to you: Which REST constraint do you think is most often overlooked in real-world APIs?

Big news coming out of AWS re:Invent in ☀️ Las Vegas this week. (Sponsored)

AWS just dropped Agent Mode in its Marketplace - and it’s a game changer for anyone building or scaling with AI. Instead of digging through dozens of vendors, solutions, and integrations, Agent Mode lets you discover, compare, and deploy AI agents & tools instantly.

There are already 2,300+ AI agents & tools available: pre-build agents, dev tools and AI infra solutions from companies like Anthropic, Langchain, Neo4j and many others. All in one place and fully integrated with your AWS Cloud

🛒 Explore 2,300+ AI Agents & Tools in AWS Marketplace

How Java HashMaps Work?

A HashMap is a data structure that stores key-value pairs, allowing fast lookup, insertion, and deletion. It uses a hash function to map each key to a bucket index, making access efficient.

Here’s how it works:

Keys are given to a HashMap
A hash function converts each key into a number that points to a bucket index that points to the data value.
When two keys map to the same bucket index, a collision occurs.
Collisions are handled by linking entries together in a chain within the same bucket. This is done using a linked list or even trees in some cases.
Finally, values are retrieved quickly by hashing the key and looking up the correct bucket.

Over to you: What else will you add to better understand the working of Hash Maps?

Virtualization Explained: From Bare Metal to Hosted Hypervisors

Before containers and serverless, there was virtualization, the foundation of modern cloud computing. Here’s a simple breakdown of how virtualization works, from bare metal to hosted hypervisors:

Type 1 / Bare Metal Hypervisor: The hypervisor runs directly on hardware. No operating system underneath it. The hypervisor is the operating system layer. VMware ESXi, Microsoft Hyper-V, and KVM are all Type 1 hypervisors. They boot up on physical servers and manage VMs with direct hardware access.

Each VM gets its own full operating system. Windows, Fedora, Ubuntu. Complete isolation between VMs. One crashes? The others keep running.

This is what runs in data centers and cloud providers. When you spin up an EC2 instance on AWS, it’s running on Type 1 virtualization (they use their custom Nitro hypervisor now, but same concept).
Type 2 / Hosted Hypervisor: The hypervisor runs as an application on top of a host operating system. You’re running Windows or macOS, then you install VirtualBox or VMware Workstation on top of it.

Guest VMs still get full operating systems, but now there’s an extra layer. Your VM talks to the hypervisor, which talks to the host OS, which talks to the hardware.

More overhead, slightly lower performance, but way more convenient for development and testing. This is what you use on your laptop. Running a Linux VM on your MacBook? That’s Type 2.

Over to you: What’s your go-to hypervisor for local development?

SPONSOR US

Get your product in front of more than 1,000,000 tech professionals.

Space Fills Up Fast - Reserve Today

Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing [email protected].

How DoorDash Moved to a Service Mesh to Handle 80M Requests/Second

2025-12-06 00:30:42

AI Meets Streaming: Build Real-Time Architectures with AWS + Redpanda (Sponsored)

Join us live on December 11 for a Redpanda Tech Talk with AWS experts on how to bring Agentic and Generative AI into real-time data pipelines. Redpanda Solutions Engineer Garrett Raska and AWS Partner Solutions Architect Dr. Art Sedighi will walk through emerging AI patterns from AWS re:Invent and show how to integrate AI inference directly into streaming architectures. Learn how to build low-latency, context-aware applications, combine real-time signals with GenAI models, and architect reliable, production-ready AI workflows. If you’re exploring how AI transforms streaming systems, this session delivers the patterns you need to get started.

Disclaimer: The details in this post have been derived from the details shared online by the DoorDash Engineering Team. All credit for the technical details goes to the DoorDash Engineering Team. The links to the original articles and sources are present in the references section at the end of the post. We’ve attempted to analyze the details and provide our input about them. If you find any inaccuracies or omissions, please leave a comment, and we will do our best to fix them.

In mid-2021, DoorDash experienced a production outage that brought down the entire platform for more than two hours.

The incident started with the payment service experiencing high latency. Clients of this service interpreted the slow responses as potential failures and retried their requests. This created a retry storm where each retry added more load to an already overwhelmed service. The cascading failure spread through DoorDash’s microservices architecture as services depending on payments started timing out and failing.

See the diagram below:

This wasn’t an isolated incident. DoorDash had experienced a series of similar issues earlier as well. The problems may have been prompted by their transition from a monolith to a microservices architecture between 2019 and 2023.

Of course, it wasn’t that DoorDash was blind to reliability concerns. The team had already implemented several reliability features in their primary Kotlin-based services. However, not all services used Kotlin, which meant they either had to build their own mechanisms or go without. The payment service was also one of them.

The outage made one thing clear: their patchwork approach to reliability wasn’t working. The incident demonstrated that reliability features like Layer 7 metrics-aware circuit breakers and load shedding couldn’t remain the responsibility of individual application teams.

Based on this realization, the DoorDash engineering team decided to transition to a service mesh architecture to handle their traffic needs. By the time they began this adoption, DoorDash was operating at significant scale with over 1,000 microservices running across Kubernetes clusters containing around 2,000 nodes. Today, the system handles more than 80 million requests per second during peak hours.

In this article, we will look at how DoorDash’s infrastructure team went through this journey and the difficulties they faced.

Challenges of Microservices Architecture

Ever since DoorDash migrated away from the monolithic architecture, they started to encounter several classic challenges applicable to microservices and distributed systems.

Services were communicating using different patterns with no standardization across the organization:

HTTP/1 services relied on Kubernetes DNS with static virtual IPs, where iptables handled connection-level load balancing.
Newer services used Consul-based multi-cluster DNS that returned all pod IPs directly, requiring client-side load balancing that was inconsistently implemented across different programming languages and teams.
Some gRPC services were exposed through dedicated AWS Elastic Load Balancers with external traffic policy settings to route traffic correctly
Certain services routed requests through a central internal router, creating a potential bottleneck and a single point of failure
Some services made full round-trips through public DoorDash domains before hairpinning back into internal systems. This hurt both performance and reliability

See the diagram below:

Beyond these varied communication patterns, critical platform-level features were implemented inconsistently:

Authentication and authorization mechanisms differed across services and teams, with no unified approach to securing service-to-service communication.
Retry and timeout policies varied by team and programming language, with some services lacking them entirely.
Load shedding and circuit breaker implementations were scattered and inconsistent, leaving many services vulnerable to exactly the kind of cascading failures they had just experienced.

The resulting service topology had become increasingly complex, making system-wide visibility and debugging extremely difficult.

The Goals

When it came to implementing the Service Mesh for their architecture, the DoorDash engineering team started with modest goals.

Before the move to Service Mesh, their multi-cluster service discovery architecture looked like this:

Each pod was assigned an elastic IP via AWS CNI to maintain a flat network
A singleton process in each cluster sent the latest pod IPs to a centralized Consul cluster
Clients issued DNS queries through the local CoreDNS DaemonSet
On a cache miss, CoreDNS forwarded to the local Consul agent, which in turn forwarded to the Consul server as needed

See the diagram below:

The service mesh had to integrate seamlessly with this existing setup and accelerate the adoption of this standardized communication model.

The immediate goal was to deliver a minimal feature set to the two major services involved in the outage within three months. The broader goal was to support general onboarding for all services within a total of six months.

Some important concerns were as follows:

Scalability was a primary concern since their single Kubernetes cluster had around 2,000 nodes, approaching the validated limits of many open-source tools. They had already encountered unpredictable scalability issues at the 1,000+ node scale with existing tools and couldn’t afford to introduce another system that would struggle at their size.
Standardization around a unified Consul-based multi-cluster communication model was a key architectural goal to eliminate the fragmentation in their existing setup.
The mesh solution needed to be proven in production and backed by success stories from companies operating at a similar scale. It needed to be stable, well-documented, and easy to configure to minimize operational burden.

The outage mentioned earlier complicated things further. Initially, two engineers from the Core Infrastructure team were exploring service mesh in their spare time. The outage quickly reset their priorities. Rather than aiming for a complete solution, they focused on shipping the most critical reliability features as fast as possible.

Choosing the Service Mesh Solution

The DoorDash engineering team evaluated multiple open-source service mesh solutions, including Istio, Linkerd2, and others, before making their choice. They also spoke with engineers at other companies to learn from their experiences with different service mesh implementations.

The decision ultimately came down to philosophy and trade-offs.

Istio represents the “power user” approach to service mesh. It uses Envoy as its data plane and is extremely feature-rich and flexible, capable of handling virtually any traffic management scenario you might need. However, this power comes at a cost: complexity, a steep learning curve, and heavier resource usage. Istio is designed for large enterprises with dedicated platform teams who can manage its operational burden.

The team rejected Istio due to its complexity, ongoing re-architecture concerns, and operational burden around scalability, debugging, upgrades, and configuration management. With only two engineers initially working on the project, maintaining Istio would have required a much larger team than they could allocate.

Linkerd2 takes a “simplicity first” approach. It uses its own custom lightweight proxy written in Rust and offers a deliberately minimal feature set focused on the essentials. It’s easy to install, operate, and understand, making it attractive for teams wanting “service mesh lite” without operational overhead.

However, there were key reasons to reject Linkerd2 as well:

Features like adaptive concurrency and sophisticated circuit breakers weren’t available out of the box.
Additionally, Linkerd2 would have required upstream contributions to integrate with DoorDash’s existing Consul-based multi-cluster setup, adding development time they didn’t have.
Linkerd also had a smaller community compared to Envoy and Istio, raising concerns about long-term support and ecosystem maturity.

The team ultimately chose Envoy as their data plane because it was battle-tested (originally built by Lyft), performant, and had strong community support. For reference, Envoy has become the de facto standard for service mesh data planes, with most vendors (including Istio) using it under the hood.

Also, rather than adopting a complete service mesh solution, they decided to build a custom control plane tailored to their specific needs. This gave them several advantages:

A minimal custom control plane was faster to build and deploy than learning and operating the full complexity of Istio.
They could implement exactly the reliability features they needed (adaptive concurrency, outlier detection) without unnecessary complexity.
They could integrate directly with their existing Consul-based service discovery without upstream contributions or workarounds.
With just two engineers, a focused custom solution was more manageable than a feature-rich but complex off-the-shelf platform.

The MVP Architecture

The DoorDash engineering team took a pragmatic MVP approach to get things started.

The key aspects of the MVP architecture were as follows:

A Kubernetes mutating webhook injected the Envoy sidecar and init containers into pods labeled with “service-mesh-enabled: true”. This made adoption seamless for developers, who simply had to add a label to their deployment, and the mesh infrastructure was automatically injected.
Iptables rules redirected all HTTP/1, HTTP/2, and gRPC traffic through the Envoy sidecar automatically.
Traffic on reserved ports used for non-HTTP protocols bypassed the mesh to avoid interfering with other communication patterns like database connections or message queues.
Instead of using a dynamic xDS API configuration, they initially opted for a simpler file-based configuration model. Users wrote raw Envoy configurations in a GitHub repository, which were versioned and reviewed like code through their standard pull request process.
Configurations were packaged and pushed to AWS S3 buckets through the continuous deployment pipeline.
A DaemonSet running on each node pulled configs from S3, extracted them, and mounted them into sidecar containers.
When configurations changed, Envoy would perform a hot-restart, updating its behavior without requiring pod restarts and minimizing downtime.

See the diagram below:

The initial feature set was deliberately minimal, focused only on the reliability problems that caused the outage:

Adaptive concurrency: Automatically limits the number of concurrent requests based on observed latency, preventing overload.
Outlier detection: Identifies and temporarily removes unhealthy service instances from the load balancing pool.
Traffic metrics via Grafana: Provides visibility into request rates, latencies, and error rates at the network layer.

Users manually declared their egress dependencies in configuration files, which allowed the control plane to create dedicated CDS (Cluster Discovery Service) clusters for each upstream service. They used STRICT_DNS as the service discovery type to resolve destination services through DNS lookups against their existing Consul infrastructure.

Critically, the sidecar was configured as an HTTP passthrough proxy, meaning that even undeclared dependencies would still work. If a service made a call to an upstream service that wasn’t explicitly listed in its configuration, the traffic would still reach its destination.

At this stage, they didn’t even call it a “service mesh” internally. It was simply the “Envoy sidecar project”. The configuration model was primitive and lacked a proper control plane in the traditional sense.

However, the goals were clear: establish a working data plane, gain operational experience running Envoy at scale, deliver immediate reliability improvements, demonstrate impact to leadership, secure more resources, and ultimately evolve the architecture into a unified platform they could rely on.

Onboarding Initial Services

The first two services onboarded were critical Python services: the payment service and notification service, which were involved in the outage. By starting with the services that had caused the most pain, the team could directly validate whether the mesh would prevent similar incidents.

Onboarding required just two steps:

Labeling the Kubernetes deployment to trigger sidecar injection
Creating raw Envoy configurations for the service.

While creating raw Envoy configs was still manual at this stage, the process was straightforward enough for the infrastructure team to handle.

The team used a canary deployment approach to introduce the Envoy sidecar gradually and safely. See the diagram below:

The canary deployment used the same container image as production, but with the service mesh label attached to enable sidecar injection. Importantly, the canary deployment reused labels from the production deployment, so it matched the selectors in the Kubernetes Service object.

This meant that canary pod IPs became automatically routable through all existing service discovery mechanisms (virtual IPs, headless services, nginx-ingress, and Consul DNS). From the perspective of clients calling the payment service, there was no distinction between production pods and canary pods. Traffic was automatically distributed between production and canary pods based on their replica counts.

Developers controlled traffic exposure by adjusting the number of canary replicas. They could start with just one canary pod receiving a small percentage of traffic, observe its behavior, and gradually scale up the canary deployment while scaling down the original production deployment. Once confidence was established, they could update the main production deployment to include the service mesh label and remove the canary entirely.

The rollout happened gradually over two weeks to allow any unknown issues to surface in a controlled manner. The initial onboarding was smooth and successful with no delivery loss.

General Availability

After a successful pilot onboarding, they prepared for general availability by introducing configuration templates.

Templates eliminated the need for developers to manually write raw Envoy configurations, which would have been nearly impossible to do safely at scale. Instead of requiring deep knowledge of Envoy’s configuration syntax, developers could fill in a simple template with their service name, egress dependencies, and basic parameters. The infrastructure team would handle generating the actual Envoy configuration from these templates.

They also provided a centralized Grafana dashboard for networking metrics reported by the Envoy sidecar. This gave all teams a consistent view of their service’s network behavior with data points such as request rates, latencies broken down by percentile, error rates, and connection pool statistics. Shared alert modules and runbooks helped teams respond quickly to common issues like high error rates or increased latency, reducing the learning curve for teams new to operating with the mesh.

With these improvements in place, they officially declared the Envoy sidecar project generally available to all teams.

The combination of automated injection, templated configuration, centralized observability, and operational documentation made it feasible for teams to onboard without requiring deep expertise in Envoy or direct support from the infrastructure team for every migration.

Evolving with Additional Features

As the Service Mesh solution started running in the wild, the internal platform teams started to ask for more features. Some of the key features were as follows:

Zone-Aware Routing

One of the first major feature requests was zone-aware routing, which allowed services to prefer calling instances in the same availability zone, reducing latency and inter-zone bandwidth costs

Zone-aware routing requires Endpoint Discovery Service (EDS) rather than STRICT_DNS for service discovery because it needs detailed metadata about each endpoint, including which availability zone it’s running in. They evolved their configuration management to introduce API-based dynamic configuration for EDS resources.

They developed an xDS control plane that reads IP addresses and availability zone metadata from Consul. The control plane delivered endpoint information to Envoy sidecars via the EDS API in real-time, pushing updates whenever pods were added or removed. This represented a significant architectural shift from purely file-based configuration to a hybrid model mixing files (CDS/LDS) with APIs (EDS).

See the diagram below:

Advanced Routing Capabilities

Header-based routing was implemented via Envoy’s route configuration to support more advanced use cases. They partnered with the Test Platform team to enable safe testing in production through sandbox routing.

Sandbox routing allowed ingress traffic to be conditionally redirected to a service’s sandbox version based on HTTP headers. Developers could test new code with production traffic by adding a special header to requests, routing them to experimental instances while normal traffic continued flowing to stable versions.

The Storage team developed an entity cache service to reduce request latency by caching API responses on behalf of clients.

Entity cache routing required the Envoy sidecar to dynamically route egress traffic based on the gRPC method name. Some methods would go to the cache, while others would bypass it. The entity cache implementation included fallback logic to route traffic to origin services when the cache was degraded, ensuring reliability even when the optimization layer failed.

Mass Adoption

After onboarding roughly 100 services, they began preparing for mass onboarding.

The early successes had validated the approach, but the team recognized that the manual processes that worked for 100 services wouldn’t scale to 1,000+. They conducted a thorough review of their onboarding process and post-onboarding support burden to identify bottlenecks and areas for improvement.

Some key points are as follows:

The team began automatically generating Envoy configurations for all services and pre-applying mesh labels to deployments. This significantly reduced the setup burden for each team and ensured consistency from the start. Instead of requiring developers to understand Envoy configuration or even request onboarding, the infrastructure made services mesh-ready by default.
They switched from canary deployments lasting several days to a gradual rolling update-based approach. By rolling update pods cell by cell, leveraging their cell-based architecture for blast radius control, they cut onboarding time from days to under an hour.
To improve debuggability and gain visibility into service dependencies, they integrated OpenTelemetry for distributed tracing. Spot-checking full traces often helped debug incidents, especially when multiple services in a complex graph appeared to fail simultaneously. Instead of piecing together logs from different services, engineers could follow a single request’s journey through the entire system.
The infrastructure team took full responsibility for managing Envoy sidecar resource allocation, including memory and CPU tuning. They removed CPU limits for the Envoy sidecar container to prevent throttling issues that could impact request latency. All pages related to Envoy resource issues were routed to the Infrastructure team, not application developers.
As service mesh became increasingly critical to DoorDash’s infrastructure, they grew the team from around two to eight engineers. They brought in teammates who had previously focused on north-south traffic (external traffic coming into DoorDash) to consolidate ownership of all traffic management

Conclusion

DoorDash’s service mesh journey demonstrates that successful infrastructure transformations require more than just adopting the right technology. They started with a real problem (cascading failures causing platform-wide outages) and built a solution incrementally, proving value at each step before expanding scope.

Several key lessons emerge from their experience.

Most important is to stay grounded in real problems rather than chasing architectural trends. Service mesh is complex, spanning many infrastructure domains, so focus on the specific issues it can solve for your organization. Also, expect the unexpected. The additional Envoy proxy can introduce subtle behaviors that are difficult to debug. This is where it becomes important to start small with pilot teams and test incrementally to surface edge cases early.

Move fast when it matters. DoorDash shipped a minimal MVP in three months because preventing the next outage was urgent. Sometimes shipping early, even with imperfections, unlocks long-term progress. However, recognize that manual processes don’t scale. Automate onboarding and operations wherever possible to avoid bottlenecks during mass adoption.

Developer experience matters critically. Education is essential but not scalable, and therefore, it is important to make things intuitive by design and invest in UX-focused tools.

References:

SPONSOR US

Get your product in front of more than 1,000,000 tech professionals.

Space Fills Up Fast - Reserve Today

Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing [email protected].

A Guide to Service Mesh Architectural Pattern

2025-12-05 00:31:08

The way software is built has undergone significant changes over the last decade.

The industry has moved toward building systems composed of many small, independent services. While this shift to distributed services solved many problems related to development speed and scalability, it created a whole new set of challenges. Suddenly, the communication between different parts of our application became much more difficult to manage.

One solution to this problem of managing communication between services is the service mesh architectural pattern.

At a high level, a service mesh is a dedicated layer that we add to our applications. Its primary job is to handle communication between services. Instead of the application code having to worry about how to connect to another service, how to retry a failed request, or how to encrypt data, the service mesh handles all of this automatically.

This technology matters because it addresses three critical gaps in modern microservices architecture: reliability, security, and observability. In this article, we will explore the reasons behind the existence of service mesh, its inner workings, and how to determine if it is the right tool for our specific needs.

Why Service Mesh Exists

How Meta Uses AI Agents for Data Warehouse Access and Security

2025-12-04 00:23:28

👋 Goodbye low test coverage and slow QA cycles (Sponsored)

Bugs sneak out when less than 80% of user flows are tested before shipping. However, getting that kind of coverage (and staying there) is hard and pricey for any team.

QA Wolf’s AI-native solution provides high-volume, high-speed test coverage for web and mobile apps, reducing your organization’s QA cycle to minutes.

They can get you:

80% automated E2E test coverage in weeks—not years
Unlimited parallel test runs
24-hour maintenance and on-demand test creation
Zero flakes, guaranteed

The benefit? No more manual E2E testing. No more slow QA cycles. No more bugs reaching production.

With QA Wolf, Drata’s team of engineers achieved 4x more test cases and 86% faster QA cycles.

⭐ Rated 4.8/5 on G2

Schedule a demo to learn more

Disclaimer: The details in this post have been derived from the details shared online by the Meta Engineering Team. All credit for the technical details goes to the Meta Engineering Team. The links to the original articles and sources are present in the references section at the end of the post. We’ve attempted to analyze the details and provide our input about them. If you find any inaccuracies or omissions, please leave a comment, and we will do our best to fix them.

Meta has one of the largest data warehouses in the world, supporting analytics, machine learning, and AI workloads across many teams. Every business decision, experiment, and product improvement relies on quick, secure access to this data.

To organize such a vast system, Meta built its data warehouse as a hierarchy. At the top are teams and organizations, followed by datasets, tables, and finally dashboards that visualize insights. Each level connects to the next, forming a structure where every piece of data can be traced back to its origin.

Access to these data assets has traditionally been managed through role-based access control (RBAC). This means access permissions are granted based on job roles. A marketing analyst, for example, can view marketing performance data, while an infrastructure engineer can view server performance logs. When someone needed additional data, they would manually request it from the data owner, who would approve or deny access based on company policies.

This manual process worked well in the early stages. However, as Meta’s operations and AI systems expanded, this model began to strain under its own weight. Managing who could access what became a complex and time-consuming process.

Three major problems began to emerge:

The data graph became massive. Each table, dashboard, and data pipeline connects to others, forming a web of relationships. Understanding dependencies and granting permissions safely across this web became difficult.
Access decisions became slower and required multiple approvals. Different teams had to coordinate across departments to manage security.
AI systems changed how data was used. Earlier, each team mainly worked within its own data domain. Now, AI models often need to analyze data from multiple domains at once. The traditional human-managed access system could not keep up with these cross-domain patterns.

To keep innovation moving while maintaining security, Meta had to find a better way to handle the problem of data access at scale. The Meta engineering team discovered that the answer lay in AI agents. These agents are intelligent software systems capable of understanding requests, evaluating risks, and making decisions autonomously within predefined boundaries

In this article, we look at how Meta redesigned their data warehouse architecture to work with both humans and agents.

The Agentic Solution: Two-Agent Architecture

To overcome the growing complexity of data access, the Meta engineering team developed what they call a multi-agent system.

In simple terms, it is a setup where different AI agents work together, each handling specific parts of the data-access workflow. This design allows Meta to make data access both faster and safer by letting agents take over the repetitive and procedural tasks that humans once did manually.

At the heart of this system are two key types of agents that interact with each other:

Data-user agents, which act on behalf of employees or systems that need access to data.
Data-owner agents, which act on behalf of the people or teams responsible for managing and protecting data.

See the diagram below:

Data-User Agent

The data-user agent is not one single program. Instead, it is a group of smaller, specialized agents that work together. These sub-agents are coordinated by a triage layer, which acts like a manager that decides which sub-agent should handle each part of the task.

See the diagram below:

There are three main sub-agents inside this structure:

Alternative-suggestion Sub-agent

This sub-agent helps users find safer or less restricted ways to access the information they need. For example, if someone requests access to a sensitive data table, the agent might recommend another table that contains similar but non-sensitive data. It can even help rewrite queries to use only unrestricted columns or public data sources.

The sub-agent relies on large language models (LLMs) to reason about relationships between datasets. Traditionally, this kind of knowledge existed only as “tribal knowledge”, meaning it was known informally by a few experienced engineers. Now, the agent can synthesize that hidden information and offer intelligent recommendations automatically.

Low-risk exploration sub-agent

Most users do not need full access to an entire dataset when they are still exploring. Often, they just need to look at a small portion to understand its structure or content.

This sub-agent provides temporary or partial access to small data samples so that users can explore safely. It ensures that this kind of low-risk exploration does not expose sensitive information.

Access-negotiation sub-agent

When full access is required, this sub-agent prepares the formal permission request. It communicates directly with the data-owner agent to request access based on business needs and data policies.

At the moment, Meta keeps a human in the loop to supervise these interactions, meaning a person reviews or confirms the agent’s actions. However, the engineering team expects that, over time, this sub-agent will be able to operate more autonomously as the system matures and safety mechanisms improve.

Data-Owner Agent

On the other side of the workflow is the data-owner agent, which represents the data managers or teams that control sensitive information.

It also consists of specialized components, each focusing on a different responsibility. See the diagram below:

Let’s look at the two main components of the data owner agent:

Security-operations sub-agent

This sub-agent functions like a junior security engineer. It follows Standard Operating Procedures (SOPs) written by data owners and applies them to incoming access requests.

When a data-user agent sends a request, this sub-agent checks it against the established rules and risk policies. It ensures that the request follows security protocols and that only legitimate users with valid purposes are granted access.

Access-management sub-agent

Beyond handling requests, this sub-agent takes a proactive role in shaping and maintaining access policies. It evolves the older “role-mining” process, where engineers manually examined user roles and permissions, into a smarter, automated system.

Using metadata, data semantics, and historical access patterns, it continuously refines and optimizes who should have access to which resources. This helps Meta reduce the manual overhead of managing permissions while still keeping the data warehouse secure.

Making the Data Warehouse Agent-Friendly

The next challenge for the Meta engineering team was to make the data warehouse usable not only by humans but also by AI agents.

Unlike people, agents interact through text-based interfaces. This means they cannot browse graphical dashboards or manually navigate folders. They need information presented in a structured, text-readable format that they can process and reason about.

To achieve this, Meta redesigned the data warehouse into what can be described as a text-navigable hierarchy, similar to how folders and files are organized on a computer. In this setup, each element in the warehouse (such as a table, a dashboard, or a policy) is treated as a resource. Agents can read these resources and understand how they relate to one another. The system turns complex warehouse objects into text summaries that describe what each resource represents and how it can be used.

In addition, important materials like SOPs, internal documentation, and even historical access rules are also represented as text. This approach allows LLMs powering the agents to analyze these textual resources just as they would analyze written information in a document.

Managing Context and Intention

To make good decisions about data access, an AI agent must understand the full situation around a request. The Meta engineering team calls this context and intention management. Together, these two concepts help the agent figure out who is asking for data, what they are trying to access, and why they need it.

Let’s start with context management. Context gives the agent the background information it needs before acting. Meta defines three main types of context:

Automatic context: When someone tries to open a dataset or run a query, the system already knows who they are and what resource they are attempting to access. This information is collected automatically from internal tools and user identities.
Static context: Sometimes, a user wants to focus on a specific project or dataset category. They can define that scope manually. For example, an engineer could choose to work within the “Ad Metrics” project area to limit search results to relevant tables.
Dynamic context: Agents can refine context further by analyzing metadata or performing similarity searches. For instance, if a user is studying ad-spend data, the agent can automatically find other tables related to ad budgets or campaign performance.

Once context is clear, the next step is intention management, which identifies the reason behind a user’s request. Meta approaches this in two ways:

Explicit intention is when a user clearly states their purpose. For example, they might indicate that they are “investigating ad performance for Q3”. The system can then match this role or goal with appropriate data-access policies.
Implicit intention is when the system infers purpose from user behavior. If an engineer suddenly starts accessing error logs at midnight, the system can reasonably assume they are responding to an outage and temporarily grant limited diagnostic access.

Deep Dive: Partial Data Preview

To understand how Meta’s agentic data-access system actually works, let’s look at an end-to-end example.

The process begins when a data scientist wants to look at new data for analysis. Instead of immediately granting full access to an entire dataset, the low-risk exploration sub-agent steps in first. It allows the user to view a small, limited sample of the data so they can understand its structure and decide if it is relevant to their task. At this stage, context-aware controls ensure that only non-sensitive parts of the dataset are visible.

If the user later needs deeper or broader access, the access-negotiation sub-agent automatically prepares a formal permission request and contacts the data-owner agent for review. This workflow not only speeds up exploration but also keeps security intact by enforcing layers of protection at every step.

The entire system operates through four major capabilities:

Context analysis: The agent understands what the user is trying to do and matches it with business rules and policies.
Query-level access control: Each query is examined to see how much data it touches and whether it performs aggregations or random sampling. This helps the system judge the potential risk of exposure.
Data-access budgets: Every employee has a daily quota of how much data they can access. This budget resets automatically every day and acts as a safeguard against accidental overexposure.
Rule-based risk management: The system continuously monitors agent behavior through analytical risk rules, catching anything unusual or potentially unsafe.

See the diagram below:

Behind the scenes, a complex architecture powers this workflow.

The data-user agent serves as the entry point. It collects signals from several internal tools:

User-activity logs, which include actions like editing code (diffs), viewing dashboards, completing tasks, or handling service events.
User-profile information, such as team, role, and current project details.

Using this information, the agent builds an intention model, a structured understanding of why the user is making the request and what they need to accomplish. This model is combined with the shape of the query (for example, whether it is reading a few rows, aggregating data, or joining large tables) to form a complete picture of the situation.

Once this intention is formed, the data-user agent hands control to the data-owner agent. This second agent retrieves metadata about the requested resources, including table summaries, column descriptions, and SOPs. It then uses a large language model (LLM) to reason about whether access should be granted or denied. The LLM’s reasoning is checked by a set of guardrails that apply rule-based risk calculations to make sure the outcome aligns with security policies.

See the diagram below:

Every action, decision, and result is logged securely for future auditing and analysis. This makes it possible to trace exactly how and why each access decision was made.

Conclusion

The Meta engineering team has made significant progress toward transforming how data is accessed and secured across its massive warehouse systems. However, the journey toward a fully agent-ready infrastructure is still ongoing. The long-term vision is to create a system where both humans and AI agents can work side by side, safely and efficiently, without adding complexity or risk.

The first area of continued focus is agent collaboration. Meta is increasingly seeing scenarios where agents act on behalf of users without direct human input. In the future, these agents may communicate and negotiate with each other automatically. To support this, Meta needs to refine how agents interact, ensuring that every exchange remains transparent, auditable, and aligned with company policies.

Next, the infrastructure itself must evolve. Many of Meta’s warehouse tools, APIs, and interfaces were originally built for human use. To fully enable machine-to-machine workflows, these systems must be reengineered to accommodate automated reasoning, contextual understanding, and safe delegation between agents.

Finally, Meta is investing heavily in benchmarking and evaluation. For agents to operate safely, the company must continuously measure performance, accuracy, and compliance. This involves defining clear metrics and running regular evaluations to detect errors or regressions. The feedback loop created by human review and automated assessment ensures that the system learns and improves over time.

In summary, Meta’s data warehouse now integrates AI agents that not only request but also approve access in a controlled manner. The combination of LLM-based reasoning with rule-based guardrails ensures that productivity and security remain balanced.

References:

SPONSOR US

Get your product in front of more than 1,000,000 tech professionals.

Space Fills Up Fast - Reserve Today

Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing [email protected].

ByteByteGoModify

Rss preview of Blog of ByteByteGo

How to stop bots from abusing free trials (Sponsored)

Challenges of Multimedia Search

The Architecture

Indexing Pipeline for Metadata

Geolocation-Aware Retrieval System

Just-In-Time Preview Generation

Technical Trade-Offs and Design Decisions

Cost vs. Performance Decisions

Reusing What Works

Conclusion

SPONSOR US

Unwrap Unbeatable Holiday Deals with Verizon (Sponsored)

The Easy Part: Migrating Read Operations

The Hard Part: Migrating Write Operations

The Sister Datastore Solution

The Scale of Verification

Challenges With Different Languages

The Race Condition Problem

Testing Strategy and Comment Complexity

Why Go Instead of Python Microservices?

Conclusion

SPONSOR US

Get full visibility into your AWS containers and serverless apps (Sponsored)

How Does a URL Shortener Work?

What is Prompt Engineering?

What is a REST API?

Big news coming out of AWS re:Invent in ☀️ Las Vegas this week. (Sponsored)

How Java HashMaps Work?

Virtualization Explained: From Bare Metal to Hosted Hypervisors

SPONSOR US

AI Meets Streaming: Build Real-Time Architectures with AWS + Redpanda (Sponsored)

Challenges of Microservices Architecture

The Goals

Choosing the Service Mesh Solution

The MVP Architecture

Onboarding Initial Services

General Availability

Evolving with Additional Features

Zone-Aware Routing

Advanced Routing Capabilities

Mass Adoption

Conclusion

SPONSOR US

Why Service Mesh Exists

👋 Goodbye low test coverage and slow QA cycles (Sponsored)

The Agentic Solution: Two-Agent Architecture

Data-User Agent

Alternative-suggestion Sub-agent

Low-risk exploration sub-agent

Access-negotiation sub-agent

Data-Owner Agent

Security-operations sub-agent

Access-management sub-agent

Making the Data Warehouse Agent-Friendly

Managing Context and Intention

Deep Dive: Partial Data Preview

Conclusion

SPONSOR US

ByteByteGo Modify