2025-05-31 23:30:19
If slow QA processes bottleneck you or your software engineering team and you’re releasing slower because of it — you need to check out QA Wolf.
QA Wolf’s AI-native service supports web and mobile apps, delivering 80% automated test coverage in weeks and helping teams ship 5x faster by reducing QA cycles to minutes.
QA Wolf takes testing off your plate. They can get you:
Unlimited parallel test runs for mobile and web apps
24-hour maintenance and on-demand test creation
Human-verified bug reports sent directly to your team
Zero flakes guarantee
The benefit? No more manual E2E testing. No more slow QA cycles. No more bugs reaching production.
With QA Wolf, Drata’s team of 80+ engineers achieved 4x more test cases and 86% faster QA cycles.
This week’s system design refresher:
AI Agent versus MCP
How does HTTPS work?
How to Learn Kubernetes?
Top 6 most commonly used Server Types
Amazon Key Architecture with Third Party Integration
Hiring Now: Top AI Startups and AI Roles
SPONSOR US
An AI agent is a software program that can interact with its environment, gather data, and use that data to achieve predetermined goals. AI agents can choose the best actions to perform to meet those goals.
Key characteristics of AI agents are as follows:
An agent can perform autonomous actions without constant human intervention. Also, they can have a human in the loop to maintain control.
Agents have a memory to store individual preferences and allow for personalization. It can also store knowledge. An LLM can undertake information processing and decision-making functions.
Agents must be able to perceive and process the information available from their environment.
Model Context Protocol (MCP) is a new system introduced by Anthropic to make AI models more powerful.
It is an open standard that allows AI models (like Claude) to connect to databases, APIs, file systems, and other tools without needing custom code for each new integration.
MCP follows a client-server model with 3 key components:
Host: AI applications like Claude
MCP Client: Component inside an AI model (like Claude) that allows it to communicate with MCP servers
MCP Server: Middleman that connects an AI model to an external system
Over to you: Have you used AI Agents or MCP?
This webinar, "Using Static Code Analysis Correctly", explores how static code analysis can be a powerful ally in maintaining code quality, preventing software erosion, and ensuring long-term maintainability.
Whether you're working in automotive, medical, embedded systems, or other safety-critical industries, you'll gain practical insights into spotting design drift, managing architectural consistency, and avoiding common pitfalls in C and C++ codebases.
In this 40-minute webinar, we break down real-world challenges and demonstrate how thoughtful use of static analysis can help you build more robust, reliable systems from the inside out.
Hypertext Transfer Protocol Secure (HTTPS) is an extension of the Hypertext Transfer Protocol (HTTP.) HTTPS transmits encrypted data using Transport Layer Security (TLS.) If the data is hijacked online, all the hijacker gets is binary code.
How is the data encrypted and decrypted?
Step 1 - The client (browser) and the server establish a TCP connection.
Step 2 - The client sends a “client hello” to the server. The message contains a set of necessary encryption algorithms (cipher suites) and the latest TLS version it can support. The server responds with a “server hello” so the browser knows whether it can support the algorithms and TLS version.
The server then sends the SSL certificate to the client. The certificate contains the public key, host name, expiry dates, etc. The client validates the certificate.
Step 3 - After validating the SSL certificate, the client generates a session key and encrypts it using the public key. The server receives the encrypted session key and decrypts it with the private key.
Step 4 - Now that both the client and the server hold the same session key (symmetric encryption), the encrypted data is transmitted in a secure bi-directional channel.
Why does HTTPS switch to symmetric encryption during data transmission? There are two main reasons:
Security: The asymmetric encryption goes only one way. This means that if the server tries to send the encrypted data back to the client, anyone can decrypt the data using the public key.
Server resources: The asymmetric encryption adds quite a lot of mathematical overhead. It is not suitable for data transmissions in long sessions.
Over to you: how much performance overhead does HTTPS add, compared to HTTP?
Kubernetes is an open-source container orchestration platform that automates the deployment, scaling, and management of containerized applications.
Core Concepts and Architecture
This includes topics like What is Kubernetes, Cluster, Node, Pod, Control Plane, and Worker Node.
Workloads and Controllers
This includes topics like Pod, ReplicaSet, Deployment, StatefulSet, Job, CronJob, Labels, Selectors, and Autoscalers (HPA, VPA, and Cluster Autoscaler).
Networking and Service Management
This involves topics like Cluster Networking Model, Services (ClusterIP, NodePort, LoadBalancer), Ingress, and Network Policies.
Storage and Configuration
This includes topics like Volumes, PersistentVolume, PersistentVolumeClaim, Storage Classes, ConfigMap, Secret, and Stateful Applications.
Security & Access Control
This involves topics like RBAC, Service Accounts, Secrets Management, Admission Controllers, Pod Security, TLS, and API Access.
Tools, Observability & Ecosystem
This includes topics like kubectl, YAML files, Helm Charts, CI/CD Integration, GitOps, Logging, Monitoring, and EKS.
Over to you: What else will you add to the list for learning Kubernetes?
Web Server:
Hosts websites and delivers web content to clients over the internet
Mail Server:
Handles the sending, receiving, and routing of emails across networks
DNS Server:
Translates domain names (like bytebytego. com) into IP addresses, enabling users to access websites by their human-readable names.
Proxy Server:
An intermediary server that acts as a gateway between clients and other servers, providing additional security, performance optimization, and anonymity.
FTP Server:
Facilitates the transfer of files between clients and servers over a network
Origin Server:
Hosts central source of content that is cached and distributed to edge servers for faster delivery to end users.
Over to you: Which type of server do you find most crucial in your online experience?
Amazon Key is a REALLY convenient way to have your packages securely delivered right inside your garage or multifamily property.
We sat down with Kaushik Mani and Vijayakrishnan Nagarajan to learn how a small side project evolved into a global platform: now powering over 100 million secure door unlocks every year.
This story is packed with valuable lessons on:
Scaling IoT infrastructure across thousands of locations
Building resilient, field-ready hardware for real-world environments
Overcoming cold starts and connectivity challenges
Designing a microservices architecture for global expansion
Creating a secure, partner-ready platform to support third-party integrations
And many more
Dive into the full newsletter here.
This Week’s High-Impact Roles at Fast-Growing AI Startups
Software Engineer, AI Platform at Pano AI
Senior Software Engineer (AI Products) at Mutiny
Senior Software Engineer, Financial Infrastructure at Vercel
Senior Software Engineer at Otter.ai
Senior Software Engineer, ML Infrastructure, Predictive Planner at Waymo
Senior Full Stack Engineer at Gatik
Senior / Staff Software Engineer, Simulation Platform at Waabi
High Salary SWE Roles this week
Engineering Manager, ML Performance and Scaling at Anthropic
Director of Engineering, Data Platform at WorkAsPro
Engineering Manager - Trust and Safety at TikTok
Senior Full Stack Software Engineer at Netflix
VP, GTM Engineering at Linkedin
VP, Sales Engineering, Northeast Division at Comcast
Senior Staff Software Engineer, iOS at Salesforce
Today’s latest ML positions - hiring now!
Machine Learning Engineer at Amazon
Senior Machine Learning Engineer at Yahoo
Principal Data Scientist - Machine Learning Engineering at Atlassian
Get your product in front of more than 1,000,000 tech professionals.
Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases.
Space Fills Up Fast - Reserve Today
Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing [email protected].
2025-05-29 23:30:46
HTTP is the backbone of modern internet communication, powering everything from browser requests to microservices inside Kubernetes clusters.
Every time a browser loads a page, every time an app fetches an API response, every time a backend service queries another, it’s almost always over HTTP. That’s true even when the underlying transport changes. gRPC, for example, wraps itself around HTTP/2. RESTful APIs, which dominate backend design, are just a convention built on top of HTTP verbs and status codes. CDNs, caches, proxies, and gateways optimize around HTTP behavior.
Understanding HTTP is less about memorizing status codes and more about internalizing the performance trade-offs baked into the protocol’s evolution. HTTP/1.0 opened the door. HTTP/1.1 made it usable at scale. HTTP/2 pushed efficiency by multiplexing streams over a single TCP connection. And HTTP/3, built on QUIC over UDP, finally breaks free from decades-old constraints.
In this article, we trace that journey. From the stateless, text-based simplicity of HTTP/1.1 to the encrypted, multiplexed, and mobile-optimized world of HTTP/3.
Along the way, we’ll look at various important aspects of HTTP:
How does TCP’s design influence HTTP performance, especially in high-latency environments?
Why does head-of-line (HOL) blocking become a problem at high scale?
How do features like header compression, server push, and connection reuse change the performance game?
And where does HTTP/3 shine and still face issues?
2025-05-28 23:30:31
Note: This article is written in collaboration with the engineering team of Amazon Key. Special thanks to Kaushik Mani (Director at Amazon) and Vijay Nagarajan (Engineering Leader) from Amazon Key for walking us through the architecture and challenges they faced while building this system. All credit for the technical details and diagrams shared in this article goes to the Amazon Key Engineering Team.
Picture a customer buzzing with excitement for their package, only to find a "delivery failed" slip because a locked gate stood in the way.
That’s what Amazon faced in 2018. At the time, delivery delays were frequently caused by delivery associates being unable to enter access-controlled areas in residential and commercial properties, including gated communities, leading to missed deliveries and poor customer experiences.
The access control systems in such areas were never designed to work with modern logistics. Moreover, the hardware systems were wildly fragmented with no common standard. Many of these devices were hardwired into buildings in ways that make network access unreliable or downright impossible.
To bridge this gap, Amazon launched an initiative to address a common but impactful delivery challenge: how to provide drivers with secure access to gated residential and commercial communities and buildings.
The result was Amazon Key: a system that allows verified delivery associates to unlock gates and doors at the right time, with the right permissions, and only for the duration needed to complete a delivery. What started with a small internal team and a few device installations has now grown into a system that unlocks 100 million doors annually, across 10+ countries and 4 continents, with five unlocks every second.
We recently had the wonderful opportunity to sit with Kaushik Mani and Vijay Nagarajan from the Amazon Key team to learn how they built this system and the challenges they faced.
In this article, we are bringing you our findings and what we learned about Amazon’s culture of entrepreneurship that played a key role in making Amazon Key a reality.
The origins of Amazon Key can be traced back to 2016.
Upon moving to Seattle to work for Amazon, Kaushik noticed a number of underutilized parking spots in buildings. However, technology to open them up to users was lacking. Kaushik worked on his own initiative to solve this problem for a Seattle apartment building. The solution needed to address a highly fragmented category of garage door mechanisms.
Kaushik invented a cloud-connected universal key that worked on any electrically controlled lock. This later became known as Amazon Key. Kaushik deployed it at customer locations to validate the solution. He wrote the business plan around his invention and pitched it to several leaders across Amazon. After 6 months of pitching the idea and receiving several refusals, Kaushik was eventually funded by Marketplace.
He worked on the idea for a year, but unfortunately, no customers signed up. Building owners loved the idea of selling parking but worried about trust and safety aspects of allowing access to anyone. So, in 2018, he pivoted the business case to benefit only last-mile deliveries for building access and launched the first version of Amazon Key with 1 firmware and 1 hardware engineer.
From a stats point of view, Amazon delivery associates deliver over a billion packages annually to tens of millions of customers living in access-restricted apartment buildings in the US, EU, and Japan. Without Amazon Key, a driver must engage with the customer, property manager, or other residents to receive access when faced with an access control barrier. This results in packages not being delivered on time and reliably due to drivers not gaining access to the apartment building/gated communities. One example highlighted by the team involved a customer expecting a cereal delivery at 4 a.m., timed precisely so breakfast could be prepared on schedule. If the delivery associate couldn’t get past a gate, that delivery would fail, turning what should be a seamless experience into a broken promise.
Not having on-demand access to these buildings was a core capability gap, resulting in repeated defects, the impact of which started becoming more pronounced as customer demand and delivery speeds increased. This shifted the idea from “interesting” to “strategic necessity”.
By 2018, Amazon Key entered early-stage operations. The rollout was intentionally conservative, targeting just 100 buildings to start, followed by another 100 shortly after. But even in this narrow scope, the system’s potential became clear.
Still, scaling was far from straightforward. As Kaushik Mani put it brilliantly: "It takes $5 to build the solution, but $95 to make it secure." That ratio became even more daunting in the context of Amazon’s global footprint, where every expansion introduced a new set of hardware, protocols, and deployment challenges.
And that’s where the Amazon Key engineering team came together to build a solution that made this scale possible.
Amazon started where most pragmatic teams start: build the simplest thing that works.
They created a small, Ethernet-connected device that could physically integrate with most ACS (Access-Control System) hardware. When a delivery associate showed up at a gate, they’d tap in the Amazon Flex App, which triggered a cloud command to the device via AWS IoT to open the gate.
See the diagram below that shows this setup:
The system stack looked something like this:
Hardware: A small, Ethernet-connected device installed on-site, physically wired to the building’s existing access control systems. It could trigger unlocks for 95% of gate types.
AWS IoT: Provided secure, two-way communication between the device and the cloud.
AWS Lambda (Java): Handled unlock requests, triggered by delivery associates using the mobile app.
DynamoDB: Stores metadata like device status, gate mappings, and property permissions.
Amazon Flex App: The application delivery associates use to initiate access requests when approaching a gate.
The early design was lean, minimal, and pragmatic. The goal was to give Amazon delivery associates a reliable way to unlock residential gates. This approach worked well for limited properties.
However, as Amazon Key expanded, hundreds of devices became thousands. A few cities became multiple countries. The rapid expansion created issues for the solution’s scalability in different areas:
When deploying access devices across thousands of properties, physical installation constraints quickly became a major factor in hardware design.
Space limitations were a frequent issue. Callboxes, where devices were often installed, have very limited internal space. Larger hardware simply could not fit inside these enclosures. Similarly, at common installation points like mail rooms, front doors, or outdoor callboxes, Ethernet connectivity was often unavailable, making it impractical to rely on wired networking as a standard installation requirement. Running new Ethernet cables required drilling through walls, digging up pathways, or convincing building managers to modify shared infrastructure.
Initially, the device supported two relay ports and two Wiegand ports, allowing it to control multiple access points and interact with various types of legacy systems. However, based on field experience, the design was streamlined to one relay and one Wiegand port. This change reduced hardware complexity and footprint, which made installations easier and more reliable.
Over time, the team also decided to stop using the Wiegand interface during installations. The main reason was operational: Wiegand credentials often change periodically, making long-term integration unreliable and harder to maintain without frequent updates.
Another problem the team faced was Java-based Lambdas with higher cold start times. It might be fine for a background task, but not when a delivery associate is standing at a gate holding a package to deliver.
Also, the initial backend design used a shared gateway across countries. This meant launching in new regions required coordination with multiple teams, environments, and deployment pipelines.
The unlock commands were fire-and-forget.
The delivery associates had no feedback. They’d tap “unlock” and hope for the best. If a device was offline, the system didn’t know. Also, the operations team couldn’t see device health or network quality remotely. Troubleshooting meant sending people into the field, which was slow, expensive, and frustrating.
The early Lambda-based design was fast to build, but faced issues in terms of scalability. Therefore, the Amazon Key engineering team decided to design the system for a global scale.
Two key changes they made were as follows:
No matter how smart the backend was, if the device couldn’t connect reliably, nothing else mattered. So the hardware team built something better.
They created a new, compact, cellular-enabled device, small enough to install discreetly and robust enough to operate in the field. It had multi-carrier support baked in, spanning 70+ countries, with failover if one network dropped. It removed the earlier dependency on local building infrastructure.
This change shifted the device from “hard to deploy” to “install and forget.” It also gave Amazon a repeatable, global deployment model, which was critical for international rollout.
The backend team knew Lambda wasn’t going to cut it anymore: not at this scale
The delivery associates needed real-time unlocks. That meant persistent device connections, low tail latency, and guaranteed CPU availability. Lambda (especially with Java) wasn’t ideal for that. Therefore, they moved to ECS Fargate, Amazon’s container-native compute layer.
So, why was ECS Fargate chosen specifically?
Vijay Nagarajan from the Amazon Key engineering team gave a few important reasons for this choice:
Fargate allows fine-grained control over vCPU and memory settings, which helped the team optimize performance based on real-world load.
Deploying across multiple Availability Zones (AZs) allowed the system to balance cost and availability.
Fargate can scale based on concrete indicators like outstanding request count, memory pressure, or CPU usage.
Compared to running the same workload on EC2, Fargate proved to be more cost-efficient, especially since it avoided the operational burden of managing EC2 instances.
Unlike Lambda, which spins up and tears down execution contexts, Fargate tasks remain provisioned and continuously running, making them well-suited for latency-sensitive workflows like device unlocks
In short, Fargate gave them the control and performance of EC2, without the ops overhead of managing VMs.
The hardest part of this evolution was surgically moving life traffic to the newer services. The team had to create feature flags to support both the flows. This meant that every time they introduced a new service, they had to slowly migrate traffic to the newer service.
To make things clear, Lambda wasn’t completely abandoned. There are still a few use cases that rely on Lambda, such as:
One-time device JITR registration that happens at the factory.
During installation, they also take the image of where the device is installed. The post-processing to scan for malware/size resolution is still a Lambda function.
Together, cellular hardware and containerized backend unlocked a new class of capabilities such as:
Faster unlock response times (<1.5s consistently).
Global expansion without localized network constraints.
The ability to handle edge cases gracefully, like intermittent connectivity, retries, and failovers.
It also laid the foundation for the next evolution: breaking the system into modular services, standardizing workflows, and monitoring what matters.
The move to ECS Fargate was also a chance to break apart the system into clean, well-scoped services. Amazon Key went from a collection of Lambda functions to a cohesive service-oriented architecture that could be evolved in the future.
See the diagram below for a high-level view of the system running on ECS Fargate:
This wasn’t microservices for microservices’ sake. Each service took care of a real need. Here are the details of the key services that were a part of the overall architecture:
This service was developed to simplify the installation process and support onboarding new properties into the Amazon Key backend.
This service handles requests originating from the Flex App when a delivery is scheduled to an Amazon Key-enabled property. It also enabled the system to support international launches by managing region-specific traffic routing.
This service maintains the relationships between gates, properties, and devices. It also manages mappings with Amazon Logistics and associated job workflows such as installation and maintenance.
Built on top of AWS IoT, this service provided a wrapper around commands sent to devices. It supported both synchronous and asynchronous APIs and streamed device performance metrics into Redshift for analysis.
This service listens to events from the Access Management Service and onboards properties into the Amazon Logistics system, allowing them to be used in routing and delivery workflows.
This service provides pipelines for deploying firmware updates to devices in the field. It supported two modes:
One-time OTA used to test firmware with a small cohort of devices.
Campaign-based OTA, which rolled out updates across device pools grouped by geographical location until all were up to date.
The Flex App is used by Amazon delivery associates to complete package deliveries. It integrates with Amazon Key to support access control during deliveries.
Metrics from different services are pushed into Redshift for analytics. Dashboards built using AWS QuickSight provide both aggregated summaries and device-level drilldowns.
Switching from Ethernet to cellular connectivity solved key deployment issues but introduced new challenges.
Devices were installed in a variety of environments, such as exposed call boxes outside communities or electrical rooms located deep within buildings, and faced varying connectivity conditions. Cellular performance was inconsistent and location-dependent, often changing throughout the day. These fluctuations impacted the reliability of time-sensitive operations like unlock requests, especially during delivery hours.
To address this, the team developed the Intelligent Connection Manager (ICM) to improve device behavior when cellular performance is inconsistent, achieving better availability, failover, and reconnection times.
ICM enables the system to respond to real-world variability in connectivity without manual intervention, helping maintain access availability during critical delivery windows.
Here’s what the ICM does in more detail:
Monitoring: ICM continuously monitors device performance using:
EventBridge to capture real-time events
Redshift for historical trend analysis
Step Functions to coordinate health checks and remediation workflows
S3 for storage and processing support
Analysis: The system identifies poorly performing devices by evaluating recent interaction history and applying defined rules.
Remediation: When issues are detected, ICM triggers automated corrective actions such as:
Rebooting the device
Rescanning the network
Switching cellular carriers
If these measures are not sufficient, the metrics help the operations team determine when manual servicing is required.
As Amazon Key matured, the team shifted its focus from solving Amazon's internal delivery challenges to building a general-purpose access platform. The system had already proven it could scale reliably. The next step was to extend that capability to external delivery providers.
In 2023, Amazon Key announced its integration with Grubhub. This marked the transition from a closed product to a more extensible platform. To support this, the team introduced a new architectural component: the Partner Gateway Service.
This service acts as a boundary between Amazon Key's internal systems and third-party applications. It exposes a clean, stable API that allows vetted external partners to request access to secure properties without revealing internal implementation details.
See the diagram below that shows the expanded architecture:
The Partner Gateway Service was designed to ensure that expansion did not compromise system integrity or performance. It provides features such as:
Partner onboarding workflows: Validates and registers new third-party delivery partners into the system.
Authentication and authorization: Ensures that only approved entities can request unlocks, and only under permitted conditions.
Rate limiting and security enforcement: Prevents abuse and isolates faults to protect core services.
Interface abstraction: Maintains a clear separation between partner-facing APIs and internal services to avoid tight coupling.
Amazon Key uses different authentication and authorization mechanisms for internal and third-party delivery systems, tailored to their integration models and security requirements.
For Amazon’s internal delivery flow, authentication is performed by AMZL (Amazon Logistics) services when a driver uses the Flex App. Once the delivery associate arrives at the property and it has been verified that the specific driver has an assigned delivery for that property at this time, Amazon Key’s backend issues a short-lived token to unlock the gate. This token is used to authorize access during the delivery attempt.
For third-party delivery providers, mutual TLS (mTLS) is used for authentication during the onboarding process. A profile is created per partner, and all communication is secured through mTLS to validate both client and server identities. The Partner Gateway Service handles authorization and request orchestration for these external systems.
Some key points shared by Vijay Nagarajan regarding access expiry and revocation are as follows:
Access is controlled through a time-bound token issued when the driver arrives and parks at the location.
The token is extended at 30-second intervals during the delivery window to maintain access while the delivery is in progress.
In failure scenarios, such as device unavailability due to power loss, the system notifies the driver through the Flex App that 1-click access is not available. In such cases, the drivers can continue accessing the property through standard access mechanisms (for example, through the lobby). One such example cited was power loss during a hurricane in Florida, which made certain devices unreachable.
Operating so many services and features requires Amazon Key to ensure a clean separation of concerns. Here’s a quick look at the distribution of responsibility and the overall team composition.
The Gateway Service is responsible for performing authentication and authorization, and for orchestrating requests from client applications.
The Access Management Service stores and manages the relationships between devices, access points, installation jobs, and address mappings.
The Partner Gateway Service also handles authentication and authorization, using mutual TLS (mTLS), and orchestrates requests from external delivery partners.
The Device Service does not hold contextual information about where a device is located. Instead, it maintains knowledge about the type of device and the operations it supports.
The backend infrastructure team is responsible for managing all these services. The broader team is organized into three focus areas:
App Development
Front-End Development
Backend Common Infrastructure
By rethinking both hardware and software architecture, Amazon Key was able to evolve from a niche internal solution focused on delivery associates into a large-scale, extensible access control platform. The results reflect a system that is not only convenient, secure, reliable and adaptable to operational realities, but also now serves a wide variety of audiences including: property owners/managers, residents, and guests, across single-family, multifamily, and commercial properties.
Today, Amazon Key supports over 100 million successful unlocks annually with extremely high system availability and low end-to-end latency from app tap to physical unlock. Because of this, first-attempt delivery success has improved, and defects per building has reduced.
These improvements directly contributed to more efficient deliveries, lower support costs, and a more seamless access control experience, at scale.
Here are some key learnings from Amazon Key’s experience building their system:
Evolve as You Scale: The initial serverless design was ideal for fast iteration, but couldn’t support the demands of a global system with strict latency and connectivity requirements. Transitioning to ECS Fargate enabled more consistent performance and stateful processing.
Measure What Matters: Instead of measuring generic uptime, the team focused on availability during delivery hours. This shifted the optimization focus to periods that affect end users, resulting in more actionable metrics and better system tuning.
Standardization Enables Speed: Using a consistent tech stack (Java across services, Infrastructure as Code, and AWS-native tooling) allowed the team to move faster without sacrificing maintainability. Reuse became a strength rather than a constraint.
Plan for Imperfect Environments: Many design decisions assume ideal conditions. In practice, field deployments introduced a range of variables: poor signal strength, hardware variation, and environmental impact. Designing with these constraints in mind was key to building resilience.
Operate Based on Data: By feeding all system metrics into a centralized analytics pipeline, the team could proactively identify issues, validate changes, and understand usage patterns. This led to faster incident response and continuous system improvement.
Use Tools Where They Fit: Lambda was not abandoned entirely. It remained useful for stateless, low-latency tasks. However, core workflows were moved to ECS for predictability and control. Tool choice became a matter of fit, not philosophy.
Design for Growth: The Partner Gateway Service shows the value of designing for extensibility. It enabled Amazon Key to expand from a single-use product into a multi-tenant platform, supporting external partners without disrupting core operations.
At the end, Vijay Nagarajan shared one key point regarding the journey: “It’s easy to say we would have arrived at the scaling architecture right away. But there are so many unknowns when we scale, especially in the business we are in. We have to inevitably grow through the learning phase. We could have potentially accelerated the learning phase by getting some of the basic metrics/telemetry from the device and prioritizing the OTA infrastructure. But for the rest, we are evolving in the right direction.”
Get your product in front of more than 1,000,000 tech professionals.
Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases.
Space Fills Up Fast - Reserve Today
Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing [email protected].
2025-05-27 23:30:56
Obsessed with performance and low latency engineering? Discuss your optimizations and lessons learned with ~30K like-minded engineers… at P99 CONF 2025!
P99 CONF is a highly technical conference known for lively discussion. ScyllaDB makes it free and virtual, so it’s open to experts around the world. Core topics for this year include Rust, Zig, databases, event streaming architectures, measurement, compute/infrastructure, Linux, Kubernetes, and AI/ML.
If you’re selected to speak, you’ll be in amazing company. Past speakers include the creators of Postgres, Bun, Honeycomb, tokio, and Flask – plus engineers from today’s most impressive tech leaders.
Bonus: Early bird registrants get 30-day access to the complete O’Reilly library & learning platform, plus free digital books
Disclaimer: The details in this post have been derived from the articles/videos shared online by the Uber Eats engineering team. All credit for the technical details goes to the Uber Eats Engineering Team. The links to the original articles and videos are present in the references section at the end of the post. We’ve attempted to analyze the details and provide our input about them. If you find any inaccuracies or omissions, please leave a comment, and we will do our best to fix them.
Uber Eats set out to increase the number of merchants available to users by a significant multiple. The team referred to it as nX growth. This wasn’t a simple matter of onboarding more restaurants. It meant expanding into new business lines like groceries, retail, and package delivery, each with its scale and technical demands.
To accommodate this, the search functionality needed to support this growth across all discovery surfaces:
Home feed, where users browse curated carousels.
Search, covering restaurant names, cuisines, and individual dishes.
Suggestions, which include autocomplete and lookahead logic.
Ads, which plug into the same backend and share the same constraints.
The challenge wasn’t just to show more. It was to do so without increasing latency, without compromising ranking quality, and without introducing inconsistency across surfaces.
A few core problems made this difficult:
Vertical expansion: Grocery stores often include over 100,000 items per location. Retail and package delivery add their indexing complexity.
Geographic expansion: The platform shifted from neighborhood-level search to intercity delivery zones.
Search scale: More merchants and more items meant exponential growth in the number of documents to index and retrieve.
Latency pressure: Every additional document increases compute costs in ranking and retrieval. Early attempts to scale selection caused 4x spikes in query latency.
To support nX merchant growth, the team had to rethink multiple layers of the search stack from ingestion and indexing to sharding, ranking, and query execution. In this article, we look at the breakdown of how Uber Eats rebuilt its search platform to handle this scale without degrading performance or relevance.
mabl’s 6th State of Testing in DevOps Report explores the impact of software testing, test automation, organizational growth, and DevOps maturity across the software development lifecycle.
Uber Eats search is structured as a multi-stage pipeline, built to balance large-scale retrieval with precise, context-aware ranking.
Each stage in the architecture has a specific focus, starting from document ingestion to final ranking. Scaling the system for millions of merchants and items means optimizing each layer without introducing bottlenecks downstream.
The system ingests documents from multiple verticals (restaurants, groceries, retail) and turns them into searchable entities.
There are two primary ingestion paths:
Batch Ingestion: Large-scale updates run through Apache Spark jobs. These jobs transform raw source-of-truth data into Lucene-compatible search documents, partition them into shards, and store the resulting indexes in an object store. This is the backbone for most index builds.
Streaming Ingestion: Real-time updates flow through Kafka as a write-ahead log. A dedicated ingestion service consumes these updates, maps them to the appropriate shard, and writes them into the live index.
Priority-Aware Ingestion: Not all updates are equal. The system supports priority queues so urgent updates, like price changes or store availability, are ingested ahead of less critical ones. This ensures high-priority content reflects quickly in search results.
The retrieval layer acts as the front line of the search experience. Its job is to fetch a broad set of relevant candidates for downstream rankers to evaluate.
Recall-focused retrieval: The system fetches as many potentially relevant documents as possible, maximizing coverage. This includes stores, items, and metadata mapped to the user’s location.
Geo-aware matching: Given that most searches are tied to physical delivery, the retrieval process incorporates location constraints using geo-sharding and hex-based delivery zones. Queries are scoped to shards that map to the user’s region.
Once the initial candidate set is retrieved, a lightweight ranking phase begins.
Lexical matching: Uses direct term overlap between the user’s query and the indexed document fields.
Fast filtering: Filters out low-relevance or out-of-scope results quickly, keeping only candidates worth further evaluation.
Efficiency-focused: This stage runs directly on the data nodes to avoid unnecessary network fanout. It's designed for speed, not deep personalization.
Before documents reach the second-pass ranker, they go through a hydration phase.
Each document is populated with additional context: delivery ETAs, promotional offers, loyalty membership info, and store images. This ensures downstream components have all the information needed for ranking and display.
This is where the heavier computation happens, evaluating business signals and user behavior.
Personalized scoring: Models incorporate past orders, browsing patterns, time of day, and historical conversion rates to prioritize results that match the user’s intent.
Business metric optimization: Ranking is also shaped by conversion likelihood, engagement metrics, and performance of past campaigns, ensuring search results aren’t just relevant, but also effective for both user and platform.
Scaling search isn't just about fetching more documents. It's about knowing how far to push the system before performance breaks.
At Uber Eats, the first attempt to increase selection by doubling the number of matched candidates from 200 to 400 seemed like a low-risk change. In practice, it triggered a 4X spike in P50 query latency and exposed deeper architectural flaws.
The idea was straightforward: expand the candidate pool so that downstream rankers have more choices. More stores mean better recall. However, the cost wasn’t linear because of the following reasons:
Search radius grows quadratically: Extending delivery range from 5 km to 10 km doesn’t double the document count—it increases the search area by a factor of four. Every added kilometer pulls in a disproportionately larger set of stores.
Retrieval becomes I/O-bound: The number of documents per request ballooned. Queries that once matched a few thousand entries now had to sift through tens of thousands. The Lucene index, tuned for fast lookups, started choking during iteration.
The geo-sharding strategy, built around delivery zones using hexagons, wasn't prepared for expanded retrieval scopes.
As delivery radii increased, queries began touching more distant shards, many of which were optimized for different traffic patterns or data distributions. This led to inconsistent latencies and underutilized shards in low-traffic areas.
Ingestion and query layers weren’t fully aligned.
The ingestion service categorizes stores as “nearby” or “far” based on upstream heuristics. These classifications didn’t carry over cleanly into the retrieval logic. As a result, rankers treated distant and local stores the same, skewing relevance scoring and increasing CPU time.
Uber Eats search is inherently geospatial. Every query is grounded in a delivery address, and every result must answer a core question: Can this store deliver to this user quickly and reliably?
To handle this, the system uses H3, Uber’s open-source hexagonal spatial index, to model the delivery world.
Each merchant’s delivery area is mapped using H3 hexagons:
The world is divided into hexagonal tiles at a chosen resolution.
A store declares delivery availability to a specific set of hexes.
The index then builds a reverse mapping: for any hex, which stores deliver to it?
This structure makes location-based lookups efficient. Given a user’s location, the system finds their H3 hexagon and retrieves all matching stores with minimal fanout.
The problem wasn’t the mapping but the metadata.
Upstream services were responsible for labeling stores as “close” or “far” at ingestion time. This binary categorization was passed downstream without actual delivery time (ETA) information.
Once ingested, the ranking layer saw both close and far stores as equivalent. That broke relevance scoring in subtle but important ways.
Consider this:
Hexagon 7 might have two stores marked as “far.”
One is 5 minutes away, the other 30.
To the search system, they look the same.
That lack of granularity meant distant but high-converting stores would often outrank nearby ones. Users saw popular chains from across the city instead of the closer, faster options they expected.
Sharding determines how the system splits the global index across machines.
A good sharding strategy keeps queries fast, data well-balanced, and hotspots under control. A bad one leads to overloaded nodes, inconsistent performance, and painful debugging sessions.
Uber Eats search uses two primary sharding strategies: Latitude sharding and Hex sharding. Each has trade-offs depending on geography, query patterns, and document distribution.
Latitude sharding divides the world into horizontal bands. Each band corresponds to a range of latitudes, and each range maps to a shard. The idea is simple: group nearby regions based on their vertical position on the globe.
Shard assignment is computed offline using Spark. The process involves two steps:
Slice the map into thousands of narrow latitude stripes.
Group adjacent stripes into N roughly equal-sized shards, based on document count.
To avoid boundary misses, buffer zones are added. Any store that falls near the edge of a shard is indexed in both neighboring shards. The buffer width is based on the maximum expected search radius, converted from kilometers into degrees of latitude.
The benefits of this approach are as follows:
Time zone diversification: Shards include cities from different time zones (for example, the US and Europe). This naturally spreads out traffic peaks, since users in different zones don’t search at the same time.
Query locality: Many queries resolve within a single shard. That keeps fanout low and speeds up ranking.
The downsides are as follows:
Shard imbalance: Dense urban areas near the equator (for example, Southeast Asia) pack far more stores per degree of latitude than sparsely populated regions. Some shards grow much larger than others.
Slower index builds: Indexing time is gated by the largest shard. Skewed shard sizes lead to uneven performance and increased latency.
To address the limitations of latitude sharding, Uber Eats also uses Hex sharding, built directly on top of the H3 spatial index. Here’s how it works:
The world is tiled using H3 hexagons at a fixed resolution (typically level 2 or 3).
Each hex contains a portion of the indexed documents.
A bin-packing algorithm groups hexes into N shards with roughly equal document counts.
Buffer zones are handled similarly, but instead of latitude bands, buffer regions are defined as neighboring hexagons at a lower resolution. Any store near a hex boundary is indexed into multiple shards to avoid cutting off valid results.
The benefits are as follows:
Balanced shards: Bin-packing by document count leads to far more consistent shard sizes, regardless of geography.
Better cache locality: Queries scoped to hexes tend to access tightly grouped data. That improves memory access patterns and reduces retrieval cost.
Less indexing skew: Because hexes are spatially uniform, indexing overhead stays predictable across regions.
As a takeaway, latitude sharding works well when shard traffic needs to be spread across time zones, but it breaks down in high-density regions.
Hex sharding offers more control, better balance, and aligns naturally with the geospatial nature of delivery. Uber Eats uses both, but hex sharding has become the more scalable default, especially as selection grows and delivery radii expand.
When search systems slow down, it’s tempting to look at algorithms, infrastructure, or sharding. But often, the bottleneck hides in a quieter place: how the documents are laid out in the index itself.
At Uber Eats scale, index layout plays a critical role in both latency and system efficiency. The team optimized layouts differently for restaurant (Eats) and grocery verticals based on query patterns, item density, and retrieval behavior.
Restaurant queries typically involve users looking for either a known brand or food type within a city. For example, “McDonald’s,” “pizza,” or “Thai near me.” The document layout reflects that intent.
Documents are sorted as:
City
Restaurant
Items within each restaurant
This works for the following reasons:
Faster city filtering: Queries scoped to San Francisco don’t need to scan through documents for Tokyo or Boston. The search pointer skips irrelevant sections entirely.
Improved compression: Lucene uses delta encoding. Grouping items from the same store, where metadata like delivery fee or promo is often repeated, yields tighter compression.
Early termination: Documents are sorted by static rank (for example, popularity or rating). Once the system retrieves enough high-scoring results, it stops scanning further.
Grocery stores behave differently. A single store may list hundreds or thousands of items, and queries often target a specific product ( “chicken,” “milk,” “pasta”) rather than a store.
Here, the layout is:
City
Store (sorted by offline conversion rate)
Items grouped tightly under each store
This matters for the following reasons:
Per-store budget enforcement: To avoid flooding results from one SKU-heavy store, the system imposes a per-store document budget. Once a store’s quota is met, the index skips to the next.
Diverse results: Instead of returning 100 versions of “chicken” from the same retailer, the layout ensures results are spread across stores.
Faster skip iteration: The tight grouping allows the system to jump across store boundaries efficiently, without scanning unnecessary items.
The improvements of these indexing strategies were pretty good:
Retrieval latency dropped by 60%
P95 latency improved by 50%
Index size reduced by 20%, thanks to better compression
Delivery time matters. When users search on Uber Eats, they expect nearby options to show up first, not restaurants 30 minutes away that happen to rank higher for other reasons. But for a long time, the ranking layer couldn’t make that distinction. It knew which stores delivered to a given area but not how long delivery would take.
This is because the ingestion pipeline didn’t include ETA (Estimated Time of Delivery) information between stores and hexagons. That meant:
The system treated all deliverable stores as equal, whether they were 5 minutes or 40 minutes away.
Ranking logic had no signal to penalize faraway stores when closer alternatives existed.
Popular but distant stores would often dominate results, even if faster options were available.
This undermined both user expectations and conversion rates. A store that looks promising but takes too long to deliver creates a broken experience.
To fix this, Uber Eats introduced ETA-aware range indexing. Instead of treating delivery zones as flat lists, the system:
Binned stores into time-based ranges: Each store was indexed into one or more delivery buckets based on how long it takes to reach a given hex. For example:
Range 1: 0 to 10 minutes
Range 2: 10 to 20 minutes
Range 3: 20 to 30 minutes
Duplicated entries across ranges: A store that’s 12 minutes from one hex and 28 minutes from another would appear in both Range 2 and Range 3. This added some storage overhead, but improved retrieval precision.
Ran range-specific queries in parallel: When a user queried from a given location, the system launched multiple subqueries—one for each ETA bucket. Each subquery targeted its corresponding shard slice.
This approach works for the following reasons:
Recall improves: The system surfaces more candidates overall, across a wider range of delivery times without overloading a single query path.
Latency drops: By splitting the query into parallel, bounded range scans, each shard does less work, and total response time shrinks.
Relevance becomes proximity-aware: Rankers now see not just what a store offers, but how fast it can deliver, enabling better tradeoffs between popularity and speed.
Scaling Uber Eats search to support nX merchant growth wasn’t a single optimization. It was a system-wide redesign.
Latency issues, ranking mismatches, and capacity bottlenecks surfaced not because one layer failed, but because assumptions across indexing, sharding, and retrieval stopped holding under pressure.
This effort highlighted a few enduring lessons that apply to any high-scale search or recommendation system:
Documents must be organized around how queries behave. Misaligned layouts waste I/O, increase iteration cost, and cripple early termination logic.
A good sharding strategy accounts for document distribution, query density, and even time zone behavior to spread traffic and avoid synchronized load spikes.
When profiling shows document iteration taking milliseconds instead of microseconds, the problem isn’t ranking but traversal. Optimizing storage access patterns often yields bigger wins than tuning ranking models.
Storing the same store in multiple ETA buckets increases index size, but dramatically reduces compute at query time. Every gain in recall, speed, or freshness has to be weighed against storage, complexity, and ingestion cost.
Breaking queries into ETA-based subranges, separating fuzzy matches from exact ones, or running proximity buckets in parallel all help maintain latency while expanding recall.
In a system touched by dozens of teams and services, observability is a prerequisite. Latency regressions, ingestion mismatches, and ranking anomalies can't be fixed without precise telemetry and traceability.
References:
Get your product in front of more than 1,000,000 tech professionals.
Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases.
Space Fills Up Fast - Reserve Today
Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing [email protected].
2025-05-24 23:30:30
Enhance visibility into your cloud architecture with expert insights from AWS + Datadog. In this ebook, AWS Solutions Architects Jason Mimick and James Wenzel guide you through best practices for creating professional and impactful diagrams.
This week’s system design refresher:
JWT Simply Explained
The 5 Pillars of API Design
How Computer Memory Works?
Top Kubernetes Scaling Strategies You Must Know
Hiring Now: Top AI Startups and AI Roles
SPONSOR US
JWT or JSON Web Tokens is an open standard for securely transmitting information between two parties. They are widely used for authentication and authorization.
A JWT consists of three main components:
Header
Every JWT carries a header specifying the algorithms for signing the JWT. It’s written in JSON format.
Payload
The payload consists of the claims and the user data. There are different types of claims such as registered, public, and private claims.
Signature
The signature is what makes the JWT secure. It is created by taking the encoded header, encoded payload, secret key, and the algorithm and signing it.
JWTs can be signed in two different ways:
Symmetric Signatures
It uses a single secret key for both signing the token and verifying it. The same key must be shared between the server that signs the JWT and the system that verifies it.
Asymmetric Signatures
In this case, a private key is used to sign the token, and a public key to verify it. The private key is kept secure on the server, while the public key can be distributed to anyone who needs to verify the token.
Over to you: Do you use JWTs for authentication?
APIs are the backbone of modern systems. But it is also important to design them in the right way.
Here are a few things that a developer should consider while designing APIs
The Interface
API Design is concerned with defining the inputs and outputs of an API. For example, defining how the CRUD operations may be exposed to the user or the client.
API Paradigms
APIs can be built following different paradigms, each with its own set of protocols and standards. Some options are REST, GraphQL, and gRPC.
Relationships in API
APIs often need to establish relationships between the various entities. For example, a user might have multiple orders related to their account. The API endpoints should reflect these relationships for a better client experience.
Versioning
When modifying API endpoints, proper versioning and supporting backward compatibility are important.
Rate Limiting
Rate limiting is used to control the number of requests a user can make to an API within a certain timeframe. This is crucial for maintaining the reliability and availability of the API.
Over to you: Which other API Design principle will you add to the list?
Here’s a simple breakdown that shows how data moves through a system from input to processing to storage.
Data enters through input sources like keyboard, mouse, camera, or remote systems.
Permanent storage holds your system files, apps, and media. This includes hard drives, USB drives, ROM/BIOS, and network-based storage.
RAM is the workspace of your computer. It includes physical memory and virtual memory, which temporarily store data and programs while you’re using them.
Cache memory sits closer to the CPU and is split into Level 1 and Level 2. It helps speed up access to frequently used data.
CPU registers are the fastest and smallest memory units. They’re used directly by the processor to execute instructions almost instantly.
The higher you go in the memory pyramid, the faster and smaller the storage.
Over to you: What else will you add to improve the understanding of a computer memory’s working?
Horizontal Pod Autoscaling or HPA
Horizontal Pod Autoscaler automatically scales the number of Pods in a Deployment, ReplicaSet, or StatefulSet based on observed CPU utilization, memory usage, or custom metrics.
Vertical Pod Autoscaling or VPA
Based on application requirements, VPA adjusts the resources allocated to individual pods, such as CPU and memory. This approach dynamically changes pod resource settings based on workload metrics.
Cluster Auto Scaling
The Cluster Autoscaler automatically adjusts the number of nodes in a Kubernetes cluster. It interacts with the cloud provider to add or remove nodes based on requirements. This is important to maintain a balanced cluster.
Predictive Auto Scaling
Predictive Autoscaling uses machine learning to forecast future resource requirements. It helps Kubernetes adjust resources by anticipating workload demands.
Over to you: Which other Kubernetes Scaling Strategy will you add to the list?
Top ML Roles Opened in the Last 12 Hours
Senior Machine Learning Engineer (Modeling) - Underwriting & Credit at Cash App
Principal Machine Learning Engineer – AI Core Infrastructure at General Motors
Senior Machine Learning Engineer at MD Anderson Cancer Center
Sr Principal Machine Learning Engineer (GenAI/LLM) at Palo Alto Networks
Senior Machine Learning Engineer at Adobe
AI/Machine Learning Analyst Lead at Peraton
High Impact Roles at High Growth AI Startups this week
Software Engineer - Data Infrastructure at Luma AI
Staff Software Engineer, Compute Services at CoreWeave
Senior Frontend Engineer at Typeface
Staff/Principal Software Engineer, Web Applications at Fireworks AI
Senior Software Engineer, iOS at Otter.ai
Software Engineer - Security at ClickHouse
Senior Full Stack Engineer at Invisible AI
High Salary General SWE Roles this week
Principal Platform Software Engineer - OpenBMC Platform Architect at NVIDIA
Director of Engineering and Site Lead at Harvey
Managing Director, Software Engineering at West Monroe
Tech Lead Backend Software Engineer at TikTok
Senior Manager, AI ML Applied Field Engineering at Snowflake
Sr. Sales Engineer at Tenstorrent
Software Engineering PMTS at Salesforce
Software Engineer, Model Context Protocol at Anthropic
Get your product in front of more than 1,000,000 tech professionals.
Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases.
Space Fills Up Fast - Reserve Today
Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing [email protected].
2025-05-22 23:31:07
Modern software systems rarely live in isolation. Most applications today are stitched together from dozens, sometimes hundreds, of independently deployed services, each handling a piece of the puzzle. This helps create smaller units of responsibility and loose coupling. However, the flexibility comes with a new kind of complexity, especially around how these services communicate.
In a monolith, in-process function calls stitch components together. In a service-based world, everything talks over the network. Suddenly, concerns that were once handled inside the application, like retries, authentication, rate limiting, encryption, and observability, become distributed concerns. And distributed concerns are harder to get right.
To manage this complexity, engineering teams typically reach for one of two patterns: the API gateway or the service mesh.
Both aim to make communication between services more manageable, secure, and observable. But they do it in very different ways, and for different reasons. The confusion often starts when these tools are treated as interchangeable or when their roles are reduced to simple traffic direction: "API gateways are for north-south traffic, service meshes are for east-west." That shortcut oversimplifies both and sets teams up for misuse or unnecessary overhead.
In this article, we look at both API Gateways and Service Mesh in detail, along with their key differences and usage goals.