2025-11-25 00:30:48
AI is only as powerful as the data behind it — but most teams aren’t ready.
We surveyed 200 senior IT and data leaders to uncover how enterprises are really using streaming to power AI, and where the biggest gaps still exist.
Discover the biggest challenges in real-time data infrastructure, the top obstacles slowing down AI adoption, and what high-performing teams are doing differently in 2025.
Download the full report to see where your organisation stands.
Disclaimer: The details in this post have been derived from the details shared online by the Zalando Engineering Team. All credit for the technical details goes to the Zalando Engineering Team. The links to the original articles and sources are present in the references section at the end of the post. We’ve attempted to analyze the details and provide our input about them. If you find any inaccuracies or omissions, please leave a comment, and we will do our best to fix them.
Zalando is one of Europe’s largest fashion and lifestyle platforms, connecting thousands of brands, retailers, and physical stores under one digital ecosystem.
As the company’s scale grew, so did the volume of commercial data it generated. This included information about product performance, sales patterns, pricing insights, and much more. This data was not just important for Zalando itself but also for its vast network of retail partners who relied on it to make critical business decisions.
However, sharing this data efficiently with external partners became increasingly complex.
Zalando’s Partner Tech division, responsible for data sharing and collaboration with partners, found itself managing a fragmented and inefficient process. Partners needed clear visibility into how their products were performing on the platform, but accessing that information was far from seamless. Data was scattered across multiple systems and shared through a patchwork of methods. Some partners received CSV files over SFTP, others pulled data via APIs, and many depended on self-service dashboards to manually export reports. Each method served a purpose, but together created a tangled system where consistency and reliability were hard to maintain. Many partners had to dedicate the equivalent of 1.5 full-time employees each month just to extract, clean, and consolidate the data they received. Instead of focusing on strategic analysis or market planning, skilled analysts spent valuable time performing repetitive manual work.
There was also a serious accessibility issue. The existing interfaces were not designed for heavy or large-scale data downloads. Historical data was often unavailable when partners needed it most, such as during key planning or forecasting cycles. As a result, even well-resourced partners struggled to build an accurate picture of their own performance.
This problem highlighted a critical gap in Zalando’s data strategy. Partners did not just want raw data or operational feeds. They wanted analytical-ready datasets that could be accessed programmatically and integrated directly into their internal analytics tools. In simple terms, they needed clean, governed, and easily retrievable data that fit naturally into their business workflows.
To address this challenge, the Zalando Engineering Team began a multi-year journey to rebuild its partner data sharing framework from the ground up. The result of this effort was Zalando’s adoption of Delta Sharing, an open protocol for secure data sharing across organizations. In this article, we will look at how Zalando built such a system and the challenges they faced.
To solve the problem of fragmented data sharing, the Zalando Engineering Team first needed to understand who their partners were and how they worked with data.
Zalando operates through three major business models:
Wholesale: Zalando purchases products from brands and resells them directly on its platform.
Partner Program: Brands list and sell products directly to consumers through Zalando’s marketplace.
Connected Retail: Physical retail stores connect their local inventory to an online platform, allowing customers to buy nearby and pick up in person.
Each of these models generates unique datasets, and the scale of those datasets varies dramatically. A small retailer may only deal with a few hundred products and generate a few megabytes of data each week. In contrast, a global brand might handle tens of thousands of products and need access to hundreds of terabytes of historical sales data for planning and forecasting.
In total, Zalando manages more than 200 datasets that support a business generating over €5 billion in gross merchandise value (GMV). These datasets are critical to helping partners analyze trends, adjust pricing strategies, manage inventory, and plan promotions. However, not all partners have the same level of technical sophistication or infrastructure to consume this data effectively.
Zalando’s partners generally fall into three categories based on their data maturity. See the table below:
Large enterprise partners often have their own analytics teams, data engineers, and infrastructure. They expect secure, automated access to data that integrates directly into their internal systems. Medium-sized partners prefer flexible solutions that combine manual and automated options, such as regularly updated reports and dashboards. Smaller partners value simplicity above all else, often relying on spreadsheet-based workflows and direct downloads.
Zalando’s existing mix of data-sharing methods (such as APIs, S3 buckets, email transfers, and SFTP connections) worked in isolation but could not scale to meet all these varied needs consistently.
After understanding the different needs of its partner ecosystem, the Zalando Engineering Team began to look for a better, long-term solution. The goal was not only to make data sharing faster but also to make it more reliable, scalable, and secure for every partner, from small retailers to global brands.
The team realized that fixing the problem required more than improving existing systems. They needed to design an entirely new framework that could handle massive datasets, provide real-time access, and adapt to each partner’s technical capability without creating new complexity. To do that, Zalando created a clear list of evaluation criteria that would guide their decision.

First, the solution had to be cloud-agnostic. Zalando’s partners used a variety of technology stacks and cloud providers. Some worked with AWS, others used Google Cloud, Azure, or even on-premise systems. The new system needed to work seamlessly across all these environments without forcing partners to change their existing infrastructure.
Second, the platform had to be open and extensible. This meant avoiding dependence on a single vendor or proprietary technology. Zalando wanted an open-standard approach that could evolve and integrate with different tools, systems, and workflows.
Third, the solution needed strong performance and scalability. With over 200 datasets and some reaching hundreds of terabytes in size, performance could not be an afterthought. The system had to handle large-scale data transfers and queries efficiently while maintaining low latency and high reliability.
Security was another non-negotiable factor. The platform had to support granular security and auditing features. This included data encryption, access control at the table or dataset level, and comprehensive logging for compliance and traceability. Since partners would be accessing sensitive commercial data, robust governance mechanisms were essential to maintain trust.
The next requirement was flexibility in data access patterns. Partners used data in different ways, so the system had to support:
Real-time streaming for partners who need up-to-the-minute insights
Batch and incremental updates for partners who preferred scheduled or partial data loads
Historical data access for partners who needed to analyze long-term trends
Finally, the solution had to be easy to integrate with the tools that partners were already using. Whether it was business intelligence dashboards, data warehouses, or analytics pipelines, the new system should fit naturally into existing workflows rather than force partners to rebuild them from scratch.
The search for such a system eventually led them to Delta Sharing, an open protocol specifically designed for secure data sharing across organizations. This discovery would go on to transform the way Zalando and its partners collaborate on data.
After months of evaluation and research, the Zalando Engineering Team found a technology that met nearly all of their requirements: Delta Sharing.
Delta Sharing is an open protocol designed specifically for secure, zero-copy data sharing across organizations. This means that partners can access live data directly from its original location without creating separate copies or transferring large files across systems.
The team immediately recognized how well this approach fit their goals. It offered the openness, scalability, and security they needed while being simple enough to integrate into partners’ existing tools and workflows. Key features of Delta Sharing are as follows:
Zero-copy access: Partners can query live datasets directly without needing to download or duplicate them. This eliminates data redundancy and ensures that everyone works with the most up-to-date information.
Open standard: Because Delta Sharing is based on open principles, it works seamlessly with a wide range of tools and platforms. Partners can connect through Pandas, Apache Spark, Tableau, or even Microsoft Excel, depending on their needs.
Granular access control: Data is shared securely using token-based authentication and credential files, which means each partner receives access tailored to their role and data permissions.
Scalable performance: The protocol efficiently handles very large datasets, even those that exceed terabytes in size, while maintaining high reliability and low latency.
Security by design: Features such as encryption, auditing, and logging are built into the system. This ensures that all data access is traceable and compliant with internal governance policies.
While Delta Sharing is available as an open-source protocol, Zalando decided to implement the Databricks Managed Delta Sharing service instead of hosting its own version. This choice was made for several practical reasons:
It integrates tightly with Unity Catalog, Databricks’ governance, and metadata layer. This allowed Zalando to maintain a single source of truth for datasets and permissions.
It provides enterprise-grade security, compliance, and auditability, which are essential when dealing with sensitive commercial data from multiple organizations.
It removes the operational overhead of managing and maintaining sharing servers, tokens, and access logs internally.
By using the managed service, the Zalando Engineering Team could focus on delivering value to partners rather than spending time maintaining infrastructure.
Once the Zalando Engineering Team validated Delta Sharing as the right solution, the next challenge was designing a clean and efficient architecture that could be scaled across thousands of partners. Their approach was to keep the system simple, modular, and easy to manage while ensuring that security and governance remained central to every layer.
At its core, the new data-sharing framework relied on three main building blocks that defined how data would be organized, accessed, and distributed:
Delta Share: A logical container that groups related Delta Tables for distribution to external recipients.
Recipient: A digital identity representing each partner within the Delta Sharing system.
Activation Link: A secure URL that allows partners to download their authentication credentials and connect to shared datasets.
This architecture followed a clear, three-step data flow designed to keep operations transparent and efficient:
Data Preparation and Centralization: All partner datasets were first curated and stored in scalable storage systems as Delta Tables. These tables were then registered in Unity Catalog, which acted as the metadata and governance layer. Unity Catalog provided a single source of truth for data definitions, schema consistency, and lineage tracking, ensuring that every dataset was traceable and well-documented.
Access Configuration: Once datasets were ready, the engineering team created a Recipient entry for each partner and assigned appropriate permissions. Each recipient received an activation link, which allowed them to securely access their data credentials. This setup ensured that partners only saw the data they were authorized to access while maintaining strict access boundaries between different organizations.
Direct Partner Access: When a partner activated their link, they retrieved a credential file and authenticated through a secure HTTPS connection. They could then directly query live data without duplication or manual transfer. Since the data remained centralized in Zalando’s data lakehouse, there were no synchronization issues or redundant copies to maintain.
This architecture brought several benefits. Partners now had real-time access to data, partner-specific credentials ensured granular security, and no redundant storage simplified maintenance.
To implement this system in Databricks, Zalando followed a clear operational workflow:
Prepare the Delta Tables and register them in Unity Catalog.
Create a Share to group related datasets.
Add the relevant tables to that share.
Create a Recipient representing each partner.
Grant the appropriate permissions to the recipient.
See the diagram below:
Every step was guided by Databricks’ Delta Sharing API documentation, allowing the team to automate processes where possible and maintain strong governance controls.
Once the new data-sharing architecture was in place, the Zalando Engineering Team understood that technology alone would not guarantee success. For the system to work, partners needed to be able to use it confidently and easily. Usability became just as important as performance or scalability.
To make the onboarding process smooth, Zalando created a range of partner-facing resources. These included step-by-step user guides that explained how to connect to Delta Sharing using tools familiar to most data teams, such as Pandas, Apache Spark, and common business intelligence (BI) platforms. Each guide walked partners through the entire process—from receiving their activation link to successfully accessing and querying their first dataset.
The team also built detailed troubleshooting documentation. This helped partners solve common issues such as expired credentials, connection errors, or authentication problems without needing to contact support. By empowering partners to self-diagnose and fix minor issues, Zalando reduced delays and improved overall efficiency.
In addition, they developed prebuilt connector snippets—small code templates that partners could plug directly into their existing data pipelines. These snippets made it possible to integrate Zalando’s data into existing workflows within minutes, regardless of whether a partner used Python scripts, Spark jobs, or visualization tools.
Together, these efforts dramatically reduced onboarding friction. Instead of days of setup and testing, partners could access and analyze data in a matter of minutes. This ease of use quickly became one of the platform’s strongest selling points.
The success of the Partner Tech pilot did not go unnoticed within Zalando. Other teams soon realized that they faced similar challenges when sharing data with internal or external stakeholders. Rather than allowing every department to build its own version of the solution, Zalando decided to expand the Delta Sharing setup into a company-wide platform for secure and scalable data distribution.
This new platform came with several key capabilities:
Unified recipient management: Centralized control of who receives what data, ensuring consistent governance.
Built-in best practices: Guidelines for preparing datasets before sharing, helping teams maintain high data quality.
Standardized security and governance policies: Every department followed the same data-sharing rules, simplifying compliance.
Cross-team documentation and automation: Shared tools and documentation made it easier for new teams to adopt the platform without starting from scratch.
Looking ahead, Zalando plans to introduce OIDC Federation, a feature that allows partners to authenticate using their own identity systems. This will remove the need for token-based authentication and make access even more secure and seamless.
Zalando’s journey to modernize partner data sharing was both a technical and organizational transformation. By focusing on real partner challenges, the Zalando Engineering Team built a system that balanced openness, governance, and usability—creating long-term value for both the company and its ecosystem.
The key lessons were as follows:
Start with partner needs, not technology. Deep research into partner workflows helped Zalando design a solution that solved real pain points rather than adding complexity.
Design for diversity. A single rigid model could not serve everyone, so the platform was built to support different partner sizes, tools, and technical skills.
Cross-team collaboration is essential. Close cooperation between the Data Foundation, AppSec, and IAM teams ensured consistency, security, and compliance from day one.
Manual processes are acceptable for pilots but not for scale. Early manual steps were valuable for testing ideas, but later became automation goals as the platform grew.
Internal adoption validates external value. When other Zalando teams began using Delta Sharing, it confirmed the platform’s effectiveness beyond its original use case.
Security must be embedded from the start. Integrating encryption, access control, and auditing early prevented rework and established long-term trust.
Documentation is a product feature. Clear guides, troubleshooting steps, and code examples made onboarding fast and self-service for partners.
Managed is better than self-managed. Relying on Databricks’ managed Delta Sharing service gave Zalando operational stability and freed engineers to focus on partner success.
Delta Sharing has fundamentally changed how Zalando exchanges data with its partners. The company moved from fragmented exports to a unified, real-time, and governed data-sharing model. This shift has produced the following impact:
Reduced manual data handling and partner friction.
Enabled faster, data-driven decision-making through consistent access.
Created a scalable foundation for cross-partner analytics and collaboration.
Established a reusable enterprise framework for secure data exchange.
References:
Get your product in front of more than 1,000,000 tech professionals.
Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases.
Space Fills Up Fast - Reserve Today
Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing [email protected].
2025-11-23 00:30:44
Failed checkouts, dropped jobs, timeouts that don’t even throw errors ➡️ these are the issues that slow your team down.
With Sentry Logs, you can trace every step, from the user’s request through your code to the log line. And because Logs connect automatically with your errors, traces, and even replays, all your debugging context lives in one place.
TLDR: Sentry has Logs. More context, fewer tabs, faster fixes, more time shipping.
This week’s system design refresher:
Cloudflare vs. AWS vs. Azure
Popular Backend Tech Stack
HTTP vs. HTTPS
Forward Proxy versus Reverse Proxy
Concurrency is NOT Parallelism
SPONSOR US
Cloudflare is much more than just a CDN and DDoS protection service. Let’s do a quick comparison of Cloudflare, AWS, and Azure.
Cloudflare has rapidly expanded beyond its traditional CDN roots, launching a suite of modern developer-first services like Workers, R2, D1, and so on. These offerings position it as a serious edge-native alternative to other cloud providers.
Here are the key cloud capabilities that Cloudflare supports:
Edge Compute and Serverless
Object and Blob Storage
Relational Databases
Containers
Sandboxes
Workflows
AI Agents SDK
Vector and AI search
Data Connectivity
AI Infrastructure
Content Delivery Network
DNS
Load Balancing
Over to you: Have you used Cloudflare’s new offerings? What are your thoughts on them?
When you open a website, the difference between HTTP and HTTPS decides whether your data travels safely or in plain sight. Here’s what actually happens under the hood:
HTTP:
Sends data in plain text, anyone on the network can intercept it.
The client and server perform a simple TCP handshake: SYN, SYN-ACK, ACK
Fast but completely insecure. Passwords, tokens, and forms can all be read in transit.
HTTPS (SSL/TLS):
Step 1: TCP Handshake: Standard connection setup.
Step 2: Certificate Check: Client says hello. Server responds with hello and its SSL/TLS certificate. That certificate contains the server’s public key and is signed by a trusted Certificate Authority.
Your browser verifies this certificate is legitimate, not expired, and actually belongs to the domain you’re trying to reach. This proves you’re talking to the real server, not some attacker pretending to be it.
Step 3: Key Exchange: Here’s where asymmetric encryption happens. The server has a public key and a private key. Client generates a session key, encrypts it with the server’s public key, and sends it over. Only the server can decrypt this with its private key.
Both sides now have the same session key that nobody else could have intercepted. This becomes the symmetric encryption key for the rest of the session.
Step 4: Data Transmission: Now every request and response gets encrypted with that session key using symmetric encryption.
Over to you: What’s your go-to tool for debugging TLS issues, openssl, curl -v, or something else?
A forward proxy sits between clients (users) and the internet. It acts on behalf of users, hiding their identity or filtering traffic before it reaches the external web.
Some applications of a forward proxy are:
Protect users while browsing the internet.
Helps organizations restrict access to certain websites.
Speeds up web browsing by caching frequently accessed content.
A reverse proxy sits between the internet (clients) and backend servers. It acts on behalf of servers, handling incoming traffic.
Some applications of a reverse proxy are:
Distributes traffic across multiple servers to ensure no single server is overwhelmed.
Handles SSL encryption/decryption so backend servers don’t have to.
Helps protect backend servers from DDoS attacks.
Over to you: What else will you add to understand forward proxy and reverse proxy?
In system design, it is important to understand the difference between concurrency and parallelism.
As Rob Pyke(one of the creators of GoLang) stated:“ Concurrency is about 𝐝𝐞𝐚𝐥𝐢𝐧𝐠 𝐰𝐢𝐭𝐡 lots of things at once. Parallelism is about 𝐝𝐨𝐢𝐧𝐠 lots of things at once.” This distinction emphasizes that concurrency is more about the 𝐝𝐞𝐬𝐢𝐠𝐧 of a program, while parallelism is about the 𝐞𝐱𝐞𝐜𝐮𝐭𝐢𝐨𝐧.
Concurrency is about dealing with multiple things at once. It involves structuring a program to handle multiple tasks simultaneously, where the tasks can start, run, and complete in overlapping time periods, but not necessarily at the same instant.
Concurrency is about the composition of independently executing processes and describes a program’s ability to manage multiple tasks by making progress on them without necessarily completing one before it starts another.
Parallelism, on the other hand, refers to the simultaneous execution of multiple computations. It is the technique of running two or more tasks or computations at the same time, utilizing multiple processors or cores within a computer to perform several operations concurrently. Parallelism requires hardware with multiple processing units, and its primary goal is to increase the throughput and computational speed of a system.
In practical terms, concurrency enables a program to remain responsive to input, perform background tasks, and handle multiple operations in a seemingly simultaneous manner, even on a single-core processor. It’s particularly useful in I/O-bound and high-latency operations where programs need to wait for external events, such as file, network, or user interactions.
Parallelism, with its ability to perform multiple operations at the same time, is crucial in CPU-bound tasks where computational speed and throughput are the bottlenecks. Applications that require heavy mathematical computations, data analysis, image processing, and real-time processing can significantly benefit from parallel execution.
Get your product in front of more than 1,000,000 tech professionals.
Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases.
Space Fills Up Fast - Reserve Today
Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing [email protected].
2025-11-21 00:30:40
Modern applications must stay online around the clock.
When a banking app goes down during business hours or an e-commerce site crashes on Black Friday, the consequences extend far beyond frustrated users. Revenue evaporates, customer trust erodes, and competitors gain ground.
High availability has transformed from a luxury feature into a baseline expectation these days.
Building systems that remain operational despite failures requires more than just buying expensive hardware or running extra servers. It demands a combination of architectural patterns, redundancy strategies, and operational discipline. In other words, high availability emerges from understanding how systems fail and designing defenses at multiple layers.
In this article, we will understand what availability means and look at some of the most popular strategies to achieve high availability.
2025-11-19 00:31:27
Today’s systems are getting more complex, more distributed, and harder to manage. If you’re scaling fast, your observability strategy needs to keep up. This eBook introduces Datadog’s Observability Maturity Framework to help you reduce incident response time, automate repetitive tasks, and build resilience at scale.
You’ll learn:
How to unify fragmented data and reduce manual triage
The importance of moving from tool sprawl to platform-level observability
What it takes to go from reactive monitoring to proactive ops
Disclaimer: The details in this post have been derived from the details shared online by the Disney+ Hotstar (now JioHotstar) Engineering Team. All credit for the technical details goes to the Disney+ Hotstar (now JioHotstar) Engineering Team. The links to the original articles and sources are present in the references section at the end of the post. We’ve attempted to analyze the details and provide our input about them. If you find any inaccuracies or omissions, please leave a comment, and we will do our best to fix them.
In 2023, Disney+ Hotstar faced one of the most ambitious engineering challenges in the history of online streaming. The goal was to support more than 50 to 60 million concurrent live streams during the Asia Cup and Cricket World Cup. These are events that attract some of the largest online audiences in the world. For perspective, before this, Hotstar had handled about 25 million concurrent users on two self-managed Kubernetes clusters.
To make things even more challenging, the company introduced a “Free on Mobile” initiative, which allowed millions of users to stream live matches without a subscription. This move significantly expanded the expected load on the platform, creating the need to rethink its infrastructure completely.
Hotstar’s engineers knew that simply adding more servers would not be enough. The platform’s architecture needed to evolve to handle higher traffic while maintaining reliability, speed, and efficiency. This led to the migration to a new “X architecture,” a server-driven design that emphasized flexibility, scalability, and cost-effectiveness at a global scale.
The journey that followed involved a series of deep technical overhauls. From redesigning the network and API gateways to migrating to managed Kubernetes (EKS) and introducing an innovative concept called “Data Center Abstraction,” Hotstar’s engineering teams tackled multiple layers of complexity. Each step focused on ensuring that millions of cricket fans could enjoy uninterrupted live streams, no matter how many joined at once.
In this article, we look at how the Disney+ Hotstar engineering team achieved that scale and the challenges they faced.
To scale your company, you need compliance. And by investing in compliance early, you protect sensitive data and simplify the process of meeting industry standards—ensuring long-term trust and security.
Vanta helps growing companies achieve compliance quickly and painlessly by automating 35+ frameworks—including SOC 2, ISO 27001, HIPAA, GDPR, and more.
And with Vanta continuously monitoring your security posture, your team can focus on growth, stay ahead of evolving regulations, and close deals in a fraction of the time.
Start with Vanta’s Compliance for Startups Bundle, with key resources to accelerate your journey.
Step-by-step compliance checklists
Case studies from fast-growing startups
On-demand videos with industry leaders
Disney+ Hotstar serves its users across multiple platforms such as mobile apps (Android and iOS), web browsers, and smart TVs.
No matter which device a viewer uses, every request follows a structured path through the system.
When a user opens the app or starts a video, their request first goes to an external API gateway, which is managed through Content Delivery Networks (CDNs).
The CDN layer performs initial checks for security, filters unwanted traffic, and routes the request to the internal API gateway.
This internal gateway is protected by a fleet of Application Load Balancers (ALBs), which distribute incoming traffic across multiple backend services.
These backend services handle specific features such as video playback, scorecards, chat, or user profiles, and store or retrieve data from databases, which can be either managed (cloud-based) or self-hosted systems.
Each of these layers (from CDN to database) must be fine-tuned and scaled correctly. If even one layer becomes overloaded, it can slow down or interrupt the streaming experience for millions of viewers.
At a large scale, the CDN nodes were not just caching content like images or video segments. They were also acting as API gateways, responsible for tasks such as verifying security tokens, applying rate limits, and processing each incoming request. These extra responsibilities put a significant strain on their computing resources. The system began to hit limits on how many requests it could process per second.
To make matters more complex, Hotstar was migrating to a new server-driven architecture, meaning the way requests flowed and data was fetched had changed. This made it difficult to predict exactly how much traffic the CDN layer would face during a peak event.
To get a clear picture, engineers analyzed traffic data from earlier tournaments in early 2023. They identified the top ten APIs that generated the most load during live streams. What they found was that not all API requests were equal. Some could be cached and reused, while others had to be freshly computed each time.
This insight led to one of the most important optimizations: separating cacheable APIs from non-cacheable ones.
Cacheable APIs included data that did not change every second, such as live scorecards, match summaries, or key highlights. These could safely be stored and reused for a short period.
Non-cacheable APIs handled personalized or time-sensitive data, such as user sessions or recommendations, which had to be processed freshly for each request.

By splitting these two categories, the team could optimize how requests were handled. The team created a new CDN domain dedicated to serving cacheable APIs with lighter security rules and faster routing. This reduced unnecessary checks and freed up computing capacity on the edge servers. The result was a much higher throughput at the gateway level, meaning more users could be served simultaneously without adding more infrastructure.
Hotstar also looked at how frequently different features refreshed their data during live matches. For instance, the scorecard or “watch more” suggestions did not need to update every second. By slightly reducing the refresh rate for such features, they cut down the total network traffic without affecting the viewer experience.
Finally, engineers simplified complex security and routing configurations at the CDN layer. Each extra rule increases processing time, and by removing unnecessary ones, the platform was able to save additional compute resources.
Once the gateway layer was optimized, the team turned its attention to the deeper layers of the infrastructure that handled network traffic and computation.
Two key parts of the system required significant tuning: the NAT gateways that managed external network connections, and the Kubernetes worker nodes that hosted application pods.
Every cloud-based application depends on network gateways to handle outgoing traffic.
In Hotstar’s setup, NAT (Network Address Translation) Gateways were responsible for managing traffic from the internal Kubernetes clusters to the outside world. These gateways act as translators, allowing private resources in a Virtual Private Cloud (VPC) to connect to the internet securely.
During pre-event testing, engineers collected detailed data using VPC Flow Logs and found a major imbalance: one Kubernetes cluster was using 50 percent of the total NAT bandwidth even though the system was running at only 10 percent of the expected peak load. This meant that if traffic increased five times during the live matches, the gateways would have become a serious bottleneck.
On deeper investigation, the team discovered that several services inside the same cluster were generating unusually high levels of external traffic. Since all traffic from an Availability Zone (AZ) was being routed through a single NAT Gateway, that gateway was being overloaded.
To fix this, engineers changed the architecture from one NAT Gateway per AZ to one per subnet. In simpler terms, instead of a few large gateways serving everything, they deployed multiple smaller ones distributed across subnets. This allowed network load to spread more evenly.
The next challenge appeared at the Kubernetes worker node level. These nodes are the machines that actually run the containers for different services. Each node has limits on how much CPU, memory, and network bandwidth it can handle.
The team discovered that bandwidth-heavy services, particularly the internal API Gateway, were consuming 8 to 9 gigabits per second on individual nodes. In some cases, multiple gateway pods were running on the same node, creating contention for network resources. This could lead to unpredictable performance during peak streaming hours.
The solution was twofold:
First, the team switched to high-throughput nodes capable of handling at least 10 Gbps of network traffic.
Second, they used topology spread constraints, a Kubernetes feature that controls how pods are distributed across nodes. They configured it so that only one gateway pod could run on each node.
This ensured that no single node was overloaded and that network usage remained balanced across the cluster. As a result, even during the highest traffic peaks, each node operated efficiently, maintaining a steady 2 to 3 Gbps of throughput per node.
Even after the improvements made to networking and node distribution, the team’s earlier setup had a limitation.
The company was running two self-managed Kubernetes clusters, and these clusters could not scale reliably beyond about 25 million concurrent users. Managing the Kubernetes control plane (which handles how workloads are scheduled and scaled) had become increasingly complex and fragile at high loads.
To overcome this, the engineering team decided to migrate to Amazon Elastic Kubernetes Service (EKS), a managed Kubernetes offering from AWS.
This change meant that AWS would handle the most sensitive and failure-prone part of the system (which is the control plane) while the team could focus on managing the workloads, configurations, and optimizations of the data plane.
Once the migration was complete, the team conducted extensive benchmarking tests to verify the stability and scalability of the new setup. The EKS clusters performed very well during tests that simulated more than 400 worker nodes being scheduled and scaled simultaneously. The control plane remained responsive and stable during most of the tests.
However, at scales beyond 400 nodes, engineers observed API server throttling. In simpler terms, the Kubernetes API server, which coordinates all communication within the cluster, began slowing down and temporarily limiting the rate at which new nodes and pods could be created. This did not cause downtime but introduced small delays in how quickly new capacity was added during heavy scaling.
To fix this, the team optimized the scaling configuration. Instead of scaling hundreds of nodes at once, they adopted a stepwise scaling approach. The automation system was configured to add 100 to 300 nodes per step, allowing the control plane to keep up without triggering throttling.
After migrating to Amazon EKS and stabilizing the control plane, the team had achieved reliable scalability for around 25–30 million concurrent users.
However, as the 2023 Cricket World Cup approached, it became clear that the existing setup still would not be enough to handle the projected 50 million-plus users. The infrastructure was technically stronger but still complex to operate, difficult to extend, and costly to maintain at scale.
Hotstar managed its workloads on two large, self-managed Kubernetes clusters built using KOPS. Across these clusters, there were more than 800 microservices, each responsible for different features such as video playback, personalization, chat, analytics, and more.
Every microservice had its own AWS Application Load Balancer (ALB), which used NodePort services to route traffic to pods. In practice, when a user request arrived, the flow looked like this:
Client → CDN (external API gateway) → ALB → NodePort → kube-proxy → Pod
See the diagram below:
Although this architecture worked, it had several built-in constraints that became increasingly problematic as the platform grew.
Port Exhaustion: Kubernetes’ NodePort service type exposes each service on a specific port within a fixed range (typically 30000 to 32767). With more than 800 services deployed, Hotstar was fast running out of available ports. This meant new services or replicas could not be easily added without changing network configurations.
Hardware and Kubernetes Version Constraints: The KOPS clusters were running on an older Kubernetes version (v1.17) and using previous-generation EC2 instances. These versions did not support modern instance families such as Graviton, C5i, or C6i, which offer significantly better performance and efficiency. Additionally, because of the older version, the platform could not take advantage of newer scaling tools like Karpenter, which automates node provisioning and helps optimize costs by shutting down underused instances quickly
IP Address Exhaustion: Each deployed service used multiple IP addresses — one for the pod, one for the service, and additional ones for the load balancer. As the number of services increased, Hotstar’s VPC subnets began running out of IP addresses, creating scaling bottlenecks. Adding new nodes or services often meant cleaning up existing ones first, which slowed down development and deployments.
Operational Overhead: Every time a major cricket tournament or live event was about to begin, the operations team had to manually pre-warm hundreds of load balancers to ensure they could handle sudden spikes in traffic. This was a time-consuming, error-prone process that required coordination across multiple teams.
Cost Inefficiency: Finally, the older cluster autoscaler used in the legacy setup was not fast enough to consolidate or release nodes efficiently.
To overcome the limitations of the old setup, Disney+ Hotstar introduced a new architectural model known as Datacenter Abstraction. In this model, a “data center” does not refer to a physical building but a logical grouping of multiple Kubernetes clusters within a specific region. Together, these clusters behave like a single large compute unit for deployment and operations.
Each application team is given a single logical namespace within a data center. This means teams no longer need to worry about which cluster their application is running on. Deployments become cluster-agnostic, and traffic routing is automatically handled by the platform.
See the diagram below:
This abstraction made the entire infrastructure much simpler to manage. Instead of dealing with individual clusters, teams could now operate at the data center level, which brought several major benefits:
Simplified failover and recovery: If one cluster faced an issue, workloads could shift to another cluster without changing configurations.
Uniform scaling and observability: Resources across clusters could be managed and monitored as one system.
Centralized routing and security: Rate limiting, authentication, and routing rules could all be handled from a common platform layer.
Reduced management overhead: Engineering teams could focus on applications instead of constantly maintaining infrastructure details.
Some key architectural innovations the team made are as follows:
At the core of Datacenter Abstraction was a new central proxy layer, built using Envoy, a modern proxy and load balancing technology. This layer acted as the single point of control for all internal traffic routing.
Before this change, each service had its own Application Load Balancer (ALB), meaning more than 200 load balancers had to be managed and scaled separately. The new Envoy-based gateway replaced all of them with a single, shared fleet of proxy servers.
This gateway handled several critical functions:
Traffic routing: Directing requests to the right service, regardless of which cluster it was running on.
Authentication and rate limiting: Ensuring that all internal and external requests were secure and properly controlled.
Load shedding and service discovery: Managing temporary overloads and finding the correct service endpoints automatically.
By placing this gateway layer within each cluster and letting it handle routing centrally, the complexity was hidden from developers. Application teams no longer needed to know where their services were running; the platform managed everything behind the scenes.
The Datacenter Abstraction model was built entirely on Amazon Elastic Kubernetes Service (EKS), marking a complete move away from self-managed KOPS clusters.
This transition gave Hotstar several important advantages:
Managed control plane: AWS handled critical control components, reducing maintenance effort and improving reliability.
Access to newer EC2 generations: The team could now use the latest high-performance and cost-efficient instance types, such as Graviton and C6i.
Rapid provisioning: New clusters could be created quickly to meet growing demand.
Unified orchestration: Multiple EKS clusters could now operate together as part of one logical data center, simplifying management across environments.
To simplify how services communicate, the team introduced a unified endpoint structure across all environments. Previously, different teams created their own URLs for internal and external access, which led to confusion and configuration errors.
Under the new system, every service followed a clear and consistent pattern:
Intra-DC (within the same data center): <service>.internal.<domain>
Inter-DC (between data centers): <service>.internal.<dc>.<domain>
External (public access): <service>.<public-domain>
This made service discovery much easier and allowed engineers to move a service between clusters without changing its endpoints. It also improved traffic routing and reduced operational friction.
In the earlier architecture, deploying an application required maintaining five or six separate Kubernetes manifest files for different environments, such as staging, testing, and production. This caused duplication and made updates cumbersome.
To solve this, the team introduced a single unified manifest template. Each service now defines its configuration in one base file and applies small overrides only when needed, for example, to adjust memory or CPU limits.
Infrastructure details such as load balancer configurations, DNS endpoints, and security settings were abstracted into the platform itself. Engineers no longer had to manage these manually.
This approach provided several benefits:
Reduced duplication of configuration files.
Faster and safer deployments across environments.
Consistent standards across all teams.
Each manifest includes essential parameters like ports, health checks, resource limits, and logging settings, ensuring every service follows a uniform structure.
One of the biggest technical improvements came from replacing NodePort services with ClusterIP services using the AWS ALB Ingress Controller.
In simple terms, NodePort requires assigning a unique port to each service within a fixed range. This created a hard limit on how many services could be exposed simultaneously. By adopting ClusterIP, services were directly connected to their pod IPs, removing the need for reserved port ranges.
This change made the traffic flow more direct and simplified the overall network configuration. It also allowed the system to scale far beyond the earlier port limitations, paving the way for uninterrupted operation even at massive traffic volumes.
By the time the team completed its transformation into the Datacenter Abstraction model, its infrastructure had evolved into one of the most sophisticated and resilient cloud architectures in the world of streaming. The final stage of this evolution involved moving toward a multi-cluster deployment strategy, where decisions were driven entirely by real-time data and service telemetry.
For every service, engineers analyzed CPU, memory, and bandwidth usage to understand its performance characteristics. This data helped determine which services should run together and which needed isolation. For instance, compute-intensive workloads such as advertising systems and personalization engines were placed in their own dedicated clusters to prevent them from affecting latency-sensitive operations like live video delivery.
Each cluster was carefully designed to host only one P0 (critical) service stack, ensuring that high-priority workloads always had enough headroom to handle unexpected spikes in demand. In total, the production environment was reorganized into six well-balanced EKS clusters, each tuned for different types of workloads and network patterns.
The results of this multi-year effort were remarkable.
Over 200 microservices were migrated into the new Datacenter Abstraction framework, operating through a unified routing and endpoint system. The platform replaced hundreds of individual load balancers with a single centralized Envoy API Gateway, dramatically simplifying traffic management and observability.
When the 2023 Cricket World Cup arrived, the impact of these changes was clear. The system successfully handled over 61 million concurrent users, setting new records for online live streaming without major incidents or service interruptions.
References:
Scaling Infrastructure for Millions: From Challenges to Triumphs (Part 1)
Scaling Infrastructure for Millions: Datacenter Abstraction (Part 2)
Get your product in front of more than 1,000,000 tech professionals.
Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases.
Space Fills Up Fast - Reserve Today
Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing [email protected].
2025-11-18 00:30:53
Sentry’s AI Code Review uses Sentry’s deep context to catch bugs you’ll actually care about before you ship them – without all the noise.
In the last month it has caught more than 30,000 bugs that would have impacted users, saving developers time and reducing rollbacks. Plus it’s now 50% faster and offers agent prompts that help you fix what’s wrong fast.
Ship confidently with Sentry context built into every review.
Disclaimer: The details in this post have been derived from the details shared online by the Grab Engineering Team. All credit for the technical details goes to the Grab Engineering Team. The links to the original articles and sources are present in the references section at the end of the post. We’ve attempted to analyze the details and provide our input about them. If you find any inaccuracies or omissions, please leave a comment, and we will do our best to fix them.
Grab operates one of the most complex and data-rich platforms in Southeast Asia. Over the years, it has expanded far beyond ride-hailing into multiple verticals, including food delivery, groceries, mobility, and financial services. This expansion has generated a massive amount of user interaction data that reflects how millions of people engage with the platform every day.
Until recently, personalizing the user experience relied heavily on manually engineered features. Teams built features such as order frequency, ride history, or spending patterns for specific products and services. These features were often siloed, difficult to maintain, and expensive to scale. More importantly, they were not very good at capturing the evolving, time-based nature of user behavior.
To address these limitations, the Grab Engineering Team has adopted a foundation model approach. Instead of relying on task-specific features, this approach learns directly from two core data sources: tabular data, such as user profiles and transaction history, and sequential data, such as clickstream interactions. By learning patterns directly from these signals, the model generates shared embeddings for users, merchants, and drivers. These embeddings provide a single, generalized understanding of how people use the app.
This shift allows Grab to move away from fragmented personalization pipelines and toward a unified system that powers multiple downstream applications, including recommendations, fraud detection, churn prediction, and ad optimization. In this article, we look at how Grab developed this foundational model and the challenges it faced.
The heart of Grab’s foundation model lies in its ability to work with a wide variety of data.
Unlike platforms that focus on a single use case, Grab operates a superapp that brings together food delivery, mobility, courier services, and financial products. Each of these services generates different kinds of signals about user activity. To build a single, reliable foundation model, the Grab Engineering Team had to find a way to bring all of this information together in a structured way.
The data used to train the model falls into two main categories.
The first is tabular data, which captures a user’s long-term profile and habits. This includes demographic attributes, saved addresses, spending trends, and other behavioral statistics, such as how often a person orders food or takes a ride in a month. These attributes tend to remain stable over time and give the model a sense of the user’s overall identity and preferences.
The second is time-series clickstream data, which captures the user’s short-term, real-time behavior on the app. This is the sequence of actions a user takes in a session (what they view, click, search for, and eventually purchase). It also includes timing information, such as how long someone hesitates before completing an action, which can reveal their level of decisiveness or interest. This type of data provides a dynamic and up-to-date view of what the user is trying to accomplish at any given moment.
To make this information usable by a machine learning model, Grab groups the data into multiple modalities, each with its own characteristics:
Text: Search queries, merchant names, and user-generated reviews contain rich signals about intent and preferences.
Numerical: Numbers such as delivery prices, ride fares, travel distances, and wait times help quantify behavior and context.
Categorical IDs: Unique identifiers such as user_id, merchant_id, and driver_id allow the system to link individual entities to their history and attributes.
Location: Geographic coordinates and geohashes represent places with real-world meaning. For example, a specific geohash might correspond to a shopping mall, a busy transport hub, or a residential neighborhood.
This diversity makes Grab’s data ecosystem very different from platforms with a single primary data source, like videos or text posts. A user on Grab might book a ride to a mall, then search for Japanese food, browse a few restaurants, and place an order. Each of those actions involves a mix of text, location, IDs, and numerical values.
The challenge is not only to handle these different formats but to preserve their structure and relationships when combining them. A ride drop-off location, for example, is not just a coordinate, but often a signal that influences what the user does next. By integrating these heterogeneous data types effectively, the foundation model can build a much richer understanding of context, intent, and behavior.
Building a foundation model for a superapp like Grab comes with a unique set of technical challenges.
The Grab Engineering Team identified four major challenges that shaped the design of the system.
The first challenge is that tabular and time-series data behave very differently.
Tabular data, such as a user’s profile attributes or aggregated spending history, does not depend on any specific order. Whether “age” appears before or after “average spending” makes no difference to its meaning.
Time-series data, on the other hand, is entirely order-dependent. The sequence of clicks, searches, and transactions tells a story about what the user is trying to do in real time.
Traditional models struggle to combine these two types of data because they treat them in the same way or process them separately, losing important context in the process. Grab needed a model that could handle both forms of data natively and make use of their distinct characteristics without forcing one to behave like the other.
The second challenge is the variety of modalities.
Text, numbers, IDs, and locations each carry different types of information and require specialized processing. Text might need language models, numerical data might require normalization, and location data needs geospatial understanding.
Simply treating all of these as the same kind of input would blur their meaning. The model needed to process each modality with techniques suited to it while still combining them into a unified representation that downstream applications could use.
Most machine learning models are built for one purpose at a time, like recommending a movie or ranking ads.
Grab’s foundation model had to support many different use cases at once, from ad targeting and food recommendations to fraud detection and churn prediction. A model trained too narrowly on one vertical would produce embeddings biased toward that use case, making it less useful elsewhere.
In other words, the architecture needed to learn general patterns that could be transferred effectively to many different tasks.
The final major challenge was scale.
Grab’s platform involves hundreds of millions of unique users, merchants, and drivers. Each of these entities has an ID that needs to be represented in the model. A naive approach that tries to predict or classify over all of these IDs directly would require an output layer with billions of parameters, making the model slow and expensive to train.
The design had to find a way to work efficiently at this massive scale without sacrificing accuracy or flexibility.
To bring together so many different types of data and make them useful in a single model, the Grab Engineering Team built its foundation model on top of a transformer architecture.
Transformers have become a standard building block in modern machine learning because they can handle sequences of data and learn complex relationships between tokens. But in Grab’s case, the challenge was unique: the model had to learn jointly from both tabular and time-series data, which behave in very different ways.
The key innovation lies in how the data is tokenized and represented before it enters the transformer.
Instead of feeding raw tables or sequences directly into the model, Grab converts every piece of information into a key: value token. Here’s how it handles tabular and time-series data:
For tabular data, the key is the column name (for example, online_hours) and the value is the user’s attribute (for example, 4).
For time-series data, the key is the event type (for example, view_merchant) and the value is the entity involved (for example, merchant_id_114).
This method gives the model a unified language to describe every kind of signal, whether it is a static attribute or a sequence of actions. A “token” here can be thought of as a small building block of information that the transformer processes one by one.
Transformers are sensitive to the order of tokens because order can change meaning. For example, in a time series, a ride booking followed by a food order has a different meaning than the reverse. But tabular data does not have a natural order.
To solve this, Grab applies different positional rules depending on the data type:
Tabular tokens are treated as an unordered set. The model does not try to find meaning in the order of columns.
Time-series tokens are treated as an ordered sequence. The model uses positional embeddings to understand which event came first, second, third, and so on.
This is where attention masks play an important role. In simple terms, an attention mask tells the transformer which tokens should be related to each other and how. For tabular data, the mask ensures that the model does not infer any fake ordering. For time-series data, it ensures that the model respects the actual sequence of actions.
This combination of key: value tokenization, different positional treatments, and attention masks allows the model to process both structured profile information and sequential behavior at the same time.
In traditional systems, these two types of data are often handled by separate models and then stitched together. Grab’s approach lets the transformer learn directly from both simultaneously, which leads to richer and more accurate user representations.
This hybrid backbone forms the core of the foundation model. It ensures that no matter whether the signal comes from a stable profile attribute or a rapidly changing interaction pattern, the model can interpret it in a consistent and meaningful way.
Once the transformer backbone is in place, the next major design challenge is figuring out how to handle different data types in a way that preserves their meaning.
A user’s interaction with Grab can involve text, numerical values, location data, and large sets of IDs. Each of these requires different processing techniques before the model can bring them together into a shared representation. If all data were treated the same way, important nuances would be lost. For example, a text search for “chicken rice” should not be processed in the same manner as a pair of latitude and longitude coordinates or a user ID.
To solve this, the Grab Engineering Team uses a modular adapter-based design.
Adapters act as specialized mini-models for each modality. Their role is to encode raw data into a high-dimensional vector representation that captures its meaning and structure before it reaches the main transformer.
Here’s how different modalities are handled:
Text adapters: Textual data, such as search queries and reviews, is processed through encoders initialized with pre-trained language models. This lets the system capture linguistic patterns and semantic meaning effectively, without having to train a language model from scratch.
ID adapters: Categorical identifiers like user_id, merchant_id, and driver_id are handled through dedicated embedding layers. Each unique ID gets its own learnable vector representation, allowing the model to recognize specific users or entities and their historical patterns.
Location and numerical adapters: Data such as coordinates, distances, and prices do not fit neatly into existing language or ID embedding spaces. For these, the team builds custom encoders designed to preserve their numerical and spatial structure. This ensures that the model understands how two nearby locations might relate more closely than two distant ones, or how price differences can affect behavior.
Once each adapter processes its input, the resulting vectors are passed through an alignment layer. This step ensures that all the different modality vectors are projected into the same latent representation space. This makes it possible for the transformer to compare and combine them meaningfully, such as linking a text query to a location or a specific merchant.
One of the most important decisions in building Grab’s foundation model was how to train it.
Many machine learning systems are trained for one specific task, such as predicting whether a user will click on an ad or what item they might buy next. While this approach works well for focused problems, it can lead to biased embeddings that perform poorly outside that single use case.
Since Grab’s model needs to support a wide range of applications across multiple verticals, the Grab Engineering Team decided to use an unsupervised pre-training strategy.
In supervised learning, the model is optimized to solve one particular labeled task. However, Grab’s ecosystem involves food orders, rides, grocery deliveries, financial transactions, and more. A model trained only on one vertical (say, food ordering) would end up favoring signals relevant to that domain while ignoring others.
By contrast, unsupervised pre-training lets the model learn general patterns across all types of data and interactions without being tied to a single label or task. Once trained, the same model can be adapted or fine-tuned for many downstream applications like recommendations, churn prediction, or fraud detection.
The first core technique is masked language modeling (MLM). This approach, inspired by methods used in large language models, involves hiding (or “masking”) some of the tokens in the input and asking the model to predict the missing pieces.
For example, if a token represents “view_merchant: merchant_id_114”, the model might see only view_merchant:[MASK] and must learn to infer which merchant ID fits best based on the rest of the user’s activity. This forces the model to build a deeper understanding of how user actions relate to each other.
The second key technique is next action prediction, which aligns perfectly with how user behavior unfolds in a superapp. A user might finish a ride and then search for a restaurant, or browse groceries, and then send a package. The model needs to predict what kind of action comes next and what specific value is involved.
This happens in two steps:
Action type prediction: Predicting what kind of interaction will happen next (for example, click_restaurant, book_ride, or search_mart).
Action value prediction: Predicting the entity or content tied to that action, such as the specific merchant ID, destination coordinates, or text query.
This dual-prediction structure mirrors the model’s key: value token format and helps it learn complex behavioral patterns across different modalities.
Different data types require different ways of measuring how well the model is learning. To handle this, Grab uses modality-specific reconstruction heads, which are specialized output layers tailored to each data type:
Categorical IDs: Use cross-entropy loss, which is well-suited for classification tasks involving large vocabularies.
Numerical values: Use mean squared error (MSE) loss, which works better for continuous numbers like prices or distances.
Other modalities: Use loss functions best matched to their specific characteristics.
One of the biggest scaling challenges Grab faced while building its foundation model was dealing with massive ID vocabularies. The platform serves hundreds of millions of users, merchants, and drivers. Each of these entities has a unique ID that the model needs to understand and, in some cases, predict.
If the model were to predict directly from a single list containing all these IDs, the output layer would have to handle hundreds of millions of possibilities at once. This would require tens of billions of parameters, making the system extremely expensive to train and slow to run. It would also be prone to instability during training because predicting across such a huge space is computationally difficult.
To address this, the Grab Engineering Team adopted a hierarchical classification strategy. Instead of treating all IDs as one flat list, the model breaks the prediction task into two steps:
The model first predicts a high-level category that the target ID belongs to. For example, if it is trying to predict a restaurant, it might first predict the city or the cuisine type.
Once the coarse group is identified, the model then predicts the specific ID within that smaller group. For example, after predicting the “Japanese cuisine” group, it might choose a particular restaurant ID from within that set.
Once Grab’s foundation model has been pre-trained on massive amounts of user interaction data, the next question is how to use it effectively for real business problems.
The Grab Engineering Team relies on two main approaches to make the most of the model’s capabilities: fine-tuning and embedding extraction.
Fine-tuning means taking the entire pre-trained model and training it further on a labeled dataset for a specific task. For example, the model can be fine-tuned to predict fraud risk, estimate churn probability, or optimize ad targeting.
The advantage of fine-tuning is that the model retains all the general knowledge it learned during pre-training but adapts it to the needs of one particular problem. This approach usually gives the highest performance when you need a specialized solution because the model learns to focus on the signals most relevant to that specific task.
The second, more flexible option is embedding extraction. Instead of retraining the entire model, Grab uses it to generate embeddings, which are high-dimensional numerical vectors that represent users, merchants, or drivers. These embeddings can then be fed into other downstream models (such as gradient boosting machines or neural networks) without modifying the foundation model itself.
This approach makes the foundation model a feature generator. It gives other teams across the company the ability to build specialized applications quickly, using embeddings as input features, without having to train a large model from scratch. This saves both time and computational resources.
To fully capture user behavior, Grab uses a dual-embedding strategy, generating two types of embeddings for each user:
Long-term embedding: This comes from the User ID adapter. It reflects stable behavioral patterns built up over time, such as spending habits, preferred locations, and service usage frequency. Think of it as a long-term profile or “memory” of the user.
Short-term embedding: This is derived from a user’s recent interaction sequence, processed through the model’s adapters and transformer backbone. A Sequence Aggregation Module then compresses the sequence output into a single vector that captures the user’s current intent or “what they’re trying to do right now.”
Grab’s foundation model represents a major step forward in how large-scale, multi-service platforms can use AI to understand user behavior.
By unifying tabular and time-series data through a carefully designed transformer architecture, the Grab Engineering Team has created a system that delivers cross-modal representations of users, merchants, and drivers.
The impact of this shift is already visible. Personalization has become faster and more flexible, allowing teams to build models more quickly and achieve better performance. These embeddings are currently powering critical use cases, including ad optimization, dual app prediction, fraud detection, and churn modeling. Since the foundation model is trained once but can be reused in many places, it provides a consistent understanding of user behavior across Grab’s ecosystem.
Looking ahead, Grab aims to take this a step further with its vision of “Embeddings as a Product.” The goal is to provide a centralized embedding service that covers not only users, merchants, and drivers but also locations, bookings, and marketplace items. By making embeddings a core platform capability, Grab can give teams across the company immediate access to high-quality behavioral representations without needing to train their own models from scratch.
To support this vision, the roadmap includes three key priorities:
Unifying data streams to create cleaner and lower-noise signals for training.
Evolving the model architecture to learn from richer and more complex data sources.
Scaling the infrastructure to handle growing traffic volumes and new data modalities as the platform expands.
References:
Get your product in front of more than 1,000,000 tech professionals.
Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases.
Space Fills Up Fast - Reserve Today
Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing [email protected].
2025-11-16 00:30:49
If slow QA processes bottleneck you or your software engineering team and you’re releasing slower because of it — you need to check out QA Wolf.
QA Wolf’s AI-native service supports web and mobile apps, delivering 80% automated test coverage in weeks and helping teams ship 5x faster by reducing QA cycles to minutes.
QA Wolf takes testing off your plate. They can get you:
Unlimited parallel test runs for mobile and web apps
24-hour maintenance and on-demand test creation
Human-verified bug reports sent directly to your team
Zero flakes guaranteed
The benefit? No more manual E2E testing. No more slow QA cycles. No more bugs reaching production.
With QA Wolf, Drata’s team of 80+ engineers achieved 4x more test cases and 86% faster QA cycles.
This week’s system design refresher:
System Design: Why is Kafka Popular? (Youtube video)
How to Design Good APIs
Big Data Pipeline Cheatsheet for AWS, Azure, and Google Cloud
How to Learn AWS?
The AI Agent Tech Stack
How to Build a Basic RAG Application on AWS?
Types of Virtualization
SPONSOR US
A well-designed API feels invisible, it just works. Behind that simplicity lies a set of consistent design principles that make APIs predictable, secure, and scalable.
Here’s what separates good APIs from terrible ones:
Idempotency: GET, HEAD, PUT, and DELETE should be idempotent. Send the same request twice, get the same result. No unintended side effects. POST and PATCH are not idempotent. Each call creates a new resource or modifies the state differently.
Use idempotency keys stored in Redis or your database. Client sends the same key with retries, server recognizes it and returns the original response instead of processing again.
Versioning
Noun-based resource names: Resources should be nouns, not verbs. “/api/products”, not “/api/getProducts”.
Security: Secure every endpoint with proper authentication. Bearer tokens (like JWTs) include a header, payload, and signature to validate requests. Always use HTTPS and verify tokens on every call.
Pagination: When returning large datasets, use pagination parameters like “?limit=10&offset=20” to keep responses efficient and consistent.
Over to you: What’s the most common API design mistake you’ve seen, and how would you fix it?
Each platform offers a comprehensive suite of services that cover the entire lifecycle:
Ingestion: Collecting data from various sources
Data Lake: Storing raw data
Computation: Processing and analyzing data
Data Warehouse: Storing structured data
Presentation: Visualizing and reporting insights
AWS uses services like Kinesis for data streaming, S3 for storage, EMR for processing, RedShift for warehousing, and QuickSight for visualization.
Azure’s pipeline includes Event Hubs for ingestion, Data Lake Store for storage, Databricks for processing, Cosmos DB for warehousing, and Power BI for presentation.
GCP offers PubSub for data streaming, Cloud Storage for data lakes, DataProc and DataFlow for processing, BigQuery for warehousing, and Data Studio for visualization.
Over to you: What else would you add to the pipeline?
AWS is one of the most popular cloud platforms. When AWS goes down, a large part of the Internet goes down.
Here’s a learning map that can help you master AWS:
AWS Fundamentals
This includes topics like “What is AWS?”, Global Infrastructure, AWS Billing, Management, and IAM basics.
Core Compute, Storage & Networking
This includes compute services like EC2, Lambda, ECS, EKS, Storage Services (such as S3, EBS, EFS, Glacier), and Networking Services (such as VPC, ELB, Route 53).
Databases and Data Services
This includes topics like Relational Databases (RDS MySQL and PostgreSQL), NoSQL, and In-Memory Databases like ElastiCache (Redis and Memcached).
Security, Identity & Compliance
Consists of topics like IAM Deep Dive, Encryption (KMS, S3 SSE), Security Tools, VPC Security Groups, and Compliance-related tools for HIPAA, SOC, and GDPR.
DevOps, Monitoring & Automation
This includes topics like DevOps Tools (CodeCommit, CodeBuild, CodePipeline), Infrastructure as Code, CI/CD Pipelines, Monitoring Tools (CloudWatch, CloudTrail), and Cost Management and Billing Dashboard
Learning Paths and Certifications
Consists of topics like AWS Learning Resources, such as Skill Builder and documentation, and certification paths such as Cloud Practitioner, Solutions Architect Associate, Developer Associate, SysOps, and DevOps Engineer.
Over to you: What else will you add to the list for learning AWS?
Foundation Models: Large-scale pre-trained language models that serve as the “brains” of AI agents, enabling capabilities like reasoning, text generation, coding, and question answering.
Data Storage: This layer handles vector databases and memory storage systems used by AI agents to store and retrieve context, embeddings, or documents.
Agent Development Frameworks: These frameworks help developers build, orchestrate, and manage multi-step AI agents and their workflows.
Observability: This category enables monitoring, debugging, and logging of AI agent behavior and performance in real-time.
Tool Execution: These platforms allow AI agents to interface with real-world tools (for example, APIs, browsers, external systems) to complete complex tasks.
Memory Management: These systems manage long-term and short-term memory for agents, helping them retain useful context and learn from past interactions.
Over to you: What else will you add to the list?
RAG is an AI pattern that combines a search step with text generation. It retrieves relevant information from a knowledge source (like a vector database) and then uses an LLM to generate accurate, context-aware responses.
Ingestion Stage
All raw documents (PDFs, text, etc) are first stored in Amazon S3.
When a file is added, AWS Lambda runs an ingestion function. This function cleans and splits the document into smaller chunks.
Each chunk is sent to Amazon Bedrock’s Titan embeddings model, which converts it into vector representations
These embeddings, along with metadata, are stored in a vector database such as OpenSearch serverless, DynamoDB
Querying Stage:
A user sends a question through the app frontend, which goes to API Gateway and then a Lambda query function.
The question is converted to an embedding using Amazon Bedrock Titan Embeddings.
This embedding is compared against the stored document embeddings in the vector database to find the most relevant chunks.
The relevant chunks and the user’s questions are sent to an LLM (like Claude or OpenAI on Bedrock) to generate an answer.
The generated response is sent back to the user through the same API.
Over to you: Which other AWS service will you use to build an RAG app on AWS?
Virtualization didn’t just make servers efficient, it changed how we build, scale, and deploy everything. Here’s a quick breakdown of the four major types of virtualization you’ll find in modern systems:
Traditional (Bare Metal): Applications run directly on the operating system. No virtualization layer, no isolation between processes. All applications share the same OS kernel, libraries, and resources.
Virtualized (VM-based): Each VM runs its own complete operating system. The hypervisor sits on physical hardware and emulates entire machines for each guest OS. Each VM thinks it has dedicated hardware even though it’s sharing the same physical server.
Containerized: Containers share the host operating system’s kernel but get isolated runtime environments. Each container has its own filesystem, but they’re all using the same underlying OS. The container engine (Docker, containerd, Podman) manages lifecycle, networking, and isolation without needing separate operating systems for each application.
Lightweight and fast. Containers start in milliseconds because you’re not booting an OS. Resource usage is dramatically lower than VMs.
Containers on VMs: This is what actually runs in production cloud environments. Containers inside VMs, getting benefits from both. Each VM runs its own guest OS with a container engine inside. The hypervisor provides hardware-level isolation between VMs. The container engine provides lightweight application isolation within VMs.
This is the architecture behind Kubernetes clusters on AWS, Azure, and GCP. Your pods are containers, but they’re running inside VMs you never directly see or manage.
Over to you: In your experience, which setup strikes the best balance between performance and flexibility?
Get your product in front of more than 1,000,000 tech professionals.
Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases.
Space Fills Up Fast - Reserve Today
Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing [email protected].