2025-10-08 23:31:24
AI app builders can scaffold a UI from a prompt. But connecting it to your data, deploying it to your preferred environment, and securing it by default? That’s where most tools break down.
Retool takes you all the way—combining AI app generation with your live data, shared components, and security rules to build full-stack apps you can ship on day one.
Generate apps on top of your data, visually edit in context, and get enterprise-grade RBAC, SSO, and audit logs automatically built in.
Disclaimer: The details in this post have been derived from the details shared online by the Facebook/Meta Engineering Team. All credit for the technical details goes to the Facebook/Meta Engineering Team. The links to the original articles and sources are present in the references section at the end of the post. We’ve attempted to analyze the details and provide our input about them. If you find any inaccuracies or omissions, please leave a comment, and we will do our best to fix them.
Modern large-scale systems often need to process enormous volumes of work asynchronously.
For example, a social network like Facebook handles many different kinds of background jobs. Some tasks, such as sending a notification, must happen quickly. Others, like translating a large number of posts into multiple languages, can be delayed or processed in parallel. To manage this variety of workloads efficiently, the Facebook engineering team built a service called Facebook Ordered Queueing Service (FOQS).
FOQS is a fully managed, horizontally scalable, multi-tenant distributed priority queue built on sharded MySQL.
In simpler terms, it is a central system that can reliably store and deliver tasks to many different consumers, while respecting their priorities and timing requirements. It acts as a decoupling layer between services, allowing one service to enqueue work and another to process it later. This design keeps systems more resilient and helps engineers control throughput, retry logic, and ordering without building complex custom queues themselves.
The “distributed” part means FOQS runs on many servers at once, and it automatically divides data across multiple MySQL database shards to handle extremely high volumes of tasks. The “priority queue” part means that items can be assigned different importance levels, so the most critical tasks are delivered first. FOQS also supports delayed delivery, letting engineers schedule work for the future rather than immediately.
FOQS plays a key role in some of Facebook’s largest production workflows:
The Async platform uses FOQS to defer non-urgent computation and free up resources for real-time operations.
Video encoding systems use it to fan out a single upload into many parallel encoding jobs that need to be processed efficiently.
Language translation pipelines rely on it to distribute large amounts of parallelizable, compute-heavy translation tasks.
At Facebook’s scale, these background workflows involve trillions of queue operations per day. In this article, we look at how FOQS is structured, how it processes enqueues and dequeues, and how it maintains reliability.
Before looking at how FOQS works internally, it is important to understand the core building blocks that make up the system. Each of these plays a distinct role in how FOQS organizes and delivers tasks at scale.
A namespace is the basic unit of multi-tenancy and capacity management in FOQS. Each team or application that uses FOQS gets its own namespace. This separation ensures that one tenant’s workload does not overwhelm others and allows the system to enforce clear performance and quota guarantees.
Every namespace is mapped to exactly one tier. A tier consists of a fleet of FOQS hosts and a set of MySQL shards. You can think of a tier as a self-contained slice of the FOQS infrastructure. By assigning a namespace to a specific tier, Facebook ensures predictable capacity and isolation between different workloads.
Each namespace is also assigned a guaranteed capacity, measured in enqueues per minute. This is the number of items that can be added to the queue per minute without being throttled. These quotas help protect the underlying storage and prevent sudden spikes in one workload from affecting others.
Overall, this design allows FOQS to support many different teams inside Facebook simultaneously, each with its own usage pattern.
Within a namespace, work is further organized into topics.
A topic acts as a logical priority queue, identified by a simple string name. Topics are designed to be lightweight and dynamic. A new topic is created automatically when the first item is enqueued to it, and it is automatically cleaned up when it becomes empty. There is no need for manual provisioning or configuration.
To help consumers discover available topics, FOQS provides an API called GetActiveTopics. This returns a list of currently active topics in a namespace, meaning topics that have at least one item waiting to be processed. This feature allows consumers to easily find which queues have pending work, even in systems with a large and changing set of topics.
Dynamic topics make FOQS flexible. For example, a video processing system might create a new topic for each uploaded video to process its encoding tasks in isolation. Once the encoding finishes and the queue is empty, the topic disappears automatically.
The item is the most important unit in FOQS because it represents a single task waiting to be processed. Internally, each item is stored as one row in a MySQL table, which allows FOQS to leverage the reliability and indexing capabilities of MySQL.
Each item contains several fields:
Namespace and topic identify which logical queue the item belongs to.
Priority is a 32-bit integer where a lower value means higher priority. This determines the order in which items are delivered.
Payload contains the actual work data. This is an immutable binary blob, up to around 10 KB in size. For example, it could include information about which video to encode or which translation task to perform.
Metadata is a mutable field containing a few hundred bytes. This is often used to store intermediate results, retry counts, or backoff information during the item’s lifecycle.
Deliver_after is a timestamp that specifies when the item becomes eligible for dequeue. This enables delayed delivery, which is useful for scheduling tasks in the future or applying backoff policies.
Lease_duration defines how long a consumer has to acknowledge or reject (ack/nack) the item after dequeuing it. If this time expires without a response, FOQS applies the namespace’s redelivery policy.
TTL (time-to-live) specifies when the item should expire and be removed automatically, even if it has not been processed.
FOQS ID is a globally unique identifier that encodes the shard ID and a 64-bit primary key. This ID allows FOQS to quickly locate the item’s shard and ensure correct routing for acknowledgments and retries.
Together, these fields give FOQS the ability to control when items become available, how they are prioritized, how long they live, and how they are tracked reliably in a distributed system. By storing each item as a single MySQL row, FOQS benefits from strong consistency, efficient indexing, and mature operational tools, which are crucial at Facebook’s scale.
The enqueue path in FOQS is responsible for adding new items into the queue reliably and efficiently.
Since FOQS processes trillions of enqueue operations per day, this part of the system must be extremely well optimized for write throughput while also being careful not to overload the underlying MySQL shards. Facebook designed the enqueue pipeline to use buffering, batching, and protective mechanisms to maintain stability under heavy load.
When a client wants to add a new task, it sends an Enqueue request to the appropriate FOQS host. Instead of immediately writing this item to the database, the host first places the request into an in-memory buffer. This approach allows FOQS to batch multiple enqueues together for each shard, which is much more efficient than inserting them one by one. As soon as the request is accepted into the buffer, the client receives a promise that the enqueue operation will be processed shortly.
See the diagram below:
In the background, per-shard worker threads continuously drain these buffers. Each shard of the MySQL database has its own set of workers that take the enqueued items from memory and perform insert operations into the shard’s MySQL table. This batching significantly reduces the overhead on MySQL and enables FOQS to sustain massive write rates across many shards simultaneously. Once the database operation is completed, FOQS fulfills the original promise and sends the result back to the client. If the insert was successful, the client receives the FOQS ID of the newly enqueued item, which uniquely identifies its location. If there was an error, the client is informed accordingly.
An important part of this pipeline is FOQS’s circuit breaker logic, which helps protect both the service and the database from cascading failures. The circuit breaker continuously monitors the health of each MySQL shard. If it detects sustained slow queries or a spike in error rates, it temporarily marks that shard as unhealthy. When a shard is marked unhealthy, FOQS stops sending new enqueue requests to it until it recovers. This prevents a situation where a struggling shard receives more and more traffic, making its performance even worse. By backing off from unhealthy shards, FOQS avoids the classic “thundering herd” problem where too many clients keep retrying against a slow or failing component, causing further instability.
This careful combination of buffering, batching, and protective measures allows FOQS to handle extremely high write volumes without overwhelming its storage backend. It ensures that enqueues remain fast and reliable, even during periods of peak activity or partial database failures.
Once tasks have been enqueued, FOQS must deliver them to consumers efficiently and in the correct order.
At Facebook’s scale, the dequeue path needs to support extremely high read throughput while respecting task priorities and scheduled delivery times. To achieve this, FOQS uses a clever combination of in-memory indexes, prefetching, and demand-aware buffering. This design allows the system to serve dequeue requests quickly without hitting the MySQL databases for every single read.
Each MySQL shard maintains an in-memory index of items that are ready to be delivered. This index contains the primary keys of items that can be dequeued immediately, sorted first by priority (with lower numbers meaning higher priority) and then by their deliver_after timestamps. By keeping this index in memory, FOQS avoids repeatedly scanning large database tables just to find the next item to deliver. This is critical for maintaining low latency and high throughput when millions of dequeue operations happen every second.
On top of these per-shard indexes, each FOQS host runs a component called the Prefetch Buffer. This buffer continuously performs a k-way merge across the indexes of all shards that the host is responsible for.
For reference, A k-way merge is a standard algorithmic technique used to combine multiple sorted lists into one sorted list efficiently. In this case, it helps FOQS select the overall best items to deliver next, based on their priority and deliver_after time, from many shards at once. As the prefetcher pulls items from the shards, it marks those items as “delivered” in MySQL. This step prevents the same item from being handed out to multiple consumers simultaneously, ensuring correct delivery semantics. The selected items are then stored in the Prefetch Buffer in memory.
When a client issues a Dequeue request, FOQS simply drains items from the Prefetch Buffer instead of going to the database. This makes dequeue operations very fast, since they are served entirely from memory and benefit from the pre-sorted order of the buffer. The Prefetch Buffer is constantly replenished in the background, so there is usually a pool of ready-to-deliver items available at any moment.
The prefetcher is also demand-aware, meaning it adapts its behavior based on actual consumption patterns. FOQS tracks dequeue rates for each topic and uses this information to refill the Prefetch Buffer proportionally to the demand. Topics that are being consumed heavily receive more aggressive prefetching, which keeps them “warm” in memory and ensures that high-traffic topics can sustain their dequeue rates without delay. This adaptive strategy allows FOQS to balance efficiency across a large number of topics with very different workloads.
Once an item is dequeued, its lease period begins. A lease defines how long the consumer has to either acknowledge (ack) or reject (nack) the item. If the lease expires without receiving either response, FOQS applies the namespace’s delivery policy.
There are two possible behaviors:
At-least-once delivery: The item is returned to the queue and redelivered later. This ensures no tasks are lost, but consumers must handle potential duplicates.
At-most-once delivery: The item is deleted after the lease expires. This avoids duplicates but risks losing tasks if the consumer crashes before processing.
This lease and retry mechanism allows FOQS to handle consumer failures gracefully. If a consumer crashes or becomes unresponsive, FOQS can safely redeliver the work to another consumer (or discard it if at-most-once is chosen).
Once a consumer finishes processing an item, it must inform FOQS about the result. This is done through acknowledgment (ack) or negative acknowledgment (nack) operations.
Every item in FOQS has a FOQS ID that encodes the shard ID and a unique primary key. When a client wants to acknowledge or reject an item, it uses this shard ID to route the request to the correct FOQS host. This step is crucial because only the host that currently owns the shard can modify the corresponding MySQL rows safely. By routing directly to the right place, FOQS avoids unnecessary network hops and ensures that updates are applied quickly and consistently.
When the FOQS host receives an ack or nack request, it does not immediately write to the database. Instead, it appends the request to an in-memory buffer that is maintained per shard. This buffering is similar to what happens during the enqueue path. By batching multiple ack and nack operations together, FOQS can apply them to the database more efficiently, reducing write overhead and improving overall throughput. See the diagram below:
Worker threads on each shard continuously drain these buffers and apply the necessary changes to the MySQL database:
For ack operations, the worker simply deletes the row associated with the item from the shard’s MySQL table. This signals that the task has been successfully completed and permanently removes it from the queue.
For nack operations, the worker updates the item’s deliver_after timestamp and metadata fields. This allows the item to be redelivered later after the specified delay. Updating metadata is useful for tracking retry counts, recording partial progress, or implementing backoff strategies before the next attempt.
The ack and nack operations are idempotent, which means they can be retried safely without causing inconsistent states.
For example, if an ack request is sent twice by mistake or due to a network retry, deleting the same row again has no harmful effect. Similarly, applying the same nack update multiple times leads to the same final state. Idempotency is essential in distributed systems, where messages may be delayed, duplicated, or retried because of transient failures.
If an ack or nack operation fails due to a network issue or a host crash, FOQS does not lose track of the item. When the item’s lease expires, FOQS automatically applies the namespace’s redelivery policy. This ensures that unacknowledged work is either retried (for at-least-once delivery) or cleaned up (for at-most-once delivery), maintaining forward progress without requiring manual intervention.
One of the key design decisions in FOQS is its use of a pull-based model for delivering work to consumers.
In a pull model, consumers actively request new items when they are ready to process them, rather than the system pushing items to consumers automatically. Facebook chose this approach because it provides better control, flexibility, and scalability across many different types of workloads.
Workloads inside Facebook vary widely. Some require low latency and high throughput, while others involve slower, scheduled processing. A push model would require FOQS to track each consumer’s capacity and flow control in real time to avoid overwhelming slower workers. This becomes complicated and error-prone at Facebook’s scale, where consumers can number in the thousands and have very different performance characteristics.
The pull model simplifies this problem. Each consumer can control its own processing rate by deciding when and how much to dequeue. This prevents bottlenecks caused by overloaded consumers and makes the system more resilient to sudden slowdowns. It also allows consumers to handle regional affinity and load balancing intelligently, since they can choose where to pull work from based on their location and capacity.
However, the main drawback of pull systems is that consumers need a way to discover available work efficiently. FOQS addresses this with its routing layer and topic discovery API, which help consumers find active topics and shards without scanning the entire system.
FOQS is designed to handle massive workloads that would overwhelm traditional queueing systems.
See the diagram below that shows the distributed architecture for FOQS:
At Facebook, the service processes roughly one trillion items every day. This scale includes not only enqueuing and dequeuing tasks but also managing retries, delays, expirations, and acknowledgments across many regions.
Large distributed systems frequently experience temporary slowdowns or downstream outages. During these events, FOQS may accumulate backlogs of hundreds of billions of items. Instead of treating this as an exception, the system is built to function normally under backlog conditions. Its sharded MySQL storage, prefetching strategy, and routing logic ensure that tasks continue to flow without collapsing under the load.
A key aspect of this scalability is FOQS’s MySQL-centric design. Rather than relying on specialized storage systems, the Facebook engineering team optimized MySQL with careful indexing, in-memory ready queues, and checkpointed scans.
By combining sharding, batching, and resilient queue management, FOQS sustains enormous traffic volumes while maintaining reliability and predictable performance.
Facebook Ordered Queueing Service (FOQS) shows how a priority queue can support diverse workloads at a massive scale.
By building on sharded MySQL and combining techniques like buffering, prefetching, adaptive routing, and leases, FOQS achieves both high performance and operational resilience. Its pull-based model gives consumers control over their processing rates, while its abstractions of namespaces, topics, and items make it flexible enough to support many teams and use cases across the company.
A crucial part of FOQS’s reliability is its disaster readiness strategy. Each shard is replicated across multiple regions, and binlogs are stored both locally and asynchronously across regions. During maintenance or regional failures, Facebook can promote replicas and shift traffic to healthy regions with minimal disruption. This ensures the queue remains functional even when large parts of the infrastructure are affected.
Looking ahead, the Facebook engineering team continues to evolve FOQS to handle more complex failure modes and scaling challenges. Areas of active work include improving multi-region load balancing, refining discoverability as data spreads, and expanding workflow features such as timers and stricter ordering guarantees. These improvements aim to keep FOQS reliable as Facebook’s workloads continue to grow and diversify.
References:
Get your product in front of more than 1,000,000 tech professionals.
Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases.
Space Fills Up Fast - Reserve Today
Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing [email protected].
2025-10-07 23:30:42
P99 CONF is the technical conference for anyone who obsesses over high-performance, low-latency applications. Naturally, Rust is a core topic.
How is Rust being applied to solve today’s low latency challenges – and where it could be heading next? That’s what experts from Clickhouse, Prime Video, Neon, Datadog, and more will be exploring
Join 20K of your peers for an unprecedented opportunity to learn from engineers at Pinterest, Gemini, Arm, Rivian and VW Group Technology, Meta, Wayfair, Disney, Uber, NVIDIA, and more – for free, from anywhere.
Bonus: Registrants can win 500 free swag packs and get 30-day access to the complete O’Reilly library.
Disclaimer: The details in this post have been derived from the official documentation shared online by the Flipkart Engineering Team. All credit for the technical details goes to the Flipkart Engineering Team. The links to the original articles and sources are present in the references section at the end of the post. We’ve attempted to analyze the details and provide our input about them. If you find any inaccuracies or omissions, please leave a comment, and we will do our best to fix them.
Flipkart is one of India’s largest e-commerce platforms with over 500 million users and 150-200 million daily users. It handles extreme surges in traffic during events like its Big Billion Days sale. To keep the business running smoothly, the company relies on thousands of microservices that cover every part of its operations, from order management to logistics and supply chain systems.
Among these systems, the most critical transactional domains depend on MySQL because it provides the durability and ACID guarantees that e-commerce workloads demand. However, managing MySQL at Flipkart’s scale presented serious challenges. Each engineering team often operated its own database clusters, resulting in uneven practices, duplicated effort, and a high operational burden. This complexity was most visible during peak shopping periods, when even small inefficiencies could cascade into major disruptions.
To solve this, the Flipkart engineering team built Altair, an internally managed service designed to offer MySQL with high availability (HA) as a standard feature.
Altair’s purpose is to ensure that the company’s most important databases remain consistently available for writes, while also reducing the manual work required by teams to keep them healthy. In practice, this means that Flipkart engineers can focus more on building services while relying on Altair to handle the heavy lifting of database failover, recovery, and availability management.
In this article, we will look at how Altair works under the hood, the technical decisions Flipkart made to balance availability and consistency, and the engineering trade-offs that come with running relational databases at a massive scale.
TL:DR: Take this 2-minute survey so I can learn more about who you are,. what you do, and how I can improve ByteByteGo
Flipkart’s Altair system uses a primary–replica setup to keep MySQL highly available.
In this model, there is always one primary database that accepts all the writes. This primary may also handle some reads. Alongside it are one or more replicas. These replicas continuously copy data from the primary in an asynchronous manner, which means there can be a small delay before changes appear on them. Replicas usually handle most of the read traffic, while the primary focuses on writes.
The main goal of this setup is simple: if the primary fails, the system should quickly promote a healthy replica to take its place as the new primary. This ensures that write operations remain available with minimal disruption.
Flipkart’s availability target for Altair is very high, close to what is known as “five nines.” That means the system is expected to stay up and running more than 99.999 percent of the time. Of course, no complex system can ever promise perfect uptime, but the goal is to keep downtime as close to zero as possible.
To make failover reliable, the Flipkart engineering team considered several important factors:
Data-loss tolerance: Making sure as little data as possible is lost during failover.
Fault detection reliability: Ensuring the system can accurately tell when the primary has truly failed.
Failover workflow strength: Designing a robust process to handle the switch without errors.
Network partition handling: Making the system resilient when parts of the network cannot talk to each other.
Fencing the old primary: Preventing the failed primary from accidentally accepting new writes.
Split-brain prevention: Avoiding a situation where two nodes think they are the primary at the same time.
Automation: Reducing the need for human intervention, so failover can happen quickly and consistently.
By combining these elements, Altair is designed to keep MySQL highly available even under failure conditions.
When a primary database fails, Altair follows a well-defined sequence of steps to recover and make sure applications can continue writing data. This process is called a failover workflow, and it involves multiple components working together.
The workflow has five main stages:
Failure detection: The system continuously monitors all MySQL nodes to check if something goes wrong with the primary. If it looks unhealthy or unreachable, a failure is suspected.
False-positive screening: Before taking any big action, Altair double-checks whether the failure is real. Sometimes a node might look down because of a temporary glitch, but it is still fine. This step ensures that the system does not promote a new primary unnecessarily.
Failover tasks: If the primary is truly down, the system begins the recovery job. This includes stopping or fencing the old primary, choosing the best replica, and promoting it to primary.
Service discovery update: Once a new primary is ready, Altair updates the DNS record so that applications connecting to the database automatically point to the new primary. This means applications usually do not need to restart.
Fencing the old primary: To avoid two databases acting as primary at the same time, Altair tries to mark the old primary as read-only or completely stop it. This step is critical for preventing split-brain, where two nodes could accept writes independently.
Altair uses a three-layered monitoring system to detect failures:
Agent: A lightweight program runs on each database node. It checks the MySQL process, replication status, replication lag, and disk usage. It reports this health information every 10 seconds.
Monitor: A Go-based microservice collects these health reports. It writes the status to ZooKeeper, also every 10 seconds. The monitor compares old and new health states, checks if thresholds are breached, and flags possible issues. If a failure is suspected, it alerts the orchestrator. Multiple monitors can run in parallel, making the system scalable as Flipkart grows.
Orchestrator: This is the brain of the workflow. It verifies if the failure is real or just a false alarm. If the problem is confirmed, it initiates the failover process.
See the diagram below:
Instead of simply relying on a few missed signals, Altair performs deeper checks:
It verifies whether the virtual machine running the database is healthy.
It checks if replicas can still connect to the primary.
It also tests whether the orchestrator itself can reach the primary.
The rule is straightforward: as long as either the orchestrator or at least one replica can connect to the primary, the primary is considered alive. If both fail, the system proceeds with failover.
When the orchestrator decides to failover, Altair runs these tasks in order:
Temporarily stop monitoring the affected node.
Allow replicas to catch up by applying all pending relay logs. This reduces data loss.
Set the old primary to read-only if it is still reachable.
Stop the old primary completely if possible.
Promote the best replica to the primary.
Update DNS so that applications automatically connect to the new primary.
See the diagram below:
This structured approach ensures that failovers are smooth, data loss is minimized, and applications reconnect to the new primary without manual intervention in most cases.
Once a failover is complete, applications need to know where the new primary database is located. Altair solves this using DNS-based service discovery.
Here’s how it works:
Applications connect to the database using a fixed DNS name (for example, orders-db-primary.flipkart.com).
Behind the scenes, this DNS name points to the IP address of the current primary.
When a failover happens, Altair updates the DNS record so that the name now points to the new primary’s IP address.
This design means that most applications do not need any manual updates or restarts. As soon as they make a fresh connection, they automatically reach the new primary.
The only exceptions are unusual situations where DNS changes are not picked up or where the network is partitioned in a way that requires manual intervention. In those rare cases, Flipkart’s engineering team coordinates with client teams to restart applications and ensure traffic points to the right place.
One of the biggest risks in any high-availability setup is something called split-brain. This happens when two different nodes both think they are the primary at the same time. If both accept writes, the data can diverge and become inconsistent across the cluster. Fixing this later requires painful reconciliation.
Split-brain usually occurs during network partitions. Imagine the primary is healthy, but because of a network issue, the rest of the system cannot reach it. From their perspective, it looks dead. A replica is then promoted to primary, while the original primary continues accepting writes. Now there are two primaries.
See the diagram below:
With MySQL’s asynchronous replication, this problem is even harder to solve because the system cannot guarantee both strong consistency and availability during a network split. Flipkart chooses to prioritize availability, but adds safeguards to prevent split-brain.
If split-brain happens, the effects can be serious. Orders might be split across two different databases, leading to confusion for both customers and sellers. Reconciling this data later is time-consuming and costly. Flipkart cites GitHub’s 2018 incident as an example, where a short connectivity problem led to nearly 24 hours of reconciliation work.
Altair includes multiple safeguards to defend against a split-brain scenario:
During failover, it tries to stop the old primary so it cannot accept new writes.
In planned failovers, it first sets the old primary to read-only, ensuring no further writes are accepted before switching roles.
If the old primary cannot be stopped (for example, because of a severe partition), Altair may still promote a replica to keep the system available.
In uncertain situations where the control plane cannot determine the exact state of the primary, Altair follows a careful procedure:
Pause the failover job.
Notify client teams to stop applications from writing.
Resume failover, promote the replica, and update DNS.
Ask clients to restart applications so they connect to the correct new primary.
See the diagram below:
This process ensures that the risk of having two primaries is avoided, even if it requires a brief pause in availability.
Databases can fail in many different ways, and each type of failure needs to be handled carefully.
Altair is designed to detect different failure scenarios and react appropriately so that downtime is minimized and data remains safe. Let’s go through the major cases and see how Altair deals with each one.
Sometimes the entire machine (virtual machine or physical host) running the primary database can go down.
In such a case, the local agent running on that machine stops sending health updates. When the monitor does not receive three consecutive 10-second updates (about 30 seconds of silence), it marks the node as unhealthy and alerts the orchestrator.
The orchestrator verifies that the node is really unreachable and then triggers a failover, promoting a replica to become the new primary.
Even if the host machine is fine, the MySQL process itself may crash. Here’s what happens in this case:
The agent reports that MySQL is down, but it can still confirm that the host is healthy.
The monitor notices the mismatch between host health and MySQL process health.
The orchestrator double-checks this situation and, once confirmed, starts the failover process.
Sometimes the primary and replicas cannot talk to each other because of a network issue, even though both are still alive. In other words, replicas lose connectivity to the primary.
This alone is not enough reason to trigger a failover. The system avoids acting on replica-only signals because the primary may still be healthy and reachable by clients.
This is more complex because Altair’s control plane (monitor and orchestrator) might lose communication with the primary while the primary itself is still running. Altair has to carefully analyze the situation to avoid false failovers.
There are three sub-cases:
Orchestrator cannot reach the primary, but the monitor still can. In this case:
The monitor continues to get health updates from the agent.
If the monitor notices a failure (like a MySQL crash), it alerts the orchestrator.
Even though the orchestrator’s own pings fail, it trusts the monitor’s updates.
Failover may still happen if replicas are available, but Altair also tries to fence the primary before promoting another node.
The monitor cannot reach the primary, but the orchestrator still can. In this case:
The monitor misses health updates and suspects the primary is down.
The orchestrator, however, can still confirm that the primary is alive and MySQL is running.
In this case, Altair treats it as a false alarm and avoids unnecessary failover.
Both the monitor and the orchestrator cannot reach the primary. In this case:
The situation looks the same as a total primary failure.
Altair proceeds with the failover process, fencing the old primary if possible before promoting a replica.
Building a system like Altair means balancing several competing goals. The Flipkart engineering team had to make careful choices about what to prioritize and how to design the system so that it worked reliably at scale. Here are the key highlights and trade-offs.
MySQL in Altair uses asynchronous replication. This means that replicas copy data from the primary with a slight delay. Because of this, there is always a trade-off:
If you want strong consistency, you must wait for every replica to confirm each write, but that slows things down and can hurt availability during failures.
If you want high availability, you accept that a small amount of data might be lost during a failover, because the replicas may not have received the very latest writes.
Flipkart chose to prioritize availability.
In practice, this means that during failover, some of the last few transactions on the old primary might not make it to the new primary. Altair reduces this risk by letting replicas catch up on relay logs whenever possible and by making planned failovers read-only before switching roles. But in unplanned crashes, a small amount of data loss is possible.
One of Altair’s biggest strengths is how it avoids false alarms.
Instead of just checking whether a few signals are missed, it uses multiple sources of truth:
The state of the virtual machine,
Replica connectivity to the primary, and
Direct connectivity between the orchestrator and the primary.
This layered approach prevents unnecessary failovers that could disrupt the system when the primary is actually fine.
Altair uses DNS indirection to make failovers smooth.
By updating the DNS record of the primary after promotion, applications automatically connect to the new primary without needing to change code or restart in most cases. This keeps the system simpler for developers who build on top of it.
Altair’s monitoring system is designed to scale as Flipkart grows:
The agent collects health data from every node.
Monitor processes the data and stores it in ZooKeeper. Multiple monitors can run in parallel, so the system can supervise many clusters at once.
Orchestrator makes final decisions and triggers failover when needed.
This separation of responsibilities ensures both reliability and scalability.
Altair represents Flipkart’s answer to the difficult problem of keeping relational databases highly available at a massive scale.
By standardizing on a primary–replica setup with asynchronous replication, the engineering team ensured that MySQL could continue to serve as the backbone for critical transactional systems. The system emphasizes write availability, while carefully minimizing data loss through relay log catch-up and planned read-only failovers.
Altair’s layered monitoring design (combining agents, monitors, ZooKeeper, and an orchestrator) allows reliable detection of failures without triggering false positives. Service discovery through DNS updates keeps application integration simple, while fencing mechanisms and procedural safeguards protect against the dangerous risk of split-brain. The system also scales horizontally, supervising thousands of clusters across Flipkart’s microservices.
The key trade-off is accepting the possibility of minor data loss in exchange for fast, automated recovery.
By doing so, Altair balances consistency, availability, and operational simplicity in a way that matches Flipkart’s business needs. In practice, this design has reduced operational overhead and delivered dependable high availability during peak events, making MySQL a reliable foundation for Flipkart’s e-commerce platform.
References:
Get your product in front of more than 1,000,000 tech professionals.
Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases.
Space Fills Up Fast - Reserve Today
Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing [email protected].
2025-10-06 23:30:46
AI coding tools become more reliable when they understand the “why” behind your code.
With Unblocked’s MCP server, tools like Cursor and Claude now leverage your team’s historical knowledge across tools like GitHub, Slack, Confluence, and Jira, so the code they generate actually makes sense in your system.
“With Claude Code + Unblocked MCP, I’ve finally found the holy grail of engineering productivity: context-aware coding. It’s not hallucinating. It’s pulling insight from everything I’ve ever worked on.” — Staff Engineer @ Nava Benefits
Disclaimer: The details in this post have been derived from the official documentation shared online by the OpenAI and Confluent Engineering Team. All credit for the technical details goes to OpenAI and the Confluent Engineering Team. The links to the original articles and sources are present in the references section at the end of the post. We’ve attempted to analyze the details and provide our input about them. If you find any inaccuracies or omissions, please leave a comment, and we will do our best to fix them.
AI systems like the ones developed at OpenAI rely on vast amounts of data. The quality, freshness, and availability of this data directly influence how well the models perform. In the early days, most organizations processed data in batch mode.
Batch processing means you collect data over hours or days and then process it all at once. This approach works well for certain use cases, but it comes with an obvious downside: by the time the data is ready, it may already be stale. For fast-moving AI systems, where user interactions, experiments, and new content are being generated constantly, stale data slows everything down.
This is where stream processing comes in.
In streaming systems, data is processed as it arrives, almost in real time. Instead of waiting for a daily or hourly batch job, the system can quickly transform, clean, and route data to wherever it is needed. For an AI research organization, this means two very important things.
First, fresher training data can be delivered to models, strengthening what the OpenAI Engineering Team calls the “data flywheel.” The more quickly models can learn from new information, the faster they improve.
Second, experimentation becomes faster. Running experiments on models is a daily activity at OpenAI, and the ability to ingest and preprocess logs in near real time means researchers can test ideas, see results, and adjust without long delays.
Recognizing these benefits, the OpenAI Engineering Team set out to design a stream processing platform that could handle their unique requirements. They sought a solution where Python is the standard language, while also being scalable, reliable, and fault-tolerant. The platform had to integrate with their existing infrastructure, particularly Kafka, which serves as their backbone for event streaming. Most importantly, it needed to remain highly available even when parts of the system or the cloud provider had issues.
The result was a platform centered on PyFlink running on Kubernetes, reinforced by custom engineering around high availability, state management, and Kafka resilience. In this article, we will understand how OpenAI built such a system and the challenges they faced.
TL:DR: Take this 2-minute survey so I can learn more about who you are,. what you do, and how I can improve ByteByteGo
When the OpenAI Engineering Team began designing its stream processing platform, three major challenges stood out.
The first was the dominance of Python in AI development. Almost every researcher and engineer at OpenAI works primarily in Python. While Apache Flink is powerful, its strongest APIs were originally written in Java and Scala. For a team that wanted data pipelines to feel natural to machine learning practitioners, it was essential to provide a Python-first experience. This meant adopting and extending PyFlink, the Python API for Flink, even though it came with limitations that required additional work.
The second challenge came from cloud capacity and scalability constraints. Cloud providers impose limits on resources like compute, storage, and networking. At the scale that OpenAI operates, these limits can create bottlenecks when running streaming jobs. The platform needed to be resilient to these constraints, ensuring that pipelines could continue to run even if resource availability shifted unexpectedly.
The third and perhaps most complex challenge was the multi-primary Kafka setup. OpenAI runs Kafka in a high-availability configuration where there are multiple primary clusters. This design improves reliability but makes things more complicated for applications like Flink. Standard Flink connectors often assume a single Kafka cluster. In a multi-primary setup, if one cluster becomes unavailable, the default behavior is for Flink to treat this as a fatal error and bring down the entire pipeline. That is unacceptable for mission-critical AI workloads.
The stream processing platform at OpenAI is built around Apache Flink, accessed through its Python API called PyFlink. Flink is the engine that actually runs the streaming computations, but by itself, it is not enough.
To make it reliable and usable at OpenAI’s scale, the engineering team added several layers:
A control plane that manages jobs and coordinates failover across multiple Flink clusters.
A Kubernetes-based setup that runs Flink reliably, with isolation between different teams and use cases.
Watchdog services that monitor Kafka and react to changes so pipelines stay stable.
State and storage management that decouples pipeline state from individual clusters, ensuring that jobs can survive outages and move seamlessly between environments.
These layers work together to provide a resilient and developer-friendly system. Let’s look at each one in more detail.
The control plane is the part of the system responsible for job management. In practical terms, this means that when a new streaming job is created, updated, or restarted, the control plane is the service that keeps track of what should be running and where.
See the diagram below that shows the central role of the control plane.
At OpenAI, jobs need to survive failures at the cluster level, not just at the task level. Cloud provider issues that affect an entire Kubernetes cluster are not rare, and without a higher-level manager, an outage could bring down critical pipelines. The control plane addresses this by supporting multi-cluster failover. If one cluster becomes unhealthy, the control plane can move the job to another cluster while ensuring that the job’s state is preserved.
Another important detail is that the control plane integrates with OpenAI’s existing service deployment infrastructure. This means that developers do not need to learn a new system to manage their streaming jobs. Submitting, upgrading, or rolling back a job fits into the same deployment workflows they already use for other services. This integration reduces friction and helps standardize operations across the organization.
Flink itself does not run directly on bare machines. Instead, it is deployed on Kubernetes, the container orchestration system widely used across the industry.
OpenAI chose to use the Flink Kubernetes Operator, which automates the lifecycle of Flink deployments on Kubernetes. The operator makes it easier to launch Flink jobs, monitor them, and recover from failures without manual intervention.
One of the key design choices here is per-namespace isolation. In Kubernetes, namespaces are a way to partition resources. See the diagram below:
By giving each team or project its own namespace, OpenAI ensures that pipelines are isolated from each other. This improves both reliability and security. If something goes wrong in one namespace, it does not automatically affect pipelines running elsewhere. Similarly, teams only have access to their own storage accounts and resources, reducing the chance of accidental interference.
Streaming pipelines are tightly connected to Kafka, the messaging system that provides the streams of data. However, Kafka itself is a dynamic system: topics may change, partitions may shift, or clusters may fail over. If a Flink job does not react to these changes, it can become unstable or even crash.
To address this, the OpenAI Engineering Team built cluster-local watchdog services. These watchdogs monitor Kafka’s topology and automatically adjust Flink pipelines when changes occur. For example, if a Kafka topic gains new partitions, the watchdog ensures that the Flink job scales appropriately to read from them. If a cluster fails, the watchdog helps the job adapt without requiring a manual restart.
See the diagram below:
This automation is critical for keeping jobs running smoothly in a production environment where both Kafka and the underlying infrastructure may change at any time.
One of the hardest problems in stream processing is managing state. State refers to the memory that a job keeps as it processes data, such as counts, windows, or intermediate results. If a job fails and restarts without its state, it may produce incorrect results.
OpenAI uses RocksDB, an embedded key-value database, to store local operator state within Flink. RocksDB is designed to handle large amounts of data efficiently and is widely used in streaming systems for this purpose.
However, local state is not enough. To make jobs resilient across clusters, OpenAI designed per-namespace blob storage accounts with high availability. These storage accounts are used to checkpoint and back up the state in a durable manner. Since they are separate from any single Kubernetes cluster, a Flink pipeline can move to a new cluster and recover its state from storage.
See the diagram below:
Finally, to improve security and reliability, the team upgraded hadoop-azure to version 3.4.1. This upgrade enables Azure workload identity authentication, which simplifies access to blob storage and avoids the need for managing long-lived credentials inside the cluster. In practice, this means jobs can securely authenticate to storage services without extra complexity for developers.
At the heart of OpenAI’s streaming platform is PyFlink, the Python interface for Apache Flink. Since Python has become the de facto language for AI and machine learning, making streaming accessible to Python developers was a priority for the OpenAI Engineering Team. With PyFlink, developers can write pipelines using familiar tools rather than learning a new language like Java or Scala.
PyFlink offers two major APIs as follows:
The DataStream API is designed for detailed control of streaming operations, where developers can write step-by-step instructions for how data should be transformed.
The Table/SQL API is more declarative. Developers can write SQL-like queries to process streams in a way that feels closer to working with a database. Both APIs integrate seamlessly with OpenAI’s existing Python monorepo, which means researchers can use their favorite Python libraries alongside streaming jobs without friction
There are two ways PyFlink can run Python operators: process mode and thread mode.
In process mode, Python code runs in its own process, which provides isolation but introduces extra communication overhead between the Java Virtual Machine (JVM) and Python. This can sometimes cause timeouts.
In thread mode, Python runs within the same process as Java, reducing overhead but also reducing isolation. Each mode involves trade-offs between efficiency and safety
Despite its usefulness, PyFlink still has limitations. Some performance-critical functions, such as certain operators and source/sink connectors, often need to be written in Java and wrapped for use in Python. Features like asynchronous I/O and streaming joins, which are common in advanced streaming use cases, are not yet supported in PyFlink’s DataStream API. These gaps remain an active area of development, both inside OpenAI and in the open-source Flink community.
By embracing PyFlink despite these limitations, OpenAI ensures that streaming feels natural for its Python-first research teams while still delivering the power of Flink underneath.
A critical part of OpenAI’s streaming platform is its integration with Kafka, the event streaming system that delivers continuous flows of data.
Kafka is used across OpenAI for logs, training data, and experiment results, so making Flink and Kafka work reliably together was essential. However, the OpenAI Engineering Team faced a unique complication: their Kafka deployment is multi-primary. Instead of a single primary Kafka cluster, they run several primaries for high availability.
This design improves resilience because if one cluster goes down, others remain available. Unfortunately, it also creates a problem for Flink. By default, the Flink Kafka connector assumes there is only one primary cluster. If one of OpenAI’s clusters becomes unavailable, Flink interprets this as a fatal error and fails the entire pipeline. For mission-critical workloads, this behavior is unacceptable.
To handle this, the Engineering Team designed custom connectors with two key ideas.
First, for reading, they built a union of Kafka streams, allowing a job to consume data from multiple primaries at the same time.
Second, for writing, they introduced the Prism Sink, which can write data back into Kafka. One important limitation here is that the Prism Sink does not yet provide end-to-end exactly-once guarantees, meaning that in rare cases, duplicate or missing events can occur.
They also improved connector resilience through open-source contributions.
With FLINK-37366, the Kafka connector gained the ability to retry topic metadata fetches instead of failing immediately. They also built a dynamic Kafka source connector that can adjust at runtime, further improving reliability.
In large-scale cloud environments, failures are not rare events. Entire clusters can go down because of outages at the cloud provider level, and these disruptions can affect critical services. The OpenAI Engineering Team knew that their streaming platform needed to keep running even when such failures happened. This requirement shaped the way they designed high availability (HA) and failover.
The responsibility for handling these situations lies with the control plane.
If a Kubernetes cluster hosting Flink becomes unavailable, the control plane steps in to trigger a job failover. Rather than leaving the pipeline offline, it automatically restarts the job on another healthy cluster. This ensures continuity of service without requiring engineers to manually intervene during an outage. See the diagram below:
A key enabler of this design is decoupled state and HA storage. The state of a Flink job (the memory of what it has processed so far) is not tied to any single cluster. Instead, it is stored in separate, highly available blob storage accounts. Because of this separation, a pipeline can recover its state and continue processing even if it has to move between clusters.
Additionally, the state and HA storage accounts themselves can fail over independently. This means the platform is resilient at multiple levels: both the compute clusters and the storage layers can withstand outages without permanently breaking a pipeline.
By combining automatic failover with decoupled storage, OpenAI ensures that its most important streaming pipelines remain reliable, even in the face of inevitable cloud-wide failures.
Building a stream processing platform at the scale required by OpenAI is not just about running Apache Flink.
It is about carefully addressing the unique needs of AI research: Python-first workflows, resilience to cloud-wide failures, and seamless integration with Kafka. By layering PyFlink, Kubernetes orchestration, custom watchdog services, and decoupled state management, the OpenAI Engineering Team created a system that is both powerful and reliable.
The key takeaway is that streaming is no longer optional for cutting-edge AI development. Fresh training data and fast experiment feedback loops directly translate to better models and faster innovation. A platform that cannot recover gracefully from outages or adapt to changing infrastructure would quickly become a bottleneck.
At the same time, the system is still evolving. OpenAI has already contributed improvements back to the open-source community, such as fixes in PyFlink and enhancements to the Kafka connector. The roadmap points toward an even smoother experience: a dynamic control plane that automates failover, self-service streaming SQL for easier adoption, and core PyFlink enhancements like async I/O and streaming joins.
In short, OpenAI’s work shows how streaming can be made reliable and developer-friendly, paving the way for more resilient AI systems in the future.
References:
Get your product in front of more than 1,000,000 tech professionals.
Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases.
Space Fills Up Fast - Reserve Today
Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing [email protected].
2025-10-04 23:30:45
Download our comprehensive eBook on optimizing Kubernetes performance. This guide delves into crucial cluster state, resource, and control plane metrics, highlighting 15 of the most essential metrics your DevOps team should be tracking. Learn how to gain complete visibility into your containerized environments and optimize Kubernetes performance with Datadog.
This week’s system design refresher:
MCP vs API: what’s the difference?
Help us Make ByteByteGo Newsletter Better
TCP vs UDP
AI and Machine Learning (ML)
How Python Works
SPONSOR US
APIs have been the backbone of software-to-software communication for decades. Now, a new player — the Model Context Protocol (MCP) — is emerging as an AI-native protocol designed for agents, IDEs, and LLMs.
API (Application Programming Interface):
Purpose: Enables software-to-software communication.
Discovery: Requires documentation.
Standardization: Varies — REST, GraphQL, gRPC, etc.
MCP (Model Context Protocol):
Purpose: Enables AI-native communication between clients (agents, IDEs, LLMs) and servers.
Discovery: Self-describing (no external docs needed).
Standardization: One uniform protocol for resources, tools, and prompts.
Over to you: Do you think MCP will complement APIs or eventually replace them in AI-driven systems?
TL:DR: Take this 2-minute survey so I can learn more about who you are,. what you do, and how I can improve ByteByteGo
Every time data moves across the internet, it chooses between accuracy and speed. That’s TCP vs UDP.
TCP: Connection-oriented and reliable. It ensures ordered, duplicate-free delivery with flow and congestion control, making it ideal for web browsing, email, and file transfers.
UDP: Connectionless and lightweight. It sends packets without guarantees of delivery or order, but with minimal overhead. It is perfect for gaming, streaming, and real-time communication.
What is AI? What is ML? Are they the same thing? We will clear the common confusion in this post.
AI and ML are often treated as if they are the same thing. They are not. AI is the bigger field. It is about creating programs that can sense, reason, act, and adapt. Any system that shows intelligent behavior can fall under AI.
ML is a subset of AI. It focuses on algorithms that learn from data and improve with experience. This is where most of the progress in recent years has happened. Some common use cases of ML are recommendation engines, fraud detection, and image recognition. Most of what we interact with daily.
The biggest breakthrough in recent years came from ML, and when the media talks about the AI revolution, they are mostly talking about advances in ML, particularly deep learning.
Ever wondered what happens behind the scenes when you run a Python script? Let’s find out:
Python (CPython Runtime):
Python source code (.py) is compiled into bytecode automatically in memory.
Bytecode can also be cached in .pyc files, making re-runs faster by using the cached version.
The Import System loads modules and dependencies.
The Python Virtual Machine (PVM) interprets the bytecode line by line, making Python flexible but relatively slower.
Over to you: For performance-critical work, do you stick with Python or reach for another language?
Get your product in front of more than 1,000,000 tech professionals.
Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases.
Space Fills Up Fast - Reserve Today
Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing [email protected].
2025-10-03 22:03:00
Our very first cohort of Becoming an AI Engineer starts in one day. This is a live, cohort-based course created in collaboration with best-selling author Ali Aminian and published by ByteByteGo.
Here’s what makes this cohort special:
• Learn by doing: Build real world AI applications, not just by watching videos.
• Structured, systematic learning path: Follow a carefully designed curriculum that takes you step by step, from fundamentals to advanced topics.
• Live feedback and mentorship: Get direct feedback from instructors and peers.
• Community driven: Learning alone is hard. Learning with a community is easy!
We are focused on skill building, not just theory or passive learning. Our goal is for every participant to walk away with a strong foundation for building AI systems.
If you want to start learning AI from scratch, this is the perfect time to begin.
2025-10-02 23:30:47
Consider how a food delivery application works when someone places an order.
The app needs to know which restaurants are currently open, where they’re located, whether they’re accepting orders, and how busy they are. If a restaurant closes early or temporarily stops accepting orders due to high demand, the app needs to know immediately. This information changes constantly throughout the day, and having outdated information leads to failed orders and frustrated customers.
Modern software applications face remarkably similar challenges.
Instead of restaurants, we have services. Instead of checking if a restaurant is open, we need to know if a payment service is healthy and responding. Instead of finding the nearest restaurant location, we need to discover which server instance can handle our request with the lowest latency. Just as the food delivery app would fail if it sent orders to closed restaurants, our applications fail when they try to communicate with services at outdated addresses or send requests to unhealthy instances.
In today’s cloud environments, applications are commonly broken down into dozens or even hundreds of services. These services run across a dynamic infrastructure where servers appear and disappear based on traffic, containers restart in different locations, and entire service instances can move between machines for load balancing.
The fundamental question becomes: how do all these services find and communicate with each other in this constantly shifting landscape?
In this article, we look at service discovery, the critical system that answers this question. We’ll examine why traditional approaches break down at scale, understand the core concepts, and patterns of service discovery.