MoreRSS

site iconByteByteGoModify

System design and interviewing experts, authors of best-selling books, offer newsletters and courses.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of ByteByteGo

Top Strategies to Improve Reliability in Distributed Systems

2025-08-21 23:31:07

Distributed systems fail all the time. There could be a node crashing, a network link going down, or a dependency timing out. It’s hard to predict when these things can happen. 

The job of system design isn’t to eliminate these failures. It’s to build systems that absorb them, recover from them, and keep working through them. This is the reason why reliability becomes such an important system quality. 

In a distributed setup, reliability doesn’t come from individual components. A highly available database, a fast API server, or a global load balancer on its own doesn’t guarantee uptime. Reliability emerges when all of these components interact in a way that tolerates failure, isolates blast radius, and maintains service guarantees under pressure.

There’s no single universal solution. What may work for a video streaming platform might not be suitable for a financial transaction system. However, some building blocks keep showing up again and again, irrespective of the domain. Here are a few examples:

  • Fault tolerance enables systems to remain functional even when components fail or behave inconsistently.

  • Load balancing distributes traffic evenly to avoid overloading any single node or region.

  • Rate limiting guards against abuse and overload by controlling the flow of incoming requests.

  • Service discovery enables services to locate each other dynamically in environments where nodes are added and removed frequently.Consistent hashing keeps distributed data placements stable and scalable under churn.

None of these solves reliability alone. But when combined thoughtfully, they form the foundation of resilient architecture. In this article, we will take a closer look at strategies that can help improve the reliability of a system.

Reliability and Fault Tolerance

Read more

How Reddit Delivers Notifications to Tens of Millions of Users

2025-08-19 23:30:33

Building Low Latency Apps Virtual Masterclass (Sponsored)

Get battle-tested tips from Pekka Enberg, Author of Latency and Turso co-founder

Latency lurks in every layer of a distributed system. In this masterclass you’ll learn proven strategies to eliminate bottlenecks across the stack — from the application layer down to the database.

After this free 2-hour masterclass, you will know how to:

  • Spot, understand, and respond to latency, across your whole stack

  • Tackle often-overlooked causes of database latency

  • Evaluate when (and when not) to use caching for latency reduction

All attendees will get early access to Pekka’s new Latency book.

Designed for teams building performance critical applications, this event offers practical methods for battling P99 percentiles. Attendees will walk away with a holistic approach, including techniques that intersect many areas of software engineering.

Register for Free


Disclaimer: The details in this post have been derived from the official documentation shared online by the Reddit Engineering Team. All credit for the technical details goes to the Reddit Engineering Team.  The links to the original articles and sources are present in the references section at the end of the post. We’ve attempted to analyze the details and provide our input about them. If you find any inaccuracies or omissions, please leave a comment, and we will do our best to fix them.

Push notifications carry a double-edged sword in the design of any product. Done well, these notifications reconnect users with content they care about. Done poorly, they turn into noise, leading users to mute them entirely or uninstall the app. Striking the right balance requires a precise and scalable system that understands what matters to each user and when it makes sense to interrupt them.

Reddit’s notification recommender system handles this problem at scale. 

It evaluates millions of new posts daily and decides which ones should be sent as personalized notifications to tens of millions of users. Behind each decision is a pipeline that combines causal modeling, real-time retrieval, deep learning, and product-driven reranking.

In this article, we understand how that pipeline works. It walks through the key components (budgeting, retrieval, ranking, and reranking) of the pipeline and highlights the trade-offs at each stage. 

Some key features of the system are as follows:

  • It operates using a close-to-real-time pipeline.

  • Driven by asynchronous workers and queues.

  • Shares core components with other ML and ranking systems at Reddit.

  • Aims for low latency and high freshness of recommendations.

The system has evolved significantly, but the core goal remains unchanged: deliver timely, relevant notifications that drive engagement without overwhelming the user.

The Overall Architecture

The notification pipeline processes millions of posts daily to decide which ones to deliver as push notifications. 

It’s structured as a series of focused stages, each responsible for narrowing and refining the candidate set.

  • Budgeting sets the daily limit for how many notifications to send to each user, balancing engagement with fatigue.

  • Retrieval pulls a shortlist of potentially interesting posts using fast, lightweight methods.

  • Ranking scores these candidates using a deep learning model trained on user interactions like clicks, upvotes, and comments.

  • Reranking adjusts the final order based on business goals—boosting certain content types or enforcing diversity.

The pipeline runs on a queue-based asynchronous infrastructure to ensure timely delivery, even at massive scale.

Budgeter

The first decision the system makes each day is how many notifications a user should receive. 

Notification fatigue isn’t just a UX nuisance, but a permanent loss of reach. Once a user disables notifications, there’s rarely a path back. The system treats this as a high-cost failure. The goal is to maximize engagement without making things annoying.

Not every user gets the same treatment. The budgeter estimates how each additional notification might affect user behavior. Push too hard, and users disable notifications or churn. Hold back too much, and the system misses opportunities to re-engage them. The balance between the two is modeled using causal inference and adaptive scoring.

The system uses a causal modeling approach to weigh outcomes:

  • Positive outcomes include increased activity such as clicking the notification, browsing the app, and interacting with content.

  • Negative outcomes include signs of churn (long gaps in usage) or outright disabling of push notifications.

Often, basic correlation (for example, “this user got 5 notifications and stayed active”) doesn’t reveal the whole picture. Instead, the system uses past user behavior to estimate how different notification volumes affect outcomes such as staying active versus dropping off. This approach, known as causal modeling, helps avoid overfitting to noisy engagement data. It doesn’t just look at what happened, but tries to estimate what would have happened under different conditions.

To do this effectively, the team builds unbiased datasets by intentionally varying notification volumes across different user groups. These variations are used to estimate treatment effects and how different budgets affect long-term engagement patterns.

At the start of each day, a multi-model ensemble estimates several candidate budgets for a user. Each model simulates outcomes under different conditions: some more conservative, some more aggressive.

The system then selects the budget that optimizes a final engagement score. That score reflects both expected gains (clicks, sessions) and expected risks (disablement, drop-off). If the models indicate that an extra PN would yield a meaningful value, the budget is increased up to that point. If not, the system holds the line. The result is a dynamic, per-user push strategy that reflects actual behavioral data.

See the diagram below:

Retrieval

Before a notification can be ranked or sent, the system needs a shortlist of candidate posts worth considering. That’s the job of the retrieval stage. 

This stage scans Reddit’s firehose of daily content and narrows it down to a few hundred posts that might interest a specific user.

This step has to be fast and efficient. Ranking every new post with a heavy ML model would be ideal in theory, but in practice, it’s computationally impossible. Reddit sees millions of posts per day, and latency budgets for notification generation are tight.

To stay within those constraints, the system relies on a mix of rule-based and model-based retrieval methods. These techniques are lightweight by design, offering high recall without doing deep computation. The goal is to cast a wide net and keep promising candidates without overloading the pipeline.

The simplest signal of user interest is subreddit subscriptions. If someone subscribes to “r/AskHistorians” and “r/MechanicalKeyboards”, there’s a good chance they want to hear about new posts from those communities.

Here’s how this method works:

  • List the user’s subscribed subreddits.

  • Apply subreddit-level filters, such as removing NSFW communities that don’t belong in notifications.

  • Pick the top X subreddits based on engagement or recency of interaction.

  • For each subreddit, select the top Y posts from the last few days using a heuristic score that blends upvotes, downvotes, and post age.

  • Filter out posts the user has already seen.

  • Use round-robin selection to pull from multiple subreddits, ensuring that one active community doesn’t dominate the list.

This rule-based method is fast and transparent. But it’s also limited. Subscriptions don’t capture every user’s evolving interests, and not all posts in a subreddit are equally relevant.

Model-Based Retrieval: Two-Tower Architecture

To go beyond simple heuristics, the system uses two-tower models: a standard technique in large-scale recommendation systems. 

One tower learns to represent users while the other learns to represent posts. Both output embeddings are fixed-length vectors that capture user and content characteristics.

The model is trained using historical PN click data. If a user clicked on a post in the past, the model learns to place that user and post closer together in the embedding space.

At runtime, the process looks like this:

  • Post embeddings are precomputed and indexed. This saves time during notification generation since posts are expensive to encode.

  • User embedding is computed on the fly using their recent behavior and metadata.

  • The system performs a nearest-neighbor search between the user vector and the precomputed post vectors to find the closest matches. These are the posts the model believes the user is most likely to click.

After this step, the candidate list undergoes a final pass of filtering to remove stale or already-viewed content. The result is a highly personalized set of posts that’s both fresh and computationally cheap to obtain.

Ranking

Once the retrieval stage returns a set of potentially interesting posts, the ranking model steps in to decide which ones are worth sending as a push notification. 

This is the most compute-heavy part of the pipeline and predicts how likely a user is to engage with each candidate.

The ranking system uses a deep neural network (DNN) that takes in hundreds of signals and outputs a predicted likelihood of user engagement. But engagement means different things: clicking a notification, upvoting a post, or leaving a comment. 

See the diagram below:

To handle this, the model uses a structure called multi-task learning (MTL). In MTL, the model doesn’t just optimize for a single outcome. It learns to predict multiple behaviors at once:

  • P(click): probability that the user clicks the notification

  • P(upvote): probability the user upvotes the post

  • P(comment): probability that the user leaves a comment

These predictions feed into a final score using a weighted formula:

Final Score = Wclick * P(click) + Wupvote * P(upvote) + ... + Wdownvote * P(downvote)

Source: Reddit Engineering Blog

The weights (Wclick, Wupvote, etc.) let the team tune the model’s priorities. For example, some users might care more about discussion-heavy posts (comments), while others are more likely to interact with high-quality content (upvotes). Adjusting these weights allows the system to steer the output based on what matters most for engagement.

The neural network is structured in two parts:

  • Shared layers at the beginning that process all the input features (user profile, post metadata, time of day, etc.) into a common representation.

  • Task-specific heads at the end that specialize in predicting each target (click, comment, upvote, etc.).

This architecture allows the model to generalize well across behaviors while still optimizing for the nuances of each interaction type.

One of the subtler challenges in large-scale ML systems is train-serve skew. This is the mismatch between how data looks during training and what the model sees in production. 

Reddit handles this by using prediction logs: the system records all the features passed into the model at the moment it served a notification, along with the actual outcome (clicked, ignored, etc.).

This approach brings several benefits:

  • Accurate feedback for training, grounded in real serving conditions.

  • Faster iteration on new features. No need to wait weeks to collect data.

  • Improved observability into how the model behaves in production.

Reranking

Even after the ranking model has scored each post (candidate), the job isn’t quite done. The top-ranked result might be statistically relevant, but that doesn’t always mean it’s the right choice to send. 

That’s where reranking comes in. It is a final layer that adjusts the ranked list based on product strategy, UX goals, and business logic.

Machine learning models optimize for historical patterns. However, product goals often evolve faster than models can retrain. For instance, a model might consistently surface high-engagement posts from one very active subreddit because that’s what the user clicked on last week. But sending similar notifications every day creates fatigue or makes the system feel one-dimensional.

This is where reranking helps. Think of reranking as the system’s way of applying editorial judgment on top of raw model output. It doesn’t override the model entirely, but it nudges the final result in ways that reflect what Reddit wants the user experience to look and feel like.

Some strategies that are used in this phase are as follows:

  • Boost subscribed content over generic posts. Even if a non-subscribed post scores slightly higher, the system often favors personalized content rooted in the user’s chosen communities.

  • Enforce diversity. Prevent multiple posts from the same subreddit or similar topics from crowding out the experience. This avoids repetition and keeps recommendations feeling fresh.

  • Personalize content-type emphasis. If a user has a history of engaging with discussion threads, the system can boost posts with active comment sections—even if those posts weren’t top-ranked originally.

The team is also experimenting with dynamic weight adjustment, where boosts and priorities adapt in real time. These adjustments draw from UX research and behavior modeling. For example:

  • If a user rarely upvotes but frequently comments, it might be a good idea to prioritize like-heavy posts and conversation starters.

  • If a user is exploring new topics outside their usual subscriptions, slightly boost non-subscribed but semantically relevant posts.

This dynamic reranking approach allows the system to respond to intent, not just history, and improve relevance.

Conclusion

Building a high-quality notifications system isn’t just about pushing notifications. It also requires understanding when to stay silent. 

Reddit’s notification pipeline handles this balance through a carefully staged system: budgeters control volume, retrieval systems narrow the field, ranking models score intent, and reranking layers apply product nuance. Each stage works in real time, under heavy load, across millions of users and posts.

But the system is far from static. 

Future work focuses on making recommendations more responsive to changing user habits, improving experiences for low-signal users, and integrating signals across Reddit surfaces like Search and Feed. Dynamic reranking and real-time learning will play a bigger role in steering relevance without relying on hardcoded rules.

References:


ByteByteGo Technical Interview Prep Kit

Launching the All-in-one interview prep. We’re making all the books available on the ByteByteGo website.

What's included:

  • System Design Interview

  • Coding Interview Patterns

  • Object-Oriented Design Interview

  • How to Write a Good Resume

  • Behavioral Interview (coming soon)

  • Machine Learning System Design Interview

  • Generative AI System Design Interview

  • Mobile System Design Interview

  • And more to come

Launch sale: 50% off


SPONSOR US

Get your product in front of more than 1,000,000 tech professionals.

Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases.

Space Fills Up Fast - Reserve Today

Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing [email protected].







How LLMs See Images, Audio, and More

2025-08-18 23:30:33

Data Streaming + AI: Shaping the Future Together (Sponsored)

Join us at the Data Streaming + AI Summit on Sept 30, 2025 at the Grand Hyatt at SFO — the most open data streaming conference bringing together leaders from OpenAI, Netflix, Uber, Google, Salesforce, LinkedIn, Blueshift, Confluent, Ververica, StreamNative, and more. Hear keynote insights from Reynold Xin (Co-founder & Chief Architect, Databricks), learn how OpenAI powers AI with real-time streaming, and explore 30+ technical sessions across four tracks. On September 29, level up your skills with Data Streaming Fundamentals Training (for beginners) or dive deeper with the Hands-On Streaming Lakehouse Workshop (advanced, in partnership with AWS).

🎟️ Use code BBG50 for 50% off conference tickets.

Register for Summit + Workshop


Language models are no longer limited to text. Today's AI systems can understand images, process audio, and even handle video, but they still need to convert everything into tokens first. Just like how "Hello world!" becomes discrete tokens for text processing, a photograph gets chopped into image patches, and a song becomes a sequence of audio codes.

This is Part 2 of the ByteByteGo LLM Tokenization Series. You can view Part 1 here.

Each type of media requires completely different tokenization strategies, each with unique trade-offs between quality, efficiency, and semantic understanding.

As models extend beyond pure text, multimodal tokenization lets them ingest images, audio, and video alongside natural language. While the core idea of mapping raw input into a discrete sequence remains the same, each modality uses a different strategy.

Image Tokenization

Image tokenization transforms visual data into discrete tokens that can be processed by LLMs. Unlike text, which naturally exists as discrete symbols, images are continuous pixel arrays that require conversion into token sequences.

Main Approaches to Image Tokenization

1. Patch Embeddings (Vision Transformer Style)

Patch embeddings divide images into fixed-size, non-overlapping “patches”. Each patch is treated as a single token, similar to how words function in text sequences. This allows the image to be processed into understandable units, just like text tokens.

Technical Process:

  • An input image (e.g., 224×224 pixels) is divided into patches of size P×P (commonly 16×16 or 14×14)

  • Each patch contains P²×C values (where C is the number of channels, typically 3 for RGB)

  • These pixel values are flattened and linearly projected through a learned embedding matrix to create a D-dimensional vector

  • Positional encodings are added to preserve spatial relationships between patches

  • The result is a sequence of N tokens, where N = (H×W)/(P²)

What to know:

  • Advantages: Simple, preserves spatial structure, works well with standard transformer architectures

  • Limitations: Fixed patch size may not align with natural object boundaries, can be inefficient for high-resolution images

  • Typical Applications: Image classification, object detection, general computer vision tasks

Example: A 224×224 image with 16×16 patches yields 196 tokens (14×14 grid).

2. Discrete VAE and Vector Quantization

Vector quantization is like creating a visual dictionary. It converts values into “buckets” so that AI can pattern match image parts into buckets. 

Technical Process:

  • An encoder network compresses the input image into a smaller, more manageable representation

  • The system then matches each part of this compressed image to the closest pattern in its learned "visual vocabulary"

  • This vocabulary typically contains 512-8,192 different visual patterns (like textures, shapes, or color combinations)

  • The result is a grid of pattern numbers - for example, a 32×32 grid of codes that represents a 256×256 pixel image

  • A decoder network can recreate the original image by looking up each code and reconstructing the corresponding visual pattern

Variants (remove?):

  • VQ-VAE: Basic vector quantization with straight-through gradient estimation

  • VQ-GAN: Adds adversarial training for better reconstruction quality

  • RQ-VAE: Uses multiple quantization layers for hierarchical representations

What to know:

  • Advantages: True discrete tokens, excellent compression ratios

  • Limitations: Requires pre-training the codebook, potential information loss during quantization

  • Typical Applications: Image generation, image compression, multimodal language models

3. CLIP-Style Contrastive Embeddings

Contrastive learning approaches like CLIP create embeddings in a shared vision-language space, which can then be discretized into pseudo-tokens for downstream tasks.

Technical Process:

  • Images and text are embedded into a shared high-dimensional space through contrastive learning

  • The continuous embeddings capture semantic relationships between visual and textual concepts

  • Discretization techniques (k-means clustering, product quantization, or learned codebooks) convert continuous embeddings into discrete tokens

  • These tokens maintain semantic meaning while being compatible with discrete sequence models

What to know:

  • Advantages: Semantically meaningful tokens, natural vision-language alignment, transferable across tasks

  • Limitations: Requires large-scale contrastive pre-training, discretization may lose fine-grained details

  • Typical Applications: Multimodal retrieval, vision-language understanding, cross-modal generation

Trade-offs to Know

Each tokenization approach has important design decisions to be aware of that impact performance, efficiency, and applicability:

Computational Efficiency:

  • Patch size directly affects sequence length and computational cost

  • Smaller patches (8×8) preserve more detail but create longer sequences

  • Larger patches (32×32) are more efficient but may lose fine-grained information

  • Token count impacts transformer attention complexity (quadratic scaling)

Information Preservation vs Compression:

  • Higher compression ratios reduce token count but may lose visual details

  • VQ-based methods offer excellent compression but require careful codebook design

  • Patch embeddings preserve all information but with minimal compression

  • The choice depends on whether perfect reconstruction is necessary for your task

Audio Tokenization

Audio tokenization converts continuous audio waveforms into discrete tokens that can be processed by sequence models. Unlike text or images, audio has unique challenges due to its temporal nature, high sampling rates, and the need to preserve both semantic content and acoustic properties like prosody, timbre, and speaker characteristics.

Main Approaches to Audio Tokenization

1. Codec Tokens (Neural Audio Codecs)

Neural audio codecs like EnCodec and SoundStream compress audio into token sequences while preserving quality. They map audio features to a finite set of learned codes using something called “vector quantization.”

Technical Process:

  • Audio is segmented into overlapping windows (typically 10-25 ms)

  • An encoder network converts each window into a continuous feature representation

  • Vector quantization maps features to the nearest codebook entry (typically 1024-4096 codes)

  • Multiple quantization layers create hierarchical representations

  • The result is a sequence of discrete tokens at ~50-75 Hz (much lower than original audio sampling rate)

Models:

  • EnCodec: Facebook's codec with residual vector quantization, achieving high compression ratios

  • SoundStream: Google's approach with efficient temporal modeling

What to know:

  • Advantages: Preserves acoustic properties, enables high-quality reconstruction, suitable for music and speech

  • Limitations: Requires pre-trained codec models, may introduce artifacts at very low bitrates

  • Applications: Audio generation, compression, music synthesis, speech synthesis

2. Phoneme/Character Tokens (ASR-based)

Automatic Speech Recognition converts spoken audio into text representations, creating tokens that capture semantic content but discard acoustic properties.

Technical Process:

  • Audio is processed through an ASR system (e.g., Whisper, Wav2Vec2)

  • Speech is converted to phonemes (sound units) or characters/words

  • Tokens represent linguistic content rather than acoustic features

  • Can include timing information and confidence scores

  • Often combined with speaker identification or language detection

Token Types:

  • Phonemes: Fine-grained linguistic units (typically 40-50 phonemes per language)

  • Characters: Direct transcription to text characters

  • Words/Subwords: Higher-level semantic units using BPE or similar tokenization

What to know:

  • Advantages: Compact representation, language-model compatible, captures semantic meaning

  • Limitations: Complete loss of prosody, speaker identity, emotional content, and non-speech audio

  • Applications: Speech-to-text, voice assistants, content indexing, cross-modal retrieval

3. Multi-Scale Token Stacks (Hierarchical Representations)

Hierarchical approaches use multiple token sequences at different temporal resolutions to capture both global structure and fine-grained details, similar to multi-scale image or video representations.

Technical Process:

  • Coarse tokens: Capture global audio structure at low temporal resolution (e.g., 12.5 Hz)

  • Fine tokens: Represent detailed acoustic information at higher resolution (e.g., 50 Hz)

  • Multiple quantization layers create a pyramid of representations

  • Tokens can be processed jointly or in stages (coarse-to-fine generation)

  • Cross-attention mechanisms often link different scales

Implementation Examples:

  • Residual Vector Quantization: Multiple codebooks applied sequentially to capture different levels of detail

  • Temporal Hierarchies: Different encoding rates for different time scales

  • Frequency Hierarchies: Separate tokenization for different frequency bands

What to know:

  • Advantages: Balances efficiency with quality, enables progressive generation, captures multi-scale patterns

  • Limitations: Increased model complexity, requires careful design of hierarchy levels

  • Applications: High-quality audio generation, music modeling, efficient compression

Tradeoffs to Remember

Temporal Resolution vs Efficiency:

  • Higher token rates preserve more temporal detail but increase sequence length

  • Codec tokens typically achieve 50-100x compression over raw audio

  • ASR tokens provide extreme compression but lose all acoustic information

Semantic vs Acoustic Information:

  • Codec tokens preserve acoustic properties (pitch, timbre, speaker identity)

  • ASR tokens capture semantic content but get rid of rhythm/intonation information

  • Multi-scale approaches try their best to balance both types of information

Quality vs Compression:

  • More codebook entries improve reconstruction quality but increase model size

  • Hierarchical approaches can achieve better quality-compression trade-offs

Domain Usage:

  • Speech-optimized codecs may not work well for music and vice versa

  • Universal codecs like EnCodec aim to handle diverse audio types

  • ASR-based approaches are limited to speech and may not handle accents or languages well

The best tokenization approach depends on your specific use case: codec tokens for high-quality audio generation, ASR tokens for semantic understanding, and multi-scale approaches when you need both efficiency and quality.

Video Tokenization

The most common way that videos are tokenized today is to turn video into video frames and send the video in as a sampling of images, with audio attached. For example, Gemini currently treats video as sequential images interweaved with text/audio (from online sources, so we can link).

The Future of Tokenization

Researchers are exploring ways to improve tokenization in LLMs:

  • Tokenization-Free Models: Some cutting-edge research proposes eliminating tokenization entirely, allowing models to operate directly on raw bytes or Unicode code points. This approach avoids the need for a predefined vocabulary, letting models natively handle any language, symbol, or character without relying on heuristics or subword units. However, training such models is more computationally intensive.

  • Adaptive Tokenization: Instead of using a fixed token vocabulary, adaptive tokenization allows the model to determine the optimal token split dynamically based on the input. This could mean breaking down complex or rare words into smaller units while preserving common phrases or numbers as whole tokens. The goal is to strike a balance between efficiency (fewer tokens) and semantic clarity (keeping meaningful units intact).

Conclusion

Multimodal tokenization extends the concept of text tokens to images, audio, and video. Images get converted through patch embeddings (splitting into grid squares), vector quantization (learning visual codebooks), or contrastive embeddings (CLIP-style). Audio uses neural codecs (preserving sound quality), ASR transcription (converting to text), or hierarchical approaches (multiple resolution levels).

Each method has tradeoffs, and these tokenization choices directly impact what AI systems can understand and generate.


ByteByteGo Technical Interview Prep Kit

Launching the All-in-one interview prep. We’re making all the books available on the ByteByteGo website.

What's included:

  • System Design Interview

  • Coding Interview Patterns

  • Object-Oriented Design Interview

  • How to Write a Good Resume

  • Behavioral Interview (coming soon)

  • Machine Learning System Design Interview

  • Generative AI System Design Interview

  • Mobile System Design Interview

  • And more to come

Launch sale: 50% off


SPONSOR US

Get your product in front of more than 1,000,000 tech professionals.

Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases.

Space Fills Up Fast - Reserve Today

Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing [email protected].

EP176: How Does SSO Work?

2025-08-16 23:30:31

Guide to Monitoring Modern Infrastructure (Sponsored)

Build resilient systems, reduce downtime, and gain full-stack visibility at scale. This in-depth eBook shows you how to evolve your monitoring strategy for today’s dynamic, cloud-native environments with:

  • Key principles of effective observability in modern architectures

  • Techniques for tracking performance across ephemeral infrastructure

  • Real-world examples of metrics, logs, and traces working together"

Download the ebook


This week’s system design refresher:

  • How Does SSO Work?

  • Best Practices in API Design

  • Key Terms in Domain-Driven Design

  • Top AI Agent Frameworks You Should Know

  • How OpenAI’s GPT-OSS 120B and 20B Models Work?

  • ByteByteGo Technical Interview Prep Kit

  • SPONSOR US


How Does SSO Work?

Single Sign-On (SSO) is an authentication scheme. It allows a user to log in to different systems using a single ID.

Let’s walk through a typical SSO login flow:

Step 1: A user accesses a protected resource on an application like Gmail, which is a Service Provider (SP).

Step 2: The Gmail server detects that the user is not logged in and redirects the browser to the company’s Identity Provider (IdP) with an authentication request.

Step 3: The browser sends the user to the IdP.

Step 4: The IdP shows the login page where the user enters their login credentials.

Step 5: The IdP creates a secure token and returns it to the browser. The IdP also creates a session for future access. The browser forwards the token to Gmail.

Step 6: Gmail validates the token to ensure it comes from the IdP.

Step 7: Gmail returns the protected resource to the browser based on what the user is allowed to access.

This completes the basic SSO login flow. Let’s see what happens when the user navigates to another SSO-integrated application, like Slack.

Step 8-9: The user accesses Slack, and the Slack server detects that the user is not logged in. It redirects the browser to the IdP with a new authentication request.

Step 10: The browser sends the user back to the IdP.

Step 11-13: Since the user has already logged in with the IdP, it skips the login process and instead creates a new token for Slack. The new token is sent to the browser, which forwards it to Slack.

Step 14-15: Slack validates the token and grants the user access accordingly.

Over to you: Would you like to see an example flow for another application?


Best Practices in API Design

APIs are the backbone of communication over the Internet. Well-designed APIs behave consistently, are predictable, and grow without friction. Some best practices to keep in mind are as follows:

  1. Use Clear Naming: When building an API, choose straightforward and logical names. Be consistent and stick with intuitive URLs that denote collections.

  2. Idempotency: APIs should be idempotent. They ensure safe retries by making repeated requests to produce the same result, especially for POST operations.

  3. Pagination: APIs should support pagination to prevent performance bottlenecks and payload bloat. Some common pagination strategies are offset-based and cursor-based.

  4. Sorting and Filtering: Query strings are an effective way to allow sorting and filtering of API responses. This makes it easy for developers to see what filters and sort orders are applied.

  5. Cross Resource References: Use clear linking between connected resources. Avoid excessively long query strings that make the API harder to understand.

  6. Rate Limiting: Rate limiting is used to control the number of requests a user can make to an API within a certain timeframe. This is crucial for maintaining the reliability and availability of the API.

  7. Versioning: When modifying API endpoints, proper versioning to support backward compatibility is important.

  8. Security: API security is mandatory for well-designed APIs. Use proper authentication and authorization with APIs using API Keys, JWTs, OAuth2, and other mechanisms.

Over to you: did we miss anything important?


Out Ship, Out Deliver, Out Perform. (Sponsored)

DevStats helps engineering leaders unpack metrics, experience flow, and ship faster so every release drives real business impact.

✅ Spot bottlenecks before they stall delivery

✅ Tie dev work to business goals

✅ Ship more, miss less, prove your impact

It’s time to ship more and make your impact impossible to ignore.

👉 Try DevStats free for 14 days


Key Terms in Domain-Driven Design

Have you heard of Domain-Driven Design (DDD), a major software design approach?

DDD was introduced in Eric Evans’ classic book “Domain-Driven Design: Tackling Complexity in the Heart of Software”. It explained a methodology to model a complex business. In this book, there is a lot of content, so I'll summarize the basics.

The composition of domain objects:

  • Entity: a domain object that has ID and life cycle.

  • Value Object: a domain object without ID. It is used to describe the property of Entity.

  • Aggregate: a collection of Entities that are bounded together by Aggregate Root (which is also an entity). It is the unit of storage.

The life cycle of domain objects:

  • Repository: storing and loading the Aggregate.

  • Factory: handling the creation of the Aggregate.

Behavior of domain objects:

  • Domain Service: orchestrate multiple Aggregate.

  • Domain Event: a description of what has happened to the Aggregate. The publication is made public so others can consume and reconstruct it.

Congratulations on getting this far. Now you know the basics of DDD. If you want to learn more, I highly recommend the book. It might help to simplify the complexity of software modeling.

Over to you: do you know how to check the equality of two Value Objects? How about two Entities?


Top AI Agent Frameworks You Should Know

Building smart, independent AI systems is easier with agent frameworks that combine large language models (LLMs) with tools like APIs, web access, or code execution.

  • LangChain
    LangChain makes it simple to connect LLMs with external tools like APIs and vector databases. It allows developers to create chains for sequential task execution and document loaders for context-aware responses.

  • AutoGen
    AutoGen allows you to develop AI agents that can chat with each other or involve humans in the loop. It is like a collaborative workspace where agents can run code, pull in data from tools, or get human feedback to complete a task.

  • CrewAI
    As the name suggests, CrewAI is all about teamwork. It orchestrates teams of AI agents with roles like writers and editors, processing tasks in a structured workflow. It utilizes LLMs and tools (APIs, Internet, code, etc) to efficiently manage complex task execution and data flow.

  • LlamaIndex
    This framework indexes and queries data from documents, APIs, and vector stores to enhance agent responses. It parses and loads data, enabling LLMs to provide context-aware answers, making it ideal for enterprise document search systems and intelligent assistants that access private knowledge bases.

  • Semantic Kernel
    Semantic Kernel connects AI services (OpenAI, Calude, Hugging Face models, etc) with a plugin-based architecture that supports skills, templates, and API integrations for flexible workflows. It supports text search and custom workflows for applications.

Over to you: Which AI agent framework have you explored or plan to use?


How OpenAI’s GPT-OSS 120B and 20B Models Work?

OpenAI has recently released two LLMs, GPT-OSS-120B (120 billion parameters) and GPT-OSS-20B (20 billion parameters). These are fully open-source models and are provided under an Apache 2.0 license.

These models aim to deliver strong real-world performance at low cost. Here’s how they work:

  1. The user provides some input, such as a question or a task. For example, “Explain quantum mechanics in a simple manner”.

  2. The raw text is converted into numerical tokens using Byte-Pair Encoding (BPE). BPE splits the text into frequently occurring subword units. Since it operates at the byte level, it can handle any input, including text, code, emojis, and more.

  3. Each token is mapped to a vector (a list of numbers) using a learned embedding table. This vectorized form is what the model understands and processes.

  4. Transformer layers are where the real computation happens. The 20B Model has 24 Transformer layers, and the 120B Model has 36 Transformer layers. Each layer includes a self-attention module, router, and experts (MoE).

  5. The self-attention module lets the model understand relationships between words across the entire input.

  6. The LLM uses a Mixture-of-Experts (MOE) architecture. Instead of using all model weights like in traditional models, a router chooses the 2 best “experts” out of a pool of up to 64 total experts. Each expert is a small feedforward network trained to specialize in certain types of inputs. Only 2 experts are activated per token, thereby saving compute while improving quality.

  7. After passing through all layers, the model projects the internal representation back into token probabilities, predicting the next word or phrase.

  8. To make the raw model safe and helpful, it undergoes supervised fine-tuning and reinforcement learning.

  9. Finally, the model generates a response based on the predicted tokens, returning coherent output to the user based on the context.

Over to you: Have you used OpenAI’s open-source models?

Reference: Introducing gpt-oss | OpenAI


ByteByteGo Technical Interview Prep Kit

Launching the All-in-one interview prep. We’re making all the books available on the ByteByteGo website.

What's included:

  • System Design Interview

  • Coding Interview Patterns

  • Object-Oriented Design Interview

  • How to Write a Good Resume

  • Behavioral Interview (coming soon)

  • Machine Learning System Design Interview

  • Generative AI System Design Interview

  • Mobile System Design Interview

  • And more to come

Launch sale: 50% off


SPONSOR US

Get your product in front of more than 1,000,000 tech professionals.

Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases.

Space Fills Up Fast - Reserve Today

Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing [email protected].

A Guide to Top Caching Strategies

2025-08-14 23:30:24

Every time a user opens an app, browses a product page, or scrolls a social feed, there’s a system somewhere trying to serve that request fast. Mostly, the goal is to serve the request predictably fast, even under load, across geographies, and during traffic spikes. 

This is where caching comes in.

Caching is a core architectural strategy that reduces load on the source databases, reduces latency, and creates breathing room for slower, more expensive systems like databases and remote services. 

If done correctly, caching can deliver great gains in performance and scalability. However, if implemented incorrectly, it can also cause bugs, stale data, or even outages.

Most modern systems rely on some form of caching: local memory caches to avoid repeat computations, distributed caches to offload backend services, and content delivery networks (CDNs) to push assets closer to users around the world. 

However, caching only works if the right data is stored, invalidated, and evicted at the right time.

In this article, we will explore the critical caching strategies that enable fast, reliable systems. We will cover cache write policies like write-through, cache-aside, and write-back that decide what happens when data changes. Each one optimizes for different trade-offs in latency, consistency, and durability. We will also look at other distributed caching concerns around cache consistency and cache eviction strategies.

Caching Write Strategies

Read more

How Atlassian Migrated 4 Million Jira Databases to AWS Aurora

2025-08-12 23:30:34

✂️ Cut your QA cycles down to minutes with QA Wolf (Sponsored)

If slow QA processes bottleneck you or your software engineering team and you’re releasing slower because of it — you need to check out QA Wolf.

QA Wolf’s AI-native service supports web and mobile apps, delivering 80% automated test coverage in weeks and helping teams ship 5x faster by reducing QA cycles to minutes.

QA Wolf takes testing off your plate. They can get you:

  • Unlimited parallel test runs for mobile and web apps

  • 24-hour maintenance and on-demand test creation

  • Human-verified bug reports sent directly to your team

  • Zero flakes guarantee

The benefit? No more manual E2E testing. No more slow QA cycles. No more bugs reaching production.

With QA Wolf, Drata’s team of 80+ engineers achieved 4x more test cases and 86% faster QA cycles.

Schedule a demo to learn more


Disclaimer: The details in this post have been derived from the official documentation shared online by the Atlassian Engineering Team. All credit for the technical details goes to the Atlassian Engineering Team.  The links to the original articles and sources are present in the references section at the end of the post. We’ve attempted to analyze the details and provide our input about them. If you find any inaccuracies or omissions, please leave a comment, and we will do our best to fix them.

Most companies struggle to migrate a handful of databases without causing downtime. Atlassian routinely migrates about 1,000 databases every day as part of its regular operations, and its users never notice. 

However, in 2024, they tackled something far more ambitious: migrating 4 million databases with minimal user impact.

The numbers alone are staggering. Atlassian's Jira platform uses a one-database-per-tenant approach, spreading 4 million PostgreSQL databases across 3,000 server instances in 13 AWS regions worldwide. Each database contains everything for a Jira tenant: issues, projects, workflows, and custom fields.

Their goal was to move everything from AWS RDS PostgreSQL to AWS Aurora PostgreSQL. The benefits were compelling: 

  • Better cost efficiency

  • Improved reliability (upgrading from 99.95% to 99.99% uptime)

  • Enhanced performance through Aurora's ability to automatically scale up to 15 reader instances during peak loads.

The constraints were equally challenging. They needed to keep downtime under 3 minutes per tenant, control infrastructure costs during the migration, and complete the entire project within months rather than years.

In this article, we will look at how Atlassian carried out this migration and the challenges they faced.

Migration Strategy

Atlassian's team chose what seemed like a textbook approach for migrating from RDS to Aurora. This process is also known as “conversion”.

See the diagram below:

Here's how the process was designed to work:

  • Step 1 (Create a Read Replica): Instead of trying to copy databases while they're actively being used, they created read replicas. Think of a read replica as a live, constantly-updated copy of the original database. AWS RDS has a built-in feature that lets you create an Aurora read replica from an existing RDS PostgreSQL instance. This replica automatically synchronizes all the data to a new Aurora cluster volume in the background while the original database continues serving users normally.

  • Step 2 (The Coordinated Cutover): Here's where things get complex. Each database cluster at Atlassian hosts up to 4,000 individual tenant databases. That means when it's time to switch from the old RDS instance to the new Aurora cluster, they need to coordinate the cutover for all 4,000 tenants simultaneously. The cutover process worked like this:

    • Lock out all users on the source RDS database (preventing any new writes)

    • Promote the read replica to become a standalone Aurora cluster

    • Update all application connections to point to the new Aurora endpoints

    • Unlock users only on the new Aurora destination

  • Step 3 (Orchestration with AWS Step Functions): To manage this complex process, they used AWS Step Functions, which is like a sophisticated workflow engine that can coordinate multiple AWS services. The Step Functions handled all the safety checks before, during, and after the conversion. If anything went wrong at any stage, it could automatically roll back to the previous state.

See the diagram below for a detailed look at the conversion process.

Lastly, they also used feature flags that let them instantly redirect database connections. Instead of waiting for applications to slowly discover the new database endpoints (which could take time), feature flags allowed them to override the tenant database endpoints immediately across all application servers.

The most impressive part was that they managed to keep the actual cutover time to less than 3 minutes, even for their largest instances. This was possible because the heavy lifting (data synchronization) happened ahead of time through the read replica, and the feature flags enabled instant endpoint switching.

The Advantages of Aurora

The key advantages of using Aurora were as follows:

  • The RDS Limitation Problem: With their existing RDS setup, Atlassian could only use one database instance (the primary) at a time. Even though RDS technically supports read replicas, their configuration didn't allow them to effectively use both the writer and reader instances simultaneously for production workloads.

  • Aurora's Dual-Instance Advantage: Aurora changed this completely. With Aurora, they could access both the writer instance and the reader instances (multiple readers if necessary) at the same time. 

  • Smart Instance Downsizing: This dual-instance capability enabled them to downsize their instances from m5.4xlarge (16 vCPUs, 64 GB RAM) to r6.2xlarge (8 vCPUs, 64 GB RAM). Normally, halving your CPU would hurt performance, but since Aurora let them use both writer and reader instances, they effectively had the same total CPU power available (or more during scale-out events). It's like trading one large server for two smaller servers that cost less to run.

  • Auto-Scaling Magic: During peak business hours, Aurora can automatically spin up additional reader instances, up to 15 total. During off-peak hours, it scales back down, dramatically reducing the instance footprint and costs.

  • Reliability Upgrade: The move also upgraded their Service Level Agreement (SLA) from 99.95% (RDS) to 99.99% (Aurora). That might seem like a tiny difference, but it means Aurora can have 75% less downtime per year, dropping from about 4.4 hours of allowable downtime annually to just 52 minutes.

  • The Economics: While they ended up with nearly triple the number of Aurora cluster instances compared to their original RDS instances, the combination of smaller instance sizes, auto-scaling behavior, and dual-instance utilization meant significantly better cost efficiency. 

The File Count Limit Problem

Just when everything seemed to be going smoothly, Atlassian hit a wall that nobody saw coming. 

During their testing phase, AWS support contacted them with concerning news: one of their large test RDS instances had successfully synchronized all its data to Aurora, but the new Aurora cluster had failed to start up properly.

From Atlassian's perspective, everything looked fine in the AWS console. The replica appeared healthy and still replicating. However, AWS's internal monitoring had detected that the Aurora instance's startup process had timed out and failed, even though the surface indicators suggested everything was working normally.

To understand what went wrong, we need to know how PostgreSQL stores data on disk. In PostgreSQL, every high-level database object gets stored as at least one file on the server's hard drive:

  • Every table = at least 1 file

  • Every index = at least 1 file

  • Every sequence (used for auto-incrementing IDs) = at least 1 file

  • Plus, various other database objects

However, Jira applications have particularly complex database schemas because they need to handle:

  • Issues and their custom fields

  • Projects and workflows

  • User permissions and configurations

  • Search indexes for fast queries

  • Audit logs and history tracking

All this complexity means that a single Jira database needs approximately 5,000 files on disk to store all its tables, indexes, and sequences. For 4000 tenants per cluster, this means 20 million files per Aurora cluster.

When an Aurora instance starts up, it performs various status checks to ensure everything is healthy. One of these checks involves enumerating (counting and cataloging) all the files in the cluster volume. This process is normally quick and invisible to users. 

However, when there are 20+ million files, this enumeration process takes a very long time. Aurora has internal timeout thresholds for how long startup processes are allowed to run. With Atlassian's massive file counts, the enumeration was taking longer than Aurora's startup timeout threshold, causing the instance to fail during boot-up.

To handle this problem, the team had only two options to reduce file counts:

  • Reduce files per database (not practical since Jira needs its complex schema to function).

  • Reduce the number of databases per cluster (the only viable path).

The solution they developed would become known as "draining".

The Draining Solution

Faced with the file count limitation, Atlassian developed an elegant solution that turned the problem into an opportunity. Instead of fighting the constraints, they embraced a new workflow that would improve their infrastructure efficiency.

See the diagram below:

Step 1: Drain

"Draining" means moving tenant databases off an RDS instance until it reaches a safe file count threshold that Aurora can handle during startup. Instead of 4,000 tenants per instance (20 million files), they needed to reduce this to a much smaller number.

The draining process used their existing database migration tools, but at a completely different scale. They prioritized moving the smallest tenants with the least usage first because:

  • Smaller databases mean less data to transfer, which results in faster migrations.

  • Lower usage means less impact during the brief cutover windows.

  • Higher throughput overall since they could complete more migrations per day.

Step 2: Convert

Once an RDS instance was drained to safe file count levels, they could proceed with the standard RDS to Aurora conversion process:

  • Create the Aurora read replica

  • Let it synchronize

  • Perform the coordinated cutover

  • Clean up the old RDS instance after a safety period

With fewer tenants, this conversion process became much more reliable and predictable.

Step 3: Refill

Instead of leaving the new Aurora clusters half-empty, they immediately started using them as destinations for draining other RDS instances.

AWS had confirmed that once an RDS instance was successfully converted to Aurora, they could re-populate it with tenant databases up to sensible limits. This meant they didn't need to permanently reduce their tenant density. They just needed to reduce it temporarily during the conversion process.

The Daisy Chain Approach

This three-step process created what Atlassian called a "daisy chain" of conversions:

  • Instance A gets drained and converted to Aurora to become a refill target.

  • Instance B gets drained (tenants go to converted Instance A) and gets converted to become the next refill target

  • Instance C gets drained (tenants go to converted Instance B) and gets converted

  • And so on...

This approach had a huge advantage: minimal infrastructure overhead. 

Instead of needing to provision massive amounts of temporary infrastructure to hold displaced tenants, they only needed enough extra capacity to accommodate the tenants from one or two instances being converted at any given time.

Through careful optimization and tooling improvements, they achieved remarkable scaling:

  • Average draining performance: 38,000 daily migrations (38× normal capacity).

  • Peak draining performance: 90,000 daily migrations (90× normal capacity).

  • Maintained high reliability targets throughout the entire scale-up.

Conclusion

In the end, Atlassian had accomplished something remarkable in the world of database migrations. The final statistics paint a picture of an operation that succeeded on every measure:

Here’s a quick look at the migration scale:

  • 2403 RDS instances successfully converted to Aurora clusters.

  • 2.6 million databases were migrated during the conversion processes.

  • 1.8 million additional databases were moved during the draining operations.

  • 27.4 billion individual files processed across the entire project.

The final Aurora infrastructure looked dramatically different from where they started, but in all the right ways:

  • They ended up with nearly triple the number of Aurora cluster instances compared to their original RDS setup. This resulted in 6482 Aurora instances active during peak business hours. At first glance, this might seem like a step backward, but the economics told a different story. 

  • Through Aurora's auto-scaling capabilities, these instances automatically scale down during off-peak hours, dramatically reducing their infrastructure footprint when demand is low. During busy periods, they have more capacity available than ever before. During quiet periods, they're paying for a fraction of what they used to maintain.

  • The instance downsizing strategy (m5.4xlarge to r6.2xlarge) proved to be a masterstroke. By halving the instance size while gaining access to both writer and reader instances, they achieved the same or better performance at significantly lower cost.

Atlassian's Aurora migration represents more than just a successful database platform upgrade. Their collaboration with AWS throughout this process led to a better understanding of Aurora's operational limits and likely influenced future improvements to the platform.

The project delivered on all its original objectives: significant cost savings through better resource utilization, improved reliability through Aurora's superior SLA, and enhanced performance through auto-scaling capabilities. But perhaps most importantly, they accomplished all of this while maintaining their commitment to minimal user impact.

References:


ByteByteGo Technical Interview Prep Kit

Launching the All-in-one interview prep. We’re making all the books available on the ByteByteGo website.

What's included:

  • System Design Interview

  • Coding Interview Patterns

  • Object-Oriented Design Interview

  • How to Write a Good Resume

  • Behavioral Interview (coming soon)

  • Machine Learning System Design Interview

  • Generative AI System Design Interview

  • Mobile System Design Interview

  • And more to come

Launch sale: 50% off


SPONSOR US

Get your product in front of more than 1,000,000 tech professionals.

Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases.

Space Fills Up Fast - Reserve Today

Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing [email protected].