MoreRSS

site iconByteByteGoModify

System design and interviewing experts, authors of best-selling books, offer newsletters and courses.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of ByteByteGo

How Datadog Redefined Data Replication

2026-04-01 23:31:07

Your cache isn’t the problem. How you’re using it is. (Sponsored)

If your cache only speeds up a few endpoints, your cache strategy is too narrow.

That model doesn’t scale. It creates stale data, extra complexity, and more load on your database than you think.

Modern systems treat cache differently. It’s seen as a real-time data layer that’s structured, queryable, and always in sync with source data.

This guide walks through how teams make that shift—from basic key-value storage to a cache that can actually carry production workloads.

Inside, you’ll learn:

  • How to cut database pressure without adding more infrastructure

  • How to serve more queries at sub-millisecond latency

  • How to keep data fresh without stitching together brittle pipelines

If you’re running into performance or cost limits, this guide is for you.

Download now


Datadog’s Metrics Summary page had a problem. For one customer, every time someone loaded the page, the database had to join a table of 82,000 active metrics with 817,000 metric configurations. The p90 latency hit 7 seconds. Every time a user clicked a filter, it triggered another expensive join.

The team tried the usual fixes, such as query optimization, indexing, and tuning. However, the problem wasn’t the query. They were asking a database designed for transactions to do the job of a search engine. Fixing that one page set off a chain of architectural decisions that didn’t just solve the performance issue. It led Datadog to fundamentally redefine how data replication works across its entire infrastructure.

In this article, we will look at how Datadog implemented the changes and the challenges they faced.

Disclaimer: This post is based on publicly shared details from the Datadog Engineering Team. Please comment if you notice any inaccuracies.

The Database Was Simply Doing the Wrong Job

Datadog operates thousands of services, many of them backed by a shared Postgres database. For a long time, that shared database was the right call. Postgres is reliable, well-understood, and cost-effective at small to medium scale. However, as data volumes grew, the cracks started to show, and the Metrics Summary page was just the most visible symptom.

The team’s first instinct was to optimize the database. They tried adjusting join order, adding multi-column indexes, and using query heuristics based on table size. None of it held up and there were several issues:

  • Disk and index bloat slowed inserts and updates.

  • VACUUM and ANALYZE operations added maintenance overhead.

  • Memory pressure drove up I/O wait times.

Monitoring with Datadog’s own APM confirmed that these queries were consuming a disproportionate share of system resources, and getting worse as the data grew. By the time multiple organizations crossed the 50,000-metrics-per-org threshold, the warning signs were everywhere, such as slow page loads, unreliable filters, and mounting operational overhead.

See the diagram below:

Postgres was being asked to do two fundamentally different jobs at once. OLTP workloads are what relational databases are designed for. However, real-time search with filtering across massive denormalized datasets is a completely different workload, one that search engines like Elasticsearch are purpose-built to handle.

Therefore, instead of making Postgres better at searching, Datadog stopped making it search at all. They replicated data from Postgres into a dedicated search platform, flattening the relational structure into denormalized documents along the way. The mechanism behind this is Change Data Capture, or CDC.

Postgres already records every change (every insert, update, and delete) in its Write-Ahead Log, or WAL. This log exists primarily for crash recovery, but it can also be read by external tools. Datadog used Debezium, an open-source CDC tool, to read that log and stream changes into Kafka, a durable message broker. From Kafka, sink connectors pushed the data into the search platform.

The advantage of this approach is that the application itself didn’t need to change how it wrote data. It still writes to Postgres as before. However, search queries now hit the search platform instead, which was purpose-built for exactly that workload.

See the diagram below:

Why Async?

Before scaling that pattern, the team faced a key design choice: synchronous or asynchronous replication.

In synchronous replication, the primary database doesn’t confirm a write to the application until every replica has acknowledged receiving it. This guarantees strong consistency, meaning every system has the same data at all times. But it’s slow. If one replica is across the network or temporarily unhealthy, the entire write pipeline stalls waiting for confirmation. One slow consumer becomes a bottleneck for everything.

Asynchronous replication flips this. The primary database confirms the write immediately, and replicas catch up afterward. The application never waits for downstream systems. This is faster and more resilient, but it introduces a window where the replica is behind the source. This gap is what’s known as replication lag. The data will get there, but not instantly.

See the diagram below:

Datadog chose async. At their scale, with thousands of services spread across multiple data centers, synchronous replication would have coupled their application’s performance to the network latency and health of every downstream consumer. That was a non-starter.

One factor that helped in the decision-making was the cost being concrete. For a brief window after a write, the search platform might show slightly stale results. If a user adds a new metric configuration and immediately searches for it, the search platform might not have the update yet. For Datadog’s use cases (search, filtering, analytics dashboards), a few hundred milliseconds of lag was a perfectly acceptable tradeoff compared to 7-second page loads.

Debezium captures changes from the WAL, and Kafka acts as a durable buffer between the source and all consumers. Since Kafka persists messages to disk and supports replay, changes aren’t lost even if a consumer goes down temporarily. The consumer just picks up where it left off.

This tradeoff between consistency and availability shows up everywhere in distributed systems. It’s a practical instance of the CAP theorem, which describes the fundamental tension between consistency, availability, and partition tolerance.

Async replication solved the performance problem. However, it introduced a new challenge. What happens when the shape of your data changes?

The Problem With Schema Evolution

In a normal application, changing a database schema is between you and your database. You run a migration, add a column, change a type, and move on. With CDC, every schema change propagates to every downstream consumer, and if those consumers aren’t ready for the change, the pipeline breaks.

Let us consider a concrete example. A team adds a required region field to a table using ALTER TABLE ... ALTER COLUMN ... SET NOT NULL. Debezium starts producing messages that include this field. But messages already sitting in Kafka were written under the old schema and don’t have it. Consumers expecting every message to have a non-null region field start failing, and the pipeline goes down.

Datadog built a two-part defense against this.

The first line of defense is an automated validation system that analyzes schema migration SQL before it’s applied to the database. It catches risky changes, like adding NOT NULL constraints, and blocks them from being deployed without coordination. Most migrations pass through automatically. The ones that don’t get flagged require the team to work directly with the platform team to coordinate a safe rollout.

The second line of defense is a multi-tenant Kafka Schema Registry configured for backward compatibility. This means any new schema must still be readable by consumers that only understand the old schema. In practice, this restricts schema changes to safe operations, such as adding optional fields or removing existing ones. When Debezium captures an updated schema, it serializes the data in Avro format and pushes both the data and the schema update to the Registry.

The Registry checks the new schema against the existing one and either accepts or rejects it based on the backward compatibility rules. Datadog uses Avro serialization specifically because it supports this kind of schema negotiation natively. The Confluent Schema Registry documentation covers the mechanics of backward, forward, and full compatibility modes.

Together, these two systems mean that most schema changes flow through automatically, breaking changes get caught early, and downstream consumers don’t wake up to broken pipelines.

With async replication running and schema evolution under control, Datadog had a working pipeline. However, setting up each new pipeline still required an engineer to manually configure seven or more components across multiple systems. That’s where automation changed the game.

From One Pipeline to a Platform

A single CDC pipeline involves a number of moving parts. For example, you need to enable logical replication on Postgres by setting wal_level to logical. Also, you need to create Postgres users with the right permissions, establish replication slots and publications, deploy Debezium instances, create Kafka topics with correct mappings, set up heartbeat tables for monitoring, and configure sink connectors to push data into the destination.

Doing all of that manually for one pipeline is tedious. But doing it across many pipelines and multiple data centers means the operational burden compounds quickly.

See the diagram below:

Datadog made automation a foundational principle. Using Temporal, a workflow orchestration engine, they broke the provisioning process into modular, reliable tasks and stitched them into higher-level workflows. If a step fails, the workflow retries or rolls back cleanly. Teams don’t touch infrastructure directly. They request a pipeline through the platform, and the automation handles everything end-to-end.

This is what turned a single fix into a company-wide capability. What started as “replicate this one Postgres table to a search engine” expanded to Postgres-to-Postgres replication for unwinding their large shared monolithic database, Postgres-to-Iceberg pipelines for event-driven analytics, Cassandra replication to support non-SQL data sources, and cross-region Kafka replication to improve data locality for products like Datadog On-Call.

See the diagram below:

Conclusion

Every architectural choice Datadog made came with a cost:

  • Asynchronous replication means downstream systems are always slightly behind the source.

  • Schema evolution constraints mean you can’t freely change your database without considering the pipeline.

  • The infrastructure itself (Debezium, Kafka, Schema Registry, Temporal) represents a lot of moving parts to operate, monitor, and maintain.

  • Lastly, building all of this into a platform requires a dedicated team to own it.

This approach makes sense when you have workloads that genuinely don’t belong in your primary database, when multiple teams need the same data in different shapes, and when your scale makes manual pipeline management untenable. Datadog checked all three boxes. However, if you have a handful of simple data flows, the overhead isn’t justified. A periodic batch sync or a straightforward read replica might be all you need. Not every problem requires a platform.

References:

How Meta Turned Debugging Into a Product

2026-03-31 23:31:44

How to Test Non-Deterministic AI Agents (Sponsored)

Same input. Same prompt. Different output. That's the reality of testing AI agents that write code, and most teams are shipping without solving it.

Nick Nisi from WorkOS tackled this by building eval systems for two AI tools:

The post covers how to test against real project structures, score output that's different every time, and catch when your agent makes up methods that don't exist.

Learn more about evals →


Every growing engineering organization eventually discovers the same problem. When something breaks in production, engineers debug it. When it breaks again, they debug it again. Hundreds of teams, thousands of incidents, each one investigated mostly from scratch.

The experienced engineer knows where to look and what patterns to check, but that knowledge lives in their head, not in the system. Over time, runbooks go stale, and scripts that one person wrote become tribal knowledge.

Meta hit this wall years ago. Their answer was DrP, a platform that lets engineers turn investigation expertise into actual code. It is a software component that runs automatically, gets tested through code review, and improves over time. It now runs across 300 teams and executes 50,000 automated analyses daily.

While Meta’s specific tool is interesting to learn about, even more insightful is the underlying principle that debugging itself can be engineered. In this article, we will look at how DrP works on a high level and the design choices Meta made while building it.

Disclaimer: This post is based on publicly shared details from the Meta Engineering Team. Please comment if you notice any inaccuracies.

Why Manual Investigation Breaks Down

The way most teams investigate incidents has a predictable failure mode, and writing better documentation doesn’t fix it.

Knowledge is trapped in people. Your best debugger carries mental models nobody else has, such as which services are flaky, which metrics actually matter, and which dashboards lie under certain conditions. When that person is asleep, on vacation, or leaves the company, the knowledge is gone. If you’ve ever been paged at 2 AM and wished someone had already figured this out last time it happened, you’ve felt this problem.

Systems change frequently, sometimes dozens of times a day. The runbook that was accurate last month now references a dashboard that was renamed and a service that was refactored. Modern software moves too fast for static documentation to keep up.

Teams often write one-off scripts to automate their own checks, and that’s a good thing to have. But those scripts can’t cross service boundaries. Neither are they tested systematically. Ultimately, they become their own form of tribal knowledge, useful to the author and opaque to everyone else.

These problems aren’t unique to Meta. Every organization at a certain scale hits the same wall. The industry has approached it in different ways. Some companies focus on coordinating people better during incidents (Netflix built and open-sourced Dispatch for exactly this), others focus on automating the investigation itself (Meta’s approach), and still others are leaning into AI-driven diagnostics.

There are at least three distinct layers to incident response:

  • Coordination (getting the right people together)

  • Investigation (figuring out what went wrong)

  • Remediation (fixing it).

Meta invested deepest in the investigation layer, and their approach is worth studying because it has been in production for over five years at a massive scale.

You can also think about this as a maturity progression from tribal knowledge to wiki runbooks to ad-hoc scripts to testable analyzers to a composable platform. Most teams are stuck somewhere around step two or three, whereas DrP represents step five.


4 engineering workflows where AI agents have more to offer (Sponsored)

AI has changed how engineers write code. But 70% of engineering time isn’t spent writing code, it’s spent running it.

Most teams are still operating manually to triage alerts, investigate incidents, debug systems, and ship with full production context.

A new wave of engineering orgs including those at Coinbase, Zscaler, and DoorDash are deploying AI agents specifically in their production systems.

This practical guide covers the four workflows where leading teams are seeing measurable impact, and what the human-agent handoff actually looks like in each one.

Download the free guide‭→


Treating Investigation as Software

DrP’s core philosophy is that investigation workflows should be written as code, go through code review, have CI/CD, and be tested, just like any other software the team ships.

The core unit is the “analyzer,” a programmatic investigation workflow. Engineers use DrP’s SDK to codify their debugging steps, such as which data to pull, which anomalies to look for, and which decision trees to follow. The output is structured findings (both human-readable and machine-readable) and not just a wiki page that someone might skim.

What makes this different from “just writing a script” is the engineering rigor around it.

Analyzers go through code review. They have automated backtesting integrated into the review process, so you can verify an analyzer would have caught last month’s incidents before you deploy it. They ship through a CI/CD process. When the underlying system changes, the analyzer gets updated through the same development workflow as any other code.

See the diagram below that shows the authoring workflow for the same:

The SDK also provides shared libraries for common investigation patterns, such as anomaly detection (spotting unusual behavior in metrics), time series correlation (checking whether a metric spike lines up with a deploy or config change), and dimension analysis (automatically slicing metrics by region, device type, or other attributes to isolate where a problem is concentrated). Engineers don’t reinvent these from scratch for each analyzer.

The mental shift matters as much as the technology. DrP has users (on-call engineers), a platform team, deployment pipelines, and usage metrics. It is maintained and improved continuously, not abandoned after the initial author moves teams.

Where the Platform Beats the Script

Writing analyzers as code is the foundation. But the real leverage comes from what happens when you connect them.

In a microservices architecture, the root cause is rarely in the service showing symptoms. Your API error rate spiked, but the actual cause may be a config change in a storage service three layers down. With standalone scripts, each team can only investigate its own domain. With DrP, analyzers chain across service boundaries. The API analyzer discovers the issue is downstream, invokes the Storage Service analyzer, passes along the context it has already gathered, and gets back a confirmed root cause. This happens automatically, without anyone pinging a Slack channel.

DrP also integrates directly into the alert lifecycle. Analyzers trigger automatically when an alert fires. Results annotate the alert itself, so the on-call engineer sees the diagnosis alongside the page, before they have opened a single dashboard.

After the investigation, a post-processing system can close the loop: create a revert task, file a bug, or trigger a mitigation step. And there’s a broader feedback loop too.

DrP Insights periodically analyzes outputs across all investigations to identify and rank the most common alert causes across the organization. Individual investigations become organizational learning.

An Investigation: Start to Finish

To make this concrete, here’s what a DrP-powered investigation looks like in practice. This is a possible scenario based on DrP’s documented capabilities, not a specific Meta incident.

Let’s consider that the error rates spike on an API service. The alert fires and auto-triggers the relevant analyzer. From there, the investigation unfolds in a series of automated steps:

  • The analyzer pulls error rate data and runs dimension analysis, slicing by region, device type, and data center. It isolates the problem to one region.

  • It runs time series correlation, comparing the error spike against other signals like latency, deploy events, and config changes. It finds a strong match with a recent config change on a downstream storage service.

  • Since the storage service is a separate dependency, the analyzer chains into the Storage Service analyzer, passing along the regional context. That analyzer confirms the config change pushed response latency past the API’s timeout threshold.

  • The findings surface to the on-call engineer as a structured summary: affected region, root cause, correlated timestamp, and a link to the offending change.

  • The post-processing system creates a revert task assigned to the storage team.

  • The engineer reviews the analysis, approves the revert, and the incident is resolved. The investigation that might have taken 45 minutes of manual work across multiple tools and teams happened in the background before the engineer finished reading the alert.

Conclusion

DrP has reduced mean time to resolve incidents by 20-80% across Meta’s teams, with over 2,000 analyzers in production. Those numbers are compelling, but the lasting takeaway isn’t about a specific tool at a specific company.

Investigation knowledge is too valuable to live in people’s heads or in documents that go stale. It can be codified into testable, composable software.

However, this doesn’t eliminate the need for human judgment. DrP deliberately keeps engineers in the loop, presenting findings for review rather than auto-remediating. And it’s not free; analyzers are code, and code needs maintenance. You’re trading unpredictable on-call toil for more manageable engineering work. But wherever you work, the question is worth asking: is your team’s debugging knowledge engineered into your system, or is it waiting to walk out the door?

References:

How Roblox Uses AI to Translate 16 Languages in 100 Milliseconds

2026-03-30 23:33:10

OpenClaw You Can Trust (Sponsored)

When your AI agent holds your API keys, reads your email, and runs shell commands, security isn’t optional.

KiloClaw is a fully managed OpenClaw: a one-click deploy that gives you a 24/7 AI agent, without buying a Mac Mini.

Every instance runs in a dedicated Firecracker micro-VM, not a shared container, with five independent isolation layers protecting your data. An independent security assessment found zero cross-tenant vulnerabilities (read the full white paper).

Built on the same infrastructure serving 1.5M+ Kilo Code developers, with access to 500+ AI models through Kilo Gateway.

Try KiloClaw Free


Translating between 16 languages means supporting 256 possible pairs, such as Korean to English, French to Thai, Portuguese to Japanese, and so on. One solution is to build a separate model for each pair. However, Roblox decided to build just one.

Roblox is a global platform where more than 70 million people play, create, and socialize every day across more than 15 million active experiences. Users span 180 countries and communicate constantly through in-experience text chat. And a single unified model now handles real-time chat translation across all of those users, at roughly 100 milliseconds per translation and over 5,000 chats per second.

However, Roblox’s real engineering challenge wasn’t building a model that could translate. It was building a system that could translate at the speed of a conversation without breaking the user experience. In this article, we will look at what Roblox built and the trade-offs they made.

Disclaimer: This post is based on publicly shared details from the Roblox Engineering Team. Please comment if you notice any inaccuracies.

One Model Versus Many

Building a separate model for every language pair is the obvious starting point. One model for English to Korean, another for Korean to French, another for French to Thai, and so on.

With 16 languages, that’s 16 times 16, or 256 individual models. Each one needs its own training data, its own infrastructure, its own maintenance. And when Roblox adds a 17th language, they don’t need a new model. They need 32. The approach grows quadratically, and it collapses under its own weight long before you reach production.

Roblox went a different direction.

They built a single, unified transformer-based translation model that handles all 256 language directions. The key to making this work is an architecture called Mixture of Experts, or MoE. Instead of every translation request passing through every parameter in the model, a routing mechanism activates only a subset of specialized “expert” subnetworks depending on the input.

Different experts specialize in groups of similar languages. Given a source sentence and a target language, the system activates the relevant expert (or combination of experts) to generate the translation. Think of it as a team of specialist translators sitting behind a single reception desk. A request comes in, the routing layer sends it to the right specialist, and only that specialist does the work. The full team has broad expertise, but any single translation only activates a fraction of it.

This unified approach creates some great benefits. When all languages are trained together, similar languages actually help each other. For example, Spanish and Portuguese share enough structure that training them in the same model improves translation quality for both. The model also learns enough about each language’s patterns that it can auto-detect the source language, even when the language setting is wrong or missing. It can even handle mixed-language input, where someone types in two languages within the same message, and still produce a reasonable translation into the target language.

However, there’s a cost to consolidation. One model now carries the weight of all 256 directions. To handle that diversity with acceptable quality, Roblox’s model ended up with roughly 1 billion parameters. Running inference through a model that large is too slow and too expensive for real-time chat at scale. The architectural problem was solved, but the serving problem was just getting started.


Unblocked: Context that saves you time and tokens (Sponsored)

Stop babysitting your coding agents. Unblocked gives them the organizational knowledge to generate mergeable code without the back and forth. It pulls context from across your engineering stack, resolves conflicts, and cuts the rework cycle by delivering only what agents need for the task at hand.

See how Unblocked works


Making a Billion Parameters Fast Enough for a Conversation

A 1-billion-parameter model produces good translations. It does not produce them fast enough for two people having a real-time conversation.

At 5,000+ chats per second, with a latency ceiling of roughly 100 milliseconds, Roblox needed to close a significant gap to make things production-ready. They did it with two moves. First, they made the model smaller. Then, they wrapped it in infrastructure that squeezes out every remaining millisecond.

Roblox used a technique called knowledge distillation, sometimes described as a teacher-student approach. The idea is straightforward. You train a large, high-quality model first (the teacher). Then you train a smaller model (the student) to mimic the teacher’s outputs. The key detail is what the student actually learns. It doesn’t just learn the teacher’s final answers. It learns the teacher’s probability distributions, in other words, the teacher’s confidence levels across all possible translations for a given input.

Through this process, Roblox compressed the model from roughly 1 billion parameters to fewer than 650 million. Alongside distillation, they also applied quantization (reducing the numerical precision of model weights) and model compilation (optimizing the computation graph for specific hardware). These are additional layers of compression stacked on top of distillation.

However, the serving infrastructure is half the story. Even a distilled model doesn’t hit 100ms on its own at this scale. The model is just one component in a longer pipeline, and most of the latency optimization happens outside the model itself.

When a chat message needs translation, it first passes through RCC (Roblox’s backend service). Before the model is ever involved, the system checks a translation cache. If this exact source-text-to-target-language translation has been done before, the cached result is returned immediately, and no model inference is needed.

If the cache misses, the request goes to the backend, where a dynamic batcher groups multiple translation requests together. Batching is critical because GPUs are far more efficient at processing many inputs at once than handling them one at a time. The batched requests then flow through the model’s encoder stack, which converts the source text into a numerical representation.

Here’s where a second, different cache comes in. Roblox added an embedding cache between the encoder and decoder. This matters for a specific scenario that happens constantly on the platform. Imagine a Korean speaker sends a message on a server with English, German, and French speakers. That single Korean message needs three separate translations. Without the embedding cache, the encoder would process the same Korean message three times. With it, the encoding happens once, the intermediate representation is cached, and the decoder generates all three translations from that single encoding. At the scale of Roblox’s chat traffic, this optimization is significant.

Finally, the decoded translation passes through Roblox’s trust and safety systems, the same scrutiny applied to all text on the platform, to catch anything that violates their policies. The translated message and the original are both sent to the recipient’s device, allowing users to toggle between the translation and the sender’s actual words.

Measuring Quality

A translation model is only as good as two things. The data it was trained on, and your ability to measure whether it’s working. For 256 language directions at Roblox’s scale, both of these are hard problems. And Roblox had to build custom solutions for each.

Standard translation quality metrics work by comparing the model’s output against a “correct” human translation, called a reference. But producing reference translations for all 256 language directions, across the volume and variety of Roblox chat messages, is impossible. You’d need human translators producing ground truth for every combination, continuously. Roblox solved this by building its own quality estimation model that scores translations using only the source text and the machine translation output. No reference translation required.

This quality model evaluates the translations along multiple dimensions.

It checks accuracy (are there additions, omissions, or mistranslations?), fluency (grammar, spelling, punctuation), and reference consistency (does the translation make sense in context with the rest of the conversation?).

Errors are classified by severity into critical, major, and minor categories. The model operates at word-level granularity, not just sentence-level. It doesn’t just flag a translation as “bad.” It pinpoints which words are wrong and how severely.

To build this, Roblox trained an ML model on human-labeled error types and scores, then fine-tuned a multilingual language model to predict these word-level errors and compute scores across their multidimensional criteria.

Common language pairs like English-Spanish have abundant parallel training data. Billions of translated sentence pairs exist on the web. However, there are also rare pairs like French-Thai. It’s difficult to train a good translator between two languages if you don’t have examples of translations between them.

Roblox addressed this with iterative back-translation. Take a French text, translate it into Thai using the current model, then translate that Thai text back into French. Compare the round-trip result to the original French. If it matches closely, the intermediate French-Thai pair is probably a good synthetic training example.

​The critical word to note here is “iterative.” Roblox didn’t do this once. They repeated the process across multiple rounds, using a mix of this synthetic back-translated data and human-labeled supervised data to progressively expand the training set. The ratio between synthetic and real data matters. Too much synthetic data degrades quality because the model starts learning from its own mistakes.

General translation data doesn’t include words like “obby” (a Roblox obstacle course) or platform-specific slang and abbreviations. Roblox brought in human evaluators to translate popular and trending terms for each of the 16 languages, then fed those translations into the training data. This is an ongoing process because slang evolves faster than any model retraining.

Conclusion

Roblox’s unified translation model came with real costs, and understanding those costs matters as much as understanding the architecture.

  • Quality vs. latency is a permanent tension. The distilled student model is inherently less accurate than the teacher. Every time Roblox improves the teacher, they face the question of whether those gains survive compression. And 100 milliseconds is a hard ceiling that limits how large and accurate the serving model can get.

  • Low-resource pairs are still the weak link. Back-translation helps, but French-to-Thai will never be as good as English-to-Spanish. The model can handle mixed-language input, but accuracy drops. Unified doesn’t mean uniform quality.

  • The maintenance burden is real. Building a custom translation model means owning the entire stack. Training, evaluation, serving, slang updates, safety integration, all of it. Using a commercial translation API means someone else handles that complexity. Roblox’s choice made sense because they needed domain-specific accuracy (their model outperforms commercial APIs on Roblox content, by their own metrics) and extreme latency at massive scale. Most companies should use off-the-shelf translation and spend their engineering effort elsewhere.

  • The reference-free quality estimation model, while clever, has an inherent limitation. It could have systematic biases that overlap with the translation model’s own weaknesses. It’s a pragmatic solution, not a perfect one.

References:

EP208: Load Balancer vs API Gateway

2026-03-28 23:31:28

This week’s system design refresher:

  • LAST CALL FOR ENROLLMENT: Become an AI Engineer - Cohort 5

  • 12 Claude Code Features Every Engineer Should Know (Youtube video)

  • Load Balancer vs API Gateway

  • What is MCP?

  • REST vs gRPC

  • Session-Based vs JWT-Based Authentication

  • A Cheat Sheet on The Most-Used Linux Commands


LAST CALL FOR ENROLLMENT: Become an AI Engineer - Cohort 5

Our 5th cohort of Becoming an AI Engineer starts today, March 28. This is a live, cohort-based course created in collaboration with best-selling author Ali Aminian and published by ByteByteGo.

Check it out Here

Here’s what makes this cohort special:

  • Learn by doing: Build real world AI applications, not just by watching videos.

  • Structured, systematic learning path: Follow a carefully designed curriculum that takes you step by step, from fundamentals to advanced topics.

  • Live feedback and mentorship: Get direct feedback from instructors and peers.

  • Community driven: Learning alone is hard. Learning with a community is easy!

We are focused on skill building, not just theory or passive learning. Our goal is for every participant to walk away with a strong foundation for building AI systems.

If you want to start learning AI from scratch, this is the perfect platform for you to begin.

Check it out Here


12 Claude Code Features Every Engineer Should Know


Load Balancer vs API Gateway

Load balancers and API gateways both sit between your clients and backend servers. But they do very different things, and mixing them up causes real problems in your architecture.

A load balancer has one job: distribute traffic. Clients send HTTP(s) requests from web, mobile, or IoT apps, and the load balancer spreads those requests across multiple server instances so no single server takes all the load.

It handles:

  • Traffic distribution

  • Health checks to detect downed servers

  • Failover when something breaks

  • L4/L7 balancing depending on whether you're routing by IP or by actual HTTP content.

An API gateway does a lot more than that. It also receives HTTP(s) requests from the same types of clients, but instead of just forwarding traffic, it controls what gets through and how.

  • Rate limiting to prevent abuse.

  • API aggregation so your client doesn't need to call five different services.

  • Observability for logging and monitoring.

  • Authentication and authorization before a request even touches your backend.

  • Request and response transformation to reshape payloads between client and service formats.

In most production setups, the load balancer and api gateway sit together. The API gateway handles the smart stuff up front, rate limits, auth, routing to the right microservice. Then the load balancer behind it distributes traffic across instances of that service.

They're not competing tools. They work best when used together.


What is MCP?

Model Context Protocol (MCP) is a new system introduced by Anthropic to make AI models more powerful.

It is an open standard (also being run as an open-source project) that allows AI models (like Claude) to connect to databases, APIs, file systems, and other tools without needing custom code for each new integration.

MCP follows a client-server model with 3 key components:

  1. Host: AI applications like Claude that provide the environment for AI interactions so that different tools and data sources can be accessed. The host runs the MCP Client.

  2. MCP Client: The MCP client is the component inside an AI model (like Claude) that allows it to communicate with MCP servers. For example, if the AI model wants data from PostgreSQL, the MCP client formats the request into a structured message to send to the MCP Server

  3. MCP Server: This is the middleman that connects an AI model to an external system like PostgreSQL, Google Drive, or an API. For example, if Claude analyzes sales data from PostgreSQL, the MCP Server for PostgreSQL acts as the connector between Claude and the database.

MCP has five core building blocks (also known as primitives). They are divided between the client and server.

  1. For the clients, the building blocks are Roots (secure file access) and Sampling (ask the AI for help with a task such as generating a DB query).

  2. For the servers, there are Prompts (instructions to guide the AI), Resources (Data Objects that the AI can reference) and Tools (functions that the AI can call such as running a DB query).


REST vs gRPC

Choosing between REST and gRPC seems simple at first, but it ends up affecting how your services communicate, scale, and even break.

Both are trying to solve the same problem: how services talk to each other. But the way they approach it is different.

  1. Data format

    • REST usually uses JSON. It’s human-readable, easy to debug, and works everywhere.

    • gRPC uses Protocol Buffers (Protobuf). It’s binary, smaller in size, and faster to process.

You start noticing this difference in performance-heavy systems. JSON is convenient, but Protobuf is built for efficiency.

  1. API style

    • REST is resource-based: /users/101 with GET, POST, PUT, DELETE.

    • gRPC is method-based: GetUser(), CreateUser(), UpdateUser().
      REST fits nicely for public APIs. gRPC, on the other hand, feels more like calling a function on another service.

  2. Communication model

    • REST is simple request/response. One request, one response.

    • gRPC supports more patterns: unary, server streaming, client streaming, and bidirectional streaming.

Streaming becomes really useful when you need real-time updates or long-lived connections.

  1. API contract & type safety

    • REST contracts are usually defined separately (OpenAPI/Swagger), and mismatches can still happen.

    • gRPC uses a shared .proto file with strict types and code generation.

With gRPC, both client and server come from the same definition, so you run into fewer issues during integration.

  1. Caching & browser support

    • REST works well with HTTP caching, CDNs, and browsers.

    • gRPC has limited browser support (usually via gRPC-Web) and doesn’t naturally fit with HTTP caching.


Session-Based vs JWT-Based Authentication

Every web app needs authentication. But how you manage it after login matters more than most developers realize.

There are two dominant approaches: session-based and JWT-based. They solve the same problem differently.

Session-Based Authentication: The user logs in, and the server creates a session and stores it in a session store. The client gets a session_id cookie. On every subsequent request, the browser sends that cookie, and the server looks up the session to validate it.

The state lives on the server. That's the key tradeoff. It's simple and easy to revoke, but now your backend has to manage that session store.

JWT-Based Authentication: The user logs in, and the server validates credentials, then creates and signs a token using a secret or private key. That token is sent back to the client. On every subsequent request, the client sends it as a Bearer token in the Authorization header. The server verifies the signature and reads the claims. No session store needed.

The state lives in the token itself. The server stays stateless, which makes horizontal scaling straightforward.

Over to you: what’s your go-to approach for auth in microservices?


A Cheat Sheet on The Most-Used Linux Commands

Linux has thousands of commands. Most engineers use about 20 or so commands every day, not because Linux is limited, but because that core set handles the bulk of actual work: navigating files, inspecting logs, debugging processes, checking system health, and fixing things under pressure.

This cheat sheet maps out the most-used Linux commands by category:

  • File management basics like ls, cd, cp, mv, and rm that you touch constantly without thinking.

  • File viewing and editing with cat, less, head, tail, nano, and vim when logs are huge and time is short.

  • Text processing with grep, awk, sort, and diff to turn raw logs into answers.

  • Permissions with chmod and chown, because something always breaks due to access issues.

  • Networking commands like ssh, scp, curl, ping, ss, and ip for debugging remote systems.

  • Process and system inspection using ps, top, htop, df, free, and uname to see what the machine is really doing.

  • Archiving, package management, system control, and help commands that glue everything together.

Over to you: Which Linux command do you end up using the most during real incidents?

LAST CALL FOR ENROLLMENT: Become an AI Engineer - Cohort 5

2026-03-27 23:30:20

Our 5th cohort of Becoming an AI Engineer starts tomorrow, March 28. This is a live, cohort-based course created in collaboration with best-selling author Ali Aminian and published by ByteByteGo.

Check it out Here

Here’s what makes this cohort special:

  • Learn by doing: Build real world AI applications, not just by watching videos.

  • Structured, systematic learning path: Follow a carefully designed curriculum that takes you step by step, from fundamentals to advanced topics.

  • Live feedback and mentorship: Get direct feedback from instructors and peers.

  • Community driven: Learning alone is hard. Learning with a community is easy!

We are focused on skill building, not just theory or passive learning. Our goal is for every participant to walk away with a strong foundation for building AI systems.

If you want to start learning AI from scratch, this is the perfect platform for you to begin.

Check it out Here

How to Implement API Security

2026-03-26 23:31:02

Most APIs that ship to production have some security in place. Most of the time, HTTPS is enabled, an API key is required, and maybe there’s even a quick code review before deployment.

By most measures, the box is checked. However, a checked box and a secure API are not the same thing. A common and costly example is an API that validates credentials correctly on every request, but never checks whether those credentials grant access to the specific resource being requested. In other words, authentication works, but there’s no proper authorization.

Such an API could not be called secure, but sometimes nothing in the happy path would reveal the issue until someone found the gap.

This is what makes API security genuinely tricky. The strategies may be well-documented, but understanding when to use a particular strategy can be confusing. In this article, we will look at various API security strategies and try to understand which strategy works in which scenario.

Understanding the Threats

Read more