MoreRSS

site iconByteByteGoModify

System design and interviewing experts, authors of best-selling books, offer newsletters and courses.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of ByteByteGo

How Netflix Built a Real-Time Distributed Graph for Internet Scale

2026-01-22 00:31:00

2026 AI predictions for builders (Sponsored)

The AI landscape is changing fast—and the way you build AI systems in 2026 will look very different.

Join us live on January 28 as we unpack the first take from Redis’ 2026 predictions report: why AI apps won’t succeed without a unified context engine.

You’ll learn:

  • One architectural standard for AI across teams

  • Lower operational overhead via shared context infrastructure

  • Predictable, production-grade performance

  • Clear observability and governance for agent data access

  • Faster time to market for new AI features

Save your spot →

Read the full 2026 predictions report →


Netflix is no longer just a streaming service. The company has expanded into live events, mobile gaming, and ad-supported subscription plans. This evolution created an unexpected technical challenge.

To understand the challenge, consider a typical member journey. Assume that a user watches Stranger Things on their smartphone, continues on their smart TV, and then launches the Stranger Things mobile game on a tablet. These activities happen at different times on different devices and involve different platform services. Yet they all belong to the same member experience.

Disclaimer: This post is based on publicly shared details from the Netflix Engineering Team. Please comment if you notice any inaccuracies.

Understanding these cross-domain journeys became critical for creating personalized experiences. However, Netflix’s architecture made this difficult.

Netflix uses a microservices architecture with hundreds of services developed by separate teams. Each service can be developed, deployed, and scaled independently, and teams can choose the best data storage technology for their needs. However, when each service manages its own data, information can become siloed. Video streaming data lives in one database, gaming data in another, and authentication data separately. Traditional data warehouses collect this information, but the data lands in different tables and processes at different times.

Manually stitching together information from dozens of siloed databases became overwhelming. Therefore, the Netflix engineering team needed a different approach to process and store interconnected data while enabling fast queries. They chose a graph representation for the same due to the following reasons:

  • First, graphs enable fast relationship traversals without expensive database joins.

  • Second, graphs adapt easily when new connections emerge without significant schema changes.

  • Third, graphs naturally support pattern detection. Identifying hidden relationships and cycles is more efficient using graph traversals than siloed lookups.

This led Netflix to build the Real-Time Distributed Graph, or RDG. In this article, we will look at the architecture of RDG and the challenges Netflix faced while developing it.

Building the Data Pipeline

The RDG consists of three layers: ingestion and processing, storage, and serving. See the diagram below:

When a member performs any action in the Netflix app, such as logging in or starting to watch a show, the API Gateway writes these events as records to Apache Kafka topics.

Apache Kafka serves as the ingestion backbone, providing durable, replayable streams that downstream processors can consume in real time. Netflix chose Kafka because it offers exactly what they needed for building and updating the graph with minimal delay. Traditional batch processing systems and data warehouses could not provide the low latency required to maintain an up-to-date graph supporting real-time applications.

The scale of data flowing through these Kafka topics is significant. For reference, Netflix’s applications consume several different Kafka topics, each generating up to one million messages per second. Records use Apache Avro format for encoding, with schemas persisted in a centralized registry. To balance data availability against storage infrastructure costs, Netflix tailors retention policies for each topic based on throughput and record size. They also persist topic records to Apache Iceberg data warehouse tables, enabling backfills when older data expires from Kafka topics.

Apache Flink jobs ingest event records from the Kafka streams. Netflix chose Flink because of its strong capabilities around near-real-time event processing. There is also robust internal platform support for Flink within Netflix, which allows jobs to integrate with Kafka and various storage backends seamlessly.

A typical Flink job in the RDG pipeline follows a series of processing steps:

  • First, the job consumes event records from upstream Kafka topics.

  • Next, it applies filtering and projections to remove noise based on which fields are present or absent in the events.

  • Then it enriches events with additional metadata stored and accessed via side inputs.

The job then transforms events into graph primitives, creating nodes that represent entities like member accounts and show titles, plus edges that represent relationships or interactions between them.

After transformation, the job buffers, detects, and deduplicates overlapping updates to the same nodes and edges within a small configurable time window. This step reduces the data throughput published downstream and is implemented using stateful process functions and timers. Finally, the job publishes node and edge records to Data Mesh, an abstraction layer connecting data applications and storage systems.

See the diagram below:

For reference, Netflix writes more than five million total records per second to Data Mesh, which handles persisting the records to data stores that other internal services can query.

Learning Through Failure

Netflix initially tried one Flink job consuming all Kafka topics. Different topics have vastly different volumes and throughput patterns, making tuning impossible. They pivoted to one-to-one mapping from topic to job. While this added operational overhead, each job became simpler to maintain and tune.

Similarly, each node and edge type gets its own Kafka topic. Though this means more topics to manage, Netflix valued the ability to tune and scale each independently. They designed the graph data model to be flexible, making new node and edge types infrequent additions.

The Storage Challenge

After creating billions of nodes and edges from member interactions, Netflix faced the critical question of how to actually store them.

The RDG uses a property graph model. Nodes represent entities like member accounts, titles, devices, and games. Each node has a unique identifier and properties containing additional metadata. Edges represent relationships between nodes, such as started watching, logged in from, or plays. Edges also have unique identifiers and properties like timestamps.

See the diagram below:

When a member watches a particular show, the system might create an account node with properties like creation date and plan type, a title node with properties like title name and runtime, and a started watching edge connecting them with properties like the last watch timestamp.

This simple abstraction allows Netflix to represent incredibly complex member journeys across its ecosystem.

Why Traditional Graph Databases Failed

The Netflix engineering team evaluated traditional graph databases like Neo4j and AWS Neptune. While these systems provide feature-rich capabilities around native graph query support, they posed a mix of scalability, workload, and ecosystem challenges that made them unsuitable for Netflix’s needs.

  • Native graph databases struggle to scale horizontally for large, real-time datasets. Their performance typically degrades with increased node and edge count or query depth.

  • In early evaluations, Neo4j performed well for millions of records but became inefficient for hundreds of millions due to high memory requirements and limited distributed capabilities.

  • AWS Neptune has similar limitations due to its single-writer, multiple-reader architecture, which presents bottlenecks when ingesting large data volumes in real time, especially across multiple regions.

These systems are also not inherently designed for the continuous, high-throughput event streaming workloads critical to Netflix operations. They frequently struggle with query patterns involving full dataset scans, property-based filtering, and indexing.

Most importantly for Netflix, the company has extensive internal platform support for relational and document databases compared to graph databases. Non-graph databases are also easier for them to operate. Netflix found it simpler to emulate graph-like relationships in existing data storage systems rather than adopting specialized graph infrastructure.

The KVDAL Solution

The Netflix engineering team turned to KVDAL, the Key-Value Data Abstraction Layer from their internal Data Gateway Platform. Built on Apache Cassandra, KVDAL provides high availability, tunable consistency, and low latency without direct management of underlying storage.

See the diagram below:

KVDAL uses a two-level map architecture. Data is organized into records uniquely identified by a record ID. Each record contains sorted items, where an item is a key-value pair. To query KVDAL, you look up a record by ID, then optionally filter items by their keys. This gives both efficient point lookups and flexible retrieval of related data.

For nodes, the unique identifier becomes the record ID, with all properties stored as a single item. For edges, Netflix uses adjacency lists. The record ID represents the origin node, while items represent all destination nodes it connects to. If an account has watched multiple titles, the adjacency list contains one item per title with properties like timestamps.

This format is vital for graph traversals. To find all titles a member watched, Netflix retrieves the entire record with one KVDAL lookup. They can also filter by specific titles using key filtering without fetching unnecessary data.

Managing Data Lifecycle

As Netflix ingests real-time streams, KVDAL creates new records for new nodes or edges. When an edge exists with an existing origin but a new destination, it creates a new item in the existing record. When ingesting the same node or edge multiple times, KVDAL overwrites existing values, keeping properties like timestamps current. KVDAL can also automatically expire data on a per-namespace, per-record, or per-item basis, providing fine-grained control while limiting graph growth.

Namespaces Enable Massive Scale

Namespaces are the most powerful KVDAL feature Netflix leveraged. A namespace is like a database table, a logical grouping of records that defines physical storage while abstracting underlying system details.

You can start with all namespaces backed by one Cassandra cluster. If one namespace needs more storage or traffic capacity, you can move it to its own cluster for independent management. Different namespaces can use entirely different storage technologies. Low-latency data might use Cassandra with EVCache caching. High-throughput data might use dedicated clusters per namespace.

KVDAL can scale to trillions of records per namespace with single-digit millisecond latencies. Netflix provisions a separate namespace for every node type and edge type. While seemingly excessive, this enables independent scaling and tuning, flexible storage backends per namespace, and operational isolation where issues in one namespace do not impact others.

Conclusion

The numbers demonstrate real-world performance. Netflix’s graph has over eight billion nodes and more than 150 billion edges. The system sustains approximately two million reads per second and six million writes per second. This runs on a KVDAL cluster with roughly 27 namespaces, backed by around 12 Cassandra clusters across 2,400 EC2 instances.

These numbers are not limits. Every component scales linearly. As the graph grows, Netflix can add more namespaces, clusters, and instances.

Netflix’s RDG architecture offers important lessons.

  • Sometimes the right solution is not the obvious one. Netflix could have used purpose-built graph databases, but chose to emulate graph capabilities using key-value storage based on operational realities like internal expertise and existing platform support.

  • Scaling strategies evolve through experimentation. Netflix’s monolithic Flink job failed. Only through experience did they discover that one-to-one topic-to-job mapping worked better despite added complexity.

  • Isolation and independence matter at scale. Separating each node and edge type into its own namespace enabled independent tuning and reduced issue blast radius.

  • Building on proven infrastructure pays dividends. Rather than adopting new systems, Netflix leveraged battle-tested technologies like Kafka, Flink, and Cassandra, building abstractions to meet their needs while benefiting from maturity and operational expertise.

The RDG enables Netflix to analyze member interactions across its expanding ecosystem. As the business evolves with new offerings, this flexible architecture can adapt without requiring significant re-architecture.

References:


SPONSOR US

Get your product in front of more than 1,000,000 tech professionals.

Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases.

Space Fills Up Fast - Reserve Today

Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing [email protected].

How Pinterest Built An Async Compute Platform for Billions of Task Executions

2026-01-21 00:31:20

The real benefits of end-to-end observability (Sponsored)

How does full-stack observability impact engineering speed, incident response, and cost control?

In this eBook from Datadog, you’ll learn how real teams across industries are using observability to:

  • Reduce mean time to resolution (MTTR)

  • Cut tooling costs and improve team efficiency

  • Align business and engineering KPIs

See how unifying your stack leads to faster troubleshooting and long-term operational gains.

Download the eBook


When Pinterest’s engineering team built their asynchronous job processing platform called Pinlater a few years ago, it seemed like a solid solution for handling background tasks at scale. The platform was processing billions of jobs every day, powering everything from saving Pins to sending notifications to processing images and videos.

However, as Pinterest continued to grow, the cracks in Pinlater’s foundation became impossible to ignore.

Ultimately, Pinterest had to perform a complete architectural overhaul of its job processing system. The new version is known as Pacer. In this article, we will look at how Pacer was built and how it works.

Disclaimer: This post is based on publicly shared details from the Pinterest Engineering Team. Please comment if you notice any inaccuracies.

What Asynchronous Job Processing Does

Before examining what went wrong with Pinlater and how Pacer fixed it, we need to understand what these systems actually do.

When you save a Pin on Pinterest, several things need to happen. The Pin needs to be added to your board, other users may need to be notified, the image might need processing, and analytics need updating. Not all of these tasks need to happen instantly. Some can wait a few seconds or even minutes without anyone noticing.

This is where asynchronous job processing comes in. When you click save, Pinterest immediately confirms the action, but the actual work gets added to a queue to be processed later. This approach keeps the user interface responsive while ensuring the work eventually gets done. See the diagram below that shows an async processing approach on a high level:

The job queue system needs to store these tasks reliably, distribute them to worker machines for execution, and handle failures gracefully. At Pinterest’s scale, this means managing billions of jobs flowing through the system every single day.

Why Pinlater Started Struggling

The Pinterest engineering team built Pinlater around three main components.

  • A stateless Thrift service acted as the front door, accepting job submissions and coordinating their retrieval.

  • A backend datastore, likely MySQL-based on the context, persisted all the job data.

  • Worker pools continuously pulled jobs from the system, executed them, and reported back whether they succeeded or failed.

See the diagram below:

This architecture worked well initially, but several problems emerged as Pinterest’s traffic grew. The most critical issue was lock contention in the database. Pinterest had sharded their database across multiple servers to handle the data volume. When a job queue was created, Pinlater created a partition for that queue in every single database shard. This meant that if you had ten database shards, every queue had ten partitions scattered across them.

When workers needed jobs to execute, the Thrift service had to scan all the shards simultaneously because it did not know in advance which shards contained ready jobs. This scanning happened from multiple Thrift servers running in parallel to handle Pinterest’s request volume. The result was dozens of threads from different Thrift servers all trying to read from the same database partitions at the same time.

Databases handle concurrent access using locks. When multiple threads try to read the same data, the database coordinates this access to prevent corruption. One thread gets the lock and proceeds while others wait in line. At Pinterest’s scale, the database was spending more resources managing these locks and coordinating access than actually retrieving data. As Pinterest added more Thrift servers to handle growing traffic, the lock contention worsened.

The second major issue was the complete lack of isolation between different job queues. Multiple queues with vastly different characteristics all ran on the same worker machines. A queue processing CPU-intensive image transformations shared resources with a queue sending simple notification emails. If one queue had a bug that crashed the worker process, it took down all the other queues running on that machine. Performance tuning was nearly impossible because different workloads needed different hardware configurations.

The third problem was that different operations in the system had very different reliability requirements, but they all shared the same infrastructure. Enqueueing jobs was part of critical user-facing flows. If the enqueue operations failed, users would notice immediately. Dequeue operations, on the other hand, just determined how quickly jobs got processed. A brief delay in dequeuing meant jobs took a few extra seconds to run, which was usually acceptable. However, both operations competed for resources on the same Thrift servers, meaning less critical operations could impact critical ones.

Finally, the way Pinlater partitioned data across shards was wasteful. Even tiny queues with minimal traffic got partitions in every database shard. The metrics showed that more than half of all database queries returned empty results because they were scanning partitions that held no data. The system also could not support FIFO ordering across an entire queue because jobs were processed from multiple shards simultaneously, with no way to maintain global ordering.

The Pacer Solution

Rather than trying to optimize around these problems, the Pinterest engineering team decided to rebuild the system from scratch. Pacer introduced several new components and fundamentally changed how data flowed through the system.

The biggest change was the introduction of a dedicated dequeue broker service. This stateful service layer sits between the workers and the database, and it changes how jobs are retrieved. Instead of many Thrift servers all competing to read from the database, each partition of each queue is assigned to exactly one dequeue broker instance. This assignment is managed by Helix, a cluster management framework that Pinterest integrated into the system.

See the diagram below:

Here is how the assignment process works:

  • When a queue partition is created or modified, the configuration is stored in Zookeeper, a distributed coordination service.

  • The Helix controller monitors Zookeeper and detects new or changed partitions.

  • Helix computes which dequeue broker should own that partition based on the current cluster state.

  • The assignment is written back to the Zookeeper instance.

  • The designated broker receives notification and begins managing that partition.

  • If a broker fails, Helix detects this and reassigns its partitions to healthy brokers.

See the diagram below:

The assignment approach eliminates the lock contention problem. Since only one broker ever reads from a given partition, there is no competition at the database level.

However, the dequeue broker does more than just eliminate contention. It also improves performance through caching. Each broker proactively fetches jobs from its assigned partitions and stores them in in-memory buffers. These buffers are implemented as thread-safe queues. When workers need jobs, they request them from the broker’s memory rather than querying the database. Memory access is orders of magnitude faster than database queries, and Pinterest’s metrics show dequeue latency dropped to under one millisecond.

See the diagram below:

Pinterest also completely rethought how queues are partitioned across database shards. In Pinlater, every queue had partitions in every shard, regardless of its size or traffic. Pacer takes a different approach.

  • Small queues with light traffic might get just a single partition in a single shard.

  • Large, high-traffic queues get multiple partitions distributed across shards based on their actual needs.

This adaptive sharding eliminated the resource waste that was a problem in Pinlater. The percentage of empty query results dropped from over 50% to nearly zero.

See the diagram below:

This flexible partitioning also enabled new features. FIFO ordering, which was impossible in Pinlater, became possible in Pacer. A queue that needs strict ordering can be configured with a single partition, guaranteeing that jobs are processed in the exact order they were submitted.

Also, for job execution, the Pinterest engineering team moved from shared worker pools to dedicated pods running on Kubernetes. Each queue now gets its own isolated worker environment with custom resource allocations. For example, an image processing queue can be configured with high CPU and moderate memory. A notification queue can use low CPU and memory but high concurrency settings. One queue’s problems cannot affect others, and each queue can be tuned for optimal performance on hardware matched to its specific needs.

The separation of concerns extends to the request path as well. In Pacer, the Thrift service handles only job submission. This critical, user-facing path is completely isolated from dequeue operations. Even if the dequeue brokers experience problems, users can still submit jobs without noticing any issues. The jobs might take longer to process, but the submission itself remains fast and reliable.

Conclusion

The migration from Pinlater to Pacer delivered measurable improvements across multiple dimensions.

  • Lock contention, which had been a growing problem in the database, completely disappeared.

  • Hardware efficiency improved significantly for both database servers and worker machines.

  • Jobs execute faster due to the isolated and customizable runtime environments.

  • The system can now scale linearly by adding more brokers as partition counts increase, without hitting the scalability ceiling that limited Pinlater.

From a system design perspective, the Pacer architecture demonstrates several important principles.

  • First, introducing specialized components can solve multiple problems simultaneously. The dequeue broker was added specifically to eliminate lock contention, but it also improved latency, enabled better caching, and allowed for isolation of the critical enqueue path.

  • Second, stateful services have their place in distributed systems despite the general preference for stateless designs. The dequeue brokers are stateful by necessity because they maintain memory buffers and have affinity to specific partitions. This statefulness, properly managed by Helix and Zookeeper, is what makes the architecture work.

  • Third, caching at the right layer can provide enormous performance benefits. Rather than trying to cache at the database level or in the Thrift service, Pinterest put the cache in the component that serves jobs to workers.

  • Fourth, isolation prevents cascading failures and enables optimization. By giving each queue its own resources, Pinterest eliminated an entire class of cross-queue impact problems and made performance tuning tractable.

  • Finally, not all data needs to be partitioned the same way. Adaptive sharding based on actual usage patterns is more efficient than one-size-fits-all approaches.

References:


SPONSOR US

Get your product in front of more than 1,000,000 tech professionals.

Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases.

Space Fills Up Fast - Reserve Today

Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing [email protected].

Why AI Needs GPUs and TPUs: The Hardware Behind LLMs

2026-01-20 00:30:29

From quick sync to quick ship (Sponsored)

Most AI notetakers just transcribe.

Granola is the AI notepad that helps you stay focused in meetings, then turns your conversations into real progress.

Engineers use Granola to:

  • Draft Linear/Jira tickets from standup notes

  • Paste code into Granola Chat and instantly receive a pass/fail verdict against the requirements that were discussed in your meeting.

  • Search across weeks of conversations in project folders

  • Create custom prompts for recurring meetings (TL;DR, decision log, action items) with Recipes

Granola works with your device audio, so no bots join your meetings. Use with any meeting app, like Zoom, Meet or Teams, and in-person with Granola for iPhone.

Try for free

1 months off any paid plan with the code BYTEBYTEGO


When we write a few lines of prompts to interact with a large language model, we receive sonnets, debugging suggestions, or complex analyses almost instantly.

This software-centric view might obscure a fundamental reality that artificial intelligence is not just a software problem. It is a physics problem involving the movement of electrons through silicon and the challenge of moving massive amounts of data between memory and compute units.

However, sophisticated AI tools like LLMs cannot be built using just CPUs. This is because the CPU was designed for logic, branching decisions, and serial execution. On the other hand, deep learning requires linear algebra, massive parallelism, and probabilistic operations.

In this article, we will look at how GPUs and TPUs help build modern LLMs and the architecture behind them.

AI is a Math Factory

At its core, every neural network performs one fundamental operation billions of times: matrix multiplication.

When we ask an LLM a question, our words are converted into numbers that flow through hundreds of billions of multiply-add operations. A single forward pass through a 70-billion-parameter model requires over 140 trillion floating-point operations.

The mathematical structure is straightforward. Each layer performs Y = W * X + B, where X represents input data, W contains learned parameters, B is a bias vector, and Y is the output. When we scale this to billions of parameters, we are performing trillions of simple multiplications and additions.

What makes this workload special is its reliance on parallel computation. Each multiplication in a matrix operation is completely independent. Computing row 1 multiplied by column 1 does not require waiting for row 2 multiplied by column 2. We can split the work across thousands of processors with zero communication overhead during computation.

The Transformer architecture amplifies this parallelism. The self-attention mechanism calculates relationship scores between every token and every other token. As a reference, for a 4,096-token context window, this creates over 16 million attention pairs. Each transformer layer performs several major matrix multiplications, and a 70-billion-parameter model can execute millions of these operations per forward pass.

Why CPUs Fall Short

The CPU excels at tasks requiring complex logic and branching decisions. Modern CPUs contain sophisticated mechanisms designed for unpredictable code paths, but neural networks do not need these features.

Branch prediction machinery achieves 93-97% accuracy in guessing conditional statement outcomes, consuming significant silicon area. Neural networks, however, have almost no branching. They execute the same operations billions of times with predictable patterns.

Out-of-order execution reorders instructions to keep processors busy while waiting for data. Matrix multiplication has perfectly predictable access patterns that do not benefit from this complexity. Large cache hierarchies (L1, L2, L3) hide memory latency for random access, but neural network data flows sequentially through memory.

This means that only a small percentage of a CPU die is dedicated to arithmetic. Most transistor budget goes to control units managing out-of-order execution, branch prediction, and cache coherency. When running an LLM, these billions of transistors sit idle, consuming power and occupying space that could have been used for arithmetic units.

Beyond the computational inefficiency, CPUs face an even more fundamental limitation: the Memory Wall. This term describes the widening gap between processor speed and memory access speed. Large language models are massive. A 70-billion parameter model stored in 16-bit precision occupies roughly 140 gigabytes of memory. To generate a single token, the processor must read every single one of those parameters from memory to perform the necessary matrix multiplications.

Traditional computers follow the Von Neumann architecture, where a processor and memory communicate through a shared bus. To perform any calculation, the CPU must fetch an instruction, retrieve data from memory, execute the operation, and write results back. This constant transfer of information between the processor and memory creates what computer scientists call the Von Neumann bottleneck.

No amount of increasing core count or clock speed can solve this problem. The bottleneck is not the arithmetic operations but the rate at which data can be delivered to the processor. This is why memory bandwidth, not compute power, often determines LLM performance.

How GPUs Solve the Problem

The Graphics Processing Unit was originally designed to render video games. The mathematical requirements of rendering millions of pixels are remarkably similar to deep learning since both demand massive parallelism and high-throughput floating-point arithmetic.

NVIDIA’s GPU architecture uses SIMT (Single Instruction, Multiple Threads). The fundamental unit is a group of 32 threads called a warp. All threads in a warp share a single instruction decoder, executing the same instruction simultaneously. This shared control unit saves massive silicon area, which is filled with thousands of arithmetic units instead.

While modern CPUs have 16 to 64 complex cores, the NVIDIA H100 contains nearly 17,000 simpler cores. These run at lower clock speeds (1-2 GHz versus 3-6 GHz), but massive parallelism compensates for slower individual operations.

Standard GPU cores execute operations on single numbers, one at a time per thread. Recognizing that AI workloads are dominated by matrix operations, NVIDIA introduced Tensor Cores starting with their Volta architecture. A Tensor Core is a specialized hardware unit that performs an entire matrix multiply-accumulate operation in a single clock cycle. While a standard core completes one floating-point operation per cycle, a Tensor Core executes a 4×4 matrix multiplication involving 64 individual operations (16 multiplies and 16 additions in the multiply step, plus 16 accumulations) instantly. This represents a 64-fold improvement in throughput for matrix operations.

Tensor Cores also support mixed-precision arithmetic, which is crucial for practical AI deployment. They can accept inputs in lower precision formats like FP16 or BF16 (using half the memory of FP32) while accumulating results in higher precision FP32 to preserve numerical accuracy. This combination increases throughput and reduces memory requirements without sacrificing the precision needed for stable model training and accurate inference.

To feed these thousands of compute units, GPUs use High Bandwidth Memory (HBM). Unlike DDR memory that sits on separate modules plugged into the motherboard, HBM consists of DRAM dies stacked vertically on top of each other using through-silicon vias (microscopic vertical wires). These stacks are placed on a silicon interposer directly adjacent to the GPU die, minimizing the physical distance data must travel.

Such an architecture allows GPUs to achieve memory bandwidths exceeding 3,350 GB/s on the H100, more than 20 times faster than CPUs. With this bandwidth, an H100 can load a 140 GB model in roughly 0.04 seconds, enabling token generation speeds of 20 or more tokens per second. This is the difference between a stilted, frustrating interaction and a natural conversational pace.

The combination of massive parallel computing and extreme memory bandwidth makes GPUs the dominant platform for AI workloads.

TPUs: Google’s Specialized Approach

In 2013, Google calculated that if every user utilized voice search for just three minutes daily, they would need to double their datacenter capacity using CPUs. This led to the Tensor Processing Unit.

See the diagram below:

The defining feature is the systolic array, a grid of interconnected arithmetic units (256 into 256, totaling 65,536 processors). Weights are loaded into the array and remain fixed while input data flows horizontally. Each unit multiplies its stored weight by incoming data, adds to a running sum flowing vertically, and passes both values to its neighbor.

This design means intermediate values never touch main memory. Reading from DRAM consumes roughly 200 times more energy than multiplication. By keeping results flowing between adjacent processors, systolic arrays eliminate most memory access overhead, achieving 30 to 80 times better performance per watt than CPUs.

Google’s TPU has no caches, branch prediction, out-of-order execution, or speculative prefetching. This extreme specialization means TPUs cannot run general code, but for matrix operations, the efficiency gains are substantial. Google also introduced bfloat16, which uses 8 bits for exponent (matching FP32 range) and 7 bits for mantissa. Neural networks tolerate low precision but require a wide range, making this format ideal.

Conclusion

Understanding hardware differences has direct practical implications.

Training and inference have fundamentally different requirements.

  • Training stores parameters, gradients, and optimizer states. For reference, total memory reaches 16 to 20 times the parameter count. For example, training LLaMA 3.1 with 405 billion parameters required 16,000 H100 GPUs with 80GB each.

  • Inference is more forgiving. It skips backpropagation, requiring fewer operations. This is why we can run 7-billion parameter models on consumer GPUs inadequate for training.

Batch processing matters for efficiency. GPUs achieve peak performance by processing multiple inputs simultaneously. Each additional input amortizes the cost of loading weights. Single-request inference often underutilizes parallel hardware.

The transition from CPU to GPU and TPU represents a fundamental shift in computing philosophy. The CPU embodies an era of logic and sequential operations, optimized for low latency. GPUs and TPUs represent an era of data transformation through probabilistic operations. These are specialized engines of linear algebra that achieve results through overwhelming parallel arithmetic.


SPONSOR US

Get your product in front of more than 1,000,000 tech professionals.

Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases.

Space Fills Up Fast - Reserve Today

Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing [email protected].

EP198: Best Resources to Learn AI in 2026

2026-01-18 00:30:31

Real-Time AI at Scale Masterclass: Virtual Masterclass (Sponsored)

Learn strategies for low-latency feature stores and vector search

This masterclass demonstrates how to keep latency predictably low across common real-time AI use cases. We’ll dig into the challenges behind serving fresh features, handling rapidly evolving embeddings, and maintaining consistent tail latencies at scale. The discussion spans how to build pipelines that support real-time inference, how to model and store high-dimensional vectors efficiently, and how to optimize for throughput and latency under load.

You will learn how to:

  • Build end-to-end pipelines that keep both features and embeddings fresh for real-time inference

  • Design feature stores that deliver consistent low-latency access at extreme scale

  • Run vector search workloads with predictable performance—even with large datasets and continuous updates

Register for Free


This week’s system design refresher:

  • Best Resources to Learn AI in 2026

  • The Pragmatic Summit

  • Why Prompt Engineering Makes a Big Difference in LLMs?

  • Modern Storage Systems

  • 🚀 Become an AI Engineer Cohort 3 Starts Today!


Best Resources to Learn AI in 2026

The AI resources can be divided into different types such as:

  1. Foundational and Modern AI Books
    Books like AI Engineering, Machine Learning System Design Interview, Generative AI System Design Interview, and Designing Machine Learning Systems cover both principles and practical system patterns.

  2. Research and Engineering Blogs
    Follow OpenAI Research, Anthropic Engineering, DeepMind Blog, and AI2 to stay current with new architectures and applied research.

  3. Courses and YouTube Channels
    Courses like Stanford CS229 and CS230 build solid ML foundations. YouTube channels such as Two Minute Papers and ByteByteAI offer concise, visual learning on cutting-edge topics.

  4. AI Newsletters
    Subscribe to The Batch (Deeplearning. ai), ByteByteGo, Rundown AI, and Ahead of AI to learn about major AI updates, model releases, and research highlights.

  5. Influential Research Papers
    Key papers include Attention Is All You Need, Scaling Laws for Neural Language Models, InstructGPT, BERT, and DDPM. Each represents a major shift in how modern AI systems are built and trained.

Over to you: Which other AI resources will you add to the list?


The Pragmatic Summit

I’ll be talking with Sualeh Asif, the cofounder of Cursor, about lessons from building Cursor at the Pragmatic Summit.

If you’re attending, I’d love to connect while we’re there.

📅 February 11
📍 San Francisco, CA

Check it out here


Why Prompt Engineering Makes a Big Difference in LLMs?

LLMs are powerful, but their answers depend on how the question is asked. Prompt engineering adds clear instructions that set goals, rules, and style. This turns vague questions and tasks into clear, well-defined prompts.

What are the key prompt engineering techniques?

  1. Few-shot Prompting: Include a few (input → output) example pairs in the prompt to teach the pattern.

  2. Zero-shot Prompting: Give a precise instruction without examples to state the task clearly.

  3. Chain-of-thought (CoT) Prompting: Ask for step-by-step reasoning before the final answer. This can be zero-shot, where we explicitly include “Think step by step” in the instruction, or few-shot, where we show some examples with step-by-step reasoning.

  4. Role-specific Prompting: Assign a persona, like “You are a financial advisor,” to set context for the LLM.

  5. Prompt Hierarchy: Define system, developer, and user instructions with different levels of authority. System prompts define high-level goals and set guardrails, while developer prompts define formatting rules and customize the LLM’s behavior.

Here are the key principles to keep in mind when engineering your prompts:

  • Begin simple, then refine.

  • Break a big task into smaller, more manageable subtasks.

  • Be specific about desired format, tone, and success criteria.

  • Provide just enough context to remove ambiguity.

Over to you: Which prompt engineering technique gave you the biggest jump in quality?


Modern Storage Systems

Every system you build, whether it's a mobile app, a database engine, or an AI pipeline, eventually hits the same bottleneck: storage. And the storage world today is far more diverse than “HDD vs SSD.

Here’s a breakdown of how today’s storage stack actually looks

Primary Storage (where speed matters most): This is memory that sits closest to the CPU.

  • L1/L2/L3 caches, SRAM, DRAM, and newer options like PMem/NVDIMM.

    Blazing fast but volatile. The moment power drops, everything is gone.

    Local Storage (your machine’s own hardware): HDDs, SSDs, USB drives, SD cards, optical media, even magnetic tape (still used for archival backups).

Networked Storage (shared over the network):

  • SAN for block-level access.

  • NAS for file-level access.

  • Object storage and distributed file systems for large-scale clusters.

    This is what enterprises use for shared storage, centralized backups, and high availability setups.

Cloud Storage (scalable + managed):

  • Block storage like EBS, Azure Disks, GCP PD for virtual machines.

  • Object storage like S3, Azure Blob, and GCP Cloud Storage for massive unstructured data.

  • File storage like EFS, Azure Files, and GCP Filestore for distributed applications.

Cloud Databases (storage + compute + scalability baked in):

  • Relational engines like RDS, Azure SQL, Cloud SQL.

  • NoSQL systems like DynamoDB, Bigtable, Cosmos DB.

Over to you: If you had to choose one storage technology for a brand-new system, where would you start, block, file, object, or a database service?


🚀 Become an AI Engineer Cohort 3 Starts Today!

Our third cohort of Becoming an AI Engineer starts today. This is a live, cohort-based course created in collaboration with best-selling author Ali Aminian and published by ByteByteGo.

Check it out Here

Check it out Here

Last Call: Enrollment for the AI Engineering Cohort 3 Ends Today

2026-01-17 00:30:56

Our third cohort of Becoming an AI Engineer starts in one day. This is a live, cohort-based course created in collaboration with best-selling author Ali Aminian and published by ByteByteGo.

Check it out Here

Check it out Here

Here’s what makes this cohort special:

• Learn by doing: Build real world AI applications, not just by watching videos.

• Structured, systematic learning path: Follow a carefully designed curriculum that takes you step by step, from fundamentals to advanced topics.

• Live feedback and mentorship: Get direct feedback from instructors and peers.

• Community driven: Learning alone is hard. Learning with a community is easy!

We are focused on skill building, not just theory or passive learning. Our goal is for every participant to walk away with a strong foundation for building AI systems.

If you want to start learning AI from scratch, this is the perfect time to begin.

Check it out Here

A Guide to Database Sharding

2026-01-16 00:31:05

When applications grow popular, they often face a good problem of attracting more users, and exponentially more data. While this growth signals business success, it creates technical challenges that can cripple even well-designed systems. The database, often the heart of any application, becomes the bottleneck that threatens to slow everything down.

Unlike application servers, which can be easily scaled to handle more traffic, databases resist horizontal scaling. We cannot simply add more database servers and expect our problems to vanish. This is where sharding enters the picture as an important solution to one of the most persistent challenges in modern application architecture.

In this article, we will learn about database sharding in more detail. We will understand what it is, why it matters, how different approaches work, and what key considerations are important when implementing it.

What Is Database Sharding?

Read more