MoreRSS

site iconByteByteGoModify

System design and interviewing experts, authors of best-selling books, offer newsletters and courses.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of ByteByteGo

EP197: 12 Architectural Concepts Developers Should Know

2026-01-11 00:30:58

How Sentry Built Production-Informed AI Code Reviews (Sponsored)

In order to have your code review reflect real system behavior, you can integrate production signals into developer workflows.

Sentry built Seer to connect production failure signals to incoming code changes before merge.

Under the hood is a multi-stage bug prediction pipeline ⚙️

  • Filtering: large change sets are scoped down to files with the strongest historical failure signals

  • Prediction: models generate and cross-check bug hypotheses using code context and production telemetry

  • Prioritization: findings are ranked by estimated risk so only high-confidence issues surface during review

See how this works in practice on the Sentry blog.

Check out the blog


This week’s system design refresher:


12 Architectural Concepts Developers Should Know

  1. Load Balancing: Distributed incoming traffic across multiple servers to ensure no single node is overwhelmed.

  2. Caching: Stores frequently accessed data in memory to reduce latency.

  3. Content Delivery Network (CDN): Stores static assets across geographically distributed edge servers so users download content from the nearest location.

  4. Message Queue: Decouples components by letting producers enqueue messages that consumers process asynchronously.

  5. Publish-Subscribe: Enables multiple consumers to receive messages from a topic.

  6. API Gateway: Acts as a single entry point for client requests, handling routing, authentication, rate limiting, and protocol translation.

  7. Circuit Breaker: Monitors downstream service calls and stops attempts when failures exceed a threshold.

  8. Service Discovery: Automatically tracks available service instances so components can locate and communicate with each other dynamically.

  9. Sharding: Splits large datasets across multiple nodes based on a specific shard key.

  10. Rate Limiting: Controls the number of requests a client can make in a given time window to protect services from overload.

  11. Consistent Hashing: Distributes data across nodes in a way that minimizes reorganization when nodes join or leave.

  12. Auto Scaling: Automatically adds or removes compute resources based on defined metrics.

Over to you: Which architectural concept will you add to the list?


Top Developer Tools You Can Use in 2026

  1. Code Editors & IDEs: These tools help developers write, edit, and debug code with greater efficiency. Examples are Visual Code, IntelliJ IDEA, PyCharm, Cursor, Eclipse, etc.

  2. Version Control Systems: Track code changes over time and enable collaboration between team members. Examples are Git, GitHub, Gitlab, Bitbucket, AWS Code Commit, etc.

  3. Testing Tools: Help ensure that code behaves as expected by identifying bugs before they reach production. Examples are JUnit, Selenium, Playwright, Cypress, etc.

  4. CI/CD Tools: They help automate the process of building and deploying code to speed up delivery. Examples are Jenkins, GitHub Actions, Circle CI, Travis CI, and Code Pipeline.

  5. Containerization and Orchestration: Helps package applications and their dependencies into containers to make sure they run across environments in a consistent manner. Examples are Docker, Kubernetes, Podman, Containerd, Rancher, etc.

  6. Project Management Tools: Helps the development teams plan, organize, and track the development of tasks. Examples are JIRA, Asana, Trello, ClickUp, Notion, etc.

  7. API Testing Tools: They help test and validate APIs to ensure stable communication between services and with external consumers. Examples are Postman, Swagger, Hopscotch, Insomnia, etc.

  8. AI-Powered Developer Tools: They are mainly used to boost developer productivity with code suggestions, error detection, and automated code generation. Examples are ChatGPT, Claude Code, Cursor, Copilot, Qodo, etc.

Over to you: Which other tools have you used?


5 Rate Limiting Strategies To Protect the System

Rate limiting protects services from overload or abuse by shaping traffic to match capacity and by enforcing policy as soon as a request arrives. A good rate limiter optimizes for accuracy, predictability, fairness, and low overhead, and real systems trade one for another.

  1. Fixed Window Counter: Involves counting requests in the current discrete time bucket and rejecting once the threshold is reached.

  2. Sliding Window Log: Stores exact timestamps of the incoming requests and admits a request only if the last T seconds remain under the limit.

  3. Sliding Window Counter: Instead of keeping a log of request timestamps, it calculates the weighted counter for the previous time window. When a new request arrives, the counter is adjusted based on the weight, and the request is allowed if the total is below the limit.

  4. Token Bucket: Works by adding “tokens” to a bucket at a steady rate. Each request consumes one token. If tokens are available, the request passes immediately. If the bucket is empty, requests are rejected or delayed.

  5. Leaky Bucket: Queues requests and lets them out at a fixed drain rate.

Over to you: Which other rate-limiting strategy will you add to the list?


How Live Streaming Works?

Live streaming works using a few key protocols such as RTMP, HLS (made by Apple), and DASH (for non-Apple devices).

Here’s a step-by-step process:

  • Step 1: The camera and microphone record video and audio. The raw data is sent to a server.

  • Step 2: The video is made smaller by removing unnecessary parts (like separating the background). Then, it’s converted into a standard format like H.264, which makes it easier to send over the internet.

  • Step 3: The video is chopped into small parts, usually a few seconds long. This helps it load faster during streaming.

  • Step 4: To make sure the video plays smoothly on all kinds of devices and internet speeds, multiple versions of the video are created at different quality levels. This is called Adaptive Bitrate Streaming.

  • Step 5: The video is then sent to nearby servers (edge servers) using a CDN. This reduces delay and helps millions of people watch the video at the same time.

  • Step 6: The viewer’s phone, tablet, or computer receives the video, turns it back into full video and sound, and plays it in a video player.

  • Steps 7 and 8: If the video needs to be watched again later, it's saved on a storage server. Viewers can replay it whenever they want.

Over to you: What else will you add to understand the live streaming process?


5 Leader Election Algorithms Powering Modern Databases

Leader Election Algorithms are important in distributed systems to manage tasks, maintain consistency, and make decisions.

  1. Bully Algorithm: Nodes have unique numeric IDs, and the one with the highest ID takes over as leader after notifying others.

  2. Ring Algorithm: Nodes are arranged in a logical ring and pass messages containing their IDs. The highest ID node wins and becomes the leader.

  3. Paxos Algorithm: A quorum-based consensus method where proposers suggest values, acceptors vote, and a learner recognizes the chosen leader.

  4. Raft Algorithm: Nodes start as followers and become candidates if no leader is detected. The first to secure a majority of votes becomes the leader.

  5. Zookeeper Atomic Broadcast: Uses ephemeral sequential znodes to elect the leader, ensuring that the lowest-numbered znode holder is the leader.

Over to you: Which other Leader Election Algorithm have you seen?


🚀 Learn AI in the New Year: Become an AI Engineer Cohort 3 Now Open

After the amazing success of Cohorts 1 and 2 (with close to 1,000 engineers joining and building real AI skills), we are excited to announce the launch of Cohort 3 of Become an AI Engineer!

Check it out Here

Check it out Here

Must-Know Message Broker Patterns

2026-01-09 00:30:49

Modern distributed systems rely on message brokers to enable communication between independent services.

However, using message brokers effectively requires understanding common architectural patterns that solve recurring challenges. This article introduces seven essential patterns that help developers build reliable, scalable, and maintainable systems using message brokers.

These patterns address three core categories of problems:

  • Ensuring data consistency across services

  • Managing workload efficiently

  • Gaining visibility into the messaging infrastructure.

Whether we’re building an e-commerce platform, a banking system, or any distributed application, these patterns provide proven solutions to common challenges.

In this article, we will look at each of these patterns in detail and understand the scenarios where they help the most.

Ensuring Data Consistency

Read more

How AI Transformed Database Debugging at Databricks

2026-01-07 00:31:08

New Year, New Metrics: Evaluating AI Search in the Agentic Era (Sponsored)

Most teams pick a search provider by running a few test queries and hoping for the best – a recipe for hallucinations and unpredictable failures. This technical guide from You.com gives you access to an exact framework to evaluate AI search and retrieval.

What you’ll get:

  • A four-phase framework for evaluating AI search

  • How to build a golden set of queries that predicts real-world performance

  • Metrics and code for measuring accuracy

Go from “looks good” to proven quality.

Learn how to run an eval


Disclaimer: The details in this post have been derived from the details shared online by the Databricks Engineering Team. All credit for the technical details goes to the Databricks Engineering Team. The links to the original articles and sources are present in the references section at the end of the post. We’ve attempted to analyze the details and provide our input about them. If you find any inaccuracies or omissions, please leave a comment, and we will do our best to fix them.

Databricks is a cloud platform that helps companies manage all their data in one place. It combines the best features of data warehouses and data lakes into a lakehouse architecture, which means you can store and work with any type of data.

Recently, Databricks built an internal AI-powered agentic platform that reduced database debugging time by up to 90% across thousands of OLTP instances spanning hundreds of regions on multiple cloud platforms.

The AI agent interprets, executes, and debugs by retrieving key metrics and logs and automatically correlating signals. It makes the life of Databricks engineers easy. They can now ask questions about the health of their services in natural language without needing to reach out to on-call engineers in storage teams.

The great part was that this platform evolved from a hackathon project into a company-wide tool that unifies metrics, tooling, and expertise for managing databases at scale. In this article, we will look at how the Databricks engineering team built this platform and the challenges faced along the way.

The Pre-AI Workflow and Pain Points

In the pre-AI workflow, Databricks engineers had to manually jump between multiple tools whenever they had to debug a database problem. Here’s how the workflow ran:

  • Engineers would first open Grafana to examine performance metrics and charts that showed how the database was behaving over time.

  • Next, they would switch to Databricks’ internal dashboards to understand which client applications were running and how much workload they were generating on the database.

  • Engineers would then run command-line interface commands to inspect InnoDB status, which provides a detailed snapshot of MySQL’s internal state, including active transactions, I/O operations, and any deadlocks.

  • Finally, engineers would log into their cloud provider’s console to download slow query logs that revealed which database queries were taking an unusually long time to execute.

The first attempt to alleviate this problem was made during a company-wide hackathon, during which developers built a simple prototype that unified a few core database metrics and dashboards into a single view. The results were promising. However, before writing more code, Databricks took a research-driven approach by actually observing on-call engineers during real debugging sessions and conducting interviews to understand their challenges firsthand.

The first major problem was fragmented tooling, where each debugging tool worked in complete isolation without any integration or ability to share information with other tools. This lack of integration meant engineers had to manually piece together information from multiple disconnected sources, which made the entire debugging process slow and prone to human error.

The second major problem was that engineers spent most of their incident response time gathering context rather than actually fixing the problem. Context gathering involved figuring out what had recently changed in the system, determining what “normal” baseline behavior looked like, and tracking down other engineers who might have relevant knowledge.

The third major problem was that engineers lacked clear guidance during incidents about which mitigation actions were safe to take and which would actually be effective. Without clear runbooks or automated guidance, engineers would either spend a lot of time investigating to ensure they fully understood the situation or they would wait for senior experts to become available and tell them what to do.

Evolution Through Iteration

Databricks didn’t build its AI debugging platform in one shot. They went through multiple versions.

The first version they built was a static agentic workflow that simply followed a pre-written debugging Standard Operating Procedure, which is essentially a step-by-step checklist of what to do. This first version failed because engineers didn’t want to follow a manual checklist, but wanted the system to automatically analyze their situation and give them a diagnostic report with immediate insights about what was wrong.

Learning from this failure, Databricks built a second version focused on anomaly detection, which could automatically identify unusual patterns or behaviors in the database metrics. However, while the anomaly detection system successfully surfaced relevant problems, it still fell short because it only told engineers “here’s what’s wrong” without providing clear guidance on what to do next to fix those problems.

The breakthrough came with the third version, which was an interactive chat assistant that fundamentally changed how engineers could debug their databases. This chat assistant codifies expert debugging knowledge, meaning it captures the wisdom and experience of senior database engineers and makes it available to everyone through conversation. Unlike the previous versions, the chat assistant can answer follow-up questions, allowing engineers to have a back-and-forth dialogue rather than just receiving a one-time report.

This interactive nature transforms debugging from a series of isolated manual steps into a continuous, conversational process where the AI guides engineers through the entire investigation.

See the evolution journey in the diagram below:

Platform Foundation Architecture

Before the Databricks engineering team could effectively add AI to its debugging platform, it realized that it needed to build a solid architectural foundation that would make the AI integration meaningful. This was because any agent would need to handle region and cloud-specific logic.

This was a difficult problem since Databricks operates thousands of database instances across hundreds of regions, eight regulatory domains, and three clouds. The team recognized that without building this strong architectural foundation first, trying to add AI capabilities would run into unavoidable roadblocks. Some of the problems were as follows:

  • The first problem that would occur without this foundation is context fragmentation, where all the debugging data would be scattered across different locations, making it impossible for an AI agent to get a complete picture of what’s happening.

  • The second problem would be unclear governance boundaries, meaning it would be extremely difficult to ensure that the AI agent and human engineers stay within their proper permissions and don’t accidentally access or modify things they shouldn’t.

  • The third problem would be slow iteration loops, where inconsistent ways of doing things across different clouds and regions would make it very hard to test and improve the AI agent’s behavior.

To support this complexity, the platform is built on three core architectural principles that work together to create a unified, secure, and scalable system.

Global Storex Instance

The first principle is a central-first sharded architecture, which means there’s one central “brain” (called Storex) that coordinates many regional pieces of the system.

This global Storex instance acts like a traffic controller, providing engineers with a single unified interface to access all their databases, no matter where those databases are physically located. Even though engineers interact with one central system, the actual sensitive data stays local in each region, which is crucial for meeting privacy and regulatory requirements.

This architecture ensures compliance across eight different regulatory domains, which are different legal jurisdictions that have their own rules about where data can be stored and who can access it.

Fine-Grained Access Control

The second principle is fine-grained access control, which means the platform has very precise and detailed rules about who can do what. Access permissions are enforced at multiple levels, such as:

  • The Team Level: Determines which teams can access what.

  • The Resource Level: Determines which specific databases or systems.

  • The RPC Level: Determines which specific operations or function calls.

This multi-layered permission system ensures that both human engineers and AI agents only perform actions they’re authorized to do, preventing accidental or unauthorized changes.

Unified Orchestration

The third principle is unified orchestration, which means the platform brings together all the existing infrastructure services under one cohesive system.

This orchestration creates consistent abstractions, which means engineers can work with databases the same way whether they’re on AWS in Virginia, Azure in Europe, or Google Cloud in Asia. By providing these consistent abstractions, the platform eliminates the need for engineers to learn and handle cloud-specific or region-specific differences in how things work.

AI Agent Implementation

Databricks engineering team built a lightweight framework for their AI agent that was inspired by two existing technologies: MLflow’s prompt optimization tools and a system called DsPy.

The key innovation of this framework is that it decouples (separates) the prompting from the tool implementation, meaning engineers can change what the AI says without having to rewrite how the underlying tools work. Engineers define tools by writing simple Scala classes (a programming language) with function signatures that describe what the tool does, rather than having to write complex instructions for the AI. Each tool just needs a simple docstring description (a short text explanation), and the large language model can automatically figure out three important things: what format of input the tool needs, what structure the output will have, and how to interpret the results.

See the diagram below:

This design enables rapid iteration, meaning engineers can quickly experiment with different prompts and swap tools in and out without having to modify the underlying infrastructure that handles parsing data, connecting to the LLM, or managing the conversation state.

Agent Decision Loop

The AI agent operates in a continuous decision loop that determines what actions to take based on the user’s needs.

  • First, the user’s input goes to the Storex Router, which is like a switchboard that directs the request to the right place.

  • Second, the LLM Endpoint (the large language model) generates a response based on what the user asked and the current context of the conversation.

  • Third, if the LLM determines it needs more information, it executes a Tool Call to retrieve data like database metrics, logs, or configuration details.

  • Fourth, the LLM Response processes the output from the tool, interpreting what the data means in the context of the user’s question.

  • Fifth, the system either loops back to step 2 to gather more information with additional tool calls or it produces a final User Response if it has everything needed to answer the question.

Validation Framework

Databricks built a validation framework to ensure that as they improve the AI agent, they don’t accidentally make it worse or introduce bugs (called “regressions”).

The framework captures snapshots of production state, which are like frozen moments in time that record what the databases looked like, what problems existed, and what the correct diagnosis should be. The snapshots include database schemas (the structure of the data), physical database info (hardware and configuration details), metrics like CPU usage and IOPS (input/output operations per second), and the expected diagnostic outputs that represent the “correct answer”. These snapshots are then replayed through the agent, meaning the system feeds old problems to the new version of the AI to see how it handles them. A separate “judge” LLM scores the agent’s responses on two key criteria: accuracy (did it identify the problem correctly) and helpfulness (did it provide useful guidance to the engineer).

See the diagram below:

All of these test results are stored in Databricks tables so the team can analyze trends over time and understand whether their changes are actually improving the agent.

Multi-Agent Specialization

Rather than building one giant AI agent that tries to do everything, Databricks’ framework enables them to create specialized agents that each focus on different domains or areas of expertise.

They have a system and database issues agent that specializes in low-level technical problems with the database software and hardware. They have a client-side traffic patterns agent that specializes in understanding how applications are using the database and whether unusual workload patterns are causing problems.

The framework allows them to easily create additional domain-specific agents as they identify new areas where specialized knowledge would be helpful. Each agent builds deep expertise in its particular area by having prompts, tools, and context specifically tailored to that domain, rather than being a generalist.

These specialized agents can collaborate with each other to provide complete root cause analysis, where one agent might identify a traffic spike and another might correlate it with a specific database configuration issue.

Conclusion

The results of Databricks’ AI-assisted debugging platform have been transformative across multiple dimensions.

The platform achieved up to 90% reduction in debugging time, turning what were once hours-long investigations into tasks that can be completed in minutes. Perhaps most remarkably, new engineers with zero context can now jump-start a database investigation in under 5 minutes. This was something that was previously nearly impossible without significant training and experience. The platform has achieved company-wide adoption across all engineering teams, demonstrating its universal value beyond just the database specialists who originally needed it.

The user feedback has been quite positive, with engineers pointing out that they no longer need to remember where various query dashboards are located or spend time figuring out where to find specific information. Multiple engineers described the platform as a big change in developer experience.

Looking forward, the platform lays the foundation for AI-assisted production operations, including automated database restores, production query optimization, and configuration updates. The architecture is designed to extend beyond databases to other infrastructure components, promising to transform how Databricks operates its entire cloud infrastructure at scale.

References:


SPONSOR US

Get your product in front of more than 1,000,000 tech professionals.

Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases.

Space Fills Up Fast - Reserve Today

Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing [email protected].

How Google’s Tensor Processing Unit (TPU) Works?

2026-01-06 00:31:12

4 Key Insights for Scaling LLM Applications (Sponsored)

LLM workflows can be complex, opaque, and difficult to secure. Get the latest ebook from Datadog for practical strategies to monitor, troubleshoot, and protect your LLM applications in production. You’ll get key insights into how to overcome the challenges of deploying LLMs securely and at scale, from debugging multi-step workflows to detecting prompt injection attacks.

Download the eBook


Disclaimer: The details in this post have been derived from the details shared online by the Google Engineering Team. All credit for the technical details goes to the Google Engineering Team. The links to the original articles and sources are present in the references section at the end of the post. We’ve attempted to analyze the details and provide our input about them. If you find any inaccuracies or omissions, please leave a comment, and we will do our best to fix them

When DeepMind’s AlphaGo defeated Go world champion Lee Sedol in March 2016, the world witnessed a big moment in artificial intelligence. The match was powered by hardware Google had been running in production for over a year, but had never publicly acknowledged.

The Tensor Processing Unit, or TPU, represented something more profound than just another fast chip. It marked a fundamental shift in computing philosophy: sometimes doing less means achieving more.

Ever since then, Google’s TPU family has evolved through seven generations since 2015, scaling from single-chip processing image recognition queries to 9216-chip supercomputers training the largest language models in existence. In this article, we look at why Google built custom silicon, and how it works, revealing the physical constraints and engineering trade-offs they had to make.

The Need for TPU

In 2013, Google’s infrastructure team ran a calculation. If Android users adopted voice search at the scale Google anticipated, using it for just three minutes per day, the computational demand would require doubling the company’s entire global data center footprint.

This was a problem with no obvious solution at the time. Building more data centers filled with traditional processors was economically unfeasible. More critically, Moore’s Law has been slowing for years. For decades, the semiconductor industry had relied on the observation that transistor density doubles roughly every two years, delivering regular performance improvements without architectural changes. However, by 2013, this trend was weakening. Google couldn’t simply wait for Intel’s next generation of CPUs to solve its problem.

The root cause of this situation was architectural. Traditional computers follow the Von Neumann architecture, where a processor and memory communicate through a shared bus. To perform any calculation, the CPU must fetch an instruction, retrieve data from memory, execute the operation, and write results back. This constant transfer of information between the processor and memory creates what computer scientists call the Von Neumann bottleneck.

The energy cost of moving data across this bus often exceeds the energy cost of the computation itself. For example, imagine a chef preparing a meal but having to walk to a distant pantry for each ingredient. The cooking takes seconds, but the walking consumes hours. For general-purpose computing tasks like word processing or web browsing, this design makes sense because workloads are unpredictable. However, neural networks are different.

Deep learning models perform one operation overwhelmingly: matrix multiplication. A neural network processes information by multiplying input data by learned weight matrices, adding bias values, and applying activation functions. This happens billions of times for a single prediction. Modern language models with hundreds of billions of parameters require hundreds of billions of multiply-add operations per query. Critically, these operations are predictable, parallel, and deterministic.

CPUs devote significant processing power to features like branch prediction and out-of-order execution, designed to handle unpredictable code. Graphics Processing Units, or GPUs, improved matters with thousands of cores working in parallel, but they still carried architectural overhead from their graphics heritage. Google’s insight was to build silicon that does only what neural networks need and strip away everything else.

The Systolic Array: A Different Way to Compute

The heart of the TPU is an architecture called a systolic array. The name originates from the Greek word for heartbeat, referencing how data pulses rhythmically through the chip. To understand why this matters, consider how different processors approach the same task.

  • A CPU operates like a single worker running back and forth between a water well and a fire, filling one bucket at a time.

  • A GPU deploys thousands of workers making the same trips simultaneously. Throughput increases, but the traffic between the well and the fire becomes chaotic and energy-intensive.

  • A systolic array takes a fundamentally different approach. The workers form a line and pass buckets hand to hand. Water flows through the chain without anyone returning to the source until the job is complete.

In a TPU, the workers are simple multiply-accumulate units arranged in a dense grid. The first-generation TPU used a 256 by 256 array, meaning 65,536 calculators operating simultaneously. Here’s how computation proceeds:

  • Neural network weights are loaded into each calculator from above and remain stationary.

  • Input data flows in from the left, one row at a time.

  • As data passes through each calculator, it is multiplied by the resident weight.

  • The product adds to a running sum, then passes rightward to the next calculator.

  • Partial results accumulate and flow downward.

  • Final results emerge from the bottom after all calculations are complete.

See the diagram below:

This design means data is read from memory once but used thousands of times as it traverses the array. Traditional processors must access memory for nearly every operation. The systolic array eliminates this bottleneck. Data moves only between spatially adjacent calculators over short wires, dramatically reducing energy consumption.

The numbers make a strong case for this approach.

  • TPU v1’s 256 by 256 array could perform 65536 multiply-accumulate operations per clock cycle. Running at 700 MHz, this delivered 92 trillion 8-bit operations per second while consuming just 40 watts.

  • A contemporary GPU might perform tens of thousands of operations per cycle, while the TPU performed hundreds of thousands.

  • More than 90 percent of the silicon performed useful computation, compared to roughly 30 percent in a GPU.

The trade-off here is absolute specialization. A systolic array can only perform matrix multiplications efficiently. It cannot render graphics, browse the web, or run a spreadsheet. Google accepted this limitation because neural network inference is fundamentally matrix multiplication repeated many times.

The Supporting Architecture

The systolic array requires carefully orchestrated support components to achieve its performance. Each piece solves a specific bottleneck in the pipeline from raw data to AI predictions.

Let’s look at the most important components:

The Matrix Multiply Unit

The Matrix Multiply Unit, or MXU, is the systolic array itself.

TPU v1 used a single 256-by-256 array operating on 8-bit integers. Later versions shifted to 128 by 128 arrays using Google’s BFloat16 format for training workloads, then returned to 256 by 256 arrays in v6 for quadrupled throughput. The weight-stationary design minimizes data movement, which is the primary consumer of energy in computing.

Unified Buffer

The Unified Buffer provides 24 megabytes of on-chip SRAM, serving as a high-speed staging area between slow external memory and the hungry MXU.

This buffer stores input activations arriving from the host computer, intermediate results between neural network layers, and final outputs before transmission. Since this memory sits directly on the chip, it operates at a higher bandwidth than external memory. This difference is critical for keeping the MXU continuously fed with data rather than sitting idle waiting for memory access.

Vector Processing Unit

The Vector Processing Unit handles operations that the MXU cannot. This includes activation functions like ReLU, sigmoid, and tanh.

Neural networks require non-linearity to learn complex patterns. Without it, multiple layers would collapse mathematically into a single linear transformation. Rather than implementing these functions in software, the TPU has dedicated hardware circuits that compute activations in a single cycle. Data typically flows from the MXU to the VPU for activation processing before moving to the next layer.

Accumulators

Accumulators collect the 32-bit results flowing from the MXU.

When multiplying 8-bit inputs, products are 16-bit, but accumulated sums grow larger through repeated addition. Using 32-bit accumulators prevents overflow during the many additions a matrix multiplication requires. The accumulator memory totals 4 megabytes across 4,096 vectors of 256 elements each.

Weight FIFO Buffer

The Weight FIFO buffer stages weights between external memory and the MXU using a technique called double-buffering.

The MXU holds two sets of weight tiles: one actively computing while the other loads from memory. This overlap completely hides memory latency, ensuring the computational units never wait for data.

High Bandwidth Memory

High Bandwidth Memory evolved across TPU generations.

The original v1 used DDR3 memory delivering 34 gigabytes per second. Modern Ironwood TPUs achieve 7.4 terabytes per second, a 217-fold improvement. HBM accomplishes this by stacking multiple DRAM dies vertically with thousands of connections between them, enabling bandwidth impossible with traditional memory packaging.

The Precision Advantage

TPUs gain significant efficiency through quantization, using lower-precision numbers than traditional floating-point arithmetic. This choice has big hardware implications that cascade through the entire design.

Scientific computing typically demands high precision. Calculating pi to ten decimal places requires careful representation of very small differences. However, neural networks operate differently. They compute probabilities and patterns. For example, whether a model predicts an image is 85 percent likely to be a cat versus 85.3472 percent likely makes no practical difference to the classification.

A multiplier circuit’s silicon area scales with the square of the bit width. An 8-bit multiplier requires roughly 64 units of silicon area, whereas a 32-bit multiplier requires about 576 units. This mathematical relationship explains why TPU v1 could pack 65,536 multiply-accumulate units into a modest chip while a GPU contains far fewer floating-point units. More multipliers mean more parallel operations per cycle.

The first TPU used 8-bit integers for inference, reducing memory requirements by four times compared to 32-bit floats. A 91-megabyte model becomes 23 megabytes when quantized. Research demonstrated that inference rarely needs 32-bit precision. The extra decimal places don’t meaningfully affect predictions.

Training requires more precision because small gradient updates accumulate over millions of iterations. Google addressed this by inventing BFloat16, or Brain Floating-Point 16. This format maintains the same 8-bit exponent as a 32-bit float but uses only 7 bits for the mantissa. The key insight was that neural networks are far more sensitive to dynamic range, controlled by the exponent, than to precision, controlled by the mantissa. BFloat16 provides a wide range of floating-point formats with half the bits, enabling efficient training without the overflow problems that plagued alternative 16-bit formats.

See the diagram below:

Modern TPUs support multiple precision modes.

  • BFloat16 for training.

  • INT8 for inference runs twice as fast on TPU v5e,

  • The newest FP8 format.

Ironwood is the first TPU with native FP8 support, avoiding the emulation overhead of earlier generations.

Evolution Journey

TPU development follows a clear trajectory.

Each generation increased performance while improving energy efficiency. The evolution reveals how AI hardware requirements shifted as models scaled.

  • TPU v1 launched secretly in 2015, focusing exclusively on inference. Built on 28-nanometer process technology and consuming just 40 watts, it delivered 92 trillion 8-bit operations per second. The chip connected via PCIe to standard servers and began powering Google Search, Photos, Translate, and YouTube before anyone outside Google knew it existed. In March 2016, TPU v1 powered AlphaGo’s victory over Lee Sedol, proving that application-specific chips could beat general-purpose GPUs by factors of 15 to 30 times in speed and 30 to 80 times in power efficiency.

  • TPU v2 arrived in 2017 with fundamental architecture changes to support training. Replacing the 256 by 256 8-bit array with two 128 by 128 BFloat16 arrays enabled the floating-point precision training requires. Adding High Bandwidth Memory, 16 gigabytes at 600 gigabytes per second, eliminated the memory bottleneck that limited v1. Most importantly, v2 introduced the Inter-Chip Interconnect, custom high-speed links connecting TPUs directly to each other. This enabled TPU Pods where 256 chips operate as a single accelerator delivering 11.5 petaflops.

  • TPU v3 in 2018 doubled performance to 420 teraflops per chip and introduced liquid cooling to handle increased power density. Pod size expanded to 1,024 chips, exceeding 100 petaflops, enough to train the largest AI models of that era in reasonable timeframes.

  • TPU v4 in 2021 brought multiple innovations. SparseCores accelerated embedding operations critical for recommendation systems and language models by five to seven times using only 5 percent of the chip area. Optical Circuit Switches enabled dynamic network topology reconfiguration. Instead of fixed electrical cables, robotic mirrors steer beams of light between fibers. This allows the interconnect to route around failures and scale to 4,096-chip Pods approaching one exaflop. The 3D torus topology, with each chip connected to six neighbors instead of four, reduced communication latency for distributed training.

  • Ironwood, or TPU v7, launched in 2025 and represents the most significant leap. Designed specifically for the age of inference, where deploying AI at scale matters more than training, each chip delivers 4,614 teraflops with 192 gigabytes of HBM at 7.4 terabytes per second bandwidth.

Conclusion

TPU deployments demonstrate practical impact across diverse applications.

For reference, a single TPU processes over 100 million Google Photos per day. AlphaFold’s solution to the 50-year protein folding problem, earning the 2024 Nobel Prize in Chemistry, ran on TPUs. Training PaLM, a 540-billion-parameter language model, across 6,144 TPU v4 chips achieved 57.8 percent hardware utilization over 50 days, remarkable efficiency for distributed training at that scale. Beyond Google, TPUs power Anthropic’s Claude assistant, Midjourney’s image generation models, and numerous research breakthroughs.

However, TPUs aren’t universally superior. They excel at large-scale language model training and inference, CNNs and Transformers with heavy matrix operations, high-throughput batch processing, and workloads prioritizing energy efficiency. On the other hand, GPUs remain better choices for PyTorch-native development, which requires the PyTorch/XLA bridge with some friction. Small batch sizes, mixed AI and graphics workloads, multi-cloud deployments, and rapid prototyping often favor GPUs.

TPUs represent a broader industry shift toward domain-specific accelerators.

The general-purpose computing model, where CPUs run any program reasonably well, hits physical limits when workloads scale to trillions of operations per query. Purpose-built silicon that sacrifices flexibility for efficiency delivers order-of-magnitude improvements that no amount of general-purpose processor optimization can match.

References:


SPONSOR US

Get your product in front of more than 1,000,000 tech professionals.

Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases.

Space Fills Up Fast - Reserve Today

Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing [email protected].

EP196: Cloud Load Balancer Cheat Sheet

2026-01-04 00:31:02

Cut Code Review Time & Bugs in Half (Sponsored)

Code reviews are critical but time-consuming. CodeRabbit acts as your AI co-pilot, providing instant Code review comments and potential impacts of every pull request.

Beyond just flagging issues, CodeRabbit provides one-click fix suggestions and lets you define custom code quality rules using AST Grep patterns, catching subtle issues that traditional static analysis tools might miss.

CodeRabbit has so far reviewed more than 10 million PRs, installed on 2 million repositories, and used by 100 thousand Open-source projects. CodeRabbit is free for all open-source repo’s.

Get Started Today


This week’s system design refresher:


Cloud Load Balancer Cheat Sheet

Efficient load balancing is vital for optimizing the performance and availability of your applications in the cloud.

However, managing load balancers can be overwhelming, given the various types and configuration options available.

In today's multi-cloud landscape, mastering load balancing is essential to ensure seamless user experiences and maximize resource utilization, especially when orchestrating applications across multiple cloud providers. Having the right knowledge is key to overcoming these challenges and achieving consistent, reliable application delivery.

In selecting the appropriate load balancer type, it's essential to consider factors such as application traffic patterns, scalability requirements, and security considerations. By carefully evaluating your specific use case, you can make informed decisions that enhance your cloud infrastructure's efficiency and reliability.

This Cloud Load Balancer cheat sheet would help you in simplifying the decision-making process and helping you implement the most effective load balancing strategy for your cloud-based applications.

Over to you: What factors do you believe are most crucial in choosing the right load balancer type for your applications?


How CQRS Works?

CQRS (Command Query Responsibility Segregation) separates write (Command) and read (Query) operations for better scalability and maintainability.

Here’s how it works:

  1. The client sends a command to update the system state. A Command Handler validates and executes logic using the Domain Model.

  2. Changes are saved in the Write Database and can also be saved to an Event Store. Events are emitted to update the Read Model asynchronously.

  3. The projections are stored in the Read Database. This database is eventually consistent with the Write Database.

  4. On the query side, the client sends a query to retrieve data.

  5. A Query Handler fetches data from the Read Database, which contains precomputed projections.

  6. Results are returned to the client without hitting the write model or the write database.

Over to you: What else will you add to understand CQRS?


How does Docker Work?

Docker’s architecture is built around three main components that work together to build, distribute, and run containers.

  1. Docker Client
    This is the interface through which users interact with Docker. It sends commands (such as build, pull, run, push) to the Docker Daemon using the Docker API.

  2. Docker Host
    This is where the Docker Daemon runs. It manages images, containers, networks, and volumes, and is responsible for building and running applications.

  3. Docker Registry
    The storage system for Docker images. Public registries like Docker Hub or private registries allow pulling and pushing images.

Over to you: Do you use Docker in your projects?


6 Practical AWS Lambda Application Patterns You Must Know

AWS Lambda pioneered the serverless paradigm, allowing developers to run code without provisioning, managing, or scaling servers. Let’s look at a few practical application patterns you can implement using Lambda.

  1. On-Demand Media Transformation
    Whenever a user requests an image from S3 in a format that isn’t available, an on-demand transformation can be done using AWS Lambda.

  2. Multiple Data Format from Single Source
    AWS Lambda can work with SNS to create a layer where data can be processed in the required format before sending to the storage layer.

  3. Real-time Data Processing
    Create a Kinesis stream and corresponding Lambda function to process different types of data (clickstream, logs, location tracking, or transactions) from your application.

  4. Change Data Capture
    Amazon DynamoDB can be integrated with AWS Lambda to respond to database events (inserts, updates, and deletes) in the DynamoDB streams.

  5. Serverless Image Processing
    Process and recognize images in a serverless manner using AWS Lambda. Integrate with AWS Step Functions for better workflow management.

  6. Automated Stored Procedure
    Invoke Lambda as a stored procedure to trigger functionality before/after some operations are performed on a particular database table.

Over to you: Have you used AWS Lambda in your project?


Containerization Explained: From Build to Runtime

“Build once, run anywhere.” That’s the promise of containerization, and here’s how it actually works:

Build Flow: Everything starts with a Dockerfile, which defines how your app should be built. When you run docker build, it creates a Docker Image containing:

  • Your code

  • The required dependencies

  • Necessary libraries

This image is portable. You can move it across environments, and it’ll behave the same way, whether on your local machine, a CI server, or in the cloud.

Runtime Architecture: When you run the image, it becomes a Container, an isolated environment that executes the application. Multiple containers can run on the same host, each with its own filesystem, process space, and network stack.

The Container Engine (like Docker, containerd, CRI-O, or Podman) manages:

  • The container lifecycle

  • Networking and isolation

  • Resource allocation

All containers share the Host OS kernel, sitting on top of the hardware. That’s how containerization achieves both consistency and efficiency, light like processes, but isolated like VMs.

Over to you: When deploying apps, do you prefer Docker, containerd, or Podman, and why?


🚀 Learn AI in the New Year: Become an AI Engineer Cohort 3 Now Open

After the amazing success of Cohorts 1 and 2 (with close to 1,000 engineers joining and building real AI skills), we are excited to announce the launch of Cohort 3 of Become an AI Engineer!

Check it out Here

Check it out Here


Message Brokers 101: Storage, Replication, and Delivery Guarantees

2026-01-02 00:33:32

A message broker is a middleware system that facilitates asynchronous communication between applications and services using messages.

At its core, a broker decouples producers of information from consumers, allowing them to operate independently without direct knowledge of each other. This decoupling is foundational to modern distributed architectures, where services communicate through the broker rather than directly with one another, enabling them to evolve independently without tight coupling.

To understand this in practice, consider an order-processing service that places an “Order Placed” message on a broker. Downstream services such as inventory, billing, and shipping will get that message from the broker when they are ready to process it, rather than the order service calling each one synchronously. This approach eliminates the need for the order service to know about or wait for these downstream systems.

Message brokers are not merely pipes for data transmission. They are sophisticated distributed databases specialized for functionalities such as stream processing and task distribution. The fundamental value proposition of a message broker lies in its ability to introduce a temporal buffer between distinct systems. By allowing a producer to emit a message without waiting for a consumer to process it, the broker facilitates temporal decoupling. This ensures that a spike in traffic at the ingress point does not immediately overwhelm downstream services.

In this article, we will look at how message brokers work in detail and explore the various patterns they enable in distributed system design.

Fundamental Terms

Read more