2025-12-05 00:31:08
The way software is built has undergone significant changes over the last decade.
The industry has moved toward building systems composed of many small, independent services. While this shift to distributed services solved many problems related to development speed and scalability, it created a whole new set of challenges. Suddenly, the communication between different parts of our application became much more difficult to manage.
One solution to this problem of managing communication between services is the service mesh architectural pattern.
At a high level, a service mesh is a dedicated layer that we add to our applications. Its primary job is to handle communication between services. Instead of the application code having to worry about how to connect to another service, how to retry a failed request, or how to encrypt data, the service mesh handles all of this automatically.
This technology matters because it addresses three critical gaps in modern microservices architecture: reliability, security, and observability. In this article, we will explore the reasons behind the existence of service mesh, its inner workings, and how to determine if it is the right tool for our specific needs.
2025-12-04 00:23:28
Bugs sneak out when less than 80% of user flows are tested before shipping. However, getting that kind of coverage (and staying there) is hard and pricey for any team.
QA Wolf’s AI-native solution provides high-volume, high-speed test coverage for web and mobile apps, reducing your organization’s QA cycle to minutes.
They can get you:
80% automated E2E test coverage in weeks—not years
24-hour maintenance and on-demand test creation
Zero flakes, guaranteed
The benefit? No more manual E2E testing. No more slow QA cycles. No more bugs reaching production.
With QA Wolf, Drata’s team of engineers achieved 4x more test cases and 86% faster QA cycles.
⭐ Rated 4.8/5 on G2
Disclaimer: The details in this post have been derived from the details shared online by the Meta Engineering Team. All credit for the technical details goes to the Meta Engineering Team. The links to the original articles and sources are present in the references section at the end of the post. We’ve attempted to analyze the details and provide our input about them. If you find any inaccuracies or omissions, please leave a comment, and we will do our best to fix them.
Meta has one of the largest data warehouses in the world, supporting analytics, machine learning, and AI workloads across many teams. Every business decision, experiment, and product improvement relies on quick, secure access to this data.
To organize such a vast system, Meta built its data warehouse as a hierarchy. At the top are teams and organizations, followed by datasets, tables, and finally dashboards that visualize insights. Each level connects to the next, forming a structure where every piece of data can be traced back to its origin.
Access to these data assets has traditionally been managed through role-based access control (RBAC). This means access permissions are granted based on job roles. A marketing analyst, for example, can view marketing performance data, while an infrastructure engineer can view server performance logs. When someone needed additional data, they would manually request it from the data owner, who would approve or deny access based on company policies.
This manual process worked well in the early stages. However, as Meta’s operations and AI systems expanded, this model began to strain under its own weight. Managing who could access what became a complex and time-consuming process.
Three major problems began to emerge:
The data graph became massive. Each table, dashboard, and data pipeline connects to others, forming a web of relationships. Understanding dependencies and granting permissions safely across this web became difficult.
Access decisions became slower and required multiple approvals. Different teams had to coordinate across departments to manage security.
AI systems changed how data was used. Earlier, each team mainly worked within its own data domain. Now, AI models often need to analyze data from multiple domains at once. The traditional human-managed access system could not keep up with these cross-domain patterns.
To keep innovation moving while maintaining security, Meta had to find a better way to handle the problem of data access at scale. The Meta engineering team discovered that the answer lay in AI agents. These agents are intelligent software systems capable of understanding requests, evaluating risks, and making decisions autonomously within predefined boundaries
In this article, we look at how Meta redesigned their data warehouse architecture to work with both humans and agents.
To overcome the growing complexity of data access, the Meta engineering team developed what they call a multi-agent system.
In simple terms, it is a setup where different AI agents work together, each handling specific parts of the data-access workflow. This design allows Meta to make data access both faster and safer by letting agents take over the repetitive and procedural tasks that humans once did manually.
At the heart of this system are two key types of agents that interact with each other:
Data-user agents, which act on behalf of employees or systems that need access to data.
Data-owner agents, which act on behalf of the people or teams responsible for managing and protecting data.
See the diagram below:
The data-user agent is not one single program. Instead, it is a group of smaller, specialized agents that work together. These sub-agents are coordinated by a triage layer, which acts like a manager that decides which sub-agent should handle each part of the task.
See the diagram below:
There are three main sub-agents inside this structure:
This sub-agent helps users find safer or less restricted ways to access the information they need. For example, if someone requests access to a sensitive data table, the agent might recommend another table that contains similar but non-sensitive data. It can even help rewrite queries to use only unrestricted columns or public data sources.
The sub-agent relies on large language models (LLMs) to reason about relationships between datasets. Traditionally, this kind of knowledge existed only as “tribal knowledge”, meaning it was known informally by a few experienced engineers. Now, the agent can synthesize that hidden information and offer intelligent recommendations automatically.
Most users do not need full access to an entire dataset when they are still exploring. Often, they just need to look at a small portion to understand its structure or content.
This sub-agent provides temporary or partial access to small data samples so that users can explore safely. It ensures that this kind of low-risk exploration does not expose sensitive information.
When full access is required, this sub-agent prepares the formal permission request. It communicates directly with the data-owner agent to request access based on business needs and data policies.
At the moment, Meta keeps a human in the loop to supervise these interactions, meaning a person reviews or confirms the agent’s actions. However, the engineering team expects that, over time, this sub-agent will be able to operate more autonomously as the system matures and safety mechanisms improve.
On the other side of the workflow is the data-owner agent, which represents the data managers or teams that control sensitive information.
It also consists of specialized components, each focusing on a different responsibility. See the diagram below:
Let’s look at the two main components of the data owner agent:
This sub-agent functions like a junior security engineer. It follows Standard Operating Procedures (SOPs) written by data owners and applies them to incoming access requests.
When a data-user agent sends a request, this sub-agent checks it against the established rules and risk policies. It ensures that the request follows security protocols and that only legitimate users with valid purposes are granted access.
Beyond handling requests, this sub-agent takes a proactive role in shaping and maintaining access policies. It evolves the older “role-mining” process, where engineers manually examined user roles and permissions, into a smarter, automated system.
Using metadata, data semantics, and historical access patterns, it continuously refines and optimizes who should have access to which resources. This helps Meta reduce the manual overhead of managing permissions while still keeping the data warehouse secure.
The next challenge for the Meta engineering team was to make the data warehouse usable not only by humans but also by AI agents.
Unlike people, agents interact through text-based interfaces. This means they cannot browse graphical dashboards or manually navigate folders. They need information presented in a structured, text-readable format that they can process and reason about.
To achieve this, Meta redesigned the data warehouse into what can be described as a text-navigable hierarchy, similar to how folders and files are organized on a computer. In this setup, each element in the warehouse (such as a table, a dashboard, or a policy) is treated as a resource. Agents can read these resources and understand how they relate to one another. The system turns complex warehouse objects into text summaries that describe what each resource represents and how it can be used.
In addition, important materials like SOPs, internal documentation, and even historical access rules are also represented as text. This approach allows LLMs powering the agents to analyze these textual resources just as they would analyze written information in a document.
To make good decisions about data access, an AI agent must understand the full situation around a request. The Meta engineering team calls this context and intention management. Together, these two concepts help the agent figure out who is asking for data, what they are trying to access, and why they need it.
Let’s start with context management. Context gives the agent the background information it needs before acting. Meta defines three main types of context:
Automatic context: When someone tries to open a dataset or run a query, the system already knows who they are and what resource they are attempting to access. This information is collected automatically from internal tools and user identities.
Static context: Sometimes, a user wants to focus on a specific project or dataset category. They can define that scope manually. For example, an engineer could choose to work within the “Ad Metrics” project area to limit search results to relevant tables.
Dynamic context: Agents can refine context further by analyzing metadata or performing similarity searches. For instance, if a user is studying ad-spend data, the agent can automatically find other tables related to ad budgets or campaign performance.
Once context is clear, the next step is intention management, which identifies the reason behind a user’s request. Meta approaches this in two ways:
Explicit intention is when a user clearly states their purpose. For example, they might indicate that they are “investigating ad performance for Q3”. The system can then match this role or goal with appropriate data-access policies.
Implicit intention is when the system infers purpose from user behavior. If an engineer suddenly starts accessing error logs at midnight, the system can reasonably assume they are responding to an outage and temporarily grant limited diagnostic access.
To understand how Meta’s agentic data-access system actually works, let’s look at an end-to-end example.
The process begins when a data scientist wants to look at new data for analysis. Instead of immediately granting full access to an entire dataset, the low-risk exploration sub-agent steps in first. It allows the user to view a small, limited sample of the data so they can understand its structure and decide if it is relevant to their task. At this stage, context-aware controls ensure that only non-sensitive parts of the dataset are visible.
If the user later needs deeper or broader access, the access-negotiation sub-agent automatically prepares a formal permission request and contacts the data-owner agent for review. This workflow not only speeds up exploration but also keeps security intact by enforcing layers of protection at every step.
The entire system operates through four major capabilities:
Context analysis: The agent understands what the user is trying to do and matches it with business rules and policies.
Query-level access control: Each query is examined to see how much data it touches and whether it performs aggregations or random sampling. This helps the system judge the potential risk of exposure.
Data-access budgets: Every employee has a daily quota of how much data they can access. This budget resets automatically every day and acts as a safeguard against accidental overexposure.
Rule-based risk management: The system continuously monitors agent behavior through analytical risk rules, catching anything unusual or potentially unsafe.
See the diagram below:
Behind the scenes, a complex architecture powers this workflow.
The data-user agent serves as the entry point. It collects signals from several internal tools:
User-activity logs, which include actions like editing code (diffs), viewing dashboards, completing tasks, or handling service events.
User-profile information, such as team, role, and current project details.
Using this information, the agent builds an intention model, a structured understanding of why the user is making the request and what they need to accomplish. This model is combined with the shape of the query (for example, whether it is reading a few rows, aggregating data, or joining large tables) to form a complete picture of the situation.
Once this intention is formed, the data-user agent hands control to the data-owner agent. This second agent retrieves metadata about the requested resources, including table summaries, column descriptions, and SOPs. It then uses a large language model (LLM) to reason about whether access should be granted or denied. The LLM’s reasoning is checked by a set of guardrails that apply rule-based risk calculations to make sure the outcome aligns with security policies.
See the diagram below:
Every action, decision, and result is logged securely for future auditing and analysis. This makes it possible to trace exactly how and why each access decision was made.
The Meta engineering team has made significant progress toward transforming how data is accessed and secured across its massive warehouse systems. However, the journey toward a fully agent-ready infrastructure is still ongoing. The long-term vision is to create a system where both humans and AI agents can work side by side, safely and efficiently, without adding complexity or risk.
The first area of continued focus is agent collaboration. Meta is increasingly seeing scenarios where agents act on behalf of users without direct human input. In the future, these agents may communicate and negotiate with each other automatically. To support this, Meta needs to refine how agents interact, ensuring that every exchange remains transparent, auditable, and aligned with company policies.
Next, the infrastructure itself must evolve. Many of Meta’s warehouse tools, APIs, and interfaces were originally built for human use. To fully enable machine-to-machine workflows, these systems must be reengineered to accommodate automated reasoning, contextual understanding, and safe delegation between agents.
Finally, Meta is investing heavily in benchmarking and evaluation. For agents to operate safely, the company must continuously measure performance, accuracy, and compliance. This involves defining clear metrics and running regular evaluations to detect errors or regressions. The feedback loop created by human review and automated assessment ensures that the system learns and improves over time.
In summary, Meta’s data warehouse now integrates AI agents that not only request but also approve access in a controlled manner. The combination of LLM-based reasoning with rule-based guardrails ensures that productivity and security remain balanced.
References:
Get your product in front of more than 1,000,000 tech professionals.
Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases.
Space Fills Up Fast - Reserve Today
Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing [email protected].
2025-12-03 00:30:47
Extreme Scale Engineering | Online | March 11-12
Your free ticket to Monster SCALE Summit is waiting — 30+ engineering talks on data-intensive applications
Monster SCALE Summit is a virtual conference that’s all about extreme-scale engineering and data-intensive applications. Engineers from Discord, Disney, LinkedIn, Pinterest, Rivian, American Express, Google, ScyllaDB, and more will be sharing 30+ talks on topics like:
Distributed databases
Streaming and real-time processing
Intriguing system designs
Massive scaling challenge
Don’t miss this chance to connect with 20K of your peers designing, implementing, and optimizing data-intensive applications – for free, from anywhere.
Register now to save your seat, and become eligible for an early bird swag pack!
Disclaimer: The details in this post have been derived from the details shared online by the Netflix Engineering Team. All credit for the technical details goes to the Netflix Engineering Team. The links to the original articles and sources are present in the references section at the end of the post. We’ve attempted to analyze the details and provide our input about them. If you find any inaccuracies or omissions, please leave a comment, and we will do our best to fix them.
Netflix processes an enormous amount of data every second. Each time a user plays a show, rates a movie, or receives a recommendation, multiple databases and microservices work together behind the scenes. This functionality is supported using hundreds of independent systems that must stay consistent with each other. When something goes wrong in one system, it can quickly create a ripple effect across the platform.
Netflix’s engineering team faced several recurring issues that threatened the reliability of their data. Some of these included accidental data corruption after schema changes, inconsistent updates between storage systems such as Apache Cassandra and Elasticsearch, and message delivery failures during transient outages. At times, bulk operations like large delete jobs even caused key-value database nodes to run out of memory. On top of that, some databases lacked built-in replication, which meant that regional failures could lead to permanent data loss.
Each engineering team tried to handle these issues differently. One team would build custom retry systems, another would design its own backup strategy, and yet another would use Kafka directly for message delivery. While these solutions worked individually, they created complexity and inconsistent guarantees across Netflix’s ecosystem. Over time, this patchwork approach increased maintenance costs and made debugging more difficult.
To fix this, Netflix built a Write-Ahead Log system to act as a single, resilient foundation for data reliability. The WAL standardizes how data changes are recorded, stored, and replayed across services. In simple terms, it captures every change before it is applied to the database, so that even if something fails midway, no information is lost.
In this article, we will look at how Netflix built this WAL and the challenges it faced.
At its core, a Write-Ahead Log is a simple but powerful idea. It is a system that keeps a record of every change made to data before those changes are applied to the actual database. You can think of it like keeping a journal of all the actions you plan to take. Even if something goes wrong during the process, you still have that journal to remind you exactly what you were doing, so you can pick up right where you left off.
In practical terms, when an application wants to update or delete information in a database, it first writes that intention to the WAL. Only after the entry has been safely recorded does the database proceed with the operation. This means that if a server crashes or a network connection drops, Netflix can replay the operations from the WAL and restore everything to the correct state. Nothing is lost, and the data remains consistent across systems.
Netflix’s version of WAL is not tied to a single database or service.
It is distributed, meaning it runs across multiple servers to handle massive volumes of data. It is also pluggable, allowing it to connect easily to various technologies, such as Kafka, Amazon SQS, Apache Cassandra, and EVCache. This flexibility allowed the Netflix engineering team to use the same reliability framework for different types of workloads, whether it’s storing cached video metadata, user preferences, or system logs.
See the diagram below:
The WAL provides several key benefits that make Netflix’s data platform more resilient:
Durability: Every change is logged first, so even if a database goes offline, no data is permanently lost.
Retry and Delay Support: If a message fails to process due to an outage or network issue, the WAL can automatically retry it later, with custom delays.
Cross-Region Replication: Data can be copied across regions, ensuring the same information exists in multiple data centers for disaster recovery.
Multi-Partition Consistency: For complex updates involving multiple tables or partitions, WAL ensures that all changes are coordinated and eventually consistent.
Netflix’s Write-Ahead Log system provides a simple interface for the developers. Despite the complexity of what happens behind the scenes, the API that developers interact with contains only one main operation called WriteToLog.
This API acts as the entry point for any application that wants to record a change. The structure looks something like this:
rpc WriteToLog (WriteToLogRequest) returns (WriteToLogResponse);
Even though this may look technical, the idea is straightforward. A service sends a request to WAL describing what it wants to write and where that data should go. WAL then processes the request, stores it safely, and responds with information about whether the operation was successful.
The request contains four main parts:
Namespace: This identifies which logical group or application the data belongs to. Think of it as a label that helps WAL organize and isolate data from different teams or services.
Lifecycle: This specifies timing details, such as whether the message should be delayed or how long WAL should keep it.
Payload: This is the actual content or data being written to the log.
Target: This tells WAL where to send the data after it has been safely recorded, such as a Kafka topic, a database, or a cache.
The response from WAL is equally simple:
Durable: Indicates whether the request was successfully stored and made reliable.
Message: Provides details if something went wrong, like an error message or reason for failure.
Each namespace in WAL has its own configuration that defines how it behaves. For example, one namespace may be set up to use Kafka for high-speed streaming, while another might rely on Amazon SQS for delayed message delivery. The team can adjust settings like retry counts, backoff times, and delay intervals depending on what each application needs.
Netflix designed the WAL system to be flexible enough to support many different situations, which they refer to as personas. Each persona represents a unique way that WAL is used within the company’s data ecosystem.
Let’s look at a few of the main ones to understand how this system adapts to different needs.
This use case comes from the Product Data Systems (PDS) team, which handles a lot of real-time data updates.
In large-scale systems like Netflix, failures are inevitable. Sometimes, a downstream service such as Kafka or a database might be temporarily unavailable due to network issues or maintenance.
Instead of losing messages or forcing engineers to manually retry failed operations, WAL automatically steps in. When a system failure occurs, WAL uses Amazon SQS (Simple Queue Service) to delay messages and retry them later.
See the diagram below for backoff and delayed retries for clients producing to Kafka:
Here’s how it works in simple terms:
If a message fails to be delivered, WAL stores it in a queue and waits for a certain amount of time before trying again. The delay can be configured based on how long the system is expected to recover.
Once the downstream service is back online, WAL automatically retries the messages, ensuring nothing is lost and no manual intervention is needed.
The diagram below shows the backoff and delayed retries for clients consuming from Kafka:
This approach saves engineers a lot of time and prevents cascading failures that might otherwise spread across the platform.
Another major use case is data replication across Netflix’s global regions. The company’s caching system, EVCache, stores frequently accessed data to make streaming fast and reliable. However, since Netflix operates worldwide, the same data needs to exist in multiple regions.
WAL makes this replication seamless by using Kafka under the hood. Whenever data is written or deleted in one region, WAL captures that event and sends it to other regions. The consumers in each region then replay the same operations locally, ensuring that all copies of the data stay synchronized.
See the diagram below:
In simpler terms, WAL acts like a reliable postman, making sure every region receives the same “letters” (data updates), even if network disruptions occur. This system keeps Netflix consistent around the world. Users in India, Europe, or the US all see the same data at nearly the same time.
The final example involves Netflix’s Key-Value data service, which stores information in systems like Apache Cassandra. Sometimes, a single operation might need to update data spread across multiple partitions or tables. Handling these multi-part changes is tricky, especially in distributed systems, because a failure in one partition can leave others out of sync.
WAL solves this problem by ensuring atomicity, meaning that either all the changes succeed or all are retried until they do. To achieve this, Netflix’s WAL combines Kafka for message delivery with durable storage for reliability. This setup functions similarly to a two-phase commit, a well-known database technique that guarantees data consistency across multiple locations.
In short, WAL coordinates complex updates so that Netflix’s data remains correct, even when multiple systems are involved.
To understand how Netflix’s Write-Ahead Log (WAL) works behind the scenes, it helps to break it down into its main building blocks.
See the diagram below:
The system is made up of several key components that work together to move data safely from one place to another while keeping everything flexible and resilient.
Producers: The producer is the first part of the system. It accepts messages or data change requests from various Netflix applications and writes them into a queue. You can think of producers as the “entry doors” of WAL. Whenever an app wants to log an update, it hands the data to a producer, which makes sure it gets safely added to the right queue.
Consumers: Consumers are the “exit doors” of the system. Their job is to read messages from the queue and send them to the correct destination, such as a database, cache, or another service. Since consumers run separately from producers, they can process messages at their own pace without slowing down the rest of the system.
Message Queues: The message queue is the middle layer that connects producers and consumers. Netflix primarily uses Kafka or Amazon SQS for this purpose. Each namespace in WAL (which represents a specific use case or service) has its own dedicated queue. This ensures isolation between applications so that a heavy workload from one service does not affect another. Every namespace also includes a Dead Letter Queue (DLQ). The DLQ is a special backup queue that stores messages that repeatedly fail to process. This gives engineers a chance to inspect and fix the problematic data later without losing it.
Control Plane: The control plane is like the central command center for WAL. It allows Netflix engineers to change settings, such as which queue type to use, how many retries should occur, or what the delay between retries should be. The key advantage here is that teams can modify these settings without having to change their application code. This makes the system highly adaptable and easy to maintain.
Targets: Finally, the targets are the destinations where WAL sends the data. A target can be a database like Cassandra, a cache like EVCache, or even another message queue. The flexibility of defining targets through configuration means that the same WAL architecture can support many different workloads across Netflix.
The way Netflix deploys its Write-Ahead Log (WAL) system is just as important as how it works internally.
To handle billions of data operations across many teams and services, Netflix needed a platform that could scale easily, stay secure, and run reliably across regions. To achieve this, WAL is deployed on top of Netflix’s Data Gateway Infrastructure.
This infrastructure acts as a foundation that gives WAL several built-in advantages right out of the box:
mTLS for security: All communication between services is encrypted and authenticated using mutual Transport Layer Security (mTLS). This ensures that only trusted Netflix services can talk to each other, keeping sensitive data safe.
Connection management: The platform automatically manages network connections, making sure requests are routed efficiently and that no single component gets overloaded.
Auto-scaling and load shedding: WAL uses adaptive scaling to adjust the number of active instances based on demand. If CPU or network usage gets too high, the system automatically adds more capacity. In extreme cases, it can also shed low-priority requests to protect the stability of the service.
Netflix organizes WAL deployments into shards. A shard is an independent deployment that serves a specific group of applications or use cases. For example, one shard might handle the Ads service, another might handle Gaming data, and so on. This separation prevents the “noisy neighbor” problem, where one busy service could slow down others running on the same system.
Inside each shard, there can be multiple namespaces, each with its own configuration and purpose. These configurations are stored in a globally replicated SQL database, ensuring they are always available and consistent, even if a region goes offline.
See the diagram below for the deployment model of WAL at Netflix:
Several key design principles shaped the success of WAL. The first is its pluggable architecture, which allows Netflix to switch between different technologies, such as Kafka or Amazon SQS, without changing application code. This flexibility ensures that teams can choose the most suitable underlying system for their specific use cases while relying on the same core framework.
Another principle is the reuse of existing infrastructure. Instead of building everything from scratch, Netflix built WAL on top of its already established systems, like the Data Gateway platform and Key-Value abstractions. This approach saved development time and allowed the new system to fit naturally into the company’s broader data architecture.
Equally important is the separation of concerns between producers and consumers. Because these components scale independently, Netflix can adjust each one based on traffic patterns or system load. This independence allows WAL to handle massive spikes in demand without service degradation.
Finally, Netflix recognizes that even a system designed for reliability must consider its own limits. The team continuously evaluates trade-offs, such as dealing with slow consumers or managing backpressure during heavy traffic. Techniques like partitioning and controlled retries are essential to keeping the system stable.
Looking ahead, Netflix plans to enhance WAL further. Future improvements include adding secondary indices to the Key-Value service, which will make data retrieval faster and more efficient, and supporting multi-target writes, allowing a single operation to send data to multiple destinations, such as a database and a backup system at the same time.
References:
Get your product in front of more than 1,000,000 tech professionals.
Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases.
Space Fills Up Fast - Reserve Today
Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing [email protected].
2025-12-02 00:30:24
One of AI’s biggest challenges today is memory—how agents retain, recall, and remember over time. Without it, even the best models struggle with context loss, inconsistency, and limited scalability.
This new O’Reilly + Redis report breaks down why memory is the foundation of scalable AI systems and how real-time architectures make it possible.
Inside the report:
The role of short-term, long-term, and persistent memory in agent performance
Frameworks like LangGraph, Mem0, and Redis
Architectural patterns for faster, more reliable, context-aware systems
The first time most people interact with a modern AI assistant like ChatGPT or Claude, there’s often a moment of genuine surprise. The system doesn’t just spit out canned responses or perform simple keyword matching. It writes essays, debugs code, explains complex concepts, and engages in conversations that feel remarkably natural.
The immediate question becomes: how does this actually work? What’s happening under the hood that enables a computer program to understand and generate human-like text?
The answer lies in a training process that transforms vast quantities of internet text into something called a Large Language Model, or LLM. Despite the almost magical appearance of their capabilities, these models don’t think, reason, or understand like human beings. Instead, they’re extraordinarily sophisticated pattern recognition systems that have learned the statistical structure of human language by processing billions of examples.
In this article, we will walk through the complete journey of how LLMs are trained, from the initial collection of raw data to the final conversational assistant. We’ll explore how these models learn, what their architecture looks like, the mathematical processes that drive their training, and the challenges involved in ensuring they learn appropriately rather than simply memorizing their training data.
LLMs don’t work like search engines or databases, looking up stored facts when asked questions.
Everything an LLM knows is encoded in its parameters, which are billions of numerical values that determine how the model processes and generates text. These parameters are essentially adjustable weights that get tuned during training. When someone asks an LLM about a historical event or a programming concept, the model isn’t retrieving a stored fact. Instead, it’s generating a response based on patterns it learned by processing enormous amounts of text during training.
Think about how humans learn a new language by reading extensively. After reading thousands of books and articles, we develop an intuitive sense of how the language works. We learn that certain words tend to appear together, that sentences follow particular structures, and that context helps determine meaning. We don’t memorize every sentence we’ve ever read, but we internalize the patterns.
LLMs do something conceptually similar, except they do it through mathematical processes rather than conscious learning, and at a scale that far exceeds human reading capacity. In other words, the core learning task for an LLM is simple: predict the next token.
A token is roughly equivalent to a word or a piece of a word. Common words like “the” or “computer” might be single tokens, while less common words might be split into multiple tokens. For instance, “unhappiness” might become “un” and “happiness” as separate tokens. During training, the model sees billions of text sequences and learns to predict what token comes next at each position. If it sees “The capital of France is”, it learns to predict “Paris” as a likely continuation.
What makes this remarkable is that by learning to predict the next token, the model inadvertently learns far more. It learns grammar because grammatically correct text is more common in training data. It learns facts because factual statements appear frequently. It even learns some reasoning patterns because logical sequences are prevalent in the text it processes.
However, this learning mechanism also explains why LLMs sometimes “hallucinate” or confidently state incorrect information. The model generates plausible-sounding text based on learned patterns that may not have been verified against a trusted database.
Training an LLM begins long before any actual learning takes place.
The first major undertaking is collecting training data, and the scale involved is staggering. Organizations building these models gather hundreds of terabytes of text from diverse sources across the internet: websites, digitized books, academic papers, code repositories, forums, social media, and more. Web crawlers systematically browse and download content, similar to how search engines index the web. Some organizations also license datasets from specific sources to ensure quality and legal rights. The goal is to assemble a dataset that represents the breadth of human knowledge and language use across different domains, styles, and perspectives.
However, the raw internet is messy. It contains duplicate content, broken HTML fragments, garbled encoding, spam, malicious content, and vast amounts of low-quality material. This is why extensive data cleaning and preprocessing become essential before training can begin.
The first major cleaning step is deduplication. When the same text appears repeatedly in the training data, the model is far more likely to memorize it verbatim rather than learn general patterns from it. If a particular news article was copied across a hundred different websites, the model doesn’t need to see it a hundred times.
Quality filtering comes next. Not all text on the internet is equally valuable for training. Automated systems evaluate each piece of text using various criteria: grammatical correctness, coherence, information density, and whether it matches patterns of high-quality content.
Content filtering for safety and legal compliance is another sensitive challenge. Automated systems scan for personally identifiable information like email addresses, phone numbers, and social security numbers, which are then removed or anonymized to protect privacy. Filters identify and try to reduce the prevalence of toxic content, hate speech, and explicit material, though perfect filtering proves impossible at this scale. There’s also filtering for copyrighted content or material from sources that have requested exclusion, though this remains both technically complex and legally evolving.
The final preprocessing step is tokenization, which transforms human-readable text into a format the model can process.
See the diagram below:
Rather than working with whole words, which would require handling hundreds of thousands of different vocabulary items, tokenization breaks text into smaller units called tokens based on common patterns. A frequent word like “cat” might be a single token, while a rarer word like “unhappiness” might split into “un” and “happiness.” These tokens are then represented as numbers, so “Hello world” might become something like [5431, 892]. This approach, often using methods like Byte Pair Encoding, allows the model to work with a fixed vocabulary of perhaps 50,000 to 100,000 tokens that can represent essentially any text.
All of this preprocessing work establishes a fundamental principle: the quality and diversity of training data directly shape what the model will be capable of. A model trained predominantly on scientific papers will excel at technical language but struggle with casual conversation. A model trained on diverse, high-quality data from many domains will develop broader capabilities.
Before training begins, an LLM starts in a state of complete ignorance. Its billions of parameters are set to small random values, carefully chosen from specific statistical distributions but essentially meaningless. If we fed text to this untrained model and asked it to predict the next token, it would produce complete gibberish. The entire purpose of training is to adjust these random parameters into a configuration that encodes useful patterns about language and knowledge.
The training process follows a continuous loop that repeats billions of times.
First, the model receives batches of text sequences from the training data.
These sequences might be chunks of articles, books, or web pages, typically a few thousand tokens long.
The model processes these sequences and generates predictions for what token should come next at each position. For every position, it produces a probability distribution across all possible tokens in its vocabulary. As an example, it might assign 15% probability to one token, 8% to another, and smaller probabilities to thousands of other options.
These predictions are then compared against the actual next tokens that appeared in the training data. This comparison produces a loss value, which is a numerical score measuring how wrong the model’s predictions were. If the model assigned a high probability to the correct tokens, the loss is low. If it assigns low probability to correct tokens and high probability to incorrect ones, the loss is high. This single number becomes the signal that drives all learning.
The challenge now is figuring out how to adjust billions of parameters to reduce this loss. This is where gradient descent comes in.
Imagine standing in a foggy, hilly landscape where the goal is to reach the lowest valley, but visibility is limited to just a few feet. The strategy would be to feel which direction slopes downward at the current position, take a step in that direction, reassess, and repeat.
Gradient descent works similarly in an abstract mathematical space. The “landscape” represents how wrong the model’s predictions are across all possible parameter configurations, and the algorithm determines which direction in this space leads downward toward better predictions.
Through a process called backpropagation, the training system efficiently calculates exactly how each of the model’s billions of parameters contributed to the error. Should parameter number 47,293,816 be increased slightly or decreased slightly to reduce the loss? Backpropagation works backward through the model’s layers, calculating gradients that indicate the direction and magnitude each parameter should change. All parameters are then adjusted simultaneously by tiny amounts, perhaps changing a value by 0.00001. No single adjustment is meaningful on its own, but across trillions of these microscopic changes, the model gradually improves.
This process repeats continuously over weeks or even months of training on massive clusters of specialized processors.
Modern LLM training might use thousands of GPUs or TPUs working in parallel, consuming megawatts of electricity and costing tens of millions of dollars in computational resources. The training data is processed multiple times, with the model making billions of predictions, calculating billions of loss values, and performing trillions of parameter adjustments.
What emerges from this process is genuinely remarkable.
No individual parameter adjustment teaches the model anything specific. There’s no moment where we explicitly program in grammar rules or facts about the world. Instead, sophisticated capabilities emerge from the collective effect of countless tiny optimizations. The model learns low-level patterns like how adjectives typically precede nouns, mid-level patterns like how questions relate to answers, and high-level patterns like how scientific discussions differ from casual conversation. All of this arises naturally from the single objective of predicting the next token accurately.
By the end of pretraining, the model has become extraordinarily good at its task. It can predict what comes next in text sequences with accuracy, demonstrating knowledge across countless domains and the ability to generate coherent, contextually appropriate text.
However, it’s still fundamentally an autocomplete system. If given a prompt that starts with a question, it might continue with more questions rather than providing an answer. The model understands patterns but hasn’t yet learned to be helpful, harmless, and honest in the way users expect from a conversational assistant. That transformation requires additional training steps we’ll explore in later sections.
The training process explains how LLMs learn, but the model’s structure determines what it’s capable of learning.
The architecture underlying modern LLMs is called the Transformer, introduced in a 2017 research paper with the fitting title “Attention Is All You Need.” This architectural breakthrough made today’s sophisticated language models possible.
Before Transformers, earlier neural networks processed text sequentially, reading one word at a time, much like a human reads a sentence from left to right. This sequential processing was slow and created difficulties when the model needed to connect information that appeared far apart in the text. If important context appeared at the beginning of a long paragraph, the model might struggle to remember it when processing the end.
Transformers revolutionized this by processing entire sequences of text simultaneously and using a mechanism called attention to let the model focus on relevant parts of the input regardless of where they appear.
The attention mechanism is best understood through an example.
Consider the sentence: “The animal didn’t cross the street because it was too tired.” When a human reads this, we instantly understand that “it” refers to “the animal” rather than “the street.” We do this by paying attention to context and meaning.
The attention mechanism in Transformers does something mathematically analogous. For each word the model processes, it calculates attention scores that determine how much that word should consider every other word in the sequence. These attention scores are learned during training. For example, the model learns that pronouns should pay high attention to their antecedents, that words at the end of sentences should consider the beginning for context, and countless other patterns that help interpret language correctly.
Transformer models are organized in layers, typically dozens of them stacked on top of each other. Each layer contains attention mechanisms along with other components, and information flows through these layers sequentially. The interesting aspect is that different layers learn to extract different kinds of patterns.
Early layers tend to recognize basic syntactic structures and simple word relationships.
Middle layers identify semantic patterns and understand how concepts relate to each other.
Later layers capture more abstract patterns, including complex reasoning and nuanced language understanding.
The information flowing through these layers takes the form of vectors, which are essentially lists of numbers that encode the meaning and context of each token position.
At each layer, these vectors get transformed based on the model’s parameters. Think of it as the model continuously refining its understanding of the text. The raw tokens enter at the bottom, and by the time information reaches the top layers, the model has developed a rich, multi-faceted representation that captures syntax, semantics, context, and relationships within the text.
This architecture provides several crucial advantages, which are as follows:
The ability to process sequences in parallel rather than sequentially means training can happen much faster, especially when distributed across thousands of processors.
The attention mechanism’s capacity to relate any part of the text to any other part, regardless of distance, enables the model to maintain context across long conversations or documents. Modern LLMs can handle contexts spanning thousands or even tens of thousands of tokens precisely because the Transformer architecture can efficiently connect information across these long spans.
The layered structure allows the model to build up an increasingly sophisticated understanding, starting from basic patterns and culminating in the complex language capabilities that make these systems so useful.
After pretraining, an LLM is excellent at predicting what comes next in text sequences, but this doesn’t make it a helpful conversational assistant.
If given a prompt that starts with a question, the pretrained model might continue with more questions rather than providing an answer. It simply completes text in statistically likely ways based on the patterns it has learnt. Transforming this autocomplete system into the helpful assistants we interact with requires additional training phases.
Supervised fine-tuning addresses this gap by training the model on carefully curated examples of good behavior.
Instead of learning from general text, the model now trains on prompt-response pairs that demonstrate how to follow instructions, answer questions directly, and maintain a helpful persona. These examples might include questions paired with clear answers, instructions paired with appropriate completions, and conversations demonstrating polite and informative dialogue.
This dataset is much smaller than pretraining data, perhaps tens of thousands to hundreds of thousands of examples rather than billions, but each example is precisely constructed to teach desired behaviors to the LLM.
The training process remains the same: predict the next token, calculate loss, adjust parameters. However, now the model learns to predict tokens in these ideal responses rather than arbitrary internet text.
Supervised fine-tuning provides significant improvement, but it has limitations. Writing explicit examples for every possible scenario the model might encounter is impractical. This is where reinforcement learning from human feedback (RLHF) provides further refinement. The process begins with the model generating multiple responses to various prompts. Human raters then rank these responses based on quality, helpfulness, and safety. These rankings train a separate reward model that learns to predict scores human raters would assign to any response.
See the diagram below:
Once the reward model exists, it guides further training of the language model. The language model generates responses, the reward model scores them, and the language model updates to produce higher-scoring responses.
There’s a careful balance here: the model should improve according to human preferences while not deviating so far from its pretrained version that it loses core knowledge and capabilities. This entire process can iterate multiple times, with improved models generating new responses for human evaluation.
Once training completes, the model undergoes a comprehensive evaluation before deployment.
Developers test it on various benchmarks that measure different capabilities such as language understanding, reasoning, mathematical ability, coding skills, and factual knowledge. Safety testing runs in parallel, examining the model’s tendency to generate harmful content, its susceptibility to adversarial prompts, and potential biases in its outputs.
The model also undergoes optimization for deployment. Training prioritizes learning capability over efficiency, but deployed models must respond quickly to user requests while managing computational costs. Techniques like quantization reduce the precision of parameters, using fewer bits to represent each number. This decreases memory requirements and speeds up computation while typically preserving most of the model’s capability. Other optimizations might involve distilling knowledge into smaller, faster models or implementing efficient serving infrastructure.
Deployment isn’t an endpoint but rather the beginning of a continuous cycle. Organizations monitor how users interact with deployed models, collect feedback on response quality, and identify edge cases where the model fails or behaves unexpectedly. This information feeds directly into the next training iteration.
When someone uses an LLM today, they’re interacting with the culmination of this entire process, from data collection through optimization.
The journey from raw internet data to conversational AI represents a remarkable achievement at the intersection of data engineering, mathematical optimization, massive-scale computation, and careful alignment with human values.
What begins as terabytes of text transforms through preprocessing, tokenization, and billions of parameter adjustments into systems capable of generating coherent text, answering questions, writing code, and engaging in sophisticated dialogue.
Understanding this training process reveals both the impressive capabilities and fundamental limitations of LLMs. For software engineers working with these systems, understanding the training process provides crucial context for making informed decisions about when and how to deploy them.
Get your product in front of more than 1,000,000 tech professionals.
Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases.
Space Fills Up Fast - Reserve Today
Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing [email protected].
2025-12-01 00:45:29
Yearly Black Friday sale Ends Today! Use code BF2025 at checkout to get 30% off the all-in-one interview prep online courses.
To take advantage of this limited time offer, subscribe before 11:59 pm PST on Monday, December 1.
2025-11-30 00:30:39
Reliability shouldn’t cost extra—and Verizon proves it. Their customer-first design, featuring myPlan, myHome, and an industry-first 3-year Value Guarantee, delivers premium network quality without premium pricing.
Unwrap unbeatable deals:
Get the iPhone 17 Pro Max on Verizon with a new line on any myPlan. Also, get an Apple Watch Series 11 and iPad (A16), all on us with a connected device plan ($1,830 in value).
Galaxy S25 Ultra, Galaxy Watch8, and Galaxy Tab S10 FE 5G, all on us with any myPlan ($1,800 value).
Switch to select Verizon Home Internet plans and choose a Samsung 43” TV, Samsung Galaxy Tab S10 FE 5G, Marshall Kilburn III, Stream TV Soundbar, Samsung 32” Smart Monitor or $200 Target GiftCard, on Verizon.
Everyone gets a better deal—flexibility, savings, and support with no extra cost.
This week’s system design refresher:
⏳ LIMITED TIME OFFER: All in One Interview Prep Black Friday Sale
Virtualization vs. Containerization
5 REST API Authentication Methods
How do AirTags work?
What is a Firewall?
Modem vs. Router
SPONSOR US
Yearly Black Friday sale is now live! Use code BF2025 at checkout to get 30% off the all-in-one interview prep online courses.
To take advantage of this limited time offer, subscribe before 11:59 pm PST on Monday, December 1.
Before containers simplified deployment, virtualization changed how we used hardware. Both isolate workloads, but they do it differently.
Virtualization (Hardware-level isolation): Each virtual machine runs a complete operating system, Windows, Fedora, or Ubuntu, with its own kernel, drivers, and libraries. The hypervisor (VMware ESXi, Hyper-V, KVM) sits directly on hardware and emulates physical machines for each guest OS.
This makes VMs heavy but isolated. Need Windows and Linux on the same box? VMs handle it easily. Startup time for a typical VM is in minutes because you’re booting an entire operating system from scratch.
Containerization (OS-level isolation): Containers share the host operating system’s kernel. No separate OS per container. Just isolated processes with their own filesystem and dependencies.
The container engine (Docker, containerd, CRI-O, Podman) manages lifecycle, networking, and isolation, but it all runs on top of a single shared kernel. Lightweight and fast. Containers start in milliseconds because you’re not booting an OS, just launching a process.
But here’s the catch: all containers on a host must be compatible with that host’s kernel. Can’t run Windows containers on a Linux host (without nested virtualization tricks).
Over to you: What’s your go-to setup: containers in VMs, bare metal containers, or something else?
Basic Authentication: Clients include a Base64-encoded username and password in every request header, which is simple but insecure since credentials are transmitted in plaintext. Useful in quick prototypes or internal services over secure networks.
Session Authentication: After login, the server creates a session record and issues a cookie. Subsequent requests send that cookie so the server can validate user state. Used in traditional web-apps.
Token Authentication: Clients authenticate once to receive a signed token, then present the token on each request for stateless authentication. Used in single-page applications and modern APIs that require scalable, stateless authentication.
OAuth-Based Authentication: Clients obtain an access token via an authorization grant from an OAuth provider, then use that token to call resource servers on the user’s behalf. Used in cases of third-party integrations or apps that need delegated access to user data.
API Key Authentication: Clients present a predefined key (often in headers or query strings) with each request. The server verifies the key to authorize access. Used in service-to-service or machine-to-machine APIs where simple credential checks are sufficient.
Over to you: Which other API Authentication method have you seen?
AirTags work by leveraging a combination of Bluetooth technology and the vast network of Apple devices to help you locate your lost items.
Here’s a breakdown of how they function:
Bluetooth Signal: Each AirTag emits a secure Bluetooth signal that can be detected by nearby Apple devices (iPhones, iPads, etc.) within the Find My network.
Find My Network: When an AirTag comes within range of an Apple device in the Find My network, that device anonymously and securely relays the AirTag’s location information to iCloud.
Location Tracking: You can then use the Find My app on your own Apple device to see the approximate location of your AirTag on a map.
Limitations:
Please note that AirTags rely on Bluetooth technology and the presence of Apple devices within the Find My network. If your AirTag is in an area with few Apple devices, its location may not be updated as frequently or accurately.
Every time you connect to the Internet, a firewall quietly decides what can come in and what must stay out. A firewall is your network’s first line of defense. It filters traffic based on rules you define, by IP address, protocol, port, program, or even keywords. Every packet that tries to enter or leave your network passes through this checkpoint.
There are two main types:
Network Firewall: Sits at the network edge between your infrastructure and the internet. Can be physical hardware, virtualized software, or cloud-deployed service. Operates at Layer 3-4 of the OSI model. Filters traffic based on IP addresses, protocols, and ports before it ever reaches your internal network.
Protects the entire network at once. This is your first line of defense. Internet traffic hits the network firewall before it reaches your router, before it touches any internal systems.
Host-based Firewall: This runs as software on individual devices, like your laptop or a server. It works at Layer 3–7, inspecting packets more deeply and protecting only that specific device.
Your desktop has its own host firewall. Your server has its own. Each one is configured independently. It’s your last layer of defense in case something slips past the network firewall.
Together, they form a layered shield, keeping unwanted traffic out while letting legitimate communication flow freely.
Over to you: Have you ever had to troubleshoot a misconfigured firewall rule that accidentally blocked something critical? What was it, and how long did it take to find?
Most people think their WiFi router gives them internet. It doesn’t. Your router is just managing traffic inside your home. The actual internet connection comes from the modem.
Here’s what each one actually does:
Modem: The modem connects you to your Internet Service Provider (ISP). It translates signals between your ISP’s network and your home network. Depending on the service type, the digital link may use coaxial cable, fiber optic, or cellular connections. The modem converts those signals into data your devices can understand.
It provides one public IP address, meaning one connection to the internet. If you plug a single device directly into a modem via Ethernet, that device gets internet access with a public IP.
Router: The router creates a private network inside your home. It takes that single public IP from the modem and shares it across multiple devices using Network Address Translation (NAT). Every device on your network gets a private IP address, usually something like 192.168.1.x. The router keeps track of which device requested what data and routes responses back to the right device.
DHCP assigns those private IPs automatically. Your phone connects to WiFi, the router gives it an IP address, and suddenly it can talk to the internet through the router.
Modern devices often combine both functions, a modem-router combo, but understanding the distinction helps when you’re troubleshooting slow speeds or network drops.
Over to you: What’s your go-to trick to quickly diagnose whether the modem or router is to blame for slow Internet?
Get your product in front of more than 1,000,000 tech professionals.
Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases.
Space Fills Up Fast - Reserve Today
Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing [email protected].