2025-12-02 00:30:24
One of AI’s biggest challenges today is memory—how agents retain, recall, and remember over time. Without it, even the best models struggle with context loss, inconsistency, and limited scalability.
This new O’Reilly + Redis report breaks down why memory is the foundation of scalable AI systems and how real-time architectures make it possible.
Inside the report:
The role of short-term, long-term, and persistent memory in agent performance
Frameworks like LangGraph, Mem0, and Redis
Architectural patterns for faster, more reliable, context-aware systems
The first time most people interact with a modern AI assistant like ChatGPT or Claude, there’s often a moment of genuine surprise. The system doesn’t just spit out canned responses or perform simple keyword matching. It writes essays, debugs code, explains complex concepts, and engages in conversations that feel remarkably natural.
The immediate question becomes: how does this actually work? What’s happening under the hood that enables a computer program to understand and generate human-like text?
The answer lies in a training process that transforms vast quantities of internet text into something called a Large Language Model, or LLM. Despite the almost magical appearance of their capabilities, these models don’t think, reason, or understand like human beings. Instead, they’re extraordinarily sophisticated pattern recognition systems that have learned the statistical structure of human language by processing billions of examples.
In this article, we will walk through the complete journey of how LLMs are trained, from the initial collection of raw data to the final conversational assistant. We’ll explore how these models learn, what their architecture looks like, the mathematical processes that drive their training, and the challenges involved in ensuring they learn appropriately rather than simply memorizing their training data.
LLMs don’t work like search engines or databases, looking up stored facts when asked questions.
Everything an LLM knows is encoded in its parameters, which are billions of numerical values that determine how the model processes and generates text. These parameters are essentially adjustable weights that get tuned during training. When someone asks an LLM about a historical event or a programming concept, the model isn’t retrieving a stored fact. Instead, it’s generating a response based on patterns it learned by processing enormous amounts of text during training.
Think about how humans learn a new language by reading extensively. After reading thousands of books and articles, we develop an intuitive sense of how the language works. We learn that certain words tend to appear together, that sentences follow particular structures, and that context helps determine meaning. We don’t memorize every sentence we’ve ever read, but we internalize the patterns.
LLMs do something conceptually similar, except they do it through mathematical processes rather than conscious learning, and at a scale that far exceeds human reading capacity. In other words, the core learning task for an LLM is simple: predict the next token.
A token is roughly equivalent to a word or a piece of a word. Common words like “the” or “computer” might be single tokens, while less common words might be split into multiple tokens. For instance, “unhappiness” might become “un” and “happiness” as separate tokens. During training, the model sees billions of text sequences and learns to predict what token comes next at each position. If it sees “The capital of France is”, it learns to predict “Paris” as a likely continuation.
What makes this remarkable is that by learning to predict the next token, the model inadvertently learns far more. It learns grammar because grammatically correct text is more common in training data. It learns facts because factual statements appear frequently. It even learns some reasoning patterns because logical sequences are prevalent in the text it processes.
However, this learning mechanism also explains why LLMs sometimes “hallucinate” or confidently state incorrect information. The model generates plausible-sounding text based on learned patterns that may not have been verified against a trusted database.
Training an LLM begins long before any actual learning takes place.
The first major undertaking is collecting training data, and the scale involved is staggering. Organizations building these models gather hundreds of terabytes of text from diverse sources across the internet: websites, digitized books, academic papers, code repositories, forums, social media, and more. Web crawlers systematically browse and download content, similar to how search engines index the web. Some organizations also license datasets from specific sources to ensure quality and legal rights. The goal is to assemble a dataset that represents the breadth of human knowledge and language use across different domains, styles, and perspectives.
However, the raw internet is messy. It contains duplicate content, broken HTML fragments, garbled encoding, spam, malicious content, and vast amounts of low-quality material. This is why extensive data cleaning and preprocessing become essential before training can begin.
The first major cleaning step is deduplication. When the same text appears repeatedly in the training data, the model is far more likely to memorize it verbatim rather than learn general patterns from it. If a particular news article was copied across a hundred different websites, the model doesn’t need to see it a hundred times.
Quality filtering comes next. Not all text on the internet is equally valuable for training. Automated systems evaluate each piece of text using various criteria: grammatical correctness, coherence, information density, and whether it matches patterns of high-quality content.
Content filtering for safety and legal compliance is another sensitive challenge. Automated systems scan for personally identifiable information like email addresses, phone numbers, and social security numbers, which are then removed or anonymized to protect privacy. Filters identify and try to reduce the prevalence of toxic content, hate speech, and explicit material, though perfect filtering proves impossible at this scale. There’s also filtering for copyrighted content or material from sources that have requested exclusion, though this remains both technically complex and legally evolving.
The final preprocessing step is tokenization, which transforms human-readable text into a format the model can process.
See the diagram below:
Rather than working with whole words, which would require handling hundreds of thousands of different vocabulary items, tokenization breaks text into smaller units called tokens based on common patterns. A frequent word like “cat” might be a single token, while a rarer word like “unhappiness” might split into “un” and “happiness.” These tokens are then represented as numbers, so “Hello world” might become something like [5431, 892]. This approach, often using methods like Byte Pair Encoding, allows the model to work with a fixed vocabulary of perhaps 50,000 to 100,000 tokens that can represent essentially any text.
All of this preprocessing work establishes a fundamental principle: the quality and diversity of training data directly shape what the model will be capable of. A model trained predominantly on scientific papers will excel at technical language but struggle with casual conversation. A model trained on diverse, high-quality data from many domains will develop broader capabilities.
Before training begins, an LLM starts in a state of complete ignorance. Its billions of parameters are set to small random values, carefully chosen from specific statistical distributions but essentially meaningless. If we fed text to this untrained model and asked it to predict the next token, it would produce complete gibberish. The entire purpose of training is to adjust these random parameters into a configuration that encodes useful patterns about language and knowledge.
The training process follows a continuous loop that repeats billions of times.
First, the model receives batches of text sequences from the training data.
These sequences might be chunks of articles, books, or web pages, typically a few thousand tokens long.
The model processes these sequences and generates predictions for what token should come next at each position. For every position, it produces a probability distribution across all possible tokens in its vocabulary. As an example, it might assign 15% probability to one token, 8% to another, and smaller probabilities to thousands of other options.
These predictions are then compared against the actual next tokens that appeared in the training data. This comparison produces a loss value, which is a numerical score measuring how wrong the model’s predictions were. If the model assigned a high probability to the correct tokens, the loss is low. If it assigns low probability to correct tokens and high probability to incorrect ones, the loss is high. This single number becomes the signal that drives all learning.
The challenge now is figuring out how to adjust billions of parameters to reduce this loss. This is where gradient descent comes in.
Imagine standing in a foggy, hilly landscape where the goal is to reach the lowest valley, but visibility is limited to just a few feet. The strategy would be to feel which direction slopes downward at the current position, take a step in that direction, reassess, and repeat.
Gradient descent works similarly in an abstract mathematical space. The “landscape” represents how wrong the model’s predictions are across all possible parameter configurations, and the algorithm determines which direction in this space leads downward toward better predictions.
Through a process called backpropagation, the training system efficiently calculates exactly how each of the model’s billions of parameters contributed to the error. Should parameter number 47,293,816 be increased slightly or decreased slightly to reduce the loss? Backpropagation works backward through the model’s layers, calculating gradients that indicate the direction and magnitude each parameter should change. All parameters are then adjusted simultaneously by tiny amounts, perhaps changing a value by 0.00001. No single adjustment is meaningful on its own, but across trillions of these microscopic changes, the model gradually improves.
This process repeats continuously over weeks or even months of training on massive clusters of specialized processors.
Modern LLM training might use thousands of GPUs or TPUs working in parallel, consuming megawatts of electricity and costing tens of millions of dollars in computational resources. The training data is processed multiple times, with the model making billions of predictions, calculating billions of loss values, and performing trillions of parameter adjustments.
What emerges from this process is genuinely remarkable.
No individual parameter adjustment teaches the model anything specific. There’s no moment where we explicitly program in grammar rules or facts about the world. Instead, sophisticated capabilities emerge from the collective effect of countless tiny optimizations. The model learns low-level patterns like how adjectives typically precede nouns, mid-level patterns like how questions relate to answers, and high-level patterns like how scientific discussions differ from casual conversation. All of this arises naturally from the single objective of predicting the next token accurately.
By the end of pretraining, the model has become extraordinarily good at its task. It can predict what comes next in text sequences with accuracy, demonstrating knowledge across countless domains and the ability to generate coherent, contextually appropriate text.
However, it’s still fundamentally an autocomplete system. If given a prompt that starts with a question, it might continue with more questions rather than providing an answer. The model understands patterns but hasn’t yet learned to be helpful, harmless, and honest in the way users expect from a conversational assistant. That transformation requires additional training steps we’ll explore in later sections.
The training process explains how LLMs learn, but the model’s structure determines what it’s capable of learning.
The architecture underlying modern LLMs is called the Transformer, introduced in a 2017 research paper with the fitting title “Attention Is All You Need.” This architectural breakthrough made today’s sophisticated language models possible.
Before Transformers, earlier neural networks processed text sequentially, reading one word at a time, much like a human reads a sentence from left to right. This sequential processing was slow and created difficulties when the model needed to connect information that appeared far apart in the text. If important context appeared at the beginning of a long paragraph, the model might struggle to remember it when processing the end.
Transformers revolutionized this by processing entire sequences of text simultaneously and using a mechanism called attention to let the model focus on relevant parts of the input regardless of where they appear.
The attention mechanism is best understood through an example.
Consider the sentence: “The animal didn’t cross the street because it was too tired.” When a human reads this, we instantly understand that “it” refers to “the animal” rather than “the street.” We do this by paying attention to context and meaning.
The attention mechanism in Transformers does something mathematically analogous. For each word the model processes, it calculates attention scores that determine how much that word should consider every other word in the sequence. These attention scores are learned during training. For example, the model learns that pronouns should pay high attention to their antecedents, that words at the end of sentences should consider the beginning for context, and countless other patterns that help interpret language correctly.
Transformer models are organized in layers, typically dozens of them stacked on top of each other. Each layer contains attention mechanisms along with other components, and information flows through these layers sequentially. The interesting aspect is that different layers learn to extract different kinds of patterns.
Early layers tend to recognize basic syntactic structures and simple word relationships.
Middle layers identify semantic patterns and understand how concepts relate to each other.
Later layers capture more abstract patterns, including complex reasoning and nuanced language understanding.
The information flowing through these layers takes the form of vectors, which are essentially lists of numbers that encode the meaning and context of each token position.
At each layer, these vectors get transformed based on the model’s parameters. Think of it as the model continuously refining its understanding of the text. The raw tokens enter at the bottom, and by the time information reaches the top layers, the model has developed a rich, multi-faceted representation that captures syntax, semantics, context, and relationships within the text.
This architecture provides several crucial advantages, which are as follows:
The ability to process sequences in parallel rather than sequentially means training can happen much faster, especially when distributed across thousands of processors.
The attention mechanism’s capacity to relate any part of the text to any other part, regardless of distance, enables the model to maintain context across long conversations or documents. Modern LLMs can handle contexts spanning thousands or even tens of thousands of tokens precisely because the Transformer architecture can efficiently connect information across these long spans.
The layered structure allows the model to build up an increasingly sophisticated understanding, starting from basic patterns and culminating in the complex language capabilities that make these systems so useful.
After pretraining, an LLM is excellent at predicting what comes next in text sequences, but this doesn’t make it a helpful conversational assistant.
If given a prompt that starts with a question, the pretrained model might continue with more questions rather than providing an answer. It simply completes text in statistically likely ways based on the patterns it has learnt. Transforming this autocomplete system into the helpful assistants we interact with requires additional training phases.
Supervised fine-tuning addresses this gap by training the model on carefully curated examples of good behavior.
Instead of learning from general text, the model now trains on prompt-response pairs that demonstrate how to follow instructions, answer questions directly, and maintain a helpful persona. These examples might include questions paired with clear answers, instructions paired with appropriate completions, and conversations demonstrating polite and informative dialogue.
This dataset is much smaller than pretraining data, perhaps tens of thousands to hundreds of thousands of examples rather than billions, but each example is precisely constructed to teach desired behaviors to the LLM.
The training process remains the same: predict the next token, calculate loss, adjust parameters. However, now the model learns to predict tokens in these ideal responses rather than arbitrary internet text.
Supervised fine-tuning provides significant improvement, but it has limitations. Writing explicit examples for every possible scenario the model might encounter is impractical. This is where reinforcement learning from human feedback (RLHF) provides further refinement. The process begins with the model generating multiple responses to various prompts. Human raters then rank these responses based on quality, helpfulness, and safety. These rankings train a separate reward model that learns to predict scores human raters would assign to any response.
See the diagram below:
Once the reward model exists, it guides further training of the language model. The language model generates responses, the reward model scores them, and the language model updates to produce higher-scoring responses.
There’s a careful balance here: the model should improve according to human preferences while not deviating so far from its pretrained version that it loses core knowledge and capabilities. This entire process can iterate multiple times, with improved models generating new responses for human evaluation.
Once training completes, the model undergoes a comprehensive evaluation before deployment.
Developers test it on various benchmarks that measure different capabilities such as language understanding, reasoning, mathematical ability, coding skills, and factual knowledge. Safety testing runs in parallel, examining the model’s tendency to generate harmful content, its susceptibility to adversarial prompts, and potential biases in its outputs.
The model also undergoes optimization for deployment. Training prioritizes learning capability over efficiency, but deployed models must respond quickly to user requests while managing computational costs. Techniques like quantization reduce the precision of parameters, using fewer bits to represent each number. This decreases memory requirements and speeds up computation while typically preserving most of the model’s capability. Other optimizations might involve distilling knowledge into smaller, faster models or implementing efficient serving infrastructure.
Deployment isn’t an endpoint but rather the beginning of a continuous cycle. Organizations monitor how users interact with deployed models, collect feedback on response quality, and identify edge cases where the model fails or behaves unexpectedly. This information feeds directly into the next training iteration.
When someone uses an LLM today, they’re interacting with the culmination of this entire process, from data collection through optimization.
The journey from raw internet data to conversational AI represents a remarkable achievement at the intersection of data engineering, mathematical optimization, massive-scale computation, and careful alignment with human values.
What begins as terabytes of text transforms through preprocessing, tokenization, and billions of parameter adjustments into systems capable of generating coherent text, answering questions, writing code, and engaging in sophisticated dialogue.
Understanding this training process reveals both the impressive capabilities and fundamental limitations of LLMs. For software engineers working with these systems, understanding the training process provides crucial context for making informed decisions about when and how to deploy them.
Get your product in front of more than 1,000,000 tech professionals.
Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases.
Space Fills Up Fast - Reserve Today
Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing [email protected].
2025-12-01 00:45:29
Yearly Black Friday sale Ends Today! Use code BF2025 at checkout to get 30% off the all-in-one interview prep online courses.
To take advantage of this limited time offer, subscribe before 11:59 pm PST on Monday, December 1.
2025-11-30 00:30:39
Reliability shouldn’t cost extra—and Verizon proves it. Their customer-first design, featuring myPlan, myHome, and an industry-first 3-year Value Guarantee, delivers premium network quality without premium pricing.
Unwrap unbeatable deals:
Get the iPhone 17 Pro Max on Verizon with a new line on any myPlan. Also, get an Apple Watch Series 11 and iPad (A16), all on us with a connected device plan ($1,830 in value).
Galaxy S25 Ultra, Galaxy Watch8, and Galaxy Tab S10 FE 5G, all on us with any myPlan ($1,800 value).
Switch to select Verizon Home Internet plans and choose a Samsung 43” TV, Samsung Galaxy Tab S10 FE 5G, Marshall Kilburn III, Stream TV Soundbar, Samsung 32” Smart Monitor or $200 Target GiftCard, on Verizon.
Everyone gets a better deal—flexibility, savings, and support with no extra cost.
This week’s system design refresher:
⏳ LIMITED TIME OFFER: All in One Interview Prep Black Friday Sale
Virtualization vs. Containerization
5 REST API Authentication Methods
How do AirTags work?
What is a Firewall?
Modem vs. Router
SPONSOR US
Yearly Black Friday sale is now live! Use code BF2025 at checkout to get 30% off the all-in-one interview prep online courses.
To take advantage of this limited time offer, subscribe before 11:59 pm PST on Monday, December 1.
Before containers simplified deployment, virtualization changed how we used hardware. Both isolate workloads, but they do it differently.
Virtualization (Hardware-level isolation): Each virtual machine runs a complete operating system, Windows, Fedora, or Ubuntu, with its own kernel, drivers, and libraries. The hypervisor (VMware ESXi, Hyper-V, KVM) sits directly on hardware and emulates physical machines for each guest OS.
This makes VMs heavy but isolated. Need Windows and Linux on the same box? VMs handle it easily. Startup time for a typical VM is in minutes because you’re booting an entire operating system from scratch.
Containerization (OS-level isolation): Containers share the host operating system’s kernel. No separate OS per container. Just isolated processes with their own filesystem and dependencies.
The container engine (Docker, containerd, CRI-O, Podman) manages lifecycle, networking, and isolation, but it all runs on top of a single shared kernel. Lightweight and fast. Containers start in milliseconds because you’re not booting an OS, just launching a process.
But here’s the catch: all containers on a host must be compatible with that host’s kernel. Can’t run Windows containers on a Linux host (without nested virtualization tricks).
Over to you: What’s your go-to setup: containers in VMs, bare metal containers, or something else?
Basic Authentication: Clients include a Base64-encoded username and password in every request header, which is simple but insecure since credentials are transmitted in plaintext. Useful in quick prototypes or internal services over secure networks.
Session Authentication: After login, the server creates a session record and issues a cookie. Subsequent requests send that cookie so the server can validate user state. Used in traditional web-apps.
Token Authentication: Clients authenticate once to receive a signed token, then present the token on each request for stateless authentication. Used in single-page applications and modern APIs that require scalable, stateless authentication.
OAuth-Based Authentication: Clients obtain an access token via an authorization grant from an OAuth provider, then use that token to call resource servers on the user’s behalf. Used in cases of third-party integrations or apps that need delegated access to user data.
API Key Authentication: Clients present a predefined key (often in headers or query strings) with each request. The server verifies the key to authorize access. Used in service-to-service or machine-to-machine APIs where simple credential checks are sufficient.
Over to you: Which other API Authentication method have you seen?
AirTags work by leveraging a combination of Bluetooth technology and the vast network of Apple devices to help you locate your lost items.
Here’s a breakdown of how they function:
Bluetooth Signal: Each AirTag emits a secure Bluetooth signal that can be detected by nearby Apple devices (iPhones, iPads, etc.) within the Find My network.
Find My Network: When an AirTag comes within range of an Apple device in the Find My network, that device anonymously and securely relays the AirTag’s location information to iCloud.
Location Tracking: You can then use the Find My app on your own Apple device to see the approximate location of your AirTag on a map.
Limitations:
Please note that AirTags rely on Bluetooth technology and the presence of Apple devices within the Find My network. If your AirTag is in an area with few Apple devices, its location may not be updated as frequently or accurately.
Every time you connect to the Internet, a firewall quietly decides what can come in and what must stay out. A firewall is your network’s first line of defense. It filters traffic based on rules you define, by IP address, protocol, port, program, or even keywords. Every packet that tries to enter or leave your network passes through this checkpoint.
There are two main types:
Network Firewall: Sits at the network edge between your infrastructure and the internet. Can be physical hardware, virtualized software, or cloud-deployed service. Operates at Layer 3-4 of the OSI model. Filters traffic based on IP addresses, protocols, and ports before it ever reaches your internal network.
Protects the entire network at once. This is your first line of defense. Internet traffic hits the network firewall before it reaches your router, before it touches any internal systems.
Host-based Firewall: This runs as software on individual devices, like your laptop or a server. It works at Layer 3–7, inspecting packets more deeply and protecting only that specific device.
Your desktop has its own host firewall. Your server has its own. Each one is configured independently. It’s your last layer of defense in case something slips past the network firewall.
Together, they form a layered shield, keeping unwanted traffic out while letting legitimate communication flow freely.
Over to you: Have you ever had to troubleshoot a misconfigured firewall rule that accidentally blocked something critical? What was it, and how long did it take to find?
Most people think their WiFi router gives them internet. It doesn’t. Your router is just managing traffic inside your home. The actual internet connection comes from the modem.
Here’s what each one actually does:
Modem: The modem connects you to your Internet Service Provider (ISP). It translates signals between your ISP’s network and your home network. Depending on the service type, the digital link may use coaxial cable, fiber optic, or cellular connections. The modem converts those signals into data your devices can understand.
It provides one public IP address, meaning one connection to the internet. If you plug a single device directly into a modem via Ethernet, that device gets internet access with a public IP.
Router: The router creates a private network inside your home. It takes that single public IP from the modem and shares it across multiple devices using Network Address Translation (NAT). Every device on your network gets a private IP address, usually something like 192.168.1.x. The router keeps track of which device requested what data and routes responses back to the right device.
DHCP assigns those private IPs automatically. Your phone connects to WiFi, the router gives it an IP address, and suddenly it can talk to the internet through the router.
Modern devices often combine both functions, a modem-router combo, but understanding the distinction helps when you’re troubleshooting slow speeds or network drops.
Over to you: What’s your go-to trick to quickly diagnose whether the modem or router is to blame for slow Internet?
Get your product in front of more than 1,000,000 tech professionals.
Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases.
Space Fills Up Fast - Reserve Today
Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing [email protected].
2025-11-29 00:32:57
Yearly Black Friday sale is now live! Use code BF2025 at checkout to get 30% off the all-in-one interview prep online courses.
To take advantage of this limited time offer, subscribe before 11:59 pm PST on Monday, December 1.
2025-11-28 00:30:51
Every modern application needs to handle transactions. These are operations that must either succeed completely or fail.
In a monolithic system, this process is usually straightforward. The application talks to a single database, and all the data needed for a business operation lives in one place. Developers can use built-in database transactions or frameworks that automatically manage them to ensure that the system remains consistent even when something goes wrong.
For example, when you make an online payment, the application might deduct money from your account and record the transaction in a ledger. Both actions happen within a single database transaction. If one action fails, the database automatically rolls everything back so that no partial updates are left behind. This behavior is part of the ACID properties (Atomicity, Consistency, Isolation, and Durability) that guarantee reliable and predictable outcomes.
However, as systems evolve and grow larger, many organizations adopt a services-based or microservices architecture. In such architectures, a business process often involves multiple services, each managing its own database. For instance, an e-commerce system might have separate services for orders, payments, shipping, and inventory. Each of these services owns its own data store and operates independently.
Now imagine a business transaction that spans all these services. Placing an order might require updating the order database, reserving stock in the inventory database, and recording payment details in another database. If one of these steps fails, the system must find a way to keep all services consistent. This is where the problem begins.
See the diagram below:
This challenge is known as the problem of distributed transactions. Traditional techniques like two-phase commit (2PC) attempt to coordinate commits across multiple databases, but they can reduce performance, limit availability, and add significant complexity. As applications become more distributed and use different types of databases or message brokers, these traditional methods become less practical.
To overcome these limitations, modern architectures rely on alternative patterns that provide consistency without strict coupling or blocking behavior. One of the most effective of these is the Saga pattern.
In this article, we will look at how the Saga pattern works and the pros and cons of various approaches to implement this pattern.
2025-11-25 00:30:48
AI is only as powerful as the data behind it — but most teams aren’t ready.
We surveyed 200 senior IT and data leaders to uncover how enterprises are really using streaming to power AI, and where the biggest gaps still exist.
Discover the biggest challenges in real-time data infrastructure, the top obstacles slowing down AI adoption, and what high-performing teams are doing differently in 2025.
Download the full report to see where your organisation stands.
Disclaimer: The details in this post have been derived from the details shared online by the Zalando Engineering Team. All credit for the technical details goes to the Zalando Engineering Team. The links to the original articles and sources are present in the references section at the end of the post. We’ve attempted to analyze the details and provide our input about them. If you find any inaccuracies or omissions, please leave a comment, and we will do our best to fix them.
Zalando is one of Europe’s largest fashion and lifestyle platforms, connecting thousands of brands, retailers, and physical stores under one digital ecosystem.
As the company’s scale grew, so did the volume of commercial data it generated. This included information about product performance, sales patterns, pricing insights, and much more. This data was not just important for Zalando itself but also for its vast network of retail partners who relied on it to make critical business decisions.
However, sharing this data efficiently with external partners became increasingly complex.
Zalando’s Partner Tech division, responsible for data sharing and collaboration with partners, found itself managing a fragmented and inefficient process. Partners needed clear visibility into how their products were performing on the platform, but accessing that information was far from seamless. Data was scattered across multiple systems and shared through a patchwork of methods. Some partners received CSV files over SFTP, others pulled data via APIs, and many depended on self-service dashboards to manually export reports. Each method served a purpose, but together created a tangled system where consistency and reliability were hard to maintain. Many partners had to dedicate the equivalent of 1.5 full-time employees each month just to extract, clean, and consolidate the data they received. Instead of focusing on strategic analysis or market planning, skilled analysts spent valuable time performing repetitive manual work.
There was also a serious accessibility issue. The existing interfaces were not designed for heavy or large-scale data downloads. Historical data was often unavailable when partners needed it most, such as during key planning or forecasting cycles. As a result, even well-resourced partners struggled to build an accurate picture of their own performance.
This problem highlighted a critical gap in Zalando’s data strategy. Partners did not just want raw data or operational feeds. They wanted analytical-ready datasets that could be accessed programmatically and integrated directly into their internal analytics tools. In simple terms, they needed clean, governed, and easily retrievable data that fit naturally into their business workflows.
To address this challenge, the Zalando Engineering Team began a multi-year journey to rebuild its partner data sharing framework from the ground up. The result of this effort was Zalando’s adoption of Delta Sharing, an open protocol for secure data sharing across organizations. In this article, we will look at how Zalando built such a system and the challenges they faced.
To solve the problem of fragmented data sharing, the Zalando Engineering Team first needed to understand who their partners were and how they worked with data.
Zalando operates through three major business models:
Wholesale: Zalando purchases products from brands and resells them directly on its platform.
Partner Program: Brands list and sell products directly to consumers through Zalando’s marketplace.
Connected Retail: Physical retail stores connect their local inventory to an online platform, allowing customers to buy nearby and pick up in person.
Each of these models generates unique datasets, and the scale of those datasets varies dramatically. A small retailer may only deal with a few hundred products and generate a few megabytes of data each week. In contrast, a global brand might handle tens of thousands of products and need access to hundreds of terabytes of historical sales data for planning and forecasting.
In total, Zalando manages more than 200 datasets that support a business generating over €5 billion in gross merchandise value (GMV). These datasets are critical to helping partners analyze trends, adjust pricing strategies, manage inventory, and plan promotions. However, not all partners have the same level of technical sophistication or infrastructure to consume this data effectively.
Zalando’s partners generally fall into three categories based on their data maturity. See the table below:
Large enterprise partners often have their own analytics teams, data engineers, and infrastructure. They expect secure, automated access to data that integrates directly into their internal systems. Medium-sized partners prefer flexible solutions that combine manual and automated options, such as regularly updated reports and dashboards. Smaller partners value simplicity above all else, often relying on spreadsheet-based workflows and direct downloads.
Zalando’s existing mix of data-sharing methods (such as APIs, S3 buckets, email transfers, and SFTP connections) worked in isolation but could not scale to meet all these varied needs consistently.
After understanding the different needs of its partner ecosystem, the Zalando Engineering Team began to look for a better, long-term solution. The goal was not only to make data sharing faster but also to make it more reliable, scalable, and secure for every partner, from small retailers to global brands.
The team realized that fixing the problem required more than improving existing systems. They needed to design an entirely new framework that could handle massive datasets, provide real-time access, and adapt to each partner’s technical capability without creating new complexity. To do that, Zalando created a clear list of evaluation criteria that would guide their decision.

First, the solution had to be cloud-agnostic. Zalando’s partners used a variety of technology stacks and cloud providers. Some worked with AWS, others used Google Cloud, Azure, or even on-premise systems. The new system needed to work seamlessly across all these environments without forcing partners to change their existing infrastructure.
Second, the platform had to be open and extensible. This meant avoiding dependence on a single vendor or proprietary technology. Zalando wanted an open-standard approach that could evolve and integrate with different tools, systems, and workflows.
Third, the solution needed strong performance and scalability. With over 200 datasets and some reaching hundreds of terabytes in size, performance could not be an afterthought. The system had to handle large-scale data transfers and queries efficiently while maintaining low latency and high reliability.
Security was another non-negotiable factor. The platform had to support granular security and auditing features. This included data encryption, access control at the table or dataset level, and comprehensive logging for compliance and traceability. Since partners would be accessing sensitive commercial data, robust governance mechanisms were essential to maintain trust.
The next requirement was flexibility in data access patterns. Partners used data in different ways, so the system had to support:
Real-time streaming for partners who need up-to-the-minute insights
Batch and incremental updates for partners who preferred scheduled or partial data loads
Historical data access for partners who needed to analyze long-term trends
Finally, the solution had to be easy to integrate with the tools that partners were already using. Whether it was business intelligence dashboards, data warehouses, or analytics pipelines, the new system should fit naturally into existing workflows rather than force partners to rebuild them from scratch.
The search for such a system eventually led them to Delta Sharing, an open protocol specifically designed for secure data sharing across organizations. This discovery would go on to transform the way Zalando and its partners collaborate on data.
After months of evaluation and research, the Zalando Engineering Team found a technology that met nearly all of their requirements: Delta Sharing.
Delta Sharing is an open protocol designed specifically for secure, zero-copy data sharing across organizations. This means that partners can access live data directly from its original location without creating separate copies or transferring large files across systems.
The team immediately recognized how well this approach fit their goals. It offered the openness, scalability, and security they needed while being simple enough to integrate into partners’ existing tools and workflows. Key features of Delta Sharing are as follows:
Zero-copy access: Partners can query live datasets directly without needing to download or duplicate them. This eliminates data redundancy and ensures that everyone works with the most up-to-date information.
Open standard: Because Delta Sharing is based on open principles, it works seamlessly with a wide range of tools and platforms. Partners can connect through Pandas, Apache Spark, Tableau, or even Microsoft Excel, depending on their needs.
Granular access control: Data is shared securely using token-based authentication and credential files, which means each partner receives access tailored to their role and data permissions.
Scalable performance: The protocol efficiently handles very large datasets, even those that exceed terabytes in size, while maintaining high reliability and low latency.
Security by design: Features such as encryption, auditing, and logging are built into the system. This ensures that all data access is traceable and compliant with internal governance policies.
While Delta Sharing is available as an open-source protocol, Zalando decided to implement the Databricks Managed Delta Sharing service instead of hosting its own version. This choice was made for several practical reasons:
It integrates tightly with Unity Catalog, Databricks’ governance, and metadata layer. This allowed Zalando to maintain a single source of truth for datasets and permissions.
It provides enterprise-grade security, compliance, and auditability, which are essential when dealing with sensitive commercial data from multiple organizations.
It removes the operational overhead of managing and maintaining sharing servers, tokens, and access logs internally.
By using the managed service, the Zalando Engineering Team could focus on delivering value to partners rather than spending time maintaining infrastructure.
Once the Zalando Engineering Team validated Delta Sharing as the right solution, the next challenge was designing a clean and efficient architecture that could be scaled across thousands of partners. Their approach was to keep the system simple, modular, and easy to manage while ensuring that security and governance remained central to every layer.
At its core, the new data-sharing framework relied on three main building blocks that defined how data would be organized, accessed, and distributed:
Delta Share: A logical container that groups related Delta Tables for distribution to external recipients.
Recipient: A digital identity representing each partner within the Delta Sharing system.
Activation Link: A secure URL that allows partners to download their authentication credentials and connect to shared datasets.
This architecture followed a clear, three-step data flow designed to keep operations transparent and efficient:
Data Preparation and Centralization: All partner datasets were first curated and stored in scalable storage systems as Delta Tables. These tables were then registered in Unity Catalog, which acted as the metadata and governance layer. Unity Catalog provided a single source of truth for data definitions, schema consistency, and lineage tracking, ensuring that every dataset was traceable and well-documented.
Access Configuration: Once datasets were ready, the engineering team created a Recipient entry for each partner and assigned appropriate permissions. Each recipient received an activation link, which allowed them to securely access their data credentials. This setup ensured that partners only saw the data they were authorized to access while maintaining strict access boundaries between different organizations.
Direct Partner Access: When a partner activated their link, they retrieved a credential file and authenticated through a secure HTTPS connection. They could then directly query live data without duplication or manual transfer. Since the data remained centralized in Zalando’s data lakehouse, there were no synchronization issues or redundant copies to maintain.
This architecture brought several benefits. Partners now had real-time access to data, partner-specific credentials ensured granular security, and no redundant storage simplified maintenance.
To implement this system in Databricks, Zalando followed a clear operational workflow:
Prepare the Delta Tables and register them in Unity Catalog.
Create a Share to group related datasets.
Add the relevant tables to that share.
Create a Recipient representing each partner.
Grant the appropriate permissions to the recipient.
See the diagram below:
Every step was guided by Databricks’ Delta Sharing API documentation, allowing the team to automate processes where possible and maintain strong governance controls.
Once the new data-sharing architecture was in place, the Zalando Engineering Team understood that technology alone would not guarantee success. For the system to work, partners needed to be able to use it confidently and easily. Usability became just as important as performance or scalability.
To make the onboarding process smooth, Zalando created a range of partner-facing resources. These included step-by-step user guides that explained how to connect to Delta Sharing using tools familiar to most data teams, such as Pandas, Apache Spark, and common business intelligence (BI) platforms. Each guide walked partners through the entire process—from receiving their activation link to successfully accessing and querying their first dataset.
The team also built detailed troubleshooting documentation. This helped partners solve common issues such as expired credentials, connection errors, or authentication problems without needing to contact support. By empowering partners to self-diagnose and fix minor issues, Zalando reduced delays and improved overall efficiency.
In addition, they developed prebuilt connector snippets—small code templates that partners could plug directly into their existing data pipelines. These snippets made it possible to integrate Zalando’s data into existing workflows within minutes, regardless of whether a partner used Python scripts, Spark jobs, or visualization tools.
Together, these efforts dramatically reduced onboarding friction. Instead of days of setup and testing, partners could access and analyze data in a matter of minutes. This ease of use quickly became one of the platform’s strongest selling points.
The success of the Partner Tech pilot did not go unnoticed within Zalando. Other teams soon realized that they faced similar challenges when sharing data with internal or external stakeholders. Rather than allowing every department to build its own version of the solution, Zalando decided to expand the Delta Sharing setup into a company-wide platform for secure and scalable data distribution.
This new platform came with several key capabilities:
Unified recipient management: Centralized control of who receives what data, ensuring consistent governance.
Built-in best practices: Guidelines for preparing datasets before sharing, helping teams maintain high data quality.
Standardized security and governance policies: Every department followed the same data-sharing rules, simplifying compliance.
Cross-team documentation and automation: Shared tools and documentation made it easier for new teams to adopt the platform without starting from scratch.
Looking ahead, Zalando plans to introduce OIDC Federation, a feature that allows partners to authenticate using their own identity systems. This will remove the need for token-based authentication and make access even more secure and seamless.
Zalando’s journey to modernize partner data sharing was both a technical and organizational transformation. By focusing on real partner challenges, the Zalando Engineering Team built a system that balanced openness, governance, and usability—creating long-term value for both the company and its ecosystem.
The key lessons were as follows:
Start with partner needs, not technology. Deep research into partner workflows helped Zalando design a solution that solved real pain points rather than adding complexity.
Design for diversity. A single rigid model could not serve everyone, so the platform was built to support different partner sizes, tools, and technical skills.
Cross-team collaboration is essential. Close cooperation between the Data Foundation, AppSec, and IAM teams ensured consistency, security, and compliance from day one.
Manual processes are acceptable for pilots but not for scale. Early manual steps were valuable for testing ideas, but later became automation goals as the platform grew.
Internal adoption validates external value. When other Zalando teams began using Delta Sharing, it confirmed the platform’s effectiveness beyond its original use case.
Security must be embedded from the start. Integrating encryption, access control, and auditing early prevented rework and established long-term trust.
Documentation is a product feature. Clear guides, troubleshooting steps, and code examples made onboarding fast and self-service for partners.
Managed is better than self-managed. Relying on Databricks’ managed Delta Sharing service gave Zalando operational stability and freed engineers to focus on partner success.
Delta Sharing has fundamentally changed how Zalando exchanges data with its partners. The company moved from fragmented exports to a unified, real-time, and governed data-sharing model. This shift has produced the following impact:
Reduced manual data handling and partner friction.
Enabled faster, data-driven decision-making through consistent access.
Created a scalable foundation for cross-partner analytics and collaboration.
Established a reusable enterprise framework for secure data exchange.
References:
Get your product in front of more than 1,000,000 tech professionals.
Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases.
Space Fills Up Fast - Reserve Today
Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing [email protected].