2025-07-31 23:30:39
Distributed database systems rely on coordination to work properly. When multiple nodes replicate data and process requests across regions or zones, a particular node has to take charge of write operations. This node is typically called the leader: a single node responsible for ordering updates, committing changes, and ensuring the system remains consistent even under failure.
Leader election exists to answer a simple but critical question: Which node is currently in charge?
The answer can’t rely on assumptions, static configs, or manual intervention. It has to hold up under real-world pressure with crashed processes, network delays, partitions, restarts, and unpredictable message loss.
When the leader fails, the system must detect it, agree on a replacement, and continue operating without corrupting data or processing the same request twice. This is a fault-tolerance and consensus problem, and it sits at the heart of distributed database design.
Leader-based architectures simplify the hard parts of distributed state management in the following ways:
They streamline write serialization across replicas.
They coordinate quorum writes so that a majority of nodes agree on each change.
They prevent conflicting operations from impacting each other in inconsistent ways.
They reduce the complexity of recovery when something inevitably goes wrong.
However, this simplicity on the surface relies on a robust election mechanism underneath. A database needs to be sure about who the leader is at any given time, that the leader is sufficiently up to date, and that a new leader can be chosen quickly and safely when necessary.
In this article, we will look at five major approaches to leader election, each with its assumptions, strengths, and trade-offs.
2025-07-29 23:30:27
Bugs sneak out when less than 80% of user flows are tested before shipping. However, getting that kind of coverage (and staying there) is hard and pricey for any team.
QA Wolf’s AI-native service provides high-volume, high-speed test coverage for web and mobile apps, reducing your organizations QA cycle to less than 15 minutes.
They can get you:
24-hour maintenance and on-demand test creation
Zero flakes, guaranteed
Engineering teams move faster, releases stay on track, and testing happens automatically—so developers can focus on building, not debugging.
The result? Drata achieved 4x more test cases and 86% faster QA cycles.
⭐ Rated 4.8/5 on G2
Disclaimer: The details in this post have been derived from the official documentation shared online by the Cursor (Anysphere) Engineering Team. All credit for the technical details goes to the Cursor (Anysphere) Engineering Team. The links to the original articles and sources are present in the references section at the end of the post. We’ve attempted to analyze the details and provide our input about them. If you find any inaccuracies or omissions, please leave a comment, and we will do our best to fix them.
Cursor is an AI-powered code editor (IDE) that has quickly become a standout tool for developers since its initial release in March 2023 by the startup Anysphere.
It has experienced remarkable growth and adoption, reaching a point where it is being used in a large number of Fortune 500 companies. This rapid rise in popularity is also evident in surveys, which identify Cursor as an extremely popular AI coding IDE among engineers.
The core reason for Cursor's success lies in its AI-first approach, which tightly integrates cutting-edge AI models into a familiar coding environment. Built as a fork of Visual Studio Code (VS Code), Cursor provides developers with a stable interface, familiar keybindings, and compatibility with the existing VS Code extension ecosystem. This minimizes friction for users while allowing Cursor’s engineering team to focus intensely on developing powerful AI capabilities rather than building an IDE from scratch.
Cursor's intelligence comes from its use of state-of-the-art large language models, including OpenAI's GPT-4 variants and Anthropic's Claude, and even its own fine-tuned models. Its backend is designed for immense scale, handling over 1 million transactions per second at peak and serving billions of AI code completions daily to ensure a responsive and seamless experience. Cursor also functions as an effective AI pair programmer that can understand entire codebases, recall project-wide details, suggest complex edits across multiple files, and even execute tasks on demand.
In this article, we will take a look at the key features of Cursor, how those features work, and the infrastructure stack that powers it.
The key features of Cursor, along with the technical details behind them, are as follows:
One of Cursor’s most important features is its AI-driven code completion, which significantly accelerates coding by suggesting code as the user types. Developers can accept these predictions, often displayed as light grey text, by pressing the Tab key. This capability extends beyond single lines, offering smarter suggestions for refactors and multi-file edits.
The responsiveness of Cursor’s autocomplete is a major engineering feat. Here’s how it works:
When a developer types, the Cursor client (the editor on the developer machine) collects a small snippet of the current code context, then encrypts this snippet locally before sending it over the network to Cursor’s cloud servers.
On the server, the snippet is securely decrypted, and Cursor’s in-house code Large Language Model (LLM) quickly generates a completion suggestion.
The predicted code is then returned to the client and displayed inline.
This entire process is engineered for ultra-low latency, ideally under a second, to feel instantaneous to the user. Crucially, Cursor does not persistently store the code from these autocomplete requests. The encrypted code is used on-the-fly for inference and then discarded, prioritizing user privacy.
The backend data layer sees over 1 million queries per second (QPS), primarily due to these tiny autocomplete requests.
Beyond inline suggestions, Cursor provides a powerful AI chat assistant.
It operates as an "agentic" AI capable of handling larger, more complex tasks across an entire project. Users can interact with it through a dedicated chat panel within the IDE, providing instructions in natural language.
This feature leverages Cursor’s codebase indexing system to understand the entire project.
When asked to implement a feature, fix a bug, or refactor code, the chat agent can generate or modify code across multiple files, making coordinated edits based on higher-level instructions. This ability to operate on multiple file sets it apart from many other AI coding tools.
The chat agent can also access relevant context through special commands, such as @Web, which searches the web to gather up-to-date information, feeding the results into the conversation.
For quick, targeted changes, Cursor offers an Inline Edit mode.
Developers can simply select a block of code within the editor, issue an instruction, and the Cursor AI will directly apply the requested changes within the selected area.
See the screenshot below for reference:
Cursor 1.0 introduced BugBot, an AI-powered code review assistant specifically designed for GitHub pull requests (PRs). Setting up BugBot involves connecting Cursor to the GitHub repository via a GitHub App installation.
BugBot automatically analyzes code changes using the same powerful AI models that drive Cursor’s chat agent. It works by examining PRs to catch potential bugs, errors, or stylistic issues that human reviewers might overlook. It then leaves comments directly on the PR with detailed explanations and suggested fixes.
BugBot can operate in both automatic mode (re-running on every PR update) or be manually triggered by commenting "bugbot run" on a PR. Each comment includes a convenient "Fix in Cursor" link, allowing developers to jump directly into the Cursor editor with the relevant context loaded to apply the suggested fix instantaneously, tightening the iteration loop.
A standout feature for handling complex or long-running coding tasks is Cursor’s Background Agents.
These are essentially "AI pair programmers in the cloud" that can work concurrently with a developer’s local editing session. They allow developers to offload tasks that might require executing code, running tests, or making broad changes, without tying up the local machine. Background agents typically utilize Cursor's more advanced "Max" models due to their extensive context needs.
When we launch a Background Agent, the code is executed on a remote machine in Cursor’s cloud infrastructure. Specifically, Background Agents run on isolated Ubuntu-based virtual machines (VMs) in Cursor’s AWS infrastructure. This ensures the agent’s operations (like running tests or making code changes) are sandboxed away from the user’s local environment.
To overcome the limitation of AI models losing context between sessions, Cursor implements two features for persistent project knowledge: Rules and Memories. These enable the AI to maintain a long-term understanding of the project and adhere to specific guidelines.
Rules are explicit, system-level instructions that developers can create in special markdown files (often stored in a “.cursor/rules directory” in the repository) or in global settings. They can dictate coding style, architectural conventions, or any custom instruction that the AI should consistently follow when generating or editing code. When active, these rule contents are injected into the AI's context for every operation, ensuring consistent behavior. Rules can be project-specific (version-controlled) or user-specific (global). They allow teams to encode best practices that the AI "knows" without repeated prompting.
Memories have been launched in beta with Cursor 1.0. They allow the AI to automatically remember key details and decisions from past conversations in the chat, carrying that knowledge across sessions.
A "sidecar model" observes the chat and suggests potential memories to save, which developers can approve or reject. This means if a developer explains a tricky function or a design choice in one session, the AI can recall that context later, avoiding redundant explanations and acting as if it knows the project's nuances over time. Memories essentially become auto-generated rules managed in the settings.
To effectively assist with large projects, Cursor performs codebase indexing in the background, enabling its AI to "understand" and answer questions about the entire codebase.
Here’s how it works:
Initial Indexing: When we open a project, Cursor analyzes the files, splitting them into smaller chunks (like functions). These chunks are then encrypted locally and sent to Cursor’s server with obfuscated file identifiers (even file names are not sent in plaintext). The server decrypts each chunk, computes a numerical "embedding" (a vector representation capturing the code's meaning) using an AI model (such as OpenAI's embedding model), and immediately discards the actual file content and names. Only these embedding vectors are stored in a specialized vector database (Turbopuffer). This means that the server has no human-readable code stored persistently.
Semantic Search: When the developer asks a question in Chat (or uses Cmd+K), Cursor turns the query into a vector and performs a vector similarity search against the stored embeddings. This finds relevant code sections based on meaning, not just exact keywords, without the server initially seeing the actual code.
Fetching Relevant Code: If actual code content is needed to answer a question, the server requests those specific chunks (identified by obfuscated IDs) from the Cursor client. The client then sends the corresponding source code, presumably still encrypted, which the server decrypts on the fly for the immediate query, then discards. This design prioritizes privacy, as embeddings are generally one-way, meaning original code cannot be reconstructed from them. Cursor also respects .gitignore and a dedicated .cursorignore file to prevent indexing sensitive files, and heuristically scans for secrets before sending chunks.
Index Synchronization: To keep the index up-to-date as developers edit code, Cursor uses a Merkle tree synchronization mechanism. A Merkle tree is a hash-based data structure that allows for efficient detection of changes in large datasets. The client computes a Merkle tree of the project's files, and the server maintains its own. By comparing these trees every few minutes, Cursor can pinpoint exactly which files have been modified and only send those changed parts for re-indexing, minimizing bandwidth and latency.
See the diagram below for a sample Merkle Tree for visualizing the project code files.
Cursor’s AI-powered features rely on a combination of cloud infrastructure providers, model hosts, indexing engines, and analytics tools. These services are integrated with careful attention to privacy, latency, and security.
Below is a breakdown of how each provider fits into Cursor’s stack:
AWS is the backbone of Cursor’s infrastructure. The majority of backend services, including API servers, job queues, and real-time components, are hosted in AWS data centers. Most servers are located in the United States, with additional latency-optimized deployments in Tokyo and London to improve response times globally.
Cloudflare serves as a reverse proxy in front of many Cursor services. It improves network performance, handles TLS termination, and offers an additional layer of DDoS protection.
Microsoft Azure hosts parts of Cursor’s secondary infrastructure. These services also reside in the United States and participate in processing AI-related requests.
Google Cloud Platform (GCP) is used for a smaller set of backend systems, also hosted in U.S. regions.
Fireworks hosts Cursor’s fine-tuned, proprietary AI models. These are the models used for low-latency code completion and other real-time features. Fireworks operates across regions including the U.S., Europe, and Asia.
OpenAI provides Cursor with access to GPT-based models for various tasks, including chat responses, summarization, and completions. Even if a user selects another model (like Claude or Gemini), some requests (particularly for background summarization) may still be routed to OpenAI.
Anthropic supplies Claude models, which Cursor uses for general-purpose chat and code reasoning. Similar to OpenAI, Anthropic may receive code data for inference even when not explicitly selected, depending on task routing.
Google Cloud Vertex AI is used to access Gemini models. These models may be selected explicitly by the user or invoked for specific tasks like large-context summarization.
xAI offers Grok models, which are integrated into Cursor’s model-switching infrastructure.
Turbopuffer is Cursor’s embedding engine and vector index store. It stores obfuscated representations of the codebase, specifically numerical embeddings generated from code chunks and metadata like file hashes.
Exa and SerpApi are used to power Cursor’s @web feature, which lets the AI assistant query the internet to supplement answers.
MongoDB is used for analytics and feature usage tracking. It stores aggregate usage information, such as frequency of feature use or session metrics.
Datadog handles performance monitoring and log aggregation. Logs related to AI requests from Privacy Mode users are stripped of any code content.
Databricks (MosaicML), Foundry, and Voltage Park are used in the training and fine-tuning of Cursor’s proprietary models.
Slack and Google Workspace are used for internal team collaboration and debugging.
Sentry is used for error tracking. While traces may include user context, code content is never explicitly logged.
Mistral is used only to parse public documents, such as internet-accessible PDFs.
Pinecone stores vector embeddings of public documentation and is used by Cursor’s AI to understand common libraries and APIs when assisting with coding tasks.
Amplitude supports high-level analytics, such as tracking how often a feature is used or how long sessions last.
HashiCorp Terraform is used to manage Cursor’s cloud infrastructure via infrastructure-as-code.
Stripe handles billing and payment infrastructure.
Vercel hosts Cursor’s public-facing website.
WorkOS provides single sign-on (SSO) and authentication infrastructure.
Cursor stands out as a pioneering AI-first code editor that seamlessly blends a familiar development environment with cutting-edge artificial intelligence. By forking Visual Studio Code, Cursor provides developers with a stable and intuitive interface while enabling a rapid focus on deep AI integration.
At its core, Cursor's architecture is designed to deliver intelligent assistance without compromising speed or privacy. Many of its features, like its real-time AI code autocomplete, are powered by in-house models running on cloud servers, sending only encrypted code snippets to ensure low latency and data security.
This sophisticated cloud-backed system, handling billions of AI completions daily, redefines the coding experience by deeply embedding AI into every workflow, boosting developer productivity and changing how code is written and managed.
References:
Get your product in front of more than 1,000,000 tech professionals.
Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases.
Space Fills Up Fast - Reserve Today
Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing [email protected].
2025-07-26 23:31:05
Launching the All-in-one interview prep. We’re making all the books available on the ByteByteGo website.
What's included:
System Design Interview
Coding Interview Patterns
Object-Oriented Design Interview
How to Write a Good Resume
Behavioral Interview (coming soon)
Machine Learning System Design Interview
Generative AI System Design Interview
Mobile System Design Interview
And more to come
This week’s system design refresher:
System Design Interview – BIGGEST Mistakes to Avoid (Youtube video)
12 MCP Servers You Can Use in 2025
MCP Versus A2A Protocol
How can Cache Systems go wrong?
8 System Design Concepts Explained in 1 Diagram
SPONSOR US
MCP (Model Context Protocol) is an open standard that simplifies how AI models, particularly LLMs, interact with external data sources, tools, and services. An MCP server acts as a bridge between these AI models and external tools. Here are the top MCP servers:
File System MCP Server
Allows the LLM to directly access the local file system to read, write, and create directories.
GitHub MCP Server
Connects Claude to GitHub repos and allows file updates, code searching.
Slack MCP Server
MCP Server for Slack API, enabling Claude to interact with Slack workspaces.
Google Maps MCP Server
MCP Server for Google Maps API.
Docker MCP Server
Integrate with Docker to manage containers, images, volumes, and networks.
Brave MCP Server
Web and local search using Brave’s Search API.
PostgreSQL MCP Server
An MCP server that enables LLM to inspect database schemas and execute read-only queries.
Google Drive MCP Server
An MCP server that integrates with Google Drive to allow reading and searching over files.
Redis MCP Server
MCP Server that provides access to Redis databases.
Notion MCP Server
This project implements an MCP server for the Notion API.
Stripe MCP Server
MCP Server to interact with the Stripe API.
Perplexity MCP Server
An MCP Server that connects to Perplexity’s Sonar API for real-time search.
Over to you: Which other MCP Server will you add to the list?
The Model Context Protocol (MCP) connects AI agents to external data sources, such as databases, APIs, and files, via an MCP server, thereby enriching their responses with real-world context.
Google’s Agent-to-Agent (A2A) Protocol enables AI agents to communicate and collaborate, allowing them to delegate tasks, share results, and enhance each other’s capabilities.
MCP and A2A can be combined into a holistic architecture. Here’s how it can work:
Each AI agent (using tools like Langchain with GPT or Claude) connects to external tools via MCP servers for data access.
The external tools can comprise cloud APIs, local files, web search, or communication platforms like Slack.
Simultaneously, the AI agents can communicate with one another using the A2A protocol to coordinate actions, share intermediate outputs, and solve complex tasks collectively.
This architecture enables both rich external context (via MCP) and decentralized agent collaboration (via A2A).
Over to you: What else will you add to understand the MCP vs A2A Protocol capabilities?
The diagram below shows 4 typical cases where caches can go wrong and their solutions.
Thunder herd problem
This happens when a large number of keys in the cache expire at the same time. Then the query requests directly hit the database, which overloads the database.
There are two ways to mitigate this issue: one is to avoid setting the same expiry time for the keys, adding a random number in the configuration; the other is to allow only the core business data to hit the database and prevent non-core data to access the database until the cache is back up.
Cache penetration
This happens when the key doesn’t exist in the cache or the database. The application cannot retrieve relevant data from the database to update the cache. This problem creates a lot of pressure on both the cache and the database.
To solve this, there are two suggestions. One is to cache a null value for non-existent keys, avoiding hitting the database. The other is to use a bloom filter to check the key existence first, and if the key doesn’t exist, we can avoid hitting the database.
Cache breakdown
This is similar to the thunder herd problem. It happens when a hot key expires. A large number of requests hit the database.
Since the hot keys take up 80% of the queries, we do not set an expiration time for them.
Cache crash
This happens when the cache is down and all the requests go to the database.
There are two ways to solve this problem. One is to set up a circuit breaker, and when the cache is down, the application services cannot visit the cache or the database. The other is to set up a cluster for the cache to improve cache availability.
Over to you: Have you met any of these issues in production?
Non-functional requirements define the quality attributes of a system that ensure it performs reliably under real-world conditions. Some key NFRs, along with the implementation approach, are as follows:
Availability with Load Balancers
Availability ensures that a system remains operational and accessible to users at all times. Using load balancers distributes traffic across multiple service instances to eliminate single points of failure
Latency with CDNs
Latency refers to the time delay experienced in a system’s response to a user request. CDNs reduce latency by caching content closer to users.
Scalability with Replication
Scalability is the system’s ability to handle increased load by adding resources. Replication distributes data across multiple nodes, enabling higher throughput and workload.
Durability with Transaction Log
Durability guarantees that once data is committed, it remains safe even in the event of failure. Transaction logs persist all operations, allowing the system to reconstruct the state after a crash.
Consistency with Eventual Consistency
Consistency means that all users see the same data. Eventual consistency allows temporary differences, but synchronizes replicas over time to a consistent state.
Modularity with Loose Coupling and High Cohesion
Modularity promotes building systems with well-separated and self-contained components. Loose coupling and high cohesion help achieve the same.
Configurability
Configurability allows a system to be easily adjusted or modified without altering core logic. Configuration-as-Code manages infra and app settings via version-controlled scripts.
Resiliency with Message Queues
Resiliency is a system’s ability to recover from failures and continue operating smoothly. Message queues decouple components and buffer tasks, enabling retries.
Over to you: Which other NFR or strategy will you add to the list?
Get your product in front of more than 1,000,000 tech professionals.
Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases.
Space Fills Up Fast - Reserve Today
Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing [email protected].
2025-07-24 23:30:33
Modern databases don’t run on a single box anymore. They span regions, replicate data across nodes, and serve millions of queries in parallel.
However, every time a database tries to be fast, available, and correct at once, something has to give. As systems scale, the promise of fault tolerance collides with the need for correctness. For example, a checkout service can’t afford to double-charge a user just because a node dropped off the network. But halting the system every time a replica lags can break the illusion of availability. Latency, replica lag, and network partitions are not edge cases.
Distributed databases have to manage these trade-offs constantly. For example,
A write request might succeed in one region but not another.
A read might return stale data unless explicitly told to wait.
Some systems optimize for uptime and accept inconsistencies. Others block until replicas agree, sacrificing speed to maintain correctness.
Two models help make sense of this: the CAP theorem and the PACELC theorem. CAP explains why databases must choose between staying available and staying consistent in the presence of network partitions. PACELC extends that reasoning to the normal case: even without failure, databases still trade latency for consistency.
In this article, we will look at these two models as they apply to real-world database design and understand the various trade-offs involved.
2025-07-22 23:30:44
Get practical experience with the strategies used at Discord, Disney, Zillow, Tripadvisor & other gamechangers
July 30, 2025
Whether you’re just getting started with NoSQL or looking to optimize your NoSQL performance, this event is a fast way to learn more and get your questions answered by experts.
You can choose from two tracks:
Essentials: NoSQL vs SQL architectures, data modeling fundamentals, and building a sample high-performance application with ScyllaDB.
Advanced: Deep dives into application development practices, advanced data modeling, optimizing your database topology, monitoring for performance, and more.
This is live instructor-led training, so bring your toughest questions. You can interact with speakers and connect with fellow attendees throughout the event.
Bonus: Registrants get the ScyllaDB in Action ebook by Discord Staff Engineer Bo Ingram.
Disclaimer: The details in this post have been derived from the articles shared online by the Nubank Engineering Team. All credit for the technical details goes to the Nubank Engineering Team. The links to the original articles and sources are present in the references section at the end of the post. We’ve attempted to analyze the details and provide our input about them. If you find any inaccuracies or omissions, please leave a comment, and we will do our best to fix them.
Understanding customer behavior at scale is one of the core challenges facing modern financial institutions. With millions of users generating billions of transactions, the ability to interpret and act on this data is critical for offering relevant products, detecting fraud, assessing risk, and improving user experience.
Historically, the financial industry has relied on traditional machine learning techniques built around tabular data. In these systems, raw transaction data is manually transformed into structured features (such as income levels, spending categories, or transaction counts) that serve as inputs to predictive models.
While this approach has been effective, it suffers from two major limitations:
Manual feature engineering is time-intensive and brittle, often requiring domain expertise and extensive trial-and-error.
Generalization is limited. Features designed for one task (for example, credit scoring) may not be useful for others (for example, recommendation or fraud detection), leading to duplicated effort across teams.
To address these constraints, Nubank adopted a foundation model-based approach, which has already been transforming domains like natural language processing and computer vision. Instead of relying on hand-crafted features, foundation models are trained directly on raw transaction data using self-supervised learning. This allows them to automatically learn general-purpose embeddings that represent user behavior in a compact and expressive form.
The objective is ambitious: to process trillions of transactions and extract universal user representations that can power a wide range of downstream tasks such as credit modeling, personalization, anomaly detection, and more. By doing so, Nubank aims to unify its modeling efforts, reduce repetitive feature work, and improve predictive performance across the board.
In this article, we look at Nubank’s system design for building and deploying these foundation models. We will trace the complete lifecycle from data representation and model architecture to pretraining, fine-tuning, and integration with traditional tabular systems.
Nubank’s foundation model system is designed to handle massive volumes of financial data and extract general-purpose user representations from it. These representations, also known as embeddings, are later used across many business applications such as credit scoring, product recommendation, and fraud detection.
The architecture is built around a transformer-based foundation model and is structured into several key stages, each with a specific purpose.
The system starts by collecting raw transaction data for each customer. This includes information like the transaction amount, timestamp, and merchant description. The volume is enormous, covering trillions of transactions across more than 100 million users.
Each user has a time-ordered sequence of transactions, which is essential for modeling spending behavior over time. In addition to transactions, other user interaction data, such as app events, can also be included.
Before transactions can be fed into a transformer model, they need to be converted into a format the model can understand. This is done through a specialized encoding strategy. Rather than converting entire transactions into text, Nubank uses a hybrid method that treats each transaction as a structured sequence of tokens.
Each transaction is broken into smaller elements:
Amount Sign (positive or negative) is represented as a categorical token.
Amount Bucket is a quantized version of the transaction amount (to reduce numeric variance).
Date tokens such as month, weekday, and day of month are also included.
Merchant Description is tokenized using standard text tokenizers like Byte Pair Encoding.
See the diagram below:
This tokenized sequence preserves both the structure and semantics of the original data, and it keeps the input length compact, which is important because attention computation in transformers scales with the square of the input length.
Once transactions are tokenized, they are passed into a transformer model. Nubank uses several transformer variants for experimentation and performance optimization.
The models are trained using self-supervised learning, which means they do not require labeled data. Instead, they are trained to solve tasks like:
Masked Language Modeling (MLM), where parts of the transaction sequence are hidden and the model must predict them.
Next Token Prediction (NTP), where the model learns to predict the next transaction in the sequence.
More on these in later sections. The output of the transformer is a fixed-length user embedding, usually taken from the final token's hidden state.
The model is trained on large-scale, unlabeled transaction data using self-supervised learning objectives. Since no manual labeling is required, the system can leverage the full transaction history for each user. The models learn useful patterns about financial behavior, such as spending cycles, recurring payments, and anomalies, by simply trying to predict missing or future parts of a user’s transaction sequence. As a simplified example, the model might see “Coffee, then Lunch, then…”. It tries to guess “Dinner”.
The size of the training data and model parameters plays a key role. As the model scales in size and the context window increases, performance improves significantly. By constantly guessing and correcting itself across billions of transactions, the model starts to notice patterns in how people spend money.
For instance, switching from a basic MLM model to a large causal transformer with optimized attention layers resulted in a performance gain of over 7 percentage points in downstream tasks.
After the foundation model is pre-trained, it can be fine-tuned for specific tasks. This involves adding a prediction head on top of the transformer and training it with labeled data. For example, a credit default prediction task would use known labels (whether a customer defaulted) to fine-tune the model.
To integrate with existing systems, the user embedding is combined with manually engineered tabular features. This fusion is done in two ways:
Late Fusion uses models like LightGBM to combine embeddings and tabular data, but the two are trained separately.
Joint Fusion trains the transformer and the tabular model together in an end-to-end fashion using a deep neural network, specifically the DCNv2 architecture.
More on these in a later section.
To make this architecture usable across the company, Nubank has built a centralized AI platform.
This platform stores pretrained foundation models and provides standardized pipelines for fine-tuning them. Internal teams can access these models, combine them with their features, and deploy fine-tuned versions for their specific use cases without needing to retrain everything from scratch.
This centralization accelerates development, reduces redundancy, and ensures that all teams benefit from improvements made to the core models.
To train foundation models on transaction data, it is critical to convert each transaction into a format that transformer models can process.
There are two main challenges when trying to represent transactions for use in transformer models:
Mixed data types: A single transaction includes structured fields (like amount and date) and textual fields (like the merchant name). This makes it harder to represent consistently using either a pure text or a pure structured approach.
High cardinality and cold-start issues: Transactions can be extremely diverse. Even simple changes like different merchant names, locations, or amounts result in new combinations. If each unique transaction is treated as an individual token, the number of tokens becomes enormous. This leads to two problems:
The embedding table grows too large to train efficiently.
The model cannot generalize to new or rare transactions it has not seen before, which is known as the cold-start problem.
To address these challenges, multiple strategies were explored for turning a transaction into a sequence of tokens that a transformer can understand.
In this method, each unique transaction is assigned a numerical ID, similar to techniques used in sequential recommendation models. These IDs are then converted into embeddings using a lookup table.
While this approach is simple and efficient in terms of token length, it has two major weaknesses:
The total number of unique transaction combinations is extremely large, making the ID space impractical to manage.
If a transaction was not seen during training, the model cannot process it effectively. This makes it poorly suited for generalization and fails under cold-start conditions.
This method treats each transaction as a piece of natural language text. The transaction fields are converted into strings by joining the attribute name and its value, such as "description=NETFLIX amount=32.40 date=2023-05-12", and then tokenized using a standard NLP tokenizer.
This representation can handle arbitrary transaction formats and unseen data, which makes it highly generalizable.
However, it comes at a high computational cost.
Transformers process data using self-attention, and the computational cost of attention increases with the square of the input length. Turning structured fields into long text sequences causes unnecessary token inflation, making training slower and less scalable.
To balance generalization and efficiency, Nubank developed a hybrid encoding strategy that preserves the structure of each transaction without converting everything into full text.
Each transaction is tokenized into a compact set of discrete fields:
Amount Sign Token: One token represents whether the transaction is positive (like a deposit) or negative (like a purchase).
Amount Bucket Token: The absolute value of the amount is placed into a quantized bucket, and each bucket is assigned its token. This reduces the range of numeric values into manageable categories.
Date Tokens: Separate tokens are used for the month, day of the month, and weekday of the transaction.
Description Tokens: The merchant description is tokenized using a standard subword tokenizer like Byte Pair Encoding (BPE), which breaks the text into common fragments. This allows the model to recognize patterns in merchant names or transaction types.
By combining these elements, a transaction is transformed into a short and meaningful sequence of tokens. See the diagram below:
This hybrid approach retains the key structured information in a compact format while allowing generalization to new inputs. It also avoids long text sequences, which keeps attention computation efficient.
Once each transaction is tokenized in this way, the full transaction history of a user can be concatenated into a sequence and used as input to the transformer. Separator tokens are inserted between transactions to preserve boundaries, and the sequence is truncated at a fixed context length to stay within computational limits.
Once transactions are converted into sequences of tokens, the next step is to train a transformer model that can learn patterns from this data.
As mentioned, the Nubank engineering team uses self-supervised learning to achieve this, which means the model learns directly from the transaction sequences without needing any manual labels. This approach allows the system to take full advantage of the enormous volume of historical transaction data available across millions of users.
Two main objectives are used:
Next Token Prediction (NTP): The model is trained to predict the next token in a transaction sequence based on the tokens that came before it. This teaches the model to understand the flow and structure of transaction behavior over time, similar to how language models predict the next word in a sentence.
Masked Language Modeling (MLM): In this method, some tokens in the sequence are randomly hidden or “masked,” and the model is trained to guess the missing tokens. This forces the model to understand the surrounding context and learn meaningful relationships between tokens, such as the connection between the day of the week and spending type or between merchant names and transaction amounts.
See the diagram below:
While foundation models trained on transaction sequences can capture complex behavioral patterns, many financial systems still rely on structured tabular data for critical features.
For example, information from credit bureaus, user profiles, or application forms often comes in tabular format. To make full use of both sources (sequential embeddings from the transformer and existing tabular features), it is important to combine them in a way that maximizes predictive performance.
This process of combining different data modalities is known as fusion.
Nubank explored two main fusion strategies: late fusion, which is easier to implement but limited in effectiveness, and joint fusion, which is more powerful and trains all components together in a unified system.
See the diagram below:
In the late fusion setup, tabular features are combined with the frozen embeddings produced by a pretrained foundation model. Think of it like taking the “frozen feeling” from the pre-trained model (embeddings) and combining it with the checklist of facts such as age, credit score, profile details, and so on.
These combined inputs are then passed into a traditional machine learning model, such as LightGBM or XGBoost. While this method is simple and leverages well-established tools, it has an important limitation.
Since the foundation model is frozen and trained separately, the embeddings do not adapt to the specific downstream task or interact meaningfully with the tabular data during training. As a result, there is no synergy between the two input sources, and the overall model cannot fully optimize performance.
To overcome this limitation, Nubank developed a joint fusion architecture.
This approach trains the transformer and the tabular model together in a single end-to-end system. By doing this, the model can learn to extract information from transaction sequences that complements the structured tabular data, and both components are optimized for the same prediction task.
To implement this, Nubank selected DCNv2 (Deep and Cross Network v2) as the architecture for processing tabular features. DCNv2 is a deep neural network specifically designed to handle structured inputs. It combines deep layers with cross layers that capture interactions between features efficiently.
See the diagram below:
Nubank’s initiative to use foundational models represents a significant leap forward in how financial institutions can understand and serve their customers. By moving away from manually engineered features and embracing self-supervised learning on raw transaction data, Nubank has built a modeling system that is both scalable and expressive.
A key part of this success lies in how the system has been integrated into Nubank’s broader AI infrastructure. Rather than building isolated models for each use case, Nubank has developed a centralized AI platform where teams can access pretrained foundation models. These models are trained on massive volumes of user transaction data and are stored in a shared model repository.
Teams across the company can choose between two types of models based on their needs:
Embedding-only models, which use the user representation generated from transaction sequences.
Blended models, which combine these embeddings with structured tabular features using the joint fusion architecture.
This flexibility is critical.
Some teams may already have strong tabular models in place and can plug in the user embeddings with minimal changes. Others may prefer to rely entirely on the transformer-based sequence model, especially for new tasks where historical tabular features are not yet defined.
The architecture is also forward-compatible with new data sources. While the current models are primarily trained on transactions, the design allows for the inclusion of other user interaction data, such as app usage patterns, customer support chats, or browsing behavior.
In short, Nubank’s system is not just a technical proof-of-concept. It is a production-ready solution that delivers measurable gains across core financial prediction tasks.
References:
Understanding Our Customer’s Finances Through Foundation Models
Defining an Interface between Transaction Data and Foundational Models
Still pretending your delivery issues are a mystery? They’re not. You’re just not looking in the right place.
DevStats gives engineering leaders brutal clarity on where delivery breaks down, so you can fix the process instead of pointing fingers.
✅ Track DORA and flow metrics like a grown-up
✅ Spot stuck work, burnout risks, and aging issues
✅ Cut cycle time without cutting corners
✅ Ship faster. With fewer surprises.
More AI tools won’t fix your delivery. More Clarity will.
Get your product in front of more than 1,000,000 tech professionals.
Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases.
Space Fills Up Fast - Reserve Today
Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing [email protected].
2025-07-19 23:30:39
Even the most experienced Airflow users will inevitably encounter task failures and DAG errors. Join Astronomers webinar on August 6 to learn how to troubleshoot Airflow issues like a pro, before they hit production. You’ll learn:
Common DAG and task issues and how to debug them
How to write DAG unit tests
How to automate tests as part of a CICD workflow
This week’s system design refresher:
20 System Design Concepts You Must Know - Final Part (Youtube video)
Top 5 common ways to improve API performance
REST API Vs. GraphQL
ByteByteGo Technical Interview Prep Kit
Tokens vs API Keys
The AWS Tech Stack
5 Data Structures That Make DB Queries Super Fast
SPONSOR US
Result Pagination:
This method is used to optimize large result sets by streaming them back to the client, enhancing service responsiveness and user experience.
Asynchronous Logging:
This approach involves sending logs to a lock-free buffer and returning immediately, rather than dealing with the disk on every call. Logs are periodically flushed to the disk, significantly reducing I/O overhead.
Data Caching:
Frequently accessed data can be stored in a cache to speed up retrieval. Clients check the cache before querying the database, with data storage solutions like Redis offering faster access due to in-memory storage.
Payload Compression:
To reduce data transmission time, requests and responses can be compressed (e.g., using gzip), making the upload and download processes quicker.
Connection Pooling:
This technique involves using a pool of open connections to manage database interaction, which reduces the overhead associated with opening and closing connections each time data needs to be loaded. The pool manages the lifecycle of connections for efficient resource use.
Over to you: What other ways do you use to improve API performance?
When it comes to API design, REST and GraphQL each have their own strengths and weaknesses.
REST
Uses standard HTTP methods like GET, POST, PUT, DELETE for CRUD operations.
Works well when you need simple, uniform interfaces between separate services/applications.
Caching strategies are straightforward to implement.
The downside is it may require multiple roundtrips to assemble related data from separate endpoints.
GraphQL
Provides a single endpoint for clients to query for precisely the data they need.
Clients specify the exact fields required in nested queries, and the server returns optimized payloads containing just those fields.
Supports Mutations for modifying data and Subscriptions for real-time notifications.
Great for aggregating data from multiple sources and works well with rapidly evolving frontend requirements.
However, it shifts complexity to the client side and can allow abusive queries if not properly safeguarded
Caching strategies can be more complicated than REST.
The best choice between REST and GraphQL depends on the specific requirements of the application and development team. GraphQL is a good fit for complex or frequently changing frontend needs, while REST suits applications where simple and consistent contracts are preferred.
Launching the All-in-one interview prep. We’re making all the books available on the ByteByteGo website.
What's included:
System Design Interview
Coding Interview Patterns
Object-Oriented Design Interview
How to Write a Good Resume
Behavioral Interview (coming soon)
Machine Learning System Design Interview
Generative AI System Design Interview
Mobile System Design Interview
And more to come
Both tokens (such as JWT) and API keys are used for authentication and authorization, but they serve different purposes. Let’s understand the simplified flow for both.
The Token Flow
End user logs into the frontend web application.
Login credentials are sent to the Identity Service.
On successful authentication, a JWT token is issued and returned.
The frontend makes API calls with the JWT in the Authorization header.
API Gateway intercepts the request and validates the JWT (signature, expiry, and claims).
If valid, the gateway sends a validation response.
The validated request is forwarded to the user-authenticated service.
The service processes the request and interacts with the database to return results.
The API Key Flow
A 3rd party developer registers on the Developer Portal.
The portal issues an API Key.
The key is also stored in a secure key store for later verification.
The developer app sends future API requests with the API Key in the header.
The API Gateway intercepts the request and sends the key to the API Key Validation service.
The validation service verifies the key from the key store and responds.
For valid API keys, the gateway forwards the request to the public API service.
The service processes it and accesses the database as needed.
Over to you: What else will you add to the explanation?
Frontend
Static websites are hosted on S3 and served globally via CloudFront for low latency. Other services that support the frontend development include Amplify, Cognito, and Device Farm.
API Layer
API Gateway and AppSync expose REST and GraphQL APIs with built-in security and throttling. Other services that work in this area are Lambda, ELB, and CloudFront.
Application Layer
This layer hosts business logic. Some services that are important in this layer are Fargate, EKS, Lambda, EventBridge, Step Functions, SNS, and SQS.
Media and File Handling
Media is uploaded to S3, transcoded via Elastic Transcoder, and analyzed using Rekognition for moderation. CloudFront signed URLs ensure secure delivery of videos and files to authenticated users.
Data Layer
The primary services for this layer are Aurora, DynamoDB, ElastiCache, Neptune, and OpenSearch.
Security and Identity
Some AWS services that help in this layer of the stack are IAM, Cognito, WAF, KMS, Secrets Manager, and CloudTrail.
Observability and Monitoring
CloudWatch monitors logs, metrics, and alarms. X-Ray provides tracing of request paths. CloudTrail captures API calls. Config ensures compliance, and GuardDuty detects security threats.
CI/CD and DevOps
The key services used in this layer are CodeCommit, CodeBuild, CodeDeploy, CodePipeline, CloudFormation, ECR, and SSM.
Multi-Region Networking
Route 53 and Global Accelerator ensure fast DNS and global routing. VPC segments the network while NAT and Transit Gateways handle secure traffic flow. AWS Backup provides disaster recovery across regions.
Over to you: Which other service will you add to the list?
Data structures are crucial for database indexes because they determine how efficiently data can be searched, inserted, and retrieved, directly impacting query performance.
B-Tree Index
B-Tree indexes use a balanced tree structure where keys and data pointers exist in internal and leaf nodes. They support efficient range and point queries through ordered traversal.
B+ Tree Index
B+ Tree indexes store all data pointers in the leaf nodes, while internal nodes hold only keys to guide the search. Leaf nodes are linked for fast range queries via sequential access.
Hash Index
Hash indexes apply a hash function to a search key to directly locate a bucket with pointers to data rows. They are optimized for equality searches but not for range queries.
Bitmap Index
Bitmap indexes represent column values using bit arrays for each possible value, allowing fast filtering through bitwise operations. They’re ideal for low-cardinality categorical data.
Inverted Index
Inverted indexes map each unique term to a list of row IDs containing that term, enabling fast full-text search.
Over to you: Which other data structure will you add to the list?
Get your product in front of more than 1,000,000 tech professionals.
Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases.
Space Fills Up Fast - Reserve Today
Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing [email protected].