2026-04-21 23:30:21
Skip the guesswork with this MongoDB cheatsheet from Datadog. You’ll get a quick, practical reference for monitoring performance and diagnosing issues in real systems.
Use it to:
Track key metrics like latency, throughput, and resource utilization
Monitor MongoDB and Atlas health with the right signals
Set up dashboards to quickly identify bottlenecks and performance issues
When DoorDash needed to launch Dasher onboarding in Puerto Rico, it took about a week. That wasn’t because they cut corners or threw a huge team at it. It took a week because almost no new code was needed. The steps that Puerto Rican Dashers would go through (identity checks, data collection, compliance validation) already existed as independent modules, battle-tested by thousands of Dashers in other countries. The team assembled them into a new workflow, made one minor customization, and shipped.
And it wasn’t just Puerto Rico. Australia’s migration was completed in under a month. Canada took two weeks, and New Zealand required almost no new development at all.
This speed came from an architectural decision the DoorDash engineering team made when they looked at their growing mess of country-specific if/else statements and decided to stop patching.
They rebuilt their onboarding system around a simple idea. Decompose the process into self-contained modules with standardized interfaces, then connect them through a deliberately simple orchestration layer.
In this article, we will look at how this architecture was designed and the challenges they faced.
Disclaimer: This post is based on publicly shared details from the DoorDash Engineering Team. Please comment if you notice any inaccuracies.
DoorDash’s Dasher onboarding started simple, with just a few steps serving a single country through straightforward logic. Then the company expanded internationally, and every new market meant new branches in the code.
At one point, three API versions ended up coexisting. V3, the newest, continued calling V2 handlers for backward compatibility and also continued writing to V2 database tables. The system literally couldn’t avoid its own history. All developers have probably seen something like this before, where nobody can fully explain which version handles what, and removing any piece feels dangerous because something else might depend on it.
See the diagram below that shows the legacy system view:
The step sequences themselves were hard-coded, with country-specific logic spread throughout. Business logic started immediately after receiving a request, branching into deep if/else chains based on country, step type, or prior state. Adding a new market meant carefully threading new conditions through this maze of conditions.
Vendor integrations followed no consistent pattern either. Some onboarding steps used internal services, which called third-party vendors. Other steps called vendors directly. This inconsistent layering made testing and debugging unpredictable.
And then there was also the state management problem. Onboarding progress was tracked across multiple separate database tables. Flags like validation_complete = true or documents_uploaded = false lived in different systems. If a user dropped off mid-onboarding and came back later, reconstructing where they actually stood required querying several systems and inferring logic. This frequently led to errors.
The practical cost was that adding a new country took months of engineering effort across APIs, tables, and code branches. Every change carried the risk of breaking something in a market on the other side of the world.
DoorDash’s rebuild was organized around three distinct layers, each with a single responsibility. It’s easy to blur these layers together, but the separation between them is where the real power lives.
The Orchestrator sits at the top. It’s a lightweight routing layer that looks at context (which country and which market type) and decides which workflow definition should handle the request. That’s all it does. It doesn’t execute steps or manage state. It doesn’t contain business logic either. The main insight here is that the smartest thing about the orchestrator is how little it does. Developers tend to imagine the central controller as the brain of the system. However, in this architecture, the brain is distributed, and the orchestrator is just a traffic cop.
Workflow Definitions are the second layer. A workflow is simply an ordered list of steps for a specific market. The US workflow might look like Data Collection, followed by Identity Verification, followed by Compliance Check, followed by Additional Validation. Australia’s workflow skips one step and reorders another. Puerto Rico adds a regional customization. Each workflow is defined as a class with a list of step references, making it easy to see exactly what each market’s onboarding process looks like.
Think of it like a Lego set. Each brick has a standardized shape, studs on top, tubes on the bottom, and that standard interface lets you build anything. A workflow definition is like building instructions for a specific model.
Step Modules are the third layer, and this is where the actual work happens. Each step (data collection, identity verification, risk and compliance checking, document verification) is implemented as an independent and self-contained module. A step knows how to collect its data, validate it, call its external vendors, handle retries and failures, and report success or failure. What it doesn’t know is which workflow it belongs to, or what step comes before or after it. This isolation is what makes reuse possible.
The mechanism enabling this plug-and-play behavior is the interface contract. Every step implements the same standardized interface, with a method to process the step, a method to check if it’s complete, and a method to return its response data. As long as a new step honors this contract, it can slot into any workflow without the workflow knowing or caring about its internals.
This contract also enables team autonomy. The identity verification step can be owned entirely by the security team. Payment setup can belong to the finance team. Each team iterates on their step independently, as long as they maintain the shared interface. In a way, the architecture mirrors the organizational structure, or more accurately, it lets the organizational structure work for the system instead of against it.
Two additional capabilities make the system even more flexible:
Composite steps group multiple granular steps into a single logical unit. One country might collect all personal information on a single screen. Another might split it across three screens. A composite step called “PersonalDetails” can wrap Profile, Additional Info, and Vehicle steps together, handling that variation without changing the individual step implementations underneath.
And steps can be dynamic and conditional. A Waitlist step might only appear in markets with specific supply conditions. The same step can even appear multiple times within a single workflow.
This flexibility goes beyond simple reordering and confirms that steps are truly stateless and workflow-agnostic.
The address collection step is the clearest proof that this works in practice. DoorDash built it once as a standalone module. When Australia needed address collection early in their flow for compliance checks, the team simply inserted the module before the compliance step in Australia’s workflow definition, without any special logic or branching. Canada later adopted the same step for validation and service-area mapping. It worked out of the box. The US team then experimented by enabling it in select regions, and again, with no new code.
This three-layer pattern isn’t specific to onboarding. Any multi-step process that varies across contexts (checkout flows, approval pipelines, content moderation queues) can be decomposed this way.
One important clarification here is that DoorDash’s step modules are not separate microservices. They are modules within a single service, which means the lesson here is about logical decomposition and interface design rather than strict deployment boundaries. Technically, we could apply this same pattern inside a monolith.
How does the system know where each applicant is in their journey?
Answering this question is needed to make modular steps work.
In the legacy system, this was a mess. Progress was tracked across multiple separate tables, each representing part of the workflow. Introducing a new onboarding step meant modifying several of these tables. Ensuring synchronization between them required close coordination across services, and it often broke down, leading to data mismatches and brittle integrations.
The new system introduced the status map, a single JSON object in the database where every step writes its own progress. It looks something like this:
{
“personal_info”: { “status”: “DONE”, “metadata”: { “name”: “Jane” } },
“address”: { “status”: “DONE”, “metadata”: { “address_id”: “abc123” } },
“validation”: { “status”: “IN_PROGRESS” },
“compliance”: { “status”: “INIT” }
}Each step is responsible for updating its own entry in the map. When a step starts, completes, fails, or gets skipped, it writes that transition directly to its entry. The workflow layer never writes to the status map. It just reads it.
See the diagram below:

Each step also exposes an isStepCompleted() method that defines its own completion logic based on the status map. One step might treat “SKIPPED” as a terminal state, while another might not. This flexibility lives at the step level, not the workflow level, which keeps the orchestration logic simple and stateless.
The practical benefit is immediate. A single query on the status map tells you exactly where any applicant stands in their onboarding journey. Partial updates are handled through atomic JSON key merges, meaning that when one step updates its status, it only touches its own entry without overwriting the rest of the map.
The architecture is only half the story. Getting there without breaking a running system is where the real engineering difficulty lives.
DoorDash didn’t flip a switch. They designed the new platform to coexist with the existing V2 and V3 APIs, running old and new systems side by side. Applicants who had partially completed onboarding under the legacy system needed to continue seamlessly, so the team built temporary synchronization mechanisms that mirrored progress between systems until the migration was complete. This parallel operation was itself a temporary technical debt, built intentionally to be thrown away.
Other major initiatives were underway during the rebuild, sometimes conflicting with the new onboarding design. Rather than treating these as blockers, the team collaborated across those efforts and adapted the architecture where necessary.
The migration started with the US in January 2025, their largest and most complex market, as the proving ground. Then the compounding payoff kicked in. Australia was completed in under a month, needing only two localized steps. Canada followed in two weeks with a single new module. Puerto Rico took a week with a minor customization. New Zealand required almost no new development.
Every migration launched with zero regressions, no user-facing incidents, no onboarding downtime, and no unexpected drop-offs in completion rates. Each rollout got faster because more modules had already been battle-tested by thousands of Dashers in prior markets.
The architecture has also proven its value beyond adding countries. DoorDash is integrating its onboarding with another large, independently developed ecosystem that has its own mature onboarding flow. The modular design allowed them to build integration-specific workflows while reusing much of the existing logic, something that would have been extremely painful with the legacy system.
The tradeoffs are real, though. Modularity adds coordination overhead. For a single-market startup, this architecture can be considered overkill. A monolithic onboarding flow is completely fine until you hit the inflection point where country-specific branching becomes more expensive than decomposition.
Reusable modules work well when the underlying concept generalizes across markets. For example, addresses are conceptually similar everywhere, which is why the address step was reused so cleanly. However, compliance requirements can be fundamentally different between regulatory regimes.
The boundary between the platform team and domain teams also requires ongoing negotiation. DoorDash addresses this through published platform principles, versioned interface contracts, and joint KPIs that create shared accountability. Domain expert teams own their business logic (fraud detection, compliance, payments) while the platform enforces consistency. This is a human coordination challenge that architecture alone doesn’t solve.
Looking ahead, DoorDash’s roadmap includes dynamic configuration loading to enable workflows to go live through config rather than code, step versioning to allow multiple iterations of a step to coexist during experiments or rollouts, and enhanced operational tooling to give non-engineering teams the ability to manage workflows directly.
That said, DoorDash deliberately kept workflows code-defined rather than jumping straight to config-driven. While config-driven systems are powerful, they introduce their own complexity. They can be harder to debug and harder to test.
Ultimately, what DoorDash built is a sort of pattern for any system that needs to support multiple variants of a multi-step process. The core idea is three layers (a thin orchestrator, composable workflows, and self-contained steps behind standardized interfaces) connected by a single shared state structure.
References:
2026-04-20 23:30:47
npx workos@latest launches an AI agent, powered by Claude, that reads your project, detects your framework, and writes a complete auth integration into your codebase. No signup required. It creates an environment, populates your keys, and you claim your account later when you're ready.
But the CLI goes way beyond installation. WorkOS Skills make your coding agent a WorkOS expert. workos seed defines your environment as code. workos doctor finds and fixes misconfigurations. And once you're authenticated, your agent can manage users, orgs, and environments directly from the terminal. No more ClickOps.
GitHub built an AI agent that can fix documentation, write tests, and refactor code while you sleep. Then they designed their entire security architecture around the assumption that this agent might try to steal your API keys, spam your repository with garbage, and leak your secrets to the internet.
This can be considered paranoia, but it’s the only responsible way to put a non-deterministic system inside your CI/CD pipeline.
GitHub Agentic Workflows let you plug AI agents into GitHub Actions so they can triage issues, generate pull requests, and handle routine maintenance without human supervision. The appeal is clear, but so is the risk. These agents consume untrusted inputs, make decisions at runtime, and can be manipulated through prompt injection, where carefully crafted text tricks the agent into doing things it wasn’t supposed to do.
In this article, we will look at how GitHub built a security architecture that assumes the agent is already compromised. However, to understand their solution, you first need to understand why the problem is harder than it looks.
Disclaimer: This post is based on publicly shared details from the GitHub Engineering Team. Please comment if you notice any inaccuracies.
CI/CD pipelines are built on a simple assumption. The developers define the steps, the system runs them, and every execution is predictable. All the components in a pipeline share a single trust domain, meaning they can all see the same secrets, access the same files, and talk to the same network. That shared environment is actually a feature for traditional automation. When every component is a deterministic script, sharing a trust domain makes everything composable and fast.
Agents break that assumption completely because they don’t follow a fixed script. They reason over repository state, consume inputs they weren’t specifically designed for, and make decisions at runtime. A traditional CI step either does exactly what you coded it to do or fails. An agent might do something you never anticipated, especially if it processes an input designed to manipulate it.
GitHub’s threat model for Agentic Workflows is blunt.
They assume the agent will try to read and write state that it shouldn’t, communicate over unintended channels, and abuse legitimate channels to perform unwanted actions. For example, a prompt-injected agent with access to shell commands can read configuration files, SSH keys, and Linux /proc state to discover credentials. It can scan workflow logs for tokens. Once it has those secrets, it can encode them into a public-facing GitHub object like an issue comment or pull request for an attacker to retrieve later. The agent isn’t actively malicious, but following instructions that it couldn’t distinguish between legitimate ones.
In a standard GitHub Actions setup, everything runs in the same trust domain on top of a runner virtual machine. A rogue agent could interfere with MCP servers (the tools that extend what an agent can do), access authentication secrets stored in environment variables, and make network requests to arbitrary hosts. A single compromised component gets access to everything. The problem isn’t that Actions are insecure. It’s that agents change the assumptions that made a shared trust domain safe in the first place.
Agents can generate code. Getting it right for your system, team conventions, and past decisions is the hard part. You end up babysitting the agent and watch the token costs climb.
More MCPs, rules, and bigger context windows give agents access to information, but not understanding. The teams pulling ahead have a context engine to give agents only what they need for the task at hand.
Our April webinar filled up, so we are bringing it back! Join us live (FREE) on May 6 to see:
Where teams get stuck on the AI maturity curve and why common fixes fall short
How a context engine solves for quality, efficiency, and cost
Live demo: the same coding task with and without a context engine
GitHub Agentic Workflows use a layered security architecture with three distinct levels.
Each layer limits the impact of failures in the layer above it by enforcing its own security properties independently.
The substrate layer sits at the bottom. It’s built on a GitHub Actions runner VM and several Docker containers, including a set of trusted containers that mediate privileged operations. This layer provides isolation between components, controls system calls, and enforces kernel-level communication boundaries. These protections hold even if an untrusted component is fully compromised and executes arbitrary code within its container. The substrate doesn’t rely on the agent behaving correctly, and even arbitrary code execution inside the agent’s container hits a wall at this level.
The configuration layer sits on top of the substrate layer. This is where the system’s structure gets defined. It includes declarative artifacts and the toolchains that interpret them to set up which components are loaded, how they’re connected, what communication channels are permitted, and what privileges are assigned. The most important piece in this layer is the compiler. GitHub doesn’t just run your workflow definition as-is, but compiles it into a GitHub Action with explicit constraints on permissions, outputs, auditability, and network access. The configuration layer also controls which secrets go into which containers. Externally minted tokens like agent API keys and GitHub access tokens are loaded only into the specific containers that need them, never into the agent’s container.
The planning layer sits on top. While the configuration layer dictates which components exist and how they communicate, the planning layer dictates which components are active over time. Its job is to create staged workflows with explicit data exchanges between stages. The safe outputs subsystem, which we’ll get to shortly, is the most important instance of this. It ensures the agent’s work gets reviewed before it affects anything real.
These layers are independent. If the planning layer fails, the configuration layer still enforces its constraints. If the configuration layer has a bug, the substrate layer still provides isolation. No single failure compromises the whole system.
From the beginning, GitHub wanted workflow agents to have zero access to secrets.
In a standard GitHub Actions setup, sensitive material like agent authentication tokens and MCP server API keys sits in environment variables and configuration files visible to all processes on the runner VM. That’s fine when everything sharing the environment is trusted. It’s dangerous with agents because they’re susceptible to prompt injection. An attacker can hide malicious instructions in a web page, a repository issue, or a pull request comment, and trick the agent into extracting and leaking whatever it can find.
GitHub’s solution is a specific container topology where secrets are physically unreachable from the agent.
The agent runs in its own Docker container on a private network. All internet access goes through a dedicated firewall container (gh-aw-firewall). MCP tools are accessed through a separate trusted container called the MCP gateway (gh-aw-mcpg), which launches MCP servers and holds exclusive access to MCP authentication material like the GitHub Personal Access Token. LLM API calls go through an isolated API proxy that holds the model’s authentication token (like the Codex token). The agent communicates with the firewall over HTTP, the firewall connects to the MCP gateway over HTTP, and the gateway connects to MCP servers over stdio.
See the diagram below:
At no point in this chain does the agent touch a secret. It can make LLM calls because the proxy adds authentication on its behalf. It can use MCP tools because the gateway authenticates on its behalf. But the agent itself is completely secretless.
There’s a real tradeoff here, and GitHub names it explicitly. Coding workloads need broad access to compilers, interpreters, scripts, and repository state. But you can’t just install all those tools inside the agent’s container without expanding the set of network destinations allowed through the firewall and duplicating a lot of the host setup.
Their resolution is elegant, but they mount the entire VM host filesystem read-only at /host inside the agent’s container. Then they overlay sensitive paths (the ones containing secrets and configuration) with empty tmpfs layers, which are temporary in-memory filesystems that effectively hide whatever was at those paths. Finally, they launch the agent in a chroot jail rooted at /host. The result is that the agent can access every compiler, interpreter, and tool on the host machine while being unable to discover any secret material. The host-side setup stays intact. The agent’s writable and discoverable surface is constrained to only what it needs. If you work with containers, this pattern of mounting a host filesystem read-only and masking sensitive paths with tmpfs is applicable well beyond the agentic context.
GitHub isn’t the only one that arrived at this zero-secret principle. OpenAI’s Codex takes a different path to the same destination. In the Codex cloud, secrets are available only during the setup phase and are removed before the agent phase starts, and internet access is disabled by default during execution. GitHub uses proxies and gateways. OpenAI uses a two-phase model. The fact that both teams independently converged on “agents should never touch secrets” validates the principle.
Prompt-injected agents can cause real harm even without access to secrets. A rogue agent could spam a repository with pointless issues and pull requests to overwhelm maintainers, or inject objectionable URLs and content into repository objects. The absence of secrets doesn’t mean the absence of risk.
GitHub’s answer is the safe outputs system, and it’s the most distinctive part of their architecture.
Here’s how it works:
The agent can read GitHub state through the GitHub MCP server, which operates in read-only mode. But it cannot write to GitHub directly. All write operations go through a separate safe output MCP server, which only buffers the agent’s intended changes without executing them.
Once the agent finishes its work and exits, everything it staged goes through a deterministic analysis pipeline.
This pipeline does three things.
First, it checks the type of each operation against an allowlist that the workflow author defined. If the author said the agent can create issues and comments but not pull requests, any staged PR gets dropped.
Second, it enforces quantity limits. An agent might be restricted to creating at most three pull requests per run.
Third, it runs content sanitization, scanning for secrets that might have leaked into the output text, removing URLs, and running content moderation checks.
Only outputs that survive the entire pipeline get committed to GitHub. Every stage’s side effects are explicit and vetted.
The compiler plays an important role here, too. When it decomposes a workflow into stages, it defines for each stage the active components and their permissions (read versus write), the data artifacts that stage can emit, and the admissible downstream consumers of those artifacts.
The workflow author declares upfront what the agent is allowed to produce, and the system enforces those declarations deterministically. Since the pipeline uses deterministic analysis, it can only catch patterns that GitHub anticipated. A truly novel attack vector might slip through, which is exactly why the other layers exist. No single layer is the complete answer.
Agents are determined to accomplish their tasks by any means and can have a surprisingly deep toolbox of tricks for doing so. When an agent behaves unexpectedly, you need full visibility into what happened.
Agentic Workflows make observability a first-class architectural property by logging at every trust boundary.
Network and destination-level activity gets recorded at the firewall.
Model request/response metadata and authenticated requests are captured by the API proxy.
Tool invocations are logged by the MCP gateway and MCP servers.
GitHub even adds internal instrumentation to the agent container to audit potentially sensitive actions like environment variable accesses.
Together, these logs support full forensic reconstruction, policy validation, and detection of anomalous behavior.
But there’s a more important long-term play here. Every point where you can observe communication is also a point where you can mediate it. GitHub is building the observation infrastructure now with future control in mind. They already support a lockdown mode for the GitHub MCP server, and they plan to introduce controls that enforce policies across MCP servers based on whether repository objects are public or private, and based on who authored them.
Every security decision GitHub made came with a cost.
Security versus utility is the most obvious tension. Agents running inside GitHub’s architecture are more constrained than a developer working locally. The chroot approach gives agents access to host tools, but the firewall still limits network access, and the safe outputs pipeline still restricts what the agent can produce. In other words, more security means less flexibility.
Strict-by-default is a strong opinion. Most other coding agents make sandboxing opt-in. Claude Code and Gemini CLI both require you to turn on their sandbox features. GitHub Agentic Workflows run in strict security mode by default. That’s a deliberate choice to prioritize safety over developer convenience, and it won’t be the right tradeoff for every use case.
Prompt injection remains fundamentally unsolved. GitHub’s architecture is a damage containment strategy, not a prevention strategy. It limits the blast radius when an agent gets tricked, but it can’t prevent the issue itself. And the deterministic vetting in the safe outputs pipeline can only catch patterns that were anticipated. A novel attack vector might need a new pipeline stage.
The architecture is also complex, involving multiple containers, proxies, gateways, a compilation step, and a staged output pipeline. This is engineering overhead that makes sense at GitHub’s scale. For simpler use cases, we might not need every piece.
As AI agents become standard in development tooling, the question will shift from whether to sandbox to building a complete security architecture.
GitHub’s four principles offer a transferable framework:
Defend in depth with independent layers.
Keep agents away from secrets by architecture, not policy.
Vet every output through deterministic analysis before it affects the real world.
Log everything at every trust boundary, because today’s observability is tomorrow’s control plane.
References:
2026-04-18 23:30:39
This week’s system design refresher:
AI for Engineering Leaders: Course Direction Survey
What is a Data Lakehouse? (Youtube video)
How the JVM Works
Figma Design to Code, Code to Design: Clearly Explained
12 AI Papers that Changed Everything
How Load Balancers Work?
Optimistic locking vs pessimistic locking
We are working on a course, AI for Engineering Leaders, and would appreciate your help with a quick survey.
Before we build it, we want to get it right, so we’re asking the people who would actually take it. If you’re an EM, Tech Lead, Director, or VP of Engineering, I’d love 3 minutes of your time. This quick survey covers questions like: how do you evaluate engineers when AI writes most of the code? What metrics still matter? Where do AI tools actually help versus just add noise?
Your answers will directly shape what we cover. Thank you!
We compile, run, and debug Java code all the time. But what exactly does the JVM do between compile and run?
Here's the flow:
Build: javac compiles your source code into platform-independent bytecode, stored as .class files, JARs, or modules.
Load: The class loader subsystem brings in classes as needed using parent delegation. Bootstrap handles core JDK classes, Platform covers extensions, and System loads your application code.
Link: The Verify step checks bytecode safety. Prepare allocates static fields with default values, and Resolve turns symbolic references into direct memory addresses.
Initialize: Static variables are assigned their actual values, and static initializer blocks execute. This happens only the first time the class is used.
Memory: Heap and Method Area are shared across threads. The JVM stack, PC register, and native method stack are created per thread. The garbage collector reclaims unused heap memory.
Execute: The interpreter runs bytecode directly. When a method gets called multiple times, the JIT compiler converts it to native machine code and stores it in the code cache. Native calls go through JNI to reach C/C++ libraries.
Run: Your program runs on a mix of interpreted and JIT-compiled code. Fast startup, peak performance over time.
We spoke with the Figma team behind these releases to better understand the details and engineering challenges. This article covers how Figma’s design-to-code and code-to-design workflows actually work, starting with why the obvious approaches fail, how MCP solves them, and the engineering challenges that remain.
At the high level:
Design to Code:
Step 1: Once the user provides a Figma link and prompt, the coding agent requests the list of available tools from Figma’s MCP server.
Step 2: The server returns its tools: get_design_context, get_metadata, and more.
Step 3: The agent calls get_design_context with the file key and node ID parsed from the URL.
Step 4: The MCP server returns a structured representation including layout and styles. The agent then generates working code (React, Vue, Swift, etc.) using that structured context.
Code to Design:
Step 1: Once the user provides the desired UI code, the agent discovers available tools from the MCP server.
Step 2: The agent calls generate_figma_design with the current UI code.
Step 3: The MCP tool opens the running UI in a browser and injects a capture script.
Step 4: The user selects the desired component, and the script sends the selected DOM data to the server.
Step 5: The server maps the DOM to native Figma layers: frames, auto-layout groups, and editable text layers. The result is fully editable Figma layers shown to the user.
Read the full newsletter here.
A handful of research papers shaped the entire AI landscape we see today.
The diagram below highlights 12 that we consider especially influential.
AlexNet (2012): Showed deep neural nets can see. Ignited the deep learning era
GANs (2014): Generate realistic image by having two networks compete
Transformer (2017): Google's "Attention Is All You Need." The architecture behind everything
GPT-3 (2020): OpenAI showed scale unlocks emergent abilities.
InstructGPT (2022): OpenAI introduced RLHF. Turned raw LLMs into useful assistants.
Scaling Laws (2020): Loss follows a clean power law
ViT (2020): Split images into patches and use a Transformer for vision tasks.
Latent Diffusion (2021): Denoising in compressed space. The design behind DALL·E.
DDPM (2020): Add noise, then learn to reverse it. The foundation behind diffusion models.
CLIP (2021): OpenAI connected images and text in one shared space.
Chain-of-Thought (2022): A simple prompt that unlocked complex reasoning.
RAG (2020): Retrieve real documents, then generate. Grounded LLMs in facts.
Over to you: What paper is missing from this list?
A load balancer is a system that distributes incoming traffic across multiple servers to ensure no single server gets overloaded. Here’s how it works under the hood:
The client sends a request to the load balancer.
A listener receives it on the right port/protocol (HTTP/HTTPS, TCP).
The load balancer parses the packet to understand headers and intent.
It checks recent health checks to know which backend servers are up.
It looks in the connection table to reuse any existing client-to-server mapping.
Using its rules, it picks a healthy target server for this request.
It rewrites addresses so traffic can reach that chosen server.
It completes the TCP handshake to open a reliable connection.
If HTTPS is used, it decrypts (or passes through) via SSL/TLS as configured.
The request is forwarded to the selected backend server.
The backend processes it and sends a response back to the load balancer.
The load balancer may tweak headers, then forwards the response to the client.
Over to you: Which other step will you add to the working of a load balancer?
Imagine two developers updating the same database row at the same time. One of them will have their update rejected. How should the system handle this?
There are two common approaches.
Optimistic locking assumes conflicts are rare. Both users read the data without acquiring any lock. Each record carries a version number. When a user attempts to write, the database checks: does the version in your update match the current version in the database? If another transaction already incremented the version from 1 to 2, your update still references version 1. The write is rejected.
Pessimistic locking takes the opposite approach. It assumes conflicts are likely, so it blocks them before they happen. The first transaction locks the row, and every other transaction waits until that lock is released. No version checks needed.
If your system is read-heavy with occasional writes, optimistic locking is the best option. When concurrent writes occur frequently and the cost of a conflict is high, pessimistic locking is the safer choice.
Over to you: Have you ever run into a deadlock in production because of a locking strategy? How did you fix it?
2026-04-16 23:31:26
The hardest part of relational database design is not using SQL. The syntax for creating tables, defining keys, and writing joins can be learnt and mastered over time. The difficult part is to develop the thinking that comes before any code gets written, and answering questions about the design of the database.
Which pieces of information deserve their own table?
How should tables reference each other?
How much redundancy is too much?
These are design decisions, and getting them right means our data stays consistent, our queries stay fast, and changes are painless. Getting them wrong means spending months patching problems that were baked into the structure from day one.
In this article, we cover the core concepts that inform those decisions. We’ll look at tables, keys, relationships, normalization, and joins, with each concept building on the last.
2026-04-14 23:31:25
AI budgets are under the microscope and most engineering teams only cite time savings from code generation when asked if it’s working.
The real impact is in production, where teams spend 70% of engineering time investigating incidents, jumping between tools, and losing time that could go toward shipping product.
That operational load only grows with every line of AI-generated code that hits prod.
Learn how engineering teams at Coinbase, Zscaler, and Salesforce are seeing AI impact across the full engineering lifecycle. Plus, get a practical worksheet for modeling AI ROI with your own operational data.
Turning a design into working code is one of the most common tasks in frontend development, and one of the hardest to automate. The design lives in Figma. The code lives in a repository. Bridging the two has traditionally required a developer to manually interpret layouts, colors, spacing, and component structure from a visual reference. AI coding agents promise to close that gap, but the naive approaches fall short in important ways.
Figma launched its MCP server in June 2025 to bring design context into code. This year, they released two new workflows: the ability to generate designs from coding tools like Claude Code and Codex, and the ability for agents to write directly to Figma design.
We spoke with Emil Sjölander, Aditya Muttur, and Shannon Toliver from the Figma team behind these releases to understand the details and engineering challenges. This article covers how Figma’s design-to-code and code-to-design workflows actually work, starting with why the obvious approaches fail, how MCP solves them, and the engineering challenges that remain.
Before diving into how Figma’s MCP server works, it helps to understand the approaches that came before it, and why each one hits a wall. There are two natural ways to give an LLM access to a design: show it a picture, or hand it the raw data. Both have fundamental limitations that motivated a different approach.
The most obvious way to turn a design into code with an LLM is to take a screenshot of your Figma file and paste it into a coding agent. The LLM sees the image, interprets the layout, and generates code.
This works for simple UIs. But it breaks down for anything complex. The LLM has to guess values based on pixels. It doesn’t know the exact color or that the spacing between cards is 24px, not 20px. The output may look close, but not identical.
So screenshots give the LLM a visual reference but no precise values. The next natural step is to go in the opposite direction: give it all the data.
Figma exposes a REST API that returns a file’s entire structure as JSON. Every node, property, and style is included. Now the LLM has real data instead of pixels.
But having all the data introduces its own problem: there is far too much of it. A single Figma page can produce thousands of lines of JSON, filled with pixel coordinates, visual effects, internal layout rules, and other metadata that are not useful for code generation. Dumping all of this into a prompt can exceed the context window. Even when it fits, the LLM has to wade through pixel coordinates, blend modes, export settings, and other visual metadata that have nothing to do with building a UI, which degrades the output quality.
Neither approach works on its own. Screenshots lack precision. Raw API data has precision but drowns the LLM in noise. What you actually need is something in between: structured design data that preserves exact values like colors, spacing, and component names, but strips out the noise that is not needed for code generation.
That is what Figma’s MCP server does. MCP stands for Model Context Protocol, a standard that defines how AI agents discover and call external tools. Figma’s MCP server takes the raw design data from Figma’s REST API, filters out the noise, and transforms what remains into a clean, structured representation. Pixel positions become layout relationships like “centered inside its parent.” Raw hex colors become design token references. Deeply nested layers get flattened to match what a developer would actually build. The result is a compact, token-efficient context that an LLM can act on directly.
With that context, let’s look at how the two main workflows, design to code and code to design, actually work under the hood.
The design-to-code workflow starts when a developer selects a frame in Figma, copies its URL, and pastes it into a coding agent like Claude Code or Codex with a prompt like “Implement this design.” The agent then produces working code that matches the design. Here is what happens behind the scenes.
The coding agent and Figma’s MCP server work together through four steps. The first two are generic MCP mechanics: tool discovery and tool calling. The last two are where Figma’s engineering makes the difference.
Step 1. The agent discovers available tools
When you first connect the Figma MCP server, the agent receives a list of available tools. These include get_design_context, get_screenshot, get_metadata, and more. Each tool comes with a name, description, and parameter schema.
The agent does not know how Figma works internally. It reads these descriptions the same way a developer reads API documentation, then decides which tool to call based on the user’s prompt.
Step 2. The agent prepares the arguments and calls the tool
The agent prepares the arguments to call the selected tool. In this case, since the selected tool is get_design_context, it needs a file key and a node ID. So it parses both from the Figma URL you pasted and calls the tool.
Step 3. The request hits Figma’s backend
The tool call is sent over the network to Figma’s MCP server at mcp.figma.com/mcp over Streamable HTTP. The MCP server handles authentication, then calls Figma’s internal services to read the design data such as node trees, component properties, styles, and variable definitions.
Step 4. Transform raw design data into LLM-friendly context
This is where the most important engineering happens. The MCP server transforms the raw JSON from Figma’s REST API into a structured representation that maps to how a developer thinks about building a UI. Pixel positions become layout relationships like “this element is centered inside its parent.” Color values become references to design tokens like brand-blue instead of raw color codes. Deeply nested layers get simplified to reflect what the user actually sees. And components get enriched with code mappings. For example, when a Figma button component is mapped to src/components/ui/Button.tsx through Code Connect, that reference appears in the output. The LLM reuses the existing component instead of recreating it from scratch.
The output defaults to a React + Tailwind framing because that is the most common frontend stack. But it is a structured representation of the design, not generated code. The LLM takes this representation and generates actual code in whatever framework the developer specifies.
Design to code is only half the story. In practice, the code often evolves faster than the design files. A developer ships a feature, tweaks the layout based on user feedback, adds a new section, and now the Figma file no longer matches what is actually running in production. Code to design closes that gap. A developer opens Claude Code, types “send this to Figma,” and a few seconds later the live UI appears in Figma as fully editable layers. Not a flat screenshot, but real frames with auto-layout, editable text, and separate components.
This is powered by one key tool in the MCP server: generate_figma_design. Here is what happens under the hood.
Step 1: The Figma tool launches the capture tool
When the developer prompts “send this to Figma,” the agent calls MCP server’s generate_figma_design tool.
The tool opens the target URL in a browser and injects a JavaScript capture script. For a local dev server, it connects directly. For production or staging URLs, it uses a browser automation tool like Playwright to open the page and inject the script programmatically.
When the browser window opens, two things appear: the running UI and a capture toolbar overlay. An initial capture happens automatically when the page loads. From there, the developer can capture the entire screen or select specific elements.
Step 2: The script reads the DOM
When the user selects the desired UI from the live capture, the injected script does not take a screenshot. It reads the live DOM.
It walks the DOM tree and extracts computed styles, layout properties, text content, and image sources for every visible element. It also preserves the parent-child hierarchy. A flex container with three children stays structured as a container with three children, not a flat collection of boxes.
This is what makes the output editable in Figma. A screenshot captures pixels. The DOM walk captures structure and relationships between elements.
Step 3: DOM data becomes Figma layers
The captured DOM data gets sent to Figma’s backend, where it is reconstructed as native Figma design layers. Each HTML element maps to a Figma frame or shape. CSS flexbox and grid layouts become Figma auto-layout groups. Text nodes become editable Figma text layers with the correct font, size, weight, and color. Images get extracted and embedded as image fills.
That covers the two core workflows. But making them work reliably in production, across millions of Figma files, multiple coding agents, and real design systems, introduces a different set of problems.
Here are some of the most important challenges Figma’s team faced, and how they addressed them.
LLMs have fixed context windows, so token count is a hard constraint. The design data for a complex Figma page can be enormous, far more than what a coding agent can handle in a single call. Claude Code, for example, defaults to a 25,000-token limit for MCP tool responses. If you call get_design_context on an entire page instead of a specific node, the response can easily exceed that limit and get truncated. This challenge is not unique to Figma. Any MCP server that exposes large structured data like a codebase, a document store, or a design file, has to solve the same problem: how to give the LLM enough context to be useful without exceeding what it can process.
To mitigate this, Figma developed the get_metadata tool. Instead of the full styled representation, it returns a sparse XML outline. A developer can call get_metadata on an entire page to see the structure, identify the specific nodes they need, and then call get_design_context only on those nodes. It is a two-step pattern: scan first, then zoom in.
By default, the coding agent has no way to know which Figma components map to which code components. Without that mapping, the agent will spend time searching the codebase to find the right component. If it does not find a match, it will create a new one from scratch. Multiply that across every reusable component in a design system, and the generated code diverges from the codebase fast.
Figma mitigates this with Code Connect, which lets teams create explicit mappings between Figma node IDs and code file paths. Once set up, the MCP server includes these mappings in its response, and the agent reuses the actual component instead of guessing.
Setting up Code Connect requires manual effort. Someone has to create and maintain those mappings. Figma has been working to reduce this friction with tools like get_code_connect_suggestions, which automatically detects and proposes mappings. But the quality of the generated code is still directly tied to how much the team has invested in connecting their design system to their codebase.
The bidirectional loop sounds seamless, but each handoff loses information. When a design goes from Figma to code, the structured context captures layout, styles, and component references, but not business logic, event handlers, state management, or API calls. The agent fills those in when generating code.
When that code gets captured back to Figma through generate_figma_design, the DOM walk captures visual structure and styles but strips out everything that is not visible: the React state, the API integration, the route handling.
The result is that each roundtrip requires re-inference. When a designer modifies a captured UI in Figma and a developer pulls it back into code with get_design_context, the agent is translating visual decisions into implementation from scratch. It does not have access to the previous version of the code. Code Connect mappings help here by preserving the link between design components and their code implementations across roundtrips, but the non-visual logic still has to be re-added each time.
Figma’s MCP server does not serve a single client. It serves Claude Code, Cursor, Codex, and any other MCP-compatible tool. Each agent has different context window sizes, different tool-calling behaviors, and different levels of sophistication in how it sequences multiple tool calls. A workflow that works well in one agent may not work the same way in another.
The generate_figma_design tool, for instance, is now available in a growing number of coding tools, including Claude Code and Codex. Code-to-design requires tighter integration with the browser (script injection, capture toolbar, multi-screen state) than most agents currently support.
Building an MCP server that works well across a growing ecosystem of agents with varying capabilities is one of the harder ongoing challenges in this space.
The recent opening of the Figma canvas to agents marks an important evolution in this workflow. Agents can now not only read and understand design context, but actively modify and create designs using the use_figma MCP tool. This tool complements the design-to-code workflow by enabling agents to edit designs directly on the Figma canvas and create new assets using your components and variables.
The hardest part of building an MCP server is not implementing the protocol. It is making the design decisions that Figma’s team had to work through: what context to include, what to leave out, how to structure it so LLMs can reason about it, and how to stay within token budgets. Those decisions are what separate a useful MCP server from one that just wraps an existing API.
Figma’s server is a useful reference point not because of the design tool specifics, but because the design decisions behind it like what to include, how to structure it, and how to handle token budgets, are well-documented and applicable to anyone building an MCP server for a complex domain.
2026-04-13 23:31:28
Agents can generate code. Getting it right for your system, team conventions, and past decisions is the hard part. You end up babysitting the agent and watch the token costs climb.
More MCPs, rules, and bigger context windows give agents access to information, but not understanding. The teams pulling ahead have a context engine to give agents only what they need for the task at hand.
Join us for a FREE webinar on April 23 to see:
Where teams get stuck on the AI maturity curve and why common fixes fall short
How a context engine solves for quality, efficiency, and cost
Live demo: the same coding task with and without a context engine
If you want to maximize the value you get from AI agents, this one is worth your time.
LinkedIn used to run five separate systems just to decide which posts to show you. One tracked trending content. Another did collaborative filtering. A third handled embedding-based retrieval.
Each had its own infrastructure, its dedicated team, and its own optimization logic. The setup worked, but when the Feed team wanted to improve one part, they’d break another. Therefore, they made a radical bet and ripped out all five systems, replacing them with a single LLM-powered retrieval model. That solved the complexity problem, but it raised new questions, such as:
How do you teach an LLM to understand structured profile data?
How do you make a transformer serve predictions in under 50 milliseconds for 1.3 billion users?
How do you train the model when most of the data is noise?
In this article, we will look at how the LinkedIn engineering team rebuilt the Feed and the challenges they faced.
Disclaimer: This post is based on publicly shared details from the LinkedIn Engineering Team. Please comment if you notice any inaccuracies.
For years, LinkedIn’s Feed retrieval relied on what engineers call a heterogeneous architecture. When you opened the Feed, content came from multiple specialized sources running in parallel.
A chronological index of network activity.
Trending posts by geography.
Collaborative filtering based on similar members.
Industry-specific pipelines.
Several embedding-based retrieval systems.
Each maintained its own infrastructure, index structure, and optimization strategy.
See the diagram below:
This architecture surfaced diverse, relevant content. But optimizing one retrieval source could degrade another, and no team could tune across all sources simultaneously. Holistic improvement was nearly impossible.
So the Feed team asked a simple question. What if they replaced all of these sources with a single system powered by LLM-generated embeddings?
Under the hood, this works through a dual encoder architecture. A shared LLM converts both members and posts into vectors in the same mathematical space. The training process pushes member and post representations close together when there’s genuine engagement, and pulls them apart when there isn’t. When you open your Feed, the system fetches your member embedding and runs a nearest-neighbor search against an index of post embeddings, retrieving the most relevant candidates in under 50 milliseconds.
However, the real power comes from what the LLM brings to those embeddings. Traditional keyword-based systems rely on surface-level text overlap. If your profile says “electrical engineering” and a post is about “small modular reactors,” a keyword system misses the connection.
An LLM-based system understands that these topics are related because the model carries world knowledge from pretraining. It knows that electrical engineers often work on power grid optimization and nuclear infrastructure. This is especially powerful for cold-start scenarios, when a new member joins with just a profile headline. The LLM can infer likely interests without waiting for engagement history to accumulate.
The downstream benefits compounded the benefits. Instead of receiving candidates from disparate sources with different biases, the ranking layer now receives a coherent candidate set selected through the same semantic similarity. Ranking became easier, and each optimization to the ranking model became more effective.
But replacing five systems with one LLM created a new problem. LLMs expect text, and recommendation systems run on structured data and numbers.
To feed structured data into an LLM, LinkedIn built a “prompt library” that transforms structured features into templated text sequences. For posts, it includes author information, engagement counts, and post text. For members, it incorporates profile information, skills, work history, and a chronologically ordered sequence of posts they’ve previously engaged with. Think of it as prompt engineering for recommendation systems.
The most striking example is what happened with numerical features. Initially, LinkedIn passed raw engagement counts directly into prompts. For example, a post with 12,345 views would appear as “views:12345” in the text. The model treated those digits like any other text tokens. When the team measured the correlation between item popularity counts and embedding similarity scores, they found it was essentially zero (-0.004). Popularity is one of the strongest relevance signals in recommendation. And the model was completely ignoring it.
The problem is fundamental. LLMs don’t understand magnitude. They process “12345” as a sequence of digit tokens, not as a quantity.
The fix was quite simple. Instead of passing raw counts, LinkedIn converted them into percentile buckets wrapped in special tokens. This meant that “Views:12345” became <view_percentile>71</view_percentile>, indicating this post sits in the 71st percentile of view counts. Most values between 1 and 100 get processed by the LLM as a single unit rather than a multi-digit sequence, giving the model a stable, learnable vocabulary for quantity. The model can learn that anything above 90 means “very popular” without trying to parse arbitrary digit sequences.
The correlation between popularity features and embedding similarity jumped 30x. Recall@10, which measures whether the top 10 retrieved posts are actually relevant, improved by 15%. LinkedIn applied the same strategy to engagement rates, recency signals, and affinity scores.
When building the member’s interaction history for training, LinkedIn initially included everything. Every post that was shown to a member went into the sequence, whether they engaged with it or scrolled past. The idea was that more data should mean a better model.
However, this didn’t turn out to be the case. Including scrolled-past posts not only made model performance worse, but it also made training significantly more expensive. GPU compute for transformer models scales quadratically with context length.
When the team filtered to include only positively-engaged posts, the results improved across every dimension.
Memory footprint per sequence dropped by 37%.
The system could process 40% more training sequences per batch.
Training iterations ran 2.6x faster
The reason comes down to signal clarity. A scrolled-past post is ambiguous. Maybe the post was irrelevant. Maybe the member was busy. Maybe the headline was mildly interesting, but not enough to stop for. Posts you actively chose to engage with are a much cleaner learning target.
The gains compounded due to this change. Better signal quality meant faster training. Faster training meant more experimentation. More experimentation meant better hyperparameter tuning. When a single change improves both quality and efficiency, the benefits multiply through the entire development cycle.
The training strategy had one more clever element. LinkedIn used two types of negative examples:
Easy negatives were randomly sampled posts never shown to a member, providing a broad contrastive signal.
Hard negatives were posts actually shown but not engaged with, the almost-right cases where the model must learn nuanced distinctions between relevant and genuinely valuable.
The difficulty of the negative examples shapes what the model learns. Easy negatives teach broad distinction, whereas hard negatives teach the fine-grained ones. Using both together is a common and effective pattern across retrieval systems, and at LinkedIn, adding just two hard negatives per member improved recall by 3.6%.
With retrieval producing high-quality candidates, the next question was how to rank them. LinkedIn’s answer was to stop treating each post as an isolated decision.
Traditional ranking models evaluate each member-post pair independently. This works, but it misses something fundamental about how professionals consume content.
LinkedIn built a Generative Recommender (GR) model that treats your Feed interaction history as a sequence. Instead of scoring each post in isolation, GR processes more than a thousand of a user’s historical interactions to understand temporal patterns and long-term interests.
The practical difference matters. If the user engages with machine learning content on Monday, distributed systems on Tuesday, and opens LinkedIn again on Wednesday, a sequential model understands these aren’t random events. They’re the continuation of a learning trajectory. A traditional pointwise model sees three independent decisions, whereas the sequential model sees the story.
The GR model uses a transformer with causal attention, meaning each position in the history can only attend to previous positions, mirroring how you actually experienced content over time. Recent posts might matter more for predicting immediate interests, but a post from two weeks ago might suddenly become relevant if recent activity suggests renewed interest.
See the diagram below that shows the transformer architecture:
One of the most practical architectural decisions is what LinkedIn calls late fusion. Not every feature benefits from full self-attention. Count features and affinity signals carry a strong independent signal, and running them through the transformer would inflate computational cost quadratically without clear benefit. Instead, these features are concatenated with the transformer output after sequence processing. This results in rich sequential understanding from the transformer, plus contextual signals that drive relevance, without the cost of processing them through self-attention.
The serving challenge is equally important. Processing 1,000+ historical interactions through multiple transformer layers for every ranking request is expensive. LinkedIn’s solution is shared context batching. The system computes the user’s history representation once, then scores all candidates in parallel using custom attention masks.
On top of the transformer, a Multi-gate Mixture-of-Experts (MMoE) prediction head routes different engagement predictions like clicks, likes, comments, and shares through specialized gates while sharing the same sequential representations underneath.
See the diagram below that shows a typical Mixture-of-Experts architecture.
This lets the model handle multiple prediction tasks without duplicating the expensive transformer computation. Together, shared context batching and the MMoE head are what make the sequential model viable at production scale.
Even the best model is useless without the infrastructure to serve it. LinkedIn’s historical ranking models ran on CPUs. Transformers are fundamentally different, with self-attention scaling quadratically with sequence length and massive parameter counts requiring GPU memory. At LinkedIn’s scale, cost-per-inference determines whether sophisticated AI models can serve every member, or only high-engagement users.
The team invested heavily in custom infrastructure on both sides. For training, a custom C++ data loader eliminates Python’s multiprocessing overhead, custom GPU routines reduced metric computation from a bottleneck to negligible overhead, and parallelized evaluation across all checkpoints cut pipeline time substantially. For serving, a disaggregated architecture separates CPU-bound feature processing from GPU-heavy model inference, and a custom Flash Attention variant called GRMIS delivered an additional 2x speedup over PyTorch’s standard implementation.
See the diagram below that shows the GR Infrastructure Stack
Freshness required its own solution.
Three continuously running background pipelines keep the system current, capturing platform activity, generating updated embeddings through LLM inference servers, and ingesting them into a GPU-accelerated index.
Each pipeline optimizes independently, while the end-to-end system stays fresh within minutes. LinkedIn’s models are also regularly audited to ensure posts from different creators compete on an equal footing, with ranking relying on professional signals and engagement patterns, never demographic attributes.
There are some takeaways:
Replacing five retrieval systems with one trades resilience for simplicity.
LLM-based embeddings are richer but more expensive than lightweight alternatives.
The bottleneck is rarely the model architecture. It’s everything around it.
The infrastructure investment represents an effort most teams can’t replicate. And this approach leans on LinkedIn’s rich text data. For primarily visual platforms, the calculus would be different.
The next time you open LinkedIn and see a post from someone you don’t follow, on a topic you didn’t search for, but it’s exactly what you needed to read, that’s all of this working together under the hood.
References: