2025-06-21 23:30:15
If slow QA processes bottleneck you or your software engineering team and you’re releasing slower because of it — you need to check out QA Wolf.
QA Wolf’s AI-native service supports web and mobile apps, delivering 80% automated test coverage in weeks and helping teams ship 5x faster by reducing QA cycles to minutes.
QA Wolf takes testing off your plate. They can get you:
Unlimited parallel test runs for mobile and web apps
24-hour maintenance and on-demand test creation
Human-verified bug reports sent directly to your team
Zero flakes guarantee
The benefit? No more manual E2E testing. No more slow QA cycles. No more bugs reaching production.
With QA Wolf, Drata’s team of 80+ engineers achieved 4x more test cases and 86% faster QA cycles.
This week’s system design refresher:
AI Vs Machine Learning Vs Deep Learning Vs Generative AI
How SQL Query Executes In A Database?
Top 20 AI Agent Concepts You Should Know
How RabbitMQ Works
Hiring Now
SPONSOR US
Artificial Intelligence (AI)
It is the overarching field focused on creating machines or systems that can perform tasks typically requiring human intelligence, such as reasoning, learning, problem-solving, and language understanding. AI consists of various subfields, including ML, NLP, Robotics, and Computer Vision
Machine Learning (ML)
It is a subset of AI that focuses on developing algorithms that enable computers to learn from and make decisions based on data.
Instead of being explicitly programmed for every task, ML systems improve their performance as they are exposed to more data. Common applications include spam detection, recommendation systems, and predictive analytics.
Deep Learning
It is a specialized subset of ML that utilizes artificial neural networks with multiple layers to model complex patterns in data.
Neural networks are computational models inspired by the human brain’s network of neurons. Deep neural networks can automatically discover representations needed for future detection. Use cases include image and speech recognition, NLP, and autonomous vehicles.
Generative AI
It refers to AI systems capable of generating new content, such as text, images, music, or code, that resembles the data they were trained on. They rely on the Transformer Architecture.
Notable generative AI models include GPT for text generation and DALL-E for image creation.
Over to you: What else will you add to understand these concepts better?
Your API workflow is changing whether you like it or not. Postman just dropped features built by devs like you to help you stay ahead of the game.
Postman’s POST/CON 25 product reveals include real-time production visibility with Insights, tighter spec workflows with Spec Hub + GitHub Sync, and AI-assisted debugging that actually works.
Think native integrations that plug directly into your stack—VS Code, GitHub, Slack—plus workflow orchestration without infrastructure headaches.
Get the full technical breakdown and see what your API development could look like.
STEP 1
The query string first reaches the Transport Subsystem of the database. This subsystem manages the connection with the client. Also, it performs authentication and authorization checks, and if everything looks fine, it lets the query go to the next step.
STEP 2
The query now reaches the Query Processor subsystem, which has two parts: Query Parser and Query Optimizer.
The Query Parser breaks down the query into sub-parts (such as SELECT, FROM, WHERE). It checks for any syntax errors and creates a parse tree.
Then, the Query Optimizer goes through the parse tree, checks for semantic errors (for example, if the “users” table exists or not), and finds out the most efficient way to execute the query.
The output of this step is the execution plan.
STEP 3
The execution plan goes to the Execution Engine. This plan is made up of all the steps needed to execute the query.
The Execution Engine takes this plan and coordinates the execution of each step by calling the Storage Engine. It also collects the results from each step and returns a combined or unified response to the upper layer.
STEP 4
The Execution Engine sends low-level read and write requests to the Storage Engine based on the execution plan.
This is handled by the various components of the Storage Engine, such as the transaction manager (for transaction management), lock manager (acquires necessary locks), buffer manager (checks if data pages are in memory), and recovery manager (for rollback or recovery).
Over to you: What else will you add to understand the execution of an SQL Query?
Agent: An autonomous entity that perceives, reasons, and acts in an environment to achieve goals.
Environment: The surrounding context or sandbox in which the agent operates and interacts.
Perception: The process of interpreting sensory or environmental data to build situational awareness.
State: The agent’s current internal condition or representation of the world.
Memory: Storage of recent or historical information for continuity and learning.
Large Language Models: Foundation models powering language understanding and generation.
Reflex Agent: A simple type of agent that makes decisions based on predefined “condition-action” rules.
Knowledge Base: Structured or unstructured data repository used by agents to inform decisions.
CoT (Chain of Thought): A reasoning method where agents articulate intermediate steps for complex tasks.
ReACT: A framework that combines step-by-step reasoning with direct environmental actions.
Tools: APIs or external systems that agents use to augment their capabilities.
Action: Any task or behavior executed by the agent as a result of its reasoning.
Planning: Devising a sequence of actions to reach a specific goal.
Orchestration: Coordinating multiple steps, tools, or agents to fulfill a task pipeline.
Handoffs: The transfer of responsibilities or tasks between different agents.
Multi-Agent System: A framework where multiple agents operate and collaborate in the same environment.
Swarm: Emergent intelligent behavior from many agents following local rules without central control.
Agent Debate: A mechanism where agents argue opposing views to refine or improve outcomes.
Evaluation: Measuring the effectiveness or success of an agent’s actions and outcomes.
Learning Loop: The cycle where agents improve performance by continuously learning from feedback or outcomes.
Over to you: Which other AI agent concept will you add to the list?
RabbitMQ is a message broker that enables applications to communicate by sending and receiving messages through queues. It helps decouple services, improve scalability, and handle asynchronous processing efficiently.
Here’s how it works:
A producer (usually an application or service) sends messages to the RabbitMQ broker, which manages message routing and delivery.
Within the broker, messages are sent to an exchange, which determines how they should be routed based on the type of exchange: Direct, Topic, or Fanout.
Bindings connect exchanges to queues using a binding key, which defines the rules for routing messages (for example, exact match or pattern-based)
Direct exchanges route messages to queues that match the routing key exactly, as shown with Queue 1.
Topic exchanges use patterns to route messages to matching queues.
Fanout exchanges broadcast messages to all bound queues, regardless of routing keys.
Finally, messages are pulled from the queues by a consumer, which processes them and can pass the results to other systems.
Over to you: What else will you add to the RabbitMQ process flow?
We collaborate with Jobright.ai (an AI job search copilot trusted by 500K+ tech professionals) to curate this job list.
This Week’s High-Impact Roles at Fast-Growing AI Startups
Senior / Staff Software Engineer, Data Platform at Waabi (California, USA)
Yearly: 155000 - 240000
Waabi is an artificial intelligence company that develops autonomous driving technology for the transportation sector.
Senior Full Stack Engineer at Proton.ai (US)
Yearly: 60000 - 90000
Proton.ai is an AI-powered sales platform for distributors to gain millions of revenue and reclaim market share.
Software Engineer - Frontend UI at Luma AI (Palo Alto, CA)
Yearly: 220K - 280K
Luma AI is a generative AI startup that enables users to transform text descriptions into corresponding 3D models.
High Salary SWE Roles this week
Principal Engine Engineer, Avatar Systems at Roblox (San Mateo, CA)
Yearly: 289460 - 338270
Director of Engineering - AWS Applied AI Solutions at Amazon Web Services (Seattle, WA)
Yearly: 264100 - 350000
Manager, Software Engineering - Interactive Foundations at Figma (New York, NY)
Yearly: 250000 - 350000
Today’s latest ML positions
Applied Machine Learning Engineer - Causal Inference Recommendation at DoorDash (Sunnyvale, CA)
Yearly: 137100 - 201600
Lead Machine Learning Engineer at Adobe (New York, NY)
Yearly: 162000 - 301200
Lead Machine Learning Engineer at ESPN (New York, NY)
Yearly: 175800 - 235700
Get your product in front of more than 1,000,000 tech professionals.
Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases.
Space Fills Up Fast - Reserve Today
Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing [email protected].
2025-06-19 23:30:36
Modern applications don’t operate in a vacuum. Every time a ride is booked, an item is purchased, or a balance is updated, the backend juggles multiple operations (reads, writes, validations) often across different tables or services. These operations must either succeed together or fail as a unit.
That’s where transactions step in.
A database transaction wraps a series of actions into an all-or-nothing unit. Either the entire thing commits and becomes visible to the world, or none of it does. In other words, the goal is to have no half-finished orders, no inconsistent account balances, and no phantom bookings.
However, maintaining correctness gets harder when concurrency enters the picture.
This is because transactions don’t run in isolation. Real systems deal with dozens, hundreds, or thousands of simultaneous users. And every one of them expects their operation to be successful. Behind the scenes, the database has to balance isolation, performance, and consistency without grinding the system to a halt.
This balancing act isn’t trivial. Here are a few cases:
One transaction might read data that another is about to update.
Two users might try to reserve the same inventory slot.
A background job might lock a record moments before a customer clicks "Confirm."
Such scenarios can result in conflicts, race conditions, and deadlocks that stall the system entirely.
In this article, we break down the key building blocks that make transactional systems reliable in the face of concurrency. We will start with the fundamentals: what a transaction is, and why the ACID properties matter. We will then dig deeper into the mechanics of concurrency control (pessimistic and optimistic) and understand the trade-offs related to them.
2025-06-17 23:30:25
Datadog analyzed data from tens of thousands of orgs to uncover 7 key insights on modern DevSecOps practices and application security risks.
Highlights:
Why smaller container images reduce severe vulns
How runtime context helps you prioritize critical CVEs
The link between deploy frequency and outdated dependencies
Plus, learn proven strategies to implement infrastructure as code, automated cloud deploys, and short-lived CI/CD credentials.
Disclaimer: The details in this post have been derived from the details shared online by the Google Engineering Team. All credit for the technical details goes to the Google Engineering Team. The links to the original articles and sources are present in the references section at the end of the post. We’ve attempted to analyze the details and provide our input about them. If you find any inaccuracies or omissions, please leave a comment, and we will do our best to fix them.
On June 12, 2025, a significant portion of the internet experienced a sudden outage. What started as intermittent failures on Gmail and Spotify soon escalated into a global infrastructure meltdown. For millions of users and hundreds of companies, critical apps simply stopped working.
At the heart of it all was a widespread outage in Google Cloud Platform (GCP), which serves as the backend for a vast ecosystem of digital services. The disruption began at 10:51 AM PDT, and within minutes, API requests across dozens of regions were failing with 503 errors. Over a few hours, the ripple effects became undeniable.
Among consumer platforms, the outage took down:
Spotify (approximately 46,000 users reported on Downdetector).
Snapchat, Discord, Twitch, and Fitbit: users were unable to stream, chat, or sync their data.
Google Workspace apps (including Gmail, Calendar, Meet, and Docs). These apps power daily workflows for hundreds of millions of users.
The failure was just as acute for enterprise and developer tools:
GitLab, Replit, Shopify, Elastic, LangChain, and other platforms relying on GCP services saw degraded performance, timeouts, or complete shutdowns.
Thousands of CI/CD pipelines, model serving endpoints, and API backends stalled or failed outright.
Vertex AI, BigQuery, Cloud Functions, and Google Cloud Storage were all affected, halting data processing and AI operations.
In total, more than 50 distinct Google Cloud services across over 40 regions worldwide were affected.
Perhaps the most significant impact came from Cloudflare, a company often viewed as a pillar of internet reliability. While its core content delivery network (CDN) remained operational, Cloudflare's authentication systems, reliant on Google Cloud, failed. This led to issues with session validation, login workflows, and API protections for many of its customers.
The financial markets also felt the impact of this outage. Alphabet (Google’s parent) saw its stock fall by nearly 1 percent. The logical question that arose from this incident is as follows: How did a platform built for global scale suffer such a cascading collapse?
Let’s understand more about it.
Your education is expiring faster than ever. What you learned in college won’t help you lead in the age of AI.
That's why Maven specializes in live courses with practitioners who have actually done the work and shipped innovative products:
Shreyas Doshi (Product leader at Stripe, Twitter, Google) teaching Product Sense
Hamel Husain (renowned ML engineer, Github) teaching AI evals
Aish Naresh Reganti (AI scientist at AWS) teaching Agentic AI
Hamza Farooq (Researcher at Google) teaching RAG
This week only: Save 20% on Maven’s most popular courses in AI, product, engineering, and leadership to accelerate your career.
To understand how such a massive outage occurred, we need to look under the hood at a critical system deep inside Google Cloud’s infrastructure. It’s called the Service Control.
Service Control is one of the foundational components of Google Cloud's API infrastructure.
Every time a user, application, or service makes an API request to a Google Cloud product, Service Control sits between the client and the backend. It is responsible for several tasks such as:
Verifying if the API request is authorized.
Enforcing quota limits (how many requests can be made).
Checking various policy rules (such as organizational restrictions).
Logging, metering, and auditing requests for monitoring and billing.
The diagram below shows how the Service Control works on a high level:
In short, Service Control acts as the gatekeeper for nearly all Google Cloud API traffic. If it fails, most of Google Cloud fails with it.
On May 29, 2025, Google introduced a new feature into the Service Control system. This feature added support for more advanced quota policy checks, allowing finer-grained control over how quota limits are applied.
The feature was rolled out across regions in a staged manner. However, it contained a bug that introduced a null pointer vulnerability in a new code path that was never exercised during rollout. The feature relied on a specific type of policy input to activate. Because that input had not yet been introduced during testing, the bug went undetected.
Critically, this new logic was also not protected by a feature flag, which would have allowed Google to safely activate it in a controlled way. Instead, the feature was present and active in the binary, silently waiting for the right (or in this case, wrong) conditions to be triggered.
Those conditions arrived on June 12, 2025, at approximately 10:45 AM PDT, when a new policy update was inserted into Google Cloud’s regional Spanner databases. This update contained blank or missing fields that were unexpected by the new quota checking logic.
As Service Control read this malformed policy, the new code path was activated. The result was a null pointer error getting triggered, causing the Service Control binary to crash in that region.
Since Google Cloud’s policy and quota metadata is designed to replicate globally in near real-time as per Spanner’s key feature, the corrupted policy data was propagated to every region within seconds.
Here’s a representative diagram on how replication works in Google Spanner:
As soon as each regional Service Control instance attempted to process the same bad data, it all began to crash in the same way. This created a global failure of Service Control.
Since this system is essential for processing API requests, nearly all API traffic across Google Cloud began to fail, returning HTTP 503 Service Unavailable errors.
The speed and scale of the failure were staggering. One malformed update, combined with an unprotected code path and global replication of metadata, brought one of the most robust cloud platforms in the world to a standstill within minutes.
Once the outage began to unfold, Google’s engineering teams responded with speed and precision. Within two minutes of the first crashes being observed in Service Control, Google’s Site Reliability Engineering (SRE) team was actively handling the situation.
The sequence of events that followed is as follows:.
Fortunately, the team that introduced the new quota checking feature had built in a safeguard: an internal “red-button” switch. This kill switch was designed to immediately disable the specific code path responsible for serving the new quota policy logic.
While not a complete fix, it offered a quick way to bypass the broken logic and stop the crash loop.
The red-button mechanism was activated within 10 minutes of identifying the root cause. By 40 minutes after the incident began, the red-button change had been rolled out across all regions, and systems began to stabilize. Smaller and less complex regions recovered first, as they required less infrastructure coordination.
This kill switch was essential in halting the worst of the disruption. However, because the feature had not been protected by a traditional feature flag, the issue had already been triggered in production globally before the red button could be deployed.
Most regions began to recover relatively quickly after the red button was applied. However, one region (us-central-1), located in Iowa, took significantly longer to stabilize.
The reason for this delay was a classic case of the “herd effect.”
As Service Control tasks attempted to restart en masse, they all hit the same underlying infrastructure: the regional Spanner database that held policy metadata. Without any form of randomized exponential backoff, the system became overwhelmed by a flood of simultaneous requests. Rather than easing into recovery, it created a new performance bottleneck.
Google engineers had to carefully throttle task restarts in us-central1 and reroute some of the load to multi-regional Spanner databases to alleviate pressure. This process took time. Full recovery in us-central1 was not achieved until approximately 2 hours and 40 minutes after the initial failure, well after other regions had already stabilized.
While the technical team worked to restore service, communication with customers proved to be another challenge.
Because the Cloud Service Health dashboard itself was hosted on the same infrastructure affected by the outage, Google was unable to immediately post incident updates. The first public acknowledgment of the problem did not appear until nearly one hour after the outage began. During that period, many customers had no clear visibility into what was happening or which services were affected.
To make matters worse, some customers relied on Google Cloud monitoring tools, such as Cloud Monitoring and Cloud Logging, that were themselves unavailable due to the same root cause. This left entire operations teams effectively blind, unable to assess system health or respond appropriately to failing services.
The breakdown in visibility highlighted a deeper vulnerability: when a cloud provider's observability and communication tools are hosted on the same systems they are meant to monitor, customers are left without reliable status updates in times of crisis.
The Google Cloud outage was not the result of a single mistake, but a series of engineering oversights that compounded one another. Each failure point, though small in isolation, played a role in turning a bug into a global disruption.
Here are the key failures that contributed to the entire issue:
The first and most critical lapse was the absence of a feature flag. The new quota-checking logic was released in an active state across all regions, without the ability to gradually enable or disable it during rollout. Feature flags are a standard safeguard in large-scale systems, allowing new code paths to be activated in controlled stages. Without one, the bug went live in every environment from the start.
Second, the code failed to include a basic null check. When a policy with blank fields was introduced, the system did not handle the missing values gracefully. Instead, it encountered a null pointer exception, which crashed the Service Control binary in every region that processed the data.
Third, Google’s metadata replication system functioned exactly as designed. The faulty policy data propagated across all regions almost instantly, triggering the crash everywhere. The global replication process had no built-in delay or validation checkpoint to catch malformed data before it reached production.
Fourth, the recovery effort in the “us-central1” region revealed another problem. As Service Control instances attempted to restart, they all hit the backend infrastructure at once, creating a “herd effect” that overwhelmed the regional Spanner database. Because the system lacked appropriate randomized exponential backoff, the recovery process generated new stress rather than alleviating it.
Finally, the monitoring and status infrastructure failed alongside the core systems. Google’s own Cloud Service Health dashboard went down during the outage, and many customers could not access logs, alerts, or observability tools that would normally guide their response. This created a critical visibility gap during the peak of the crisis.
In the end, it was a simple software bug that brought down one of the most sophisticated cloud platforms in the world.
What might have been a minor error in an isolated system escalated into a global failure that disrupted consumer apps, developer tools, authentication systems, and business operations across multiple continents. This outage is a sharp reminder that cloud infrastructure, despite its scale and automation, is not infallible.
Google acknowledged the severity of the failure and issued a formal apology to customers. In its public statement, the company committed to making improvements to ensure such an outage does not happen again. The key actions Google has promised are as follows:
Prevent the API management system from crashing in the presence of invalid or corrupted data.
Introduce safeguards to stop metadata from being instantly replicated across the globe without proper testing and monitoring.
Improve error handling in core systems and expand testing to ensure invalid data is caught before it can cause failure.
Reference:
Get your product in front of more than 1,000,000 tech professionals.
Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases.
Space Fills Up Fast - Reserve Today
Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing [email protected]
2025-06-14 23:30:22
AI agents can trigger tools, call APIs, and access sensitive data.
Failing to control access creates real risk.
Learn how to:
Limit what agents can do with scoped tokens
Define roles and restrict permissions
Log activity for debugging and review
Secure credentials and secrets
Detect and respond to suspicious behavior
Real teams are applying these practices to keep agent workflows safe, auditable, and aligned with least-privilege principles.
This week’s system design refresher:
Top 20 AI Concepts You Should Know
The AI Application Stack for Building RAG Apps
Shopify Tech Stacks and Tools
Our new book, Mobile System Design Interview, is available on Amazon!
Featured Job
Other Jobs
SPONSOR US
Machine Learning: Core algorithms, statistics, and model training techniques.
Deep Learning: Hierarchical neural networks learning complex representations automatically.
Neural Networks: Layered architectures efficiently model nonlinear relationships accurately.
NLP: Techniques to process and understand natural language text.
Computer Vision: Algorithms interpreting and analyzing visual data effectively
Reinforcement Learning: Distributed traffic across multiple servers for reliability.
Generative Models: Creating new data samples using learned data.
LLM: Generates human-like text using massive pre-trained data.
Transformers: Self-attention-based architecture powering modern AI models.
Feature Engineering: Designing informative features to improve model performance significantly.
Supervised Learning: Learns useful representations without labeled data.
Bayesian Learning: Incorporate uncertainty using probabilistic model approaches.
Prompt Engineering: Crafting effective inputs to guide generative model outputs.
AI Agents: Autonomous systems that perceive, decide, and act.
Fine-Tuning Models: Customizes pre-trained models for domain-specific tasks.
Multimodal Models: Processes and generates across multiple data types like images, videos, and text.
Embeddings: Transforms input into machine-readable vector formats.
Vector Search: Finds similar items using dense vector embeddings.
Model Evaluation: Assessing predictive performance using validation techniques.
AI Infrastructure: Deploying scalable systems to support AI operations.
Over to you: Which other AI concept will you add to the list?
Large Language Models
These are the core engines behind Retrieval-Augmented Generation (RAG), responsible for understanding queries and generating coherent and contextual responses. Some common LLM options are OpenAI GPT models, Llama, Claude, Gemini, Mistral, DeepSeek, Qwen 2.5, Gemma, etc.
Frameworks and Model Access
These tools simplify the integration of LLMs into your applications by handling prompt orchestration, model switching, memory, chaining, and routing. Common tools are Langchain, LlamaIndex, Haystack, Ollama, Hugging Face, and OpenRouter.
Databases
RAG applications rely on storing and retrieving relevant information. These vector databases are optimized for similarity search, while relational options like Postgres offer structured storage. Tools are Postgres, FAISS, Milvus, pgVector, Weaviate, Pinecone, Chroma, etc.
Data Extraction
To populate your knowledge base, these tools help extract structured information from unstructured sources like PDFs, websites, and APIs. Some common tools are Llamaparse, Docking, Megaparser, Firecrawl, ScrapeGraph AI, Document AI, and Claude API.
Text Embeddings
Embeddings convert text into high-dimensional vectors that enable semantic similarity search, which is a critical step for connecting queries with relevant context in RAG. Common tools are Nomic, OpenAI, Cognita, Gemini, LLMWare, Cohere, JinaAI, and Ollama.
Over to you: What else will you add to the list to build RAG apps?
Shopify handles scale that would break most systems.
On a single day (Black Friday 2024), the platform processed 173 billion requests, peaked at 284 million requests per minute, and pushed 12 terabytes of traffic every minute through its edge.
These numbers aren’t anomalies. They’re sustained targets that Shopify strives to meet. Behind this scale is a stack that looks deceptively simple from the outside: Ruby on Rails, React, MySQL, and Kafka.
But that simplicity hides sharp architectural decisions, years of refactoring, and thousands of deliberate trade-offs.
In this newsletter, we map the tech stack powering Shopify from:
the modular monolith that still runs the business,
to the pods that isolate failure domains,
to the deployment pipelines that ship hundreds of changes a day.
It covers the tools, programming languages, and patterns Shopify uses to stay fast, resilient, and developer-friendly at incredible scale.
A huge thank you to Shopify’s world-class engineering team for sharing their insights and for collaborating with us on this deep technical exploration.
🔗 Dive into the full newsletter here.
Book author: Manuel Vicente Vivo
What’s inside?
An insider's take on what interviewers really look for and why.
A 5-step framework for solving any mobile system design interview question.
7 real mobile system design interview questions with detailed solutions.
24 deep dives into complex technical concepts and implementation strategies.
175 topics covering the full spectrum of mobile system design principles.
Table Of Contents
Chapter 1: Introduction
Chapter 2: A Framework for Mobile System Design Interviews
Chapter 3: Design a News Feed App
Chapter 4: Design a Chat App
Chapter 5: Design a Stock Trading App
Chapter 6: Design a Pagination Library
Chapter 7: Design a Hotel Reservation App
Chapter 8: Design the Google Drive App
Chapter 9: Design the YouTube app
Chapter 10: Mobile System Design Building Blocks
Quick Reference Cheat Sheet for MSD Interview
Founding Engineer @dbdasher.ai
Location: Remote (India)
Role Type: Full-time
Compensation: Highly Competitive
Experience Level: 2+ years preferred
About dbdasher.ai: dbdasher.ai is a well-funded, high-ambition AI startup on a mission to revolutionize how large enterprises interact with data. We use cutting-edge language models to help businesses query complex datasets with natural language. We’re already working with two pilot customers - a publicly listed company and a billion-dollar private enterprise and we’re just getting started.
We’re building something new from the ground up. If you love solving hard problems and want to shape the future of enterprise AI tools, this is the place for you.
About the Role: We’re hiring a Founding Engineer to join our early team and help build powerful, user-friendly AI-driven products from scratch. You’ll work directly with the founders to bring ideas to life, ship fast, and scale systems that power real-world business decisions.
If you are interested, apply here or email Rishabh at [email protected]
We collaborate with Jobright.ai (an AI job search copilot trusted by 500K+ tech professionals) to curate this job list.
This Week’s High-Impact Roles at Fast-Growing AI Startups
Senior Software Engineer, Search Evaluations at OpenAI (San Francisco, CA)
Yearly: 245,000 - 465,000USD
OpenAI creates artificial intelligence technologies to assist with tasks and provide support for human activities.
Staff Software Engineer, ML Engineering at SmarterDx (United States)
Yearly: 220,000 - 270,000USD
SmarterDx is a clinical AI company that develops automated pre-bill review technology to assist hospitals in analyzing patient discharges.
Software Engineering Manager, Core Platform at Standard Bots (New York, NY)
Yearly: 220,000 - 240,000
Standard Bots offers advanced automation solutions, including the RO1 robot, to help businesses streamline their operations.
High Salary SWE Roles this week
Web UI Engineer (L4) at Netflix (Los Gatos, CA)
Yearly: 100,000 - 720,000USD
Principal Switch Engineering Architect at NVIDIA (Westford, MA)
Yearly: 272,000 - 425,500USD
Staff iOS Engineer, Banking Mobile at Square (United States)
Yearly: 263,600 - 395,400USD
Today’s latest ML positions - hiring now!
Senior/Principal Machine Learning Engineer at Red Hat (Raleigh, NC)
Yearly: 170,770 - 312,730USD
Applied Machine Learning Engineer at Jobot (Roseville, CA)
Yearly: 200,000 - 270,000USD
Machine Learning Engineer at Docusign (Seattle, WA)
Yearly: 157,500 - 254,350USD
Get your product in front of more than 1,000,000 tech professionals.
Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases.
Space Fills Up Fast - Reserve Today
Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing [email protected]
2025-06-12 23:30:25
Database schema design plays a crucial role in determining how quickly queries run, how easily features are implemented, and how well things perform at scale. Schema design is never static. What works at 10K users might collapse at 10 million. The best architects revisit schema choices, adapting structure to scale, shape, and current system goals.
Done right, schema design can become a great asset for the system. It accelerates product velocity, reduces data duplication debt, and shields teams from late-stage refactors. Done wrong, it bottlenecks everything: performance, evolution, and sometimes entire features.
Every engineering team hits the same fork in the road: normalize the schema for clean structure and consistency, or denormalize for speed and simplicity. The wrong choice doesn’t necessarily cause immediate issues. However, problems creep in through slow queries, fragile migrations, and data bugs that surface months later during a traffic spike or product pivot.
In truth, normalization and denormalization aren't rival approaches, but just tools to get the job done. Each solves a different kind of problem. Normalization focuses on data integrity, minimal redundancy, and long-term maintainability. Denormalization prioritizes read efficiency, simplicity of access, and performance under load.
In this article, we’ll look into both of them in detail. We’ll start with the foundations: normal forms and how they shape normalized schemas. We will then explore denormalization and the common strategies for implementing it. From there, we will map the trade-offs between normalization and denormalization
The goal isn't to declare one approach as the winner. It's to understand their mechanics, consequences, and ideal use cases.
2025-06-11 23:30:35
Note: This article is written in collaboration with the Shopify engineering team. Special thanks to the Shopify engineering team for sharing details with us about their tech stack and also for reviewing the final article before publication. All credit for the technical details and diagrams shared in this article goes to the Shopify Engineering Team.
Shopify handles scale that would break most systems.
On a single day (Black Friday 2024), the platform processed 173 billion requests, peaked at 284 million requests per minute, and pushed 12 terabytes of traffic every minute through its edge.
These numbers aren’t anomalies. They’re sustained targets that Shopify strives to meet. Behind this scale is a stack that looks deceptively simple from the outside: Ruby on Rails, React, MySQL, and Kafka.
But that simplicity hides sharp architectural decisions, years of refactoring, and thousands of deliberate trade-offs.
In this article, we map the tech stack powering Shopify from the modular monolith that still runs the business, to the pods that isolate failure domains, to the deployment pipelines that ship hundreds of changes a day. It covers the tools, programming languages, and patterns Shopify uses to stay fast, resilient, and developer-friendly at incredible scale.
Shopify’s backend runs on Ruby on Rails. The original codebase, written in the early 2000s, still forms the heart of the system. Rails offers fast development, convention over configuration, and strong patterns for database-backed web applications. Shopify also uses Rust for its systems programming language.
While most startups eventually rewrite their early frameworks, Shopify doubled down to help ensure Ruby and Rails are 100-year tools that will continue to merit being in their toolchain of choice. Instead of moving on to another framework, Shopify pushed it further. They invested in:
YJIT, a Just-in-Time compiler for Ruby built on Rust that improves runtime performance without changing developer ergonomics.
Sorbet, a static type checker built specifically for Ruby. Shopify contributed heavily to Sorbet and made it a first-class part of the stack.
Rails Engines, a built-in Rails feature repurposed as a modularity mechanism. Each engine behaves like a mini-application, allowing isolation, ownership, and eventual extraction if needed.
The result is one of the largest and longest-running Rails applications in production.
Shopify runs a modular monolith. That phrase gets thrown around a lot, but in Shopify’s case, it means this: the entire codebase lives in one repository, runs in a single process, but is split into independently deployable components with strict boundaries.
Each component defines a public interface, with contracts enforced via Sorbet.
These interfaces aren’t optional. They’re a way to prevent tight coupling, allow safe refactoring, and make the system feel smaller than it is. Developers don’t need to understand millions of lines of code. They need to know the contracts their component depends on and trust those contracts will hold.
To manage complexity, components are organized into logical layers:
Platform: Foundational services like identity, shop state, and database abstractions
Supporting: Business domains like inventory, shipping, or merchandising
Frontend-facing: External interfaces like the online store or GraphQL APIs
This layering prevents cyclic dependencies and encourages clean flow across domains.
To support this at scale, Shopify maintains a comprehensive system of static analysis tools, exception monitoring dashboards, and differentiated application/business metrics to track component health across the company.
This modular structure doesn’t make development effortless. It introduces boundaries, which can feel like friction. However, it keeps teams aligned, reduces accidental coupling, and lets Shopify evolve without losing control of its core.
Shopify’s frontend has gone through multiple architectural shifts, each one reflecting changes in the broader web ecosystem and lessons learned under scale.
The early days used standard patterns: server-rendered HTML templates, enhanced with jQuery and prototype.js. As frontend complexity grew, Shopify built Batman.js, its single-page application (SPA) framework. It offered reactivity and routing, but like most in-house frameworks, it came with long-term maintenance overhead.
Eventually, Shopify shifted back to simpler patterns: statically rendered HTML and vanilla JavaScript. However, that also had limits. Once the broader ecosystem matured, particularly around React and TypeScript, the team made a clean move forward.
Today, the Shopify Admin interface runs on React, React Router by Remix, written in TypeScript, and driven entirely by GraphQL. It follows a strict separation: no business logic in the client, no shared state across views. The Admin is one of Shopify’s biggest apps, built on Remix that behaves as a stateless GraphQL client. Each page fetches exactly the data it needs, when it needs it.
This discipline enforces consistency across platforms. Mobile apps and web admin screens speak the same language (GraphQL), reducing duplication and misalignment between surfaces.
Mobile development at Shopify follows a similar philosophy: reuse where possible, specialize where needed.
Every major app now runs on React Native. The goal of using a single framework is to share code, reduce drift between platforms, and improve developer velocity across Android and iOS.
Shared libraries power common concerns like authentication, error tracking, and performance monitoring. When apps need to drop into native for camera access, payment hardware, or long-running background tasks, they do so through well-defined native modules.
Shopify teams also contribute directly to React Native ecosystem projects like Mobile Bridge (for enabling web to trigger native UI elements), Skia (for fast 2D rendering), WebGPU (that enables modern GPU APIs and enables general-purpose GPU computation for AI/ML), and Reanimated (for performant animations). In some cases, Shopify engineers co-captain React Native releases.
Shopify’s language choices reflect its commitment to developer productivity and operational resilience.
Ruby remains the backbone of Shopify’s backend. It powers the monolith, the engines, and most of the internal services.
Sorbet, a static type checker for Ruby, fills the safety gap traditionally left open in dynamically typed systems. It enables early feedback on interface violations and contract boundaries.
TypeScript is a first-class language on the frontend. Paired with React, it provides predictable behavior across the web and mobile surfaces.
JavaScript still appears in shared libraries and older assets, but most modern development favors TypeScript for its tooling and clarity.
Lua is used for custom scripting inside OpenResty, Shopify’s edge-layer HTTP server built on Nginx. This enables high-performance, scriptable load balancing.
GraphQL is served from the Ruby backend and used across all major clients, such as web, mobile, and third-party apps.
Kubernetes YAML defines infrastructure deployments, service configurations, and environment scaling parameters.
Remix is a full stack web framework used across various aspects of the platform — Shopify Admin Interface, marketing websites, and Hydrogen, Shopify's headless commerce framework for building custom storefronts.
A large monolith doesn’t stay healthy without support. Shopify has developed an ecosystem of internal and open-source tools to enforce structure, automate safety checks, and reduce operational toil.
Packwerk enforces dependency boundaries between components in the monolith. It flags violations early, before they cause architectural drift.
Tapioca automates the generation of Sorbet RBI (Ruby Interface) files, keeping static type definitions in sync with actual code.
Bootsnap improves startup times for Ruby applications by caching expensive computations like YAML parsing and gem loading.
Maintenance Tasks standardize background job execution. They make recurring tasks idempotent, safe to rerun, and easy to observe.
Toxiproxy simulates unreliable network conditions such as latency, dropped packets, or timeouts, allowing services to test their behavior under stress.
TruffleRuby is a high-performance Ruby implementation developed by Oracle. Shopify contributes to this as part of its broader effort to push Ruby further.
Semian is a circuit breaker library for Ruby, protecting critical resources like Redis or MySQL from cascading failures during partial outages.
Roast is a convention-oriented framework for creating structured AI workflows, maintained and used internally by the Augmented Engineering team at Shopify.
A much more exhaustive list of open-source software supported by Shopify is also present here.
There are two main categories here:
Shopify uses MySQL as its primary relational database, and has done so since the platform's early days. However, as merchant volume and transactional throughput grew, the limits of a single instance became unavoidable.
In 2014, Shopify introduced sharding. Each shard holds a partition of the overall data, and merchants are distributed across those shards based on deterministic rules. This works well in commerce, where tenant isolation is natural. One merchant’s orders don’t need to query another merchant’s inventory.
Over time, Shopify replaced the flat shard model with Pods. A pod is a fully isolated slice of Shopify, containing its own MySQL instance, Redis node, and Memcached cluster. Each pod can run independently, and each one can be deployed in a separate geographic region.
This model solves two problems:
It removes single points of failure. An issue in one pod won't cascade across the fleet.
It allows Shopify to scale horizontally by adding more pods instead of vertically scaling the database.
By pushing isolation to the infrastructure level, Shopify contains failure domains and simplifies operational recovery.
Shopify relies on two core systems for caching and asynchronous work: Memcached and Redis.
Memcached handles key-value caching. It speeds up frequently accessed reads, like product metadata or user session info, without burdening the database.
Redis powers queues and background job processing. It supports Shopify’s asynchronous workflows: webhook delivery, email sends, payment retries, and inventory syncs.
But Redis wasn’t always scoped cleanly. At one point, all database shards shared a single Redis instance. A failure in that central Redis brought down the entire platform. Internally, the incident is still known as “Redismageddon.”
The lesson Shopify took from this incident was clear: never centralize a system that’s supposed to isolate work. Afterward, Redis was restructured to match the pod model, giving each pod its own Redis node. Since then, outages have been localized, and the platform has avoided global failures tied to shared infrastructure.
There are two main categories of the same:
Shopify uses Kafka as the backbone for messaging and event distribution. It forms the spine of the platform’s internal communication layer, decoupling producers from consumers, buffering high-volume traffic, and supporting real-time pipelines that feed search, analytics, and business workflows.
At peak, Kafka at Shopify has handled 66 million messages per second, a throughput level that few systems encounter outside large-scale financial or streaming platforms.
This messaging layer serves several use cases:
Emitting domain events when core objects change (for example, order created, product updated)
Driving ML inference workflows with near real-time updates
Powering search indexing, inventory tracking, and customer notifications
By relying on Kafka, Shopify avoids tight coupling between services. Producers don't wait for consumers. Consumers process at their own pace. And when something goes wrong, like a downstream service crashing, the event stream holds the data until the system recovers.
That’s a practical way to build resilience into a fast-moving platform.
For synchronous interactions, Shopify services communicate over HTTP, using a mix of REST and GraphQL.
REST APIs still power much of the internal communication, especially between older services and support tools.
GraphQL is the preferred interface for frontend and mobile clients. It allows precise data queries, reduces over-fetching, and aligns with Shopify’s philosophy of pushing complexity to the server.
However, as the number of services grows, this model starts to strain. Synchronous calls introduce tight coupling and hidden failure paths, especially when one service transitively depends on five others.
To address this, Shopify is actively exploring RPC standardization and service mesh architectures. The goal is to build a communication layer that’s:
Observably reliable
Easy to reason about
Standardized across all environments
The ML infrastructure at Shopify could be divided into two main parts:
Shopify’s storefront search doesn’t rely on traditional keyword matching. It uses semantic search powered by text and image embeddings: vector representations of product metadata and visual features that enable more relevant, contextual search results.
This system runs at production scale. Shopify processes around 2,500 embeddings per second, translating to over 216 million per day. These embeddings cover multiple modalities, including:
Product titles and descriptions (text)
Images and thumbnails (visual content)
Each embedding is generated in near real time and immediately published to downstream consumers that use them to update search indices and personalize results.
The embedding system also performs intelligent deduplication. For example, visually identical images are grouped to avoid unnecessary inference. This optimization alone reduced image embedding memory usage from 104 GB to under 40 GB, freeing up GPU resources and cutting costs across the pipeline.
Under the hood, Shopify runs its ML pipelines on Apache Beam, executed through Google Cloud Dataflow. This setup supports:
Streaming inference at scale.
GPU acceleration through custom ModelHandler components.
Efficient pipeline parallelism using optimized thread pools.
Inference jobs are structured to process embeddings as quickly and cheaply as possible. The pipeline uses a low number of concurrent threads (down from 192 to 64) to prevent memory contention, ensuring that inference performance remains predictable under load.
Shopify trades off between latency, throughput, and infrastructure cost. The current configuration strikes that balance carefully:
Embeddings are generated fast enough for near-real-time updates
GPU memory is used efficiently
Redundant computation is avoided through smart caching and pre-filtering
For offline analytics, Shopify stores embeddings in BigQuery, allowing large-scale querying, trend analysis, and model performance evaluation without affecting live systems.
This area can be divided into the following parts:
Shopify deploys infrastructure using Kubernetes, running on Google Kubernetes Engine (GKE). Each Shopify pod, an isolated unit containing its own MySQL, Redis, and Memcached stack, is defined declaratively through Kubernetes YAML, making it easy to replicate, scale, and isolate across regions.
The runtime environment uses Docker containers for packaging applications and OpenResty, built on Nginx with embedded Lua scripting, for custom load balancing at the edge. These Lua scripts give Shopify fine-grained control over HTTP behavior, enabling smart routing decisions and performance optimizations closer to the user.
Before Kubernetes, deployment was managed through Chef, a configuration management tool better suited for static environments. As the platform evolved, so did the need for a more dynamic, container-based architecture. The move to Kubernetes replaced slow, manual provisioning with fast, declarative infrastructure-as-code.
Shopify’s monolith contains over 400,000 unit tests, many of which exercise complex ORM behaviors. Running all of them serially would take hours, maybe days. To stay fast, Shopify relies on Buildkite as its CI orchestrator. Buildkite coordinates test runs across hundreds of parallel workers, slashing feedback time and keeping builds within a 15–20 minute window.
Once the build passes, Shopify's internal deployment tools take over and offer visibility into who's deploying what, and where.
Deployments don’t go straight to production. Instead, ShipIt uses a Merge Queue to control rollout. At peak hours, only 5–10 commits are merged and deployed at a time. This throttling makes issues easier to trace and minimizes the blast radius when something breaks.
Notably, Shopify doesn’t rely on staging environments or canary deploys. Instead, they use feature flags to control exposure and fast rollback mechanisms to undo bad changes quickly. If a feature misbehaves, it can be turned off without redeploying the code.
This area can be divided into multiple parts, such as:
Shopify takes a structured, service-aware approach to observability. At the center of this is ServicesDB, an internal service registry that tracks:
Service ownership and team accountability
Runtime logs and exception reports
Uptime and operational health
Gem versions and security patch status
Dependency graphs across applications
ServicesDB catalog metadata and enforces good practices. When a service falls out of compliance (for example, due to outdated gems or missing logs), it automatically opens GitHub issues and tags the responsible team. This creates continuous pressure to maintain service quality across the board.
Incident response isn’t siloed into a single ops team. Shopify uses a lateral escalation model: all engineers share responsibility for uptime, and escalation happens based on domain expertise, not job title. This encourages shared ownership and reduces handoff delays during critical outages.
For fault tolerance, Shopify leans on two key tools:
Semian, a circuit breaker library for Ruby, helps protect core services like Redis and MySQL from cascading failures during degradation.
Toxiproxy lets engineers simulate bad network conditions (latency spikes, dropped packets, service flaps) before those issues appear in production. It’s used in test environments to validate resilience assumptions early.
Security isn’t an afterthought in Shopify’s stack, but part of the ecosystem investment. Since the company relies heavily on Ruby, it also works actively to secure the Ruby community at large.
Key efforts include:
Ongoing contributions to Bundler and RubyGems, focusing on dependency integrity and package security.
A close partnership with Ruby Central, the non-profit that oversees Ruby infrastructure.
A $500,000 commitment to fund academic research and performance improvements in the Ruby ecosystem.
The goal isn’t just to secure Shopify’s stack, but to strengthen the foundation shared by thousands of developers who depend on the same tools.
Shopify's architecture isn’t theoretical. It’s built to withstand real-world pressure—Black Friday flash sales, celebrity product drops, and continuous developer activity across a global platform. These numbers put that scale in context.
$5 billion in Gross Merchandise Volume (GMV) processed on Black Friday.
284 million requests per minute at the edge during peak load.
173 billion total requests handled in a single 24-hour period.
12 terabytes of traffic egress per minute across Shopify’s edge network.
45 million database queries per second at peak read load.
7.6 million database writes per second during transactional bursts.
66 million Kafka messages per second, sustaining Shopify’s real-time event pipelines.
100,000+ unit tests executed in CI on every monolith build.
216 million embeddings processed per day through ML inference pipelines.
>99.9% crash-free session rate across React Native mobile apps.
2.8 million lines of Ruby code in the monolith, with over 500,000 commits in version control.
100+ isolated pods, each containing its stack (MySQL, Redis, Memcached).
100+ internal Rails apps, maintained alongside the monolith using shared standards.
References:
Open-source tools that we use
Get your product in front of more than 1,000,000 tech professionals.
Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases.
Space Fills Up Fast - Reserve Today
Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing [email protected].