MoreRSS

site iconThe Practical DeveloperModify

A constructive and inclusive social network for software developers.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of The Practical Developer

Workspace-First Development: Why It Changes Everything

2026-02-12 04:39:31

The Problem Nobody Talks About

You're building a modern backend.

Not one API. Several.

  • An authentication service
  • A user management API
  • A payment gateway
  • A notification service
  • Maybe a few more

Each service is small. Each service is focused.

That's microservices done right, right?

Then why does it feel so painful?

The Hidden Tax of Multiple Services

Here's what actually happens:

Day 1: Create the first service

  • 30 minutes to set up the project
  • 1 hour to configure Docker and CI
  • 2 hours to wire up auth, database, logging
  • Finally start coding features

Day 3: Need a second service

  • Copy the first project as a template
  • Change names, ports, configs
  • Fix inconsistencies from the copy-paste
  • Realize you're already diverging from the first service

Week 2: Now you have 4 services

  • Each has a slightly different structure
  • Each has its own Python virtualenv (600MB each)
  • Each has its own dependencies (and version conflicts)
  • Each has its own way of doing logging, error handling, and config

Month 2: A new developer joins

  • "Which service should I look at as reference?"
  • "Why do these three services structure models differently?"
  • "Do we use Pydantic v1 or v2?"
  • "Which .env file is the source of truth?"

This Is What "Consistency at Scale" Really Means

It's not about code style.
It's not about linting rules.

It's about:

  • Structure: Every service follows the same architecture
  • Tooling: One environment, one CLI, one workflow
  • Dependencies: Shared Python environment across all services
  • Discovery: Every service is automatically tracked and accessible

This is what RapidKit Workspaces solve.

What Is a Workspace?

A workspace is a shared environment for multiple backend projects.

Think of it like a monorepo, but:

  • Each project is independent (no forced sharing)
  • Each project can use FastAPI or NestJS (or both)
  • One Python environment powers all Python projects
  • All projects are auto-tracked in a registry

In practice:

my-workspace/
├── .rapidkit-workspace        # Workspace marker
├── .venv/                     # Shared Python environment
├── poetry.lock                # Locked dependencies
├── auth-service/              # FastAPI project
├── user-api/                  # FastAPI project
├── payment-gateway/           # NestJS project
└── notification-service/      # FastAPI project

One command to create it:

npx rapidkit my-workspace
cd my-workspace

One command to add projects:

rapidkit create project fastapi.standard auth-service
rapidkit create project nestjs.standard payment-gateway

Result:

  • All FastAPI projects share the same .venv (~150MB once, not 600MB four times)
  • All projects are tracked in ~/.rapidkit/workspaces.json
  • VS Code extension auto-discovers everything
  • RapidKit Core version is consistent across all projects

Why This Matters: Real Numbers

Scenario: You're building 5 microservices for a SaaS app.

Without Workspaces (Traditional Approach)

  • 5 separate virtualenvs: ~750MB of duplicated Python packages
  • 5 independent setups: ~30 minutes × 5 = 2.5 hours
  • 5 different dependency versions: Debugging nightmare
  • No central tracking: Manual documentation required
  • Onboarding time: 3-4 days for a new developer

With Workspaces (RapidKit Approach)

  • 1 shared virtualenv: ~150MB total
  • 1 workspace setup: ~2 minutes
  • Consistent dependencies: One poetry.lock for all Python projects
  • Auto-tracked registry: VS Code shows all projects instantly
  • Onboarding time: 1 day (open workspace, see everything)

Savings:

  • 600MB disk space
  • 2+ hours of setup time
  • Countless hours of debugging version conflicts
  • 2-3 days of onboarding per developer

How It Works: The Magic of Shared Environments

When you create a workspace, RapidKit:

  1. Creates a Poetry-managed Python environment
   .venv/
   ├── bin/
   │   ├── python3
   │   ├── rapidkit   # CLI available globally in workspace
   │   └── ...
   ├── lib/
   │   └── python3.11/site-packages/
   │       └── rapidkit_core/   # Shared engine
  1. Installs RapidKit Core once
   poetry add rapidkit-core
  1. Provides activation scripts
   source "$(poetry env info --path)/bin/activate"
  1. Tracks everything in a registry
   // ~/.rapidkit/workspaces.json
   {
     "workspaces": [
       {
         "name": "my-workspace",
         "path": "/home/user/my-workspace",
         "projects": [
           {"name": "auth-service", "framework": "fastapi"},
           {"name": "user-api", "framework": "fastapi"},
           {"name": "payment-gateway", "framework": "nestjs"}
         ]
       }
     ]
   }

Every tool reads this registry:

  • VS Code extension shows your workspaces in the sidebar
  • CLI knows which projects exist
  • Commands auto-complete project names

Under the Hood: How Workspace Environments Work

Understanding workspace mechanics helps you make the right choice.

The Isolation Model

When you run npx rapidkit my-workspace:

my-workspace/
├── .venv/                    # Workspace virtualenv (Poetry-managed)
│   ├── bin/
│   │   ├── python3          # Python interpreter
│   │   ├── poetry           # Poetry CLI
│   │   └── rapidkit         # RapidKit Core CLI
│   └── lib/
│       └── python3.10/site-packages/
│           └── rapidkit_core/   # Core engine installed here
├── pyproject.toml            # Workspace Poetry config
├── poetry.lock               # Locked Core version
└── README.md

# After creating projects:
my-workspace/
├── .venv/                    # Workspace env (shared Core)
├── auth-service/
│   └── .venv/               # Project env (project deps only)
├── user-service/
│   └── .venv/               # Project env (project deps only)
└── payment-service/
    └── .venv/               # Project env (project deps only)

Key Benefits of This Architecture

1. Command Resolution:

# Inside workspace (any subdirectory)
$ rapidkit create project fastapi.standard api
# ↳ Uses workspace .venv/bin/rapidkit (shared Core)

# Outside workspace
$ npx rapidkit create project fastapi.standard api
# ↳ Downloads and uses temporary npm bridge

2. Disk Efficiency:

  • Workspace .venv/: ~150MB (rapidkit-core + Poetry deps)
  • Each project .venv/: ~100MB (FastAPI, SQLAlchemy, etc.)
  • Total for 4 projects: 150MB + (4 × 100MB) = 550MB

Without workspace:

  • Each project: ~150MB (rapidkit-core) + 100MB (deps) = 250MB
  • Total for 4 projects: 4 × 250MB = 1000MB
  • Savings: 450MB (45%)

3. Version Consistency:

# All projects use same Core version
$ cd auth-service && rapidkit --version
RapidKit Core 0.3.0

$ cd ../user-service && rapidkit --version  
RapidKit Core 0.3.0  # Same version!

4. Upgrade Scenarios:

# Upgrade Core for entire workspace
cd my-workspace
poetry update rapidkit-core
# ↳ All projects instantly use new version

# Upgrade individual project dependencies
cd auth-service
poetry update fastapi
# ↳ Only affects this project

Global Install Alternative

Instead of workspace, you can install Core globally:

# With pipx (recommended - isolated)
pipx install rapidkit-core

# Or with pip (system-wide)
pip install rapidkit-core

# Now rapidkit is available everywhere
$ rapidkit --version
RapidKit Core 0.3.0

$ rapidkit create project fastapi.standard api
# ↳ Uses global installation

Trade-offs:

  • ✅ No workspace needed for single projects
  • rapidkit available system-wide
  • ⚠️ All projects use same global Core version
  • ⚠️ Upgrading affects all projects simultaneously
  • ⚠️ No per-workspace version pinning

Three Ways to Start: Which One Should You Choose?

RapidKit gives you flexibility. Here's how each approach compares:

1️⃣ Workspace Creation (Recommended)

npx rapidkit my-workspace
cd my-workspace
rapidkit create project fastapi.standard api
cd api
rapidkit init && rapidkit dev

Best for:

  • Multiple services (microservices)
  • Team environments
  • Production projects
  • Long-term maintenance

Advantages:

  • ✅ Prerequisites auto-installed (Python, Poetry, RapidKit Core)
  • ✅ Shared rapidkit-core in workspace virtualenv
  • ✅ Each project gets its own .venv/ but shares Core
  • rapidkit command uses workspace environment (faster)
  • ✅ VS Code extension auto-discovers workspace
  • ✅ Consistent Core version across all projects

Setup time: 2 minutes (first time), instant for additional projects

Disk: Workspace .venv/ ~150MB (once) + each project .venv/ ~100MB

Complexity: Low (everything automated)

2️⃣ Direct Project Creation

# Outside any workspace
npx rapidkit create project fastapi.standard api
cd api
npx rapidkit init
npx rapidkit dev

Best for:

  • Single service
  • Quick prototypes
  • Learning RapidKit
  • Temporary experiments

Trade-offs:

  • ⚠️ Requires Python 3.10+ and Poetry pre-installed
  • ⚠️ Each project gets its own virtualenv (~150MB each)
  • ⚠️ No automatic dependency synchronization
  • ✅ Faster initial setup (no workspace overhead)
  • ✅ Good for experimentation

Setup time: 5 minutes (if prerequisites exist)

Disk per project: ~150MB (separate virtualenv)

Complexity: Medium (manual prerequisite management)

3️⃣ Python Core Direct

# Install globally
pipx install rapidkit-core
# Or: pip install rapidkit-core

# Create project
rapidkit create project fastapi.standard api
cd api
rapidkit init
rapidkit dev

Best for:

  • Advanced users
  • CI/CD pipelines
  • Custom tooling
  • Docker environments

Trade-offs:

  • ⚠️ Full manual Python environment management
  • ⚠️ No workspace features
  • ⚠️ No npm bridge benefits
  • ✅ Maximum control over environment
  • ✅ Good for automation scripts
  • ✅ Minimal dependencies (Python only)

Setup time: Varies (depends on your environment)

Disk per project: ~150MB (separate virtualenv)

Complexity: High (manual everything)

Visual Approach: VS Code Extension

All three approaches integrate with VS Code extension:

  1. Install RapidKit extension
  2. Use Command Palette:
    • "RapidKit: Create Workspace" (Approach 1)
    • "RapidKit: Create Project" (Approach 2)
  3. Browse modules visually
  4. Check setup status

Perfect for:

  • Visual learners
  • GUI preference
  • Team onboarding
  • Non-terminal workflows

Comparison Matrix

Feature Workspace Direct Python Core VS Code
Auto Prerequisites ✅ Yes ⚠️ Partial ❌ No ✅ Yes
Shared Virtualenv ✅ Yes ❌ No ❌ No ✅ (workspace)
Multi-Project Support ✅ Excellent ⚠️ Manual ⚠️ Manual ✅ Excellent
Disk Efficiency ✅ High ⚠️ Medium ⚠️ Medium ✅ High
Setup Complexity 🟢 Low 🟡 Medium 🔴 High 🟢 Low
Team-Friendly ✅ Yes ⚠️ Moderate ❌ No ✅ Yes
CI/CD Ready ✅ Yes ✅ Yes ✅ Yes ⚠️ GUI only
Learning Curve 🟢 Easy 🟡 Medium 🔴 Steep 🟢 Easiest

Real Scenario: Building a SaaS Backend

Task: Create 4 microservices (auth, users, payments, notifications)

Approach 1: Workspace

npx rapidkit my-saas
cd my-saas
rapidkit create project fastapi.standard auth-service
rapidkit create project fastapi.standard user-service
rapidkit create project fastapi.standard payment-service
rapidkit create project fastapi.standard notification-service

Result:

  • Workspace .venv/ with rapidkit-core (~150MB)
  • Each project has own .venv/ (~100MB each = 400MB total)
  • Total: ~550MB (vs 600MB+ without workspace)
  • Setup time: 5 minutes
  • All services use same RapidKit Core version
  • rapidkit command (no npx) uses workspace environment

Approach 2: Direct (Individual Projects)

npx rapidkit create project fastapi.standard auth-service
cd auth-service && npx rapidkit init && cd ..

npx rapidkit create project fastapi.standard user-service
cd user-service && npx rapidkit init && cd ..

npx rapidkit create project fastapi.standard payment-service
cd payment-service && npx rapidkit init && cd ..

npx rapidkit create project fastapi.standard notification-service
cd notification-service && npx rapidkit init

Result:

  • Four separate virtualenvs (~600MB total)
  • Each has its own rapidkit-core installation
  • Setup time: 20+ minutes
  • Potential Core version conflicts between projects
  • Must use npx rapidkit for every command

Approach 3: Python Core

pip install rapidkit-core
rapidkit create project fastapi.standard auth-service
cd auth-service && poetry install && cd ..

rapidkit create project fastapi.standard user-service
cd user-service && poetry install && cd ..
# ...repeat for remaining services

Result:

  • Four virtualenvs (~600MB total)
  • Each project manages its own environment
  • Setup time: 25+ minutes
  • Full manual control (good for experts)
  • Requires global rapidkit-core installation
  • No npm bridge benefits (no npx workflow)

Our Recommendation: Start with Workspace

For 90% of use cases, workspace approach wins:

  1. Handles prerequisites automatically
  2. Scales effortlessly (add projects instantly)
  3. Team-friendly (consistent environments)
  4. Disk-efficient (shared virtualenv)
  5. Production-ready (used by real companies)

Switch to Direct/Core when:

  • Single prototype project
  • Experimenting with RapidKit
  • Custom CI/CD requirements
  • Need maximum control

This Is What "Consistency at Scale" Really Means

Rule of thumb:

  • Planning to build 2+ related projects? → Workspace
  • Just trying RapidKit? → Standalone
  • Not sure? → Workspace (you can always ignore it)

Real-World Example: SaaS Backend

Let's build a complete SaaS backend in a workspace.

Step 1: Create workspace

npx rapidkit saas-backend
cd saas-backend

Step 2: Create core services

# Authentication & user management
rapidkit create project fastapi.ddd auth-service
rapidkit add module auth --project auth-service
rapidkit add module users.core --project auth-service

# Main API gateway
rapidkit create project nestjs.standard api-gateway

# Background jobs service
rapidkit create project fastapi.standard jobs-service
rapidkit add module celery --project jobs-service

# Notification service
rapidkit create project fastapi.standard notifications
rapidkit add module email --project notifications
rapidkit add module notifications.unified --project notifications

Step 3: Open in VS Code

code .

Result:

VS Code sidebar now shows:

📁 SaaS Backend (Workspace)
 ├── 🐍 auth-service (FastAPI)
 ├── 🟦 api-gateway (NestJS)
 ├── 🐍 jobs-service (FastAPI)
 └── 🐍 notifications (FastAPI)

One click to:

  • Start any service's dev server
  • Run tests across all services
  • Install modules visually
  • Check workspace health

Developer Experience Benefits

1. Instant Discoverability

New developer joins your team:

git clone company/backend-workspace
cd backend-workspace
code .

VS Code opens. They see:

  • All 8 microservices in the sidebar
  • Project structure is identical across services
  • Modules are documented in each project's README
  • Docker Compose is already configured

No documentation needed.

2. Consistent Commands

Every project, same commands:

cd auth-service
rapidkit dev      # Start dev server

cd ../api-gateway
rapidkit dev      # Start dev server

cd ../jobs-service
rapidkit test     # Run tests

No learning curve between services.

3. Shared Module Updates

You add a new logging module to one service:

cd auth-service
rapidkit add module logging

RapidKit Core is shared across the workspace.

Now every developer on every project can use the same module:

cd ../api-gateway
rapidkit add module logging   # Same version, instant install

No "which version did we use?" questions.

Advanced: Workspace Strategies

Strategy 1: Monolithic Workspace

All company projects in one workspace:

company-backend/
├── auth-service/
├── user-api/
├── payment-api/
├── notification-service/
├── analytics-api/
└── admin-api/

Pros: Maximum consistency
Cons: Large workspace

Strategy 2: Domain-Separated Workspaces

One workspace per domain:

company-auth/
├── auth-service/
└── user-api/

company-payments/
├── payment-api/
└── billing-service/

company-operations/
├── notification-service/
└── analytics-api/

Pros: Smaller workspaces, clear boundaries
Cons: Requires more setup

Strategy 3: Hybrid (Recommended)

Core services in a workspace, utilities standalone:

core-platform/       # Workspace
├── auth-service/
├── user-api/
└── payment-api/

tools/
├── migration-script/    # Standalone
└── data-import/         # Standalone

Pros: Best of both worlds
Cons: Requires discipline

Workspace Anti-Patterns to Avoid

❌ Anti-Pattern 1: Mixing Unrelated Projects

Don't put your company backend and personal blog in the same workspace.

Workspaces are for related projects, not all projects.

❌ Anti-Pattern 2: Never Using Standalone

Workspaces are powerful, but not always necessary.

Quick prototype? Use standalone:

npx rapidkit create project fastapi.standard quick-api

❌ Anti-Pattern 3: Ignoring the Registry

RapidKit auto-tracks workspaces in ~/.rapidkit/workspaces.json.

Don't manually edit this file — let RapidKit manage it.

The Hidden Superpower: Team Onboarding

This is where workspaces really shine.

Traditional onboarding:

  1. Clone repo
  2. Read 10-page setup guide
  3. Install Python 3.11 (or was it 3.10?)
  4. Create virtualenv
  5. Install dependencies (pray no conflicts)
  6. Set up .env files (which template?)
  7. Install Docker
  8. Configure IDE
  9. Finally run the project (if lucky)

Time: 4-6 hours (or 2 days if unlucky)

Workspace onboarding:

  1. Clone workspace repo
  2. Run ./bootstrap.sh (installs Poetry, activates env)
  3. Open VS Code: code .

Time: 5 minutes

Everything else is automatic:

  • Python environment is ready
  • Dependencies are locked
  • All projects are visible
  • Structure is consistent
  • Commands work immediately

Try It Yourself

Create a workspace and see the difference:

# Create workspace
npx rapidkit demo-workspace
cd demo-workspace

# Add two projects
rapidkit create project fastapi.standard api-one
rapidkit create project nestjs.standard api-two

# Open in VS Code
code .

Now try:

  • Browse projects in the sidebar
  • Check workspace health (pulse icon)
  • Install a module to one project
  • See how everything is tracked

What's Next?

In the next article, we'll build a production API from scratch in 5 minutes — no shortcuts, no hand-waving.

You'll see exactly what you get, why the structure matters, and how to deploy it.

Learn More

Choosing Between Array and Map in JavaScript

2026-02-12 04:36:52

It is often the case that when writing code, we don't spend much time thinking about the performance implications of seemingly trivial operations. We reach for familiar patterns, like using arrays to store collections of items, without considering whether they're the best tool for the job.

The Hidden Cost of Array Lookups

Let's consider a scenario in which we have a list of users and our task is to identify a specific user by their ID, update their data, or remove them from the list.

const users = [
  { id: '1', name: 'John Smith', age: 28 },
  { id: '2', name: 'Jane Doe', age: 34 },
  { id: '3', name: 'Bob Johnson', age: 42 }
];

The first thought that often comes to mind is to use methods available on arrays such as find(), filter(), or splice(). All of those work, but they come with some cost.

// Looking up a user
const user = users.find(u => u.id === '2');

// Removing a user from the list
const updatedUsers = users.filter(u => u.id !== '2');

Why This Hurts Performance

Every time you use find() method, JavaScript has to iterate through the elements one by one until it finds a match. For a small arrays it would be unnoticeable, but as your data grows, these O(n) operations start to add up, especially when performed repeatedly.

The O(1) Fix: Map

This is where Map comes in handy, designed specifically for key-based operations.

// Transform array to map with ID as a key
const usersMap = new Map(users.map(u => [u.id, u]));

console.log(usersMap); 
// Map(3) {
//   '1' => { id: '1', name: 'John Smith', age: 28 },
//   '2' => { id: '2', name: 'Jane Doe', age: 34 },
//   '3' => { id: '3', name: 'Bob Johnson', age: 42 }
// }

Looking up, adding, or removing items by key happens in O(1) constant time, regardless of how many items you have stored.

// Verifying user existence
const userExists = usersMap.has('2');

// Getting user by ID
const user = usersMap.get('2');

// Updating user data
usersMap.set('2', { id: '2', name: 'Jane Doe Updated', age: 35 });

// Removing user
usersMap.delete('1');

When Arrays Still Make Sense

That said, arrays aren't obsolete. In many scenarios, especially when order matters, or when you’re performing operations like mapping, filtering, or sorting, arrays are still the right choice. Maps shine when your primary interaction is lookup by key.

For example:

  • If you just need to list items, filter them, or render them in UI order, an array is simpler.
  • If you often search by ID, toggle specific items, or sync updates efficiently, Map will pay off in performance and clarity.

Data-driven decision making using Power BI.

2026-02-12 04:22:27

If you are here, you are probably a data scientist or data engineer, looking for a powerful, easy-to-use business intelligence tool for analysis and visualisation, or a business owner looking to understand your business data and make insightful business decisions.

What are its advantages? How does it give you a cutting edge as a professional in the Big Data industry?

Well, let me tell you everything I know about this powerful tool!

Power BI Architecture:

Flow of data from a messy raw data flat table to a one-page interactive dashboard.

Power BI comes in 4 components that allow you to use it locally or as a service on the cloud. These components are:
Power BI Desktop, Power Bi service( Cloud), Power BI Mobile, and Report Server.

1. Data Sources:

Power BI is very popular because of its ability to pull data from numerous sources, making it compatible with most systems that store raw data. These sources include, but are not limited to, Excel, SQL Server, Web applications, etc.)

2. Data Preparation & Transformation:

Data transformation is a very important step in data analysis. It is impossible to model or make sense of data that is full of formatting errors, blanks, and duplicates.

Power Query

  • Power Query tool is a built-in tool in Power BI that uses the ETL (Extract, Transform, Load) process to clean and transform a big data set before loading it for analysis.

This process entails removing duplicates, changing incorrect data types, unifying blanks or null values, and trimming extra text characters before loading the data for analysis and visualisation.

3. Data Modelling & Analysis:

  • Relationships, Joins and Schemas

Relationships are how Power BI connects tables so that data can flow correctly between them.

Power Bi allows you to easily create and manage relationships between fact and dimension tables. It enables you to arrange cleaned data in schemas that structure the data for analysis, easy update, and retrieval.

Joins are used in Power BI to physically combine data from two tables into a single table. They are performed in Power Query during the data preparation stage, before the data is loaded into the Power BI model.

Designing a clear and well-structured model using fact tables, dimension tables, and an appropriate schema is essential for effective and scalable analysis.

  • DAX (Data Analysis Expressions) is a formula language designed specifically for analytical and business intelligence calculations, such as:
totals, averages, percentages, rankings, comparisons.

Dax Function

DAX is used to build measures, calculated columns, and calculated tables that help transform raw data into meaningful insights.

4. Visualization and reporting.

Sample Dashboard

Power Bi visualisation and filter panes allow you to easily create numerous captivating visuals like charts, graphs, reports, and Interactive dashboards.

Dashboards are a one page Interractive interface that displays Key Insights that a Business needs to make informed decisions. It is a single-screen visual summary of the most important metrics and trends needed
to monitor performance and make decisions.

A dashboard is not just a collection of random charts and visuals, but a carefully thought-out display of all the information that provides answers needed to key Business questions. They should be precise and accurate, giving the stakeholder a chance to filter out data using Filters and Slicers.

There, now you know everything I know!

Well, almost everything :) .....

Signed
Jules.

5 Redis Patterns Every Developer Should Know

2026-02-12 04:21:15

Redis is more than just a cache - it's a powerful data structure server. Here are 5 patterns that will level up your Redis game.

1. Rate Limiting with Sliding Window

Perfect for API rate limiting:

def is_rate_limited(user_id: str, limit: int = 100, window: int = 60) -> bool:
    key = f"rate:{user_id}"
    now = time.time()

    pipe = redis.pipeline()
    pipe.zremrangebyscore(key, 0, now - window)
    pipe.zadd(key, {str(now): now})
    pipe.zcard(key)
    pipe.expire(key, window)

    _, _, count, _ = pipe.execute()
    return count > limit

2. Distributed Locks

Prevent race conditions across services:

def acquire_lock(name: str, timeout: int = 10) -> bool:
    return redis.set(f"lock:{name}", "1", nx=True, ex=timeout)

def release_lock(name: str):
    redis.delete(f"lock:{name}")

3. Pub/Sub for Real-time Events

Great for notifications and live updates:

# Publisher
redis.publish("events", json.dumps({"type": "new_message", "data": {...}}))

# Subscriber
pubsub = redis.pubsub()
pubsub.subscribe("events")
for message in pubsub.listen():
    handle_event(message)

4. Leaderboards with Sorted Sets

Perfect for gaming and ranking:

# Add score
redis.zadd("leaderboard", {"player1": 1500, "player2": 1200})

# Get top 10
redis.zrevrange("leaderboard", 0, 9, withscores=True)

5. Session Storage

Fast session management:

def save_session(session_id: str, data: dict, ttl: int = 3600):
    redis.setex(f"session:{session_id}", ttl, json.dumps(data))

Which pattern will you try first? Let me know in the comments!

Optimizing the MongoDB Java Driver: How minor optimizations led to macro gains

2026-02-12 04:21:08

This tutorial was written by Slab Babanin & Nasir Qureshi.

Donald Knuth, widely recognized as the ‘father of the analysis of algorithms,’ warned against premature optimization—spending effort on code that appears inefficient but is not on the critical path. He observed that programmers often focus on the wrong 97% of the codebase. Real performance gains come from identifying and optimizing the critical 3%. But, how can you identify the critical 3%? Well, that’s where the philosophy of ‘never guess, always measure’ comes in.

In this blog, we share how the Java developer experience team optimized the MongoDB Java Driver by strictly adhering to this principle. We discovered that performance issues were rarely where we thought they were. This post explains how we achieved throughput improvements between 20% to over 90% in specific workloads. We’ll cover specific techniques, including using SWAR (SIMD Within A Register) for null-terminator detection, caching BSON array indexes, and eliminating redundant invariant checks. 

These are the lessons we learned turning micro-optimizations into macro-gains. Our findings might surprise you — they certainly surprised us — so we encourage you to read until the end. 

Getting the metrics right

Development teams often assume they know where bottlenecks are, but intuition is rarely dependable. During this exercise, the MongoDB Java team discovered that performance problems were often not where the team expected them to be.

Donald Knuth emphasizes this concept in his paper on Premature Optimization:

Programmers waste enormous amounts of time thinking about the speed of noncritical parts of their programs, and these attempts to improve efficiency have a strong negative impact on debugging and maintenance. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.

To avoid ‘premature optimization’—that is, improving code that appears slow but isn't on the critical path—we follow a strict rule: never guess, always measure.

We applied the Pareto principle (also known as the 80/20 rule) to target the specific code paths responsible for the majority of execution time. For this analysis, we used async-profiler. Its low-overhead, sampling-based approach allowed us to capture actionable CPU and memory profiles with negligible performance impact.

How we measured performance

We standardized performance tests based on throughput (MB/s), simplifying comparisons across all scenarios. Our methodology focused on minimizing the influence of external variables and ensuring practical relevance.

Testing Methodology:

  • Tested representative workloads: Testing focused on representative driver operations (for example, small, large, and bulk document inserts) executed via the MongoClient API, not isolated method microbenchmarks.

  • Isolated the testing environment: We conducted performance tests across multiple isolated cloud machines to minimize external variability and prevent performance degradation from resource contention (i.e., 'noisy neighbors'). Each test was run multiple times on each machine, and the median throughput score was used as the final result for that machine.

  • Statistical verification: Next, we aggregated the median results and directly compared the optimized branch mean throughput with the stable mainline mean. We verified statistical significance through percentage improvement and z-score analysis.

From this exercise, we realized that what truly mattered was that the performance improvements appear at the MongoClient API level. Internal microbenchmarks may show significant gains, but since users interact solely through the API, any gains that do not manifest at the API level will not translate into noticeable improvements in application performance.

Refer to the Performance Benchmarking drivers specification for a more detailed description of these tests.

Below, we will explain the six techniques we used to optimize the MongoDB Java driver, while staying true to our guiding principle: ‘never guess, always measure’.

1. Caching BSON array indexes

In BSON, array indexes are not merely integers; they are encoded as null-terminated strings, or CStrings. For example, the index 0 becomes the UTF-8 byte sequence for '0' (U+0030) followed by the UTF-8 byte sequence for NULL (U+0000). Encoding an array involves converting each numeric index to a string, encoding it into UTF-8 bytes, and then appending a null terminator:

for (int i = 0; i < arraySize; i++) {
    encodeCString(Integer.toString(i));;
   }

Calling toString() and encoding the result for every index was clearly suboptimal, because it repeats the same conversion work for the same indexes over and over again: each call rebuilds the String representation of i and then turns that String into a byte[]. This involves unnecessary copying, even though the result remains the same.

Our first improvement was to precompute and cache these immutable byte arrays for reuse in tight loops.

private static final byte[][] PRE_ENCODED_INDICES = new byte[1000][];

for (int i = 0; i < 1000; i++) {
    PRE_ENCODED_INDICES[i] = (Integer.toString(i) + '\u0000').getBytes("UTF-8");
}

for (int i = 0; i < arraySize; i++) {
    if (i < PRE_ENCODED_INDICES.length) {
        buffer.put(PRE_ENCODED_INDICES[i]);
    } else {
        encodeCString(Integer.toString(i));
    }
}

This caching step was already effective. We also changed the cache layout from a jagged byte[] to a flat byte[] representation.

private static final byte[] PRE_ENCODED_INDICES;

The flat byte[] layout remains the preferred choice because it uses fewer heap objects and scales more efficiently as the cache grows due to spatial locality. Our benchmarks showed no significant throughput difference compared to jagged byte[][] structures in smaller caches; this parity stems from the HotSpot’s Thread-Local Allocation Buffer (TLAB) allocator, which places small rows close together in memory. Even with garbage collection (GC) settings that forced frequent promotion into the old generation, we often observed the same effect. Because that behaviour is JVM- and GC-specific rather than guaranteed, we use the flat array as the more robust solution. To quantify the impact of caching itself, we adapted the “Small Doc insertOne” workload from our performance specification for array-heavy documents. Instead of the original shape, each document now contained A arrays of B length (that is, 100×100, and 100×1000), so the total number of encoded array indexes per document was A × B. The larger the arrays we use per document, the more significant the difference is, as the array encoding fraction in the “insertOne” operation is larger for larger arrays.

Figure 1. The figure below shows the before-and-after results of performance tests on 100x100 and 100x1000 array documents. The larger arrays saw the greatest improvement in performance. 

be

2. Java Virtual Machine (JVM) intrinsics

As Java developers, we write code with many abstractions, which makes the code easier to maintain and understand; however, these abstractions can also cause significant performance issues. What is easy for humans to read isn’t always what the machine prefers to execute, which is why having mechanical sympathy may be beneficial.

For example, in BSON, numbers must be encoded with little-endian order. In our original code, when encoding an int to BSON, we wrote each byte to a ByteBuffer separately, doing manual shifting and masking to produce little-endian byte order.

 

write(value >> 0);

write(value >> 8);

write(value >> 16);

write(value >> 24);

However, this approach wasn’t efficient. It required individual bounds checks and manual byte shuffling for every byte written, which showed up as a hotspot in profiles. We chose to adopt ByteBuffer’s methods—such as putInt, putLong, and putDouble. This method collapses four separate byte operations into a single call (putInt) that handles byte order automatically. 

Under the hood, modern JITs (e.g., in HotSpot) can compile these methods using intrinsics—such as Unsafe.putInt and Integer.reverseBytes—often mapping them to efficient machine instructions. For more context, see Intrinsic functions.

The JVM can replace these helper methods with a very small sequence of machine instructions, sometimes even a single instruction. For example, on x86, the Integer.reverseBytes(int i) method may use the BSWAP instruction; on ARM, the JVM might use the REV instruction.

Bonus tip: Repeated invariant checks in the hot path are computationally expensive. In the original code of the BsonOutput serializer, every single-byte write also re-checks invariants, such as “is this BsonOutput still open?” If you’ve already validated invariants elsewhere, you can safely skip repeated invariant checks inside the loop.

After implementing our changes and testing them, we realized that a simple change affecting less than 0.03% of the codebase resulted in a nearly 39% improvement in throughput for large document bulk inserts. 

Figure 2. The figure below shows throughput improvements for each insert command. ‘Large Doc Bulk Insert’ saw the most significant gain because processing larger payloads maximizes the impact of optimizing the hottest execution paths.

care

3. Check and check again

As mentioned earlier, size checks on ByteBuffer in the critical path are expensive. However, we also performed similar checks on invariants in many other places. When encoding BSON, we retrieved the current buffer from a list by index on every write:

ByteBuffer currentBuffer = bufferList.get(currentBufferIndex);

//other code
currentBuffer.put(value);

This get() call is fast, but performing it many times adds up—especially since each call includes range checks and method indirection (as the path is too deep; the JVM might not always inline it). 

Aha moment: If the same buffer will be used for thousands of operations before switching, why should we keep re-checking?

By caching the buffer in a field and updating it only when switching buffers, we eliminated at least three redundant range checks. Here is how:

private ByteBuffer currentByteBuffer;// Only update when changing bufferscurrentByteBuffer.put(value);

This minor change led to a 16% increase in throughput for bulk inserts. This wasn’t the only area where redundant checks could be eliminated; when we tested removing similar invariant checks from other abstractions, we observed an additional 15% improvement in throughput for bulk inserts.

The lesson: Remove unnecessary checks from the hottest paths. Because these checks run so frequently, they quickly become bottlenecks that drag down performance. 

4. BSON null terminator detection with SWAR

Every BSON document is structured as a list of triplets: a type byte, a field name, and a value. Crucially, each field name is a null-terminated string—CString—not a length-prefixed string. While this design saves four bytes per field, it introduces a performance trade-off: extracting a CString now requires a linear scan rather than a constant-time lookup.

Our original implementation processed the buffer byte-by-byte, searching for the terminating zero:

boolean checkNext = true;

while (checkNext) {
    if (!buffer.hasRemaining()) {
        throw new BsonSerializationException("Found a BSON string that is not null-terminated");
    }
    checkNext = buffer.get() != 0;
}

The primary issue with this approach is that it requires more comparisons for the same amount of work. For large documents, the process calls buffer.get() billions of times. Processing each byte individually requires a load, check, and conditional jump each time, which rapidly increases the total instruction count.

To improve performance, we used a classic optimization technique: SWAR (SIMD Within A Register Vectorization). Instead of checking each byte separately, SWAR lets us examine eight bytes simultaneously with a single 64-bit load and some bitwise operations. Here’s what that looks like:

long chunk = buffer.getLong(pos);
long mask = chunk - 0x0101010101010101L;
mask &= ~chunk;
mask &= 0x8080808080808080L;
if (mask != 0) {
    int offset = Long.numberOfTrailingZeros(mask) >>> 3;
    return (pos - prevPos) + offset + 1;
}

These 'magic numbers' aren’t arbitrary: 0x0101010101010101L repeats the byte 1, while 0x8080808080808080L repeats the byte 128. By subtracting 1 from each byte, ANDing with the inverted value, and applying the high-bit mask, you can instantly detect if a zero exists. Then, simply counting the trailing zeros allows you to calculate the precise byte offset. This method is highly effective with CPU pipelining.

Let’s take a simple example. Suppose we use an int (4 bytes) for simplicity. The value: 0x7a000aFF contains a zero byte. We will demonstrate how the SWAR technique detects it:

Step                    | Value (Hex)   | Value (Binary, per byte)

------------------------|---------------|-----------------------------

chunk                   | 0x7a000aFF    | 01111010 00000000 00001010 11111111

ones                    | 0x01010101    | 00000001 00000001 00000001 00000001

mask (high bit mask)    | 0x80808080    | 10000000 10000000 10000000 10000000

Subtraction:

chunk      = 01111010 00000000 00001010 11111111

- ones     = 00000001 00000001 00000001 00000001

------------------------------------------------

            01111000 11111111 00001001 11111110

                          ↑
                      underflow

                       (0-1=FF)

Bitwise AND with inverted chunk:

prevResult = 01111001 11111111 00001001 11111110

& ~chunk   = 10000101 11111111 11110101 00000000

------------------------------------------------

            00000001 11111111 00000001 00000000

Bitwise AND with mask (high bits):

prevResult = 00000001 11111111 00000001 00000000

& mask     = 10000000 10000000 10000000 10000000

------------------------------------------------

             00000000 10000000 00000000 00000000

The final result:

  • The result has a high bit set (10000000) in Byte 2, showing there’s a zero byte at that position.

  • After isolating one byte, we can use the Integer.numberOfTrailingZeros(mask) >>> 3 to get the offset of the zero byte. This method is often an intrinsic function, built into the JVM, producing efficient single instructions.

Because the loop body now consists of a small, predictable set of arithmetic instructions, it integrates seamlessly with modern CPU pipelines. The efficiency of SWAR stems from its reduced instruction count, the absence of per-byte branches, and one memory load for every eight bytes.

5. Avoiding redundant copies and allocations

While optimizing CString detection with SWAR, we also identified an opportunity to reduce allocations and copies on the string decoding path.

Earlier versions of the driver wrapped the underlying ByteBuffer in a read-only view to guard against accidental writes, but that choice forced every CString decode to perform two copies:

  • From the buffer into a temporary ‘byte[]’.
  • From that ‘byte[]’ into the internal ‘byte[]’ that backs the ‘String’.

By verifying that the buffer contents remain immutable during decoding, we were able to safely remove the restrictive read-only wrapper. This allows us to access the underlying array directly and decode the string without intermediate copies.

if (buffer.isBackedByArray()) {
    int position = buffer.position();
    int arrayOffset = buffer.arrayOffset();
    return new String(array, arrayOffset + position, bsonStringSize - 1, StandardCharsets.UTF_8);
}

For direct buffers (which are not backed by a Java heap array), we cannot hand a backing array to the String constructor. We still need to copy bytes from the buffer, but we can avoid allocating a new byte[] for every string.

To achieve this, the decoder maintains a reusable byte[] buffer. The first call allocates it (or grows it if a larger string is encountered), and subsequent calls reuse the same memory region. That has two benefits:

  • Fewer allocations, less GC pressure, and memory zeroing: We no longer create a fresh temporary byte[] for every CString, which reduces the amount of work the allocator and garbage collector must do per document.

  • Better cache behavior: The JVM repeatedly reads and writes the same small piece of memory, which tends to remain hot in the CPU cache. We examined CPU cache behavior on our “FindMany and empty cursor” workload using async-profiler’s cache-misses event. Async-profiler samples hardware performance counters exposed by the CPU’s Performance Monitoring Unit (PMU), the hardware block that tracks events such as cache misses, branch misses, and cycles. For readString(), cache-miss samples dropped by roughly 13–28% between the old and new implementation, as we touch fewer cache lines per CString. We still treat the PMU data as directional rather than definitive — counters and sampling semantics vary by CPU and kernel — so the primary signal remains the end-to-end throughput gains (MB/s) that users actually observe.

On our “FindMany and empty cursor” workload, eliminating the redundant intermediate copy in readString improved throughput by approximately 22.5%. Introducing the reusable buffer contributed a ~5% improvement in cases where the internal array is not available.

6. String Encoding, removing method indirection and redundant checks 

While benchmarking our code, we observed that encoding Strings to UTF-8, the format used by BSON, consumed a significant amount of time. BSON documents contain many strings, including attribute names as CStrings and various string values of different lengths and Unicode code points. The process of encoding strings to UTF-8 was identified as a hot path, prompting us to investigate it and suggest potential improvements. Our initial implementation used custom UTF-8 encoding to avoid creating additional arrays with the standard JDK libraries.

But inside the loop, every character involved several inner checks: 

  • Verifying ByteBuffer capacity 
  • Branching for different Unicode ranges
  • Repeatedly calling abstractions and methods 

If the buffer was full, we’d fetch another one from a buffer pool (we pool ByteBuffers to reduce garbage collection (GC) pressure:

for (int i = 0; i < len;) {
    int c = Character.codePointAt(str, i);
    if (checkForNullCharacters && c == 0x0) {
        throw new BsonSerializationException(...);
    }

    if (c < 0x80) {
        //check if ByteBuffer has capacity and write one byt
    } else if (c < 0x800) {
        //check if ByteBuffer has capacity and write two bytes
    } else if (c < 0x10000) {
       //check if ByteBuffer has capacity and write three bytes
       total += 3;
    } else 
      //check if ByteBuffer has capacity and write fourfoure bytes
  }
  i += Character.charCount(c);
}

In practice, modern JVMs can unroll tight, counted loops, reducing back branches and enhancing instruction pipeline efficiency under suitable conditions. However, when examining the assembly generated by the JIT for this method, we observed that loop unrolling did not occur in this instance. This underscores the importance of keeping the hot path as straight as possible, minimizing branches, checks, and method indirection, especially for large workloads.

Our first optimization was based on the hypothesis that most workloads mainly use ASCII characters. Using this assumption, we developed a new loop that was much more JIT-friendly.

for (; sp < str.length(); sp++, pos++) {
    char c = str.charAt(sp);
    if (checkNullTermination && c == 0) {
        //throw exception
    }

    if (c >= 0x80) {
        break;
    }
    dst[pos] = (byte) c;
}

Before entering the loop, we verified that String.length() < ByteBuffer’s capacity and got the underlying array from the ByteBuffer (our ByteBuffer is a wrapper over the JDK or Netty buffers) 

By verifying this invariant upfront, we eliminated the need for repeated capacity checks or method indirection inside the loop. Instead, we worked directly with the internal array of a ByteBuffer.

We also added a safeguard: if the character to encode is greater than 0x80, we’ve encountered a non-ASCII character and must fall back to a slower, more general loop with additional branching.

With this setup, the JIT usually unrolls the loop body, replacing it with several consecutive copies. This modification decreases the number of back branches and improves pipeline performance efficiency. If we zoom in on the compiled assembly, we can see that C2 has unrolled the loop by a factor of 4. Instead of processing one character per iteration, the hot path processes four consecutive characters (sp, sp+1, sp+2, sp+3) and then increments sp by 4. For example, HotSpot C2 on AArch64 might produce the following assembly, with some bookkeeping removed and only 2 of the 4 unrolled copies for brevity:

loop body:

; ----- char 0: value[sp], dst[pos] -----

    ldrsb   w5,  [x12,#16]          ; load value[sp]

    and     w11, w5, #0xff          ; w11 = (unsigned) c0

    cbz     w11, null_trap          ; if (c0 == 0) -> slow path (null terminator check)

    cmp     w11, #0x80              ; if (c0 >= 0x80) -> slow path (non-ASCII)

    b.ge    non_ascii1_path

    strb    w5,  [x10,#16]          ; dst[pos] = (byte)c0

    ; ----- char 1: value[sp+1], dst[pos+1] -----

    ldrsb   w4,  [x12,#17]          ; load value[sp+1]

    and     w11, w4, #0xff          ; w11 = (unsigned) c1

    cbz     w11, null_trap          ; if (c1 == 0) -> slow path (null terminator check)

    cmp     w11, #0x80              ; if (c1 >= 0x80) -> slow path (non-ASCII)

    b.ge    non_ascii1_path

    strb    w4,  [x10,#17]          ; dst[pos+1] = (byte)c1

What we did:

  • Removed internal method indirection (like write() wrappers) that introduced extra bound checks.

  • When writing ASCII, we wrote directly to the underlying ByteBuffer array if the capacity allowed, skipping extra bounds and range checks.

  • For multi-byte code points, we avoided repeated calls to ensureOpen(), hasRemaining(), and related checks by caching position and limit values inside hot paths.

This optimization improved insert throughput across all related benchmarks. For example:

  • Bulk write insert throughput for UTF-8 multi-byte characters improved by nearly 31%.

  • Bulk write insert throughput for ASCII improved by nearly 33%.

You can see the particular test conditions in the Performance Benchmarking specification.

Lessons learned

  • Cache immutable data on the hot path: In our case, pre-encoding BSON index CStrings once into a compact flat byte[] removed repeated int to byte[] conversions and cut heap overhead from thousands of tiny byte[] objects.

  • The JVM can surprise you: Use intrinsics and hardware features whenever possible. After implementing our changes and testing, we found that a simple modification affecting less than 0.03% of the codebase increased throughput for large document bulk inserts by nearly 39%.

  • Thoroughly profile your code: Optimize where it matters. Small, smart changes in the hot path can yield more benefits than rewriting cold code. By caching the buffer in a field and updating it only when switching to a new buffer, we improved bulk insert throughput by 16%.

  • Even cheap checks can add up: Bounds checks and branches in the hot path can be surprisingly costly - multiply a few cycles by billions, and you’ll notice a real impact. Move checks outside inner loops where possible, and don’t repeat invariants that have already been verified.

  • Vectorization (SIMD): Rewriting critical code paths with vectorized methods (e.g., SWAR) can significantly increase throughput by enabling you to process multiple data elements simultaneously per instruction.

  • Removing method indirection and redundant checks: Optimizing low-level operations required removing write(...)/put(...) wrappers to eliminate an extra layer of method indirection and the repeated invariant checks. By writing directly to the underlying ByteBuffer array for ASCII and caching position values in hot paths for multi-byte code points, we bypassed repeated validation calls, such as ensureOpen() and hasRemaining(). This resulted in a 33% improvement in bulk write insert throughput for ASCII.

Figure 3. The figure below shows the final results for throughput improvements (measured in MB/s) after optimizing the driver, as explained above, based on this performance benchmarking specification. The ‘Large doc Collection BulkWrite insert’ saw the highest performance improvement +96.46%. 

Imagd

Check out our developer blog to learn how we are solving different engineering problems, or consider joining our engineering team.

Building a High-Performance SMS Gateway with Python: A Complete Guide

2026-02-12 04:20:15

In this comprehensive guide, I'll walk you through building a production-ready SMS gateway using Python and FastAPI. We'll cover everything from basic setup to advanced features like rate limiting, multi-provider support, and delivery tracking.

Why Build Your Own SMS Gateway?

While services like Twilio and Telnyx offer excellent APIs, building your own gateway gives you:

  • Cost control: Route messages through the cheapest provider
  • Redundancy: Automatic failover between providers
  • Customization: Tailor features to your exact needs
  • Analytics: Deep insights into your messaging patterns

Project Architecture

sms-gateway/
├── app/
│   ├── main.py           # FastAPI application
│   ├── providers/        # SMS provider adapters
│   │   ├── twilio.py
│   │   ├── telnyx.py
│   │   └── vonage.py
│   ├── services/
│   │   ├── router.py     # Message routing logic
│   │   └── rate_limiter.py
│   └── models/
│       └── message.py
├── tests/
└── requirements.txt

Getting Started

First, install the required dependencies:

pip install fastapi uvicorn httpx redis

Core Implementation

Here's the main FastAPI application:

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from typing import Optional

app = FastAPI(title="SMS Gateway")

class SMSRequest(BaseModel):
    to: str
    message: str
    provider: Optional[str] = None

@app.post("/api/v1/send")
async def send_sms(request: SMSRequest):
    # Validate phone number
    if not validate_phone(request.to):
        raise HTTPException(400, "Invalid phone number")

    # Route to appropriate provider
    provider = select_provider(request.provider)

    # Send message
    result = await provider.send(request.to, request.message)

    return {"status": "sent", "message_id": result.id}

Rate Limiting with Redis

To prevent abuse and respect provider limits:

import redis
from functools import wraps

class RateLimiter:
    def __init__(self, redis_url: str):
        self.redis = redis.from_url(redis_url)

    def check_limit(self, key: str, limit: int, window: int) -> bool:
        current = self.redis.incr(key)
        if current == 1:
            self.redis.expire(key, window)
        return current <= limit

Multi-Provider Support

The key to a robust gateway is supporting multiple providers:

class ProviderRouter:
    def __init__(self):
        self.providers = {
            'twilio': TwilioProvider(),
            'telnyx': TelnyxProvider(),
            'vonage': VonageProvider()
        }

    async def send(self, to: str, message: str) -> dict:
        # Try providers in order of priority
        for name, provider in self.providers.items():
            try:
                return await provider.send(to, message)
            except ProviderError:
                continue
        raise AllProvidersFailedError()

Conclusion

Building your own SMS gateway gives you complete control over your messaging infrastructure. The code shown here is a starting point - you can extend it with features like:

  • Webhook callbacks for delivery status
  • Message queuing with Celery
  • Geographic routing
  • A/B testing for message content

Check out my full implementation on GitHub: cloud-sms-gateway

Found this helpful? Follow me for more Python and API development content!