MoreRSS

site iconThe Practical DeveloperModify

A constructive and inclusive social network for software developers.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of The Practical Developer

Compiling the Vision Encoder: Squeezing 3% More Throughput from Qwen3-VL on Hopper GPUs

2026-02-09 10:08:38

When you run a vision-language model through vLLM, the framework does something clever: it compiles the LLM decoder with torch.compile, fuses operators, and captures CUDA graphs for maximum throughput. But there is a component it quietly leaves behind -- the Vision Transformer (ViT) encoder that processes your images. It runs in plain eager mode, every single time.

We changed that for Qwen3-VL. The result: 3.4% higher throughput on an NVIDIA H200, three previously unknown bugs discovered and fixed, and a one-flag change that any vLLM user can enable today.

This post walks through the engineering story -- why the encoder was left behind, how we ported compilation support from a sibling model, what broke along the way, and what the profiler actually says about where the time goes.

Why Does the Encoder Run Eager?

vLLM's compilation infrastructure is built around the LLM decoder. When you launch an inference server, the startup sequence compiles the decoder's forward pass with torch.compile, traces its graph, and captures CUDA graphs at various batch sizes. This eliminates Python overhead and enables kernel fusion across attention, LayerNorm, and MLP layers.

The multimodal encoder -- the ViT that converts raw image pixels into embedding vectors -- gets none of this treatment. The reason is a single boolean flag in vLLM's compilation config:

compile_mm_encoder: bool = False
"""Whether or not to compile the multimodal encoder.
Currently, this only works for Qwen2_5_vl and mLLaMa4
models on selected platforms. Disabled by default until
more models are supported/tested to work."""

The default is False, and for good reason. Vision encoders face a fundamental tension with compilation: variable input shapes. Different requests can carry images at different resolutions, producing different numbers of patches. CUDA graphs require fixed tensor shapes at capture time. A general-purpose serving framework cannot assume that every image will be the same size.

But for batch inference workloads with fixed-size images -- which is common in production pipelines processing standardized camera frames, satellite tiles, or document pages -- this conservatism leaves performance on the table. If your images are all the same resolution, the encoder always receives identically shaped tensors, and torch.compile can fully specialize.

There was a second, more specific problem: Qwen3-VL simply lacked the compilation decorators. Its sibling model, Qwen2.5-VL, already had full torch.compile support for its encoder. Qwen3-VL shared much of the same architecture (including the identical attention implementation), but the compilation wiring was never ported over.

The Pattern: Porting from Qwen2.5-VL

vLLM uses a decorator-based system for selective compilation. Rather than compiling an entire model's forward pass (which would break on Python control flow, NumPy calls, and dynamic branching), it compiles individual submodules whose forward() methods contain only clean tensor operations.

Qwen2.5-VL already had this wired up for three encoder submodules: VisionPatchEmbed, VisionBlock, and VisionPatchMerger. Our task was to replicate the exact same pattern in Qwen3-VL.

The Decorator

Each compilable submodule gets a @support_torch_compile decorator that declares which tensor dimensions are dynamic and provides a gating function:

@support_torch_compile(
    dynamic_arg_dims={"x": 0},
    enable_if=should_torch_compile_mm_vit,
)
class Qwen3_VisionPatchEmbed(nn.Module):
    ...

The dynamic_arg_dims={"x": 0} tells torch.compile that dimension 0 of the input tensor x can vary between calls (different numbers of patches), so it should not bake that shape into the compiled graph. The enable_if callback is a one-liner that checks whether the user opted in:

def should_torch_compile_mm_vit(vllm_config: VllmConfig) -> bool:
    return vllm_config.compilation_config.compile_mm_encoder

When compile_mm_encoder is False (the default), the decorator sets self.do_not_compile = True and the forward pass runs in eager mode -- zero overhead, zero behavior change. When it is True, the decorator wraps the module in torch.compile on first call and uses compiled execution from then on.

The Model Tags

The second piece of wiring is set_model_tag, a context manager that tells the compilation backend to use separate caches for encoder versus decoder components. Without tags, the encoder and decoder would share a single compile cache, causing shape mismatches when the compiler tries to reuse a graph compiled for decoder weight shapes on encoder weights.

In Qwen3_VisionTransformer.__init__():

# DO NOT MOVE THIS IMPORT
from vllm.compilation.backends import set_model_tag

with set_model_tag("Qwen3_VisionPatchEmbed", is_encoder=True):
    self.patch_embed = Qwen3_VisionPatchEmbed(...)

with set_model_tag("Qwen3_VisionPatchMerger", is_encoder=True):
    self.merger = Qwen3_VisionPatchMerger(...)

# Deepstack mergers need a separate tag (different weight shapes!)
with set_model_tag("Qwen3_VisionPatchMerger_deepstack", is_encoder=True):
    self.deepstack_merger_list = nn.ModuleList([...])

with set_model_tag("Qwen3_VisionBlock", is_encoder=True):
    self.blocks = nn.ModuleList([
        Qwen3_VisionBlock(...) for _ in range(depth)
    ])

That comment about DO NOT MOVE THIS IMPORT is not a joke -- it matches the exact pattern in Qwen2.5-VL and relates to import ordering constraints with the compilation backend (see vllm#27044).

Notice the deepstack mergers get their own tag, separate from the main merger. This was not in the original plan. It was the fix for Bug #2, which we will get to shortly.

What Gets Compiled

The Qwen3-VL vision encoder has three distinct compilable submodules:

Submodule What It Does Dynamic Dims
Qwen3_VisionPatchEmbed Reshape + Conv3D + reshape (pixels to patch embeddings) x: dim 0
Qwen3_VisionBlock (x24) LayerNorm -> Attention -> Residual -> LayerNorm -> MLP -> Residual x, cu_seqlens, cos, sin: dim 0
Qwen3_VisionPatchMerger LayerNorm -> Linear -> GELU -> Linear (merge spatial patches) x: dim 0

The outer VisionTransformer.forward() -- which orchestrates these submodules -- is deliberately not compiled. It contains NumPy operations (np.array, np.cumsum), Python control flow (isinstance, list comprehensions), and .tolist() calls that would cause graph breaks. The per-submodule pattern avoids all of this.

Zero Graph Breaks

The first compile attempt was the moment of truth. We enabled TORCH_LOGS=+dynamo and TORCH_COMPILE_DEBUG=1, loaded a handful of test images, and watched TorchDynamo trace through the encoder.

Result: zero graph breaks. The single COMPILING GRAPH event reported:

COMPILING GRAPH due to GraphCompileReason(
    reason='return_value',
    user_stack=[<FrameSummary file qwen3_vl.py, line 1169 in forward>],
    graph_break=False
)

This was expected but still satisfying. The per-submodule compilation pattern is specifically designed to isolate clean tensor operations from Python control flow. Each compiled forward method contains nothing but torch operations -- reshapes, linear projections, attention, LayerNorm, residual additions. No data-dependent control flow, no Python-side data structures, no calls that escape the Dynamo graph.

The key insight: if you tried to compile the entire VisionTransformer.forward() as one graph, you would hit graph breaks immediately on the NumPy calls that compute positional embeddings and cumulative sequence lengths. By compiling only the inner submodules, you get all the fusion benefits with none of the graph break headaches.

Three Bugs Found and Fixed

Zero graph breaks did not mean zero problems. The first full run crashed. Then it crashed differently. Then it crashed a third way. Here is what we found.

Bug 1: AssertionError: Forward context is not set in profile_run()

The crash:

AssertionError: Forward context is not set.
Please use `set_forward_context` to set the forward context.

What happened: When vLLM starts up, it runs a profiling pass (profile_run()) to determine memory usage. This calls self.model.embed_multimodal() to profile the encoder. In eager mode, this works fine -- the encoder's forward methods are just regular PyTorch calls.

But with @support_torch_compile, the compilation backend wraps each submodule in a CUDAGraphWrapper. The wrapper's __call__ method reads forward_context.cudagraph_runtime_mode to decide whether to execute via CUDA graph or fall through to eager. Without a forward context set, it crashes.

The fix: Wrap the profiling call in set_forward_context:

with set_forward_context(attn_metadata=None, vllm_config=self.vllm_config):
    dummy_encoder_outputs = self.model.embed_multimodal(
        **batched_dummy_mm_inputs
    )

Since attn_metadata=None, the wrapper sees CUDAGraphMode.NONE and falls through to eager execution -- exactly the behavior we want during profiling.

Bug 2: AssertionError: expected size 1024==4096

The crash:

AssertionError: expected size 1024==4096, stride 1==1 at dim=0

What happened: Qwen3-VL has two types of patch mergers. The main merger has a LayerNorm over context_dim=1024 (the per-patch hidden size before spatial merging). The deepstack mergers have a LayerNorm over hidden_size=4096 (the full hidden size, via use_postshuffle_norm=True). Both use the Qwen3_VisionPatchMerger class.

In our initial implementation, both mergers shared the same set_model_tag("Qwen3_VisionPatchMerger") context. This meant they shared a single compiled graph cache. When torch.compile traced through the main merger (norm weight shape (1024,)), it cached a graph with that shape baked in. When the deepstack merger tried to reuse the same cached graph with its (4096,) weights -- crash.

The fix: Separate model tags:

with set_model_tag("Qwen3_VisionPatchMerger", is_encoder=True):
    self.merger = ...          # LayerNorm over 1024

with set_model_tag("Qwen3_VisionPatchMerger_deepstack", is_encoder=True):
    self.deepstack_merger_list = ...  # LayerNorm over 4096

Same Python class, different compile caches. The tag system was designed exactly for this -- but you have to remember to use it when two instances of the same class have different weight shapes.

Bug 3: Same as Bug 1, but in _execute_mm_encoder()

The profiling fix (Bug 1) resolved the startup crash, but the same AssertionError appeared during actual inference. The encoder execution path in _execute_mm_encoder() also called embed_multimodal() without setting forward context.

The fix: Same pattern -- wrap the encoder execution loop in set_forward_context(attn_metadata=None, ...).

Defense-in-Depth

After fixing both call sites, we added a belt-and-suspenders guard in CUDAGraphWrapper.__call__ itself:

def __call__(self, *args, **kwargs):
    if not is_forward_context_available():
        return self.runnable(*args, **kwargs)  # Eager fallback
    forward_context = get_forward_context()
    ...

If any future code path calls a compiled encoder submodule without setting forward context, it gracefully falls through to eager execution instead of crashing. This is defense-in-depth -- the primary fix is ensuring all call sites set the context, but the guard protects against regressions.

Profiling: Where the Time Goes

With compilation working, we instrumented the encoder with torch.cuda.Event timing to measure exactly how much each component contributes and how much compilation helps.

The Encoder Is Only 13.5% of Total Inference Time

For Qwen3-VL-2B on our workload, the ViT encoder processes each image once to produce embedding tokens, then the LLM decoder generates the output sequence. The decoder dominates.

Component Baseline (ms) Compiled (ms) Speedup
PatchEmbed 5.2 6.2 -19%
VisionBlocks (24) 352.5 330.2 +6.3%
PatchMerger 3.8 5.3 -39%
Total Encoder 450.3 430.5 +4.4%

VisionBlocks Win, Small Ops Lose

The 24 VisionBlocks are where compilation shines. Each block runs LayerNorm -> Attention -> Residual -> LayerNorm -> MLP -> Residual. The Inductor backend fuses these into fewer, more efficient kernels. Blocks 1-23 show a consistent 7-8% per-block speedup, accumulating to a 22.3ms reduction.

PatchEmbed and PatchMerger show the opposite: compilation makes them slower. These are tiny operations (~0.3ms per call). The @support_torch_compile decorator adds Python dispatch overhead on every call, and at this scale, the overhead exceeds the fusion benefit. It is a classic tradeoff -- compilation has a per-call dispatch cost that only pays off when the compiled operation is large enough.

A pragmatic optimization would be to remove the @support_torch_compile decorators from PatchEmbed and PatchMerger, compiling only VisionBlocks. The net encoder speedup would actually be slightly higher without the small-op regressions. But the dispatch overhead is small in absolute terms (a few milliseconds total), and having all submodules wired for compilation maintains consistency with the Qwen2.5-VL pattern.

Why 4.4% Encoder Speedup Becomes 3.4% End-to-End

With the encoder representing 13.5% of total inference time, even a 4.4% encoder speedup translates to only ~0.6% of total wall time through Amdahl's Law. The actual measured end-to-end improvement is larger than that simple calculation suggests, likely because the compilation also reduces Python overhead and improves memory access patterns in ways that benefit the surrounding orchestration code.

End-to-End Benchmark

We ran a full A/B comparison over ~8,000 samples on an NVIDIA H200, with 10-sample warmup excluded from measurements.

Metric Baseline Compiled Delta
Throughput 32.33 samp/s 33.42 samp/s +3.4%
Generate time 266.1s 257.4s -8.7s
Per-sample latency 30.93ms 29.92ms -1.0ms
Model load time 37.3s 50.2s +12.9s

The 3.4% throughput improvement is consistent across scales. We saw similar relative gains at 100 samples (+0.9% -- noisier at smaller scale) and at the full dataset (+3.4%).

The model load time increase (+12.9s) is a one-time cost for Dynamo bytecode transforms and Inductor codegen on the encoder submodules. On subsequent runs, the compilation cache (~/.cache/vllm/torch_compile_cache/) eliminates recompilation entirely -- subsequent startups are only marginally slower than baseline. In a production serving context, this compilation happens once at server startup and all subsequent inference benefits from the speedup.

Break-Even Analysis

Parameter Value
One-time compilation overhead 12.9s
Per-sample time saving ~1.0ms
Break-even point ~12,900 samples

For the first-ever run (cold compilation cache), you need to process approximately 13,000 samples before the compilation overhead is amortized. For any subsequent run with a warm cache, the benefit is immediate.

Output Correctness

An important caveat: compiled and baseline modes produce slightly different outputs on some inputs. This is expected behavior from torch.compile -- the Inductor backend may apply different operator fusion, reduction ordering, and kernel implementations that change floating-point rounding at the bit level. These tiny differences in intermediate activations can cascade through the encoder, shift logits by small amounts, and occasionally flip the argmax for borderline tokens during autoregressive decoding.

Concretely:

  • Both modes are individually deterministic -- the same mode always produces the same output for the same input, run after run.
  • They are not cross-compatible -- baseline and compiled may differ on some samples.
  • The differences are small in magnitude and affect only a fraction of samples.

This is a property of torch.compile itself, not of our changes. If your application requires bitwise reproducibility between compiled and non-compiled modes, this is worth knowing. If you only need consistency within a single mode (the more common requirement), both modes deliver it.

When Would This Matter More?

A 3.4% throughput improvement is real and free (once the cache is warm), but it is bounded by the encoder's share of total inference time. For Qwen3-VL-2B, the ViT encoder is small relative to the LLM decoder. Several scenarios would amplify the benefit:

Larger ViT encoders. Qwen3-VL-72B has a proportionally larger vision encoder. The same 7-8% per-block VisionBlock speedup applied to more expensive encoder blocks would yield a larger end-to-end improvement.

Video workloads. Video inputs require processing many frames, multiplying encoder invocations per request. The encoder's share of total time increases, and the compilation benefit compounds.

High-concurrency serving. When many requests arrive simultaneously, encoder latency adds up across the batch. Shaving 4.4% off each encoder call reduces queuing delay.

Bandwidth-bound GPUs. The H200 is a compute-rich Hopper GPU. On more bandwidth-constrained hardware like the L40S, the operator fusion from torch.compile (which reduces memory traffic by eliminating intermediate tensor materializations) would likely yield a larger percentage improvement.

Higher-resolution images. More patches per image means more work in the VisionBlocks, which are the primary beneficiaries of compilation.

How to Enable It

One flag:

from vllm import LLM

llm = LLM(
    model="Qwen/Qwen3-VL-2B-Instruct",
    compilation_config={"compile_mm_encoder": True},
    # ... other settings
)

Or via the CLI:

vllm serve Qwen/Qwen3-VL-2B-Instruct \
    --compilation-config '{"compile_mm_encoder": true}'

That is it. No model changes, no custom code, no configuration gymnastics. The flag tells vLLM to apply torch.compile to the ViT encoder submodules during model initialization. The first inference call that includes images will trigger compilation (or load from cache), and all subsequent calls use the compiled kernels.

First Run vs. Subsequent Runs

On the very first run with a new model or new vLLM version, you will see a longer model load time (~13s extra) as TorchDynamo traces and Inductor generates code for the encoder submodules. These artifacts are cached to ~/.cache/vllm/torch_compile_cache/.

On all subsequent runs, the cached artifacts load in seconds, and the throughput benefit is immediate.

Conclusion

This was a small change -- six modifications across two files for the core enablement, plus four files touched for bug fixes. The pattern was already established by Qwen2.5-VL; we just ported it to Qwen3-VL. But small changes can have disproportionate engineering value when they uncover latent bugs.

The three bugs we found -- missing set_forward_context in two encoder execution paths, and shared compile caches for mergers with different weight shapes -- are not specific to Qwen3-VL. They would affect any model that enables compile_mm_encoder. The fixes (including the defense-in-depth guard in CUDAGraphWrapper) benefit the entire vLLM multimodal compilation infrastructure.

The profiling results tell an honest story: the ViT encoder is a small fraction of end-to-end time for a 2B parameter model, so even a solid 4.4% encoder speedup translates to a modest 3.4% end-to-end gain. But it is a free 3.4% -- one flag, cached after the first run, no accuracy impact within a single mode. For larger models, video workloads, or bandwidth-constrained hardware, the benefit would be larger.

Sometimes the most useful engineering work is not building something new, but noticing that a capability already exists in the codebase and was never wired up for your model.

Summary of Changes

File Change
vllm/model_executor/models/qwen3_vl.py @support_torch_compile decorators on 3 encoder submodules + set_model_tag wiring
vllm/config/compilation.py Updated compile_mm_encoder docstring to include Qwen3-VL
vllm/v1/worker/gpu_model_runner.py set_forward_context wrapper in _execute_mm_encoder() and profile_run()
vllm/compilation/cuda_graph.py is_forward_context_available() guard in CUDAGraphWrapper.__call__

Hardware and Software

  • GPU: NVIDIA H200 (141 GB HBM3e)
  • vLLM: 0.15.x (main branch)
  • PyTorch: 2.9.x
  • Model: Qwen3-VL-2B-Instruct (fine-tuned checkpoint)
  • Workload: ~8,000 fixed-resolution images, single GPU, temperature=0.0, max_tokens=128

5 AI Coding Patterns That Actually Work (2026 Edition)

2026-02-09 09:25:54

As AI coding agents become the norm, I've spent the last few months figuring out what actually works vs. what's just hype.

Here are 5 patterns that have genuinely sped up my workflow.

1. The "Describe, Don't Code" Pattern

Instead of writing code yourself, describe what you want in plain English:

# Bad: Writing this yourself
def validate_email(email):
    import re
    pattern = r"^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$"
    return bool(re.match(pattern, email))

# Good: Ask the agent
# "Write a Python function to validate email addresses. 
#  Handle edge cases like plus addressing and subdomains."

The agent will write more comprehensive code than you would have.

2. The "Rubber Duck" Pattern

Before diving into a problem, explain it to the AI:

Me: I need to build a rate limiter for an API. 
I'm thinking sliding window, but I'm not sure 
if that's overkill for my use case.

Agent: What's your expected QPS? If it's under 1000, 
a simple token bucket might be simpler to implement 
and debug...

The AI challenges your assumptions before you waste time.

3. The "Test First" Pattern

Ask for tests before implementation:

# Prompt: "Write pytest tests for a user authentication 
# service that handles login, logout, and password reset"

Then feed those tests back and ask for implementation. The agent writes code that passes the tests on first try.

4. The "Refactor Chain" Pattern

Don't ask for perfect code upfront. Iterate:

  1. "Write a quick script to do X"
  2. "Now add error handling"
  3. "Now make it production-ready"
  4. "Now optimize for performance"

Each pass is focused. The final result is clean.

5. The "Code Review" Pattern

Paste your code and ask:

"Review this code. What would break in production? 
What would a senior engineer change?"

You get instant feedback without waiting for PR reviews.

The Meta-Lesson

AI coding tools aren't about replacing developers. They're about amplifying what you already know.

The best prompts come from experience. The best code reviews come from understanding the codebase.

The agents do the typing. You do the thinking.

What patterns are you using? Drop them in the comments 👇

Cloud Newbies: Avoid These 5 Costly Pitfalls! | Cloud Cost Optimization

2026-02-09 09:25:35

Pitfall 1: Poor Instance Selection
Common Mistakes:

Over-provisioning: Blindly choosing high-spec instances, leading to wasted performance.

Misunderstanding the differences between Compute-Optimized, Memory-Optimized, and Storage-Optimized types.

Ignoring the limitations of Burstable Performance instances (e.g., T-series CPU credits).

How to Avoid:

Test Before You Buy: Use "Pay-as-you-go" to benchmark performance before committing.

Match Application Needs:

Web Apps → General Purpose

Databases → Memory-Optimized

Batch Processing → Compute-Optimized

Leverage Tools: Use the cloud provider’s advisor or sizing recommendation tools.
💰 Savings: Strategic selection can drastically reduce your baseline compute costs.

Pitfall 2: Wasted Storage Configuration
Common Mistakes:

Using High-Performance SSDs for all data types.

Never cleaning up old Snapshots and Backups.

Forgetting to set Lifecycle Rules for Object Storage (S3/OSS).

How to Avoid:

Implement Data Tiering:

Hot Data → SSD Cloud Disks

Warm Data → Standard Cloud Disks

Cold Data → Archive Storage

Automated Cleanup:

Set auto-deletion policies for snapshots.

Configure Object Storage lifecycles (Auto-transition to Infrequent Access/Archive).

Storage Monitoring: Set up storage-specific cost alerts.
💰 Savings: Turn "forgotten storage" into immediate budget savings.

Pitfall 3: Runaway Networking & Egress Costs
Common Mistakes:

Downloading large files directly from cloud servers via the Public Internet.

Ignoring the costs of Cross-AZ or Cross-Region data transfers.

Failing to set up traffic monitoring alerts.

How to Avoid:

Optimize Downloads: Distribute large files via Object Storage + CDN.

Set Bandwidth Caps: Limit peak speeds to prevent spikes.

Use Internal Networking: * Use Private IPs within the same region (usually free).

Utilize VPC Peering or Cloud Enterprise Networks for cross-region connectivity.

Early Warnings: Set daily egress cost thresholds.
💰Savings: Prevent "bill shocks" from unexpected traffic spikes.

Pitfall 4: Excessive Security Group & Permission Access
Common Mistakes:

Setting Security Groups to 0.0.0.0/0 (Wide open to the world).

Using the Root Account Access Key (AK/SK) for daily operations.

Failing to audit permission logs.

How to Avoid:

Principle of Least Privilege (PoLP): Open only specific IPs/Ports in Security Groups.

Use IAM/RAM Sub-accounts with minimal necessary permissions.

Security Hardening:

Delete unused Access Keys.

Enable ActionTrails/CloudTrails for auditing.

Rotate Access Keys regularly.

Cost Impact: Breached accounts are often used for "Crypto-jacking" (mining), leading to massive unauthorized bills.
💰 Savings: Protect against catastrophic bills caused by security breaches.

Pitfall 5: Unmanaged "Orphaned" Resources
❌ Common Mistakes:

Forgetting to delete test instances.

Leaving Elastic IPs (EIP) or Load Balancers (LB) unattached while still being billed.

Keeping database test environments running 24/7.

How to Avoid:

Resource Audit: Perform weekly/monthly checks for:

Idle Cloud Servers (Zero CPU load).

Unattached EIPs.

Empty/Unused Load Balancers.

Automation Tools:

Use Tags to label and track test resources.

Write cleanup scripts (Check our group for shared scripts!).

Architecture Optimization: Use instances that "Stop without Billing" for test environments.
💰 Savings: Eliminate unnecessary spending on resources that aren't even being used.

📥 Bonus: "Cloud Resource Cost Optimization Checklist" PDF
👉 Comment or DM me to get your copy!

Stop miscalculating age in JavaScript: leap years, Feb 29, and the Jan 31 trap

2026-02-09 09:18:05

Most age calculators are wrong for at least one of these reasons:

they do nowYear - dobYear and forget to check if the birthday already happened

they treat all months like the same length

they explode on Feb 29 birthdays

they hit JavaScript’s “Jan 31 + 1 month = March” surprise

If you want an age calculator you can ship, you need a small amount of boring correctness.

This post shows a simple, testable way to calculate years + months + days between two calendar dates.

What we’re actually trying to compute

Given:

dob (date of birth)

asOf (the date you want to measure age on, defaulting to today)

We want:

years: full birthdays completed

months: full months since the last birthday

days: remaining days since that month anchor

If asOf < dob, that’s invalid input.

The two rules that prevent 90% of bugs
Rule 1: Work with “date-only”, not time

Dates in JS carry time and timezone baggage. For age, you almost always want midnight local time.

So normalize:

YYYY-MM-DD 00:00:00

Rule 2: Define your Feb 29 policy

Born on Feb 29, non-leap year: do you count their birthday on Feb 28 or Mar 1?

There’s no universal answer. Pick one and be consistent.
In this code: Feb 28.

The algorithm (simple and dependable)

compute tentative years = asOf.year - dob.year

if asOf is before birthday in asOf.year, subtract 1

set anchor = last birthday date

walk forward month-by-month from anchor, clamping day-of-month

remaining days = difference between anchor and asOf

Implementation (TypeScript)
type AgeBreakdown = { years: number; months: number; days: number };

function normalizeDateOnly(d: Date) {
return new Date(d.getFullYear(), d.getMonth(), d.getDate());
}

function daysInMonth(year: number, monthIndex0: number) {
return new Date(year, monthIndex0 + 1, 0).getDate();
}

function birthdayInYear(
dob: Date,
year: number,
feb29Rule: "FEB_28" | "MAR_1" = "FEB_28",
) {
const m = dob.getMonth();
const day = dob.getDate();

// Feb 29 handling
if (m === 1 && day === 29) {
const isLeap = new Date(year, 1, 29).getMonth() === 1;
if (isLeap) return new Date(year, 1, 29);
return feb29Rule === "FEB_28" ? new Date(year, 1, 28) : new Date(year, 2, 1);
}

return new Date(year, m, day);
}

export function calculateAge(dobInput: Date, asOfInput: Date): AgeBreakdown {
const dob = normalizeDateOnly(dobInput);
const asOf = normalizeDateOnly(asOfInput);

if (asOf < dob) throw new Error("asOf must be >= dob");

// Years
let years = asOf.getFullYear() - dob.getFullYear();
const bdayThisYear = birthdayInYear(dob, asOf.getFullYear(), "FEB_28");
if (asOf < bdayThisYear) years -= 1;

// Anchor at last birthday
const lastBirthdayYear = dob.getFullYear() + years;
let anchor = birthdayInYear(dob, lastBirthdayYear, "FEB_28");

// Months: step forward with month-end clamping
let months = 0;
while (true) {
const nextMonthFirst = new Date(anchor.getFullYear(), anchor.getMonth() + 1, 1);
const y = nextMonthFirst.getFullYear();
const m = nextMonthFirst.getMonth();
const d = Math.min(anchor.getDate(), daysInMonth(y, m));
const candidate = new Date(y, m, d);

if (candidate <= asOf) {
  months += 1;
  anchor = candidate;
} else break;

}

// Days
const msPerDay = 24 * 60 * 60 * 1000;
const days = Math.floor((asOf.getTime() - anchor.getTime()) / msPerDay);

return { years, months, days };
}
Tests that catch the real failures

Don’t just test “normal” birthdays. Test the annoying dates.

import { describe, it, expect } from "vitest";
import { calculateAge } from "./calculateAge";

describe("calculateAge", () => {
it("handles birthday not yet happened this year", () => {
const dob = new Date("2000-10-20");
const asOf = new Date("2026-02-09");
const r = calculateAge(dob, asOf);
expect(r.years).toBe(25);
});

it("handles month-end clamping (Jan 31)", () => {
const dob = new Date("2000-01-31");
const asOf = new Date("2000-03-01");
const r = calculateAge(dob, asOf);
// If your month add is buggy, this often breaks.
expect(r.years).toBe(0);
expect(r.months).toBeGreaterThanOrEqual(1);
});

it("handles Feb 29 birthdays with FEB_28 rule", () => {
const dob = new Date("2004-02-29");
const asOf = new Date("2025-02-28");
const r = calculateAge(dob, asOf);
// Under FEB_28 policy, birthday is considered reached on Feb 28.
expect(r.years).toBe(21);
});

it("rejects asOf before dob", () => {
const dob = new Date("2020-01-01");
const asOf = new Date("2019-12-31");
expect(() => calculateAge(dob, asOf)).toThrow();
});
});

You can add more:

dob = 1999-12-31, asOf = 2000-01-01

dob = 2000-02-28, asOf = 2001-02-28

dob = 2000-03-31, asOf = 2000-04-30

The takeaway

If you want correct age output:

normalize to date-only

define Feb 29 behavior

clamp month ends

ship tests for weird dates

That’s it. No libraries required.

Demo (optional): https://www.calculatorhubpro.com/everyday-life/age-calculator

Quantified Self: Building a Blazing Fast Health Dashboard with DuckDB and Streamlit

2026-02-09 09:15:00

Have you ever tried exporting your Apple Health data, only to find a massive, 2GB+ export.xml file that makes your text editor cry? 😭 As a developer obsessed with the Quantified Self movement, I wanted to turn that mountain of raw data into actionable insights—without waiting ten minutes for a single correlation plot.

In this tutorial, we are diving deep into Data Engineering for personal analytics. We’ll leverage the insane speed of DuckDB, the flexibility of Pyarrow, and the simplicity of Streamlit to build a high-performance health dashboard. Whether you are dealing with millions of heart rate rows or gigabytes of GPS points, this stack ensures millisecond-level DuckDB performance for all your Apple HealthKit analysis.

Why this stack?

If you've been following modern data stacks, you know that OLAP for wearables is becoming a hot topic. Traditional Python libraries like Pandas are great, but they often struggle with the memory overhead of large nested XML structures.

  • DuckDB: The "SQLite for Analytics." It's an in-process columnar database that runs SQL at the speed of light.
  • Pyarrow: The bridge that allows us to move data between formats with zero-copy overhead.
  • Streamlit: The fastest way to turn data scripts into shareable web apps.

The Architecture 🏗️

The biggest challenge with wearable data is the "Extract-Transform-Load" (ETL) process. We need to turn a hierarchical XML file into a flattened, queryable Parquet format that DuckDB can devour.

graph TD
    A[Apple Health export.xml] --> B[Python XML Parser]
    B --> C[Pyarrow Table]
    C --> D[Parquet Storage]
    D --> E[DuckDB Engine]
    E --> F[Streamlit Dashboard]
    F --> G[Millisecond Insights 🚀]

Prerequisites

Before we start, ensure you have your export.xml ready and these tools installed:

pip install duckdb streamlit pandas pyarrow

Step 1: From XML Chaos to Parquet Order

Apple's XML format is... "unique." We use Pyarrow to define a schema and convert those records into a compressed Parquet file. This reduces our file size by up to 90% and optimizes it for columnar reads.

import xml.etree.ElementTree as ET
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

def parse_health_data(xml_path):
    tree = ET.parse(xml_path)
    root = tree.getroot()

    # We only care about Record types for this dashboard
    records = []
    for record in root.findall('Record'):
        records.append({
            'type': record.get('type'),
            'value': record.get('value'),
            'unit': record.get('unit'),
            'creationDate': record.get('creationDate'),
            'startDate': record.get('startDate')
        })

    df = pd.DataFrame(records)
    # Convert dates to proper datetime objects
    df['startDate'] = pd.to_datetime(df['startDate'])
    df['value'] = pd.to_numeric(df['value'], errors='coerce')

    # Convert to Arrow Table and save as Parquet
    table = pa.Table.from_pandas(df)
    pq.write_table(table, 'health_data.parquet')
    print("✅ Transformation complete!")

# parse_health_data('export.xml')

Step 2: The Secret Sauce — DuckDB Logic

Now for the magic. Instead of loading the entire Parquet file into RAM with Pandas, we query it directly using DuckDB. This allows us to perform complex aggregations (like heart rate variability vs. sleep quality) in milliseconds.

import duckdb

def get_heart_rate_summary():
    con = duckdb.connect(database=':memory:')
    # DuckDB can query Parquet files directly!
    res = con.execute("""
        SELECT 
            date_trunc('day', startDate) as day,
            avg(value) as avg_heart_rate,
            max(value) as max_heart_rate
        FROM 'health_data.parquet'
        WHERE type = 'HKQuantityTypeIdentifierHeartRate'
        GROUP BY 1
        ORDER BY 1 DESC
    """).df()
    return res

Step 3: Building the Streamlit UI 🎨

Streamlit makes it incredibly easy to visualize these SQL results. We can add sliders, date pickers, and interactive charts with just a few lines of code.

import streamlit as st
import plotly.express as px

st.set_page_config(page_title="Quantified Self Dashboard", layout="wide")

st.title("🏃‍♂️ My Quantified Self Dashboard")
st.markdown("Analyzing millions of health records with **DuckDB** speed.")

# Load data using our DuckDB function
df_hr = get_heart_rate_summary()

col1, col2 = st.columns(2)

with col1:
    st.subheader("Heart Rate Trends")
    fig = px.line(df_hr, x='day', y='avg_heart_rate', title="Average Daily Heart Rate")
    st.plotly_chart(fig, use_container_width=True)

with col2:
    st.subheader("Raw DuckDB Query Speed")
    st.code("""
    SELECT avg(value) FROM 'health_data.parquet' 
    WHERE type = 'HeartRate'
    """)
    st.success("Query executed in 0.002s")

The "Official" Way to Scale 🥑

While building locally is fun, production-grade data engineering requires more robust patterns. For those looking to take their data pipelines to the next level—handling multi-user environments, automated ingestion, or advanced machine learning on health metrics—check out the WellAlly Blog.

They provide excellent deep-dives into advanced architectural patterns and production-ready examples that go far beyond a local script. It’s been my go-to resource for refining my data stack!

Conclusion

By switching from raw XML parsing to a DuckDB + Parquet workflow, we’ve turned a sluggish data problem into a high-performance analytical tool. You no longer need a massive cluster to analyze your personal data; your laptop is more than enough when you use the right tools.

What are you tracking? Whether it's steps, sleep, or coding hours, let me know in the comments how you're visualizing your life! 👇

The Future of Go Network Programming: What's Next for Gophers?

2026-02-09 09:11:18

Hey Gophers! If you’re building APIs, microservices, or real-time apps with Go, you’re already riding a wave of simplicity and performance. Go’s concurrency model (goroutines FTW!) and robust net/http package make it a go-to for network programming. But the tech world doesn’t stand still—new protocols like HTTP/3, gRPC, and cloud-native trends are changing the game. Want to stay ahead? Let’s dive into the future of Go network programming, complete with code, tips, and lessons learned.

Who’s this for? Developers with 1-2 years of Go experience looking to level up their network programming skills. Whether you’re optimizing HTTP clients or exploring gRPC, this guide has you covered.

What’s coming? We’ll explore HTTP/3, gRPC, cloud-native architectures, new Go features, and a real-world case study. Plus, a peek at Go’s role in edge computing and WebAssembly. Let’s get started!

1. Why Go Shines (and Where It Struggles)

Go’s goroutines and standard library are like a superhero duo for network programming. Spinning up an HTTP server is as easy as:

package main

import (
    "fmt"
    "net/http"
)

func main() {
    http.HandleFunc("/", func(w http.ResponseWriter, r *http.Request) {
        fmt.Fprint(w, "Hello, Gophers!")
    })
    http.ListenAndServe(":8080", nil)
}

Why Go rocks:

  • Concurrency: Handle thousands of connections with goroutines.
  • Simplicity: Clean APIs for HTTP/1.1, HTTP/2, TCP, and UDP.
  • Cross-platform: Runs everywhere—Linux, macOS, Windows.

But it’s not perfect:

  • Connection pooling: Misconfigured http.Client can spike CPU usage.
  • New protocols: No native HTTP/3 or QUIC support (yet!).
  • Distributed systems: Service discovery and fault tolerance are tricky.

Quick win: Optimize connection pooling to boost performance. Here’s how:

package main

import (
    "log"
    "net/http"
    "time"
)

func createClient() *http.Client {
    return &http.Client{
        Transport: &http.Transport{
            MaxIdleConns:        100,
            MaxIdleConnsPerHost: 10,
            IdleConnTimeout:     90 * time.Second,
        },
        Timeout: 10 * time.Second,
    }
}

func main() {
    client := createClient()
    resp, err := client.Get("https://api.example.com")
    if err != nil {
        log.Fatal(err)
    }
    defer resp.Body.Close()
}

Lesson learned: In a high-traffic API, forgetting MaxIdleConnsPerHost caused memory spikes. Setting it to 10 slashed resource usage by 25%. Always tune your http.Transport!

Segment 2: Future Trends

2. Hot Trends to Watch in Go Network Programming

The network programming landscape is evolving, and Go is keeping pace. Let’s break down three game-changers: HTTP/3, gRPC, and cloud-native architectures.

2.1 HTTP/3 and QUIC: The Speed Boost

Why care? HTTP/3, powered by QUIC (UDP-based), is like swapping a bicycle for a rocket. It cuts latency with 0-RTT handshakes and eliminates TCP’s head-of-line blocking. Go’s standard library doesn’t support QUIC yet, but quic-go is a solid community option.

Perks:

  • Faster connections with 0-RTT.
  • True multiplexing without blocking.
  • Seamless network switches (e.g., Wi-Fi to mobile).

Try it out with quic-go:

package main

import (
    "log"
    "net/http"
    "github.com/quic-go/quic-go/http3"
)

func main() {
    mux := http.NewServeMux()
    mux.HandleFunc("/", func(w http.ResponseWriter, r *http.Request) {
        w.Write([]byte("Hello, QUIC!"))
    })
    log.Fatal(http3.Server{Addr: ":443", Handler: mux}.ListenAndServeTLS("cert.pem", "key.pem"))
}

Pro tip: Use TLS 1.3 certificates (e.g., ECDSA SHA-256). A mismatched cert cost me hours of debugging in a real-time analytics project!

2.2 gRPC: Microservices Done Right

Why it’s awesome: gRPC is like a super-efficient courier for microservices, using HTTP/2 and Protocol Buffers. Go’s google.golang.org/grpc package supports streaming, interceptors, and load balancing—perfect for distributed systems.

Use case: Real-time apps or service-to-service communication.

Example: A bidirectional streaming gRPC service:

package main

import (
    "log"
    "net"
    "google.golang.org/grpc"
    pb "path/to/your/protobuf/package"
)

type StreamServer struct {
    pb.UnimplementedStreamServiceServer
}

func (s *StreamServer) BidirectionalStream(stream pb.StreamService_BidirectionalStreamServer) error {
    for {
        msg, err := stream.Recv()
        if err != nil {
            return err
        }
        stream.Send(&pb.StreamResponse{Data: "Echo: " + msg.Data})
    }
}

func main() {
    lis, err := net.Listen("tcp", ":50051")
    if err != nil {
        log.Fatal(err)
    }
    s := grpc.NewServer()
    pb.RegisterStreamServiceServer(s, &StreamServer{})
    log.Fatal(s.Serve(lis))
}

Lesson learned: Unclosed streams caused memory leaks in a chat app. Use pprof to monitor and always terminate streams properly.

2.3 Cloud-Native and Service Meshes

Why it matters: Go powers tools like Kubernetes and Istio, making it a cloud-native superstar. Service meshes (e.g., Istio) handle service discovery, load balancing, and security automatically.

Example: A health-checked service for Istio:

package main

import (
    "log"
    "net/http"
)

func main() {
    http.HandleFunc("/health", func(w http.ResponseWriter, r *http.Request) {
        w.Write([]byte("OK"))
    })
    http.HandleFunc("/", func(w http.ResponseWriter, r *http.Request) {
        w.Write([]byte("Welcome, Gophers!"))
    })
    log.Fatal(http.ListenAndServe(":8080", nil))
}

Pro tip: Debug Istio timeouts with istioctl proxy-status. Misconfigured VirtualService rules once tanked my Kubernetes app—don’t skip the docs!

Segment 3: New Features and Best Practices

3. Go’s New Toys: Features and Tools

Go keeps getting better, and community tools like Fiber and Chi are game-changers. Let’s explore what’s new in Go 1.20+ and how these tools boost your projects.

3.1 Go 1.20+ Goodies

What’s new? Go 1.20+ brings better context handling and optimized net/http for connection pooling. These updates make timeout management and resource usage a breeze.

Example: Timeout handling with context:

package main

import (
    "context"
    "log"
    "net/http"
    "time"
)

func main() {
    ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
    defer cancel()

    req, err := http.NewRequestWithContext(ctx, "GET", "https://api.example.com", nil)
    if err != nil {
        log.Fatal(err)
    }

    resp, err := http.DefaultClient.Do(req)
    if err != nil {
        log.Fatal(err)
    }
    defer resp.Body.Close()
}

Takeaway: Always use defer cancel() to avoid goroutine leaks. I learned this the hard way when memory spiked in an API project—pprof saved the day!

3.2 Community Gems: Fiber and Chi

Fiber (built on fasthttp) is blazing fast, while Chi offers lightweight, modular routing. Both reduce boilerplate and boost productivity.

Example: A Fiber REST API:

package main

import "github.com/gofiber/fiber/v2"

func main() {
    app := fiber.New()
    app.Get("/api", func(c *fiber.Ctx) error {
        return c.JSON(fiber.Map{"message": "Hello, Fiber!"})
    })
    app.Listen(":3000")
}

Pro tip: Limit Fiber’s concurrency (e.g., fiber.New(fiber.Config{Concurrency: 10000})). Overloading middleware in an e-commerce API once crushed my performance—keep it lean!

4. Best Practices for Go Network Programming

Here are battle-tested tips to keep your Go services fast and reliable:

4.1 Connection Management

Problem: Creating new connections for every request kills performance.

Solution: Tune http.Transport:

transport := &http.Transport{
    MaxIdleConns:        100,
    MaxIdleConnsPerHost: 10,
    IdleConnTimeout:     90 * time.Second,
}
client := &http.Client{Transport: transport, Timeout: 10 * time.Second}

Win: In a payment gateway, this cut CPU usage by 30%.

4.2 Error Handling

Problem: Panics crash your app without warning.

Solution: Use middleware to catch errors:

package main

import (
    "log"
    "net/http"
)

func errorHandler(next http.Handler) http.Handler {
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        defer func() {
            if err := recover(); err != nil {
                log.Printf("Panic: %v", err)
                http.Error(w, "Oops!", http.StatusInternalServerError)
            }
        }()
        next.ServeHTTP(w, r)
    })
}

Tip: Pair with tools like Sentry for error tracking.

4.3 Performance Hacks

Problem: Memory allocation slows high-concurrency apps.

Solution: Use sync.Pool for buffer reuse:

var bufferPool = sync.Pool{
    New: func() interface{} { return new(bytes.Buffer) },
}

func processData(data string) string {
    buf := bufferPool.Get().(*bytes.Buffer)
    defer bufferPool.Put(buf)
    buf.Reset()
    buf.WriteString(data)
    return buf.String()
}

Win: In a logging service, this slashed GC time by 40%.

Segment 4: Case Study and Conclusion

5. Case Study: Building a High-Performance API

Let’s see these ideas in action with a product query API for an e-commerce platform. Goals: handle thousands of requests/second with <100ms latency.

Tech stack:

  • Fiber: Fast REST API.
  • gRPC: Backend communication.
  • Redis: Caching hot data.
  • Prometheus + Grafana: Monitoring.

Code snippet:

package main

import (
    "context"
    "github.com/gofiber/fiber/v2"
    "github.com/redis/go-redis/v9"
    "log"
    "time"
)

func main() {
    app := fiber.New()
    client := redis.NewClient(&redis.Options{Addr: "localhost:6379"})

    app.Get("/data/:id", func(c *fiber.Ctx) error {
        id := c.Params("id")
        val, err := client.Get(context.Background(), id).Result()
        if err == redis.Nil {
            data, err := callGRPCService(id)
            if err != nil {
                return c.Status(500).SendString("Server Error")
            }
            client.Set(context.Background(), id, data, 3600*time.Second)
            return c.SendString(data)
        }
        return c.SendString(val)
    })

    app.Listen(":3000")
}

func callGRPCService(id string) (string, error) {
    return "Product: " + id, nil
}

Setup:

  • Deployment: Kubernetes + Istio for scaling.
  • Monitoring: Prometheus for metrics, Grafana for dashboards.
  • Fix: Redis timeouts were a pain—set DialTimeout=500ms and used redis.Ping.

Lesson: Monitor everything. Prometheus caught a latency spike I missed during testing.

6. What’s Next for Go?

Go’s future is bright! HTTP/3 and gRPC are slashing latency, while cloud-native tools like Istio simplify microservices. Looking ahead:

  • Edge computing: Go’s lightweight nature is perfect for IoT.
  • WebAssembly: Run Go in browsers for next-gen apps.
  • Community: Libraries like quic-go are growing fast.

Actionable tips:

  1. Play with quic-go and gRPC.
  2. Monitor with Prometheus.
  3. Use context to avoid leaks.
  4. Follow Fiber/Chi updates.

My take: Building a real-time chat app with gRPC and Istio was a game-changer—Go’s simplicity made it a joy. What’s your next Go project? Share in the comments!