MoreRSS

site iconThe Practical DeveloperModify

A constructive and inclusive social network for software developers.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of The Practical Developer

🧠 Hydration, Selective Hydration & Progressive Hydration Explained (React vs Vue/Nuxt vs Others)

2025-11-17 01:07:41

Hydration is one of the most important concepts in modern frontend frameworks, yet also one of the most misunderstood.

In 2025, frameworks like React 18+, Vue/Nuxt 3, SvelteKit, SolidStart, Qwik, Astro all approach hydration differently — some progressively, some selectively, some partially, some never fully.

This post breaks it all down with real examples, SSR + SSG scenarios, and an extended example so the concept stays in your mind forever.

🌊 What Is Hydration?

When a page is SSR or SSG, the browser gets plain HTML:

<div id="app">
  <button>Click me</button>
  <ul>
    <li>Item 1</li>
    <li>Item 2</li>
  </ul>
</div>

But HTML alone does nothing — no JavaScript, no event listeners.

Hydration is when the framework:

  • Loads JS bundles
  • Recreates the virtual component tree
  • Attaches event listeners
  • Activates reactivity
  • Makes the page interactive

Another way to say it:

SSR/SSG gives HTML.
Hydration gives life.

🟥 Old Hydration Model (Pre-React 18 / Classic Vue / SvelteKit)

Traditionally, hydration happened linearly, top-to-bottom:

1. Hydrate App()
2. Hydrate Header()
3. Hydrate Navbar()
4. Hydrate Sections()
5. Hydrate Footer()

Linear hydration has three big problems:

❌ 1. Slow Time-To-Interactive

Because hydration must finish before the user can interact safely.

❌ 2. Big JS bundle runs all at once

Blocks main thread.

❌ 3. Offscreen components hydrate early

Even things not visible yet.

This model works, but it’s outdated for large apps.

🟩 React 18+: Selective Hydration (Smarter, Not Linear)

React 18 replaced “hydrate everything in order” with:

Hydrate what’s needed, when it’s needed.
Skip or delay the rest.

React evaluates multiple signals:

React 18 automatically detects:

  • Which components are visible
  • Which components are interactive
  • Which have pending Suspense data
  • Which are above-the-fold
  • Which the user interacts with first
  • Which are heavier or lower priority

Hydration priority:

1️⃣ Visible + interactive components
2️⃣ User-clicked or user-focused components
3️⃣ Above-the-fold content
4️⃣ Suspense boundaries that resolved
5️⃣ Below-the-fold or non-interactive UI
6️⃣ Everything else, when browser is idle

This is selective hydration — only hydrate the parts users need immediately.

🟦 How Suspense Affects Hydration

Consider:

<Suspense fallback={<LoadingUser />}>
  <UserProfile />
</Suspense>

React can stream HTML in chunks:

  • Server streams <LoadingUser /> first
  • Later streams <UserProfile /> when data resolves
  • Client hydrates these boundaries independently
  • Hydration can happen in the middle, out of order, or scattered

Suspense creates natural hydration islands.

🔥 BIG Real-World Example (SSR + Suspense + Complex Layout)

Imagine a dashboard:

+----------------------+
| Navbar               |
+----------------------+
| Sidebar | Chart      |
|         | (Suspense) |
|         |            |
+----------------------+

Server Output (HTML)

  • Navbar rendered immediately
  • Sidebar rendered immediately
  • Chart is slow → Suspense fallback is sent

Client sees:

Navbar (interactive instantly)
Sidebar (interactive instantly)
Chart loading skeleton…

Hydration (React 18):

1️⃣ User clicks sidebar → React hydrates sidebar immediately
2️⃣ Navbar is visible → hydrate early
3️⃣ Chart fallback hydrates next (lightweight)
4️⃣ Actual Chart component hydrates only when its data resolves
5️⃣ Footer hydrates when browser is idle

This is selective, scattered, priority-driven hydration.

🟨 Is Progressive Hydration Needed in React?

React 18 already does a form of progressive hydration:

✔ Selective hydration
✔ Suspense boundary hydration
✔ Hydrate-on-interaction
✔ Hydrate-visible-first
✔ Hydrate background content later
✔ Hydrate streaming chunks as they arrive

React does not call it “progressive hydration,” but practically, it achieves that behavior.

🟩 SSG + Hydration — Does Hydration Still Happen?

Yes.

SSG also needs hydration because:

  • HTML is static
  • But JS must still attach event handlers
  • Framework re-runs components on client
  • Event listeners get connected
  • State becomes reactive

Example:
A static blog page still hydrates “Like Button”, comments, interactive parts.

SSG = static HTML + hydration on client.

🔰 How Other Frameworks Handle Progressive Hydration

Now let's compare how different frameworks hydrate:

🟦 Vue 3 (Classic) — Linear Hydration

Vue's default hydration is still linear unless using advanced tooling.

Hydration flow:

1. Hydrate root instance
2. Hydrate children in DOM order
3. Hydrate nested components
4. Finally hydrate leaf nodes

Downsides:

  • Slower interactivity for large pages
  • No Suspense-driven hydration
  • No hydrate-on-interaction by default

But this changes with Nuxt 3

🟩 Nuxt 3 — “Progressive, Island-Style” Hydration

Nuxt 3 introduced partial hydration capabilities, similar to islands architecture:

client:only

Component is only hydrated when needed:

<client-only>
  <FancyChart />
</client-only>

✔ Async Components + Suspense

Nuxt 3 supports Suspense-like behavior:

<Suspense>
  <UserProfile />
</Suspense>

This allows:

  • above-the-fold hydration first
  • async component hydration later
  • streamed HTML + partial hydration

✔ Nuxt + Nitro + Island Optimizations

Nuxt aggressively optimizes hydration for:

  • per-component hydration
  • hydration skipping for static parts
  • “hybrid island architecture”

While it's not as automatic as React's selective hydration,
Nuxt 3 can achieve progressive hydration with explicit patterns.

🟧 SvelteKit — Progressive Hydration by Default

SvelteKit hydrates:

  • visible sections first
  • interactive sections first
  • lazy components when needed

Svelte also supports component-level hydration:

  • preload only what's visible
  • lazy load everything else
  • hydrate only when scrolled into view
  • hydrate on interaction

Very efficient because Svelte compiles away most framework runtime.

🟦 SolidStart — Fine-Grained Hydration

Solid uses reactive primitives and compiles away components.
Hydration happens:

  • per DOM node
  • per reactive signal
  • not per component

This makes hydration extremely granular.

SolidStart naturally does progressive + selective hydration.

🟣 Qwik — No Hydration (Resumes Instead)

Qwik doesn’t hydrate at all.

It resumes the app from HTML with embedded state.
Event listeners are attached lazily.

Progressive hydration is built-in because Qwik doesn't hydrate entire trees — it loads behavior on demand.

🌑 Astro — Island Architecture (Hydrate Only What You Ask)

Astro hydrates components only when instructed:

<Counter client:load />
<Chart client:visible />
<Sidebar client:idle />

Modes:

  • client:load
  • client:visible
  • client:idle
  • client:media
  • client:hover

A pure island architecture.
Astro minimizes JS for non-interactive content, best-in-class for performance.

🟩 Summary: How Frameworks Hydrate

Framework Hydration Style Notes
React 18+ Selective, priority-based, Suspense-aware Automatic
Vue 3 classic Linear Simple but slower
Nuxt 3 Partial/Client-only/Islands Supports progressive hydration
SvelteKit Progressive Hydrates visible/interacted UI first
SolidStart Fine-grained Tiny hydration chunks
Qwik No hydration → Resume Most advanced
Astro Islands Hydrate only what you choose

🎉 Final Thoughts

In modern SSR/SSG frameworks:

  • Hydration is the cost of interactivity
  • Linear hydration is slow
  • Progressive + selective hydration is the future
  • React 18+ achieves it automatically
  • Nuxt 3, SvelteKit, Qwik, Astro each do it differently
  • Suspense brings async-shaped hydration
  • Streaming SSR improves TTFB dramatically

If you understand hydration deeply,
you understand the foundation of modern web frameworks in 2025.

How to Embed a HubSpot Form in Next.js and react js

2025-11-17 01:05:29

HubSpot forms are widely used for lead generation, newsletters, onboarding workflows, and marketing funnels.

But when developers try to embed a HubSpot form inside a Next.js project, the default embed snippet usually fails.

Embed hubspot form with next js and react js

Forms don’t load, scripts break, and SSR errors begin to appear.

Before implementing anything, it’s important to understand why this happens in a Next.js environment.

This article covers theory only — the actual working solution is linked at the end.

Why HubSpot Forms Don’t Work Directly in Next.js

HubSpot provides an embed script that works perfectly in standard HTML sites.

But Next.js renders pages on both the server and the client, causing conflicts.

Here’s why the embed fails:

🔹 1. HubSpot expects window

During SSR, window doesn’t exist — so HubSpot script breaks.

🔹 2. HubSpot injects HTML dynamically

It needs a target div in the DOM, which only exists after hydration.

🔹 3. Third-party scripts must run on the client side

If Next.js tries to execute the script on the server, errors appear.

In short:

HubSpot = browser code

Next.js = server + browser

→ The script must run only on the client.

The Actual Logic Behind the Fix (Theory Only)

Every working solution follows this exact formula:

✔ Load HubSpot’s script client-side only

Avoid SSR conflicts.

✔ Create a target container like <div id="hubspotForm"></div>

HubSpot needs a mount point.

✔ Initialize the form after the script loads

Timing matters because HubSpot injects content dynamically.

Once you understand this, embedding HubSpot in Next.js becomes straightforward.

Want the Full Working Code?

For developers looking for the full solution, including:

  • reusable <HubspotForm /> component
  • safe client-side script loading
  • error-free initialization
  • proper Next.js setup
  • complete copy-paste example

the full guide is here:

How to Embed HubSpot Form in Next.js — Full Working Code

Helpful Tools While Working With Next.js & HubSpot

Developers usually handle images, PDFs, SEO metadata, and text formatting while building landing pages.

Here are some tools you may find useful — all in dev-friendly markdown format.

Image Tools (for HubSpot pages & assets)

SEO, Developer & Writing Tools

Final Thoughts

Embedding a HubSpot form in Next.js is easy once you understand one simple rule:

The embed script must run only in the browser — never during SSR.

To get the exact working code and final implementation, read the full guide:

👉 Read the Complete Guide

Copy → Paste → Your HubSpot form works instantly.

I Got My First Dev Job After 6 Months of Rejections - Here's What Actually Worked

2025-11-17 01:03:30

I Got My First Dev Job After 6 Months of Rejections - Here's What Actually Worked

Hey,

Just accepted my first junior dev position and wanted to share what actually got me there, because it wasn't what everyone on Reddit tells you.

My "Perfect" Resume That Nobody Wanted

CS degree, internships, active GitHub, LeetCode grind. Had all the boxes checked. Still got ghosted by 95% of companies for months.

The breaking point was people telling me "AI will replace you anyway" while I'm literally shipping code every day. Felt insane.

What Actually Changed

Honestly, I didn't suddenly get better at coding. Three things shifted:

Stopped spray-and-pray applications. Used to send 20+ apps/week with the same resume. Started doing 5/week but actually researched each company - matched my resume to their tech stack, referenced their products in cover letters. Quality over desperation.

This is why I built Woberry initially - needed something to help me track applications and auto-generate tailored cover letters without losing the personal touch. It's at ~$2K MRR now.

Built stuff I actually needed. Everyone says "do side projects" but here's what matters - solve your own problems. In interviews, I wasn't talking about tutorial apps. I was explaining real users, revenue, bugs I'd fixed. That hits different.

Right now I'm building a fitness app because every tracking app out there feels bloated or has a terrible UX. Just want something simple that works. In interviews, employers loved hearing about this kind of thinking - spotting gaps and just building the solution.

Gave up trying to impress people. The interview I got hired from? Went in expecting nothing, just talked normal. No rehearsed answers, just honest conversation. Apparently juniors who can communicate clearly are rarer than you'd think.

The Weird Part

Starting in November but keeping Woberry running. Part of me wonders if I should've gone full indie, but honestly the stability means I can build without financial panic. Plus I'll get Spring Boot production experience which I can't replicate solo.

If You're Still In It

Market's brutal and it's not your fault. But volume won't save you - quality and being real will.

Build something, even if small. Your side project might become your backup plan. Mine did.

If you're drowning in tracking applications and writing cover letters, that's literally why Woberry exists - built it because I needed it. Also made ResumeFast when I just needed a quick resume without all the extra features.

You're probably closer than you think.

--

https://www.woberry.com/ and https://www.resumefast.io/ if you want to check them out

🌐 How to Choose the Best Domain

2025-11-17 00:54:06

A domain is more than just an address — it’s your brand, identity, and the first impression users get when they find you online.

Choosing the right one can feel stressful, but with a bit of strategy, it's totally doable.

Here’s how to pick a domain that looks professional, sounds clean, and works long-term.

🔎 1. Keep it short and memorable

Short domains are easier to type, share, and remember.

Good examples:

  • stripe.com
  • github.com
  • sx.dev

Avoid long phrases, hyphens, or weird spellings.

If someone can’t repeat your domain after hearing it once, it’s too complicated.

🧠 2. Make it meaningful

Your domain should reflect your project, team, or brand idea.

Ask yourself:

  • What does my product do?
  • What feeling or concept should the domain communicate?
  • Would a stranger understand something from the name alone?

Meaning increases trust.

🌍 3. Choose the right TLD (.com, .dev, .io, etc.)

The TLD (Top-Level Domain) matters more than people think.

Popular options:

  • .com — universal, trusted, business-friendly
  • .dev — perfect for developers, tech, APIs
  • .io — trendy, startup vibe
  • .ai — ideal for AI/ML projects
  • .app — for applications and mobile products

Choose the one that matches your project’s identity and audience.

🛡️ 4. Check trademarks and legal issues

You don’t want to build a brand and then get a legal notice.

Before buying:

  • search the name on Google
  • check trademark databases
  • make sure no big brand is already using it

If the domain is too similar to an existing company, avoid it.

🔒 5. Make sure social media handles are available

Consistency across platforms = strong branding.

Check availability on:

  • GitHub
  • TikTok
  • Instagram
  • Twitter (X)
  • LinkedIn

If you can secure the same name everywhere, perfect.

⚡ 6. Prioritize SEO-friendly naming

Google loves clarity.

A good domain:

  • hints at what the site is about
  • uses real words or simple combinations
  • avoids complex numbers or symbols

Example:

codeacademy.dev is better than xzy-tools1337.net

🏷️ 7. Avoid hard-to-spell names

If users often misspell it, you’ll lose traffic.

Test it:

say the name to a friend and ask them to write it. If they struggle — rethink.

💡 8. Think long-term

Choose a domain that will still make sense in:

  • 1 year
  • 5 years
  • 10 years

Avoid names tied to a single feature or trend if you plan to grow your project.

🚀 Final Thoughts

Choosing a domain isn’t just picking a name — it’s shaping your digital identity.

Aim for something simple, meaningful, scalable, and consistent across the web.

The perfect domain:

  • looks clean
  • sounds good
  • is easy to remember
  • fits the project
  • and is legally safe

A strong domain is a long-term investment in your brand and your future.

Toon: A Lightweight Data Format That Helps Cut LLM Token Costs

2025-11-17 00:53:20

When working with LLMs, even small details—like how you format your data—can add up to noticeable differences in cost and performance. One of the new formats people have been experimenting with is called Toon(Token-Oriented Object Notation), and it’s gaining attention because of how compact it is. It conveys structured information like JSON or XML, but with fewer characters, which usually means fewer tokens.

This doesn’t replace established formats. JSON and XML are excellent for APIs, external integrations, and strict data handling. Toon simply offers a lighter alternative specifically for situations where data is being processed inside an LLM prompt or response, where size matters more than strict formal structure.

Below, I’ll walk through what Toon looks like, how to write it, how to create arrays and lists, how nesting works, and how it compares to JSON and XML—using straightforward language so the concepts click easily.

What Is Toon, in Simple Terms?

Toon is a compact way of writing structured information.
If JSON and XML aim for full clarity and standardization, Toon aims for minimal overhead. It focuses on giving the model the data it needs without the extra symbols that traditional formats include for compatibility with programming languages and parsers.

A basic Toon object looks like this:

name:Luna;age:3;color:silver

No quotes, no commas, no braces around the whole thing.
Still understandable, still structured—just lighter.

How to Write Toon Data

Here’s a breakdown of the different building blocks.

1. Basic Toon “objects”

A Toon object is simply a sequence of key:value pairs separated by semicolons:

name:Luna;age:3;color:silver

If a value contains spaces, wrap it in parentheses so it stays together:

title:(Chief Snack Manager)

That’s all you need for a standard object.

2. Toon Arrays

Arrays in Toon use square brackets and separate items using the pipe symbol |:

pets:[cat|dog|ferret]

More complex items can also be placed inside an array:

tasks:[name:clean;time:10 | name:feed;time:5]

Each item can be its own structured object.

3. Toon Lists

Toon also supports lists, which behave like arrays but preserve order more explicitly and allow repeated values without any ambiguity.

Lists use angle brackets:

shopping:<milk|eggs|eggs|bread>

Use lists when the exact sequence matters or when duplicates are intentional.

4. Nested Toon Structures

Toon allows nesting using curly braces {}:

user:{name:Luna;stats:{speed:9;stealth:10}}

This keeps nested relationships clear while still avoiding most of the bulk found in JSON or XML.

Toon vs JSON vs XML: What’s the Difference?

All three formats serve a purpose, but they’re shaped by different goals.

XML

XML is very explicit.
It prioritizes structure, clarity, and machine-verified consistency. That’s why it uses opening and closing tags:

<cat>
  <name>Luna</name>
</cat>

Great for document-like data and environments that require strict validation.

JSON

JSON is lighter than XML and is widely used in web APIs:

{ "name": "Luna", "age": 3 }

It’s familiar, readable, and supported everywhere—but it still includes quotes, commas, and braces that add up in token-based contexts.

Toon

Toon takes a different approach. It focuses on reducing the number of characters used to express the same information:

name:Luna;age:3

It keeps things understandable while minimizing overhead.
This makes it practical when your main target is an LLM rather than an external system or parser.

Simple Comparison

Feature XML JSON Toon
Typical size Largest Medium Smallest
Human-readable Yes, but verbose Yes Yes
Best use case Document standards, external systems Web APIs, broad app support LLM prompts and responses
Token usage High Medium Low

Each has strengths; Toon is simply optimized for a different environment.

Why Toon Reduces Token Costs (Clear Example)

Let’s compare the same data in JSON and Toon.

JSON version:

{
  "name": "Luna",
  "age": 3,
  "color": "silver"
}

This includes:

  • 2 curly braces
  • 6 quotation marks
  • 2 commas
  • Extra whitespace
  • Repeated keys in quotes

These all become individual tokens.
A short object like this often lands around 24–28 tokens.

Toon version:

name:Luna;age:3;color:silver

Much fewer symbols, no quotes, no commas, no braces.
This usually ends up around 10–12 tokens.

Scaling the Example

If you had 100 objects of this shape:

  • JSON: ~25 tokens × 100 = 2500 tokens
  • Toon: ~11 tokens × 100 = 1100 tokens

You save about 1400 tokens just by changing the format.

For large prompt-based systems, tool outputs, or inline metadata inside LLM workflows, this can noticeably reduce costs over time.

When Toon Makes Sense (and When It Doesn’t)

Use Toon When:

  • You’re passing structured info into an LLM via a prompt.
  • You need consistent data with minimal token count.
  • You’re building classification, extraction, or reasoning workflows where structure matters but full syntactic formality doesn't.
  • You want to shrink big chunks of repeated data.

Avoid Toon When:

  • The data is part of a public API.
  • You need schema validation or strict typing.
  • You’re sharing the data with external systems that expect JSON or XML.
  • Programmers need long-term maintainability outside LLM-based tools.

Toon isn’t trying to replace the established formats—it’s just optimised for a different environment.

Comparing Token Usage: JSON vs TOON

To understand how much TOON can help reduce LLM prompt costs, we can run a simple token-comparison test using Node.js, TypeScript, and OpenAI’s tiktoken tokenizer.

TOON doesn’t try to replace JSON — JSON is still the best for APIs and data interchange — but inside LLM prompts, the extra characters in JSON (quotes, braces, commas, whitespace) add up quickly.
TOON removes most of that, which makes token usage noticeably smaller.

Below is a working script to compare token usage and calculate efficiency.

Token Comparison Script (Node.js + TypeScript)

This script:

  • Converts JSON → TOON
  • Counts tokens for both
  • Prints the percentage savings
  • Shows you exactly how compact TOON is

compareTokens.ts

import { encoding_for_model } from "tiktoken";

const encoder = encoding_for_model("gpt-4o-mini");

// --- Convert JSON → TOON ---
function jsonToToon(obj: any): string {
  if (Array.isArray(obj)) {
    return `[${obj.map(jsonToToon).join("|")}]`;
  }

  if (typeof obj === "object" && obj !== null) {
    return Object.entries(obj)
      .map(([key, value]) => {
        if (typeof value === "string" && value.includes(" ")) {
          return `${key}:(${value})`;
        } else if (typeof value === "object") {
          return `${key}:{${jsonToToon(value)}}`;
        }
        return `${key}:${value}`;
      })
      .join(";");
  }

  return String(obj);
}

// --- Count tokens ---
function countTokens(text: string): number {
  return encoder.encode(text).length;
}

// Example JSON
const data = {
  name: "Luna",
  age: 3,
  color: "silver",
  stats: { speed: 9, stealth: 10 },
  pets: ["cat", "dog"]
};

const jsonStr = JSON.stringify(data);
const toonStr = jsonToToon(data);

const jsonTokens = countTokens(jsonStr);
const toonTokens = countTokens(toonStr);

const savings = jsonTokens - toonTokens;
const percentage = ((savings / jsonTokens) * 100).toFixed(2);

console.log("JSON:", jsonStr);
console.log("TOON:", toonStr);

console.log("\nJSON Tokens:", jsonTokens);
console.log("TOON Tokens:", toonTokens);

console.log("\nToken Savings:", savings);
console.log("Efficiency:", percentage + "%");

Example Output (Based on the Script)

Here’s what results typically look like when comparing the same dataset:

Format Token Count Notes
JSON 26 tokens Includes braces, commas, quotes
TOON 11 tokens Much smaller, minimal syntax
Savings 15 tokens Fewer characters used
Efficiency 57.6% reduction Nearly half the cost

Interpretation

This means TOON uses ~58% fewer tokens than JSON for the same information.
Depending on your LLM pricing, a savings like this accumulates dramatically when you’re working with:

  • RAG datasets
  • Repeating metadata
  • Tool outputs
  • Multi-step reasoning prompts
  • Bulk classification tasks

Even a difference of ~15 tokens per object becomes thousands of saved tokens across large inputs.

JSON → TOON Conversion Function (Standalone Version)

You may want the converter separately:

function jsonToToon(data) {
  if (Array.isArray(data)) {
    return `[${data.map(jsonToToon).join("|")}]`;
  }

  if (typeof data === "object" && data !== null) {
    return Object.entries(data)
      .map(([key, value]) => {
        if (typeof value === "string" && value.includes(" ")) {
          return `${key}:(${value})`;
        } else if (typeof value === "object") {
          return `${key}:{${jsonToToon(value)}}`;
        }
        return `${key}:${value}`;
      })
      .join(";");
  }

  return String(data);
}

Measuring Efficiency Across Larger Datasets

You can also test average efficiency across multiple objects:

function compareDataset(dataset: any[]) {
  let totalJSON = 0;
  let totalTOON = 0;

  for (const item of dataset) {
    totalJSON += countTokens(JSON.stringify(item));
    totalTOON += countTokens(jsonToToon(item));
  }

  return {
    totalJSON,
    totalTOON,
    savings: totalJSON - totalTOON,
    efficiency: ((totalJSON - totalTOON) / totalJSON * 100).toFixed(2) + "%"
  };
}

Use this to benchmark real-world data and see consistent savings.

Final Thoughts

Toon is a lightweight, practical format that fits well into LLM-focused pipelines. It keeps structure clear but trims away most of the characters that increase token count. JSON and XML still dominate traditional software systems, and they should—they’re reliable and standardised.
But when your goal is to communicate structured data inside an LLM prompt as efficiently as possible, Toon offers a noticeably smaller, cleaner alternative.

[Apache Iceberg] Iceberg Performance: The Hidden Cost of NULLS FIRST

2025-11-17 00:47:45

Introduction

Apache Iceberg is a widely-used table format in Data Lakehouse architectures. It provides flexibility in how data is written, with two key optimizations: partition, which splits data into segments, and sort, which reorders data within those segments. These optimizations can significantly reduce the amount of data scanned by query engines, ultimately boosting query performance.

When querying data with high-cardinality columns (e.g., IDs or serial numbers), quickly filtering out unnecessary values is crucial. Sorting becomes particularly valuable in these scenarios. The rationale is simple: if data is written in order, query engines can rapidly locate the needed data rather than performing a full table scan and discarding irrelevant rows.

When configuring Iceberg table sort properties, engineers can specify whether sorting follows ascending or descending order—with ascending as the default. While reading about this configuration, a question came to mind: Is there any performance difference between these two ordering approaches? If so, which one performs better, and why? To answer these questions, I designed an experiment to find out.

Experiment

Detailed code and performance analysis can be found in my repo: https://github.com/CuteChuanChuan/Dive-Into-Iceberg

Testing Materials

  • Generated 1,000,000 rows with 30% null values

  • Created two identically configured Iceberg tables with different null sorting orders (i.e., NULLS FIRST vs. NULLS LAST)

Queries Executed to Evaluate Performance

  • select count(*) from table where value is not null

  • select sum(value) from table where value is not null

  • select avg(value) from table where value is not null

  • select count(*) from table where value is null

  • select count(*) from table

Performance Evaluation Metrics

  • Query plan: Whether different sorting orders generate different execution plans

  • Execution time with statistical analysis: Overall query time comparison

  • CPU profiling: Detailed CPU usage analysis

Findings

To obtain a complete picture, I planned to conduct three types of analysis. First, I compared query plans to see whether different null placements generate different plans, which might influence query performance. Second, I conducted statistical analysis on execution times for rigorous examination. Since query time differences are the observable outcome, we need to identify the root cause if significant differences exist. Therefore, if statistical significance is found, CPU profiling will be conducted in the final phase.

Query Plan

Details

select count(*) from table where value is not null

# Null First
Query Plan (NULLS First):
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[], functions=[count(1)])
   +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=1557]
      +- HashAggregate(keys=[], functions=[partial_count(1)])
         +- Project
            +- Filter isnotnull(value#508)
               +- BatchScan local.db.test_nulls_first[value#508] local.db.test_nulls_first (branch=null) [filters=value IS NOT NULL, groupedBy=] RuntimeFilters: []

# Null Last
Query Plan (NULLS Last):
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[], functions=[count(1)])
   +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=1574]
      +- HashAggregate(keys=[], functions=[partial_count(1)])
         +- Project
            +- Filter isnotnull(value#521)
               +- BatchScan local.db.test_nulls_last[value#521] local.db.test_nulls_last (branch=null) [filters=value IS NOT NULL, groupedBy=] RuntimeFilters: []

select sum(value) from table where value is not null

# Null First
Query Plan (NULLS First):
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[], functions=[sum(value#886)])
   +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=3045]
      +- HashAggregate(keys=[], functions=[partial_sum(value#886)])
         +- Filter isnotnull(value#886)
            +- BatchScan local.db.test_nulls_first[value#886] local.db.test_nulls_first (branch=null) [filters=value IS NOT NULL, groupedBy=] RuntimeFilters: []

# Null Last
Query Plan (NULLS Last):
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[], functions=[sum(value#899)])
   +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=3064]
      +- HashAggregate(keys=[], functions=[partial_sum(value#899)])
         +- Filter isnotnull(value#899)
            +- BatchScan local.db.test_nulls_last[value#899] local.db.test_nulls_last (branch=null) [filters=value IS NOT NULL, groupedBy=] RuntimeFilters: []

select avg(value) from table where value is not null

# Null First
Query Plan (NULLS First):
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[], functions=[avg(value#1264)])
   +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=4535]
      +- HashAggregate(keys=[], functions=[partial_avg(value#1264)])
         +- Filter isnotnull(value#1264)
            +- BatchScan local.db.test_nulls_first[value#1264] local.db.test_nulls_first (branch=null) [filters=value IS NOT NULL, groupedBy=] RuntimeFilters: []

# Null Last
Query Plan (NULLS Last):
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[], functions=[avg(value#1279)])
   +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=4554]
      +- HashAggregate(keys=[], functions=[partial_avg(value#1279)])
         +- Filter isnotnull(value#1279)
            +- BatchScan local.db.test_nulls_last[value#1279] local.db.test_nulls_last (branch=null) [filters=value IS NOT NULL, groupedBy=] RuntimeFilters: []

select count(*) from table where value is null

# Null First
Query Plan (NULLS First):
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[], functions=[count(1)])
   +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=6023]
      +- HashAggregate(keys=[], functions=[partial_count(1)])
         +- Project
            +- Filter isnull(value#1646)
               +- BatchScan local.db.test_nulls_first[value#1646] local.db.test_nulls_first (branch=null) [filters=value IS NULL, groupedBy=] RuntimeFilters: []

# Null Last
Query Plan (NULLS Last):
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[], functions=[count(1)])
   +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=6040]
      +- HashAggregate(keys=[], functions=[partial_count(1)])
         +- Project
            +- Filter isnull(value#1659)
               +- BatchScan local.db.test_nulls_last[value#1659] local.db.test_nulls_last (branch=null) [filters=value IS NULL, groupedBy=] RuntimeFilters: []

select count(*) from table

# Null First
Query Plan (NULLS First):
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[], functions=[sum(agg_func_0#1895L)])
   +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=7045]
      +- HashAggregate(keys=[], functions=[partial_sum(agg_func_0#1895L)])
         +- Project [count(*)#1896L AS agg_func_0#1895L]
            +- LocalTableScan [count(*)#1896L]

# Null Last
Query Plan (NULLS Last):
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[], functions=[sum(agg_func_0#1904L)])
   +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=7060]
      +- HashAggregate(keys=[], functions=[partial_sum(agg_func_0#1904L)])
         +- Project [count(*)#1905L AS agg_func_0#1904L]
            +- LocalTableScan [count(*)#1905L]

Conclusion

For both tables, the execution plans for all queries are identical.

File-Level Statistics Analysis

Although the query plans are the same, a deeper look at the Parquet file statistics reveals important differences in how data is physically organized.

Partition Statistics Comparison

Below are the min/max statistics for each partition in both configurations:

Partition NULLS FIRST NULLS LAST Min Value Difference
cat_0-2 All nulls All nulls N/A
cat_3 min=103, max=993 min=103, max=993 Same
cat_4 min=4, max=994 min=4, max=994 Same
cat_5 min=405, max=995 min=355, max=995 -50
cat_6 min=106, max=996 min=6, max=996 -100
cat_7 min=517, max=997 min=487, max=997 -30
cat_8 min=228, max=998 min=208, max=998 -20
cat_9 min=619, max=999 min=609, max=999 -10

Why Are Statistics Different?

The different min/max values reveal that physical data layout differs between the two configurations:

  1. Different File Boundaries: When sorting with NULLS FIRST vs. NULLS LAST, Spark writes data in different orders, causing file splits to occur at different points. Even though both tables contain identical data, the way rows are distributed across files differs.

  2. File Organization Pattern:

  • NULLS FIRST: Files begin with null values, followed by non-null values. The minimum non-null value appears after skipping nulls within each file.

  • NULLS LAST: Files begin with non-null values immediately. The minimum value is at or near the start of the file.

  1. Metadata Quality: NULLS LAST produces "better" statistics for non-null queries:
  • In NULLS FIRST (e.g., cat_6): min=106 means the file starts with nulls, and 106 is the first non-null value encountered.

  • In NULLS LAST (e.g., cat_6): min=6 means the file immediately starts with value 6, providing more accurate bounds.

Impact on Query Execution

For queries with WHERE value IS NOT NULL:

NULLS FIRST:

  • Files contain nulls at the beginning, causing mixed value distribution

  • Query engine must scan through null values before reaching non-null data

  • Statistics indicate the presence of non-null values, but they're not immediately accessible

NULLS LAST:

  • Files with non-null data have those values at the beginning

  • Query engine can immediately start processing valid values

  • Better sequential access pattern for counting non-null values

This file-level organization difference, combined with CPU microarchitecture optimizations, explains why NULLS LAST performs better for counting non-null values even though logical query plans are identical.

Execution Time Analysis

Data Collection

  • 5 queries, each executed 100 times

Statistical Methods

  • T-test: Compare whether query times are statistically different

  • Cohen's d: Calculate the effect size of null ordering settings

Details

select count(*) from table where value is not null: Null Last performs better

Descriptive Statistics:
  NULLS FIRST: mean=41.46ms, sd=8.38ms
  NULLS LAST:  mean=31.55ms, sd=2.40ms

Paired t-test:
  t-statistic = 11.9367 
  p-value = 0.000000 
  95% CI: [8.26, 11.55] ms
  Result: *** HIGHLY SIGNIFICANT (p < 0.001)

Effect Size (Cohen's d):
  d = 1.1937 
  Interpretation: Large 

Summary:
  Mean difference: 9.91 ms
  Percentage difference: 23.90 %
  Winner: NULLS LAST

select sum(value) from table where value is not null: Not significantly different

Descriptive Statistics:
  NULLS FIRST: mean=34.14ms, sd=5.12ms
  NULLS LAST:  mean=33.40ms, sd=6.43ms

Paired t-test:
  t-statistic = 0.8759 
  p-value = 0.383195 
  95% CI: [-0.94, 2.43] ms
  Result: NOT SIGNIFICANT (p >= 0.05)

select avg(value) from table where value is not null: Not significantly different

Descriptive Statistics:
  NULLS FIRST: mean=28.84ms, sd=3.42ms
  NULLS LAST:  mean=27.95ms, sd=3.26ms

Paired t-test:
  t-statistic = 1.9654 
  p-value = 0.052165 
  95% CI: [-0.01, 1.80] ms
  Result: NOT SIGNIFICANT (p >= 0.05)

select count(*) from table where value is null: Not significantly different

Descriptive Statistics:
  NULLS FIRST: mean=24.00ms, sd=4.64ms
  NULLS LAST:  mean=23.16ms, sd=3.43ms

Paired t-test:
  t-statistic = 1.3804 
  p-value = 0.170582 
  95% CI: [-0.37, 2.05] ms
  Result: NOT SIGNIFICANT (p >= 0.05)

select count(*) from table: Not significantly different

Descriptive Statistics:
  NULLS FIRST: mean=14.95ms, sd=2.41ms
  NULLS LAST:  mean=14.39ms, sd=2.45ms

Paired t-test:
  t-statistic = 1.6356 
  p-value = 0.105090 
  95% CI: [-0.12, 1.25] ms
  Result: NOT SIGNIFICANT (p >= 0.05)

Conclusion

NULLS LAST is significantly faster than NULLS FIRST when counting non-null values.

CPU Profiling: Analyzing Count Non-Null Values Query

Details

Please refer to the flame graphs in my repo.

The performance difference observed in execution time analysis can be attributed to both file-level organization and CPU microarchitecture optimizations:

  1. File-Level Organization Impact: As shown in the file statistics analysis, NULLS LAST creates files where non-null values are positioned at the beginning. This layout means when the query engine scans data with WHERE value IS NOT NULL, it immediately encounters a continuous block of valid values rather than having to skip over nulls first. This reduces unnecessary I/O operations and deserialization overhead.

  2. CPU Microarchitecture Optimizations:

    1. SIMD (Single Instruction, Multiple Data): Modern CPUs can process multiple data elements simultaneously using SIMD instructions. When counting non-null values with NULLS LAST, the query engine encounters a continuous block of non-null values at the start of each file. This layout allows SIMD instructions to efficiently process multiple valid values in parallel. For example, when checking isnotnull(value) on 8 consecutive values that are all non-null, a single SIMD instruction can validate and count them in one operation.
    2. Branch Prediction: Modern CPUs use branch predictors to anticipate the outcome of conditional statements (like if (value != null)). With NULLS LAST, the query engine scans data following a highly predictable pattern: a long sequence of non-null values followed by nulls. This consistency allows the branch predictor to achieve high accuracy, keeping the CPU pipeline running smoothly. In contrast, NULLS FIRST presents a less predictable pattern at file boundaries where nulls transition to non-nulls, potentially causing pipeline stalls.

The CPU profiling data supports these optimizations: NULLS LAST (2,238 samples) uses approximately 11.7% less CPU time than NULLS FIRST (2,536 samples). This reduction results from the combined effects of better file organization, improved SIMD vectorization, and enhanced branch prediction accuracy.

Conclusion

NULLS LAST occupies less CPU time due to a combination of better file-level data organization and CPU microarchitecture optimizations.

Conclusion and Future Exploration

This exploration reveals that while different null value placements do not create different query plans, they significantly impact query performance through physical data organization.

Key Findings:

  1. File-Level Statistics Matter: NULLS LAST produces better min/max statistics, with non-null values positioned at file beginnings. This creates more favorable data layouts for queries filtering on non-null values.

  2. CPU Microarchitecture Synergy: The continuous blocks of non-null values in NULLS LAST enable CPU optimizations including SIMD vectorization and improved branch prediction, resulting in ~11.7% less CPU time.

  3. Significant Performance Impact: For SELECT COUNT(*) WHERE value IS NOT NULL, NULLS LAST achieves 23.90% faster execution time—a substantial improvement for such a common OLAP operation.

Practical Recommendations:

If counting non-null values is a frequent operation in your workload—which is common in OLAP scenarios—configuring Iceberg tables with NULLS LAST can provide measurable performance improvements. The benefits stem from both better file organization and CPU-level optimizations working in tandem.

Future Exploration:

This experiment tested 5 queries on a 1-million-row dataset with 30% null values. Future investigations could explore:

  • Various query patterns frequently used in OLAP scenarios (e.g., window functions like LAG, complex aggregations)

  • Larger datasets with multiple files per partition to amplify metadata pruning effects

  • Different null percentage distributions (10%, 50%, 70%) to understand the threshold where NULLS LAST benefits diminish

  • Impact on different data types (strings, decimals) and column cardinalities

  • Performance with Iceberg's metadata-based filtering in more complex predicates

These investigations would provide a more complete understanding of optimal Iceberg table sorting configurations across diverse workloads.