2025-11-17 01:07:41
Hydration is one of the most important concepts in modern frontend frameworks, yet also one of the most misunderstood.
In 2025, frameworks like React 18+, Vue/Nuxt 3, SvelteKit, SolidStart, Qwik, Astro all approach hydration differently — some progressively, some selectively, some partially, some never fully.
This post breaks it all down with real examples, SSR + SSG scenarios, and an extended example so the concept stays in your mind forever.
When a page is SSR or SSG, the browser gets plain HTML:
<div id="app">
<button>Click me</button>
<ul>
<li>Item 1</li>
<li>Item 2</li>
</ul>
</div>
But HTML alone does nothing — no JavaScript, no event listeners.
Hydration is when the framework:
Another way to say it:
SSR/SSG gives HTML.
Hydration gives life.
Traditionally, hydration happened linearly, top-to-bottom:
1. Hydrate App()
2. Hydrate Header()
3. Hydrate Navbar()
4. Hydrate Sections()
5. Hydrate Footer()
Linear hydration has three big problems:
Because hydration must finish before the user can interact safely.
Blocks main thread.
Even things not visible yet.
This model works, but it’s outdated for large apps.
React 18 replaced “hydrate everything in order” with:
Hydrate what’s needed, when it’s needed.
Skip or delay the rest.
React evaluates multiple signals:
1️⃣ Visible + interactive components
2️⃣ User-clicked or user-focused components
3️⃣ Above-the-fold content
4️⃣ Suspense boundaries that resolved
5️⃣ Below-the-fold or non-interactive UI
6️⃣ Everything else, when browser is idle
This is selective hydration — only hydrate the parts users need immediately.
Consider:
<Suspense fallback={<LoadingUser />}>
<UserProfile />
</Suspense>
React can stream HTML in chunks:
<LoadingUser /> first<UserProfile /> when data resolvesSuspense creates natural hydration islands.
Imagine a dashboard:
+----------------------+
| Navbar |
+----------------------+
| Sidebar | Chart |
| | (Suspense) |
| | |
+----------------------+
Client sees:
Navbar (interactive instantly)
Sidebar (interactive instantly)
Chart loading skeleton…
1️⃣ User clicks sidebar → React hydrates sidebar immediately
2️⃣ Navbar is visible → hydrate early
3️⃣ Chart fallback hydrates next (lightweight)
4️⃣ Actual Chart component hydrates only when its data resolves
5️⃣ Footer hydrates when browser is idle
This is selective, scattered, priority-driven hydration.
React 18 already does a form of progressive hydration:
✔ Selective hydration
✔ Suspense boundary hydration
✔ Hydrate-on-interaction
✔ Hydrate-visible-first
✔ Hydrate background content later
✔ Hydrate streaming chunks as they arrive
React does not call it “progressive hydration,” but practically, it achieves that behavior.
Yes.
SSG also needs hydration because:
Example:
A static blog page still hydrates “Like Button”, comments, interactive parts.
SSG = static HTML + hydration on client.
Now let's compare how different frameworks hydrate:
Vue's default hydration is still linear unless using advanced tooling.
1. Hydrate root instance
2. Hydrate children in DOM order
3. Hydrate nested components
4. Finally hydrate leaf nodes
But this changes with Nuxt 3…
Nuxt 3 introduced partial hydration capabilities, similar to islands architecture:
client:only
Component is only hydrated when needed:
<client-only>
<FancyChart />
</client-only>
Nuxt 3 supports Suspense-like behavior:
<Suspense>
<UserProfile />
</Suspense>
This allows:
Nuxt aggressively optimizes hydration for:
While it's not as automatic as React's selective hydration,
Nuxt 3 can achieve progressive hydration with explicit patterns.
SvelteKit hydrates:
Svelte also supports component-level hydration:
Very efficient because Svelte compiles away most framework runtime.
Solid uses reactive primitives and compiles away components.
Hydration happens:
This makes hydration extremely granular.
SolidStart naturally does progressive + selective hydration.
Qwik doesn’t hydrate at all.
It resumes the app from HTML with embedded state.
Event listeners are attached lazily.
Progressive hydration is built-in because Qwik doesn't hydrate entire trees — it loads behavior on demand.
Astro hydrates components only when instructed:
<Counter client:load />
<Chart client:visible />
<Sidebar client:idle />
Modes:
client:loadclient:visibleclient:idleclient:mediaclient:hoverA pure island architecture.
Astro minimizes JS for non-interactive content, best-in-class for performance.
| Framework | Hydration Style | Notes |
|---|---|---|
| React 18+ | Selective, priority-based, Suspense-aware | Automatic |
| Vue 3 classic | Linear | Simple but slower |
| Nuxt 3 | Partial/Client-only/Islands | Supports progressive hydration |
| SvelteKit | Progressive | Hydrates visible/interacted UI first |
| SolidStart | Fine-grained | Tiny hydration chunks |
| Qwik | No hydration → Resume | Most advanced |
| Astro | Islands | Hydrate only what you choose |
In modern SSR/SSG frameworks:
If you understand hydration deeply,
you understand the foundation of modern web frameworks in 2025.
2025-11-17 01:05:29
HubSpot forms are widely used for lead generation, newsletters, onboarding workflows, and marketing funnels.
But when developers try to embed a HubSpot form inside a Next.js project, the default embed snippet usually fails.
Forms don’t load, scripts break, and SSR errors begin to appear.
Before implementing anything, it’s important to understand why this happens in a Next.js environment.
This article covers theory only — the actual working solution is linked at the end.
HubSpot provides an embed script that works perfectly in standard HTML sites.
But Next.js renders pages on both the server and the client, causing conflicts.
Here’s why the embed fails:
window
During SSR, window doesn’t exist — so HubSpot script breaks.
It needs a target div in the DOM, which only exists after hydration.
If Next.js tries to execute the script on the server, errors appear.
In short:
HubSpot = browser code
Next.js = server + browser
→ The script must run only on the client.
Every working solution follows this exact formula:
Avoid SSR conflicts.
<div id="hubspotForm"></div>
HubSpot needs a mount point.
Timing matters because HubSpot injects content dynamically.
Once you understand this, embedding HubSpot in Next.js becomes straightforward.
For developers looking for the full solution, including:
<HubspotForm /> component
the full guide is here:
How to Embed HubSpot Form in Next.js — Full Working Code
Developers usually handle images, PDFs, SEO metadata, and text formatting while building landing pages.
Here are some tools you may find useful — all in dev-friendly markdown format.
Embedding a HubSpot form in Next.js is easy once you understand one simple rule:
The embed script must run only in the browser — never during SSR.
To get the exact working code and final implementation, read the full guide:
Copy → Paste → Your HubSpot form works instantly.
2025-11-17 01:03:30
Hey,
Just accepted my first junior dev position and wanted to share what actually got me there, because it wasn't what everyone on Reddit tells you.
My "Perfect" Resume That Nobody Wanted
CS degree, internships, active GitHub, LeetCode grind. Had all the boxes checked. Still got ghosted by 95% of companies for months.
The breaking point was people telling me "AI will replace you anyway" while I'm literally shipping code every day. Felt insane.
What Actually Changed
Honestly, I didn't suddenly get better at coding. Three things shifted:
Stopped spray-and-pray applications. Used to send 20+ apps/week with the same resume. Started doing 5/week but actually researched each company - matched my resume to their tech stack, referenced their products in cover letters. Quality over desperation.
This is why I built Woberry initially - needed something to help me track applications and auto-generate tailored cover letters without losing the personal touch. It's at ~$2K MRR now.
Built stuff I actually needed. Everyone says "do side projects" but here's what matters - solve your own problems. In interviews, I wasn't talking about tutorial apps. I was explaining real users, revenue, bugs I'd fixed. That hits different.
Right now I'm building a fitness app because every tracking app out there feels bloated or has a terrible UX. Just want something simple that works. In interviews, employers loved hearing about this kind of thinking - spotting gaps and just building the solution.
Gave up trying to impress people. The interview I got hired from? Went in expecting nothing, just talked normal. No rehearsed answers, just honest conversation. Apparently juniors who can communicate clearly are rarer than you'd think.
The Weird Part
Starting in November but keeping Woberry running. Part of me wonders if I should've gone full indie, but honestly the stability means I can build without financial panic. Plus I'll get Spring Boot production experience which I can't replicate solo.
If You're Still In It
Market's brutal and it's not your fault. But volume won't save you - quality and being real will.
Build something, even if small. Your side project might become your backup plan. Mine did.
If you're drowning in tracking applications and writing cover letters, that's literally why Woberry exists - built it because I needed it. Also made ResumeFast when I just needed a quick resume without all the extra features.
You're probably closer than you think.
--
https://www.woberry.com/ and https://www.resumefast.io/ if you want to check them out
2025-11-17 00:54:06
A domain is more than just an address — it’s your brand, identity, and the first impression users get when they find you online.
Choosing the right one can feel stressful, but with a bit of strategy, it's totally doable.
Here’s how to pick a domain that looks professional, sounds clean, and works long-term.
Short domains are easier to type, share, and remember.
Good examples:
stripe.comgithub.comsx.devAvoid long phrases, hyphens, or weird spellings.
If someone can’t repeat your domain after hearing it once, it’s too complicated.
Your domain should reflect your project, team, or brand idea.
Ask yourself:
Meaning increases trust.
The TLD (Top-Level Domain) matters more than people think.
Choose the one that matches your project’s identity and audience.
You don’t want to build a brand and then get a legal notice.
Before buying:
If the domain is too similar to an existing company, avoid it.
Consistency across platforms = strong branding.
Check availability on:
If you can secure the same name everywhere, perfect.
Google loves clarity.
A good domain:
Example:codeacademy.dev is better than xzy-tools1337.net
If users often misspell it, you’ll lose traffic.
Test it:
say the name to a friend and ask them to write it. If they struggle — rethink.
Choose a domain that will still make sense in:
Avoid names tied to a single feature or trend if you plan to grow your project.
Choosing a domain isn’t just picking a name — it’s shaping your digital identity.
Aim for something simple, meaningful, scalable, and consistent across the web.
The perfect domain:
A strong domain is a long-term investment in your brand and your future.
2025-11-17 00:53:20
When working with LLMs, even small details—like how you format your data—can add up to noticeable differences in cost and performance. One of the new formats people have been experimenting with is called Toon(Token-Oriented Object Notation), and it’s gaining attention because of how compact it is. It conveys structured information like JSON or XML, but with fewer characters, which usually means fewer tokens.
This doesn’t replace established formats. JSON and XML are excellent for APIs, external integrations, and strict data handling. Toon simply offers a lighter alternative specifically for situations where data is being processed inside an LLM prompt or response, where size matters more than strict formal structure.
Below, I’ll walk through what Toon looks like, how to write it, how to create arrays and lists, how nesting works, and how it compares to JSON and XML—using straightforward language so the concepts click easily.
Toon is a compact way of writing structured information.
If JSON and XML aim for full clarity and standardization, Toon aims for minimal overhead. It focuses on giving the model the data it needs without the extra symbols that traditional formats include for compatibility with programming languages and parsers.
A basic Toon object looks like this:
name:Luna;age:3;color:silver
No quotes, no commas, no braces around the whole thing.
Still understandable, still structured—just lighter.
Here’s a breakdown of the different building blocks.
A Toon object is simply a sequence of key:value pairs separated by semicolons:
name:Luna;age:3;color:silver
If a value contains spaces, wrap it in parentheses so it stays together:
title:(Chief Snack Manager)
That’s all you need for a standard object.
Arrays in Toon use square brackets and separate items using the pipe symbol |:
pets:[cat|dog|ferret]
More complex items can also be placed inside an array:
tasks:[name:clean;time:10 | name:feed;time:5]
Each item can be its own structured object.
Toon also supports lists, which behave like arrays but preserve order more explicitly and allow repeated values without any ambiguity.
Lists use angle brackets:
shopping:<milk|eggs|eggs|bread>
Use lists when the exact sequence matters or when duplicates are intentional.
Toon allows nesting using curly braces {}:
user:{name:Luna;stats:{speed:9;stealth:10}}
This keeps nested relationships clear while still avoiding most of the bulk found in JSON or XML.
All three formats serve a purpose, but they’re shaped by different goals.
XML is very explicit.
It prioritizes structure, clarity, and machine-verified consistency. That’s why it uses opening and closing tags:
<cat>
<name>Luna</name>
</cat>
Great for document-like data and environments that require strict validation.
JSON is lighter than XML and is widely used in web APIs:
{ "name": "Luna", "age": 3 }
It’s familiar, readable, and supported everywhere—but it still includes quotes, commas, and braces that add up in token-based contexts.
Toon takes a different approach. It focuses on reducing the number of characters used to express the same information:
name:Luna;age:3
It keeps things understandable while minimizing overhead.
This makes it practical when your main target is an LLM rather than an external system or parser.
| Feature | XML | JSON | Toon |
|---|---|---|---|
| Typical size | Largest | Medium | Smallest |
| Human-readable | Yes, but verbose | Yes | Yes |
| Best use case | Document standards, external systems | Web APIs, broad app support | LLM prompts and responses |
| Token usage | High | Medium | Low |
Each has strengths; Toon is simply optimized for a different environment.
Let’s compare the same data in JSON and Toon.
{
"name": "Luna",
"age": 3,
"color": "silver"
}
This includes:
These all become individual tokens.
A short object like this often lands around 24–28 tokens.
name:Luna;age:3;color:silver
Much fewer symbols, no quotes, no commas, no braces.
This usually ends up around 10–12 tokens.
If you had 100 objects of this shape:
You save about 1400 tokens just by changing the format.
For large prompt-based systems, tool outputs, or inline metadata inside LLM workflows, this can noticeably reduce costs over time.
Toon isn’t trying to replace the established formats—it’s just optimised for a different environment.
To understand how much TOON can help reduce LLM prompt costs, we can run a simple token-comparison test using Node.js, TypeScript, and OpenAI’s tiktoken tokenizer.
TOON doesn’t try to replace JSON — JSON is still the best for APIs and data interchange — but inside LLM prompts, the extra characters in JSON (quotes, braces, commas, whitespace) add up quickly.
TOON removes most of that, which makes token usage noticeably smaller.
Below is a working script to compare token usage and calculate efficiency.
This script:
import { encoding_for_model } from "tiktoken";
const encoder = encoding_for_model("gpt-4o-mini");
// --- Convert JSON → TOON ---
function jsonToToon(obj: any): string {
if (Array.isArray(obj)) {
return `[${obj.map(jsonToToon).join("|")}]`;
}
if (typeof obj === "object" && obj !== null) {
return Object.entries(obj)
.map(([key, value]) => {
if (typeof value === "string" && value.includes(" ")) {
return `${key}:(${value})`;
} else if (typeof value === "object") {
return `${key}:{${jsonToToon(value)}}`;
}
return `${key}:${value}`;
})
.join(";");
}
return String(obj);
}
// --- Count tokens ---
function countTokens(text: string): number {
return encoder.encode(text).length;
}
// Example JSON
const data = {
name: "Luna",
age: 3,
color: "silver",
stats: { speed: 9, stealth: 10 },
pets: ["cat", "dog"]
};
const jsonStr = JSON.stringify(data);
const toonStr = jsonToToon(data);
const jsonTokens = countTokens(jsonStr);
const toonTokens = countTokens(toonStr);
const savings = jsonTokens - toonTokens;
const percentage = ((savings / jsonTokens) * 100).toFixed(2);
console.log("JSON:", jsonStr);
console.log("TOON:", toonStr);
console.log("\nJSON Tokens:", jsonTokens);
console.log("TOON Tokens:", toonTokens);
console.log("\nToken Savings:", savings);
console.log("Efficiency:", percentage + "%");
Here’s what results typically look like when comparing the same dataset:
| Format | Token Count | Notes |
|---|---|---|
| JSON | 26 tokens | Includes braces, commas, quotes |
| TOON | 11 tokens | Much smaller, minimal syntax |
| Savings | 15 tokens | Fewer characters used |
| Efficiency | 57.6% reduction | Nearly half the cost |
This means TOON uses ~58% fewer tokens than JSON for the same information.
Depending on your LLM pricing, a savings like this accumulates dramatically when you’re working with:
Even a difference of ~15 tokens per object becomes thousands of saved tokens across large inputs.
You may want the converter separately:
function jsonToToon(data) {
if (Array.isArray(data)) {
return `[${data.map(jsonToToon).join("|")}]`;
}
if (typeof data === "object" && data !== null) {
return Object.entries(data)
.map(([key, value]) => {
if (typeof value === "string" && value.includes(" ")) {
return `${key}:(${value})`;
} else if (typeof value === "object") {
return `${key}:{${jsonToToon(value)}}`;
}
return `${key}:${value}`;
})
.join(";");
}
return String(data);
}
You can also test average efficiency across multiple objects:
function compareDataset(dataset: any[]) {
let totalJSON = 0;
let totalTOON = 0;
for (const item of dataset) {
totalJSON += countTokens(JSON.stringify(item));
totalTOON += countTokens(jsonToToon(item));
}
return {
totalJSON,
totalTOON,
savings: totalJSON - totalTOON,
efficiency: ((totalJSON - totalTOON) / totalJSON * 100).toFixed(2) + "%"
};
}
Use this to benchmark real-world data and see consistent savings.
Toon is a lightweight, practical format that fits well into LLM-focused pipelines. It keeps structure clear but trims away most of the characters that increase token count. JSON and XML still dominate traditional software systems, and they should—they’re reliable and standardised.
But when your goal is to communicate structured data inside an LLM prompt as efficiently as possible, Toon offers a noticeably smaller, cleaner alternative.
2025-11-17 00:47:45
Apache Iceberg is a widely-used table format in Data Lakehouse architectures. It provides flexibility in how data is written, with two key optimizations: partition, which splits data into segments, and sort, which reorders data within those segments. These optimizations can significantly reduce the amount of data scanned by query engines, ultimately boosting query performance.
When querying data with high-cardinality columns (e.g., IDs or serial numbers), quickly filtering out unnecessary values is crucial. Sorting becomes particularly valuable in these scenarios. The rationale is simple: if data is written in order, query engines can rapidly locate the needed data rather than performing a full table scan and discarding irrelevant rows.
When configuring Iceberg table sort properties, engineers can specify whether sorting follows ascending or descending order—with ascending as the default. While reading about this configuration, a question came to mind: Is there any performance difference between these two ordering approaches? If so, which one performs better, and why? To answer these questions, I designed an experiment to find out.
Detailed code and performance analysis can be found in my repo: https://github.com/CuteChuanChuan/Dive-Into-Iceberg
Generated 1,000,000 rows with 30% null values
Created two identically configured Iceberg tables with different null sorting orders (i.e., NULLS FIRST vs. NULLS LAST)
select count(*) from table where value is not null
select sum(value) from table where value is not null
select avg(value) from table where value is not null
select count(*) from table where value is null
select count(*) from table
Query plan: Whether different sorting orders generate different execution plans
Execution time with statistical analysis: Overall query time comparison
CPU profiling: Detailed CPU usage analysis
To obtain a complete picture, I planned to conduct three types of analysis. First, I compared query plans to see whether different null placements generate different plans, which might influence query performance. Second, I conducted statistical analysis on execution times for rigorous examination. Since query time differences are the observable outcome, we need to identify the root cause if significant differences exist. Therefore, if statistical significance is found, CPU profiling will be conducted in the final phase.
select count(*) from table where value is not null
# Null First
Query Plan (NULLS First):
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[], functions=[count(1)])
+- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=1557]
+- HashAggregate(keys=[], functions=[partial_count(1)])
+- Project
+- Filter isnotnull(value#508)
+- BatchScan local.db.test_nulls_first[value#508] local.db.test_nulls_first (branch=null) [filters=value IS NOT NULL, groupedBy=] RuntimeFilters: []
# Null Last
Query Plan (NULLS Last):
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[], functions=[count(1)])
+- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=1574]
+- HashAggregate(keys=[], functions=[partial_count(1)])
+- Project
+- Filter isnotnull(value#521)
+- BatchScan local.db.test_nulls_last[value#521] local.db.test_nulls_last (branch=null) [filters=value IS NOT NULL, groupedBy=] RuntimeFilters: []
select sum(value) from table where value is not null
# Null First
Query Plan (NULLS First):
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[], functions=[sum(value#886)])
+- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=3045]
+- HashAggregate(keys=[], functions=[partial_sum(value#886)])
+- Filter isnotnull(value#886)
+- BatchScan local.db.test_nulls_first[value#886] local.db.test_nulls_first (branch=null) [filters=value IS NOT NULL, groupedBy=] RuntimeFilters: []
# Null Last
Query Plan (NULLS Last):
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[], functions=[sum(value#899)])
+- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=3064]
+- HashAggregate(keys=[], functions=[partial_sum(value#899)])
+- Filter isnotnull(value#899)
+- BatchScan local.db.test_nulls_last[value#899] local.db.test_nulls_last (branch=null) [filters=value IS NOT NULL, groupedBy=] RuntimeFilters: []
select avg(value) from table where value is not null
# Null First
Query Plan (NULLS First):
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[], functions=[avg(value#1264)])
+- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=4535]
+- HashAggregate(keys=[], functions=[partial_avg(value#1264)])
+- Filter isnotnull(value#1264)
+- BatchScan local.db.test_nulls_first[value#1264] local.db.test_nulls_first (branch=null) [filters=value IS NOT NULL, groupedBy=] RuntimeFilters: []
# Null Last
Query Plan (NULLS Last):
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[], functions=[avg(value#1279)])
+- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=4554]
+- HashAggregate(keys=[], functions=[partial_avg(value#1279)])
+- Filter isnotnull(value#1279)
+- BatchScan local.db.test_nulls_last[value#1279] local.db.test_nulls_last (branch=null) [filters=value IS NOT NULL, groupedBy=] RuntimeFilters: []
select count(*) from table where value is null
# Null First
Query Plan (NULLS First):
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[], functions=[count(1)])
+- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=6023]
+- HashAggregate(keys=[], functions=[partial_count(1)])
+- Project
+- Filter isnull(value#1646)
+- BatchScan local.db.test_nulls_first[value#1646] local.db.test_nulls_first (branch=null) [filters=value IS NULL, groupedBy=] RuntimeFilters: []
# Null Last
Query Plan (NULLS Last):
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[], functions=[count(1)])
+- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=6040]
+- HashAggregate(keys=[], functions=[partial_count(1)])
+- Project
+- Filter isnull(value#1659)
+- BatchScan local.db.test_nulls_last[value#1659] local.db.test_nulls_last (branch=null) [filters=value IS NULL, groupedBy=] RuntimeFilters: []
select count(*) from table
# Null First
Query Plan (NULLS First):
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[], functions=[sum(agg_func_0#1895L)])
+- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=7045]
+- HashAggregate(keys=[], functions=[partial_sum(agg_func_0#1895L)])
+- Project [count(*)#1896L AS agg_func_0#1895L]
+- LocalTableScan [count(*)#1896L]
# Null Last
Query Plan (NULLS Last):
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[], functions=[sum(agg_func_0#1904L)])
+- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=7060]
+- HashAggregate(keys=[], functions=[partial_sum(agg_func_0#1904L)])
+- Project [count(*)#1905L AS agg_func_0#1904L]
+- LocalTableScan [count(*)#1905L]
For both tables, the execution plans for all queries are identical.
Although the query plans are the same, a deeper look at the Parquet file statistics reveals important differences in how data is physically organized.
Below are the min/max statistics for each partition in both configurations:
| Partition | NULLS FIRST | NULLS LAST | Min Value Difference |
|---|---|---|---|
| cat_0-2 | All nulls | All nulls | N/A |
| cat_3 | min=103, max=993 | min=103, max=993 | Same |
| cat_4 | min=4, max=994 | min=4, max=994 | Same |
| cat_5 | min=405, max=995 | min=355, max=995 | -50 |
| cat_6 | min=106, max=996 | min=6, max=996 | -100 |
| cat_7 | min=517, max=997 | min=487, max=997 | -30 |
| cat_8 | min=228, max=998 | min=208, max=998 | -20 |
| cat_9 | min=619, max=999 | min=609, max=999 | -10 |
The different min/max values reveal that physical data layout differs between the two configurations:
Different File Boundaries: When sorting with NULLS FIRST vs. NULLS LAST, Spark writes data in different orders, causing file splits to occur at different points. Even though both tables contain identical data, the way rows are distributed across files differs.
File Organization Pattern:
NULLS FIRST: Files begin with null values, followed by non-null values. The minimum non-null value appears after skipping nulls within each file.
NULLS LAST: Files begin with non-null values immediately. The minimum value is at or near the start of the file.
In NULLS FIRST (e.g., cat_6): min=106 means the file starts with nulls, and 106 is the first non-null value encountered.
In NULLS LAST (e.g., cat_6): min=6 means the file immediately starts with value 6, providing more accurate bounds.
For queries with WHERE value IS NOT NULL:
NULLS FIRST:
Files contain nulls at the beginning, causing mixed value distribution
Query engine must scan through null values before reaching non-null data
Statistics indicate the presence of non-null values, but they're not immediately accessible
NULLS LAST:
Files with non-null data have those values at the beginning
Query engine can immediately start processing valid values
Better sequential access pattern for counting non-null values
This file-level organization difference, combined with CPU microarchitecture optimizations, explains why NULLS LAST performs better for counting non-null values even though logical query plans are identical.
T-test: Compare whether query times are statistically different
Cohen's d: Calculate the effect size of null ordering settings
select count(*) from table where value is not null: Null Last performs better
Descriptive Statistics:
NULLS FIRST: mean=41.46ms, sd=8.38ms
NULLS LAST: mean=31.55ms, sd=2.40ms
Paired t-test:
t-statistic = 11.9367
p-value = 0.000000
95% CI: [8.26, 11.55] ms
Result: *** HIGHLY SIGNIFICANT (p < 0.001)
Effect Size (Cohen's d):
d = 1.1937
Interpretation: Large
Summary:
Mean difference: 9.91 ms
Percentage difference: 23.90 %
Winner: NULLS LAST
select sum(value) from table where value is not null: Not significantly different
Descriptive Statistics:
NULLS FIRST: mean=34.14ms, sd=5.12ms
NULLS LAST: mean=33.40ms, sd=6.43ms
Paired t-test:
t-statistic = 0.8759
p-value = 0.383195
95% CI: [-0.94, 2.43] ms
Result: NOT SIGNIFICANT (p >= 0.05)
select avg(value) from table where value is not null: Not significantly different
Descriptive Statistics:
NULLS FIRST: mean=28.84ms, sd=3.42ms
NULLS LAST: mean=27.95ms, sd=3.26ms
Paired t-test:
t-statistic = 1.9654
p-value = 0.052165
95% CI: [-0.01, 1.80] ms
Result: NOT SIGNIFICANT (p >= 0.05)
select count(*) from table where value is null: Not significantly different
Descriptive Statistics:
NULLS FIRST: mean=24.00ms, sd=4.64ms
NULLS LAST: mean=23.16ms, sd=3.43ms
Paired t-test:
t-statistic = 1.3804
p-value = 0.170582
95% CI: [-0.37, 2.05] ms
Result: NOT SIGNIFICANT (p >= 0.05)
select count(*) from table: Not significantly different
Descriptive Statistics:
NULLS FIRST: mean=14.95ms, sd=2.41ms
NULLS LAST: mean=14.39ms, sd=2.45ms
Paired t-test:
t-statistic = 1.6356
p-value = 0.105090
95% CI: [-0.12, 1.25] ms
Result: NOT SIGNIFICANT (p >= 0.05)
NULLS LAST is significantly faster than NULLS FIRST when counting non-null values.
Please refer to the flame graphs in my repo.
The performance difference observed in execution time analysis can be attributed to both file-level organization and CPU microarchitecture optimizations:
File-Level Organization Impact: As shown in the file statistics analysis, NULLS LAST creates files where non-null values are positioned at the beginning. This layout means when the query engine scans data with WHERE value IS NOT NULL, it immediately encounters a continuous block of valid values rather than having to skip over nulls first. This reduces unnecessary I/O operations and deserialization overhead.
CPU Microarchitecture Optimizations:
isnotnull(value) on 8 consecutive values that are all non-null, a single SIMD instruction can validate and count them in one operation.if (value != null)). With NULLS LAST, the query engine scans data following a highly predictable pattern: a long sequence of non-null values followed by nulls. This consistency allows the branch predictor to achieve high accuracy, keeping the CPU pipeline running smoothly. In contrast, NULLS FIRST presents a less predictable pattern at file boundaries where nulls transition to non-nulls, potentially causing pipeline stalls.The CPU profiling data supports these optimizations: NULLS LAST (2,238 samples) uses approximately 11.7% less CPU time than NULLS FIRST (2,536 samples). This reduction results from the combined effects of better file organization, improved SIMD vectorization, and enhanced branch prediction accuracy.
NULLS LAST occupies less CPU time due to a combination of better file-level data organization and CPU microarchitecture optimizations.
This exploration reveals that while different null value placements do not create different query plans, they significantly impact query performance through physical data organization.
Key Findings:
File-Level Statistics Matter: NULLS LAST produces better min/max statistics, with non-null values positioned at file beginnings. This creates more favorable data layouts for queries filtering on non-null values.
CPU Microarchitecture Synergy: The continuous blocks of non-null values in NULLS LAST enable CPU optimizations including SIMD vectorization and improved branch prediction, resulting in ~11.7% less CPU time.
Significant Performance Impact: For SELECT COUNT(*) WHERE value IS NOT NULL, NULLS LAST achieves 23.90% faster execution time—a substantial improvement for such a common OLAP operation.
Practical Recommendations:
If counting non-null values is a frequent operation in your workload—which is common in OLAP scenarios—configuring Iceberg tables with NULLS LAST can provide measurable performance improvements. The benefits stem from both better file organization and CPU-level optimizations working in tandem.
Future Exploration:
This experiment tested 5 queries on a 1-million-row dataset with 30% null values. Future investigations could explore:
Various query patterns frequently used in OLAP scenarios (e.g., window functions like LAG, complex aggregations)
Larger datasets with multiple files per partition to amplify metadata pruning effects
Different null percentage distributions (10%, 50%, 70%) to understand the threshold where NULLS LAST benefits diminish
Impact on different data types (strings, decimals) and column cardinalities
Performance with Iceberg's metadata-based filtering in more complex predicates
These investigations would provide a more complete understanding of optimal Iceberg table sorting configurations across diverse workloads.