2026-04-12 13:43:09
2026-04-12 13:43:03
The story of migrating our governance agent from hardcoded skills to dynamic Registry loading — the wins, the gotchas, and what we learned along the way.
Our AWS governance agent has 16 domain skills — security analysis, cost optimization, network intelligence, the works. Every single one of them was baked into the system prompt on every request. Ask about a single S3 bucket? Here's all 16 skill descriptions anyway.
That's a lot of tokens doing nothing useful.
The goal was simple: move skill definitions to AgentCore Registry so the agent loads only what it needs per request. Less prompt bloat, smarter skill selection, and the foundation for per-thread skill loading down the road.
The key constraint: Registry acts as a catalog only. Tool implementations stay in the agent code where they belong (lowest latency). The Registry just stores metadata — name, description, instructions, which tools a skill can use.
And because we're not reckless, we added a USE_REGISTRY env var toggle. Flip it off, and everything goes back to the old local files. Zero drama.
We used Claude Code with agentic planning to break it into 7 phases: SDK upgrades, plugin creation, upload scripts, IAM permissions, tests, this journal, and documentation. The whole thing — plan through production deployment — took about an hour. A few decisions we locked in early:
As of April 2026, AgentCore Registry has no Terraform provider and no CloudFormation support. So we did everything through the AWS Console and CLI — which honestly worked fine for a one-time setup, but worth knowing if you're planning to IaC everything from day one.
Created the registry in the console. We already had a working JWT auth setup with Azure Entra ID on our main AgentCore agent, so we matched that same configuration. Straightforward.
Then we needed the Registry ID.
You'd think there'd be a labeled field with a copy button, like every other AWS service. There isn't. The ID is buried inside the ARN — you have to extract it yourself, or notice it tucked into sample CLI commands at the bottom of the page.
Not a showstopper, but the kind of thing that makes you squint at the screen for five minutes wondering what you're missing.
Dear AWS: A "Registry ID" field with a copy button would save everyone some confusion.
Bumped our SDK versions (strands-agents and boto3), added config fields, and built the new plugin — a drop-in replacement for the old AgentSkills loader.
One thing worth knowing: match your Registry auth to your agent's auth. If your agent uses IAM, create an IAM-auth registry. If your agent uses OIDC, create a JWT registry. We use OIDC on our agent, so we went with JWT — and since the plugin runs with the agent's IAM execution role, we use the control plane APIs (list_registry_records + get_registry_record) which accept IAM regardless of registry type.
We wrote scripts to push all 16 skill files to the Registry. Two things bit us:
The frontmatter problem. Our SDK's Skill.instructions field helpfully strips the YAML frontmatter from skill files. The Registry's upload API helpfully requires it. We switched to reading raw file content.
The async creation dance. create_registry_record returns immediately with an ARN but no record ID field. The record is in CREATING state. You have to: extract the ID from the ARN, wait for it to reach DRAFT, then submit it for approval as a separate step. None of this is one API call.
IAM was uneventful — added registry permissions to the execution role, scoped to our specific registry ARN.
Tests were thorough: 13 unit tests for the plugin (including XML injection protection and caching behavior), 6 for the upload scripts. All green.
Deployed with USE_REGISTRY=true. Opened Slack. Asked Schwarzi a question.
It worked.
The logs told the story:
That 1.2 seconds is the catalog fetch. It happens once, then individual skill fetches (~200ms each) are cached for the rest of the session. Acceptable trade-off for dynamic loading.
This was our first question too. We'd just added API calls and latency — surely we were also burning more tokens?
No. The LLM sees exactly the same text either way.
Both plugins inject an <available_skills> XML block into the system prompt before every turn — 16 skill names and descriptions, same format, same size. When the agent activates a skill, both return the full SKILL.md content as a tool result. Same content, same tokens. The Registry is a different storage backend, not a different prompt.
The actual cost delta is purely operational: ~1.2 seconds of latency at session start, ~200ms per uncached skill fetch, and 3–4 AWS API calls per session that didn't exist before. Those are real, but they're measured in milliseconds and pennies, not tokens.
If anything, the Registry sets up future token savings. Right now we load the full catalog into the prompt every turn. With per-thread skill loading (the next phase), we'd inject only the skills relevant to the current conversation. That's a prompt reduction, not an increase — but it requires the dynamic loading infrastructure we just built.
For the detail-oriented, here's everything that went sideways:
| # | What Happened | What We Did |
|---|---|---|
| 1 | Registry ID not labeled in console | Extracted from ARN |
| 2 | Registry APIs missing in boto3 < 1.42.88 | Upgraded boto3 |
| 3 | Upload API requires YAML frontmatter | Sent raw file content |
| 4 | Record creation is async, no ID in response | Parsed ID from ARN, polled for DRAFT status |
| 5 | Data-plane search requires matching auth type | Used control-plane APIs (accept IAM regardless) |
| 6 | Registry takes ~45s to become READY | Added polling before uploads |
Match your Registry auth to your agent. IAM agent → IAM registry. OIDC agent → JWT registry. We matched our existing Azure Entra ID config and it worked on the first try.
Control plane is your friend. The control-plane APIs (list_registry_records + get_registry_record) accept IAM regardless of registry auth type. Use them for catalog fetches from your plugin.
Send the raw file, frontmatter and all. The SDK strips it; the Registry needs it. Read the file from disk, not from the parsed object.
Record creation is a multi-step process. Create → wait for DRAFT → submit for approval → wait for APPROVED. Budget for polling and retries.
Check your boto3 version. Registry APIs landed in 1.42.88. Older versions have the clients but not the operations — a confusing failure mode.
Cache at the session level. The catalog fetch is the slow part (~1.2s). Individual skill lookups are fast (~200ms) and cache well. Don't re-fetch what hasn't changed.
Use an env var toggle for migrations. USE_REGISTRY=true/false with both code paths intact means you can roll back in seconds. No database flags, no deployment. Just flip the switch.
2026-04-12 13:42:25
Most shell habits fail because they are too clever to repeat. A useful workflow is boring on purpose: same file, same commands, same output shape, every day. My default Bash check-in routine is built around one pattern: filter signal, count it, and store the result. It takes less than a minute, and it gives me a concrete number I can discuss with the team instead of vague statements like "logs looked noisy."
The core idea is simple: treat your shell like a small data pipeline, not a command graveyard.
Pipes let each command do one job well. This composability is core to Bash productivity and keeps complex workflows understandable.
First, I look at warnings directly. No script yet, no abstractions:
cat events.log | grep warn
This is intentionally explicit. I want to see whether I am matching real operational noise or just stale assumptions. If the output is empty, that tells me something. If it is long and repetitive, that tells me something else. Either way, I learn before I automate.
A common mistake is jumping straight to counting. Counting is useful, but counts without context can mislead. If all warnings come from one noisy subsystem, that is very different from ten independent warning types. Start with eyes-on output, then move to metrics.
Once the warning lines look relevant, I count them:
cat events.log | grep warn | wc -l
Now we have a daily number. This is the moment the workflow becomes operationally useful. You can compare today vs. yesterday, detect sudden jumps, and prioritize cleanup work by trend instead of gut feel.
I like to keep this as a dedicated step rather than folding everything into a giant one-liner with formatting, timestamps, and side effects. Why? Because this command is easy to explain to a teammate in 10 seconds. If a workflow cannot be explained quickly, it will not survive handoffs.
After the quick read and count, store the value:
warn_count=$(cat events.log | grep warn | wc -l)
printf "warn_count=%s\n" "$warn_count"
Then append it to a daily report file:
printf "%s warn_count=%s\n" "$(date +%F)" "$warn_count" >> daily-warn-report.txt
This turns an ad hoc check into a time series. After a week, you can spot trend direction. After a month, you can correlate spikes with releases, migrations, or config changes.
If you only do one thing from this article, do this: write the count down every day. Teams underestimate how valuable a tiny, consistent metric can be.
Here is the routine exactly as I run it during morning triage:
cat events.log | grep warn
cat events.log | grep warn | wc -l
warn_count=$(cat events.log | grep warn | wc -l)
daily-warn-report.txt
That is enough to keep your log hygiene honest. You can always add breakdowns later (by module, by host, by error code), but this baseline already improves decision quality.
A real failure case made this workflow even more valuable for me.
One morning, my count suddenly dropped to zero:
cat events.log | grep warn | wc -l
# 0
At first glance, that looked like great news. It was not. The application dashboard still showed noisy behavior, and developers were reporting degraded responses. The observable symptom was obvious: the warning count said 0, but production behavior clearly did not.
The fix was straightforward once I looked at raw lines. The logger had switched severity text from warn to WARN during a dependency update. My filter was case-sensitive, so it silently missed everything.
Immediate fix:
cat events.log | grep -i warn | wc -l
Longer-term fix: we standardized log level casing in the app and documented filter assumptions in our team runbook. The lesson is practical: if a metric improves too perfectly overnight, assume your measurement might be broken before you celebrate.
Yes, you can do similar log filtering in PowerShell or Windows Command Prompt, and those tools absolutely have their place. For this specific daily routine, I still prefer Bash because pipeline composition stays lightweight and readable even as the workflow grows from two commands to ten.
If you want to drill this pattern until it is muscle memory, use the Bash training track here:
https://windows-cli.arnost.org/en/bash
For the exact lesson behind this workflow, read:
https://windows-cli.arnost.org/en/blog/bash-pipe-stack
A related drill that pairs well with this one:
https://windows-cli.arnost.org/en/blog/bash-copy-move-drill
My advice: run these commands yourself, daily, on a real log file. Repetition beats inspiration in shell work.
2026-04-12 13:31:58
Last time, we tore apart the core mechanics of RFC 6749 (Authorization Code Grant).
Hopefully, those fundamentals clicked. But here’s the thing: the second you try writing your own OAuth client or start poking around IdP dashboards, you almost inevitably smash into this weird, lingering mystery.
"So... what exactly does this 'PKCE' thing protect against? Isn't the state parameter enough?"
You'll see that "Require PKCE" toggle sitting right there in the console. An alarming number of developers blindly flip it to "ON" just because it sounds like a sensible security upgrade.
But if you can't instantly explain why you MUST use S256, or how exactly your access tokens would get hijacked without PKCE, then congrats. You're essentially just praying to a magic internet spell.
The true identity of this "feature everyone checks but nobody actually understands" is a clever cryptographic trick purposely built to shield authorization codes from the completely unhinged routing behavior of mobile OSes. It perfectly kills interception attacks. It is RFC 7636.
Consider this article the sequel to our RFC 6749 deep dive.
We’re going to rip through the original RFC 7636 spec, uncover the beautifully simple cryptography behind PKCE, and expose the "don't ever do this" configurations that could ruin your app.
Where exactly does RFC 7636 fit in the sprawling OAuth 2.0 universe?
While RFC 6749 tells you "how to get a token," RFC 7636 tells you "how to stop people from stealing it while you're trying to get it." We'll assume you already know how the standard Authorization Code Grant works.
If you've spent your career in backend web dev, you probably think, "As long as the IdP strictly validates the redirect_uri, we're bulletproof." And for server-side apps redirecting to a dedicated, TLS-secured https://myapp.example.com/callback, you'd be right.
But the minute you step into native mobile apps (iOS / Android), everything falls apart. To receive the callback (the auth code) from the browser, native apps rely on custom URI schemes like this:
myapp://oauth/callback?code=AUTH_CODE
There’s a fatal, gaping vulnerability here. The moment the auth server tells the browser to redirect to myapp://, it completely surrenders control over which app on that phone actually catches the URL. The ball gets thrown out of your safe, locked-down server environment and straight into the lawless wasteland of the mobile OS.
And guess what? Android and iOS do not guarantee custom URI schemes are unique.
A malicious app doesn’t have to pull off an Ocean's Eleven heist on your auth server. It just sits there. When it gets installed, it casually tells the OS (via a few lines of text in Info.plist or AndroidManifest.xml): "Hey, I can open myapp:// too."
When the browser fires that redirect with your precious authorization code, the OS can't tell which app is the "real" one. The routing decision is a complete coin toss. If the OS capriciously hands it to the malicious app, your auth code is gone.
RFC 7636 §1 sketches this out vividly.
You might think, "Does this actually happen in the wild, or is it just academic paranoia?" RFC 7636 §1 delivers a cold reality check:
While this is a long list of pre-conditions, the described attack has been observed in the wild.
I know what you're thinking: "Why not just authenticate with a client_secret?"
Because Public Clients (like mobile apps and SPAs) cannot keep a secret. If you ship a hardcoded secret in a mobile binary, it will be extracted before you finish your morning coffee.
Since the attacker and the real app both show up at the Token Endpoint holding the exact same credentials (code + client_id), the auth server is blind. It has absolutely no way to tell the good guys from the bad guys.
The fix for this mess is so simple it almost feels anticlimactic.
The attacker doesn't know the original string. It only lives in the RAM of the legitimate app. You can steal the code, but without the secret string, the auth server will slam the door in your face. That is PKCE.
We can step right through the actual protocol (RFC 7636 §4) to see how this works.
The code_verifier is a cryptographically secure random string that the client generates freshly for every single request.
[A-Z], [a-z], [0-9], -, ., _, ~
import secrets
import base64
# Generate 32 bytes (256 bits) of pure randomness
random_bytes = secrets.token_bytes(32)
# base64url encode (remove padding)
code_verifier = base64.urlsafe_b64encode(random_bytes).rstrip(b'=').decode('ascii')
print(code_verifier)
# Example output: dBjftJeZ4CVP-mB92K27uhbUJU1p1r_wW1gFWFOEjXk
⚠️ Never use predictable pseudo-random number generators like Math.random(). Just don't.
Next up, we derive the code_challenge from that code_verifier. The industry standard is S256.
code_challenge = BASE64URL-ENCODE(SHA256(ASCII(code_verifier)))
The spec says you MAY use plain (i.e., no hashing) if some technical constraint absolutely prevents S256. To put it bluntly: In the modern era, there is zero excuse to use plain. Ever.
Staring at Mermaid diagrams gets boring fast. Let’s simulate the raw PKCE flow using your terminal and curl.
① Authorization Request (via browser)
The client includes code_challenge and code_challenge_method as parameters and redirects the user.
# The URL actually hit in the browser's address bar
curl -i "https://auth.example.com/authorize?\
response_type=code&\
client_id=my-native-app&\
redirect_uri=myapp://callback&\
state=xyz123&\
code_challenge=E9Melhoa2OwvFrEMTJguCHaoeK1t8URWbuGJSstw-cM&\
code_challenge_method=S256"
Behind the scenes, the server ties E9Melhoa... to the issued code, stores it, and returns a redirect back to the app.
② Token Request
The app, having caught the incoming authorization code, sends the code_verifier (which it kept safely tucked away in its own memory) straight to the Token Endpoint.
curl -X POST "https://auth.example.com/token" \
-H "Content-Type: application/x-www-form-urlencoded" \
-d "grant_type=authorization_code" \
-d "client_id=my-native-app" \
-d "redirect_uri=myapp://callback" \
-d "code=AUTH_CODE_RECEIVED" \
-d "code_verifier=dBjftJeZ4CVP-mB92K27uhbUJU1p1r_wW1gFWFOEjXk"
The server hashes the received code_verifier with SHA-256 and checks if it matches the stored code_challenge. If they match, congratulations—you get the access token.
If an attacker intercepted the code and sends it here, their code_verifier will be empty or garbage, and the server will mercilessly kick them out with an invalid_grant.
Why am I relentlessly telling you to avoid plain?
Because plain literally means code_challenge = code_verifier. You are passing your secret completely unhashed.
If the smartphone's HTTP history leaks, or if the auth request gets intercepted over a compromised TLS proxy, the attacker instantly sees your code_challenge. If you're using plain, they just got your code_verifier for free. Once they snatch the auth code, they have all the pieces, and your entire PKCE defensive wall collapses.
With S256, even if they intercept every single auth request and steal the code_challenge, trying to reverse-engineer the original code_verifier out of that SHA-256 hash requires a pre-image attack. Good luck with that. They'll be crunching hashes until the sun burns out.
Using plain is essentially opting out of security while pretending you didn't.
"Wait, doesn't the state parameter prevent CSRF? Why do I need both?"
Simply put: They block attacks coming from totally opposite directions.
| Feature |
state (RFC 6749) |
PKCE (RFC 7636) |
|---|---|---|
| What it stops | CSRF (Auth code injection) | Auth code interception |
| Attack direction | Attacker → Victim (Forces the attacker's code into your session) | Victim → Attacker (Steals your code for the attacker's session) |
| Where it's verified | Client-side (Your app checks the callback) | Server-side (Token Endpoint validates the verifier) |
state ensures nobody shoves a sketchy auth code into your app. PKCE ensures nobody takes your auth code and uses it elsewhere. They aren't mutually exclusive. You absolutely must use both.
If you know anything about passwords, you know you need a salt. So why doesn't PKCE use one for its hashes?
Because the base entropy is already at lethal levels.
Passwords need salts because human-readable passwords have pathetically low entropy, leaving them wide open to pre-computation (rainbow table) attacks. A code_verifier, however, is 256 bits of pure cryptographic randomness. Brute-forcing it is physically impossible in our universe. Adding a salt achieves absolutely nothing except making your code messier.
RFC 7636 has a strict rule:
Clients MUST NOT downgrade to "plain" after trying the "S256" method.
If you send S256 and the auth server throws an error, do not try to be helpful and resend the request using plain. That’s exactly how a Man-In-The-Middle attacker strips away your security (a downgrade attack). If S256 fails, the flow is compromised. Kill it immediately.
RFC 7636 is a breezy 20-page specification, but it fundamentally fixed one of the scariest vulnerabilities in the OAuth 2.0 ecosystem: delivering the authorization code.
To sum it up:
plain is reckless.PKCE was initially pitched as an "optional extension for Public Clients." Today, under the current OAuth 2.0 Security Best Current Practice (BCP), it is strictly mandatory for everyone, even Confidential Clients sitting on secure backend servers.
Skipping PKCE in a modern system isn't an "option." It's just plain wrong.
2026-04-12 13:31:38
The Apache Iceberg community is discussing "secondary indexes." This topic is far more complex than it appears on the surface. Adding an index is not a simple engineering problem of "how to store an index file." It requires deep thinking about how an open format should design an entire indexing ecosystem.
If you are short on time, here are the core takeaways:
Apache Iceberg already excels at data pruning. The format provides several mechanisms to skip irrelevant data:
Together, these mechanisms help query engines eliminate massive amounts of irrelevant data before they even open a data file.
So, why is the Iceberg community still seriously discussing secondary indexing?
The answer is simple: In certain specific scenarios, the limitations of current mechanisms are painful enough for users.
Imagine you have an orders table holding 700 million records distributed across 700 Parquet files. You want to query all orders for a specific user:
SELECT * FROM orders WHERE user_id = 987654321;
Iceberg can filter out some files using partition pruning and file-level statistics. But if user_id is not a partition column, and the user_id ranges across files overlap heavily (since a user's orders might land in files from different time periods), the query engine might still need to open 658 files to find the single file that actually contains the target user.
This creates the classic Needle in a Haystack problem.
As RAG (Retrieval-Augmented Generation) and multimodal search workloads enter the data lake, queries looking for "the K most similar records to a given vector" are becoming increasingly common. Existing min/max statistics offer almost no help for vector retrieval.
Iceberg supports the Merge-on-Read (MOR) mode for handling updates and deletions. Deletions do not modify the data files directly. Instead, engines write them to separate delete files. When a table accumulates many equality delete files, the read path must reconcile every file to figure out which rows were deleted. This causes severe read amplification.
These three pain points reveal that Iceberg is not just looking for another optimization trick. The community is addressing a more fundamental issue:
In an open table format, how do we express and maintain multiple "alternative access paths" in a snapshot-aware, engine-agnostic way?
This is the essence of the "index" concept.
In traditional databases like MySQL or PostgreSQL, adding an index to a table is entirely natural. Why does introducing indexes to Iceberg spark endless discussions and require countless design trade-offs?
The core reason is this: Iceberg is an open format, not a closed storage engine.
In a traditional database, the database itself maintains the indexes. You write the data, the database updates the indexes synchronously, and the user never needs to worry about the underlying details.
However, Iceberg relies on completely different engines (like Spark, Flink, Trino, and Dremio) to write data. If the Iceberg specification forced engines to synchronously update indexes during writes, several problems would arise:
Therefore, the first fundamental disagreement the community faces is: Should index maintenance be synchronous or asynchronous?
The case for asynchronous maintenance:
The system does not enforce index updates during writes. Instead, background jobs (similar to compaction) rebuild or merge indexes periodically. This approach imposes zero intrusion on writers and offers more flexibility.
The case for synchronous maintenance:
Some scenarios demand immediate index availability. For instance, high-concurrency point lookups need to work right after real-time writes. If the index always lags behind the data, the system introduces too much uncertainty for users.
Current community consensus: We prioritize asynchronous maintenance and treat synchronous updates as an optional enhancement.
This is another core issue that has stalled conversations.
Iceberg's TableMetadata currently records schema details, partition rules, and the snapshot chain. Stuffing index definitions and status states into this file would make the table metadata increasingly bloated. But if developers isolate the index metadata completely, query engines must issue an extra network request to read index information during query planning. This increases planning latency. That "entry cost" could completely offset the performance gains the index provides.
The current compromise direction is:
This challenge is unique to Iceberg. Iceberg datasets exist as versioned snapshots. You can query yesterday's snapshot just as easily as you query the current one.
This raises a tricky question: If you build an index on snapshot_v5, is it valid for a query targeting snapshot_v3?
The answer is not a simple "yes" or "no." It depends entirely on what happened between those two snapshots. If the system only appended data, the index might still be valid (just incomplete). If compaction or schema changes occurred, the index likely became invalid.
This requires the indexing system to track its own "freshness" and actively verify its validity during use.
There is an even more subtle problem: Indexes do not only need updates when data files change.
Consider a scenario where you add a new column to a table and specify a default value. This operation does not rewrite the old data files. However, during a query, those old records logically "possess" this new default value. If you happen to have an index on this column, the content of that index is now expired, even though the underlying files remain completely unchanged byte-for-byte.
This means the index maintenance logic must detect more than just data file modifications. It also needs to sense schema changes, default value updates, and other metadata-only operations.
Facing all these complexities, the community realized something crucial: Before implementing any specific index, we must first build a unified indexing framework.
This framework needs to answer the following questions:
Building the framework is the correct step, but it is also the most time-consuming. You must secure broad consensus before all engines can implement features according to a unified standard.
Faced with the dilemma of "standardize the framework first" versus "build a concrete feature first," the Iceberg community adopted a pragmatic strategy: Advance both simultaneously and let them validate each other.
This strategy is very practical. The framework gets a concrete implementation to keep it grounded. Meanwhile, the concrete implementation uses the framework to ensure it is not a short-sighted, one-off solution.
Let us examine the various index proposals surfacing in community discussions. They are not competing against each other. Instead, they represent parallel paths aimed at different problems and sitting at different maturity levels.
What problem it solves: This targets the "needle in a haystack" scenario by eliminating irrelevant files before the engine even attempts to open them.
How it works:
Developers build a Bloom filter for a target column (like user_id) on a per-data-file basis. They store this Bloom data in a Puffin file (Iceberg's container format for statistics and index data).
When a query arrives:
query: SELECT * FROM orders WHERE user_id = 987654321
1. The system performs standard partition and statistical pruning -> leaving 658 candidate files.
2. The planner reads the Bloom data from the Puffin file (without opening any data files).
3. The Bloom filter determines that out of 658 files, only 1 "possibly contains" this user_id.
4. The query engine ultimately scans only this 1 file.
We must establish a critical understanding here: A Bloom filter will only tell you "possibly yes" or "definitely no." It never gives a false negative, but it can give a false positive. Therefore, it functions as a Skipping Index rather than a traditional index that points directly to specific rows.
Best use cases:
IN (...) set queries.Its limitations:
WHERE status = 'active' which hits many files).Why the community expects it to land first:
The use case is clear. The correctness semantics are simple to define (it affects performance, not query results). Engineers can build it asynchronously, and it does not require changes to the write path. As of March 2026, the corresponding Proof of Concept (PR #15311) already shows significant results, successfully dropping candidate file counts from 658 down to 1 in specific test scenarios.
What problem it solves: The goal moves beyond skipping files. The system attempts to "use the index to directly answer the query."
How it works:
The engine designates a column or group of columns as a key and stores other frequently queried columns directly inside the index structure (this is the definition of "covering"). When querying:
SELECT user_id, last_name, first_name FROM persons WHERE nationality = 'CN';
If a covering index exists with nationality as the key and includes user_id, last_name, first_name:
1. The engine locates nationality = 'CN' within the index.
2. It reads user_id, last_name, first_name directly from the index.
3. It bypasses the main data files entirely.
Unlike Bloom filters, this type of index does more work. It does not just tell the planner where to look; it actively participates in answering the query.
In community discussions, developers will likely back this type of index using a Materialized View. Essentially, this means maintaining an extra Iceberg table optimized and sorted by the index key.
Why it is not the top priority yet:
The implementation complexity for B-Tree/Covering indexes far exceeds Bloom filters. It involves redundant data storage, update maintenance, and query rewrite capabilities inside the engine itself. If an engine cannot recognize and rewrite a query to use the index path, the universal value of this index across different engines drops significantly.
What problem it solves: This supports text search, allowing users to execute inverted retrievals directly within the data lake.
How it works:
You can intuitively understand a full-text index as an "inverted table":
Original data:
doc1: "iceberg secondary index"
doc2: "iceberg metadata"
doc3: "vector index"
Term Index (Inverted structure):
iceberg -> [doc1, doc2]
secondary -> [doc1]
index -> [doc1, doc3]
metadata -> [doc2]
vector -> [doc3]
When querying "iceberg secondary index", the system consults the inverted table to find candidate documents, then fetches the full data from the main table as needed.
Its challenges:
Building a true full-text search system requires much more than maintaining a simple postings list. Tokenization rules, scoring models, phrase matching, and positional information will all trigger fierce standardization debates. The central controversy is this: How deep should the Iceberg core specification define the standard for "full-text retrieval"? If defined too shallowly, every engine creates an inconsistent implementation. If defined too deeply, the overhead becomes immense, far exceeding the responsibilities of a table format.
Current status: The community has incorporated it into the general index framework discussions, but it remains in the early conceptual stages.
What problem it solves: This supports approximate nearest neighbor (ANN) search, helping Iceberg natively handle AI retrieval workloads.
How it works:
The core logic of a vector index (like IVF, or Inverted File Index) operates like this:
Input: An embedding vector
1. The system locates the closest "cluster centroid" in the index.
2. The engine scans only the candidate vectors belonging to that cluster.
3. It calculates exact distances and returns the top-k results.
Why the community is prioritizing this direction:
AI workloads like RAG, multimodal search, and semantic retrieval are flooding into data infrastructure. If Iceberg can natively manage vector data and its retrieval indexes, engineering teams gain unified version control and governance across their training data, vector data, and indexes.
Why it is difficult:
Vector indexes come in an overwhelming variety (IVF-PQ, HNSW, DiskANN, etc.). They differ drastically in distance functions, recall rates, latency, and quantization methods. Standardization is an order of magnitude harder than with Bloom filters. The current community split focuses on this question: Should we start with a simpler MV-backed approach, or leap straight into designing dedicated native vector index structures?
Current status: The open-source community has dedicated issues and proposals reviewing this, but no unified implementation is close to deployment.
What problem it solves: This accelerates the Merge-on-Read path when dealing with massive numbers of delete files.
This is a relatively "internal" index optimization. It does not expose itself directly to SQL users; instead, execution engines quietly leverage it during scanning:
1. The engine begins reading data file A.
2. It internally queries the delete acceleration index.
3. If the index shows no deleted records overlap this file -> the system skips the delete reconciliation step entirely.
4. If there is a hit -> the system retrieves the precise row positions and processes only the affected rows.
This index carries massive business value for "write-heavy, delete-heavy" upsert scenarios. It directly slashes the row-by-row validation costs during the MOR read path.
Current status: Developers proposed this in a previous issue that was later marked "not planned." However, it reappeared in the 2026 index synchronization agenda. This indicates the community has not abandoned the need, but simply deprioritized it from the current mainline plan.
We can arrange these five proposals to clarify their core positioning:
Bloom Skipping Index
Core Goal: Reduces invalid file scans | Directly Answers Queries: No, only skips files | Storage: Puffin | Maturity: Most advanced POC
B-Tree / Covering Index
Core Goal: Supports direct positioning and avoids table access | Directly Answers Queries: Potentially yes | Storage: MV-backed Iceberg table | Maturity: Proposal stage
Full-text / Term Index
Core Goal: Supports inverted retrieval | Directly Answers Queries: Subject to design | Storage: Postings table / MV | Maturity: Early concepts
Vector Index (IVF/ANN)
Core Goal: Supports approximate nearest neighbor search | Directly Answers Queries: Subject to design | Storage: MV-backed or native structures | Maturity: Early discussions
Delete/MOR Acceleration Index
Core Goal: Optimizes read paths in deletion scenarios | Directly Answers Queries: No, internal engine optimization | Storage: Specialized metadata | Maturity: Not main priority
Here is a one-sentence summary for each index's core positioning:
They solve completely different problems and target entirely different workloads. This explains why establishing a universal indexing framework is so critical. Without a unified object model and lifecycle management system, these index types would easily fragment into isolated implementations across different engines.
While developers have reached a preliminary consensus on the broad technical direction, several major disagreements remain unresolved:
This is absolutely not a boring semantic game. The conclusion dictates whether developers attach Bloom under the unified IndexCatalog or place it closer to the "enhanced file statistics" side of the codebase.
One faction argues: Bloom has independent metadata, separate maintenance workflows, and actively alters the query access path. Therefore, it is definitely an index.
The opposing faction insists: Bloom itself never directly returns rows; it merely helps the planner exclude some files. It acts more like a massive upgrade to file-level stats. It does not qualify as a "true secondary index."
This theoretical divide impacts the next design decision: Does our universal index framework need to prioritize index types that align more closely with "traditional database definitions"?
The community repeatedly surfaces this idea: Some indexes do not require inventing brand new file formats. Reusing an additional Iceberg table to host them is perfectly sufficient.
The benefits are obvious. We reuse mature existing capabilities, making implementation extremely straightforward. The core controversy, however, centers on boundaries. If this "index table" requires the query engine to perform query rewriting just to use it, is it truly a unified indexing capability at the Iceberg level? Or is it simply an internal optimization trick living inside the engine?
If the community eventually allows users to declare a specific index as "mandatory" (must be maintained), this move will trigger a cascade of difficult secondary problems:
Currently, the mainstream discussion leans this way: Do not demand that all writers natively support every complex index type. Doing so would immediately raise the barrier to entry, a move that contradicts Iceberg's fundamental identity as an open data format.
If you are a developer interested in index design, these extensive Iceberg discussions are not just internal open-source trivialities. They offer a rare, complete, and public walkthrough of a massive systemic problem: "How to design an index architecture for an open format."
We can extract a few critical insights essentially for system design:
1. "Index" is never a singular technical concept.
Bloom filters, B-Trees, full-text indexes, and vector indexes target completely different problem categories. They require drastically different underlying storage structures and daily maintenance mechanisms. Therefore, "supporting indexes" never means doing just one thing; it means handling an entire category of related tasks.
2. Designing indexes for open formats is much harder than for closed engines.
In the closed world of traditional databases, the engine possesses uncompromising, total control over its storage media. This allows the engine to flawlessly bind index updates to the core write process. Iceberg, however, must define precise semantics, full-lifecycle management, and data consistency guarantees while facing the harsh reality of uncontrollable, diverse writers. This difficulty stems from the architectural essence, not just code-level implementation.
3. Comprehensive metadata design is the true deep-water challenge.
Understanding "how to save a Bloom filter file to a hard drive" is not a real system challenge. That is a basic engineering task. The truly difficult bottlenecks are these: Which exact data snapshot does this newly generated index tie to? Under what specific conditions does it expire? How does a query engine efficiently and cheaply check if a valid index is available right now? More often than not, these complex design questions hold far more life-or-death importance than the physical format of the index file itself.
4. There is an irreconcilable tension between absolute consistency and peak performance.
Choosing between safe synchronous updates and fast asynchronous ones, or deciding between strict mandatory indexes and best-effort advisory ones — every architectural path represents a painful, calculated tradeoff between strict data consistency and the quest for ideal performance. There is no universally correct, ultimate answer. There are only temporary compromises that fit specific workload demands better at a given time.
5. Building the foundational framework first is far more important than rushing out a single feature.
Looking past the surface, we see that the Iceberg community resolutely chose a steady path: "Build a universal indexing framework first, then use Bloom specifically to validate it." This choice proves they clearly understand a painful systems lesson. If they allow every different index to invent its own metadata format and isolated lifecycle rules, they might save effort today, but they guarantee utter architectural chaos in the long run.
As of late March 2026, the Iceberg community has not yet successfully merged an official, complete secondary index specification. However, two core Proof of Concept (POC) proposals remain highly active and open for serious formatting and debate:
Overall, the mainstream consensus envisions this final form:
The Iceberg core specification standardizes the index object models, snapshot binding relationships, Catalog APIs, and lifecycle semantics. Meanwhile, it allows different index types to adopt the storage mediums that suit them best: Bloom uses Puffin; B-Tree/Covering uses materialized views; and vector indexes might rely on dedicated native structures.
This design philosophy avoids forcing a single format onto all indexes. It aligns perfectly with Iceberg's historical trajectory: remaining open, expansible, and fiercely independent of any single query engine.
Iceberg's extensive discussion surrounding indexes delivers a vivid, public masterclass in architecture design. It serves as a strong reminder to all software developers and architects: "Adding an index" to an open format system is never as easy as merely saving a file.
To introduce an index into a foundational data format like Iceberg — one hosting a massive ecosystem and multiple engine integrations — you must navigate deep waters. You have to resolve metadata binding mechanics, lifecycle management, strict consistency guarantees, and compatibility strategies across varying snapshots. Every problem you solve requires a difficult, precise tradeoff between consistency and performance.
If you enjoy exploring these backend logic subjects and architectural evolutions, I strongly recommend reading the original discussions in the community mailing lists and PRs. That is where you will find the freshest, most vibrant engineering design wisdom on the front lines.
References:
2026-04-12 13:30:54
This week I was focused on building a solid foundation in SQL by working with student, subject, and exam result data. Along the way, I explored key concepts like DDL, DML, filtering using WHERE, and transforming data using CASE WHEN. Here’s a summary of what I learned.
SQL commands are broadly categorized into two main types:
DDL is used to define and manage the structure of a database.
Examples include:
CREATE – used to create tablesALTER – used to modify table structureDROP – used to delete tablesExample:
CREATE TABLE students (
student_id INT PRIMARY KEY,
first_name VARCHAR(50),
last_name VARCHAR(50),
gender CHAR(1),
date_of_birth DATE,
class VARCHAR(10),
city VARCHAR(50)
);```
In simple terms, DDL defines how the database looks.
### 2. DML (Data Manipulation Language)
DML is used to manage and manipulate the data inside the tables.
Examples include:
- INSERT – add new data
- UPDATE – modify existing data
- DELETE – remove data
Example:
INSERT INTO students (student_id, first_name, last_name)
VALUES (1, 'Amina', 'Wanjiku');
**DML focuses on the data itself.**
USING CREATE, INSERT, UPDATE, and DELETE
I used CREATE to define tables like students, subjects, and exam_results.
I used INSERT to populate tables with records extracted from provided data.
Although not heavily used, UPDATE can be applied to change values,
for example:
***UPDATE students
SET city = 'Nairobi'
WHERE student_id = 1;
DELETE removes records:
DELETE FROM students
WHERE student_id = 10;***
These commands helped me understand how databases are built and maintained step by step.
**Filtering Data Using WHERE**
The WHERE clause is used to filter records based on conditions.
These are some of operators used:
= → exact match
> → greater than
BETWEEN → range filtering
IN → multiple values
LIKE → pattern matching
***Examples:
-- Exact match
SELECT * FROM students WHERE class = 'Form 4';
-- Range
SELECT * FROM exam_results WHERE marks BETWEEN 50 AND 80;
-- Multiple values
SELECT * FROM students WHERE city IN ('Nairobi', 'Mombasa', 'Kisumu');
-- Pattern matching
SELECT * FROM students WHERE first_name LIKE 'A%';***
This is one of the most powerful parts of SQL because it allows precise data retrieval.
**Transforming Data Using CASE WHEN**
CASE WHEN allows us to create new columns based on conditions. It works like an "if-else" statement in programming.
Example:
SELECT
marks,
CASE
WHEN marks >= 80 THEN 'Distinction'
WHEN marks >= 60 THEN 'Merit'
WHEN marks >= 40 THEN 'Pass'
ELSE 'Fail'
END AS performance
FROM exam_results;
This helped convert raw numeric data into meaningful categories.
CASE WHEN is very useful for reporting and data interpretation.
**Reflection**
This week was both challenging and interesting.
What I found challenging:
Understanding when to use operators like IN vs BETWEEN
Writing clean and error-free queries, most of my queries were to be corrected before running.
Seeing how raw data can be transformed into meaningful insights on the other hand was very interesting.
Conclusion
This week helped me understand the basics of working with databases:
- DDL defines structure
- DML manipulates data
- WHERE filters results
- CASE WHEN transforms data
These are essential building blocks for any developer working with databases. I’m looking forward to learning more advanced SQL concepts like joins and aggregations next and I will share it all here.