MoreRSS

site iconThe Practical DeveloperModify

A constructive and inclusive social network for software developers.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of The Practical Developer

Hello World

2026-04-23 02:27:20

Hey everyone! 👋

I'm a self-taught vibe coder and this is my first post here. I'm pretty new to the dev.to community.

I like building AI-powered tools that speed up the game development process and make it cheaper and less painful along the way.

My first project is AutoGameVisionTester — a background tool that smart-captures gameplay screenshots and uses Grok Vision to run QA/analysis reports.

Just wanted to say hi to the community and hope everyone is doing well!

What projects are you currently working on? I'd love to hear about them :D

If you want to check out AutoGameVisionTester, here's the repo:

Repo: https://github.com/Sqeakzz/AutoGameVisionTester

I shipped a DevSecOps tool in 2026 with zero LLM calls. On purpose. I think determinism still wins.

2026-04-23 02:24:48

Let me get one thing out of the way before the comments section catches fire.

I am not a luddite. I use AI all day. This article was outlined in Cursor, the tool I am about to describe was built with heavy help from Claude Code, and my IDE autocomplete is so spicy these days that I sometimes forget I am the one supposed to be writing the code. That is normal in 2026. Pretending otherwise would be ridiculous.

The point of this article is something else.

The point is: the shipped product itself does not call a single LLM at runtime. No OpenAI key. No Anthropic key. No RAG layer. No agentic loop. No "AI Mode" toggle. No vector database, no embedding step, no model card. Just plain old deterministic code that does the same thing every time.

In the current climate I think that genuinely needs an explanation, so here it is.

What the tool actually does

I built ArchiteX, a free open source GitHub Action that runs on every pull request that touches *.tf files. It parses the Terraform on both sides of the PR, builds an architecture graph for each, computes the delta, runs a set of weighted risk rules, and posts a sticky comment on the PR with:

  • a 0 to 10 risk score
  • a short plain English summary of what changed
  • a small Mermaid diagram of just the changed nodes plus one layer of context
  • an optional CI gate to fail the build above a threshold

You can see exactly what the comment looks like here, no install needed: live sample report.

It supports AWS and Azure today. MIT licensed. Single Go binary. Free forever, no paid tier ever.

This is, in 2026 vocabulary, the most boring possible tool. No agents. No reasoning loop. No "ask the diagram a question". I think that is a feature, not a bug, and I want to explain why.

Why I deliberately did not put an LLM in the hot path

In my opinion, the moment you put an LLM in the runtime of a tool that grades pull requests, you lose three things at the same time. And once you lose them, you cannot get them back without rewriting the trust contract you have with your users.

1. You lose determinism. And determinism is the entire product.

A reviewer trusts an automated PR comment for one reason and one reason only: because re-running it cannot quietly change the score. If I open the same PR twice and the first run says "9.0 / HIGH" and the second run says "6.5 / MEDIUM", I will never trust that tool again. Not for security. Not for anything.

Every LLM I know is non-deterministic by default. You can pin temperature to zero, you can fix the seed, you can do all the rituals. The provider can still ship a new model checkpoint next Tuesday and your scores drift overnight, silently, with no audit trail. I think that is unacceptable for a tool whose entire job is to be trustworthy at the moment of code review.

ArchiteX has a golden test suite that re-runs the full pipeline against checked-in fixtures and asserts the rendered Mermaid, the score JSON, and the egress JSON are byte identical to a stored expected output. If anyone ever changes a map iteration order, a sort comparator, or a JSON marshaller, the build fails on the next push. That guarantee is impossible if there is a model call anywhere in the path.

2. You lose the trust model.

ArchiteX never runs terraform plan. Never calls AWS or Azure. Never downloads provider plugins. Never touches state. The only network call in the entire tool is the GitHub REST API call at the very end to post the comment. The Terraform code never leaves the runner. There is no SaaS, no signup, no telemetry, no opt-out flag, because there is nothing to opt out of.

The moment I add a model call, I have to send something to a third party. Even a sanitized summary. Even just the rule IDs. The trust conversation immediately changes from "this runs entirely on your CI runner" to "well, mostly". I think for a tool aimed at regulated tenants, financial services, healthcare, government, that is the difference between adoption and a polite no thank you.

3. You lose air-gap and fork-PR support.

Real bonus consequence I did not appreciate until I was deep into design. Because the tool needs zero credentials and zero API keys, it works on PRs from forks. PRs from forks are exactly where most supply-chain-style attacks land, and most CI tooling refuses to run on them precisely because it needs secrets. ArchiteX runs there fine, because it has nothing to lose.

Same for air-gapped CI. Banks, defense contractors, hospitals. Add an LLM call and you cannot ship there at all.

What I gave up by saying no to AI in the runtime

I am not pretending this came for free. The tradeoffs are real and I think being honest about them is part of why this post is worth writing.

The plain English summaries are template-based, which means they are correct, deterministic, and a bit dry. An LLM would write nicer prose. I know. I have prototyped exactly that. The prose was indeed nicer. The score also drifted between runs and a colleague asked me, with a completely straight face, "did the AI just decide my PR was fine today". That was the end of that prototype.

There is no smart cross-resource correlation. If your PR adds an unauthenticated Lambda URL and attaches AdministratorAccess to a role in the same diff, ArchiteX scores both rules and sums them. It does not say "hey, those two together are materially worse than the sum of their parts". An LLM walking the graph could probably spot that. A deterministic rule engine cannot, unless I write the specific compound rule by hand. I am writing them by hand. It is slow. I think that is fine.

Rule curation is manual. Adding a new resource type means writing the parser support, the abstract type mapping, the literal attribute extraction, the edge inference, the risk rules, and the tests. There is no "let me ask the model to suggest 5 dangerous patterns for this resource". That tradeoff is the price of getting reproducibility as a load-bearing property.

A few technical decisions that follow directly from "no LLM"

For the people who care about how this looks in code, the no-LLM constraint shaped a bunch of design choices that I think are interesting on their own merits.

The HCL parser is hashicorp/hcl/v2 walked generically through hclsyntax, not decoded with gohcl (which would have needed a Go struct per resource type). Attribute values are evaluated with expr.Value(nil) and any failure to resolve a literal is recorded as nil rather than guessed at. Variable-driven attributes never trigger rules even when they should. The engine never invents values. Reproducibility wins.

The trust model is enforced structurally with a CI grep rule: ! grep -rE "net/http|architex/github" parser graph delta risk interpreter models. The build fails if any analysis package ever imports networking. The only place HTTP is allowed is the github REST client, which is only ever called by main.go in one specific subcommand. Code review can be fooled. A CI grep cannot.

The Mermaid renderer has a deterministic byte-budget cap. mermaid-js stops rendering above 50,000 characters (the GitHub PR comment failure mode for big diagrams). The renderer keeps nodes by status priority, then abstract type priority, then ID alphabetical, until the byte budget is hit, then drops the rest and emits a visible truncation marker so reviewers always know it happened. Found this empirically with a synthetic stress probe, not by asking a model.

There is a subcommand called architex baseline that snapshots the "shape" of your repo (kinds of resources, abstract types, edge pairs ever seen) into a small JSON file. Three baseline rules then surface novelties (a brand new abstract type, a brand new resource kind, a brand new edge pair) as low-weight signals. This is the closest the tool gets to "anomaly detection", and I deliberately built it as a deterministic snapshot diff and not as embeddings. Same fixture in, same novelty out, every time.

When I think AI absolutely does belong in a tool like this

Just so I do not come off as a hater. Here are the places I would happily use an LLM, and possibly will, just not in the runtime hot path:

  • Generating the first draft of a new risk rule from a CVE writeup or an incident postmortem. Then a human edits it, tests it, locks the weight, and ships it. The human is the trust boundary, not the model.
  • Translating the deterministic plain-English summary into other languages, offline, at release time. The translation is part of the build artifact, not a runtime call.
  • Helping users author suppressions. "I have this finding, write me a suppression block for it". This runs on the user's machine, not in the analysis pipeline, so it cannot influence the score.

The rule I keep coming back to: the model can write the rules, but the rules have to run without the model. I think that is the right line for security tooling specifically. Maybe not for chatbots. Maybe not for code generation. Definitely for anything that grades a pull request.

So if you are also building developer tools right now, I think this is worth asking

In 2026 it feels like every product launch needs an "AI" in the title to even get clicks. I get it. I literally led with the no-LLM angle to get you to read this article, so I am as guilty as anyone.

But I think there is a real, durable category of tools where adding an LLM in the runtime is a strict downgrade. Not because the model is bad, but because the contract between the tool and its users requires reproducibility, locality, and low trust surface. PR review tooling is one. CI gates are another. Anything that fires on every commit, anything that produces a number people will argue over, anything that has to work in an air-gapped tenant.

For those, I think the boring deterministic answer is still the right one. And I think that has to be defended on purpose right now, because the default is the other way.

If you want to look at it:

I would genuinely love your honest feedback in the comments, especially:

  • If you have built a similar tool with an LLM in the runtime, what did you do about reproducibility?
  • If you work in a regulated tenant, is the "no SaaS, no telemetry, runs entirely on your runner" property actually the deal-breaker I think it is, or am I overestimating it?
  • If you are on the other side of this debate, where would you put the LLM in a PR-review tool, and why?

I will reply to every comment.

Breaking Down Linux File-System

2026-04-23 02:22:06

In Linux, everything is a file—not just text documents or executables, but also devices, network sockets, and even running processes. This idea turns the whole operating system into a giant, interconnected file‑system tree that you can “hunt” through like a detective. In this blog I’ll walk you through 10 meaningful discoveries I made while exploring the Linux file system, focusing on what those files and directories actually do, why they exist, and what interesting insights they reveal.

The brain of the system: /etc

The /etc directory acts as Linux’s configuration brain. It stores almost all system‑wide configuration files, from user accounts (/etc/passwd, /etc/shadow) to service settings, network parameters, and shell preferences.

Why does it exist?

Linux is designed to be modular and customizable. Instead of baking settings into the kernel, the system keeps them in human‑readable files under /etc, so administrators can tweak behavior without recompiling the OS.

What problem it solves:

  • Centralized configuration: tools like SSH, NTP, and cron all read their config from here, so you can control the whole machine from one place.
  • Portability: take /etc from one server and some settings can be reused on another, as long as the distribution is similar.

Interesting insight:

Opening /etc/environment shows where global environment variables are defined; every user and many system services inherit these values. This means a single file can subtly shape how every process behaves across the system.

DNS and name resolution: /etc/resolv.conf and /etc/hosts

When you type google.com into a browser, Linux must translate that into an IP address. The main files that drive this are /etc/resolv.conf and /etc/hosts.

What they do:

  • /etc/resolv.conf: tells the resolver which DNS servers to query and sets search domains.
  • /etc/hosts: maps hostnames to IPs locally, bypassing DNS entirely.

Why they exist:

Without DNS, you’d have to memorise IP addresses for every service. These files let the system know where to ask for DNS answers and allow quick local overrides for testing or debugging.

Example:

$ cat /etc/resolv.conf
nameserver 8.8.8.8
nameserver 1.1.1.1
search mycompany.local

What problem it solves:

  • Fault isolation: you can point a test server to a different DNS server by changing just /etc/resolv.conf.
  • Local testing: /etc/hosts lets you temporarily route api.staging.example to 127.0.0.1 without touching DNS infrastructure.

Interesting insight:

On some systems /etc/resolv.conf is just a symlink to /run/systemd/resolve/resolv.conf because systemd-resolved manages DNS and caches queries. This reveals how multiple layers of abstraction (service ↔ config ↔ resolver) all converge into a single file that applications read.

Routing tables on disk: /proc/net/route and /etc/iproute2

Routing is the process of deciding which network interface and gateway to use for each packet. Linux exposes its routing table through virtual files under /proc and configures higher‑level rules via /etc/iproute2.

What /proc/net/route does:

It’s a text representation of the kernel’s routing table, showing destination networks, gateways, and interfaces in hexadecimal.

Example:

$ cat /proc/net/route
Iface   Destination     Gateway         Flags   RefCnt  Use     Metric  MTU     Window  IRTT
eth0    0000FEA9        00000000        0001    0       0       0       0       0       0
eth0    0200A8C0        00000000        0001    0       0       0       0       0       0

Why /proc/net/route exists:

The kernel maintains routing data in memory, but users‑pace tools need access. /proc exposes this as a file‑like interface so commands like ip route and netstat -r can read and display it.

What problem it solves:

  • Troubleshooting: if a server can’t reach a remote subnet, inspecting /proc/net/route (or ip route) can show missing or wrong routes.
  • Scripting: scripts can parse this file to detect routing changes or validate network config.

Interesting insight:

Seeing the table in hexadecimal initially looks cryptic, but once you decode the destination and gateway fields (e.g., 0200A8C0192.168.0.2), you realise it’s literally the kernel’s raw routing logic laid bare as a text file.

Networking interface configuration

Network interfaces are controlled by config files and runtime data exposed under the file system. The exact paths differ by distro and network manager, but the idea is the same: configuration files live under /etc and running‑state info lives under /sys or /proc.

On many modern systems, NetworkManager writes connection‑specific files such as:

$ ls /etc/NetworkManager/system-connections/
wifi-home.nmconnection
server-wired.nmconnection

Why these files exist:

They persist the settings (SSID, IP mode, gateway, DNS) for each interface so the system doesn’t need to re‑ask for configuration every reboot.

What problem they solve:

  • Automatic re‑connection: after a reboot, the system uses these configs to bring interfaces up correctly.
  • Multi‑environment: a laptop can have separate configs for home, office, and mobile Hotspot.

Interesting insight:

Exploring /sys/class/net/eth0 shows symbolic links and device‑specific files that expose the interface’s driver, speed, and MAC address. This reveals how even hardware is modelled as a file‑system tree, making low‑level inspection surprisingly scriptable.

Logs and digital breadcrumbs: /var/log

System logs are one of the most powerful “forensic” tools in Linux. The /var/log directory houses logs from the kernel, services, and applications, each stored as ordinary text files.

Typical files:

$ ls /var/log/
auth.log    syslog      kern.log    nginx/access.log    apache2/error.log

What they do:

  • auth.log (or secure on some distros): records SSH logins, sudo attempts, and other auth events.
  • syslog / messages: general system‑wide messages from daemons and the kernel.
  • Service‑specific logs (e.g., nginx/*, apache2/*): store web‑server access and errors.

Why they exist:

Logs answer the “what happened and when?” question. They help debug crashes, track performance issues, and detect security incidents.

What problem they solves:

  • Incident response: after a suspected breach, you can comb through /var/log/auth.log to see if there were brute‑force SSH attempts.
  • Automation: log‑parsing tools read these files and trigger alerts or dashboards.

Interesting insight:

Reading /var/log/kern.log can show exact timestamps of hardware events, such as when a USB device was plugged in or when a disk driver threw an error. This makes /var/log feel like a chronological diary of the machine’s entire life.

Users and accounts: /etc/passwd, /etc/shadow, /etc/group

User management is surprisingly file‑based. The main files are:

  • /etc/passwd: user names, UIDs, home directories, and default shells (no passwords).
  • /etc/shadow: password hashes, expiry dates, and login restrictions.
  • /etc/group: group definitions and which users belong to them.

Example:

$ head -2 /etc/passwd
root:x:0:0:root:/root:/bin/bash
ritam:x:1000:1000:Ritam,,,:/home/ritam:/bin/zsh

Why they exist:

Linux follows the principle of “everything is a file.” Instead of a hidden database, user and group data live in plain(ish) text files with strict permissions.

What problem they solve:

  • Fast, portable access: any program that needs UID‑to‑name mapping can open /etc/passwd without a database daemon.
  • Security separation: sensitive password hashes are stored in /etc/shadow, which is readable only by root, preventing casual inspection.

Interesting insight:

Noticing the x in /etc/passwd’s password field and then finding the actual hash in /etc/shadow reveals a clever separation of concerns: one file for public info, another for secret data, all still under /etc.

Permissions and the role of /etc/sudoers

File permissions already give Linux a rich security model, but /etc/sudoers adds another layer on top. It defines which users can run which commands as root or other users.

Example:

$ grep 'ritam' /etc/sudoers
ritam ALL=(ALL:ALL) NOPASSWD:ALL

What it does:

Each line in /etc/sudoers grants a user (or group) permission to run specific commands on specific hosts, sometimes without a password.

Why it exists:

Direct root logins are risky; sudo allows fine‑grained privilege delegation. An admin can give a developer temporary root‑like access to a web server without handing over the root password.

What problem it solves:

  • Auditability: sudo logs every command to /var/log/auth.log, so you can track who did what.
  • Least privilege: a backup user can run only backup‑related commands, not arbitrary system changes.

Interesting insight:

Hunting through /etc/sudoers and its included snippets (often under /etc/sudoers.d/) reveals that even privilege‑escalation logic is ultimately driven by human‑editable text files, not a hidden binary policy engine.

Processes as files: /proc

The /proc filesystem is one of the most fascinating parts of the Linux file system. Each running process gets a directory under /proc/<PID>, containing status files, memory maps, and environment variables.

Example:

$ ls /proc/1/
cwd  environ  exe  fd/  mem  mounts  status  cmdline

What it does:

  • status: basic metadata like memory usage, state, and user ID.
  • cmdline: the command line that started the process.
  • fd/: symbolic links to open file descriptors.
  • environ: environment variables as null‑separated strings.

Why /proc exists:

The kernel exposes process‑level information in a file‑like interface so tools like ps, top, and lsof can read it without needing special syscalls for every piece of data.

What problem it solves:

  • Process inspection: you can literally cat /proc/$PID/environ to see which environment variables a service sees.
  • Forensics: if a process opens suspicious files or sockets, you can inspect /proc/$PID/fd and /proc/$PID/maps to trace them.

Interesting insight:

Looking at /proc/self (a symlink to the current process’s own directory) makes it feel like each process “lives” in the file system, reading and writing information about itself as if it were any other service talking to a config file.

Device files and /dev

In Linux, even hardware devices are represented as files under /dev. These “device files” let you read from or write to physical devices using the same file‑I/O operations a text program would use.

Example:

$ ls -l /dev/sda*
brw-rw---- 1 root disk 8, 0 Apr 22 2026 /dev/sda
brw-rw---- 1 root disk 8, 1 Apr 22 2026 /dev/sda1

What they do:

  • /dev/sda, /dev/sda1, etc.: block device files for disk and its partitions.
  • /dev/null: a sink that discards all data written to it.
  • /dev/random, /dev/urandom: provide entropy for random data.

Why they exist:

The kernel abstracts hardware into a file‑like interface so applications can treat storage, terminals, and other devices uniformly.

What problem it solves:

  • Abstraction: tools like dd can back up a disk by literally copying /dev/sda to a file, without caring what controller it is.
  • Security: restrictive permissions on device files prevent random users from directly accessing hardware.

Interesting insight:

Running ls -l /dev and seeing that /dev/sda is a “block special” file (brw-rw----) shows that even raw disk access is just another file, governed by the same permission model used for regular files and directories.

Boot‑time configuration under /boot and /etc/grub

Booting is orchestrated by files under /boot and configuration under /etc/grub. These files define which kernel to load, which initramfs to use, and what kernel parameters to pass.

Example:

$ ls /boot
vmlinuz-6.8.0-10-generic
initrd.img-6.8.0-10-generic
grub/

What they do:

  • vmlinuz-*: compressed kernel images.
  • initrd.img-*: initial RAM disk used early in boot.
  • /boot/grub/grub.cfg (or /etc/grub.d/*): boot‑menu configuration.

Why they exist:

The bootloader needs to know which kernel and initramfs to load, and the admin needs a way to tweak kernel parameters (e.g., nomodeset, console=ttyS0).

What problem it solves:

  • Predictable boot: the system can always find the correct kernel and initramfs in /boot.
  • Flexibility: changing /etc/default/grub and running update-grub generates a new grub.cfg, allowing you to experiment with different boot options.

Interesting insight:

Opening /boot/grub/grub.cfg and seeing kernel‑parameter lines like linux /boot/vmlinuz-6.8.0-10-generic root=UUID=... ro makes it clear that booting is just another script‑like configuration, defined by plain text that you can read and edit.

Environment and system‑wide configuration

Beyond /etc/environment, many environment variables and shell settings are defined in files such as /etc/profile, /etc/profile.d/*.sh, /etc/bash.bashrc, and user‑specific ~/.bashrc, ~/.profile.

Example:

$ cat /etc/profile.d/java.sh
export JAVA_HOME=/usr/lib/jvm/java-17-openjdk
export PATH=$JAVA_HOME/bin:$PATH

What they do:

These scripts set environment variables and shell aliases that are inherited by every interactive shell session.

Why they exist:

To avoid hard‑coding paths and settings into each user’s shell config, distribution maintainers centralize them in /etc.

What problem it solves:

  • Consistency: every user on the system gets the same JAVA_HOME unless explicitly overridden.
  • Maintenance: updating one file under /etc can propagate a change to all future logins.

Interesting insight:

Hunting through /etc/profile.d/*.sh and comparing them with your own ~/.bashrc reveals that Linux effectively “assembles” your environment from multiple layers of configuration files, like a stack of configuration cards.

Wrapping up: thinking like a system investigator

Exploring the Linux file system beyond basic commands turns you into a kind of system detective. Instead of seeing /etc, /proc, /dev, and /var/log as abstract directories, you begin to recognise them as the living, file‑based control plane of the entire OS.

Each meaningful discovery—whether it’s decoding a routing table from /proc/net/route, reading environment variables from /etc/environment, or tracing a suspicious process via /proc/$PID—tells you why Linux behaves the way it does and how you can safely change or debug that behaviour. This “hunting” mindset doesn’t just help with assignments; it trains you to think like the OS itself, which is exactly what you want when you step into roles as a developer, DevOps engineer, or security analyst.

PostgreSQL Row Level Security: A Complete Guide

2026-04-23 02:20:31

Your application code knows which tenant owns which row. Your ORM always filters by WHERE tenant_id = $1. Your team has reviewed the queries and they look fine.

Then someone forgets the WHERE clause. Or a bulk operation skips the filter. Or a new developer writes a raw query without knowing the convention. Suddenly one tenant can read another tenant's data, and you find out from a support ticket two weeks later.

Row Level Security (RLS) moves the tenant isolation logic inside PostgreSQL itself. The database enforces the policy automatically on every access, regardless of how the query was written.

What Row Level Security Does

Enable RLS on a table:

ALTER TABLE documents ENABLE ROW LEVEL SECURITY;

Without any policies, no rows are visible to non-superusers. The safe default is deny, not permit. Then create a policy:

CREATE POLICY documents_tenant_isolation
  ON documents FOR ALL
  USING (tenant_id = current_setting('app.tenant_id')::uuid);

Setting the Tenant Context

Always use SET LOCAL (not SET) with connection poolers. SET LOCAL resets when the transaction ends, so pooled connections do not carry the wrong tenant context into the next request:

BEGIN;
SET LOCAL app.tenant_id = '550e8400-e29b-41d4-a716-446655440000';
-- your queries here
COMMIT;

FORCE ROW LEVEL SECURITY

Table owners bypass RLS by default. Close this gap:

ALTER TABLE documents FORCE ROW LEVEL SECURITY;

Without it, an application connecting as the table owner silently ignores all policies. This is the most common RLS gotcha.

Permissive vs Restrictive Policies

Multiple policies on the same operation combine with OR by default (permissive). For rules that must always apply, use AS RESTRICTIVE. Restrictive policies combine with AND against all other policies.

Performance

Add an index on the tenant_id column:

CREATE INDEX idx_documents_tenant_id ON documents (tenant_id);

Without it, every query with an RLS filter becomes a full table scan.

Common Mistakes

  • Not using FORCE ROW LEVEL SECURITY when the app connects as the table owner
  • Using SET instead of SET LOCAL with PgBouncer in transaction mode (tenant context leaks between clients)
  • Missing the index on the tenant_id column
  • Not testing cross-tenant access explicitly in your test suite

For the full guide with multi-tenant schema setup, testing patterns, EXPLAIN output, and inspecting existing policies, read the full post at rivestack.io.

Originally published at rivestack.io

Apache Data Lakehouse Weekly: April 16–22, 2026

2026-04-23 02:19:22

Two weeks past the Iceberg Summit, the San Francisco in-person alignments are now translating into formal proposals and code on the dev lists. Iceberg's V4 design work continued consolidating, Polaris kept moving toward its 1.4.0 milestone, Parquet's Geospatial spec picked up a cleanup commit from a new contributor, and Arrow's release engineering and Java modernization discussions stayed active.

Apache Iceberg

The post-summit V4 design work continued as the defining thread on the Iceberg dev list this week. The V4 metadata.json optionality discussion that Anton Okolnychyi, Yufei Gu, Shawn Chang, and Steven Wu drove through March kept narrowing on practical design questions. The concrete direction emerging from the summit is to treat catalog-managed metadata as a first-class supported mode while preserving static-table portability through explicit opt-in semantics, rather than the current implicit assumption that the root JSON file is always present.

Russell Spitzer and Amogh Jahagirdar's one-file commits design moved toward a formal spec write-up this week. The approach replaces manifest lists with root manifests and introduces manifest delete vectors, enabling single-file commits that cut metadata write overhead dramatically for high-frequency writers. The in-person sessions at the summit cleared the last design disagreements about inline versus external manifest delete vectors, and the community is now aligning on the implementation plan.

Péter Váry's efficient column updates proposal for AI and ML workloads drew steady engagement. The design lets Iceberg write only the columns that change on each write for wide feature tables, then stitch the result at read time. For teams managing petabyte-scale feature stores with embedding vectors and model scores, the I/O savings are meaningful. Anurag Mantripragada and Gábor Herman are working alongside Péter on POC benchmarks to support the formal proposal.

The AI contribution policy that Holden Karau, Kevin Liu, Steve Loughran, and Sung Yun pushed through March is moving toward published guidance. The summit provided the in-person alignment that async debate rarely produces, and a working policy covering disclosure requirements and code provenance standards for AI-generated contributions is expected on the dev list in the next couple of weeks. Polaris is navigating the same question in parallel, and the two communities are likely to converge on a shared approach given their overlapping contributor base.

Apache Polaris

The Polaris 1.4.0 release is in active scope finalization as the project's first release since graduating to top-level status on February 18. Credential vending for Azure and Google Cloud Storage is the headline feature, alongside catalog federation that lets one Polaris instance front multiple catalog backends across clouds. The schedule-driven release model calls for a release intent email to the dev list about a week before the RC cut, so watch the list for that thread shortly.

The Apache Ranger authorization RFC from Selvamohan Neethiraj remained the most active governance discussion. The plugin lets organizations running Ranger with Hive, Spark, and Trino manage Polaris security within the same policy framework, eliminating the policy duplication that arises when teams bolt separate authorization onto each engine. It is opt-in and backward compatible with Polaris's internal authorization layer, which lowers the enterprise adoption barrier considerably.

On the community side, Polaris's blog continued its post-graduation cadence with a Sunday April 4 post on building a fully integrated, locally-running open data lakehouse in under 30 minutes using k3d, Apache Ozone, Polaris, and Trino. The Polaris PMC also shipped a March 29 post covering automated entity management for catalogs, principals, and roles. With incubator overhead behind it, release velocity has picked up noticeably from the 1.3.0 release on January 16.

Apache Arrow

Arrow's release calendar shows arrow-rs 58.2.0 landing this month, following 58.1.0 in March which shipped with no breaking API changes. The cadence has held at roughly one minor version per month, with 59.0.0 already scheduled for May as a major release that may include breaking changes. The Rust implementation has become one of the most actively maintained segments of the Arrow ecosystem, with a DataFusion integration drawing engines that want Arrow without a JVM dependency.

Jean-Baptiste Onofré's JDK 17 minimum proposal for Arrow Java 20.0.0 continued drawing input from Micah Kornfield and Antoine Pitrou. The practical rationale is coordination: setting JDK 17 as Arrow's Java baseline aligns with Iceberg's own upgrade timeline and effectively raises the minimum across the entire lakehouse stack in a single coordinated move. The decision is expected before the 20.0.0 release cycle formally opens.

Nic Crane's thread on using LLMs for Arrow project maintenance continued generating discussion. The framing — AI as a resource for maintainers, not just contributors — is distinct from how Iceberg and Polaris are approaching their AI policies. Arrow's angle is practical: a lean maintainer group managing a growing issue backlog needs help triaging, and LLMs can do that work without introducing the code-provenance concerns that matter for contributions. Google Summer of Code 2026 student proposals that landed in early April are being sorted this week, with interest concentrated in compute kernels and Go and Swift language bindings.

Apache Parquet

Parquet's week centered on hardening the Geospatial spec that was adopted earlier this year. Milan Stefanovic merged PR #560 on April 20, clarifying the Geospatial spec wording for coordinate reference systems. The change documents existing CRS usage practice for the default OGC:CRS84 system and removes ambiguity caught during implementation reviews. Small spec-hardening commits like this are how a new type goes from "shipped" to "production-reliable" across engines.

The community blog effort continued alongside the spec work. The Native Geospatial Types blog that Jia Yu and Dewey Dunnington published on February 13 remains the community's reference explainer, and Andrew Lamb has been coordinating with Aihua Xu on the companion Variant blog post. Spotlighting recent additions through the Parquet blog is part of a deliberate push to give the project the same kind of voice that DataFusion and Arrow have built.

The ALP encoding that cleared its acceptance vote in the prior week moved into implementation discussion. Engine teams across Spark, Trino, Dremio, and DataFusion are comparing notes on how to integrate ALP into their Parquet readers, with compression gains for float-heavy ML feature stores as the immediate benefit. The File logical type proposal for unstructured data (images, PDFs, audio) also kept advancing in community discussion, extending Parquet's scope beyond pure analytics.

Cross-Project Themes

The summit's downstream effect is now visible across every dev list. Iceberg's V4 work, Polaris's 1.4.0 scope, Arrow's JDK 17 decision, and Parquet's Geospatial cleanup are running in parallel, and the cross-project coordination on shared questions like AI contribution policy and Java baselines has intensified. The JDK 17 alignment is the clearest case: moving Arrow Java 20.0.0, Iceberg's next major, and downstream engines to the same floor in a single window removes years of compatibility friction.

The second pattern is the steady expansion of format scope to meet AI workloads. Iceberg's efficient column updates, Parquet's File logical type, the Geospatial spec hardening, and Polaris's multi-cloud federation all respond to the same pressure: the lakehouse stack is being asked to power AI pipelines, not just analytical queries. Each project is making changes that only make sense if you assume the next decade's workloads look different from the last.

Looking Ahead

Watch for the V4 single-file commits formal spec write-up and the metadata optionality vote on the Iceberg dev list, along with a published AI contribution policy. The Polaris 1.4.0 release intent email should land in the coming days. Arrow's JDK 17 baseline decision for Java 20.0.0 is close to a vote, and arrow-rs 58.2.0 should ship before the end of the month. Iceberg Summit 2026 session recordings are also rolling out on the project's YouTube channel.

Resources & Further Learning

Get Started with Dremio

Free Downloads

Books by Alex Merced

Stop building. Start validating.

2026-04-23 02:18:18

Most devs:
👉 spend weeks coding
👉 launch to zero users

Reality: distribution > code

I fixed it with 2 things:
• plug-and-play auth
• fast validation before building

Links:
SecureAuthKit
ValidateFast
Build less. Earn soon