MoreRSS

site iconThe Practical DeveloperModify

A constructive and inclusive social network for software developers.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of The Practical Developer

Can AI Replace Data Engineers? We Tried It.

2026-04-28 07:16:07

We had a slightly reckless idea: what if we let AI do most of our data engineering work?

Not "help with a query here and there," but actually build real pipelines.

Azure, Databricks, Delta Lake, the whole thing.

Real enterprise data, messy schemas, and stakeholders who will definitely shout if numbers look wrong.

I'm a Senior Data Engineer, I work on this stack every day, and I still wanted to see how far we could push AI into my own job.

This is what happened when we tried.

The Experiment: Letting AI Touch Real Pipelines

The setup will look familiar to a lot of people:

  • Azure as the platform
  • Databricks as the main compute environment
  • Delta Lake as the storage layer, with a Bronze, Silver, Gold medallion layout
  • Unity Catalog for governance and access control

Most of the transformation work lives in PySpark, with SQL on top for reporting and BI layers.

The experiment was simple to describe and painful to watch:

  • Give an LLM and Copilot style tools the job of:
    • Writing PySpark transformations for a new Silver layer
    • Generating SQL for aggregation and reporting tables
    • Suggesting schemas and data models for a new feature set
    • Proposing fixes for slow or failing jobs

We fed it:

  • Plain language descriptions of the business logic
  • Table schemas copied from DESCRIBE and SHOW COLUMNS
  • A few existing notebooks as "examples of our style"

All of this happened in a safe Databricks workspace with test data and separate storage. No chance of breaking production, but our question was serious. Could we realistically replace most of the day to day data engineering work on a new pipeline?

Where AI Actually Helped

To be fair, AI did a few things well enough that I now use it on purpose.

Boilerplate PySpark

Whenever I needed yet another "read, filter, transform, write" notebook, the model saved a bit of time:

  • Reading from Delta tables
  • Simple filters and column selections
  • Casting and basic feature engineering
  • Writing back to Delta with a reasonable partition strategy

For example, I asked it something close to:

"Read from bronze.orders, filter cancelled orders, cast order_ts to timestamp, add order_date, then write to silver.orders_clean as Delta, partitioned by order_date."

The generated PySpark looked like this:

from pyspark.sql import functions as F

df = (
    spark.read.table("bronze.orders")
    .filter(F.col("status") != "CANCELLED")
    .withColumn("order_ts", F.to_timestamp("order_ts"))
    .withColumn("order_date", F.to_date("order_ts"))
)

(df.write
   .format("delta")
   .mode("overwrite")
   .partitionBy("order_date")
   .saveAsTable("silver.orders_clean"))

Could I have written this faster by hand? On a good day, yes.

But over dozens of similar notebooks, the time saved adds up.

Quick SQL starting points

For straightforward reporting queries, Copilot in a SQL editor was handy.

  • It auto completed SELECT lists once it saw the schema
  • It filled in GROUP BY and ORDER BY clauses correctly most of the time
  • It often proposed reasonable aggregates to start from

I still had to adjust conditions and add proper filters, but I was no longer staring at a blank editor. That alone reduces friction.

Documentation and "glue text"

The part that surprised me most was how useful AI was for the boring bits:

  • Drafting docstrings and short comments
  • Converting bullet points into a light design doc
  • Writing a high level description of a pipeline for internal docs

None of this replaces real architectural decisions, but it lets me stay in "technical thinking" mode while an assistant fills in the prose.

Where AI Failed, In Ways That Matter

Now for the part that actually matters, especially if you work in a production environment.

Joins that compile but lie

We tried a customer360 style pipeline that combined:

  • bronze.customers
  • bronze.orders
  • bronze.events for clickstream data
  • bronze.subscriptions

We told the model something like:

"Join these to build a customer centric table with basic attributes, last activity date, and current subscription status."

It produced:

  • A join from customers to orders on customer_id which was fine
  • A join from orders to subscriptions on customer_id which was wrong in our world, the real join key for subscriptions is account_id
  • An aggregation of events using MAX(event_ts) per customer_id, ignoring the fact that some event types should not count as "activity"

The result:

  • Subscription states merged incorrectly
  • Trial vs paid blurred together
  • "Last seen" dates inflated by internal or noise events

All of this ran without any schema error. Nothing crashed. The table "looked" fine at a glance.

But the logic was off in exactly the way that breaks trust with downstream users.

Columns and tables that never existed

When our prompt was slightly vague, the model started inventing things:

  • New boolean columns like is_active that were not in any table
  • Table names that looked plausible, for example orders_clean, which only existed in someone's head, not in our catalog
  • Column names that were close to reality, but not exact, such as customer_email instead of email_address

This is a known issue with LLM generated code in general, often called hallucination.

In a chat window it looks clever. In a notebook connected to real data, it is just a bug factory.

You either fix the code and bend it back toward reality, or you start renaming actual tables and columns to match the hallucination. I saw both instincts on the team.

Performance "help" that makes things worse

We pointed AI at a slow query on a large Delta table and asked for tuning suggestions.

It happily suggested changes that:

  • Removed filters that were actually highly selective
  • Rewrote predicates in ways that broke partition pruning
  • Introduced joins that would obviously cause huge shuffles

In practice, I spent more time validating each suggestion with EXPLAIN and the Databricks query profile than it would have taken to reason through the original plan myself.

Edge cases and dirty data

We set up some non ideal data on purpose:

  • Null keys in join columns
  • Late arriving events
  • Dirty reference data with conflicting keys
  • Out of order timestamps

The model did not:

  • Add defensive joins or explicit null handling
  • Propose data quality checks
  • Think about slowly changing dimensions or history
  • Distinguish between "missing because it never existed" and "missing because it will arrive later"

The code it generated assumed clean, static, relational textbook data.

That world does not exist in any real enterprise I have worked in.

Why It Fails: Context, Lineage and Ownership

This is not only about wrong code. There is a deeper gap that explains most of the failures we saw.

No real sense of lineage

The model only sees what we paste into the prompt or make available through a narrow integration. It does not naturally see:

  • The end to end flow from Bronze through to Gold
  • Which downstream reports and ML models depend on a field
  • Where a column definition came from originally

Lineage is technical and social. It lives partly in tools, partly in tribal knowledge, and partly in old Slack threads. AI only sees a slice of that picture unless you build a very deliberate context layer around it.

Business rules live outside the schema

Here is a real kind of rule you will not get from a table definition:

"A customer is active if they had a paid transaction in the last 90 days, except in region X and contract type Y, where the window is 180 days."

Pieces of that rule live in:

  • Product requirements
  • Email threads
  • A teammate's memory
  • Old reconciliation docs in some forgotten folder

If you ask AI to "mark active customers," it will give you a clean definition that fits a generic pattern. That pattern is almost guaranteed to differ from your actual rulebook.

No accountability

When I ship a pipeline:

  • My name is on the PR
  • I get pinged if a CFO dashboard looks wrong
  • I have to answer questions from auditors or risk teams

AI has no skin in the game. It can be wrong with confidence and nothing bad happens to it. That changes how much you trust it, and it should change how you design your review and test processes.

What Actually Worked: AI As a Copilot

After a few weeks, we stopped trying to make AI "do" data engineering and started treating it like an extra pair of very fast hands.

Acceleration for well defined tasks

We now use AI to:

  • Turn a clear description into a first draft of PySpark or SQL
  • Suggest alternative ways to express the same logic
  • Refactor slightly messy code into something cleaner

We still own the logic. We still write tests. The model just gets us to the first draft faster.

Debugging assistant, not performance engineer

AI is useful when:

  • You paste in a stack trace and ask "what is this error really telling me"
  • You show an EXPLAIN plan and want a plain language description
  • You forget the exact syntax for some obscure Spark function

It is not our primary performance tuner, but it makes the feedback loop a bit shorter.

Glue work and writing

We let AI start the boring writing:

  • Basic README files
  • First drafts of design docs
  • Short explanations for internal wikis

Engineers review and correct the details, but they are not starting from an empty page.

Real Enterprise Scenarios

Two concrete scenarios from the kind of environment many of us work in:

  • Marketing attribution pipeline

    • Good: generate window functions, sketch event aggregations, build starter queries
    • Bad: handle the messy attribution rules that change per region and per campaign type
  • Finance reconciliation layer

    • Good: boilerplate mapping logic, standard transformations, basic QC checks
    • Bad: interpret accounting rules, satisfy audit requirements, reason about exceptions

In both cases, AI makes small pieces of the job faster. The responsibility and the judgement stay with the human engineers.

Final Verdict

After trying quite hard to let AI handle a big chunk of my data engineering work on Azure Databricks, my view is pretty clear.

  • AI will not replace solid data engineers any time soon
  • AI will change how those engineers work and what they spend time on

The engineers who learn to:

  • Use AI for drafts and helpers, not as an oracle
  • Wrap AI generated code in tests, monitoring, and review
  • Keep ownership of business rules, data quality and performance

will move faster than those who either ignore these tools or trust them blindly.

AI will not replace data engineers.
Data engineers who use AI well will replace the ones who do not.

Surviving Midnight SDK: a 700-line cure for the silent failure problem

2026-04-28 07:09:46

Why I built midnight-doctor — a pre-flight check that catches the 16 hours of debugging you don't know you're about to spend

"The information you need to align your stack already exists. It's just not executable."*

After five months building four applications on Midnight Network, I've concluded the protocol isn't the problem. The protocol is genuinely good — Compact compiles ZK circuits in three seconds without a trusted setup ceremony, and the shielded/unshielded/dust split is the cleanest answer to compliance-grade privacy I've seen.

The problem is everything that surrounds it. Specifically, the unforgiving math of: a fast-moving SDK + multiple environment tracks + a single npm namespace + zero error messages when versions misalign.

This post is the story of one specific 6-hour debugging session, the pattern I extracted from it, and the ~700-line tool I built so the next person doesn't pay the same tax.

The 6-hour silent failure

DPO2U Wallet, March 2026. I bumped @midnight-ntwrk/wallet-sdk-facade to 2.0.0 because npm marked it as latest. My local stack ran midnightntwrk/midnight-node:0.21.0 (preprod-targeted). I wrote ~50 lines of wallet bootstrap, started the dev loop, and watched:

const wallet = await WalletFacade.init({
  configuration,
  shielded,
  unshielded,
  dust,
});

await wallet.waitForSyncedState();
console.log('synced!');

The synced! line never printed.

No error. No timeout. No log. The wallet just sat in syncing state forever. I added more logging — got more "syncing" lines. Restarted Docker. Wiped state. Tried a fresh seed. Re-cloned the repo. Asked Discord. Read the SDK changelog (which didn't exist for that version). Stared at the indexer logs for 40 minutes.

Six hours in, almost by accident, I noticed the symptom: subscribeRuntimeVersion was firing once, returning, and never firing again. The standalone node closes that subscription early. The WalletFacade.init({...}) constructor in 2.x wires sync to that subscription. Result: a single missed event silently kills the entire sync loop, with zero surface.

The fix was to downgrade to facade 1.0.0 (preprod track). Forty seconds of work, after six hours of debugging. The bug was real but the cost was information asymmetry: nothing in the documentation, npm metadata, or runtime told me that "this SDK version is incompatible with this node version."

That asymmetry is the real bug. Not the runtime subscription closing. Not the constructor wiring. The fact that the system has the information needed to diagnose itself, but doesn't bother.

The pattern: silent failures from missing cross-references

After that session, I started cataloguing other times I'd been bitten the same way. Within a week I had a list:

Symptom Root cause Time spent
waitForSyncedState() hangs forever facade 2.x + node 0.21.0 mismatch ~6h
npm install returns ENOTFOUND community tutorial said .npmrc should point at npm.midnight.network (domain doesn't exist) ~2h
Transactions silently fail to submit Two versions of @midnight-ntwrk/ledger-v7 in node_modules (transitive bump) ~3h
Indexer crash-loops on startup indexer-standalone:4.0.0-rc.4 requires a subscription: block in indexer.yml, undocumented ~1.5h
WalletFacade behaves differently on standalone vs preprod Same SDK, different node behavior, no warning ~4h

Across five incidents: ~16 hours of debugging silent failures. Each one had a root cause that was, with the right cross-reference, detectable in seconds:

  • package.json says facade is 2.0
  • docker ps says node is 0.21.0
  • A lookup table says those two are incompatible
  • → Print a clear error.

The system already has all three pieces of data. There's just no glue.

The tool: midnight-doctor

The fix isn't a new feature in the SDK. It isn't better docs (which decay). It's an executable cross-reference — a script that reads your project, your Docker stack, and your config files; matches them against a curated compatibility table; and tells you what's wrong.

I built midnight-doctor over a weekend. ~700 lines of Node, zero runtime dependencies, single-binary install:

npx midnight-doctor

Run against my own legacy midnight-hello-world repo, it produces:

── midnight-doctor ──
project: /root/midnight-hello-world

⚠ SDK track: Preprod 3.x (legacy, OK for existing apps)
   Detected from [email protected].
⚠ Track is deprecated
   SDK 3.2 / facade 2.0 is 2 majors behind. Current is [email protected] /
   [email protected] (released 2026-04-23). The WalletFacade.init({...}) constructor
   used in 2.0 has been removed; 4.0 reverted to `new WalletFacade(s, u, d) +
   .start()`. Plan a migration.
⚠ WalletFacade.init() in 2.x stalls on standalone dev nodes
   In SDK 2.x, WalletFacade.init() subscribes to runtime version events that the
   standalone node closes early. The wallet hangs in 'syncing' state forever
   with no error.
   → fix: Either (a) develop against preprod once past hello-world, or
     (b) upgrade to [email protected] which reverted to constructor +
     .start() pattern.
✓ node: midnightntwrk/midnight-node:0.21.0
✓ indexer: midnightntwrk/indexer-standalone:4.0.0-rc.4
✓ proof-server: midnightntwrk/proof-server:7.0.0
✓ midnight-node:0.21.0 matches SDK track

summary: 4 ok  3 warn  0 error  0 info
Status: workable, but warnings deserve a look.

The 6-hour incident from March is now a 30-second output.

Architecture

The tool is structurally trivial. Three scanners, one diagnosis engine, one report formatter. Total surface:

midnight-doctor/
├── bin/midnight-doctor.js   # CLI entry, arg parsing, exit codes
├── lib/
│   ├── scan-package.js      # reads package.json + walks node_modules
│   ├── scan-docker.js       # runs `docker ps`, parses image tags
│   ├── scan-config.js       # reads .npmrc, indexer.yml
│   ├── diagnose.js          # cross-references findings with matrix
│   ├── report.js            # ANSI-colored pretty output
│   └── index.js             # public API
├── data/compatibility-matrix.json   # the source of truth
└── test/diagnose.test.js    # node:test, no framework

The compatibility matrix is the product

The ~700 lines of code are scaffolding. The actual value is the JSON document at data/compatibility-matrix.json. It encodes three things:

  1. Tracks — known-good combinations of SDK + Docker images, labeled (current, preprod-3x, preprod-1x)
  2. Known issues — specific bugs keyed by package version range or config pattern
  3. Cross-cutting checks — rules that fire when two scanners disagree (e.g., node tag vs SDK track)

A snippet:

{
  "tracks": {
    "current": {
      "label": "Current (latest npm + preprod compatible)",
      "node": "0.21.0",
      "indexer": "4.0.0-rc.4",
      "proofServer": "7.0.0",
      "packages": {
        "@midnight-ntwrk/wallet-sdk": "1.0.0",
        "@midnight-ntwrk/wallet-sdk-facade": "4.0.0",
        "@midnight-ntwrk/compact-runtime": "0.15.0",
        ...
      }
    }
  },
  "knownIssues": [
    {
      "id": "facade-2x-init-bug",
      "severity": "warn",
      "match": { "type": "package-version",
                 "package": "@midnight-ntwrk/wallet-sdk-facade",
                 "range": "2.x" },
      "title": "WalletFacade.init() in 2.x stalls on standalone dev nodes",
      "fix": "Either develop against preprod, or upgrade to facade 4.0.0..."
    }
  ]
}

This file is the same information that lives, scattered, across Discord pinned messages and individual developers' heads. Centralizing it in a JSON document accomplishes three things:

  • Source of truth — there's now a place to point people instead of "search the channel"
  • Versioned — the file has a verifiedAt: "2026-04-27" field; staleness is detectable
  • Executable — code can act on it; humans can read it

The scanners are deliberately dumb

Each scanner is one file, one responsibility, ~50 lines.

// lib/scan-docker.js (excerpt)
export async function scanDocker() {
  const findings = { dockerAvailable: false, containers: {} };
  try {
    await exec('docker', ['version', '--format', '{{.Server.Version}}']);
    findings.dockerAvailable = true;
  } catch { return findings; }

  const { stdout } = await exec('docker', [
    'ps', '--format', '{{.Image}}|{{.Names}}|{{.Status}}',
  ]);

  for (const line of stdout.split('\n').filter(Boolean)) {
    const [image, name, status] = line.split('|');
    const [imageName, tag = 'latest'] = image.split(':');
    if (KNOWN_IMAGES[imageName]) {
      findings.containers[KNOWN_IMAGES[imageName]] = { image: imageName, tag, name, status };
    }
  }
  return findings;
}

No Docker SDK dependency, no parsing library — just child_process.execFile and String.prototype.split. If Docker isn't installed, the scanner returns { dockerAvailable: false } and the diagnosis engine emits an info diagnostic instead of crashing.

The package scanner is similar — walks node_modules recursively (capped at 5000 directories so monorepos don't hang), records every @midnight-ntwrk/* it finds and what version. Duplicates fall out as a side effect: if the same package name has more than one version in the array, it's a duplicate.

const allInstances = await findAllInstances(projectDir, MIDNIGHT_NS);
for (const [name, versions] of Object.entries(allInstances)) {
  const unique = [...new Set(versions)];
  if (unique.length > 1) findings.duplicates[name] = unique;
}

Three lines. That's the entire duplicate-detection logic. The remaining ~50 lines of the scanner are filesystem traversal — boring, mechanical, but it works on every Node project regardless of package manager (npm, pnpm, yarn, bun all create node_modules directories with the same shape).

The diagnosis engine is glorified table joins

lib/diagnose.js takes the three scan results plus the matrix and emits diagnostics. The pattern repeats per check:

function diagnoseCrossCutting(pkg, docker, matrix) {
  const facadeVersion = pkg.installed['@midnight-ntwrk/wallet-sdk-facade'];
  const nodeContainer = docker.containers.node;
  if (!facadeVersion || !nodeContainer) return [];

  const track = matrix.tracks[detectTrack(facadeVersion, matrix)];
  if (track.node && track.node !== nodeContainer.tag) {
    return [{
      id: 'node-track-mismatch',
      severity: 'error',
      title: `midnight-node:${nodeContainer.tag} doesn't match SDK track (expects ${track.node})`,
      detail: `Mixing causes silent sync failures with no error.`,
      fix: `Either update the node container to ${track.node}, or align SDK to a track that matches your node version.`,
    }];
  }
  return [{ id: 'node-track-match', severity: 'ok',
            title: `midnight-node:${nodeContainer.tag} matches SDK track` }];
}

That's the entire body of the cross-cut check that would have saved my 6-hour March incident. Twelve lines. The information was there the whole time.

The full diagnosis file is ~200 lines. Most of it is the same shape: pick two facts from the scans, compare against the matrix, emit a diagnostic. It's deliberately boring code.

What this isn't

Scope discipline is half the value of a tool like this. Things midnight-doctor deliberately doesn't do:

  • It doesn't auto-fix. No --fix flag yet. The cost of a wrong auto-fix in this domain (bricked node_modules, lost work) outweighs the convenience.
  • It doesn't curl health endpoints. It checks Docker container presence, not health. A separate health-check script (midnight-health-check.sh) covers that already in my infra repo.
  • It doesn't probe the chain. No getBlockNumber(), no balance lookup. Those depend on a working SDK, which is what doctor is meant to validate before you try to use it.
  • It isn't a package manager. It won't run npm dedupe for you. It tells you to run it.
  • It doesn't write code. It doesn't generate scaffolds, codemods, or migrations. That's a different tool (create-midnight-app, eventually).

Each non-feature is a deliberate choice to keep the surface small enough that the tool stays correct.

What I'd build next, if there's appetite

midnight-doctor is the smallest useful version. The compatibility matrix as data unlocks several adjacent tools:

  1. create-midnight-app — scaffold a project where midnight-doctor is wired into npm install lifecycle. The matrix becomes the default versions.
  2. A codemod for the 2.x → 4.x facade migration. The API change is mechanical: WalletFacade.init({...})new WalletFacade(...) + .start(...). AST transform.
  3. Compactc compiler version check. Today it's via shell-out. With a parser, doctor could read your .compact source and say "this uses syntax X, requires compactc ≥ Y."
  4. Remote-fetched matrix. Right now the matrix is bundled. A nightly job at https://compatibility.midnight.network/matrix.json would let users not need to bump the doctor itself.
  5. Editor integration. A VSCode extension that runs doctor on save and surfaces diagnostics in the problems panel. The CLI is for CI; the editor is for dev loop.

The first one is the highest leverage by far — it short-circuits the entire onboarding problem. But it's a much bigger project. Doctor is a Trojan horse for the matrix; once the matrix exists and people accept it as canonical, the rest is plumbing.

The broader argument: tools beat docs

Documentation rots. A README written today describes a system that, six weeks from now, has moved two majors. Anyone who builds in a fast-moving ecosystem has lived this: the official doc says "use SDK 1.0", the npm latest tag says 4.0, the GitHub README says 3.0, and the Discord pinned message says "we know, sorry, the docs are stale." None of those sources is wrong. They were all true at some point. The problem is they don't update each other.

Executable knowledge updates itself. A JSON file with a verifiedAt date that's three weeks old is visibly stale. A doc that's three weeks old looks identical to a doc that's three days old. Tools force the question "is this still true?" in a way that prose doesn't.

This isn't unique to Midnight. Every fast-moving stack reaches the point where the gap between "what the docs claim" and "what actually works" is wide enough to swallow new developers. The fix is the same: take the lookup table out of human heads, encode it as data, ship it as a tool that runs in seconds.

The Midnight protocol team doesn't need to build midnight-doctor. They could. Anyone could. The information already exists — in pinned messages, in CHANGELOG.md files, in the heads of the half-dozen people on Discord who answer the same question every week. The work isn't producing the information. It's transcribing it once, into a format machines can act on, and committing to keeping it current.

That's what this ~700-line tool does. It's not clever. It's not hard. It's just nobody had done it.

If you've spent hours on a Midnight silent failure, please run npx midnight-doctor against your project before your next debugging session. If you find a bug doctor missed, open an issue — the matrix is the artifact, the code is just glue.

Repo: github.com/fredericosanntana/midnight-doctor
Install: npm install -g midnight-doctor or npx midnight-doctor
License: MIT

If this was useful, the people who actually maintain Midnight (@MidnightNtwrk) deserve the visibility — half this matrix came from their Discord answers. The other half came from getting burned. Both contributions are essential.

Appendix: the full check list

For reference, every check midnight-doctor currently runs:

Package scanner:

  • ✓ Detect SDK track (current, preprod-3x, preprod-1x) from wallet-sdk-facade major
  • ✓ Flag deprecated tracks with migration guidance
  • ✓ Detect major-version mismatch across wallet-sdk-* subpackages
  • ✓ Detect duplicate @midnight-ntwrk/ledger-v7 in node_modules
  • ✓ Detect duplicate @midnight-ntwrk/compact-runtime in node_modules
  • ✓ Flag [email protected] standalone init bug

Docker scanner:

  • ✓ Detect running midnight-node, indexer-standalone, proof-server
  • ✓ Cross-reference node tag with SDK track

Config scanner:

  • ✓ Flag .npmrc with bogus npm.midnight.network registry
  • ✓ Flag indexer.yml missing subscription: block

Cross-cutting:

  • ✓ Node tag ↔ SDK track consistency

Roadmap:

  • ☐ Compact compiler version vs runtime
  • ☐ Health probes (RPC, GraphQL, proof endpoints)
  • ☐ Monorepo workspace iteration
  • --fix mode for safe auto-corrections
  • ☐ Remote-fetched matrix

Total checks today: 11. Adding more is one PR each.

Bridging 'I Want to Build' and 'I Want to Publish Safely' for Non-Engineers — Sandbox MCP

2026-04-28 07:04:57

Hi, I'm Ryan, CTO at airCloset.

In my previous posts, I've introduced our internal MCP servers: an MCP server for natural-language search across all our databases, the full picture of our 17 internal MCP servers, and a custom Graph RAG that lets AI answer "Did that initiative actually work?".

This time I'm covering something a bit different: Sandbox MCP — a platform that lets non-engineer employees deploy apps they built with AI to a safe, internal-only URL with a single command.

The pitch is simple: "If Claude Code can build an app, why not publish it directly?" The hard part is making "directly" mean safely.

The Problem: Building Got Easy. Publishing Safely Did Not.

The arrival of Claude Code and other AI coding agents is reshaping how work happens inside our company.

"Building an app" used to be an engineer's job. You had to do requirements, design, frontend, backend, database, CI/CD, production deploy — all in one head.

Now PMs, designers, and customer-success folks are talking to Claude Code with "build me a screen that does X" and getting working mockups on the spot. Inside airCloset we're seeing more and more:

  • Mockups for new project proposals
  • Interactive reports that visualize research findings
  • KPI dashboards used only by a single team
  • Small tools for everyday operational improvements

These non-engineer outputs are growing fast. People are even saying "let's just run with this in production for a bit."

That's where the wall hits.

Easy to Build. Hard to Publish Safely.

Anyone can build something that runs locally now. Spin up python -m http.server 8000, view it on your Mac — five minutes max.

But the moment it becomes "I want my team to see this" or "I want others to actually use it," the difficulty curve goes vertical.

  • Where do you run it? Cloud means GCP/AWS accounts, IAM, billing.
  • What URL? Domain registration, DNS, SSL certificates, Cloudflare.
  • What about auth? If it touches confidential info, you need employees-only. OAuth implementation, domain restriction.
  • And the data? Is localStorage enough, or do you need a real DB? If a DB, who manages the password?
  • How do you deploy? Can you write a Dockerfile? Cloud Run config, env vars, service accounts, IAM.
  • What about security? What if the AI-written code has a vulnerability? An auth bypass?

You could "let the AI write all of it." But the result is left to the AI. Cloudflare misconfigured and exposed to the world. Auth bypassed. A service account with production database write access slipped into the code. The more code AI writes, the higher the risk of these accidents.

When a non-engineer says "I want to try building this," we need to clearly separate what the builder is responsible for from what the platform must guarantee by default.

There's also a quieter problem.

UI Inconsistency and Data Sprawl

When non-engineers build apps independently:

  • One person uses React, another Vue, another raw HTML
  • Buttons look and behave differently
  • Some store data in localStorage, some in Google Sheets, some in Firebase

After 10 or 20 such apps, internal tooling becomes chaos. Users wonder "wait, who built this one?" and "why does this button work differently?"

Even for internal tools, you need a baseline of consistency — both in design and in where data lives.

Sandbox MCP — Standing Between "Build" and "Publish"

That's why we built Sandbox MCP.

A non-engineer just says "build this" to Claude Code, and:

  1. An app is generated using a unified UI Kit
  2. They can verify it works locally
  3. A single command deploys it to https://sbx-{nickname}--{app-name}.example.com/
  4. Cloudflare's Google SSO enforces internal-only access
  5. Data is stored, isolated, in a dedicated Firestore database

— all of this completes within a single chat session with the AI.

The builder is only responsible for functionality. Security, data isolation, domain & SSL, authentication are all handled by the Sandbox MCP platform by default.

System Overview

Scale

Resource Details
MCP tools 10 (publish, status, schedule, list, delete, write_file, read_file, list_files, init_repo, unschedule)
Supported runtimes Python (Flask + gunicorn), Node.js, static HTML/SPA, custom Dockerfile
URL sbx-{nickname}--{app-name}.example.com (covered by Universal SSL, no ACM)
Authentication Cloudflare ZeroTrust Access (Google Workspace)
Data Firestore named DB sandbox, namespaced per nickname × app
Infrastructure Self-hosted Git Server (GCE) + Cloud Run + Cloudflare Worker + KV
Deploy time Typically 2–5 minutes (git push to public URL)

Let's walk through the internals.

What It Does — Web, API, DB, and Cron

Sandbox MCP supports four app shapes so it can cover almost any "I want to ship something internally" use case.

Type Detected by Use cases
Python .py files present Flask + gunicorn for APIs, analysis tools with a UI
Node.js package.json present Express APIs + UI; Bun also works
Static HTML/SPA only .html files (no Python/Node) nginx-served, React/Vue dist supported
Custom includes a Dockerfile Any runtime — Go, Rust, Bun, anything

Pick any of these and sandbox_publish deploys it with no extra config.

There's also sandbox_schedule for scheduled batch apps via Cloud Scheduler. Things like "post a risk summary to Slack at 9 AM every morning" become one-line cron setups.

sandbox_schedule(
  app_name: "risk-alert",
  schedule: "0 9 * * *",
  path: "/api/cron",
  timezone: "Asia/Tokyo"
)

Cloud Scheduler now hits the app's /api/cron every morning at 9. No need to open the scheduler UI or translate cron syntax into IaC.

Frontend — Unified Design via sandbox-ui-kit

Even apps built by non-engineers should feel consistent as a tool family. That's the job of the sandbox-ui-kit repo.

It lives on mcp-sandbox.example.com/git and provides:

File Contents
sandbox-ui.css Design tokens + glass-morphism component styles (dark/light)
sandbox-ui.js Theme switcher, modals, toasts, generic JS utilities
sandbox-db.js SandboxDB client SDK (more below)
index.html Storybook-style component catalog
README.md Full API documentation

The key: it's designed for AI to read and use.

The sandbox_publish tool description literally says:

When building an app, first read README.md with read_file and use the UI Kit.

When Claude Code builds a new app, it read_files this README, learns which CSS/JS to load and which component names to use, then generates code accordingly. Instead of a human walking the AI through UI guidelines, we centralized the "how to use" in one place targeted at the AI.

The result: apps built by anyone (with AI) end up with consistent buttons, modals, and forms.

Backend — Auto-Generated Dockerfile + Cloud Run

"I don't want to write Docker." "I don't want to think about runtime configuration." Classic non-engineer requests.

Sandbox MCP inspects the source files and generates a Dockerfile automatically.

// apps/mcp/git-server/src/sandbox/tools.ts
if (hasPy) {
  dockerfile = generatePythonDockerfile(hasRequirements);
  // Auto-create requirements.txt if missing
  if (!hasRequirements) {
    await writeFile('requirements.txt', 'flask\ngunicorn\n');
  }
} else if (hasPackageJson) {
  dockerfile = generateNodeDockerfile(true);
} else if (hasHtml) {
  dockerfile = generateStaticDockerfile();
}

For example, a Python app gets:

FROM python:3.12-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
ENV PORT=8080
CMD ["python", "-u", "$(ls *.py | head -1)"]

If requirements.txt is missing, flask + gunicorn get added automatically. AI can write from flask import Flask and the dependencies will resolve — no missing-package surprises.

Deployment uses gcloud run deploy --source, with Cloud Build handling the image build. App authors can write a Dockerfile, but they don't have to. No Dockerfile gets the standard, with one customizes — friendly to both non-engineers and engineers.

Deploy Flow

Database — Transparent Fallback Between localStorage and Firestore

"I want to save data. I don't want to set up a database."

The SandboxDB SDK handles that. The same code uses localStorage locally and Firestore once deployed.

<script src="https://mcp-sandbox.example.com/api/db/sdk.js"></script>
<script type="module">
  const db = new SandboxDB({ token: googleOAuthAccessToken });

  // Save (storage location auto-detected from hostname)
  const { id } = await db.collection('items').add({ name: 'test' });

  // List
  const items = await db.collection('items').get();

  // Get / update / delete
  await db.collection('items').doc(id).update({ name: 'updated' });
  await db.collection('items').doc(id).delete();
</script>

The SDK internals:

this._isLocal = location.hostname === 'localhost'
              || location.hostname === '127.0.0.1';

async add(data) {
  if (this._db._isLocal) return this._localAdd(data);  // localStorage
  return this._req('', 'POST', data);                  // Firestore REST API
}

When running on localhost, it uses localStorage. The moment it's deployed under sbx-*.example.com, it switches to Firestore. No code changes required.

This dramatically improves the experience of building apps with AI:

  • Local: no network, no auth, all features work
  • Deployed: same code runs, data is properly persisted
  • Development data never leaks into systems outside Sandbox (it physically can't reach them)

Firestore Namespace Isolation

Once deployed, data paths are strictly isolated:

sandbox_data/{nickname}--{app}/{collection}/{docId}
  • nickname: user identifier resolved via OAuth
  • app: Sandbox app name
  • _createdAt / _updatedAt: auto-attached by the SDK

Data from different apps is physically unreachable from each other. Even apps built by the same person live in different paths.

The most important point: we use a dedicated sandbox named database. It's a completely separate Firestore database from the (default) DB used by other internal systems. No matter how badly an app's code misbehaves, it can never touch data outside Sandbox.

Infrastructure — Wildcard DNS + Cloudflare Worker + Self-Hosted Git Server

Now for the infrastructure highlights.

How URLs Are Determined

The public URL takes the form:

https://sbx-{nickname}--{app-name}.example.com/

nickname is automatically pulled from the MCP OAuth session. When a user logs into Sandbox MCP via Google, the email is looked up in a Firestore users collection to resolve the nickname. Users never have to repeat "I am ryan" each time.

[email protected] → users[[email protected]].nickname → "ryan"
                                                       ↓
                                  sbx-ryan--todo-app.example.com

Note: The users collection is kept in sync from a separate internal pipeline (a daily batch that pulls from our HR system and Google Workspace directory). Sandbox MCP just reads from it — no need to maintain its own employee master.

The benefit: you can tell whose app it is just by reading the URL. When someone says "go look at ryan's todo-app," reading the URL aloud naturally communicates ownership.

Instant Publishing via Cloudflare Worker

Normally, publishing a new subdomain requires:

  1. Adding A/CNAME DNS records
  2. Issuing an SSL certificate (15–30 minute wait with ACM or Let's Encrypt)
  3. Configuring a load balancer or DomainMapping

Sandbox MCP skips all of this with a Cloudflare Edge Router Worker.

URL Routing

DNS is fixed as *.example.com wildcard + Cloudflare proxy, with Universal SSL automatically covering every subdomain. The Cloudflare Worker receives all *.example.com/* traffic and routes by subdomain.

The logic is three-tier:

// apps/worker/edge-router/src/index.ts
export async function handleRequest(request, env) {
  const url = new URL(request.url);

  // ① sbx-* prefix → Sandbox routing
  const sandboxSub = extractSandboxSubdomain(url.hostname);
  if (sandboxSub !== null) {
    return handleSandboxRequest(request, url, sandboxSub, env);
  }

  // ② KV route:{subdomain} registered → Cloud Run proxy
  const subdomain = extractSubdomain(url.hostname);
  if (subdomain) {
    const proxyResponse = await handleCloudRunProxy(request, url, subdomain, env);
    if (proxyResponse) return proxyResponse;
  }

  // ③ Otherwise → fetch(request) passthrough
  return fetch(request);
}

When sandbox_publish finishes, all it does is write a route:{nickname}/{app} key into Cloudflare KV. That single write makes the new subdomain routable instantly.

await kvPut(`route:${nickname}/${appName}`, serviceUrl);

No DNS setup. No waiting for SSL issuance. No IaC deploy. Everything completes within the MCP tool execution.

Self-Hosted Git Server for Larger Apps

This setup actually started out without git at all.

Since the primary users were going to be PMs and CS folks, we figured "git concepts are too high a bar — let's keep everything inside MCP tools." Write files via sandbox_write_file, deploy via sandbox_publish. That should be enough, we thought.

The approach hit two walls quickly.

Wall 1: Constant chunking

MCP tool calls travel over HTTP, with a payload size limit. React/Vue build bundles, SPAs with images, business tools with dozens of files — they don't fit in a single call. We added an append mode to sandbox_write_file for chunking, but every "first half of file A → second half of file A → first half of file B → ..." sequence triggered error recovery and retries. Deployments became flaky.

Wall 2: Massive token consumption

This was the real killer. When you tell the AI "deploy this app," it sends the entire source as MCP tool arguments. The file contents land in the conversation context, and a few-thousand-line app burns through tokens fast. A single deploy easily consumed tens of thousands of tokens, and Claude Code sessions hit compaction quickly.

Worse, the AI tends to "verify after sending" — re-reading the same file via sandbox_read_file. Write → read → write loops, with tokens going up in flames.

So we pivoted to using git push as well. With git push:

  • No file size limit
  • Differential transfer — second-time pushes are fast
  • Source code stays out of the MCP conversation context (no AI tokens consumed)

We never expected business-side employees to run git push by hand. But if Claude Code runs git commands in the background, it's not a barrier. The user just says "build this and publish it" — the AI runs git init && git push on its own when needed.

Why a Self-Hosted Git Server?

Once we adopted git push, the next question was: where do we host the repos? We considered using GitHub Organizations but ruled it out.

Issuing and managing GitHub accounts for every employee — including non-engineers — wasn't worth the cost or the operational overhead. Paying for a GitHub seat just to ship one app is overkill.

Fortunately, we already operated a self-hosted Git Server on GCE for a different purpose: hosting an internal "read-only Git MCP for code investigation." A VM with repositories cloned under /mnt/repos/.

We just added a Git Smart HTTP Protocol endpoint and one new repo (sandbox-apps) to it. The VM was already running, so the marginal cost was near zero. Authentication piggybacks on the existing Google OAuth setup. Repository management is just OS directory operations. Borrowing space on the existing internal Git Server was vastly simpler than spinning up new infrastructure.

Actual Usage Flow

# 1. Get the git URL from the MCP tool (nickname is automatic)
sandbox_init_repo(app_name: "my-app")
# → https://mcp-sandbox.example.com/git/sandbox/ryan/my-app.git

# 2. Local commit (the AI does this in the background)
cd ~/my-app/
git init && git add . && git commit -m "init"
git remote add sandbox <returned URL>

# 3. Push
git push sandbox main
# Username: oauth2accesstoken
# Password: $(gcloud auth print-access-token)

# 4. Deploy
sandbox_publish(app_name: "my-app", description: "...")

Auth uses a Google OAuth token as the Basic Auth password (same pattern as GCP Source Repos). Only @air-closet.com accounts pass. No GitHub account required — any employee can push.

The remote repo is configured with receive.denyCurrentBranch=updateInstead, so the working tree updates server-side on push. Cloud Run uses that directory as --source, so there's no extra step between push and publish.

For small apps (a few files, hundreds of lines each), sandbox_write_file still works fine. Switch between MCP-only and git push depending on app size.

Security — Four Independent Gates

That covered the "convenient to build" side. Now the "safe to publish" side.

As I noted at the start, exposing AI-generated code in front of users is risky. So Sandbox MCP layers four independent safety mechanisms that don't depend on the app's own implementation.

Security Layers

① Public-Facing Gate — Cloudflare ZeroTrust Access

sbx-*.example.com sits behind Cloudflare ZeroTrust Access. When someone visits, Cloudflare's edge first redirects them to Google Workspace SSO; without an @air-closet.com account, they never reach Cloud Run.

This is independent of the app's implementation. Even if the AI didn't write a single line of auth code, Cloudflare stops the request first. "Accidentally public" is physically impossible.

② Deploy Gate — MCP OAuth

Operations like sandbox_publish and sandbox_delete enforce Google OAuth on the MCP server side. Sandbox MCP implements RFC 8414 (/.well-known/oauth-authorization-server), so Claude Code runs the OAuth flow automatically on first connection.

The strongest guarantee is "you can't accidentally update or delete someone else's app."

When multiple people share a Sandbox MCP, an AI accident like "wait, I overwrote a coworker's app while updating mine" would be devastating. To prevent that, the AI doesn't get to decide whose app is being touched. The server injects nickname automatically from the OAuth session.

// Strip the `nickname` property from the MCP tool schema and have
// the server force-inject the logged-in user's nickname.
function injectNickname(tool: McpTool, userNickname?: string): McpTool {
  const { nickname: _, ...restProperties } = tool.schema.inputSchema.properties;
  return {
    schema: { ...tool.schema, inputSchema: { ...tool.schema.inputSchema, properties: restProperties } },
    execute: (args, ctx) => tool.execute({ ...args, nickname: userNickname }, ctx),
  };
}

From the AI's perspective, the nickname input doesn't exist. Even with a prompt injection like "delete ryan's app," there's no mechanism to do so. "You can only touch your own apps" is enforced at the API spec level.

On top of that, inputs are validated strictly against /^[a-z0-9]([a-z0-9-]{0,61}[a-z0-9])?$/, rejecting shell-injection and path-traversal patterns (.., /).

③ Data Gate — SandboxDB Namespace Isolation

As mentioned earlier, data lives at:

sandbox_data/{nickname}--{app}/...

Per request, the SandboxDB API resolves the path server-side:

  • Browser (OAuth): resolve email → users → nickname, take app from the Origin header
  • Backend (SA token): take nickname/app from the X-Sandbox-App header (required — missing returns 400)

The client cannot spoof the path.

We deliberately do not use the K-Service header (the Cloud Run-injected service name). That's a client-spoofable header, and another implementation that relied on it had a "read another app's data" vulnerability disclosed. Requiring X-Sandbox-App keeps the only valid route through an explicitly server-validated path.

The clincher: a dedicated named database for Sandbox. Instead of the (default) DB (which contains data from other systems), we use an independent Firestore database called sandbox, and the Cloud Run SA gets an IAM Condition that allows access only to the sandbox DB.

// From infra/mcp/git-server/index.ts
// IAM Condition on roles/datastore.user:
//   resource.name == "projects/.../databases/sandbox" ||
//   resource.name.startsWith("projects/.../databases/sandbox/")

No matter how badly the AI-written code goes wrong, it physically cannot reach data outside Sandbox.

④ Execution Gate — Cloud Run SA + IAM

All sandbox-* Cloud Run services run under a single shared SA (e.g. sandbox-run). The permissions on that SA are minimal.

  • roles/logging.logWriter (write its own logs)
  • roles/bigquery.jobUser + bigquery.dataViewer scoped to the sandbox_logs dataset only (its own access logs, nothing else)
  • roles/datastore.user (IAM Condition limiting to sandbox DB)

What it does not have:

  • Access to the (default) Firestore that holds data from other systems
  • Access to BigQuery datasets used by other internal systems
  • Direct access to Secret Manager
  • Permission to manage other Cloud Run services

In other words, even if a Sandbox app goes completely rogue, the blast radius is limited to sandbox_data and sandbox_logs. Nothing outside Sandbox is affected.

Logging — Apps Can Query Their Own Access Logs

Sandbox apps eventually want to look at logs too. "How many views did this page get?" "Who hit that error?"

We forward Cloud Run request logs to BigQuery via a Logging Sink:

// From infra/mcp/git-server/index.ts
const sandboxLogSink = new gcp.logging.ProjectSink('sandbox-logs-sink', {
  destination: `bigquery.googleapis.com/projects/${projectId}/datasets/sandbox_logs`,
  filter: [
    'resource.type="cloud_run_revision"',
    'resource.labels.service_name:"sandbox-"',
    'logName:"run.googleapis.com%2Frequests"',
  ].join(' AND '),
  bigqueryOptions: { usePartitionedTables: true },
});

The sandbox_logs dataset is locked down with project-owner-only ACLs (it contains PII like remoteIp and User-Agent), and the Sandbox SA gets a tightly scoped bigquery.dataViewer to it.

This lets apps query their own access logs from BigQuery. "Post last week's user count for this app to Slack" can be done entirely inside Sandbox.

Tool Design — Making AI Use Tools Correctly

Let me close with a note on tool definitions. I personally think this is where MCP design really makes or breaks.

Sandbox MCP exposes 10 tools:

Tool Purpose
sandbox_publish Start deploy (async)
sandbox_deploy_status Check deploy status
sandbox_init_repo Initialize git push repo
sandbox_write_file Write file (overwrite/append)
sandbox_list List apps
sandbox_delete Delete app
sandbox_schedule Configure Cloud Scheduler
sandbox_unschedule Remove Cloud Scheduler
sandbox_read_file Read source code
sandbox_list_files List files

Whether the AI picks the right tool at the right moment is almost entirely determined by what's written in the tool description.

For example, the description for sandbox_publish covers not just functionality but also:

  • Supported app types and required files (Python / Node.js / static HTML / custom)
  • Startup command and PORT requirement per type
  • When to use write_file vs git push
  • How to use SandboxDB (with SDK code samples)
  • How to use the UI Kit (explicit instruction to fetch README.md via read_file)

With this in place, the AI can autonomously do:

  1. User says "build me a tool that displays Slack emoji scores"
  2. → Reads sandbox_publish description and sees "first read the UI Kit README"
  3. → Calls read_file on sandbox-ui-kit/README.md
  4. → Generates HTML/CSS/JS following the guidelines
  5. → Sees the SandboxDB SDK usage in the description and integrates persistence
  6. → Calls sandbox_publish

— without asking the user a single follow-up question. Writing not just "what it does" but "what to do with it" into the tool definition is the secret to AI-friendly design.

If you write tool definitions tersely, the AI keeps coming back asking "what should I do next?" The description is less of a human-facing doc and more of an AI-facing runbook. That framing helps a lot.

Wrap-Up

Sandbox MCP exists to answer two challenges of building internal tools in the AI era:

  • Building is now possible for anyone, thanks to AI
  • Publishing safely remains hard

To close that gap, we:

  • Standardized every layer on the platform side: frontend / backend / DB / infra / auth / domain / SSL
  • Embedded a runbook into tool descriptions so the AI naturally uses things correctly
  • Layered four access gates (ZeroTrust / OAuth / namespace isolation / IAM) so safety doesn't depend on the implementation being correct

Building this, what struck me again is that the role of platforms in an AI-powered development era is shifting. Platforms used to optimize for "easy for humans." Now they also need to optimize for "used correctly by AI." Tool descriptions are AI-facing docs, and safety must be designed assuming AI will write incorrect code.

At the same time, by limiting what the builder is responsible for, we drastically lower the barrier to "let me just try something." That's the entry point that turns a non-engineer's "I want to build this" into actual operational improvements.

I hope this is useful for anyone designing internal platforms.

At airCloset, we're looking for engineers who want to build a new development experience together with AI. If you're interested, please check out our careers page at airCloset Quest.

What Structured Data Your Product Page Needs in 2026

2026-04-28 07:04:57

What Structured Data Your Product Page Needs in 2026

The rules for product schema used to be pretty simple. Add Product, Offer, and maybe AggregateRating, and you were done. Google was happy. Your rich snippets looked nice. You moved on.

Those days are over. AI shopping agents are reading structured data differently than Google's crawler does, and they need a much fuller set of fields to actually surface your products confidently. Here's the complete 2026 list for any ecommerce store that wants to stay visible.

The Core Five

Product. Must include: name, description (minimum 150 chars), image (at least one, ideally multiple), brand (as a nested Brand object with name), sku, gtin, mpn (if applicable), category, weight, color, material, size. The old "just name + image" version won't cut it anymore.

Offer. Must include: price, priceCurrency, availability (InStock/OutOfStock/PreOrder/etc), priceValidUntil, itemCondition (NewCondition/UsedCondition), url (to the product page), hasMerchantReturnPolicy, shippingDetails.

AggregateRating if you have reviews. ratingValue, reviewCount, bestRating, worstRating. If you don't have reviews yet, consider adding them with a honest count (even if small).

Review array. Include your latest 3-5 reviews as nested Review objects with author, datePublished, reviewBody, and reviewRating. Agents use these for social proof.

BreadcrumbList. Category path from homepage to product. This is easy but a surprising number of stores don't have it.

The Fields Agents Actually Care About in 2026

Here's the thing most SEO guides haven't caught up on yet. These are the fields that are becoming hard differentiators in AI agent retrieval:

hasMerchantReturnPolicy. This is the biggest one. ChatGPT's shopping answers now strongly prefer products from stores that explicitly declare return policies in schema. 94% of stores I've scanned are missing this. If you add it, you get a meaningful visibility boost.

shippingDetails. Include shippingRate, shippingDestination, shippingLabel, and deliveryTime. Agents read this to answer "can I get this by Thursday" type queries. Missing = you're out of the consideration set.

additionalProperty array. This is a generic key-value bag for attributes that don't fit the standard schema. Material composition, certifications, specifications, size charts. Agents love this because they can use it to filter and compare.

itemCondition. Sounds obvious but a lot of stores omit it. Especially important for used/refurbished/open-box items where agents need to know before recommending.

offers.priceValidUntil. If you run sales, this tells agents when the price expires. Without it, agents can't confidently recommend your sale price because they don't know if it's still valid.

Common Mistakes

A few things I see over and over.

Inconsistency between schema and on-page content. Your schema says the product is $29 but the page shows $35. Google's crawler forgives this, AI agents don't. They downweight you significantly for inconsistency.

Wrong availability values. Use the exact ISO values: https://schema.org/InStock, https://schema.org/OutOfStock, etc. Not "available" or "yes" or your own strings.

Shallow brand object. "brand": "Nike" is not valid. It should be {"@type": "Brand", "name": "Nike"}. Most themes get this wrong out of the box.

Missing mainEntityOfPage. This is a subtle one. It tells crawlers which URL is the canonical location for the schema. Helps with deduplication when you have multiple URLs for the same product.

Nested schema that isn't actually nested. If you have a product with offers, the offers should be inside the product schema, not in a separate JSON-LD block. Agents parse the whole document and sometimes miss stuff that's structured weirdly.

How to Validate

Google's Rich Results Test is still the gold standard for validation. https://search.google.com/test/rich-results. Run it on 5-10 of your highest-traffic product pages and fix any errors.

For more advanced validation, the Schema.org validator at https://validator.schema.org/ is more strict and catches things Google misses. Worth running periodically.

For AI-specific validation, there's no good public tool yet. The best approach is to query ChatGPT directly with a product search and see if your store shows up. If not, you probably have gaps somewhere in this list.

The Implementation Path

For Shopify stores, the fastest path to a compliant product schema is:

  1. Export 5 product pages, view source, find the JSON-LD block.
  2. Paste into https://jsonld.com/json-ld-visualizer/ or similar to see the structure.
  3. Compare against the full field list above. Note what's missing.
  4. Fix in your theme's product.liquid template, or use an app that adds the missing fields.
  5. Re-test with Rich Results Test until clean.

For custom stores (headless Shopify, Next.js, whatever), you'd do the same thing but in your product component. Next.js in particular makes this easy with the tag outputting JSON-LD.</p> <h2> <a name="the-takeaway" href="#the-takeaway" class="anchor"> </a> The Takeaway </h2> <p>Product schema in 2026 is richer and stricter than it was in 2022. The baseline for &quot;visible to AI agents&quot; has moved up significantly and most stores haven&#39;t caught up. If you invest a few hours updating your schema now, you&#39;ll be ahead of 80% of your competitors for a year or two. After that, it&#39;ll be table stakes and you&#39;ll wish you&#39;d done it sooner.</p> <p>Run the Rich Results Test on your store this week. You&#39;ll probably find gaps. Fix them. It&#39;s one of the highest-ROI things you can do for your store&#39;s visibility in both google and AI channels right now.</p>

Part 1: One Spec To Rule Them All

2026-04-28 07:02:27

One Spec to rule them all, one Spec to find them, one Spec to bring them all, and in the darkness bind them.

This is the first post in a series about spec-driven development. Not which tool to use or how to get started, but what I learned after living with a spec long enough to hit the problems that nobody writes about yet. I do not have all the answers and I am not trying to be a guru. I am sharing what worked, what did not, and what I am still figuring out. If you have been down this road too, I would love to hear your experience in the comments.

Spec-driven development is having a moment. Microsoft shipped a spec-kit and wrote about it on their developer blog. JetBrains published a dedicated series on using a spec-driven approach with AI coding tools. Tools like Kiro and CodeSpeak are building entire development models around the idea that specs, not code, are the primary artefact. Martin Fowler's blog has a detailed breakdown comparing SDD tools. The term is everywhere.

Most of this content is useful. It explains what SDD is, compares approaches, and helps teams get started. But almost all of it is written from the outside looking in, by people who adopted the practice recently or are building tools around it. Very little comes from engineers who have been doing it long enough to hit the problems that only show up later.

I have been writing and maintaining a spec across nine SDKs for three years. I started before SDD had a name, before LLMs made it a topic of conversation, and before any of the current tooling existed. I have a lot of thoughts about what makes a spec genuinely useful over time and what makes it quietly fall apart. This series is my attempt to think through that in public, and hopefully start a conversation with people who are navigating the same problems.

What a Spec Actually Is (And What It Is Not)

A lot of the current SDD conversation frames the spec as a disposable implementation plan for an LLM. You write it, the agent consumes it, code comes out, job done. Some tools are explicitly built around this model. The spec is an intermediate artifact, a way of communicating intent to an AI before it disappears into the generated code.

That framing is not wrong for certain use cases. But it is a narrow way to think about something with much broader value.

The definition I find more useful, and the one this series is built around, is this:

A spec is a contract between implementations.

It does not describe code. It defines behavior: what a feature should do, how it should respond to edge cases, what a developer can rely on regardless of which language or platform they are using. The moment you have more than one implementation of the same thing, you need something that sits above all of them and answers the question: what does correct actually mean here?

Tests do not answer this. Tests verify that your code behaves the way you wrote it. They say nothing about whether the behavior you wrote was the right one. A test can pass in nine SDKs while each of them does something subtly different, and nothing in your CI pipeline will flag it.

Code reviews do not answer it either. A reviewer working inside a single codebase has no way to know whether this implementation matches what the mobile client does, or what the desktop client does, or what a developer reading your docs will reasonably expect.

The spec is the only artifact that exists at the level of behavior rather than implementation and it can be the referee when two implementations disagree.

Tips for Keeping Your Spec Useful Over Time

Writing a spec is the easy part. Keeping it honest with reality over months and years is where most teams quietly struggle. Before closing this first post I want to share some practical foundations that have helped us keep our spec useful over time.

Keep your spec in version control, close to the code

A spec that lives in a wiki or a shared document will drift. It needs to be versioned alongside the code it describes, treated with the same discipline as a codebase. If a spec change does not go through a pull request, it will quietly stop reflecting reality. Nobody intends this to happen. It happens anyway, gradually, and by the time you notice, the spec has become historical fiction.

Give every behavior a unique stable ID

This is the single most practical thing I can pass on. Every spec entry describing a distinct behavior should have a unique identifier. At Ably we use abbreviations of feature areas combined with alternating numbers and letters for nesting levels. So RTP is Realtime Presence, RSP is REST Presence, and a deeply nested entry might look like RTP2a5c6b7. These IDs let you reference spec entries directly from tests, from code comments, from pull requests, from conversations. Instead of describing a behavior in prose every time, you point to the ID. Anyone reading the code can trace it back to the contract it implements. This traceability is what separates a spec that is actually used from one that exists only to be consulted.

Use cross-references instead of repeating logic

Duplication in a spec is as dangerous as duplication in code. When the same behaviour is described in two places, they will eventually diverge, and you will have two sources of truth instead of one. The solution is the same as in code: do not repeat yourself. When one spec entry depends on or extends another, reference it by ID rather than restating the logic. This keeps each behaviour defined exactly once, makes the spec easier to maintain, and means that when something changes you update it in one place and the rest of the spec stays coherent.

Versioning/deprecation

If you change what a behaviour does and keep the same ID, you silently break the traceability chain. Tests referencing the old ID now verify the wrong thing, code comments become misleading, and the history becomes unreadable. The discipline of generating a new ID when behaviour changes forces an explicit acknowledgement: this is not a correction, it is a new contract. Old entries get replaced rather than deleted, leaving a paper trail. For example, "This entry has been superseded by RSC25 as of specification version 4.0.0".

Use RFC 2119 requirement language

Ambiguous language in a spec is a slow poison. Words like "should", "must" and "may" mean different things to different people, and when an LLM or a new engineer reads your spec, those differences matter. RFC 2119 solves this cleanly: MUST means mandatory, SHOULD means recommended, MAY means optional. Adopting this convention costs nothing and eliminates an entire category of misinterpretation. When someone asks "is this behaviour required or just a suggestion", the spec answers the question without needing a conversation.

Further Reading

If you want to explore the current state of spec-driven development, here are the resources mentioned in this post:

Follow or subscribe so you do not miss Part 2. And if any of this resonates with your own experience, or if you think I am getting something wrong, I would genuinely love to hear it in the comments.

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

2026-04-28 07:00:51

The AI landscape is experiencing unprecedented growth and transformation. This post delves into the key developments shaping the future of artificial intelligence, from massive industry investments to critical safety considerations and integration into core development processes.

Key Areas Explored:

  • Record-Breaking Investments: Major tech firms are committing billions to AI infrastructure, signaling a significant acceleration in the field.
  • AI in Software Development: We examine how companies are leveraging AI for code generation and the implications for engineering workflows.
  • Safety and Responsibility: The increasing focus on ethical AI development and protecting vulnerable users, particularly minors.
  • Market Dynamics: How AI is influencing stock performance, cloud computing strategies, and global market trends.
  • Global AI Strategies: Companies are adapting AI development for specific regional markets.

This deep dive aims to provide developers, tech leaders, and enthusiasts with a comprehensive overview of the current state and future trajectory of AI.

AI #ArtificialIntelligence #TechTrends #SoftwareEngineering #MachineLearning #CloudComputing #FutureOfTech #AISafety