2026-04-28 07:16:07
We had a slightly reckless idea: what if we let AI do most of our data engineering work?
Not "help with a query here and there," but actually build real pipelines.
Azure, Databricks, Delta Lake, the whole thing.
Real enterprise data, messy schemas, and stakeholders who will definitely shout if numbers look wrong.
I'm a Senior Data Engineer, I work on this stack every day, and I still wanted to see how far we could push AI into my own job.
This is what happened when we tried.
The setup will look familiar to a lot of people:
Most of the transformation work lives in PySpark, with SQL on top for reporting and BI layers.
The experiment was simple to describe and painful to watch:
We fed it:
DESCRIBE and SHOW COLUMNS
All of this happened in a safe Databricks workspace with test data and separate storage. No chance of breaking production, but our question was serious. Could we realistically replace most of the day to day data engineering work on a new pipeline?
To be fair, AI did a few things well enough that I now use it on purpose.
Whenever I needed yet another "read, filter, transform, write" notebook, the model saved a bit of time:
For example, I asked it something close to:
"Read from
bronze.orders, filter cancelled orders, castorder_tsto timestamp, addorder_date, then write tosilver.orders_cleanas Delta, partitioned byorder_date."
The generated PySpark looked like this:
from pyspark.sql import functions as F
df = (
spark.read.table("bronze.orders")
.filter(F.col("status") != "CANCELLED")
.withColumn("order_ts", F.to_timestamp("order_ts"))
.withColumn("order_date", F.to_date("order_ts"))
)
(df.write
.format("delta")
.mode("overwrite")
.partitionBy("order_date")
.saveAsTable("silver.orders_clean"))
Could I have written this faster by hand? On a good day, yes.
But over dozens of similar notebooks, the time saved adds up.
For straightforward reporting queries, Copilot in a SQL editor was handy.
SELECT lists once it saw the schemaGROUP BY and ORDER BY clauses correctly most of the timeI still had to adjust conditions and add proper filters, but I was no longer staring at a blank editor. That alone reduces friction.
The part that surprised me most was how useful AI was for the boring bits:
None of this replaces real architectural decisions, but it lets me stay in "technical thinking" mode while an assistant fills in the prose.
Now for the part that actually matters, especially if you work in a production environment.
We tried a customer360 style pipeline that combined:
bronze.customersbronze.ordersbronze.events for clickstream databronze.subscriptionsWe told the model something like:
"Join these to build a customer centric table with basic attributes, last activity date, and current subscription status."
It produced:
customers to orders on customer_id which was fineorders to subscriptions on customer_id which was wrong in our world, the real join key for subscriptions is account_id
MAX(event_ts) per customer_id, ignoring the fact that some event types should not count as "activity"The result:
All of this ran without any schema error. Nothing crashed. The table "looked" fine at a glance.
But the logic was off in exactly the way that breaks trust with downstream users.
When our prompt was slightly vague, the model started inventing things:
is_active that were not in any tableorders_clean, which only existed in someone's head, not in our catalogcustomer_email instead of email_address
This is a known issue with LLM generated code in general, often called hallucination.
In a chat window it looks clever. In a notebook connected to real data, it is just a bug factory.
You either fix the code and bend it back toward reality, or you start renaming actual tables and columns to match the hallucination. I saw both instincts on the team.
We pointed AI at a slow query on a large Delta table and asked for tuning suggestions.
It happily suggested changes that:
In practice, I spent more time validating each suggestion with EXPLAIN and the Databricks query profile than it would have taken to reason through the original plan myself.
We set up some non ideal data on purpose:
The model did not:
The code it generated assumed clean, static, relational textbook data.
That world does not exist in any real enterprise I have worked in.
This is not only about wrong code. There is a deeper gap that explains most of the failures we saw.
The model only sees what we paste into the prompt or make available through a narrow integration. It does not naturally see:
Lineage is technical and social. It lives partly in tools, partly in tribal knowledge, and partly in old Slack threads. AI only sees a slice of that picture unless you build a very deliberate context layer around it.
Here is a real kind of rule you will not get from a table definition:
"A customer is active if they had a paid transaction in the last 90 days, except in region X and contract type Y, where the window is 180 days."
Pieces of that rule live in:
If you ask AI to "mark active customers," it will give you a clean definition that fits a generic pattern. That pattern is almost guaranteed to differ from your actual rulebook.
When I ship a pipeline:
AI has no skin in the game. It can be wrong with confidence and nothing bad happens to it. That changes how much you trust it, and it should change how you design your review and test processes.
After a few weeks, we stopped trying to make AI "do" data engineering and started treating it like an extra pair of very fast hands.
We now use AI to:
We still own the logic. We still write tests. The model just gets us to the first draft faster.
AI is useful when:
EXPLAIN plan and want a plain language descriptionIt is not our primary performance tuner, but it makes the feedback loop a bit shorter.
We let AI start the boring writing:
Engineers review and correct the details, but they are not starting from an empty page.
Two concrete scenarios from the kind of environment many of us work in:
Marketing attribution pipeline
Finance reconciliation layer
In both cases, AI makes small pieces of the job faster. The responsibility and the judgement stay with the human engineers.
After trying quite hard to let AI handle a big chunk of my data engineering work on Azure Databricks, my view is pretty clear.
The engineers who learn to:
will move faster than those who either ignore these tools or trust them blindly.
AI will not replace data engineers.
Data engineers who use AI well will replace the ones who do not.
2026-04-28 07:09:46
Why I built midnight-doctor — a pre-flight check that catches the 16 hours of debugging you don't know you're about to spend
"The information you need to align your stack already exists. It's just not executable."*
After five months building four applications on Midnight Network, I've concluded the protocol isn't the problem. The protocol is genuinely good — Compact compiles ZK circuits in three seconds without a trusted setup ceremony, and the shielded/unshielded/dust split is the cleanest answer to compliance-grade privacy I've seen.
The problem is everything that surrounds it. Specifically, the unforgiving math of: a fast-moving SDK + multiple environment tracks + a single npm namespace + zero error messages when versions misalign.
This post is the story of one specific 6-hour debugging session, the pattern I extracted from it, and the ~700-line tool I built so the next person doesn't pay the same tax.
DPO2U Wallet, March 2026. I bumped @midnight-ntwrk/wallet-sdk-facade to 2.0.0 because npm marked it as latest. My local stack ran midnightntwrk/midnight-node:0.21.0 (preprod-targeted). I wrote ~50 lines of wallet bootstrap, started the dev loop, and watched:
const wallet = await WalletFacade.init({
configuration,
shielded,
unshielded,
dust,
});
await wallet.waitForSyncedState();
console.log('synced!');
The synced! line never printed.
No error. No timeout. No log. The wallet just sat in syncing state forever. I added more logging — got more "syncing" lines. Restarted Docker. Wiped state. Tried a fresh seed. Re-cloned the repo. Asked Discord. Read the SDK changelog (which didn't exist for that version). Stared at the indexer logs for 40 minutes.
Six hours in, almost by accident, I noticed the symptom: subscribeRuntimeVersion was firing once, returning, and never firing again. The standalone node closes that subscription early. The WalletFacade.init({...}) constructor in 2.x wires sync to that subscription. Result: a single missed event silently kills the entire sync loop, with zero surface.
The fix was to downgrade to facade 1.0.0 (preprod track). Forty seconds of work, after six hours of debugging. The bug was real but the cost was information asymmetry: nothing in the documentation, npm metadata, or runtime told me that "this SDK version is incompatible with this node version."
That asymmetry is the real bug. Not the runtime subscription closing. Not the constructor wiring. The fact that the system has the information needed to diagnose itself, but doesn't bother.
After that session, I started cataloguing other times I'd been bitten the same way. Within a week I had a list:
| Symptom | Root cause | Time spent |
|---|---|---|
waitForSyncedState() hangs forever |
facade 2.x + node 0.21.0 mismatch | ~6h |
npm install returns ENOTFOUND |
community tutorial said .npmrc should point at npm.midnight.network (domain doesn't exist) |
~2h |
| Transactions silently fail to submit | Two versions of @midnight-ntwrk/ledger-v7 in node_modules (transitive bump) |
~3h |
| Indexer crash-loops on startup |
indexer-standalone:4.0.0-rc.4 requires a subscription: block in indexer.yml, undocumented |
~1.5h |
| WalletFacade behaves differently on standalone vs preprod | Same SDK, different node behavior, no warning | ~4h |
Across five incidents: ~16 hours of debugging silent failures. Each one had a root cause that was, with the right cross-reference, detectable in seconds:
package.json says facade is 2.0docker ps says node is 0.21.0The system already has all three pieces of data. There's just no glue.
midnight-doctor
The fix isn't a new feature in the SDK. It isn't better docs (which decay). It's an executable cross-reference — a script that reads your project, your Docker stack, and your config files; matches them against a curated compatibility table; and tells you what's wrong.
I built midnight-doctor over a weekend. ~700 lines of Node, zero runtime dependencies, single-binary install:
npx midnight-doctor
Run against my own legacy midnight-hello-world repo, it produces:
── midnight-doctor ──
project: /root/midnight-hello-world
⚠ SDK track: Preprod 3.x (legacy, OK for existing apps)
Detected from [email protected].
⚠ Track is deprecated
SDK 3.2 / facade 2.0 is 2 majors behind. Current is [email protected] /
[email protected] (released 2026-04-23). The WalletFacade.init({...}) constructor
used in 2.0 has been removed; 4.0 reverted to `new WalletFacade(s, u, d) +
.start()`. Plan a migration.
⚠ WalletFacade.init() in 2.x stalls on standalone dev nodes
In SDK 2.x, WalletFacade.init() subscribes to runtime version events that the
standalone node closes early. The wallet hangs in 'syncing' state forever
with no error.
→ fix: Either (a) develop against preprod once past hello-world, or
(b) upgrade to [email protected] which reverted to constructor +
.start() pattern.
✓ node: midnightntwrk/midnight-node:0.21.0
✓ indexer: midnightntwrk/indexer-standalone:4.0.0-rc.4
✓ proof-server: midnightntwrk/proof-server:7.0.0
✓ midnight-node:0.21.0 matches SDK track
summary: 4 ok 3 warn 0 error 0 info
Status: workable, but warnings deserve a look.
The 6-hour incident from March is now a 30-second output.
The tool is structurally trivial. Three scanners, one diagnosis engine, one report formatter. Total surface:
midnight-doctor/
├── bin/midnight-doctor.js # CLI entry, arg parsing, exit codes
├── lib/
│ ├── scan-package.js # reads package.json + walks node_modules
│ ├── scan-docker.js # runs `docker ps`, parses image tags
│ ├── scan-config.js # reads .npmrc, indexer.yml
│ ├── diagnose.js # cross-references findings with matrix
│ ├── report.js # ANSI-colored pretty output
│ └── index.js # public API
├── data/compatibility-matrix.json # the source of truth
└── test/diagnose.test.js # node:test, no framework
The ~700 lines of code are scaffolding. The actual value is the JSON document at data/compatibility-matrix.json. It encodes three things:
current, preprod-3x, preprod-1x)A snippet:
{
"tracks": {
"current": {
"label": "Current (latest npm + preprod compatible)",
"node": "0.21.0",
"indexer": "4.0.0-rc.4",
"proofServer": "7.0.0",
"packages": {
"@midnight-ntwrk/wallet-sdk": "1.0.0",
"@midnight-ntwrk/wallet-sdk-facade": "4.0.0",
"@midnight-ntwrk/compact-runtime": "0.15.0",
...
}
}
},
"knownIssues": [
{
"id": "facade-2x-init-bug",
"severity": "warn",
"match": { "type": "package-version",
"package": "@midnight-ntwrk/wallet-sdk-facade",
"range": "2.x" },
"title": "WalletFacade.init() in 2.x stalls on standalone dev nodes",
"fix": "Either develop against preprod, or upgrade to facade 4.0.0..."
}
]
}
This file is the same information that lives, scattered, across Discord pinned messages and individual developers' heads. Centralizing it in a JSON document accomplishes three things:
verifiedAt: "2026-04-27" field; staleness is detectableEach scanner is one file, one responsibility, ~50 lines.
// lib/scan-docker.js (excerpt)
export async function scanDocker() {
const findings = { dockerAvailable: false, containers: {} };
try {
await exec('docker', ['version', '--format', '{{.Server.Version}}']);
findings.dockerAvailable = true;
} catch { return findings; }
const { stdout } = await exec('docker', [
'ps', '--format', '{{.Image}}|{{.Names}}|{{.Status}}',
]);
for (const line of stdout.split('\n').filter(Boolean)) {
const [image, name, status] = line.split('|');
const [imageName, tag = 'latest'] = image.split(':');
if (KNOWN_IMAGES[imageName]) {
findings.containers[KNOWN_IMAGES[imageName]] = { image: imageName, tag, name, status };
}
}
return findings;
}
No Docker SDK dependency, no parsing library — just child_process.execFile and String.prototype.split. If Docker isn't installed, the scanner returns { dockerAvailable: false } and the diagnosis engine emits an info diagnostic instead of crashing.
The package scanner is similar — walks node_modules recursively (capped at 5000 directories so monorepos don't hang), records every @midnight-ntwrk/* it finds and what version. Duplicates fall out as a side effect: if the same package name has more than one version in the array, it's a duplicate.
const allInstances = await findAllInstances(projectDir, MIDNIGHT_NS);
for (const [name, versions] of Object.entries(allInstances)) {
const unique = [...new Set(versions)];
if (unique.length > 1) findings.duplicates[name] = unique;
}
Three lines. That's the entire duplicate-detection logic. The remaining ~50 lines of the scanner are filesystem traversal — boring, mechanical, but it works on every Node project regardless of package manager (npm, pnpm, yarn, bun all create node_modules directories with the same shape).
lib/diagnose.js takes the three scan results plus the matrix and emits diagnostics. The pattern repeats per check:
function diagnoseCrossCutting(pkg, docker, matrix) {
const facadeVersion = pkg.installed['@midnight-ntwrk/wallet-sdk-facade'];
const nodeContainer = docker.containers.node;
if (!facadeVersion || !nodeContainer) return [];
const track = matrix.tracks[detectTrack(facadeVersion, matrix)];
if (track.node && track.node !== nodeContainer.tag) {
return [{
id: 'node-track-mismatch',
severity: 'error',
title: `midnight-node:${nodeContainer.tag} doesn't match SDK track (expects ${track.node})`,
detail: `Mixing causes silent sync failures with no error.`,
fix: `Either update the node container to ${track.node}, or align SDK to a track that matches your node version.`,
}];
}
return [{ id: 'node-track-match', severity: 'ok',
title: `midnight-node:${nodeContainer.tag} matches SDK track` }];
}
That's the entire body of the cross-cut check that would have saved my 6-hour March incident. Twelve lines. The information was there the whole time.
The full diagnosis file is ~200 lines. Most of it is the same shape: pick two facts from the scans, compare against the matrix, emit a diagnostic. It's deliberately boring code.
Scope discipline is half the value of a tool like this. Things midnight-doctor deliberately doesn't do:
--fix flag yet. The cost of a wrong auto-fix in this domain (bricked node_modules, lost work) outweighs the convenience.midnight-health-check.sh) covers that already in my infra repo.getBlockNumber(), no balance lookup. Those depend on a working SDK, which is what doctor is meant to validate before you try to use it.npm dedupe for you. It tells you to run it.create-midnight-app, eventually).Each non-feature is a deliberate choice to keep the surface small enough that the tool stays correct.
midnight-doctor is the smallest useful version. The compatibility matrix as data unlocks several adjacent tools:
create-midnight-app — scaffold a project where midnight-doctor is wired into npm install lifecycle. The matrix becomes the default versions.WalletFacade.init({...}) → new WalletFacade(...) + .start(...). AST transform..compact source and say "this uses syntax X, requires compactc ≥ Y."https://compatibility.midnight.network/matrix.json would let users not need to bump the doctor itself.The first one is the highest leverage by far — it short-circuits the entire onboarding problem. But it's a much bigger project. Doctor is a Trojan horse for the matrix; once the matrix exists and people accept it as canonical, the rest is plumbing.
Documentation rots. A README written today describes a system that, six weeks from now, has moved two majors. Anyone who builds in a fast-moving ecosystem has lived this: the official doc says "use SDK 1.0", the npm latest tag says 4.0, the GitHub README says 3.0, and the Discord pinned message says "we know, sorry, the docs are stale." None of those sources is wrong. They were all true at some point. The problem is they don't update each other.
Executable knowledge updates itself. A JSON file with a verifiedAt date that's three weeks old is visibly stale. A doc that's three weeks old looks identical to a doc that's three days old. Tools force the question "is this still true?" in a way that prose doesn't.
This isn't unique to Midnight. Every fast-moving stack reaches the point where the gap between "what the docs claim" and "what actually works" is wide enough to swallow new developers. The fix is the same: take the lookup table out of human heads, encode it as data, ship it as a tool that runs in seconds.
The Midnight protocol team doesn't need to build midnight-doctor. They could. Anyone could. The information already exists — in pinned messages, in CHANGELOG.md files, in the heads of the half-dozen people on Discord who answer the same question every week. The work isn't producing the information. It's transcribing it once, into a format machines can act on, and committing to keeping it current.
That's what this ~700-line tool does. It's not clever. It's not hard. It's just nobody had done it.
If you've spent hours on a Midnight silent failure, please run npx midnight-doctor against your project before your next debugging session. If you find a bug doctor missed, open an issue — the matrix is the artifact, the code is just glue.
Repo: github.com/fredericosanntana/midnight-doctor
Install: npm install -g midnight-doctor or npx midnight-doctor
License: MIT
If this was useful, the people who actually maintain Midnight (@MidnightNtwrk) deserve the visibility — half this matrix came from their Discord answers. The other half came from getting burned. Both contributions are essential.
For reference, every check midnight-doctor currently runs:
Package scanner:
current, preprod-3x, preprod-1x) from wallet-sdk-facade majorwallet-sdk-* subpackages@midnight-ntwrk/ledger-v7 in node_modules
@midnight-ntwrk/compact-runtime in node_modules
[email protected] standalone init bugDocker scanner:
midnight-node, indexer-standalone, proof-server
Config scanner:
.npmrc with bogus npm.midnight.network registryindexer.yml missing subscription: blockCross-cutting:
Roadmap:
--fix mode for safe auto-correctionsTotal checks today: 11. Adding more is one PR each.
2026-04-28 07:04:57
Hi, I'm Ryan, CTO at airCloset.
In my previous posts, I've introduced our internal MCP servers: an MCP server for natural-language search across all our databases, the full picture of our 17 internal MCP servers, and a custom Graph RAG that lets AI answer "Did that initiative actually work?".
This time I'm covering something a bit different: Sandbox MCP — a platform that lets non-engineer employees deploy apps they built with AI to a safe, internal-only URL with a single command.
The pitch is simple: "If Claude Code can build an app, why not publish it directly?" The hard part is making "directly" mean safely.
The arrival of Claude Code and other AI coding agents is reshaping how work happens inside our company.
"Building an app" used to be an engineer's job. You had to do requirements, design, frontend, backend, database, CI/CD, production deploy — all in one head.
Now PMs, designers, and customer-success folks are talking to Claude Code with "build me a screen that does X" and getting working mockups on the spot. Inside airCloset we're seeing more and more:
These non-engineer outputs are growing fast. People are even saying "let's just run with this in production for a bit."
That's where the wall hits.
Anyone can build something that runs locally now. Spin up python -m http.server 8000, view it on your Mac — five minutes max.
But the moment it becomes "I want my team to see this" or "I want others to actually use it," the difficulty curve goes vertical.
You could "let the AI write all of it." But the result is left to the AI. Cloudflare misconfigured and exposed to the world. Auth bypassed. A service account with production database write access slipped into the code. The more code AI writes, the higher the risk of these accidents.
When a non-engineer says "I want to try building this," we need to clearly separate what the builder is responsible for from what the platform must guarantee by default.
There's also a quieter problem.
When non-engineers build apps independently:
After 10 or 20 such apps, internal tooling becomes chaos. Users wonder "wait, who built this one?" and "why does this button work differently?"
Even for internal tools, you need a baseline of consistency — both in design and in where data lives.
That's why we built Sandbox MCP.
A non-engineer just says "build this" to Claude Code, and:
https://sbx-{nickname}--{app-name}.example.com/
— all of this completes within a single chat session with the AI.
The builder is only responsible for functionality. Security, data isolation, domain & SSL, authentication are all handled by the Sandbox MCP platform by default.
| Resource | Details |
|---|---|
| MCP tools | 10 (publish, status, schedule, list, delete, write_file, read_file, list_files, init_repo, unschedule) |
| Supported runtimes | Python (Flask + gunicorn), Node.js, static HTML/SPA, custom Dockerfile |
| URL |
sbx-{nickname}--{app-name}.example.com (covered by Universal SSL, no ACM) |
| Authentication | Cloudflare ZeroTrust Access (Google Workspace) |
| Data | Firestore named DB sandbox, namespaced per nickname × app |
| Infrastructure | Self-hosted Git Server (GCE) + Cloud Run + Cloudflare Worker + KV |
| Deploy time | Typically 2–5 minutes (git push to public URL) |
Let's walk through the internals.
Sandbox MCP supports four app shapes so it can cover almost any "I want to ship something internally" use case.
| Type | Detected by | Use cases |
|---|---|---|
| Python |
.py files present |
Flask + gunicorn for APIs, analysis tools with a UI |
| Node.js |
package.json present |
Express APIs + UI; Bun also works |
| Static HTML/SPA | only .html files (no Python/Node) |
nginx-served, React/Vue dist supported |
| Custom | includes a Dockerfile
|
Any runtime — Go, Rust, Bun, anything |
Pick any of these and sandbox_publish deploys it with no extra config.
There's also sandbox_schedule for scheduled batch apps via Cloud Scheduler. Things like "post a risk summary to Slack at 9 AM every morning" become one-line cron setups.
sandbox_schedule(
app_name: "risk-alert",
schedule: "0 9 * * *",
path: "/api/cron",
timezone: "Asia/Tokyo"
)
Cloud Scheduler now hits the app's /api/cron every morning at 9. No need to open the scheduler UI or translate cron syntax into IaC.
Even apps built by non-engineers should feel consistent as a tool family. That's the job of the sandbox-ui-kit repo.
It lives on mcp-sandbox.example.com/git and provides:
| File | Contents |
|---|---|
sandbox-ui.css |
Design tokens + glass-morphism component styles (dark/light) |
sandbox-ui.js |
Theme switcher, modals, toasts, generic JS utilities |
sandbox-db.js |
SandboxDB client SDK (more below) |
index.html |
Storybook-style component catalog |
README.md |
Full API documentation |
The key: it's designed for AI to read and use.
The sandbox_publish tool description literally says:
When building an app, first read README.md with read_file and use the UI Kit.
When Claude Code builds a new app, it read_files this README, learns which CSS/JS to load and which component names to use, then generates code accordingly. Instead of a human walking the AI through UI guidelines, we centralized the "how to use" in one place targeted at the AI.
The result: apps built by anyone (with AI) end up with consistent buttons, modals, and forms.
"I don't want to write Docker." "I don't want to think about runtime configuration." Classic non-engineer requests.
Sandbox MCP inspects the source files and generates a Dockerfile automatically.
// apps/mcp/git-server/src/sandbox/tools.ts
if (hasPy) {
dockerfile = generatePythonDockerfile(hasRequirements);
// Auto-create requirements.txt if missing
if (!hasRequirements) {
await writeFile('requirements.txt', 'flask\ngunicorn\n');
}
} else if (hasPackageJson) {
dockerfile = generateNodeDockerfile(true);
} else if (hasHtml) {
dockerfile = generateStaticDockerfile();
}
For example, a Python app gets:
FROM python:3.12-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
ENV PORT=8080
CMD ["python", "-u", "$(ls *.py | head -1)"]
If requirements.txt is missing, flask + gunicorn get added automatically. AI can write from flask import Flask and the dependencies will resolve — no missing-package surprises.
Deployment uses gcloud run deploy --source, with Cloud Build handling the image build. App authors can write a Dockerfile, but they don't have to. No Dockerfile gets the standard, with one customizes — friendly to both non-engineers and engineers.
"I want to save data. I don't want to set up a database."
The SandboxDB SDK handles that. The same code uses localStorage locally and Firestore once deployed.
<script src="https://mcp-sandbox.example.com/api/db/sdk.js"></script>
<script type="module">
const db = new SandboxDB({ token: googleOAuthAccessToken });
// Save (storage location auto-detected from hostname)
const { id } = await db.collection('items').add({ name: 'test' });
// List
const items = await db.collection('items').get();
// Get / update / delete
await db.collection('items').doc(id).update({ name: 'updated' });
await db.collection('items').doc(id).delete();
</script>
The SDK internals:
this._isLocal = location.hostname === 'localhost'
|| location.hostname === '127.0.0.1';
async add(data) {
if (this._db._isLocal) return this._localAdd(data); // localStorage
return this._req('', 'POST', data); // Firestore REST API
}
When running on localhost, it uses localStorage. The moment it's deployed under sbx-*.example.com, it switches to Firestore. No code changes required.
This dramatically improves the experience of building apps with AI:
Once deployed, data paths are strictly isolated:
sandbox_data/{nickname}--{app}/{collection}/{docId}
nickname: user identifier resolved via OAuthapp: Sandbox app name_createdAt / _updatedAt: auto-attached by the SDKData from different apps is physically unreachable from each other. Even apps built by the same person live in different paths.
The most important point: we use a dedicated sandbox named database. It's a completely separate Firestore database from the (default) DB used by other internal systems. No matter how badly an app's code misbehaves, it can never touch data outside Sandbox.
Now for the infrastructure highlights.
The public URL takes the form:
https://sbx-{nickname}--{app-name}.example.com/
nickname is automatically pulled from the MCP OAuth session. When a user logs into Sandbox MCP via Google, the email is looked up in a Firestore users collection to resolve the nickname. Users never have to repeat "I am ryan" each time.
[email protected] → users[[email protected]].nickname → "ryan"
↓
sbx-ryan--todo-app.example.com
Note: The
userscollection is kept in sync from a separate internal pipeline (a daily batch that pulls from our HR system and Google Workspace directory). Sandbox MCP just reads from it — no need to maintain its own employee master.
The benefit: you can tell whose app it is just by reading the URL. When someone says "go look at ryan's todo-app," reading the URL aloud naturally communicates ownership.
Normally, publishing a new subdomain requires:
Sandbox MCP skips all of this with a Cloudflare Edge Router Worker.
DNS is fixed as *.example.com wildcard + Cloudflare proxy, with Universal SSL automatically covering every subdomain. The Cloudflare Worker receives all *.example.com/* traffic and routes by subdomain.
The logic is three-tier:
// apps/worker/edge-router/src/index.ts
export async function handleRequest(request, env) {
const url = new URL(request.url);
// ① sbx-* prefix → Sandbox routing
const sandboxSub = extractSandboxSubdomain(url.hostname);
if (sandboxSub !== null) {
return handleSandboxRequest(request, url, sandboxSub, env);
}
// ② KV route:{subdomain} registered → Cloud Run proxy
const subdomain = extractSubdomain(url.hostname);
if (subdomain) {
const proxyResponse = await handleCloudRunProxy(request, url, subdomain, env);
if (proxyResponse) return proxyResponse;
}
// ③ Otherwise → fetch(request) passthrough
return fetch(request);
}
When sandbox_publish finishes, all it does is write a route:{nickname}/{app} key into Cloudflare KV. That single write makes the new subdomain routable instantly.
await kvPut(`route:${nickname}/${appName}`, serviceUrl);
No DNS setup. No waiting for SSL issuance. No IaC deploy. Everything completes within the MCP tool execution.
This setup actually started out without git at all.
Since the primary users were going to be PMs and CS folks, we figured "git concepts are too high a bar — let's keep everything inside MCP tools." Write files via sandbox_write_file, deploy via sandbox_publish. That should be enough, we thought.
The approach hit two walls quickly.
Wall 1: Constant chunking
MCP tool calls travel over HTTP, with a payload size limit. React/Vue build bundles, SPAs with images, business tools with dozens of files — they don't fit in a single call. We added an append mode to sandbox_write_file for chunking, but every "first half of file A → second half of file A → first half of file B → ..." sequence triggered error recovery and retries. Deployments became flaky.
Wall 2: Massive token consumption
This was the real killer. When you tell the AI "deploy this app," it sends the entire source as MCP tool arguments. The file contents land in the conversation context, and a few-thousand-line app burns through tokens fast. A single deploy easily consumed tens of thousands of tokens, and Claude Code sessions hit compaction quickly.
Worse, the AI tends to "verify after sending" — re-reading the same file via sandbox_read_file. Write → read → write loops, with tokens going up in flames.
So we pivoted to using git push as well. With git push:
We never expected business-side employees to run git push by hand. But if Claude Code runs git commands in the background, it's not a barrier. The user just says "build this and publish it" — the AI runs git init && git push on its own when needed.
Once we adopted git push, the next question was: where do we host the repos? We considered using GitHub Organizations but ruled it out.
Issuing and managing GitHub accounts for every employee — including non-engineers — wasn't worth the cost or the operational overhead. Paying for a GitHub seat just to ship one app is overkill.
Fortunately, we already operated a self-hosted Git Server on GCE for a different purpose: hosting an internal "read-only Git MCP for code investigation." A VM with repositories cloned under /mnt/repos/.
We just added a Git Smart HTTP Protocol endpoint and one new repo (sandbox-apps) to it. The VM was already running, so the marginal cost was near zero. Authentication piggybacks on the existing Google OAuth setup. Repository management is just OS directory operations. Borrowing space on the existing internal Git Server was vastly simpler than spinning up new infrastructure.
# 1. Get the git URL from the MCP tool (nickname is automatic)
sandbox_init_repo(app_name: "my-app")
# → https://mcp-sandbox.example.com/git/sandbox/ryan/my-app.git
# 2. Local commit (the AI does this in the background)
cd ~/my-app/
git init && git add . && git commit -m "init"
git remote add sandbox <returned URL>
# 3. Push
git push sandbox main
# Username: oauth2accesstoken
# Password: $(gcloud auth print-access-token)
# 4. Deploy
sandbox_publish(app_name: "my-app", description: "...")
Auth uses a Google OAuth token as the Basic Auth password (same pattern as GCP Source Repos). Only @air-closet.com accounts pass. No GitHub account required — any employee can push.
The remote repo is configured with receive.denyCurrentBranch=updateInstead, so the working tree updates server-side on push. Cloud Run uses that directory as --source, so there's no extra step between push and publish.
For small apps (a few files, hundreds of lines each), sandbox_write_file still works fine. Switch between MCP-only and git push depending on app size.
That covered the "convenient to build" side. Now the "safe to publish" side.
As I noted at the start, exposing AI-generated code in front of users is risky. So Sandbox MCP layers four independent safety mechanisms that don't depend on the app's own implementation.
sbx-*.example.com sits behind Cloudflare ZeroTrust Access. When someone visits, Cloudflare's edge first redirects them to Google Workspace SSO; without an @air-closet.com account, they never reach Cloud Run.
This is independent of the app's implementation. Even if the AI didn't write a single line of auth code, Cloudflare stops the request first. "Accidentally public" is physically impossible.
Operations like sandbox_publish and sandbox_delete enforce Google OAuth on the MCP server side. Sandbox MCP implements RFC 8414 (/.well-known/oauth-authorization-server), so Claude Code runs the OAuth flow automatically on first connection.
The strongest guarantee is "you can't accidentally update or delete someone else's app."
When multiple people share a Sandbox MCP, an AI accident like "wait, I overwrote a coworker's app while updating mine" would be devastating. To prevent that, the AI doesn't get to decide whose app is being touched. The server injects nickname automatically from the OAuth session.
// Strip the `nickname` property from the MCP tool schema and have
// the server force-inject the logged-in user's nickname.
function injectNickname(tool: McpTool, userNickname?: string): McpTool {
const { nickname: _, ...restProperties } = tool.schema.inputSchema.properties;
return {
schema: { ...tool.schema, inputSchema: { ...tool.schema.inputSchema, properties: restProperties } },
execute: (args, ctx) => tool.execute({ ...args, nickname: userNickname }, ctx),
};
}
From the AI's perspective, the nickname input doesn't exist. Even with a prompt injection like "delete ryan's app," there's no mechanism to do so. "You can only touch your own apps" is enforced at the API spec level.
On top of that, inputs are validated strictly against /^[a-z0-9]([a-z0-9-]{0,61}[a-z0-9])?$/, rejecting shell-injection and path-traversal patterns (.., /).
As mentioned earlier, data lives at:
sandbox_data/{nickname}--{app}/...
Per request, the SandboxDB API resolves the path server-side:
email → users → nickname, take app from the Origin headernickname/app from the X-Sandbox-App header (required — missing returns 400)The client cannot spoof the path.
We deliberately do not use the K-Service header (the Cloud Run-injected service name). That's a client-spoofable header, and another implementation that relied on it had a "read another app's data" vulnerability disclosed. Requiring X-Sandbox-App keeps the only valid route through an explicitly server-validated path.
The clincher: a dedicated named database for Sandbox. Instead of the (default) DB (which contains data from other systems), we use an independent Firestore database called sandbox, and the Cloud Run SA gets an IAM Condition that allows access only to the sandbox DB.
// From infra/mcp/git-server/index.ts
// IAM Condition on roles/datastore.user:
// resource.name == "projects/.../databases/sandbox" ||
// resource.name.startsWith("projects/.../databases/sandbox/")
No matter how badly the AI-written code goes wrong, it physically cannot reach data outside Sandbox.
All sandbox-* Cloud Run services run under a single shared SA (e.g. sandbox-run). The permissions on that SA are minimal.
roles/logging.logWriter (write its own logs)roles/bigquery.jobUser + bigquery.dataViewer scoped to the sandbox_logs dataset only (its own access logs, nothing else)roles/datastore.user (IAM Condition limiting to sandbox DB)What it does not have:
(default) Firestore that holds data from other systemsIn other words, even if a Sandbox app goes completely rogue, the blast radius is limited to sandbox_data and sandbox_logs. Nothing outside Sandbox is affected.
Sandbox apps eventually want to look at logs too. "How many views did this page get?" "Who hit that error?"
We forward Cloud Run request logs to BigQuery via a Logging Sink:
// From infra/mcp/git-server/index.ts
const sandboxLogSink = new gcp.logging.ProjectSink('sandbox-logs-sink', {
destination: `bigquery.googleapis.com/projects/${projectId}/datasets/sandbox_logs`,
filter: [
'resource.type="cloud_run_revision"',
'resource.labels.service_name:"sandbox-"',
'logName:"run.googleapis.com%2Frequests"',
].join(' AND '),
bigqueryOptions: { usePartitionedTables: true },
});
The sandbox_logs dataset is locked down with project-owner-only ACLs (it contains PII like remoteIp and User-Agent), and the Sandbox SA gets a tightly scoped bigquery.dataViewer to it.
This lets apps query their own access logs from BigQuery. "Post last week's user count for this app to Slack" can be done entirely inside Sandbox.
Let me close with a note on tool definitions. I personally think this is where MCP design really makes or breaks.
Sandbox MCP exposes 10 tools:
| Tool | Purpose |
|---|---|
sandbox_publish |
Start deploy (async) |
sandbox_deploy_status |
Check deploy status |
sandbox_init_repo |
Initialize git push repo |
sandbox_write_file |
Write file (overwrite/append) |
sandbox_list |
List apps |
sandbox_delete |
Delete app |
sandbox_schedule |
Configure Cloud Scheduler |
sandbox_unschedule |
Remove Cloud Scheduler |
sandbox_read_file |
Read source code |
sandbox_list_files |
List files |
Whether the AI picks the right tool at the right moment is almost entirely determined by what's written in the tool description.
For example, the description for sandbox_publish covers not just functionality but also:
write_file vs git push
read_file)With this in place, the AI can autonomously do:
sandbox_publish description and sees "first read the UI Kit README"read_file on sandbox-ui-kit/README.md
sandbox_publish
— without asking the user a single follow-up question. Writing not just "what it does" but "what to do with it" into the tool definition is the secret to AI-friendly design.
If you write tool definitions tersely, the AI keeps coming back asking "what should I do next?" The description is less of a human-facing doc and more of an AI-facing runbook. That framing helps a lot.
Sandbox MCP exists to answer two challenges of building internal tools in the AI era:
To close that gap, we:
Building this, what struck me again is that the role of platforms in an AI-powered development era is shifting. Platforms used to optimize for "easy for humans." Now they also need to optimize for "used correctly by AI." Tool descriptions are AI-facing docs, and safety must be designed assuming AI will write incorrect code.
At the same time, by limiting what the builder is responsible for, we drastically lower the barrier to "let me just try something." That's the entry point that turns a non-engineer's "I want to build this" into actual operational improvements.
I hope this is useful for anyone designing internal platforms.
At airCloset, we're looking for engineers who want to build a new development experience together with AI. If you're interested, please check out our careers page at airCloset Quest.
2026-04-28 07:04:57
The rules for product schema used to be pretty simple. Add Product, Offer, and maybe AggregateRating, and you were done. Google was happy. Your rich snippets looked nice. You moved on.
Those days are over. AI shopping agents are reading structured data differently than Google's crawler does, and they need a much fuller set of fields to actually surface your products confidently. Here's the complete 2026 list for any ecommerce store that wants to stay visible.
Product. Must include: name, description (minimum 150 chars), image (at least one, ideally multiple), brand (as a nested Brand object with name), sku, gtin, mpn (if applicable), category, weight, color, material, size. The old "just name + image" version won't cut it anymore.
Offer. Must include: price, priceCurrency, availability (InStock/OutOfStock/PreOrder/etc), priceValidUntil, itemCondition (NewCondition/UsedCondition), url (to the product page), hasMerchantReturnPolicy, shippingDetails.
AggregateRating if you have reviews. ratingValue, reviewCount, bestRating, worstRating. If you don't have reviews yet, consider adding them with a honest count (even if small).
Review array. Include your latest 3-5 reviews as nested Review objects with author, datePublished, reviewBody, and reviewRating. Agents use these for social proof.
BreadcrumbList. Category path from homepage to product. This is easy but a surprising number of stores don't have it.
Here's the thing most SEO guides haven't caught up on yet. These are the fields that are becoming hard differentiators in AI agent retrieval:
hasMerchantReturnPolicy. This is the biggest one. ChatGPT's shopping answers now strongly prefer products from stores that explicitly declare return policies in schema. 94% of stores I've scanned are missing this. If you add it, you get a meaningful visibility boost.
shippingDetails. Include shippingRate, shippingDestination, shippingLabel, and deliveryTime. Agents read this to answer "can I get this by Thursday" type queries. Missing = you're out of the consideration set.
additionalProperty array. This is a generic key-value bag for attributes that don't fit the standard schema. Material composition, certifications, specifications, size charts. Agents love this because they can use it to filter and compare.
itemCondition. Sounds obvious but a lot of stores omit it. Especially important for used/refurbished/open-box items where agents need to know before recommending.
offers.priceValidUntil. If you run sales, this tells agents when the price expires. Without it, agents can't confidently recommend your sale price because they don't know if it's still valid.
A few things I see over and over.
Inconsistency between schema and on-page content. Your schema says the product is $29 but the page shows $35. Google's crawler forgives this, AI agents don't. They downweight you significantly for inconsistency.
Wrong availability values. Use the exact ISO values: https://schema.org/InStock, https://schema.org/OutOfStock, etc. Not "available" or "yes" or your own strings.
Shallow brand object. "brand": "Nike" is not valid. It should be {"@type": "Brand", "name": "Nike"}. Most themes get this wrong out of the box.
Missing mainEntityOfPage. This is a subtle one. It tells crawlers which URL is the canonical location for the schema. Helps with deduplication when you have multiple URLs for the same product.
Nested schema that isn't actually nested. If you have a product with offers, the offers should be inside the product schema, not in a separate JSON-LD block. Agents parse the whole document and sometimes miss stuff that's structured weirdly.
Google's Rich Results Test is still the gold standard for validation. https://search.google.com/test/rich-results. Run it on 5-10 of your highest-traffic product pages and fix any errors.
For more advanced validation, the Schema.org validator at https://validator.schema.org/ is more strict and catches things Google misses. Worth running periodically.
For AI-specific validation, there's no good public tool yet. The best approach is to query ChatGPT directly with a product search and see if your store shows up. If not, you probably have gaps somewhere in this list.
For Shopify stores, the fastest path to a compliant product schema is:
For custom stores (headless Shopify, Next.js, whatever), you'd do the same thing but in your product component. Next.js in particular makes this easy with the tag outputting JSON-LD.</p> <h2> <a name="the-takeaway" href="#the-takeaway" class="anchor"> </a> The Takeaway </h2> <p>Product schema in 2026 is richer and stricter than it was in 2022. The baseline for "visible to AI agents" has moved up significantly and most stores haven't caught up. If you invest a few hours updating your schema now, you'll be ahead of 80% of your competitors for a year or two. After that, it'll be table stakes and you'll wish you'd done it sooner.</p> <p>Run the Rich Results Test on your store this week. You'll probably find gaps. Fix them. It's one of the highest-ROI things you can do for your store's visibility in both google and AI channels right now.</p>
2026-04-28 07:02:27
One Spec to rule them all, one Spec to find them, one Spec to bring them all, and in the darkness bind them.
This is the first post in a series about spec-driven development. Not which tool to use or how to get started, but what I learned after living with a spec long enough to hit the problems that nobody writes about yet. I do not have all the answers and I am not trying to be a guru. I am sharing what worked, what did not, and what I am still figuring out. If you have been down this road too, I would love to hear your experience in the comments.
Spec-driven development is having a moment. Microsoft shipped a spec-kit and wrote about it on their developer blog. JetBrains published a dedicated series on using a spec-driven approach with AI coding tools. Tools like Kiro and CodeSpeak are building entire development models around the idea that specs, not code, are the primary artefact. Martin Fowler's blog has a detailed breakdown comparing SDD tools. The term is everywhere.
Most of this content is useful. It explains what SDD is, compares approaches, and helps teams get started. But almost all of it is written from the outside looking in, by people who adopted the practice recently or are building tools around it. Very little comes from engineers who have been doing it long enough to hit the problems that only show up later.
I have been writing and maintaining a spec across nine SDKs for three years. I started before SDD had a name, before LLMs made it a topic of conversation, and before any of the current tooling existed. I have a lot of thoughts about what makes a spec genuinely useful over time and what makes it quietly fall apart. This series is my attempt to think through that in public, and hopefully start a conversation with people who are navigating the same problems.
A lot of the current SDD conversation frames the spec as a disposable implementation plan for an LLM. You write it, the agent consumes it, code comes out, job done. Some tools are explicitly built around this model. The spec is an intermediate artifact, a way of communicating intent to an AI before it disappears into the generated code.
That framing is not wrong for certain use cases. But it is a narrow way to think about something with much broader value.
The definition I find more useful, and the one this series is built around, is this:
A spec is a contract between implementations.
It does not describe code. It defines behavior: what a feature should do, how it should respond to edge cases, what a developer can rely on regardless of which language or platform they are using. The moment you have more than one implementation of the same thing, you need something that sits above all of them and answers the question: what does correct actually mean here?
Tests do not answer this. Tests verify that your code behaves the way you wrote it. They say nothing about whether the behavior you wrote was the right one. A test can pass in nine SDKs while each of them does something subtly different, and nothing in your CI pipeline will flag it.
Code reviews do not answer it either. A reviewer working inside a single codebase has no way to know whether this implementation matches what the mobile client does, or what the desktop client does, or what a developer reading your docs will reasonably expect.
The spec is the only artifact that exists at the level of behavior rather than implementation and it can be the referee when two implementations disagree.
Writing a spec is the easy part. Keeping it honest with reality over months and years is where most teams quietly struggle. Before closing this first post I want to share some practical foundations that have helped us keep our spec useful over time.
A spec that lives in a wiki or a shared document will drift. It needs to be versioned alongside the code it describes, treated with the same discipline as a codebase. If a spec change does not go through a pull request, it will quietly stop reflecting reality. Nobody intends this to happen. It happens anyway, gradually, and by the time you notice, the spec has become historical fiction.
This is the single most practical thing I can pass on. Every spec entry describing a distinct behavior should have a unique identifier. At Ably we use abbreviations of feature areas combined with alternating numbers and letters for nesting levels. So RTP is Realtime Presence, RSP is REST Presence, and a deeply nested entry might look like RTP2a5c6b7. These IDs let you reference spec entries directly from tests, from code comments, from pull requests, from conversations. Instead of describing a behavior in prose every time, you point to the ID. Anyone reading the code can trace it back to the contract it implements. This traceability is what separates a spec that is actually used from one that exists only to be consulted.
Duplication in a spec is as dangerous as duplication in code. When the same behaviour is described in two places, they will eventually diverge, and you will have two sources of truth instead of one. The solution is the same as in code: do not repeat yourself. When one spec entry depends on or extends another, reference it by ID rather than restating the logic. This keeps each behaviour defined exactly once, makes the spec easier to maintain, and means that when something changes you update it in one place and the rest of the spec stays coherent.
If you change what a behaviour does and keep the same ID, you silently break the traceability chain. Tests referencing the old ID now verify the wrong thing, code comments become misleading, and the history becomes unreadable. The discipline of generating a new ID when behaviour changes forces an explicit acknowledgement: this is not a correction, it is a new contract. Old entries get replaced rather than deleted, leaving a paper trail. For example, "This entry has been superseded by RSC25 as of specification version 4.0.0".
Ambiguous language in a spec is a slow poison. Words like "should", "must" and "may" mean different things to different people, and when an LLM or a new engineer reads your spec, those differences matter. RFC 2119 solves this cleanly: MUST means mandatory, SHOULD means recommended, MAY means optional. Adopting this convention costs nothing and eliminates an entire category of misinterpretation. When someone asks "is this behaviour required or just a suggestion", the spec answers the question without needing a conversation.
If you want to explore the current state of spec-driven development, here are the resources mentioned in this post:
Follow or subscribe so you do not miss Part 2. And if any of this resonates with your own experience, or if you think I am getting something wrong, I would genuinely love to hear it in the comments.
2026-04-28 07:00:51
The AI landscape is experiencing unprecedented growth and transformation. This post delves into the key developments shaping the future of artificial intelligence, from massive industry investments to critical safety considerations and integration into core development processes.
Key Areas Explored:
This deep dive aims to provide developers, tech leaders, and enthusiasts with a comprehensive overview of the current state and future trajectory of AI.