2026-03-10 22:16:11
The most important interface shift in a decade is happening right now, and most teams are sleepwalking through it.
AI agents are the fastest-growing consumer of developer tooling. They don’t click buttons. They don’t read man pages. They invoke commands, parse output, and move on. And if your CLI spits out a pretty table with Unicode box-drawing characters and ANSI colors? Congratulations — you’ve built something an agent has to hallucinate its way through.
The post that crystallized the moment landed on Hacker News recently: Justin Poehnelt’s “You Need to Rewrite Your CLI for AI Agents,” written from the experience of building Google’s new Workspace CLI — agents-first from day one. It hit the front page and the comments exploded. Everyone felt the pain it describes.
The thesis is simple: the primary consumer of your CLI is no longer a human. Act accordingly or get wrapped, forked, or replaced by someone who does.
Here’s what “human-first” CLI design looks like:
my-cli spreadsheet create \Ten flags. Flat namespace. Can’t express nesting without inventing bespoke flag hierarchies. A human can tab-complete their way through it. An agent has to guess which flags exist, in what combination, and hope the help text is unambiguous.
--title "Q1 Budget" \
--locale "en_US" \
--sheet-title "January" \
--frozen-rows 1 \
--frozen-cols 2
Now the agent-first version:
gws sheets spreadsheets create --json '{One flag. The full API payload. An LLM generates this trivially because it maps directly to the schema. Zero translation loss.
"properties": {"title": "Q1 Budget", "locale": "en_US"},
"sheets": [{"properties": {"title": "January",
"gridProperties": {"frozenRowCount": 1, "frozenColumnCount": 2}}}]
}'
This isn’t about abandoning human ergonomics. It’s about making the raw-payload path a first-class citizen alongside your convenience flags. The practical minimum: --output json, an OUTPUT_FORMAT=json env var, or — better yet — NDJSON by default when stdout isn’t a TTY.
Agents can’t google your documentation without blowing their token budget. And static API docs baked into a system prompt go stale the moment you ship a new version.
The Google Workspace CLI solved this with runtime schema introspection:
gws schema drive.files.listEach call dumps the full method signature — params, request body, response types, required OAuth scopes — as machine-readable JSON. The agent self-serves. No pre-stuffed documentation. No 50-page system prompt.
gws schema sheets.spreadsheets.create
This is the pattern that matters: make the CLI itself the documentation, queryable at runtime. Your tool should be able to answer “what do you accept?” and “what will you return?” without the agent ever leaving the terminal.
The gh CLI already does a version of this. docker does it. The tools that don’t are the ones getting wrapped by shim layers — and every shim is a maintenance liability waiting to happen.
Here’s a number that should scare you: a single Gmail API response can consume a meaningful chunk of an agent’s context window. Humans scroll past irrelevant fields. Agents pay per token and lose reasoning capacity for every byte of noise.
Two mechanisms matter:
Field masks limit what the API returns. gws drive files list --params '{"fields": "files(id,name,mimeType)"}' — only get what you need.
NDJSON pagination emits one JSON object per line, stream-processable without buffering an entire response into memory. The agent processes page by page instead of choking on a 200KB blob.
This is context window discipline, and it’s non-negotiable. If your CLI dumps everything and expects the consumer to filter, you’re burning tokens that could be spent on reasoning.
“But what about MCP?” Fair question. Anthropic’s Model Context Protocol was supposed to be the universal connector — a clean, structured protocol for agents to talk to any tool. And it works. But there’s a cost nobody talks about.
Jannik Reinhard ran the numbers in a real-world comparison. A compliance-checking task against Microsoft Graph:
MCP approach: ~145,000 tokens (28K just for schema injection before asking a single question)
CLI approach: ~4,150 tokens
That’s a 35x reduction. A typical MCP server ships dozens or hundreds of tool definitions, all of which get dumped into the agent’s context whether it needs them or not. Stack a few MCP servers for a real enterprise workflow — GitHub, a database, Microsoft Graph, Jira — and you’re burning 150K+ tokens on plumbing alone.
MCP isn’t wrong. But it’s an abstraction layer, and abstraction layers have tax. For many workflows, a well-designed CLI with --json output and schema introspection is faster, cheaper, and more reliable than routing through a protocol server. The CLI is the tool call.
If you maintain a CLI and you’re not thinking about agent consumers, here’s the minimum viable checklist. It’s not long:
--json** flag everywhere.** Structured output to stdout, human messages to stderr.
Meaningful exit codes. Not just 0/1. Agents need to branch on failure modes.
Idempotent operations. Agents retry. Your tool should handle that gracefully.
Schema introspection. mytool schema <command> should return what the command accepts and returns.
NDJSON pagination. Stream large result sets. Don’t buffer.
Noun-verb command structure. mytool resource action — it turns discovery into a tree search instead of a guessing game.
TTY detection. Pretty output for humans, JSON for pipes. Automatically.
None of this is exotic. Most of it is just good Unix hygiene that we’ve been lazy about for years. The difference is that now there’s a consumer — a very fast-growing, very demanding consumer — that will route around your tool if you don’t provide it.
RTK, a Hacker News Show HN from last month, wraps existing CLI commands to strip human-oriented formatting before it hits an agent’s context. It saved 60-90% of tokens. That tool exists because your CLI doesn’t output clean data by default.
Google just shipped a Workspace CLI built agents-first. CLIWatch is building benchmarks that score tools on agent-readiness — pass rates, token efficiency, turn counts — with badges for your README.
The migration is happening. The question isn’t whether your CLI needs an agent-friendly interface. It’s whether you build it yourself or someone else builds a wrapper that makes you a dependency they’d rather not have.
Your CLI’s next power user doesn’t read your README. It reads your --help output, introspects your schema, and parses your JSON. Design for that user, or watch them move on to someone who did.
Originally published on The Undercurrent
2026-03-10 22:15:42
If you've ever tried collecting data from a modern website and ended up with empty HTML containers instead of real content, you're not alone.
Many developers run into this issue when working with websites built using frameworks like React, Vue, or Angular. Instead of delivering fully rendered HTML, these sites load content dynamically using JavaScript after the page loads.
So when you use a basic HTTP request to fetch the page, the data you're looking for often isn't there yet.
This is where Selenium becomes extremely useful.
Selenium allows you to automate a real browser session. That means the page loads exactly as it would for a human visitor, JavaScript included. Once everything renders, you can access the fully populated page and extract the information you need.
Let’s walk through how this works.
When you fetch a page using a library like requests in Python, you receive the initial HTML response from the server.
However, many modern websites work differently:
Your script only sees step one.
This is why you might open a page in your browser and see dozens of products or listings, but your script only finds empty <div> elements.
Selenium solves this problem by actually running the browser and executing the JavaScript before extracting data.
First, install Selenium using pip:
pip install selenium
Next, download the appropriate browser driver.
Common options include:
Make sure the driver version matches your installed browser version.
Here’s a minimal Selenium script using Python:
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("https://example.com")
print(driver.title)
driver.quit()
This script:
By the time Selenium retrieves the page content, the browser has already executed any JavaScript needed to render the page.
Once the page loads, you can locate elements using Selenium selectors.
Example:
from selenium.webdriver.common.by import By
products = driver.find_elements(By.CSS_SELECTOR, ".product-card")
for product in products:
print(product.text)
Selenium supports several ways to locate elements:
By.CSS_SELECTOR
By.XPATH
By.ID
By.CLASS_NAME
By.TAG_NAME
Most developers prefer CSS selectors because they are easier to maintain and usually more readable.
Dynamic pages often load content asynchronously, so the elements you're looking for might not appear immediately.
Instead of using fixed delays with time.sleep(), Selenium provides explicit waits.
Example:
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
items = WebDriverWait(driver, 10).until(
EC.presence_of_all_elements_located((By.CLASS_NAME, "product-card"))
)
This tells Selenium to wait until the elements appear before continuing.
Explicit waits make automation scripts significantly more reliable.
Many websites load additional content when the user scrolls down the page.
You can simulate this behavior with Selenium by executing JavaScript.
Example:
driver.execute_script(
"window.scrollTo(0, document.body.scrollHeight);"
)
If you're collecting multiple batches of content, you can repeat this action in a loop:
import time
for _ in range(5):
driver.execute_script(
"window.scrollTo(0, document.body.scrollHeight);"
)
time.sleep(2)
Each scroll triggers the website to load more entries.
When running automation on servers or cloud environments, you typically don't want a visible browser window.
Selenium supports headless mode, which runs the browser without a graphical interface.
Example:
from selenium.webdriver.chrome.options import Options
options = Options()
options.add_argument("--headless")
driver = webdriver.Chrome(options=options)
Headless mode reduces resource usage and makes automation easier to deploy in backend systems.
When collecting large amounts of data, repeatedly accessing a website from the same IP address can trigger rate limits or temporary blocks.
To avoid this, many developers add proxy infrastructure to their automation stack. Developers often integrate providers of high-quality residential proxies like Squid Proxies when running workflows that require stable IP rotation and consistent connections.
Using proxies alongside Selenium can significantly improve reliability when running larger automation tasks.
For static websites, lightweight HTTP libraries are usually faster. But for modern dynamic applications, Selenium is often the simplest and most reliable solution.
Dynamic websites are now the standard across much of the web. Because so many platforms rely on JavaScript to render content, traditional request-based methods often fail to retrieve the data you need.
Selenium solves this problem by automating a real browser environment, allowing developers to render JavaScript-heavy pages and interact with them just like a user would.
When combined with proxy infrastructure and thoughtful automation design, Selenium becomes a powerful tool for building reliable data collection pipelines and automation workflows.
2026-03-10 22:13:00
I've been writing in Markdown for years — documentation, blog posts, notes, project specs. I tried dozens of editors. Some were fast but ugly. Some were beautiful but Electron-heavy. Some locked my files in proprietary formats or demanded a cloud subscription for basic features.
So I built my own.
CrabPad is a desktop Markdown editor built with Tauri (Rust backend + React/TypeScript frontend). It's local-first, keyboard-driven, free, and runs on macOS, Windows, and Linux.
In this post I want to share the technical decisions, trade-offs, and lessons learned from building it.
The decision was easy. Tauri gives you:
The trade-off? You're at the mercy of each platform's webview quirks. Safari-based WebKit on macOS behaves differently than WebView2 on Windows. CSS rendering, font smoothing, keyboard events — all slightly different. But for a text editor, the savings in memory and startup time are absolutely worth it.
┌──────────────────────────────┐
│ React + TypeScript │ ← UI, editor, preview, tabs
│ TailwindCSS │
├──────────────────────────────┤
│ Tauri IPC (invoke) │ ← commands bridge
├──────────────────────────────┤
│ Rust backend │ ← Markdown parsing, file I/O,
│ pulldown-cmark │ search, settings persistence
└──────────────────────────────┘
The frontend handles all UI state — open files, tabs, preview, keybindings. The backend handles everything that touches the filesystem or needs performance: Markdown parsing, file read/write, workspace search, and user settings persistence.
Communication happens through Tauri's invoke() — essentially async IPC calls that feel like calling a local function.
I started with pulldown-cmark on the Rust side for the core GFM parsing (tables, footnotes, strikethrough, task lists). But modern Markdown needs way more than that.
I ended up building a preprocessing pipeline in Rust that runs before pulldown-cmark:
:-) → 😊)gh-emoji crate^sup^, ~sub~) via regex++inserted++, ==highlighted==)*[HTML]: HyperText Markup Language)^[This is an inline footnote]): definition)> [!NOTE], > [!WARNING])::: warning ... :::)Then pulldown-cmark parses the result. On the frontend, KaTeX handles math and Mermaid renders diagrams from fenced code blocks.
Lesson learned: Don't try to extend a Markdown parser at the AST level if all you need is text preprocessing. Regex-based passes before parsing are pragmatic and fast enough for real-time preview.
Every feature in CrabPad is accessible without a mouse. The core shortcuts:
| Action | Shortcut |
|---|---|
| Command Palette | CMD+Shift+P |
| Toggle Preview | CMD+P |
| Toggle Outline | CMD+Shift+E |
| Zen Mode | CMD+Alt+Z |
| Global Search | CMD+Shift+F |
| Settings | CMD+, |
Building the Command Palette was straightforward — a fuzzy-search input over a list of registered commands. The hard part was keyboard event handling on macOS.
e.key is unreliable with modifier keys. Press CMD+Alt+Z on macOS and e.key becomes Ω, not Z. The fix: fall back to e.code (physical key position) when e.key doesn't match:
function checkShortcut(e: KeyboardEvent, shortcut: string): boolean {
const parts = shortcut.toLowerCase().split('+');
const key = parts[parts.length - 1];
const keyMatch = e.key.toLowerCase() === key
|| e.code.toLowerCase() === `key${key}`;
return keyMatch
&& parts.includes('cmd') === (e.metaKey || e.ctrlKey)
&& parts.includes('shift') === e.shiftKey
&& parts.includes('alt') === e.altKey;
}
Another gotcha: if you have a shortcut recorder (for custom keybindings) and a global keyboard handler, they fight each other. My solution was a global flag window.__crabpadRecordingShortcut — the recorder sets it to true, the global handler checks it and bails out. Not elegant, but battle-tested.
Zen Mode hides everything — tabs, sidebar, status bar — leaving only the editor centered on screen. It sounds simple, but the devil is in the details:
max-width + justify-center)The state persistence is handled by saving all UI preferences to a settings.json in Tauri's app data directory. This file survives app updates — unlike localStorage which lives inside the WebView storage.
Tauri has a built-in updater plugin. You publish a signed JSON manifest (update.json) at a public URL, and the app checks it on startup.
My first implementation used native OS dialogs for the update flow. It worked, but the UX was terrible — click "Download", see a system alert saying "please wait", then... nothing. The download happened in the background with progress only visible in console.log.
I replaced it with a custom React modal that shows every stage:
3.2 MB of 12.8 MB)@tauri-apps/plugin-process for relaunch)The CI/CD pipeline (GitHub Actions) builds for all 3 platforms, signs the bundles with minisign, generates update.json, deploys binaries to a VPS, and updates the Homebrew Cask — all automatically on tag push.
CrabPad has no backend, no accounts, no sync. Files are plain .md on disk.
This isn't just a privacy feature — it's an architectural simplification. No auth flows, no conflict resolution, no server costs, no GDPR headaches. The app works offline, files load instantly, and if CrabPad disappears tomorrow, your documents are still standard Markdown.
The only network request the app makes is the optional update check to crabpad.app/update.json.
Getting a desktop app to users in 2026 is harder than it should be:
brew install --cask LukaszOleniuk/tap/crabpad), which strips the quarantine attribute automatically..deb for Debian/Ubuntu. No Flatpak or Snap yet.Lesson learned: Homebrew Cask is the single best distribution channel for indie macOS apps without Apple notarization. Zero friction for the user.
Website: crabpad.app
Homebrew (macOS):
brew install --cask LukaszOleniuk/tap/crabpad
Changelog: crabpad.app/changelog
I'd genuinely love feedback — what works, what doesn't, what's missing. You can reach me via the contact form or find me on LinkedIn.
Thanks for reading.
2026-03-10 22:12:42
Your LangGraph agent works great in demos. But in production, every node's output needs to be validated before the next node acts on it. Here's how to add a validation step without writing custom checking logic.
LangGraph gives you fine-grained control over your agent's execution graph — you define nodes, edges, and conditional routing. But one thing that's missing from most LangGraph tutorials is what happens when a node produces bad data. The next node just receives it and either crashes or propagates the error downstream.
I ran into this when building an order processing pipeline with LangGraph. The extraction node would occasionally produce negative amounts, invalid currencies, or missing fields. The downstream nodes — pricing, invoicing, fulfillment — would silently process the bad data. By the time someone noticed, the damage was already in the database.
The typical fix is writing validation logic inside each node. That works, but it means every node carries its own schema checks, the validation rules are scattered across your codebase, and there's no central place to see what's failing and why.
So I hooked up Rynko Flow as an external validation step in the graph. The agent extracts data, Flow validates it against a schema and business rules, and only if it passes does the pipeline continue. If it fails, the agent gets structured errors it can use to self-correct.
A LangGraph agent with three nodes:
The graph looks like this:
extract → validate → process
↓ (if failed)
extract (retry with error context)
pip install langgraph langchain-openai httpx
You'll also need:
Create a gate in the Flow dashboard with this schema:
| Field | Type | Constraints |
|---|---|---|
| vendor | string | required, min 1 char |
| amount | number | required, >= 0 |
| currency | string | required, one of: USD, EUR, GBP, INR |
| po_number | string | optional |
Add a business rule: amount >= 10 with error message "Order amount must be at least $10."
If you already have a Pydantic model, you can import the schema directly — run YourModel.model_json_schema() and paste the output into the gate's Import Schema dialog. There's a tutorial for that.
Save and publish the gate. Note the gate ID — you'll need it in the code.
First, a small wrapper around the Flow API. This is what the validate node will call:
import httpx
import os
RYNKO_BASE_URL = os.environ.get("RYNKO_BASE_URL", "https://api.rynko.dev/api")
RYNKO_API_KEY = os.environ["RYNKO_API_KEY"]
def validate_with_flow(gate_id: str, payload: dict) -> dict:
"""Submit a payload to a Flow gate and return the result."""
response = httpx.post(
f"{RYNKO_BASE_URL}/flow/gates/{gate_id}/runs",
json={"payload": payload},
headers={
"Authorization": f"Bearer {RYNKO_API_KEY}",
"Content-Type": "application/json",
},
timeout=30,
)
return response.json()
This returns the full validation result — status, errors, validation ID, the works. The important fields are status (either "validated" or "validation_failed") and errors (an array of specific field-level issues when validation fails).
LangGraph uses a typed state that flows between nodes. Ours tracks the user request, extracted data, validation result, and retry count:
from typing import TypedDict, Optional
class OrderState(TypedDict):
user_request: str
extracted_data: Optional[dict]
validation_result: Optional[dict]
validation_errors: Optional[str]
retry_count: int
final_result: Optional[str]
The LLM extracts structured order data from the user's natural language request. If there were previous validation errors, they're included in the prompt so the LLM can correct its output:
from langchain_openai import ChatOpenAI
import json
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
GATE_ID = os.environ["FLOW_GATE_ID"] # Your gate ID
def extract_order(state: OrderState) -> dict:
error_context = ""
if state.get("validation_errors"):
error_context = (
f"\n\nYour previous extraction had validation errors:\n"
f"{state['validation_errors']}\n"
f"Fix these issues in your new extraction."
)
response = llm.invoke(
f"Extract order data from this request as JSON with fields: "
f"vendor (string), amount (number), currency (string, one of USD/EUR/GBP/INR), "
f"po_number (string, optional).\n\n"
f"Request: {state['user_request']}"
f"{error_context}\n\n"
f"Respond with ONLY valid JSON, no markdown."
)
try:
extracted = json.loads(response.content)
except json.JSONDecodeError:
extracted = {"vendor": "", "amount": 0, "currency": ""}
return {"extracted_data": extracted}
This is the Flow integration — submit the extracted data to the gate and capture the result:
def validate_order(state: OrderState) -> dict:
result = validate_with_flow(GATE_ID, state["extracted_data"])
if result.get("status") == "validation_failed":
errors = result.get("error", {}).get("details", [])
error_text = "\n".join(
f"- {e.get('field', e.get('rule_id', 'unknown'))}: {e.get('message', 'invalid')}"
for e in errors
)
return {
"validation_result": result,
"validation_errors": error_text,
"retry_count": state.get("retry_count", 0) + 1,
}
return {
"validation_result": result,
"validation_errors": None,
}
If validation passed, the order moves forward. In a real system this would write to your database, trigger fulfillment, or call another API:
def process_order(state: OrderState) -> dict:
validation_id = state["validation_result"].get("validation_id", "")
return {
"final_result": (
f"Order processed successfully.\n"
f"Vendor: {state['extracted_data']['vendor']}\n"
f"Amount: {state['extracted_data']['amount']} {state['extracted_data']['currency']}\n"
f"Validation ID: {validation_id}"
)
}
The validation_id is a tamper-proof token from Flow — your downstream systems can verify that the data passed validation and hasn't been modified since.
Now connect the nodes with conditional routing. If validation fails and we haven't exhausted retries, route back to the extract node with the error context:
from langgraph.graph import StateGraph, END
def should_retry(state: OrderState) -> str:
if state.get("validation_errors") and state.get("retry_count", 0) < 3:
return "retry"
elif state.get("validation_errors"):
return "give_up"
return "proceed"
# Build the graph
graph = StateGraph(OrderState)
graph.add_node("extract", extract_order)
graph.add_node("validate", validate_order)
graph.add_node("process", process_order)
graph.set_entry_point("extract")
graph.add_edge("extract", "validate")
graph.add_conditional_edges(
"validate",
should_retry,
{
"retry": "extract", # Back to extraction with error context
"proceed": "process", # Validation passed
"give_up": END, # Max retries reached
},
)
graph.add_edge("process", END)
app = graph.compile()
result = app.invoke({
"user_request": "Process an order from Globex Corp for twelve thousand five hundred dollars USD, PO number PO-2026-042",
"retry_count": 0,
})
print(result["final_result"])
Output:
Order processed successfully.
Vendor: Globex Corp
Amount: 12500.0 USD
Validation ID: v_abc123...
The interesting part is what happens when the LLM makes a mistake. Say it extracts currency: "Dollars" instead of "USD". Flow returns:
{
"status": "validation_failed",
"errors": [
{"field": "currency", "message": "must be one of: USD, EUR, GBP, INR"}
]
}
The graph routes back to the extract node, which now includes the error in its prompt. The LLM reads "currency must be one of: USD, EUR, GBP, INR", fixes its extraction to "USD", and the second attempt passes validation.
This happens automatically — no human intervention, no hardcoded fixes. The LLM uses the structured error feedback from Flow to correct itself.
In our testing, most validation issues resolve in one retry. The retry_count cap of 3 is a safety net — if the agent can't fix it in three attempts, something is fundamentally wrong with the input and it's better to fail explicitly.
You could validate with Pydantic directly in the extract node. For a single agent, that works fine. But Flow gives you a few things Pydantic doesn't:
Business rules that cross fields. Pydantic validates field types and constraints, but expressions like endDate > startDate or quantity * price == total need custom validators. Flow evaluates these as expressions — you configure them in the dashboard, no code changes needed.
Centralized validation across agents. If you have five different LangGraph pipelines submitting orders, they all validate against the same gate. Change a rule once, it applies everywhere. With Pydantic, you'd need to update the model in every repo.
Observability. Flow's analytics dashboard shows you which fields fail most often, which business rules trigger, and which agents (by session) are producing the most errors. When you're debugging why Agent C keeps submitting bad currencies, this is where you look.
Approval workflows. For high-value orders, add a human approval step on the gate. The pipeline pauses, a reviewer approves or rejects, and the graph resumes. You can't do this with a Pydantic validator.
If you want the LLM to call Flow tools directly (instead of going through a hardcoded REST call), you can use LangChain's MCP tool integration. Flow's MCP endpoint at https://api.rynko.dev/api/flow/mcp auto-generates a validate_{gate_slug} tool for each active gate in your workspace.
This means the LLM can discover available gates and submit payloads through tool calling, which is useful when the agent needs to decide which gate to validate against based on the input.
To set up a local LangGraph development environment:
# Create a project directory
mkdir langgraph-flow-demo && cd langgraph-flow-demo
# Set up a virtual environment
python -m venv .venv
source .venv/bin/activate
# Install dependencies
pip install langgraph langchain-openai httpx python-dotenv
# Create .env file
cat > .env << 'EOF'
OPENAI_API_KEY=sk-...
RYNKO_API_KEY=your_api_key_here
FLOW_GATE_ID=your_gate_id_here
EOF
Create a main.py with the code from this tutorial, add from dotenv import load_dotenv; load_dotenv() at the top, and run it with python main.py.
For iterative development, LangGraph has a built-in visualization tool:
# Print the graph structure
app.get_graph().print_ascii()
# Or save as PNG (requires pygraphviz)
app.get_graph().draw_png("graph.png")
This shows you the nodes, edges, and conditional routing at a glance — useful for verifying the self-correction loop is wired correctly.
The complete code for this tutorial — including the graph, Flow client, .env.example, and two test scenarios — is in our developer resources repo. Clone it, add your API keys, and run python src/main.py.
Resources:
2026-03-10 22:12:03
Most organizations chasing AI transformation are looking in the wrong direction. The highest-value data isn't in the shiny new tool , it's buried in the systems you've been running for years.
Every few months, I sit across from a technical leader who tells me some version of the same story: "We've got GPT wired up for internal chat, but nobody's using it." Or: "We built a chatbot, but it just makes things up." Or my personal favorite: "We tried RAG, but the results were garbage."
And almost every time, the problem isn't the model. It's the plumbing.
I run a company called Sprinklenet AI, where we build and deploy multi-LLM platforms , primarily Knowledge Spaces, our RAG-based system , for government agencies and enterprise clients. Over the past two years, I've watched the industry fixate on model selection (GPT-4 vs. Claude vs. Gemini) while almost completely ignoring the far harder, far more valuable problem: getting AI systems reliably connected to the raw operational data that actually drives decisions.
The gold isn't in the model. It's in the basement , in the ERP logs, the CRM records, the SharePoint graveyards, the PostgreSQL tables that nobody has touched since 2019. And the organizations that figure out how to connect AI to that data, securely and at scale, are the ones that will win.
Retrieval-Augmented Generation has become the default architecture for enterprise AI, and for good reason. Instead of fine-tuning a model on your data (expensive, brittle, and a governance nightmare), you retrieve relevant documents at query time and inject them into the prompt context. The model generates answers grounded in your actual information.
In theory, this is elegant. In practice, most RAG implementations fail at the retrieval step , not the generation step.
Here's what I mean. A typical proof-of-concept goes like this: someone uploads a few PDFs into a vector store, builds a simple semantic search pipeline, and demos it to leadership. "Look, it can answer questions about our procurement policy!" Everyone applauds. Budget gets approved.
Then reality hits. The production system needs to pull from Salesforce, a legacy SQL database, an internal wiki, and six different file shares with overlapping and contradictory versions of the same document. The PDFs that worked great in the demo turn out to represent maybe 3% of the organization's actual knowledge. The other 97% lives in structured databases, transactional systems, and formats that don't neatly convert to text chunks.
This is the RAG gap: the distance between "we can do semantic search on a folder of documents" and "we can give our people AI-powered access to everything they need to make decisions." It's enormous, and closing it is mostly an engineering problem, not an AI problem.
When we architect Knowledge Spaces deployments, we spend roughly 60% of our integration time on data connectors , the unglamorous middleware that pulls information from source systems, normalizes it, chunks it appropriately, generates vector embeddings, and keeps everything in sync.
We've built connectors for more than 15 different source systems: Salesforce, PostgreSQL, REST APIs with OAuth flows, file systems, cloud storage. Each one has its own authentication model, rate limits, data schema, and update patterns. And each one requires a different chunking strategy to produce embeddings that actually return relevant results during retrieval.
This is the part that doesn't make it into conference talks. Nobody gives a keynote about spending three weeks tuning chunk sizes for a PostgreSQL connector so that semantic search over transactional records returns meaningful results. But that's the work that separates a demo from a system people actually rely on.
A few hard-won lessons:
Chunk size matters more than model choice. I've seen teams agonize over whether to use GPT-4o or Claude 3.5 while their retrieval pipeline is returning irrelevant context because they're splitting documents at arbitrary 500-token boundaries. For structured data from relational databases, we typically chunk by logical record , one row or one transaction per chunk, with schema metadata preserved. For long-form documents, overlapping chunks of 800-1200 tokens with section-header context prepended tend to outperform naive splitting. The right strategy depends entirely on how your users actually query the data.
Embeddings are not one-size-fits-all. Different embedding models perform differently depending on the domain and the nature of the queries. We've found that for government and defense use cases , where terminology is highly specific and acronym-dense , general-purpose embedding models underperform unless you prepend definitional context to chunks. Running a small evaluation set before committing to an embedding strategy saves weeks of debugging bad retrieval later.
Freshness is a first-class concern. Static RAG (upload once, query forever) works for reference documents. It falls apart for operational data. If your sales team is asking the AI about pipeline status and your Salesforce connector last synced three days ago, trust evaporates immediately. We run incremental sync jobs on configurable schedules , hourly for transactional data, daily for documents , and surface last-sync timestamps in the UI so users know what they're working with.
Here's where enterprise RAG diverges most sharply from the open-source tutorials.
In a real deployment , especially in government , you can't just dump all your documents into a single vector store and let everyone query everything. That's a data spill waiting to happen. The AI system has to respect the same access controls that govern the source systems.
In Knowledge Spaces, we implement this through a four-tier RBAC hierarchy (Organization Owner, Admin, Contributor, Viewer) that controls not just who can query, but what data each query can retrieve against. When a user asks a question, the retrieval step filters the vector search results by that user's permissions before anything reaches the LLM. The model never sees data the user isn't authorized to access.
We also enforce SAML 2.0 SSO and support CAC/PKI authentication for defense clients , because if your AI platform has a separate login from everything else, your security team will (rightly) shut it down.
And then there's audit logging. We capture 64+ event types , every query, every retrieval, every model invocation, every document access. Not because we love logging, but because our government clients need to answer the question: "Who asked what, and what data informed the answer?" If you can't answer that question, you don't have a governed AI system. You have a liability.
One more pattern I want to surface, because I think it's underappreciated: in production, you almost certainly need more than one model.
We currently orchestrate across models from OpenAI, Anthropic, Google, Groq, and xAI , 16+ foundation models with support for tool calling, streaming, and structured JSON output. Different models excel at different tasks. Some are better at precise factual extraction. Others handle nuanced summarization more gracefully. Some are fast and cheap enough for high-volume classification tasks. Others are worth the latency for complex analytical queries.
The point isn't to have options for the sake of options. It's that when you're connecting AI to diverse enterprise data sources, the queries that hit your system are diverse too. A procurement analyst asking "What were the top three cost overruns on Program X last quarter?" needs a different model behavior than a policy researcher asking "How does this draft regulation compare to FAR Part 15?" Routing queries to the right model , and having guardrails that catch PII leakage, prompt injection, and off-topic responses regardless of which model is active , is table stakes for production deployment.
If I could give one piece of advice to a technical leader starting an enterprise AI initiative, it would be this: before you evaluate a single model, before you pick a vector database, before you write a line of prompt engineering , go inventory your data.
Map every system that holds information your people need to make decisions. Understand the access controls on each one. Document the update frequency. Figure out what's structured versus unstructured. Identify which sources overlap, which contradict each other, and which are authoritative.
Then build your RAG architecture around that map. Let the data topology drive the system design, not the other way around.
The organizations that get this right don't just get a better chatbot. They get something much more valuable: a single, governed, intelligent interface to their institutional knowledge. An interface that respects security boundaries, stays current with source systems, and gets smarter as more data flows through it.
The gold has been in the basement all along. You just need to build the stairs.
Jamie Thompson is the Founder and CEO of Sprinklenet AI, where he builds enterprise AI platforms for government and commercial clients. He writes weekly at newsletter.sprinklenet.com.
2026-03-10 22:12:03
By Jamie Thompson, Founder & CEO, Sprinklenet AI
If you have spent any time building with large language models over the past two years, you have almost certainly encountered the term RAG. Retrieval-Augmented Generation has become one of the most important architectural patterns in applied AI, and for good reason. It solves a fundamental problem that every team hits the moment they try to make LLMs useful in production: the model does not know your data.
I have been building RAG systems since before the term was trendy. At Sprinklenet, our flagship platform Knowledge Spaces is a multi-LLM RAG system that serves enterprise and government clients across sensitive, high-stakes environments. What follows is not theory. It is what we have learned building, deploying, and operating these systems in production.
RAG is a design pattern where you augment an LLM's generation capabilities by first retrieving relevant context from an external knowledge base. Instead of relying solely on what the model learned during training, you give it fresh, specific, verified information at inference time.
The concept is straightforward. A user asks a question. Your system searches a curated knowledge base for the most relevant documents or passages. Those passages get injected into the prompt as context. The LLM then generates a response grounded in that retrieved information.
This is fundamentally different from fine-tuning. Fine-tuning changes the model's weights. RAG changes the model's context window. That distinction matters enormously in practice because it means you can update your knowledge base without retraining anything, you can control exactly what information the model has access to, and you can cite specific sources in every response.
LLMs are remarkable at language understanding, reasoning, and generation. They are terrible at knowing facts about your organization, your documents, your policies, or anything that happened after their training cutoff.
Without RAG, you are stuck with a model that confidently generates plausible answers that may be completely wrong. In enterprise settings, that is not just annoying. It is dangerous. An analyst acting on hallucinated intelligence, a contractor citing a regulation that does not exist, a compliance officer relying on outdated policy guidance: these are real failure modes with real consequences.
RAG addresses this by grounding every response in retrievable, verifiable source material. When done well, the system can tell you not just what it thinks, but exactly which documents it consulted and which passages informed its answer. That traceability is what makes RAG production-ready for serious applications.
Every RAG system has three core phases: embedding, retrieval, and generation. Getting each one right matters, and the interactions between them matter even more.
Before you can retrieve anything, you need to transform your documents into a format that supports semantic search. This means converting text into vector embeddings, which are dense numerical representations that capture meaning rather than just keywords.
The ingestion pipeline typically looks like this: documents come in through connectors (file uploads, API integrations, database pulls), get parsed and cleaned, get split into chunks, and then get embedded and stored in a vector database.
Chunking strategy is one of the most consequential decisions you will make. Chunk too large and your retrieval loses precision. Too small and you lose context. In our experience building Knowledge Spaces, we have found that the optimal chunk size varies significantly by use case. Dense regulatory text (like the Federal Acquisition Regulation, which powers our FARbot product) benefits from smaller, paragraph-level chunks with overlapping windows. Narrative documents like reports and memos work better with larger chunks that preserve the author's reasoning flow.
Overlap between chunks is critical and often overlooked. If your chunks do not overlap, you will inevitably split important information across chunk boundaries, and the retrieval system will miss it. We typically use 10 to 20 percent overlap, though this is tunable.
Retrieval is where your system searches the vector database for chunks that are semantically similar to the user's query. The user's question gets embedded using the same model that embedded your documents, and then a similarity search (typically cosine similarity or dot product) finds the closest matches.
This sounds simple, but there are several layers of complexity in production systems.
Hybrid search combines vector similarity with traditional keyword matching. Pure semantic search can miss exact terms that matter (like specific regulation numbers or product names), while pure keyword search misses conceptual relevance. The best production systems use both and merge the results.
Metadata filtering lets you scope retrieval to specific document sets, time ranges, access levels, or categories before the similarity search even runs. In multi-tenant systems like Knowledge Spaces, this is essential. You cannot have one client's documents leaking into another client's results.
Reranking takes the initial retrieval results and applies a second, more computationally expensive model to reorder them by relevance. The initial vector search is fast but approximate. A cross-encoder reranker is slower but significantly more accurate. In practice, you retrieve a larger candidate set (say 20 to 50 chunks) and then rerank down to the top 5 to 10 that actually get passed to the LLM.
The retrieved chunks get assembled into a prompt alongside the user's question and any system instructions. The LLM then generates a response grounded in that context.
Prompt engineering for RAG is its own discipline. You need to instruct the model to use the provided context, cite its sources, and clearly indicate when it does not have enough information to answer. You also need to handle the case where the retrieved context is irrelevant to the question, because the retrieval system will always return something, even if nothing in the knowledge base actually answers the query.
Source attribution is non-negotiable in production RAG. Every claim in the response should trace back to a specific chunk from a specific document. This is what separates a useful enterprise tool from a liability. In Knowledge Spaces, we log retrieval results alongside every generated response so that administrators can audit not just what the system said, but what it consulted.
The vector database is the backbone of your RAG system. It stores your embeddings and handles similarity search at scale. The major options include Pinecone, Weaviate, Qdrant, Milvus, Chroma, and pgvector (for teams already running PostgreSQL).
We use Pinecone for Knowledge Spaces, and it has served us well at scale. But the choice depends on your constraints. If you need on-premises deployment for security requirements, Qdrant or Milvus give you that control. If you want to minimize infrastructure, Pinecone's managed service is hard to beat. If you are prototyping and want minimal setup, Chroma works fine locally but think carefully before taking it to production.
Key factors to evaluate: query latency at your expected scale, filtering capabilities (metadata filtering performance varies dramatically between solutions), managed versus self-hosted options, and cost at your projected data volume.
After building RAG systems for several years, I have seen the same mistakes repeatedly. Here are the ones that cause the most pain.
Ignoring chunk quality. Garbage in, garbage out applies doubly to RAG. If your ingestion pipeline produces poorly parsed, badly chunked documents, no amount of retrieval sophistication will save you. Invest heavily in document parsing and chunk quality. Parse tables correctly. Handle headers and footers. Strip boilerplate. This unsexy work is often the difference between a system that works and one that hallucinates.
Skipping evaluation. Most teams build a RAG pipeline, try a few queries manually, and call it done. You need systematic evaluation: a test set of questions with known correct answers, automated retrieval quality metrics (precision, recall, MRR), and end-to-end answer quality assessment. Without this, you are flying blind every time you change a parameter.
Overloading the context window. Retrieving too many chunks and stuffing them all into the prompt is counterproductive. LLMs have finite attention. Research consistently shows that models perform worse when given excessive context, particularly in the middle of long prompts (the "lost in the middle" phenomenon). Be selective. Five highly relevant chunks will outperform twenty mediocre ones.
Neglecting access control. In any multi-user or multi-tenant system, retrieval must respect authorization boundaries. A user should never receive information from documents they do not have permission to access. This sounds obvious, but implementing it correctly requires thinking about access control at the vector database level, not just at the application layer. In Knowledge Spaces, we enforce role-based access control with a four-tier hierarchy and 64+ auditable event types precisely because this is a hard problem that demands rigorous engineering.
Treating RAG as a one-time build. A production RAG system is a living system. Documents change. New sources get added. Embedding models improve. User needs evolve. You need operational infrastructure for re-ingestion, monitoring retrieval quality over time, and updating your pipeline as the underlying models and data shift.
RAG is powerful, but it is not the right pattern for every problem. If your task requires real-time computation, complex multi-step reasoning over structured data, or actions in external systems, you likely need agentic architectures, tool use, or traditional software engineering rather than (or in addition to) retrieval-augmented generation.
RAG excels when the core task is: "Answer questions using information from a specific, curated knowledge base." The further you drift from that pattern, the more you should consider other approaches.
RAG is maturing rapidly. The next generation of production systems will incorporate more sophisticated retrieval strategies (graph-based retrieval, hypothetical document embeddings, multi-hop reasoning), tighter integration with structured data sources, and better evaluation frameworks.
But the fundamentals remain the same. Ingest your data carefully. Retrieve with precision. Generate with grounding. Audit everything. If you get those four things right, you are ahead of most teams building in this space.
At Sprinklenet, we have distilled these lessons into Knowledge Spaces, a platform that handles multi-LLM orchestration across 16+ foundation models, enterprise-grade access control, and comprehensive audit logging out of the box. We built it because we got tired of solving the same hard infrastructure problems on every engagement. If you are building RAG systems seriously, whether for government or commercial use, the infrastructure layer matters as much as the AI layer.
Jamie Thompson is the Founder and CEO of Sprinklenet AI, where he builds enterprise AI platforms for government and commercial clients. He writes weekly at newsletter.sprinklenet.com.