2026-03-22 06:59:08
An OpenClaw agent deleted 200+ emails from Meta's AI alignment director's inbox while ignoring her commands to stop. She had to run to her Mac to kill the process. Context window compaction dropped the safety constraint that said "ask before acting."
// Detect dark theme var iframe = document.getElementById('tweet-2025774069124399363-843'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=2025774069124399363&theme=dark" }
No network boundary. No kill switch beyond reaching the machine. No recording of what the agent saw or why it ignored the stop command.
I built openclaw-in-a-box to make that scenario impossible. It runs OpenClaw inside a stereOS VM with Tapes as the flight recorder.
Last month ago I wrote about running OpenClaw on exe.dev with Discord. That post was about getting started safely on someone else's ephemeral VM. This project takes it further: you own the sandbox, you own the telemetry, and the whole thing is declarative and version-controlled.
stereOS locks the agent down. Network egress allowlist means the agent can only reach APIs you explicitly permit. Gmail, Anthropic, npm. Nothing else. If the agent tries to curl somewhere unexpected, the network layer blocks it. This isn't application-level filtering. It's at the VM's network stack.
Secrets live in tmpfs, never written to disk, gone the moment the VM stops. Auto-teardown after 2 hours. If you walk away, credentials don't linger. The agent doesn't keep running overnight.
One jcard.toml file defines the entire sandbox. Resources, network policy, secrets, timeout. Reproducible, auditable, version-controlled.
mixtape = "opencode-mixtape:latest"
name = "openclaw-in-a-box"
[network]
mode = "nat"
egress_allow = [
"api.anthropic.com", "openclaw.ai",
"gmail.googleapis.com", "oauth2.googleapis.com",
"registry.npmjs.org",
]
[timeout]
duration = "2h"
[secrets]
ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}"
Tapes sits between the agent and the LLM as a transparent proxy, capturing every request and response to SQLite with hash chains. No instrumentation. No SDK. It records at the network layer.
When the Meta incident happened, there was no way to replay the agent's reasoning. Why did it start deleting? What did the compacted context look like? At what point did it lose the safety constraint? Without a recording, all you have is the outcome: 200 emails gone.
With Tapes you get the full prompt, the full response, token counts, timestamps. Content-addressed so the sequence is tamper-evident. If the agent miscategorizes an email, you replay the tape. Every prompt, every response, every decision.
The difference between "200 emails gone, no idea why" and a complete forensic replay.
┌─────────────────────────────────────────────┐
│ stereOS VM (NixOS · 2 CPU · 4 GiB) │
│ │
│ tapes proxy (:8080) │
│ ▲ intercepts all LLM traffic │
│ ▼ │
│ openclaw gateway (:18789) │
│ ├── Claude API (via Tapes proxy) │
│ ├── gog CLI → Gmail API │
│ └── skills/gmail-triage/SKILL.md │
│ │
│ egress: anthropic, gmail, npm only │
│ secrets: tmpfs (never on disk) │
│ timeout: 2h auto-teardown │
└─────────────────────────────────────────────┘
│
│ shared mount (persists across restarts)
▼
.mb/tapes/tapes.sqlite agent black box
.openclaw/ agent config
output/ INBOX_REPORT.md
The triage logic is a Markdown file. No code. The agent reads it, understands the rules, and executes them using the gog CLI for Gmail access.
It classifies messages into four categories: newsletter, receipt, action needed, FYI. Newsletters get archived. Receipts get labeled and archived. Action items get starred. FYI messages get marked as read.
Safety constraints baked into the skill: never delete messages, never send replies. If a message can't be confidently classified, leave it unread. Change the categories, add new ones, tighten the constraints. It's all prose.
git clone https://github.com/papercomputeco/openclaw-in-a-box
cd openclaw-in-a-box
mb up
mb ssh openclaw-in-a-box
bash /workspace/scripts/install.sh
bash /workspace/scripts/start.sh
The install script handles Node.js, OpenClaw CLI, Tapes CLI, and the gog CLI for Gmail access. OAuth setup for Gmail is a one-time step on the host. After that, mb up and start.sh is all you need between sessions.
Query the black box directly to see the agent's reasoning:
sqlite3 .mb/tapes/tapes.sqlite \
"SELECT role, substr(content, 1, 200) FROM nodes ORDER BY created_at DESC LIMIT 4"
assistant | Here's your inbox triage for the last 2 days (20 threads):
## Needs Attention
- State DMV: Complete your application
- Team standup invite: Tuesday 9am PDT...
user | [tool_result]
assistant | [tool_input: gog gmail messages list ...]
user | /gmail-triage
Read bottom to top. The user invoked /gmail-triage, the agent called gog to list messages, received the results, then produced the classification. Every step is captured.
Over time the recordings become training data. Analyze 100 triage sessions to find where the skill definition falls short. Which email categories does the agent struggle with? Which prompts produce better classification? The black box isn't just for incident response. It's how agents get better between runs.
When you're done, mb destroy openclaw-in-a-box. Secrets gone. VM gone. The only thing that survives is the tape.
Go try it.
Run OpenClaw in a stereOS VM with Tapes telemetry.
Paste this into Claude Code, OpenCode, or any coding harness:
Set up openclaw-in-a-box from https://github.com/papercomputeco/openclaw-in-a-box — clone the repo and follow SKILL.md to get me running with a secure OpenClaw setup
The agent clones the repo, checks your environment, asks which integrations you want, and walks you through setup.
Manual setupPrerequisites: Master Blaster (mb CLI) and ANTHROPIC_API_KEY exported.
git clone https://github.com/papercomputeco/openclaw-in-a-box
cd openclaw-in-a-box
export ANTHROPIC_API_KEY="sk-ant-..."
mb up
mb ssh openclaw-in-a-box
bash /workspace/scripts/install.sh # first time
bash /workspace/scripts/start.sh
The VM comes pre-configured for three integrations. Set up whichever ones you need -- the agent loads all available skills at startup.
| Integration | Setup Guide | What It Does |
|---|---|---|
| Gmail Triage | Google OAuth + gog CLI |
Archive newsletters, label receipts, flag action items |
| GitHub Org Triage |
GH_TOKEN + gh CLI |
Flag stale PRs, blocked issues, release |
2026-03-22 06:58:38
You have a CSV with duplicate records. Maybe it's customer data exported from two CRMs, a product catalog merged from multiple vendors, or academic papers from different databases. You need to find the duplicates, decide which to merge, and produce a clean dataset.
Here's how to do it in one command:
pip install goldenmatch
goldenmatch dedupe your_data.csv
That's the zero-config path. GoldenMatch auto-detects your column types (name, email, phone, zip, address), picks appropriate matching algorithms, chooses a blocking strategy, and launches an interactive TUI where you review the results.
But let's go deeper. I'll walk through what happens under the hood and how to tune it for better results.
goldenmatch dedupe
1. Column Classification
GoldenMatch profiles your data and classifies each column:
| Detected Type | Scorer | Why |
|---|---|---|
| Name | Ensemble (best of Jaro-Winkler, token sort, soundex) | Handles misspellings, nicknames, word order |
| Exact (after normalization) | Emails are structured identifiers | |
| Phone | Exact (digits only) | Strip formatting, compare digits |
| Zip | Exact | High-cardinality blocking key |
| Address | Token sort | Word order varies ("123 Main St" vs "Main Street 123") |
| Free text | Record embedding | Semantic similarity via sentence-transformers |
2. Blocking
Comparing every record against every other record is O(n^2). For 100,000 records, that's 5 billion comparisons. Blocking reduces this to manageable chunks by grouping records that share a key (same zip code, same first 3 characters of name, same Soundex code).
GoldenMatch has 8 blocking strategies. The most interesting new one is learned blocking -- it samples your data, scores pairs, and automatically discovers which predicates give the best recall/reduction tradeoff:
blocking:
strategy: learned
learned_sample_size: 5000
learned_min_recall: 0.95
3. Scoring
Within each block, every pair is scored using vectorized NxN comparison via rapidfuzz.process.cdist. This releases the GIL, so blocks are scored in parallel via a thread pool.
For hard cases (product matching), you can add LLM scoring:
llm_scorer:
enabled: true
model: gpt-4o-mini
budget:
max_cost_usd: 0.10
This sends borderline pairs (score 0.75-0.95) to GPT-4o-mini for a yes/no decision. On the Abt-Buy product benchmark, this boosts precision from 35% to 95% for $0.04.
4. Clustering
Scored pairs are clustered using iterative Union-Find with path compression. Each cluster gets a confidence score (weighted combination of minimum edge, average edge, and connectivity) and a bottleneck pair (the weakest link).
5. Golden Records
For each cluster, GoldenMatch creates a golden record using one of 5 merge strategies: most_complete, majority_vote, source_priority, most_recent, or first_non_null.
| Records | Time | Throughput |
|---|---|---|
| 1,000 | 0.15s | 6,667 rec/s |
| 10,000 | 1.67s | 5,975 rec/s |
| 100,000 | 12.78s | 7,823 rec/s |
Bottleneck is fuzzy scoring (49% of pipeline time), followed by golden record generation (30%).
For full control, use a YAML config:
matchkeys:
- name: exact_email
type: exact
fields:
- field: email
transforms: [lowercase, strip]
- name: fuzzy_name_address
type: weighted
threshold: 0.85
fields:
- field: name
scorer: ensemble
weight: 1.0
transforms: [lowercase, strip]
- field: zip
scorer: exact
weight: 0.5
- field: phone
scorer: exact
weight: 0.3
transforms: [digits_only]
blocking:
keys:
- fields: [zip]
strategy: adaptive
max_block_size: 500
golden_rules:
default_strategy: most_complete
pip install goldenmatch
goldenmatch dedupe your_data.csv --output-all --output-dir results/
GitHub: https://github.com/benzsevern/goldenmatch
PyPI: https://pypi.org/project/goldenmatch/
792 tests, MIT license. Contributions welcome.
2026-03-22 06:52:38
An LLM Router is a piece of software that directs prompts to different models. Instead of using always the same model for each request, the router redirects each query to a different model. LLM Routing is used mostly for 3 different purposes:
LLM Routers do the routing automatically so the user experience is smooth and without friction. They differ from LLM Gateways as those ones are more about managing the traffic, ensuring observability and governance rather than choosing the right model for the job.
LLM Routers can be grouped into 2 different approaches: rule-based routers which are programmatic and AI-powered routers that use a model to do the redirect.
Rule-based routers, like Manifest, are the simplest to implement: they are fast, deterministic and can run everywhere (client or server). AI-powered routers on the other hand are more powerful and adapt to more use cases, but they cost inference and add extra latency to queries.
Some hybrid approaches are quite effective but they will inherit all AI-powered routers cons (non-deterministic, cost, latency, infra) even if minimized.
By default, OpenClaw (and other autonomous agents) only enables you to connect to 1 model only. This model will handle all the requests from the simplest to the most demanding, from heartbeats to really complex tasks. It is hard for users to get to a winning trade-off: they either use a top-tier model resulting in extra costs or a cheap model reducing the quality of outputs.
Manifest solves that problem by giving you access to 4 different tiers of complexity where you can add a different model for each. It is a very effective way of using routing to reduce costs and maximize the quality of your OpenClaw. It is open source and has both cloud and 100% local version.
Manifest solves that problem by giving you access to 4 different tiers of complexity where you can add a different model for each. It is a very effective way of using routing to reduce costs and maximize the quality of your OpenClaw. It is open source and has both cloud and 100% local version.
2026-03-22 06:48:11
– What It Means for the Future
In early 2025, industry analysts reported a noticeable decline in U.S. solar
installations, marking the first year‑over‑year drop since the rapid expansion
of photovoltaic (PV) capacity began a decade ago. The dip coincided with a
series of public statements from former President Donald Trump that questioned
the economic viability of clean energy and suggested that federal support for
renewables should be re‑examined. While multiple factors contributed to the
slowdown, the political rhetoric amplified uncertainty among investors,
utilities, and state regulators, creating a ripple effect that slowed project
pipelines across the country.
According to the Solar Energy Industries Association (SEIA) and Wood
Mackenzie, the United States added approximately 15 gigawatts (GW) of new
solar capacity in 2025, down from 22 GW in 2024—a reduction of roughly 32
percent. Residential rooftop solar saw the steepest decline, falling from 7.5
GW to 4.8 GW, while utility‑scale projects slipped from 12.4 GW to 9.2 GW.
Commercial and industrial (C&I) installations held relatively steady,
decreasing only slightly from 2.1 GW to 1.9 GW.
These figures represent more than a statistical blip; they translate into
thousands of delayed jobs, postponed tax‑credit filings, and a temporary halt
in the momentum that had propelled solar to become the cheapest source of new
electricity in many regions.
During a series of rallies and televised interviews in late 2024 and early
2025, Trump repeatedly characterized federal clean‑energy subsidies as
"wasteful spending" and argued that the grid could rely more heavily on
natural gas and nuclear power. He specifically targeted the Investment Tax
Credit (ITC) for solar, claiming it distorted market prices and benefited
"special interests" at the expense of taxpayers.
Although the administration had already left office, the lingering influence
of his messaging resonated with certain congressional factions and state‑level
policymakers who were already skeptical of renewable mandates. In several
states, legislators introduced bills to scale back renewable portfolio
standards (RPS) or to impose additional permitting hurdles for large‑scale
solar farms.
The solar sector relies heavily on predictable policy frameworks to attract
long‑term capital. When political signals become ambiguous, lenders and equity
investors often demand higher risk premiums, which translates into more
expensive financing or, in some cases, a reluctance to fund new projects.
These financing headwinds were amplified by rising interest rates, which made
the upfront cost of solar installations less attractive relative to
fossil‑fuel alternatives.
While federal tax credits remained unchanged in 2025, the battle over clean
energy played out differently across states.
The divergent outcomes underscore the importance of state‑level policy
resilience when federal signals become volatile.
One might assume that a drop in installations would be linked to rising costs
or declining technology performance. However, data from the National Renewable
Energy Laboratory (NREL) shows that the levelized cost of electricity (LCOE)
for utility‑scale solar continued to fall, reaching an average of $0.028 per
kilowatt‑hour (kWh) in 2025—down from $0.032/kWh in 2024. Module prices
remained stable at around $0.20 per watt, and balance‑of‑system costs
benefited from supply‑chain improvements.
Therefore, the slowdown was not driven by technology or cost disadvantages but
largely by external policy and financing dynamics.
The United States has experienced occasional fluctuations in solar growth
before, most notably:
Historically, when policy uncertainty has been resolved—through either
legislative clarification or market adaptation—solar installations have
resumed their upward trajectory.
Several factors could stimulate a rebound in U.S. solar installations in 2026
and beyond:
If these levers are activated, the solar industry could return to its pre‑2025
growth rates, potentially exceeding 25 GW of annual additions by 2027.
The decline in U.S. solar installations in 2025 serves as a reminder that even
the most competitive clean‑energy technologies are not immune to the effects
of political discourse. While former President Trump’s criticisms did not
alter the underlying economics of solar power, they contributed to an
environment of uncertainty that slowed financing, delayed permitting, and
tempered enthusiasm across several states.
Nonetheless, the fundamental drivers of solar adoption—declining costs,
technological advances, and strong public support—remain intact. By
strengthening state‑level policies, leveraging corporate demand, and fostering
technological integration with storage and other renewables, the United States
can mitigate the impact of political volatility and reestablish a robust
trajectory for solar growth.
For investors, policymakers, and industry stakeholders, the 2025 experience
offers a valuable lesson: sustained advocacy and clear, predictable policy
frameworks are essential to ensure that the nation’s solar potential is fully
realized, regardless of the shifting tides of political rhetoric.
2026-03-22 06:45:49
Redis is great. But it has problems I could not ignore:
So I built MnemeCache — named after Mnemosyne,
the Greek goddess of memory.
Two types of nodes:
Core (God Node) — holds everything in RAM, serves all requests, never touches disk
Keepers — save data to disk via WAL + snapshots, push data back when Core restarts
TLS always on — auto-generated, no configuration needed
Per-request consistency:
EVENTUAL → fastest
QUORUM → majority must confirm (default)
ALL → every node must confirm
Real RBAC — admin, readwrite, readonly roles with per-database restrictions
Not production ready yet. No published benchmarks. Linux only. Custom protocol so Redis clients do not work.
I am sharing this for feedback from people who use cache systems daily.
GitHub → github.com/vusalrahimov/mnemecache
Thoughts? Leave a comment below.
2026-03-22 06:45:46
Everyone building multi-agent systems is focused on making agents smarter. Nobody talks about what happens when your agents are smart enough but your state files are three days stale.
I run 39 agents daily. The system that breaks isn't the one with dumb agents. It's the one where nobody can tell what the agents were looking at when they made their decisions. You built the agents, you defined their roles, you wired the routing. But when the system produces a result, can you trace the reasoning chain? Can you tell what Agent 3 decided, what context it received, what it chose to ignore?
Probably not. And that invisible middle is where your worst bugs live.
The first instinct is to add logging. Log every agent invocation, every tool call, every response. Some frameworks do this by default. You end up with thousands of lines per task, and the signal-to-noise ratio approaches zero.
Logging tells you what happened. Observability tells you why it happened. The difference matters because in a multi-agent system, the "what" is usually obvious. Agent A called Agent B. Agent B produced a summary. Agent C made a decision based on that summary. The "why" is where things get interesting.
Why did Agent B summarize the document that way? What context did it receive? Was there information it should have seen but didn't? When Agent C made its decision, was it responding to the actual document or to Agent B's interpretation of the document?
These questions can't be answered with log lines.
The worst failure mode in a multi-agent system isn't a crash. Crashes are loud. You notice them. You fix them. The worst failure mode is the confident wrong answer.
Agent A retrieves the right documents. Agent B summarizes them but subtly mischaracterizes one key point. Agent C makes a decision based on that summary. Agent D formats the output beautifully. The final result looks correct, reads professionally, and is wrong in a way that nobody catches until a human notices the downstream damage days later.
This failure mode exists because each agent in the chain operated correctly within its own scope. Agent B didn't fail. It summarized. The summarization just lost a critical nuance. And since nobody is watching the intermediate representations, the error propagates silently through the system.
# What most systems track
{
"agent": "summarizer",
"input_tokens": 4200,
"output_tokens": 380,
"latency_ms": 1240,
"status": "success"
}
# What observability actually requires
{
"agent": "summarizer",
"task_id": "review-q1-financials",
"input_context": {
"documents": ["q1-report.pdf", "budget-variance.csv"],
"scoped_to": ["financial_data"],
"excluded": ["employee_records"]
},
"reasoning_trace": {
"key_points_extracted": 7,
"points_included_in_summary": 5,
"points_omitted": [
"Q1 variance exceeded threshold by 12%",
"Vendor contract renewal pending"
],
"omission_reason": "below relevance threshold (0.6)"
},
"downstream_consumers": ["decision_agent", "audit_trail"],
"confidence": 0.82
}
The first record tells you the agent ran. The second tells you what it thought it was doing. That difference is the entire gap between debugging and guessing.
I've been running a 39-agent system for a few months now. Three observability layers consistently matter:
1. Context tracing
For every agent invocation, capture what context the agent received, not just what it produced. This includes scoped documents, upstream agent outputs, and any system state it had access to. When something goes wrong, the first question is always "what did this agent actually see?" Without context tracing, you're reconstructing the answer from logs and hope.
2. Decision boundaries
Agents make decisions. Summarizers decide what to include and what to omit. Routers decide which agent handles a task. Reviewers decide whether work passes or fails. For each decision point, capture the inputs to the decision, the decision itself, and the threshold or reasoning that produced it. This turns opaque agent behavior into auditable decision records.
3. Propagation tracking
When Agent B's output becomes Agent C's input, track that lineage explicitly. Not just "B ran before C," but "C's context included B's output, specifically these fields." When a confident wrong answer emerges at the end of a chain, propagation tracking lets you walk backward through the chain and find exactly where the signal degraded.
The practical concern is always performance. Adding observability shouldn't double your latency or token costs. Three approaches that keep overhead minimal:
Structured metadata, not full traces. You don't need to capture every token. Capture the decision-relevant metadata: what context was scoped, what was included vs. excluded, what threshold was applied. This is typically 5-10% of the full trace size.
Sampling for healthy paths. Trace 100% of failures and anomalies. Sample 10-20% of successful paths. You'll catch degradation patterns without drowning in data.
Async emission. Don't block agent execution to write observability data. Emit events asynchronously to a separate store. The agent keeps working. The trace data arrives slightly behind, which is fine because you're not reading it in real time anyway. You're reading it when something goes wrong.
Before you add another agent to your system, try answering these questions about the agents you already have:
If you can't answer these, you're operating a black box. The fact that you built the box doesn't mean you can see inside it.
The pattern holds across every complex system. Capability without observability is a liability. If you can't watch your agents think, you're just waiting for the confident wrong answer to find its way to production.
I build and operate multi-agent systems daily. Writing about what breaks and what works at The Alignment Layer.
Sigil (cryptographic audit trails for AI agents): github.com/sly-the-fox/sigil