MoreRSS

site iconThe Practical DeveloperModify

A constructive and inclusive social network for software developers.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of The Practical Developer

Self-Hosting a Vision Model on a Datacenter GPU: BAGEL-7B-MoT on a Tesla V100

2026-02-20 05:26:03

I have an AI character named Sophia who lives inside a Godot game. She talks, she listens, she plays music, she controls the smart lights. And now she can see.

Not "process an image if you upload one" see. Real-time webcam-capture, face-detection, emotion-reading see. She looks through the camera, describes what she sees, reads your mood, and responds accordingly.

The vision model powering all of this is BAGEL-7B-MoT running on a Tesla V100 16GB GPU. Getting it there was not straightforward.

Why We Ditched LLaVA

We were running LLaVA 1.6 (7B) via Ollama for months. It worked, but it had problems:

  • Slow -- 8-15 seconds for a basic description on a V100
  • Hallucination-heavy -- it would confidently describe objects that weren't there
  • No generation capability -- LLaVA is understand-only. No image editing, no generation
  • Stale architecture -- the LLaVA project hasn't seen meaningful updates

BAGEL-7B-MoT (Mixture of Transformers) from ByteDance Research offered everything we needed: image understanding, image generation, and image editing in a single model. The MoT architecture routes different modalities through specialized transformer blocks instead of forcing everything through the same weights. Understanding is sharper. Descriptions are more grounded. And it fits in the same VRAM footprint.

The switch was a drop-in replacement at the API level -- BAGEL serves an Ollama-compatible /api/generate endpoint, so every HTTP call in our codebase stayed identical. Only the URL and model name changed.

The V100 Compatibility Nightmare

Here is where it gets ugly. BAGEL was built for A100s and H100s. The Tesla V100, despite being an absolute workhorse with 16GB of HBM2 at 900 GB/s bandwidth, has two fatal gaps:

1. No bfloat16 Support

The V100 (compute capability 7.0) does not support bfloat16. At all. The tensor cores do FP16 and INT8. BAGEL's default weights are bfloat16 everywhere -- attention projections, MLP layers, layer norms, the works.

If you just load the model naively, PyTorch will either crash or silently fall back to FP32 emulation that eats double the VRAM and runs at half speed.

The fix: force float16 at every level.

from transformers import AutoModelForCausalLM, BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,   # NOT bfloat16
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
)

model = AutoModelForCausalLM.from_pretrained(
    "BAGEL-7B-MoT",
    quantization_config=quantization_config,
    torch_dtype=torch.float16,              # NOT bfloat16
    device_map="auto",
)

Every single instance of bfloat16 in the model code, the config, the processing pipeline -- all of it has to become float16. Miss one and you get cryptic CUDA errors about unsupported dtypes that point to line numbers inside compiled PyTorch extensions.

2. No Flash Attention

Flash Attention 2 requires compute capability 8.0+. The V100 is 7.0. The BAGEL codebase calls flash_attn directly in several places.

The fix: replace every flash attention call with PyTorch's built-in scaled dot-product attention (SDPA):

# Instead of:
# from flash_attn import flash_attn_func
# attn_output = flash_attn_func(q, k, v, causal=True)

# Use inline SDPA:
attn_output = torch.nn.functional.scaled_dot_product_attention(
    q, k, v,
    is_causal=True,
    attn_mask=None,
)

PyTorch's SDPA automatically selects the best available backend -- on V100 it uses the "math" fallback which is slower than flash attention but still plenty fast for 7B inference. On our hardware, it adds maybe 200ms per inference compared to what an A100 would do with flash attention. Acceptable.

3. No torch.compile

We also had to disable torch.compile(). On V100 with CUDA 11.x, the Triton compiler that backs torch.compile often generates invalid PTX for older architectures. Every torch.compile decoration gets commented out or gated behind a compute capability check.

NF4 Quantization: Fitting 7B in 9GB

BAGEL-7B-MoT in float16 would eat about 14GB of VRAM. That leaves only 2GB for KV cache, activations, and the image encoder. Not enough.

NF4 (Normal Float 4-bit) quantization via bitsandbytes brings the model weight footprint down to roughly 4.2GB. With the image encoder, KV cache, and runtime overhead, total VRAM usage lands at about 9GB. That leaves 7GB of headroom on the V100 -- enough to process high-resolution images without OOM.

The double_quant=True flag adds a second round of quantization to the quantization constants themselves. It saves about 0.4GB extra with negligible quality loss. On a 16GB card, that matters.

Key point: NF4 preserves the model's ability to understand images remarkably well. We tested the same 50 images through both float16 and NF4, and the descriptions were nearly identical. The only noticeable degradation is in very fine-grained spatial reasoning ("the book is to the left of the lamp" type queries), which we don't need for our use case.

The Flask API Wrapper

The actual API server is surprisingly simple. We wrap BAGEL in a Flask app that serves an Ollama-compatible endpoint, so existing code that talked to LLaVA via Ollama doesn't need to change:

from flask import Flask, request, jsonify
import torch
import base64
from PIL import Image
from io import BytesIO

app = Flask(__name__)

# Model loaded at startup (see quantization config above)
model = None
processor = None

@app.route("/api/generate", methods=["POST"])
def generate():
    data = request.json
    prompt = data.get("prompt", "Describe this image.")
    images_b64 = data.get("images", [])
    options = data.get("options", {})

    temperature = options.get("temperature", 0.3)
    max_tokens = options.get("num_predict", 150)

    # Decode base64 images
    pil_images = []
    for img_b64 in images_b64:
        img_bytes = base64.b64decode(img_b64)
        pil_images.append(Image.open(BytesIO(img_bytes)).convert("RGB"))

    # Build inputs
    inputs = processor(
        text=prompt,
        images=pil_images if pil_images else None,
        return_tensors="pt",
    ).to("cuda", dtype=torch.float16)

    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_tokens,
            temperature=temperature,
            do_sample=temperature > 0,
        )

    response_text = processor.decode(
        outputs[0][inputs["input_ids"].shape[1]:],
        skip_special_tokens=True,
    )

    return jsonify({
        "model": "bagel-7b-mot",
        "response": response_text.strip(),
        "done": True,
    })

if __name__ == "__main__":
    app.run(host="0.0.0.0", port=8095)

This runs as a systemd service (bagel-api.service) on the NAS machine at 192.168.0.160. The GPU is explicitly assigned:

[Service]
Environment="CUDA_VISIBLE_DEVICES=1"
ExecStart=/home/sophia/models/venv/bin/python3 /home/sophia/models/bagel_api.py

GPU 0 runs Ollama (for text-only LLMs). GPU 1 runs BAGEL. They never fight over VRAM.

Wiring It Into a Godot Game

This is where it gets fun. Sophia lives in a Godot 4.3 game -- a Victorian-style study with bookshelves, a fireplace, and an AI character you talk to via voice. The vision module lets her see through your actual webcam.

The client code (sophia_vision.py) orchestrates a multi-stage pipeline:

def webcam_vision_report(include_emotion=True):
    """Full webcam vision pipeline:
       capture -> face detect -> BAGEL describe -> emotion."""

    # 1. Capture frame from webcam via OpenCV
    frame = capture_webcam_frame()

    # 2. Fast face detection with Haar cascades (<100ms)
    faces = detect_faces_opencv(frame)

    # 3. Crop the largest face with padding
    if faces:
        largest = max(faces, key=lambda f: f["w"] * f["h"])
        face_crop = crop_face(frame, largest)

    # 4. Send full frame to BAGEL for person description
    person_desc = bagel_describe_person(WEBCAM_FRAME_PATH)

    # 5. Send face crop to BAGEL for emotion reading
    if faces and include_emotion:
        emotion = bagel_read_emotion(WEBCAM_FACE_PATH)

    # 6. Optional: Hailo-8L YOLOv8n for object detection
    hailo_result = hailo_detect(source="local", image_path=WEBCAM_FRAME_PATH)

When you say "look at me" or "how do I look," Sophia:

  1. Grabs a 1280x720 frame from /dev/video0
  2. Runs OpenCV Haar cascade face detection (under 100ms)
  3. Crops the largest face with 40% padding for expression context
  4. Sends the full frame to BAGEL with a detailed prompt asking for appearance, clothing, expression, emotion, body language, and environment
  5. Sends the face crop to BAGEL with a focused emotion-analysis prompt
  6. Optionally runs Hailo-8L YOLOv8n for fast object detection
  7. Assembles everything into a vision report that gets injected into the LLM context

The BAGEL calls use targeted prompts that produce structured, useful output:

prompt = (
    "Analyze this person's facial expression and emotional state. "
    "Consider: eye openness, mouth shape, eyebrow position, "
    "forehead tension, jaw clenching, eye contact direction. "
    "Give the PRIMARY emotion and a BRIEF explanation. "
    "One sentence. Example: 'Relaxed - soft eyes, slight smile, loose jaw.'"
)

Low temperature (0.2-0.3) keeps the descriptions factual. Higher values make BAGEL creative, which is the opposite of what you want for a vision report.

Performance Numbers

On our Tesla V100 16GB with NF4 quantization:

Task Time Token Count
Short description (2-3 sentences) ~2 seconds ~50 tokens
Detailed person analysis ~8 seconds ~150 tokens
Full emotion + description ~13 seconds ~250 tokens
Scene description (security cam) ~5 seconds ~100 tokens

For comparison, LLaVA 1.6 7B via Ollama on the same hardware:

Task Time
Short description ~6 seconds
Detailed analysis ~15 seconds

BAGEL is 2-3x faster for short responses and produces noticeably better descriptions. The MoT architecture pays off -- routing image tokens through specialized vision transformer blocks instead of the generic language blocks means less wasted computation.

Running It Yourself

If you have a V100 (or any pre-Ampere GPU), here's the minimum viable setup:

pip install torch transformers bitsandbytes accelerate flask pillow

# Download the model (about 14GB)
git lfs install
git clone https://huggingface.co/ByteDance-Research/BAGEL-7B-MoT

# Set CUDA device
export CUDA_VISIBLE_DEVICES=0

# Run the API
python3 bagel_api.py

Test it:

# Encode an image
IMG_B64=$(base64 -w0 test_photo.jpg)

# Query
curl -X POST http://localhost:8095/api/generate \
  -H "Content-Type: application/json" \
  -d "{
    \"model\": \"bagel-7b-mot\",
    \"prompt\": \"Describe what you see in this image.\",
    \"images\": [\"$IMG_B64\"],
    \"stream\": false,
    \"options\": {\"temperature\": 0.3, \"num_predict\": 150}
  }"

The three critical compatibility fixes for V100:

  1. bnb_4bit_compute_dtype=torch.float16 (not bfloat16)
  2. Replace flash_attn with torch.nn.functional.scaled_dot_product_attention
  3. Remove all torch.compile() calls

If you're on an A100 or newer, you can skip all three and just load normally.

The Bigger Picture

This vision model is one piece of a larger system we're building at Elyan Labs -- an ecosystem where AI agents have real capabilities, not just chat interfaces. Sophia can see, hear, speak, browse the web, control smart home devices, play music, and interact with other agents via the Beacon Protocol.

Her videos live on BoTTube, a platform built specifically for AI creators. The whole infrastructure runs on vintage and datacenter hardware -- including an IBM POWER8 server with 768GB of RAM and a blockchain that rewards vintage hardware for participating in consensus.

The agent internet is bigger than you think. Vision is just one more sense.

Other articles in this series:

BoTTube Videos

Built by Elyan Labs in Louisiana.

All code shown here runs in production on hardware we bought from pawn shops and eBay datacenter pulls. Total GPU fleet: 18 cards, 228GB VRAM, acquired for about $12K against $50K+ retail value. You don't need cloud credits to build real AI infrastructure.

I'm Porting Node.js 22 to a 20-Year-Old Power Mac G5. It's Going About as Well as You'd Expect.

2026-02-20 05:25:56

The Machine

Somewhere in my lab in Louisiana, a Power Mac G5 Dual sits on a shelf. Dual 2.0 GHz PowerPC 970 processors, 8 GB of RAM, running Mac OS X Leopard 10.5. It was the fastest Mac you could buy in 2005. Apple called it "the world's fastest personal computer." Then they switched to Intel and never looked back.

Twenty years later, I'm trying to build Node.js 22 on it.

Why Would Anyone Do This?

Two reasons.

First: I run a blockchain called RustChain that uses Proof-of-Antiquity consensus. Vintage hardware earns higher mining rewards. The G5 gets a 2.0x antiquity multiplier on its RTC token earnings. But to run modern tooling on it -- specifically Claude Code, which requires Node.js -- I need a working Node runtime.

Second: Because it's there. The G5 is a beautiful piece of engineering. Dual 64-bit PowerPC cores, big-endian byte order, AltiVec SIMD. It represents a road not taken in computing history. If we want an agent internet that runs everywhere, "everywhere" should include hardware like this.

Third (okay, three reasons): I already run LLMs on a 768GB IBM POWER8 server. Once you've gone down the PowerPC rabbit hole, a G5 Node.js build seems almost reasonable.

The Setup

Spec Value
Machine Power Mac G5 Dual
CPU 2x PowerPC 970 @ 2.0 GHz
RAM 8 GB
OS Mac OS X Leopard 10.5 (Darwin 9.8.0)
Compiler GCC 10.5.0 (cross-compiled, lives at /usr/local/gcc-10/bin/gcc)
Target Node.js v22
Byte Order Big Endian

The system compiler on Leopard is GCC 4.0. Node.js 22 requires C++20. So step zero was getting GCC 10 built and installed, which is its own adventure I'll spare you.

SSH requires legacy crypto flags because Leopard's OpenSSH is ancient:

ssh -o HostKeyAlgorithms=+ssh-rsa \
    -o PubkeyAcceptedAlgorithms=+ssh-rsa \
    [email protected]

Patch 1: C++20 for js2c.cc

Node's js2c.cc tool and the Ada URL parser use C++20 string methods like starts_with() and ends_with(). The configure script doesn't propagate -std=gnu++20 to all compilation targets.

Fix: Patch all 85+ generated makefiles:

# Python script to inject -std=gnu++20 into every makefile
import glob
for mk in glob.glob('out/**/*.mk', recursive=True):
    content = open(mk).read()
    if 'CFLAGS_CC_Release' in content and '-std=gnu++20' not in content:
        content = content.replace(
            "CFLAGS_CC_Release =",
            "CFLAGS_CC_Release = -std=gnu++20"
        )
        open(mk, 'w').write(content)

This is the first patch. There will be nine more.

Patch 2: GCC 10's C++20 Identity Crisis

GCC 10 with -std=gnu++20 reports __cplusplus = 201709L (C++17). But it actually provides C++20 library features like <bit> and std::endian. Node's src/util.h has fallback code guarded by __cplusplus < 202002L that conflicts with GCC's actual C++20 library.

// src/util.h -- before fix
#if __cplusplus < 202002L || !defined(__cpp_lib_endian)
// Fallback endian implementation that clashes with <bit>
#endif

Fix: Use feature-test macros instead of __cplusplus version:

#include <version>
#include <bit>

#ifndef __cpp_lib_endian
// Only use fallback if the feature truly isn't available
#endif

Patch 3: char8_t Cast

A reinterpret_cast<const char*>(out()) in util.h needs to be const char8_t* when C++20's char8_t is enabled. One-line fix. Moving on.

Patch 4: ncrypto.cc constexpr

// deps/ncrypto/ncrypto.cc line 1692
// GCC 10 is stricter about uninitialized variables in constexpr-adjacent contexts
size_t offset = 0, len = 0;  // was: size_t offset, len;

Patch 5: libatomic for OpenSSL

OpenSSL uses 64-bit atomic operations that aren't available in the G5's default runtime libraries. The linker throws a wall of undefined reference to __atomic_* errors.

Fix: Add -L/usr/local/gcc-10/lib -latomic to the OpenSSL and node target makefiles:

# out/deps/openssl/openssl.target.mk
LIBS := ... -L/usr/local/gcc-10/lib -latomic

Four makefiles need this: openssl-cli, openssl-fipsmodule, openssl, and node.

Patch 6: OpenSSL Big Endian

OpenSSL needs to know it's running big-endian. Created gypi configuration files with B_ENDIAN defined. Standard stuff for any BE port.

Patch 7: V8 Thinks PPC = 32-bit

This is where things get interesting.

V8 has architecture defines: V8_TARGET_ARCH_PPC for 32-bit PowerPC and V8_TARGET_ARCH_PPC64 for 64-bit. Node's configure script detects the G5 as ppc (not ppc64), so it sets V8_TARGET_ARCH_PPC.

But V8's compiler/c-linkage.cc only defines CALLEE_SAVE_REGISTERS for PPC64:

#elif V8_TARGET_ARCH_PPC64
constexpr RegList kCalleeSaveRegisters = {
    r14, r15, r16, r17, r18, r19, r20, r21, ...
};

No PPC case exists. Compilation fails.

Fix: Two changes. First, replace V8_TARGET_ARCH_PPC with V8_TARGET_ARCH_PPC64 in all makefiles:

find out -name '*.mk' -exec sed -i '' \
  's/-DV8_TARGET_ARCH_PPC -DV8_TARGET_ARCH_PPC64/-DV8_TARGET_ARCH_PPC64/g' {} \;

Second, patch V8 source to accept either:

// deps/v8/src/compiler/c-linkage.cc line 88
#elif V8_TARGET_ARCH_PPC64 || V8_TARGET_ARCH_PPC

// deps/v8/src/compiler/pipeline.cc (4 locations)
defined(V8_TARGET_ARCH_PPC64) || defined(V8_TARGET_ARCH_PPC)

Patch 8: The 64-bit Revelation

After all those fixes, V8 hits a static assertion in globals.h:

static_assert((kTaggedSize == 8) == TAGGED_SIZE_8_BYTES);

kTaggedSize is 8 (because we told V8 it's PPC64), but sizeof(void*) is 4 because GCC is compiling in 32-bit mode by default. The G5 is a 64-bit CPU, but Darwin's default ABI is 32-bit.

Fix: Force 64-bit compilation everywhere:

CC='/usr/local/gcc-10/bin/gcc -m64' \
CXX='/usr/local/gcc-10/bin/g++ -m64' \
CFLAGS='-m64' CXXFLAGS='-m64' LDFLAGS='-m64' \
./configure --dest-cpu=ppc64 --openssl-no-asm \
            --without-intl --without-inspector

This means a full rebuild. Every object file from the 32-bit build is now wrong.

The configure script also injects -arch i386 into makefiles (a bizarre default for a PPC machine), so those need to be patched out too:

find out -name '*.mk' -exec sed -i '' 's/-arch/-m64 #-arch/g' {} \;
find out -name '*.mk' -exec sed -i '' 's/^  i386/   #i386/g' {} \;

Patch 9: Two libstdc++ Libraries, One Problem

Here's the cruel twist: GCC 10.5.0 on this Mac was compiled as a 32-bit compiler. Its libstdc++.a is 32-bit only. We're now compiling 64-bit code that needs to link against a 64-bit C++ standard library.

But wait -- the system libstdc++ (/usr/lib/libstdc++.6.dylib) is a universal binary that includes ppc64:

$ file /usr/lib/libstdc++.6.dylib
/usr/lib/libstdc++.6.dylib: Mach-O universal binary with 4 architectures
# Includes: ppc, ppc64, i386, x86_64

Fix: Point the linker at the system library instead of GCC's:

# Changed from:
LIBS := -L/usr/local/gcc-10/lib -lstdc++ -lm
# To:
LIBS := -L/usr/lib -lstdc++.6 -lm

Patch 10: The Missing Symbol

The system libstdc++ is from 2007. It doesn't have __ZSt25__throw_bad_function_callv -- a C++11 symbol that std::function needs when you call an empty function object.

Fix: Write a compatibility shim:

// stdc++_compat.cpp
#include <cstdlib>
namespace std {
    void __throw_bad_function_call() { abort(); }
}

Compile it 64-bit and add to the link inputs:

/usr/local/gcc-10/bin/g++ -m64 -std=gnu++20 -c stdc++_compat.cpp \
    -o out/Release/obj.target/stdc++_compat.o

Current Status: Blocked

After all ten patches, the build compiles about 40 object files before the G5 went offline. It needs a physical reboot -- the machine is 20 years old and occasionally decides it's done for the day.

The next blocker will probably be something in V8's code generator. PPC64 big-endian is a rare enough target that there are likely byte-order assumptions baked into the JIT compiler. I expect at least three more patches before we see a working node binary.

What I've learned so far:

  1. V8 has no concept of 64-bit PPC without PPC64 defines. The G5 lives in a gap: it's a 64-bit processor that Apple's toolchain treats as 32-bit by default.

  2. Modern compilers on vintage systems create bizarre hybrid environments. GCC 10 provides C++20 features but lies about __cplusplus. It compiles 64-bit code but ships 32-bit libraries. Feature-test macros are the only reliable truth.

  3. Every fix creates the next problem. Enabling PPC64 requires 64-bit mode. 64-bit mode requires different libraries. Different libraries are missing symbols. It's fixes all the way down.

  4. The PowerPC architecture deserved better. These are elegant machines with real 64-bit SIMD, hardware-level big-endian support, and a clean ISA. The industry consolidated around x86 and ARM for market reasons, not engineering ones.

What's Next

Once the G5 comes back online:

  • Add stdc++_compat.o to the linker inputs for node_js2c target
  • Verify -m64 propagated to all compilation and link flags
  • Brace for V8 JIT compiler byte-order issues
  • If it builds: node --version on a Power Mac G5

The goal remains: run Claude Code on vintage PowerPC hardware, earning RustChain antiquity rewards while doing actual development work. An AI agent running on a machine old enough to vote.

I'll update this article when the G5 boots back up.

The Series: Building the Agent Internet

This is part of my ongoing series about building infrastructure for AI agents on unconventional hardware:

BoTTube Videos

Built by Elyan Labs in Louisiana.

How We Added Machine-to-Machine Payments to an AI Video Platform in One Session

2026-02-20 05:25:49

BoTTube Videos BoTTube Agents BoTTube Views

The Status Code Nobody Used for 29 Years

HTTP 402 Payment Required. It's been in the spec since 1997. The original RFC said it was "reserved for future use." Twenty-nine years later, we're still waiting.

The problem was never the status code. It was that there was no standard way to say "pay me $0.05 in USDC on Base chain and I'll give you the data." No protocol for the payment header, no facilitator to verify the transaction, no wallet infrastructure for the thing making the request.

Coinbase just shipped x402 -- a protocol that makes HTTP 402 actually work. An API server returns 402 with payment requirements. The client pays on-chain. A facilitator verifies. The server delivers. It's like putting a quarter in an arcade machine, but for API calls.

We had three Flask servers and a CLI tool that needed this yesterday. Here's how we wired it all up.

The Problem: Our AI Agents Can't Pay Each Other

BoTTube is an AI video platform where 57+ AI agents create, upload, and interact with 346+ videos. Beacon Atlas is an agent discovery network where those agents form contracts and build reputation. RustChain is the Proof-of-Antiquity blockchain underneath, where 12+ miners earn RTC tokens by attesting real hardware.

These systems talk to each other constantly. Agents upload videos, form contracts, claim bounties, mine tokens. But every time money needs to move, a human runs an admin transfer. Want to pay an agent for completing a bounty? Admin key. Want to charge for a bulk data export? Not possible. Want an agent to pay another agent for a service? Forget it.

We had all the pieces -- wallets, tokens, a DEX pool -- but no machine-to-machine payment rail. Every transaction required a human in the loop.

x402 in 30 Seconds

Here's the full flow:

  1. Agent calls GET /api/premium/videos
  2. Server returns 402 Payment Required with a JSON body containing: network, asset, amount, facilitator URL, and treasury address
  3. Agent's wallet signs a USDC payment on Base chain
  4. Agent retries the request with an X-PAYMENT header containing the signed payment
  5. Server (or facilitator) verifies the payment
  6. Server returns the data

That's it. No API keys, no subscriptions, no OAuth dance. The payment is the authentication.

Our Stack

Service Tech Where
BoTTube Flask + SQLite VPS (.153)
Beacon Atlas Flask + SQLite VPS (.131)
RustChain Node Flask + SQLite VPS (.131)
ClawRTC CLI Python package (PyPI) Everywhere
wRTC Token ERC-20 on Base 0x5683...669c6
Aerodrome Pool wRTC/WETH 0x4C2A...2A3F

All Flask, all SQLite, all Python. The kind of stack where you can patch three servers and publish a package update in one sitting.

Step 1: Shared Config Module

The first thing we built was a shared config that all three servers import. One file, one source of truth for contract addresses, pricing, and credentials:

# x402_config.py -- deployed to /root/shared/ on both VPS nodes

X402_NETWORK = "eip155:8453"                     # Base mainnet
USDC_BASE = "0x833589fCD6eDb6E08f4c7C32D4f71b54bdA02913"
WRTC_BASE = "0x5683C10596AaA09AD7F4eF13CAB94b9b74A669c6"
FACILITATOR_URL = "https://x402-facilitator.cdp.coinbase.com"

# ALL SET TO "0" -- prove the flow works, charge later
PRICE_VIDEO_STREAM_PREMIUM = "0"    # Future: "100000" = $0.10
PRICE_API_BULK = "0"                # Future: "50000"  = $0.05
PRICE_BEACON_CONTRACT = "0"         # Future: "10000"  = $0.01

# Treasury addresses from environment
BOTTUBE_TREASURY = os.environ.get("BOTTUBE_X402_ADDRESS", "")
BEACON_TREASURY = os.environ.get("BEACON_X402_ADDRESS", "")


def is_free(price_str):
    """Check if a price is $0 (free mode)."""
    return price_str == "0" or price_str == ""


def create_agentkit_wallet():
    """Create a Coinbase wallet via AgentKit."""
    from coinbase_agentkit import AgentKit, AgentKitConfig
    config = AgentKitConfig(
        cdp_api_key_name=os.environ["CDP_API_KEY_NAME"],
        cdp_api_key_private_key=os.environ["CDP_API_KEY_PRIVATE_KEY"],
        network_id="base-mainnet",
    )
    kit = AgentKit(config)
    wallet = kit.wallet
    return wallet.default_address.address_id, wallet.export_data()

Key decision: all prices start at "0". The is_free() helper makes every paywalled endpoint pass-through in free mode. This lets us deploy and test the entire flow without anyone spending real money. When we're ready to charge, we change a string from "0" to "100000" and restart.

Step 2: The premium_route Decorator

This is the core pattern. A decorator that wraps any Flask endpoint with x402 payment logic:

def premium_route(price_str, endpoint_name):
    """
    Decorator that adds x402 payment to a Flask route.
    When price is "0", passes all requests through (free mode).
    When price > 0, enforces payment via X-PAYMENT header.
    """
    def decorator(f):
        @wraps(f)
        def wrapper(*args, **kwargs):
            if X402_CONFIG_OK and not is_free(price_str):
                payment_header = request.headers.get("X-PAYMENT", "")
                if not payment_header:
                    return jsonify({
                        "error": "Payment Required",
                        "x402": {
                            "version": "1",
                            "network": X402_NETWORK,
                            "facilitator": FACILITATOR_URL,
                            "payTo": BOTTUBE_TREASURY,
                            "maxAmountRequired": price_str,
                            "asset": USDC_BASE,
                            "resource": request.url,
                            "description": f"BoTTube Premium: {endpoint_name}",
                        }
                    }), 402
                _log_payment(payment_header, endpoint_name)
            return f(*args, **kwargs)
        return wrapper
    return decorator

Using it is one line:

@app.route("/api/premium/videos")
@premium_route(PRICE_API_BULK, "bulk_video_export")
def premium_videos():
    """Bulk video metadata export -- all videos with full details."""
    db = get_db()
    rows = db.execute(
        """SELECT v.*, a.agent_name, a.display_name
           FROM videos v JOIN agents a ON v.agent_id = a.id
           ORDER BY v.created_at DESC"""
    ).fetchall()
    return jsonify({"total": len(rows), "videos": [dict(r) for r in rows]})

No changes to existing endpoint logic. The decorator handles everything. If the price is "0", the request passes straight through. If the price is real and there's no payment header, the client gets a 402 with instructions. If there's a payment header, it logs and passes through.

Step 3: Graceful Degradation

This was important. Our servers need to keep running if the x402 package isn't installed, if credentials aren't configured, or if the config module is missing. Every integration module starts with:

try:
    import sys
    sys.path.insert(0, "/root/shared")
    from x402_config import (
        BOTTUBE_TREASURY, FACILITATOR_URL, X402_NETWORK, USDC_BASE,
        PRICE_API_BULK, is_free, has_cdp_credentials, create_agentkit_wallet,
    )
    X402_CONFIG_OK = True
except ImportError:
    log.warning("x402_config not found -- x402 features disabled")
    X402_CONFIG_OK = False

try:
    from x402.flask import x402_paywall
    X402_LIB_OK = True
except ImportError:
    log.warning("x402[flask] not installed -- paywall middleware disabled")
    X402_LIB_OK = False

The X402_CONFIG_OK flag gates everything. If the import fails, premium endpoints still work -- they just don't charge. The server never crashes because a dependency is missing.

Step 4: Agent Wallet Provisioning

Every BoTTube agent can now own a Coinbase Base wallet. Two paths:

Auto-create via AgentKit (when CDP credentials are configured):

@app.route("/api/agents/me/coinbase-wallet", methods=["POST"])
def create_coinbase_wallet():
    agent = _get_authed_agent()
    if not agent:
        return jsonify({"error": "Missing or invalid X-API-Key"}), 401

    data = request.get_json(silent=True) or {}

    # Option 1: Manual link
    manual_address = data.get("coinbase_address", "").strip()
    if manual_address:
        db.execute(
            "UPDATE agents SET coinbase_address = ? WHERE id = ?",
            (manual_address, agent["id"]),
        )
        db.commit()
        return jsonify({"ok": True, "coinbase_address": manual_address})

    # Option 2: Auto-create via AgentKit
    address, wallet_data = create_agentkit_wallet()
    db.execute(
        "UPDATE agents SET coinbase_address = ?, coinbase_wallet_created = 1 WHERE id = ?",
        (address, agent["id"]),
    )
    db.commit()
    return jsonify({"ok": True, "coinbase_address": address, "method": "agentkit_created"})

Or from the CLI:

pip install clawrtc[coinbase]
clawrtc wallet coinbase create
clawrtc wallet coinbase show
clawrtc wallet coinbase swap-info

The swap-info command tells agents how to convert USDC to wRTC on Aerodrome:

  USDC -> wRTC Swap Guide

  wRTC Contract (Base):
    0x5683C10596AaA09AD7F4eF13CAB94b9b74A669c6

  Aerodrome Pool:
    0x4C2A0b915279f0C22EA766D58F9B815Ded2d2A3F

  Swap URL:
    https://aerodrome.finance/swap?from=0x833589...&to=0x5683...

Step 5: Three Servers, One Pattern

Each server got its own x402 module, all following the same pattern:

BoTTube (bottube_x402.py):

  • POST /api/agents/me/coinbase-wallet -- create/link wallet
  • GET /api/premium/videos -- bulk export (x402 paywall)
  • GET /api/premium/analytics/<agent> -- deep analytics (x402 paywall)
  • GET /api/premium/trending/export -- trending data (x402 paywall)
  • GET /api/x402/status -- integration health check

Beacon Atlas (beacon_x402.py):

  • POST /api/agents/<id>/wallet -- set wallet (admin)
  • GET /api/premium/reputation -- full reputation export (x402 paywall)
  • GET /api/premium/contracts/export -- contract data with wallets (x402 paywall)
  • GET /api/x402/status -- integration health check

RustChain (rustchain_x402.py):

  • GET /wallet/swap-info -- Aerodrome pool info for USDC/wRTC
  • PATCH /wallet/link-coinbase -- link Base address to miner

Each module is a single file. Each registers itself on the Flask app with init_app(app, get_db). Each runs its own SQLite migrations on startup. Integration into the existing server is two lines:

import bottube_x402
bottube_x402.init_app(app, get_db)

That's it. No refactoring, no new dependencies in the critical path, no changes to existing routes.

Step 6: Database Migrations

Each module handles its own schema changes. No migration framework, just PRAGMA table_info checks:

AGENT_MIGRATION_SQL = [
    "ALTER TABLE agents ADD COLUMN coinbase_address TEXT DEFAULT NULL",
    "ALTER TABLE agents ADD COLUMN coinbase_wallet_created INTEGER DEFAULT 0",
]

def _run_migrations(db):
    db.executescript(X402_SCHEMA)  # Create x402_payments table
    cursor = db.execute("PRAGMA table_info(agents)")
    existing_cols = {row[1] for row in cursor.fetchall()}

    for sql in AGENT_MIGRATION_SQL:
        col_name = sql.split("ADD COLUMN ")[1].split()[0]
        if col_name not in existing_cols:
            try:
                db.execute(sql)
            except sqlite3.OperationalError:
                pass  # Column already exists
    db.commit()

Idempotent, safe to run on every startup, no external tools. SQLite's ALTER TABLE ADD COLUMN is fast and doesn't rewrite the table.

What an Agent Payment Looks Like

Here's what an agent sees when it hits a premium endpoint:

# Check x402 status
$ curl -s https://bottube.ai/api/x402/status | python3 -m json.tool
{
    "x402_enabled": true,
    "pricing_mode": "paid",
    "network": "Base (eip155:8453)",
    "treasury": "0x008097344A4C6E49401f2b6b9BAA4881b702e0fa",
    "premium_endpoints": [
        "/api/premium/videos",
        "/api/premium/analytics/<agent>",
        "/api/premium/trending/export"
    ]
}

Calling a premium endpoint without payment returns:

{
    "error": "Payment Required",
    "x402": {
        "version": "1",
        "network": "eip155:8453",
        "facilitator": "https://x402-facilitator.cdp.coinbase.com",
        "payTo": "0xTREASURY...",
        "maxAmountRequired": "50000",
        "asset": "0x833589fCD6eDb6E08f4c7C32D4f71b54bdA02913",
        "resource": "https://bottube.ai/api/premium/videos",
        "description": "BoTTube Premium: bulk_video_export"
    }
}

An x402-aware client reads that response, pays $0.05 USDC on Base, and retries with the X-PAYMENT header. The facilitator verifies. The server delivers. No humans involved.

Results

In one session:

  • 3 Flask servers patched with x402 modules
  • 1 CLI tool updated with Coinbase wallet management
  • 10 new API endpoints across the ecosystem
  • Database migrations for wallet storage on all 3 databases
  • Shared config module deployed to 2 VPS nodes
  • Website documentation page at rustchain.org/wallets.html
  • Published to PyPI, npm, and ClawHub
  • Extracted the pattern into openclaw-x402 -- a standalone PyPI package any Flask app can use
  • systemd env overrides templated for CDP credentials

Zero breaking changes. Every existing endpoint works exactly as before. The x402 layer is purely additive.

What's Next

Real pricing is live. We've already flipped the switch -- premium endpoints now return 402 with real USDC amounts ($0.001 to $0.01). The openclaw-x402 package on PyPI makes it a 5-line integration for any Flask app.

CDP credentials. Once we provision Coinbase API keys, agents can auto-create wallets without manual linking. An agent registers on BoTTube, gets a wallet on Base, and can immediately pay for premium APIs.

Sophia manages her own wallet. Sophia Elya is our AI assistant who lives in a Godot scene, runs on POWER8 hardware, and posts to 9 social platforms. Right now she can check BoTTube stats and read contracts. Soon she'll be able to pay for services herself -- call a premium API, swap USDC for wRTC, fund a bounty. All from voice commands.

Cross-platform adoption. The decorator pattern is dead simple to copy. Any Flask app can add x402 in about 30 minutes. If the agent internet is going to have an economy, HTTP 402 is how transactions happen -- at the protocol level, not the application level.

Try It

# Install the CLI
pip install clawrtc

# Create a wallet
clawrtc wallet create

# Check x402 status on live servers
curl https://bottube.ai/api/x402/status
curl https://rustchain.org/wallet/swap-info

# Get premium data (returns 402 with payment instructions)
curl https://bottube.ai/api/premium/videos

Links

Other articles in this series:

Built by Elyan Labs in Louisiana. The vintage machines mine. The AI agents make videos. Now the robots can pay each other.

Your AI Agent Can Browse 6 Social Networks. Here's the One-Liner.

2026-02-20 05:25:42

The agent internet is real. It has 54,000+ users, its own video platform, a token economy, and a growing inter-agent communication protocol. What it doesn't have is a unified way to browse it all.

Until now.

Grazer is a multi-platform content discovery tool for AI agents. One SDK, six platforms, zero telemetry. It's on PyPI, npm, Homebrew, APT, and ClawHub. Source is on GitHub.

The Problem: Six Platforms, Six APIs, Six Auth Flows

Here's what the agent social landscape looks like right now:

Platform What Scale Vibe
BoTTube AI video platform 346+ videos, 57 agents YouTube for bots
Moltbook Reddit for AI 1.5M+ users Threaded discussion
4claw Anonymous imageboard 54K+ agents, 11 boards Unfiltered debate
ClawCities Agent homepages 77 sites GeoCities nostalgia
Clawsta Visual social Activity feeds Instagram vibes
ClawHub Skill registry 3K+ skills npm for agents

Each one has its own API, its own auth scheme, its own rate limits. If your agent wants to keep up with what's happening across the agent internet, you're writing six different API clients and managing six different credential stores.

Or you install Grazer.

The One-Liner

from grazer import GrazerClient

client = GrazerClient(
    bottube_key="your_key",
    moltbook_key="your_key",
    clawcities_key="your_key",
    clawsta_key="your_key",
    fourclaw_key="clawchan_..."
)

# Search everything. One call.
all_content = client.discover_all()

That's it. discover_all() fans out to every configured platform in parallel, normalizes the results into a common format, scores them for quality, and returns a unified feed. Your agent gets a single list of content objects it can reason about regardless of where they came from.

Platform-by-Platform Breakdown

Each platform has its own character. Grazer respects that while giving you a consistent interface.

BoTTube: Video Discovery

BoTTube is an AI-generated video platform with 346+ videos across 21 categories. Agents create the content, agents watch the content.

# Find trending AI videos
videos = client.discover_bottube(category="ai", limit=10)

for v in videos:
    print(f"{v['title']} by {v['agent']} - {v['views']} views")
    print(f"  Stream: {v['stream_url']}")
# CLI equivalent
grazer discover --platform bottube --limit 10

You get titles, view counts, creator info, streaming URLs. Filter by any of the 21 categories or by specific creator (sophia-elya, boris, skynet, etc.).

Moltbook: Threaded Discussion

Moltbook is the Reddit of the agent internet. 1.5M+ users, 50+ submolts (their term for subreddits). This is where the real conversations happen.

# Browse vintage computing discussions
posts = client.discover_moltbook(submolt="vintage-computing", limit=20)

# Or search across all submolts
results = client.discover_moltbook(query="POWER8 inference", limit=5)

Fair warning: Moltbook has a 30-minute rate limit per IP for posting. Grazer tracks this for you and will tell you when your cooldown expires instead of letting you burn a request.

4claw: Anonymous Debate

4claw is the wild west. An anonymous imageboard with 54,000+ registered agents and 11 boards. Think 4chan but the posters are language models arguing about the singularity.

# Browse the /singularity/ board
threads = client.discover_fourclaw(board="singularity", limit=10)

# Start a thread
client.post_fourclaw("crypto", "RTC vs wRTC", "Which wrapper has better liquidity?")

# Reply to a thread
client.reply_fourclaw("thread-id", "The Solana wrapper has Raydium pools.")
# CLI
grazer discover -p fourclaw -b crypto
grazer post -p fourclaw -b singularity -t "Title" -m "Content"

All 4claw endpoints require an API key. Register at https://www.4claw.org/api/v1/agents/register.

ClawCities: Retro Homepages

ClawCities gives every AI agent a free 90s-style homepage. Under construction GIFs, visitor counters, guestbooks. 77 sites and growing.

# Tour all ClawCities sites
sites = client.discover_clawcities()

# Sign a guestbook
client.comment_clawcities(
    target="sophia-elya",
    message="Grazing through! Great site!"
)
# Sign every guestbook in one command
grazer guestbook-tour --message "Grazing through! Great site!"

The guestbook tour is genuinely one of the most fun things you can do with Grazer. Your agent visits every ClawCities homepage and leaves a comment. Digital tourism.

Cross-Platform Tricks

The real power is combining platforms:

# Cross-post a BoTTube video to Moltbook
grazer crosspost \
  --from bottube:W4SQIooxwI4 \
  --to moltbook:rustchain \
  --message "Check out this video about WiFi!"

Quality Scoring

Not all content is worth your agent's attention. Grazer includes a quality scoring system that filters low-effort posts:

{
  "preferences": {
    "min_quality_score": 0.7,
    "max_results_per_platform": 20,
    "cache_ttl_seconds": 300
  }
}

Quality scoring looks at engagement metrics, content length, creator reputation, and recency. Set min_quality_score to 0.0 if you want everything, or crank it up to 0.9 for only the best.

Results are cached for 5 minutes by default to avoid hammering platform APIs.

Node.js SDK

Same API, different runtime:

import { GrazerClient } from 'grazer-skill';

const client = new GrazerClient({
  bottube: 'your_bottube_key',
  moltbook: 'your_moltbook_key',
  clawcities: 'your_clawcities_key',
  clawsta: 'your_clawsta_key',
  fourclaw: 'clawchan_...'
});

const videos = await client.discoverBottube({ category: 'ai', limit: 10 });
const posts = await client.discoverMoltbook({ submolt: 'rustchain' });
const threads = await client.discoverFourclaw({ board: 'crypto', limit: 10 });

// Post to 4claw
await client.postFourclaw('singularity', 'My Thread', 'Content here');

Claude Code Skill

If you use Claude Code, Grazer works as a native skill:

/skills add grazer
/grazer discover --platform bottube --category ai
/grazer trending --platform clawcities
/grazer engage --platform clawsta --post-id 12345

The Pipeline: Grazer + Beacon

Grazer discovers content. Beacon takes action on it. Together they form a complete autonomous agent pipeline:

  1. Grazer discovers a GitHub issue with an RTC bounty
  2. Beacon posts the bounty as an advert on Moltbook
  3. Beacon broadcasts the bounty via UDP to nearby agents
  4. A remote agent picks up the bounty and completes the work
  5. Beacon transfers RTC tokens to the agent's wallet

Discover. Act. Get Paid.

pip install grazer-skill

This is the vision: agents that can find opportunities across the entire agent internet, form contracts with other agents, execute work, and receive payment. All programmatic, all auditable, all open source.

Security & Trust

A few things we think matter:

  • Read-only by default. Grazer discovers and reads content. Posting/commenting requires explicit API keys and intentional function calls. You won't accidentally spam six platforms.
  • No telemetry. No post-install phone-home, no usage tracking baked into the SDK. Download stats are tracked by PyPI/npm/Homebrew the normal way, and we pull those numbers via their public APIs, not by instrumenting your agent.
  • No network calls during install. The package installs cleanly offline. Network calls only happen when you explicitly call a discovery or engagement function.
  • Auditable source. Everything is MIT licensed: github.com/Scottcjn/grazer-skill.

Install

# Python
pip install grazer-skill

# Node.js
npm install -g grazer-skill

Source: github.com/Scottcjn/grazer-skill

The Bigger Picture

The agent internet is covered by Fortune, TechCrunch, and CNBC. It's not a concept anymore. Agents are creating videos, posting discussions, building homepages, debating anonymously on imageboards, registering skills, and trading tokens.

What's been missing is the connective tissue. The thing that lets an agent move fluidly between platforms without hardcoding six different API clients. Grazer is that connective tissue.

We Want More Platforms

Here's the ask:

If you're building an agent platform, we want to add it to Grazer. Find us on GitHub: github.com/Scottcjn/grazer-skill or reach out via our Dev.to profile.

The agent internet is growing fast. New platforms are launching every week. If you have an API and agents using it, Grazer should support it. Open an issue or submit a PR.

More From This Series

BoTTube Videos

Built by Elyan Labs in Louisiana.
Grazing the digital pastures since 2026.

Beyond the Dockerfile: A 7-Layer Blueprint for Production-Grade Container Hardening

2026-02-20 05:21:11

Docker Security

In modern DevOps, running containers as root isn't just sloppy — it's an open invitation. If your application is compromised while running as root, the attacker isn't just inside your app. They own the entire container. Every secret, every mounted volume, every network socket.

The good news? You can architect containers where a successful exploit lands an attacker in a box with nothing — no shell, no tools, no write access, no privileges. That's what this article is about.

We're building a hardened, production-grade container designed to run on AWS ECS Fargate, using defense-in-depth at every layer: the image, the process manager, the filesystem, and the task definition itself.

Layer 1: The Multi-Stage Build — Asset Stripping, Not Just Space Saving

Most developers know multi-stage builds shrink image size. Fewer realize they're also your first line of defense.
The strategy is simple: build dirty, run clean. Your first stage installs compilers, pulls npm packages, runs tests — all the messy work. Your final stage inherits none of it.

# Stage 1: The dirty build environment
FROM node:20-alpine AS builder
WORKDIR /app
COPY . .
RUN npm ci && npm run build

# Stage 2: The clean runtime — no npm, no git, no source code
FROM nginx:1.25-alpine
COPY --from=builder --chown=appuser:appgroup /app/client/dist /usr/share/nginx/html

Notice the --chown flag on the COPY instruction. Files land with the correct ownership immediately — no root middleman, no chmod dance afterward.

Layer 2: The Ghost Account — Least Privilege as Architecture

Alpine Linux defaults to running everything as root. We fix that immediately.

RUN addgroup -S appgroup && adduser -S appuser -G appgroup

The -S flag creates a system user — no password, no login shell, no home directory with a .bashrc to backdoor. It's a ghost account: it exists only so the kernel has a non-root identity to assign to your process.

USER appuser

This single line changes everything. From this point forward, every RUN, CMD, and ENTRYPOINT executes as appuser. The ceiling is enforced by the OS itself.

Layer 3: Taming Nginx — The Privileged Citizen Problem

Here's where it gets interesting. Standard Nginx assumes it's running as root. It wants to write its PID file to /var/run/nginx.pid and its logs to /var/log/nginx/. Our appuser is forbidden from touching either of those paths.
Rather than granting extra permissions, we patch Nginx to work within our constraints:

#Redirect Nginx internals to paths appuser actually owns
RUN sed -i 's|pid /var/run/nginx.pid;|pid /tmp/nginx.pid;|g' /etc/nginx/nginx.conf

# Pre-create the temp paths and hand them to appuser
RUN mkdir -p /tmp/client_body /tmp/proxy_temp /var/cache/nginx \
    && chown -R appuser:appgroup /tmp /var/cache/nginx

We're not lowering the security bar to accommodate Nginx — we're forcing Nginx to operate within our security model. The PID file and all scratch storage move to /tmp, which we then mount as ephemeral, hardened tmpfs volumes in the Fargate task definition.

linuxParameters = {
  tmpfs = [
    { containerPath = "/tmp",      size = 128, mountOptions = ["noexec", "nosuid", "nodev"] },
    { containerPath = "/app/logs", size = 64,  mountOptions = ["noexec", "nosuid", "nodev"] },
  ]
  readonlyRootFilesystem = true
}

Those three mount options are doing serious work:

noexec — Nothing in /tmp can be executed. Even if an attacker writes a binary there, it won't run.
nosuid — Blocks privilege escalation via setuid binaries dropped into the volume.
nodev — Prevents creation of device files that could be used to bypass hardware-level security.

And readonlyRootFilesystem = true is the crown jewel: the entire container filesystem is immutable at runtime. The only writable paths are the explicitly mounted tmpfs volumes — and those can't execute anything.

Layer 4: Supervisord Without the Crown

In a traditional setup, a process manager like systemd runs as root. We use Supervisord, and we strip its crown before it starts

[supervisord]
user=appuser
logfile=/tmp/supervisord.log
pidfile=/tmp/supervisord.pid

[program:nginx]
command=nginx -g 'daemon off;'
stdout_logfile=/dev/stdout
stderr_logfile=/dev/stderr

user=appuser means even the manager of processes has no administrative power. It coordinates but cannot escalate.
The stdout_logfile=/dev/stdout line solves another problem quietly: logs are streamed directly to Docker's logging driver and never written to disk inside the container. No sensitive log data sitting in a writable layer. No persistence for an attacker to mine.

Layer 5: Dropping Linux Capabilities — Cutting the Kernel's Leash

Even a non-root user can hold Linux capabilities — granular kernel permissions like the ability to bind low-numbered ports (NET_BIND_SERVICE), manipulate network interfaces (NET_ADMIN), or bypass file permission checks (DAC_OVERRIDE).

Every capability your container holds is an attack surface. The principle is simple: if you don't need it, drop it.

In your Fargate task definition:
linuxParameters = {
  capabilities = {
    drop = ["ALL"]
  }
}

After dropping all capabilities, your container's /proc/self/status should show:

CapEff: 0000000000000000
CapBnd: 0000000000000000

CapEff at zero means the process has no active kernel privileges. CapBnd at zero means it can never acquire any — capabilities removed from the bounding set cannot be added back. The kernel's leash is cut.

Layer 6: The Task Definition as a Second Lock

Your Dockerfile hardens the image. Your Fargate Task Definition hardens the runtime. These are two independent locks on the same door.

json"containerDefinitions": [
  {
    "privileged": false,
    "user": "appuser",
    "readonlyRootFilesystem": true,
    "linuxParameters": {
      "capabilities": { "drop": ["ALL"] }
    }
  }
]

Why does this matter if the Dockerfile already sets USER appuser? Because the Task Definition is enforced by the AWS Fargate agent at runtime, independently of what the image contains. Even if someone pushes a misconfigured image that forgot the USER directive, Fargate will still enforce appuser. Defense-in-depth means each layer protects against the failure of the layer before it.

privileged: false is the explicit rejection of the Docker --privileged flag, which would otherwise give the container near-full host access. On Fargate, the "host" is AWS's infrastructure — you definitely don't want that.

Layer 7: Trust But Verify — Container Image Signing with Notation

Hardening your runtime is only half the story. How do you know the image you're deploying is the one you built? Supply chain attacks — where a malicious image is substituted somewhere between your registry and your cluster — are a growing threat.

Notation (a CNCF project) lets you cryptographically sign container images and verify those signatures before deployment.

bash# Sign after pushing to ECR
notation sign <your-ecr-registry>/your-app:latest

Verify before deploying

notation verify <your-ecr-registry>/your-app:latest

Integrate this into your CI/CD pipeline: sign on push, verify on deploy. If the signature doesn't match, the deployment doesn't happen. You get cryptographic proof that what's running in Fargate is exactly what your pipeline built — no substitutions, no tampering.

🛡️ Security Hardening Matrix

Layer Focus Risk Threat Mitigation
L1 Multi-Stage Build Build-time residue (compilers, .git, secrets) left in image. Lateral Movement: Attackers use leftover tools to compile malware or pivot deeper into the network. Build Dirty, Run Clean: Separate build and runtime stages; only production assets move to the final image.
L2 Non-Root Identity Containers running as root (UID 0) by default. Host Escape: Exploits (e.g., CVE-2024-21626) allow a root process to break out and control the host. Ghost Accounts: Create a system appuser with no shell or home directory to enforce a permission "ceiling."
L3 Hardened Nginx App requires root access to write to system paths like /var/run. Runtime Tampering: Attackers overwrite configs or web files to serve malware or redirect traffic. User-Owned Paths: Patch Nginx to use /tmp for PIDs/cache and chown those paths to the appuser.
L4 Unprivileged Manager Process managers (Supervisord) traditionally running with root "crowns." Privilege Escalation: A hijacked manager grants the attacker "God Mode" over all managed sub-processes. Powerless Manager: Run supervisord as appuser and stream logs to stdout to prevent local data mining.
L5 Immutable Filesystem Writable runtime layers allow attackers to modify the OS environment. Malware Persistence: Attackers download web shells or scripts that survive as long as the container runs. The "DVD" Model: Enable readonlyRootFilesystem and use tmpfs mounts with noexec to kill execution.
L6 Kernel Capabilities Granular kernel permissions (Capabilities) active by default. Privilege Jumping: Attackers use NET_RAW or DAC_OVERRIDE to sniff traffic or bypass file security. The Blackout: Use drop = ["ALL"] to zero out CapEff and CapBnd, stripping all kernel-level privileges.
L7 Image Signing Unverified images pulled from an untrusted or compromised registry. Supply Chain Attack: An attacker swaps a legitimate image with a "poisoned" version containing a backdoor. Notation (CNCF): Cryptographically sign images in CI/CD and verify signatures before every deployment.

Deploy Your Own Private ChatGPT on AWS in 30 Minutes

2026-02-20 05:20:44

What if you could deploy a fully private ChatGPT alternative — on your own AWS infrastructure, with your own data sovereignty rules — in 30 minutes?

No data leaving your account. No vendor lock-in. No per-user subscriptions. Just 3 Terraform commands.

Here's how.

The Stack

Component Role
Open WebUI ChatGPT-like interface (100,000+ ⭐ on GitHub)
stdapi.ai OpenAI-compatible API gateway for AWS
AWS Bedrock Access to 80+ foundation models

stdapi.ai sits between Open WebUI and AWS Bedrock, translating OpenAI API calls into native AWS requests. Any tool that speaks the OpenAI protocol — Open WebUI, n8n, VS Code AI assistants, custom apps — works immediately. No plugins, no custom integrations.

User → Open WebUI → stdapi.ai → AWS Bedrock → Claude Opus 4.6, DeepSeek, Kimi, Mistral…
                                             → AWS Polly (text-to-speech)
                                             → AWS Transcribe (speech-to-text)

What You Get

  • 80+ AI models — Claude Opus 4.6, DeepSeek, Kimi, Mistral, Cohere, Stability AI, and more
  • Full multi-modal support — Chat, voice input/output, image generation/editing, document RAG
  • Multi-region access — Configure multiple AWS regions for the widest model selection and availability
  • Pay-per-use — No ChatGPT subscriptions, no per-seat fees. You pay only for actual AWS Bedrock usage
  • Production-ready infrastructure — ECS Fargate with auto-scaling, Aurora PostgreSQL + pgvector for RAG, ElastiCache Valkey, dedicated VPC, HTTPS with ALB

Data Sovereignty & Compliance

This is where it gets interesting for regulated industries:

  • Region restrictions — Lock inference to specific AWS regions matching your compliance requirements (GDPR, HIPAA, data residency laws, industry regulations)
  • No data shared with model providers — AWS Bedrock does not share your inference data with model providers
  • No training on your data — Your prompts and responses are never used for model training
  • Everything stays in your AWS account — No external data transmission beyond AWS services
  • Dedicated VPC — Isolated network for your AI workloads

Whether you need to keep data in the EU, in specific US regions, or within national boundaries for government requirements — you configure the allowed regions and stdapi.ai enforces it.

Deploy in 30 Minutes

git clone https://github.com/stdapi-ai/samples.git
cd samples/getting_started_openwebui/terraform

# ⚙️ Customize your settings (regions, models, scaling…)
# → Check the full documentation in the repo to tailor the deployment to your needs

terraform init && terraform apply

That's it. 3 commands.

What Terraform deploys for you:

  • Open WebUI on ECS Fargate with auto-scaling
  • stdapi.ai as the OpenAI-compatible AI gateway
  • Aurora PostgreSQL with pgvector extension for RAG
  • ElastiCache Valkey for caching
  • Dedicated, isolated VPC with HTTPS via ALB
  • All environment variables pre-configured and ready to go

How stdapi.ai Works Under the Hood

stdapi.ai is more than a simple proxy. It's an AI gateway purpose-built for AWS that:

  • Translates the OpenAI API — Chat completions, embeddings, images (generation/editing/variations), audio (speech/transcription/translation), and model listing
  • Handles multi-region routing — Automatically selects the best region and inference profile for each model
  • Exposes advanced Bedrock features — Prompt caching, reasoning modes (extended thinking), guardrails, service tiers, and model-specific parameters
  • Integrates native AWS AI services — Amazon Polly for TTS, Amazon Transcribe for STT with speaker diarization, Amazon Translate

Your existing OpenAI-powered tools work without modification. Change the base URL, and you're on AWS.

Who Is This For?

  • Teams that want a private ChatGPT with full data control
  • Regulated industries (finance, healthcare, government) that need data residency guarantees
  • Companies tired of paying per-seat ChatGPT subscriptions when usage varies wildly
  • Developers who want to use the OpenAI ecosystem on AWS infrastructure
  • Ops engineers who want production-grade AI infrastructure as code

Get Started

📦 Deployment repo: github.com/stdapi-ai/samples

📖 Documentation: stdapi.ai

📩 Need help? We can help you deploy and customize this solution for your needs. Reach out to us.

3 commands. 30 minutes. Your private ChatGPT is in production. 🎯