MoreRSS

site iconThe Practical DeveloperModify

A constructive and inclusive social network for software developers.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of The Practical Developer

Git Worktrees for AI Coding: Run Multiple Agents in Parallel

2026-02-23 17:44:15

Last Tuesday I had Claude Code fixing a pagination bug in my API layer. While it worked, I sat there. Waiting. Watching it think. For eleven minutes.

Meanwhile, three other tasks sat in my backlog: a Blazor component needed refactoring, a new endpoint needed tests, and the SCSS build pipeline had a caching issue. All independent. All blocked behind my single terminal.

I thought: I have 5 monitors and a machine that could run a small country. Why am I running one agent at a time?

Then I discovered that Claude Code shipped built-in worktree support, and everything changed. I went from sequential AI coding to running five agents in parallel, each on its own branch, none stepping on each other's files. My throughput didn't just double. It went up roughly 5x.

Here's exactly how I set it up, the .NET-specific gotchas I hit, and why I think worktrees are the single biggest productivity unlock for AI-assisted development right now.

Table of Contents

  • What Are Git Worktrees (And Why Should You Care Now)
  • The Problem: One Repo, One Agent, One Branch
  • Setting Up Your First Worktree
  • Running Multiple AI Agents in Parallel
  • The .NET Worktree Survival Guide
  • My 5-Agent Workflow
  • Common Worktree Pain Points (And How to Fix Them)
  • When Worktrees Don't Make Sense
  • Frequently Asked Questions
  • Stop Waiting, Start Parallelizing

What Are Git Worktrees

A git worktree is a second (or third, or fifth) working directory linked to the same repository. Each worktree checks out a different branch, but they all share the same .git history, refs, and objects.

Think of it this way: instead of cloning your repo five times (and wasting disk space on five copies of your git history), you create five lightweight checkouts that share one .git folder.

# Your main repo
C:\code\MyApp\                    # on branch: master

# Your worktrees (separate folders, same repo)
C:\code\MyApp-worktrees\fix-pagination\    # on branch: fix/pagination
C:\code\MyApp-worktrees\add-tests\         # on branch: feature/api-tests
C:\code\MyApp-worktrees\refactor-blazor\   # on branch: refactor/blazor-grid

Git introduced worktrees in version 2.5 (July 2015). They've been around for over a decade. Most developers have never used them because, until AI coding agents, there was rarely a reason to work on five branches simultaneously.

Now there is.

The Problem: One Repo, One Agent, One Branch

Here's the typical AI coding workflow in 2026:

  1. Open terminal. Start Claude Code (or Cursor, or Copilot).
  2. Describe a task. Watch the agent work.
  3. Wait 5-15 minutes while it reads files, writes code, runs tests.
  4. Review the changes. Commit.
  5. Start the next task.

Steps 1-4 are sequential. You're blocked. Your machine is doing maybe 10% of what it could.

"But I can just open another terminal and start a second agent."

No, you can't. Not safely. Two agents editing the same working directory is a recipe for corrupted state. Agent A writes to OrderService.cs while Agent B is reading it. Agent A runs dotnet build while Agent B is mid-refactor. Merge conflicts happen in real-time, inside your working directory, with no version control to save you.

Worktrees fix this. Each agent gets its own directory, its own branch, its own isolated workspace. They can all build, test, and modify files simultaneously without interference.

Setting Up Your First Worktree

The syntax is simple:

# Create a worktree with a new branch
git worktree add ../MyApp-worktrees/fix-pagination -b fix/pagination

# Create a worktree from an existing branch
git worktree add ../MyApp-worktrees/fix-pagination fix/pagination

# List all worktrees
git worktree list

# Remove a worktree when you're done
git worktree remove ../MyApp-worktrees/fix-pagination

I keep my worktrees in a sibling directory to avoid cluttering the main repo:

C:\code\
├── MyApp\                        # Main working directory
└── MyApp-worktrees\              # All worktrees live here
    ├── fix-pagination\
    ├── add-tests\
    └── refactor-blazor\

One critical rule: you cannot check out the same branch in two worktrees. Git enforces this by default. If your main directory is on master, no worktree can also be on master. You can override this with git worktree add -f, but don't. It prevents two workspaces from stomping on each other's state. The restriction is a feature, not a bug.

Running Multiple AI Agents in Parallel

Here's where it gets interesting. Once you have worktrees set up, you can launch an AI agent in each one.

With Claude Code

Claude Code has built-in worktree support with a --worktree (-w) CLI flag that starts a session in an isolated worktree automatically. You can also create worktrees manually and point Claude Code at them:

# Terminal 1: Main repo - fixing the pagination bug
cd C:\code\MyApp
claude "Fix the pagination bug in OrdersController where offset is off by one"

# Terminal 2: Worktree - adding API tests
cd C:\code\MyApp-worktrees\add-tests
claude "Add integration tests for all endpoints in OrdersController"

# Terminal 3: Worktree - refactoring Blazor component
cd C:\code\MyApp-worktrees\refactor-blazor
claude "Refactor the OrderGrid component to use virtualization"

# Terminal 4: Worktree - fixing SCSS
cd C:\code\MyApp-worktrees\fix-scss
claude "Fix the SCSS compilation caching issue in the build pipeline"

# Terminal 5: Worktree - documentation
cd C:\code\MyApp-worktrees\update-docs
claude "Update the API documentation for the Orders endpoint"

Five terminals. Five agents. Five branches. Zero conflicts.

Claude Code also supports spawning subagents in worktrees internally using isolation: "worktree" in agent definitions, where each subagent works in isolation and the changes get merged back. Boris Cherny, Creator and Head of Claude Code at Anthropic, called worktrees his number one productivity tip — he runs 3-5 worktrees simultaneously and described it as particularly useful for "1-shotting large batch changes like codebase-wide code migrations."

With Other AI Tools

The same pattern works with any AI coding tool:

# Cursor - open each worktree as a separate workspace
code C:\code\MyApp-worktrees\fix-pagination

# GitHub Copilot CLI - run in each worktree directory
cd C:\code\MyApp-worktrees\add-tests && gh copilot suggest "..."

The worktree is just a directory. Any tool that operates on a directory works.

The .NET Worktree Survival Guide

This is where generic worktree guides fall short. .NET projects have specific pain points that will bite you if you're not prepared.

Pain Point 1: NuGet Package Restore

Each worktree needs its own bin/ and obj/ directories. The good news: dotnet restore handles this automatically. The bad news: your first build in each worktree takes longer because it's restoring packages from scratch.

# After creating a worktree, always restore first
cd C:\code\MyApp-worktrees\fix-pagination
dotnet restore

The NuGet global packages cache (%userprofile%\.nuget\packages on Windows, ~/.nuget/packages on Mac/Linux) is shared across all worktrees. So the packages aren't downloaded again — they're just linked. Fast enough.

Pain Point 2: Port Conflicts in launchSettings.json

This one will get you. If all your worktrees use the same launchSettings.json, they'll all try to bind to the same port. Two Kestrel instances on port 5001 means one of them crashes.

Fix it with environment variables or override the port at launch:

# In worktree terminal, override the port
dotnet run --urls "https://localhost:5011"

# Or set it via environment variable
ASPNETCORE_URLS=https://localhost:5011 dotnet run

One gotcha: if you have Kestrel endpoints configured explicitly in appsettings.json, those override ASPNETCORE_URLS. The --urls flag is safer because it takes highest precedence.

I usually don't bother with any of this — most of the time the AI agent doesn't need to run the app, just build and test it.

Pain Point 3: User Secrets and appsettings.Development.json

User secrets are stored by UserSecretsId (set in your .csproj) under %APPDATA%\Microsoft\UserSecrets\<UserSecretsId>\secrets.json on Windows (~/.microsoft/usersecrets/ on Mac/Linux). They live outside the repo entirely. So they're shared automatically across worktrees. This is usually what you want.

appsettings.Development.json is tracked in git (or should be gitignored), so it exists in every worktree. No issues here.

Pain Point 4: Database Migrations Running in Parallel

If two agents both try to run dotnet ef database update against the same database at the same time, you'll get lock contention or worse.

My rule: only one worktree touches the database at a time. If a task involves migrations, it gets its own dedicated slot and the other agents work on code-only changes.

Or better: use a separate database per worktree for integration tests. Your docker-compose.yml can spin up isolated Postgres instances:

# docker-compose.worktree-tests.yml
services:
  db-pagination:
    image: postgres:17
    ports: ["5433:5432"]
    environment:
      POSTGRES_DB: myapp_pagination

  db-tests:
    image: postgres:17
    ports: ["5434:5432"]
    environment:
      POSTGRES_DB: myapp_tests

Pain Point 5: Shared Global Tools and SDK

The .NET SDK is machine-wide. global.json in your repo pins the version. Since all worktrees share the same repo, they all use the same SDK version. No issues here — this just works.

My 5-Agent Workflow

Here's my actual daily workflow. I've been running this for a few weeks and it's settled into a rhythm.

Morning planning (10 minutes):

  1. Check the backlog. Pick 4-5 independent tasks.
  2. "Independent" means: different files, different concerns, no shared migration paths.
  3. Create worktrees and branches:
# Quick script I keep handy
#!/bin/bash
REPO="C:\code\MyApp"
TREES="C:\code\MyApp-worktrees"

for branch in "$@"; do
    git worktree add "$TREES/$branch" -b "$branch" 2>/dev/null || \
    git worktree add "$TREES/$branch" "$branch"
    echo "Created worktree: $TREES/$branch"
done
# Usage
./create-worktrees.sh fix/pagination feature/api-tests refactor/blazor fix/scss update/docs

Parallel execution (1-2 hours):

  1. Open 5 terminals (I use Windows Terminal with tabs).
  2. Launch Claude Code in each worktree with a clear, scoped prompt.
  3. Monitor. Most tasks complete in 5-15 minutes.
  4. Review each agent's work as it finishes.

Merge back (15 minutes):

  1. Review diffs. Run tests in each worktree.
  2. Merge completed branches back to master:
git checkout master
git merge fix/pagination
git merge feature/api-tests
# ... and so on
  1. Clean up worktrees:
git worktree remove ../MyApp-worktrees/fix-pagination
git worktree remove ../MyApp-worktrees/add-tests
# Or nuke them all
git worktree list | grep -v "bare" | awk '{print $1}' | xargs -I{} git worktree remove {}

Results: What used to take a full day of sequential agent sessions now takes about 2 hours including review time.

Task Selection Matters

Not every task is a good worktree candidate. The ideal task for parallel AI execution:

Good for worktrees Bad for worktrees
Bug fix in isolated file Database schema migration
Adding tests for existing code Renaming a shared model class
New endpoint (separate controller) Refactoring shared base classes
UI component work Changing DI registration order
Documentation updates Anything that touches Program.cs

The rule of thumb: if two tasks would cause a merge conflict, don't run them in parallel.

Common Worktree Pain Points

The criticisms are real. Let me address them honestly.

"I have to npm install in every worktree."

True for Node projects. For .NET, dotnet restore is fast because the global package cache is shared. If you're in a monorepo with both Node and .NET, install node_modules per worktree — it takes 30 seconds with a warm cache.

"Pre-commit hooks don't install automatically."

If you use Husky or similar, run the install command after creating the worktree. For .NET projects using dotnet format as a pre-commit hook, it works automatically since the tool is restored via dotnet tool restore.

"I have to copy env files."

Write a setup script. Seriously. If you're creating worktrees regularly, spending 20 minutes on a setup-worktree.sh script will save you hours:

#!/bin/bash
WORKTREE_DIR=$1
cp .env "$WORKTREE_DIR/.env"
cd "$WORKTREE_DIR"
dotnet restore
dotnet tool restore
echo "Worktree ready: $WORKTREE_DIR"

"Ports conflict."

Pass --urls to override the port. For ASP.NET Core integration tests, port conflicts aren't even an issue — WebApplicationFactory<T> uses an in-memory test server with no actual port binding. Multiple test suites can run simultaneously without stepping on each other.

These are all solvable problems. The throughput gain is worth the 30-minute setup cost.

When Worktrees Don't Make Sense

I'm not going to pretend worktrees are always the answer. Skip them when:

  • Your task list has sequential dependencies (task B needs task A's output)
  • You're working on a single large feature that touches every layer
  • Your repo is small enough that the agent finishes in under 3 minutes anyway
  • You're on a machine with less than 16GB RAM (each agent + build process eats memory)
  • The codebase has heavy shared state — a single God.cs file that everything imports

For a focused 30-minute bug fix, just use your main directory. Worktrees shine when you have 3+ hours of independent tasks and the machine to run them.

Frequently Asked Questions

What is a git worktree?

A git worktree is an additional working directory linked to an existing repository. It lets you check out a different branch in a separate folder while sharing the same git history and objects. Created with git worktree add <path> <branch>, worktrees have been available since Git 2.5 (July 2015).

Can I use git worktrees with Visual Studio?

Yes. Visual Studio 2022 and later can open a worktree folder as a project. Solution files, project references, and NuGet packages all work normally. The only caveat is that Solution Explorer shows the worktree path, not the main repo path. JetBrains Rider also handles worktrees well.

How many git worktrees can I run at once?

Git imposes no hard limit. The practical limit is your machine's RAM and CPU. Each worktree with an AI agent running dotnet build consumes roughly 2-4GB of RAM. On a 32GB machine, 5-6 concurrent worktrees with active builds is comfortable. On 64GB, you can push to 10+.

Do git worktrees share the NuGet cache?

Yes. The NuGet global packages folder (~/.nuget/packages) is machine-wide, not per-repository. When you run dotnet restore in a worktree, packages are resolved from the global cache. Only packages not already cached will be downloaded. This makes the first restore in a new worktree fast — usually under 10 seconds for a typical .NET solution.

Are git worktrees better than multiple git clones?

For AI-assisted parallel development, yes. Worktrees share git history, refs, and the object database. Five worktrees use a fraction of the disk space of five full clones. Commits made in any worktree are immediately visible to all others (same .git directory). The only advantage of separate clones is full isolation — useful if you need different git configs or hooks per copy.

How do I resolve merge conflicts from parallel worktree branches?

Merge each branch back to your main branch one at a time. If branches touched different files (which they should if you planned well), merges are clean. For conflicts, resolve them using your normal merge workflow. The key is task selection: if you chose truly independent tasks, merge conflicts are rare. I've been running 5 parallel branches daily for weeks and hit fewer than 3 conflicts total.

Stop Waiting, Start Parallelizing

The era of watching a single AI agent grind through your tasks one by one is over. Git worktrees give you isolated workspaces in seconds. AI coding tools give you agents that can fill each one.

The math is simple. If one agent takes 10 minutes per task and you have 5 tasks, that's 50 minutes sequential. With 5 worktrees, it's 10 minutes plus review time.

Set up a few worktrees. Pick independent tasks. Launch your agents. Go make coffee.

When you come back, five branches will be waiting for review.

Now if you'll excuse me, I have 4 agents running and one of them just finished refactoring my Blazor grid component. Time to review.

About the Author

I'm Mashrul Haque, a Systems Architect with over 15 years of experience building enterprise applications with .NET, Blazor, ASP.NET Core, and SQL Server. I specialize in Azure cloud architecture, AI integration, and performance optimization.

When production catches fire at 2 AM, I'm the one they call.

Follow me here on dev.to for more .NET and AI coding content

GitHub热门项目: visual-explainer

2026-02-23 17:41:16

visual-explainer

Agent skill + prompt templates that generate rich HTML pages for visual diff reviews, architecture overviews, plan audits, data tables, and project recaps

项目信息

简介

Agent skill + prompt templates that generate rich HTML pages for visual diff reviews, architecture overviews, plan audits, data tables, and project recaps

快速开始

git clone https://github.com/nicobailon/visual-explainer
cd visual-explainer

标签

github trending opensource html

欢迎关注!

Run AI Locally: Converting an Android phone into into a personal LLM server

2026-02-23 17:38:41

Cloud AI is convenient. It is also expensive and dependent on internet access.

Over the weekend, I tried something different: I converted an old 4GB Android phone into a local LLM server and routed it to my PC.

The goal was simple. Run AI offline. No subscriptions. No API costs.

Here is what worked, what did not, and what this experiment reveals about the future of edge AI.

The Stack

The setup was minimal:

• Termux for a Linux-like environment on Android
• Ollama for running local language models
• Qwen2 (0.5b variant) as the lightweight model
• One old Android device with 4GB RAM

Below are the exact steps that worked for me when setting up Termux and Ollama on Android.

  1. Install Termux

Download the latest APK from the official GitHub releases page of Termux. After installation, grant storage and network permissions when prompted.

  1. Update packages and install Ollama

Inside Termux:

pkg update && pkg install ollama

This installs Ollama directly in the Termux environment.

  1. Start the Ollama server

Expose it to your local network:

export OLLAMA_HOST=0.0.0.0:11434
ollama serve &

Setting 0.0.0.0 allows other devices on the same network to connect.

  1. Pull a lightweight model

For low-RAM devices, I used:

ollama pull qwen2:0.5b

This pulls the 0.5B parameter variant of Qwen2, which is small enough to run on constrained hardware. If download speed is an issue, using alternative mirrors can help.

  1. Run the model and test with a prompt
ollama run qwen2:0.5b

Note: On some setups, Ollama may throw an error about a missing serve executable. Creating a symbolic link fixes it:

ln -s $PREFIX/bin/ollama $PREFIX/bin/serve

This maps the expected command to the correct binary.

  1. Access from your PC

From your computer, send a request to the phone’s local IP:

curl http://[phone-ip]:11434/api/generate -d '{"model": "qwen2:0.5b", "prompt": "Test"}'

If everything is configured correctly, the phone responds with generated text.

At this point, your Android device is functioning as a local LLM server.

The Android device I used has a weak mobile CPU and limited RAM. Inference times were noticeably slow. Large prompts required patience.

There were additional bottlenecks:

• Termux introduces slight I/O latency since it runs a Linux environment on top of Android.
• The phone throttled performance to manage heat and battery health.
• Sustained loads caused noticeable slowdowns.

Phones are not designed to behave like servers. Thermal limits are very real.

Still, the system remained functional.

What This Actually Proves

The interesting part is not performance. It is feasibility.

A few years ago, running a language model required serious hardware. Now, even a retired Android phone can serve a lightweight LLM.

This experiment highlights three shifts:

Model compression is improving rapidly.

Edge AI is becoming practical.

Personal AI infrastructure is possible without cloud dependence.

This was not about replacing high-performance systems. It was about exploring autonomy.

Offline AI changes the equation. No network dependency. No usage limits. No recurring costs.

Is It Practical?

For production workloads? nope, not really.

For experimentation, learning, and private local tooling, yes.

If you are building tools that require lightweight inference or offline capabilities, small models running on edge devices are increasingly viable.

The tradeoff is speed.

The benefit is independence.

I hope you enjoyed it!

Claude Code Changed How I Write Software. Here's My Setup.

2026-02-23 17:33:39

Software development demands efficiency and precision. Integrating AI into the development loop promises significant productivity gains. Many developers experiment with AI chatbots, but true transformation requires a structured approach. Claude Code offers this structure, moving beyond simple chat interfaces to an AI-native development environment.

For months, I have embedded Claude Code into my daily workflow. This isn't about using a large language model (LLM) as a glorified search engine or a quick code snippet generator. It's about leveraging an AI that understands project context, adheres to architectural patterns, and automates complex tasks. My setup has dramatically increased my output, reducing boilerplate and accelerating iteration cycles. Here is the exact configuration and workflow that made me 10x more productive.

What Claude Code Is

Claude Code is not merely a chatbot with a code interpreter. It is an integrated development environment (IDE) designed from the ground up to collaborate with Anthropic's Claude LLM. This environment provides specific features that allow the AI to operate within a defined project context, interact with files, and execute custom commands.

It differentiates itself through its deep understanding of project structure and its ability to maintain persistent context across sessions. This allows for complex, multi-step tasks that traditional chat interfaces struggle with. Claude Code functions as an intelligent co-pilot, not just a suggestion engine.

The Power of CLAUDE.md

The cornerstone of an effective Claude Code setup is the CLAUDE.md file. This isn't just a README; it's the project's constitution for the AI. It provides Claude with a comprehensive understanding of the project's architecture, goals, constraints, and preferred coding styles.

CLAUDE.md acts as a dynamic prompt, ensuring Claude always operates with the latest, most relevant project context. This eliminates the need to repeatedly provide background information.

I place CLAUDE.md at the root of every project. It contains sections for high-level goals, architectural decisions, technology stack, coding standards, and even specific modules or files the AI should prioritize. Updating this file updates Claude's understanding of the entire project automatically.

Here is a typical CLAUDE.md structure I use:

# Project: User Management Service

## [CONTEXT]
This service manages user authentication, authorization, and profile data. It integrates with an existing API Gateway and a PostgreSQL database. All communication is RESTful JSON. Security and performance are paramount.

## [GOALS]
- Implement robust user registration and login flows.
- Provide endpoints for user profile management (CRUD operations).
- Ensure all API endpoints are secured with JWT tokens.
- Maintain high test coverage (>90%).
- Deliver a scalable and maintainable codebase.

## [ARCHITECTURAL_GUIDELINES]
- Microservice architecture.
- Stateless service design.
- Event-driven patterns for asynchronous tasks (e.g., email verification).
- Use dependency injection for all services and repositories.

## [TECHNOLOGY_STACK]
- **Language:** Python 3.10+
- **Framework:** FastAPI
- **Database:** PostgreSQL (via SQLAlchemy 2.0 ORM)
- **Authentication:** PyJWT
- **Testing:** Pytest, httpx
- **Linting/Formatting:** Black, Pylint

## [CODING_STANDARDS]
- Adhere to PEP 8.
- Use type hints extensively.
- Docstrings for all functions, classes, and modules.
- Prefer explicit over implicit.
- Error handling must be explicit and informative.
- Avoid global state.

## [IMPORTANT_FILES_OR_MODULES]
- `app/main.py`: Main FastAPI application entry point.
- `app/schemas/`: Pydantic models for request/response validation.
- `app/crud/`: Database interaction logic.
- `app/services/`: Business logic.
- `app/api/v1/endpoints/`: API route definitions.
- `app/core/security.py`: JWT handling and password hashing.

## [CONSTRAINTS]
- Response times for critical endpoints must be under 50ms.
- Database queries must be optimized; avoid N+1 problems.
- All sensitive data must be encrypted at rest and in transit.
- No external dependencies without explicit approval.

## [PREVIOUS_DECISIONS]
- Chosen UUIDs for primary keys in all database tables.
- Implemented a custom rate-limiting middleware.
- Using `loguru` for structured logging.

This detailed CLAUDE.md provides Claude with a complete operational blueprint. When I ask Claude to "implement user registration," it immediately understands the technology stack, architectural patterns, and even specific file locations. This drastically reduces the back-and-forth common with less structured AI interactions.

Custom Slash Commands for Repetitive Tasks

Repetitive development tasks are prime candidates for automation. Claude Code allows defining custom slash commands, which map to predefined prompts or sequences of actions. These commands streamline common operations, ensuring consistency and saving significant time.

I configure these commands in a claude_config.json file, typically located in my user's Claude Code configuration directory. Each command specifies a name, a description, and the underlying prompt template or script to execute.

Here are some of the custom slash commands I use daily:

  • /test_suite: Runs all tests in the current directory, then analyzes failures and suggests fixes.
  • /refactor_file <filename>: Analyzes a specified file for potential refactorings (readability, performance, adherence to standards) and proposes changes.
  • /generate_docs <module_name>: Creates or updates Sphinx/MkDocs-style documentation for a given Python module.
  • /optimize_query <sql_query>: Analyzes a SQL query for performance bottlenecks and suggests indexing strategies or query rewrites.
  • /create_endpoint <resource_name>: Generates boilerplate for a new FastAPI CRUD endpoint, including Pydantic schemas, CRUD operations, and route definitions.

Consider the /create_endpoint command. Instead of manually creating files and writing boilerplate, I type /create_endpoint product. Claude Code then leverages the CLAUDE.md context (FastAPI, SQLAlchemy, Pydantic) to generate the necessary files.

An example entry in claude_config.json for a custom command might look like this:

{
  "commands": [
    {
      "name": "create_endpoint",
      "description": "Generates a new FastAPI CRUD endpoint boilerplate.",
      "template": "Based on the CLAUDE.md context, generate a complete FastAPI CRUD endpoint for the resource '{{resource_name}}'. Include Pydantic schemas (request and response), SQLAlchemy CRUD operations, and the API router definitions. Ensure type hints and docstrings are present. Provide the code for app/schemas/{{resource_name}}.py, app/crud/{{resource_name}}.py, and app/api/v1/endpoints/{{resource_name}}.py. Use UUIDs for IDs.",
      "args": [
        {
          "name": "resource_name",
          "type": "string",
          "description": "The name of the resource (e.g., 'user', 'product')."
        }
      ]
    },
    {
      "name": "refactor_file",
      "description": "Analyzes a file for refactoring opportunities.",
      "template": "Analyze the file '{{filename}}' for code smells, potential performance improvements, and adherence to the coding standards defined in CLAUDE.md. Propose specific, actionable refactorings, showing both the original and modified code snippets. Focus on readability, maintainability, and efficiency.",
      "args": [
        {
          "name": "filename",
          "type": "string",
          "description": "The path to the file to refactor."
        }
      ]
    }
  ]
}

This configuration allows me to invoke commands like /create_endpoint product or /refactor_file app/services/user_service.py. Claude Code parses the command, substitutes arguments into the template, and executes the refined prompt against the current project context. This automation ensures consistency and reduces manual effort significantly.

MCP Servers for External Tool Integration

Claude Code's real power extends beyond internal code generation through its Multi-Context Proxy (MCP) servers. MCP servers are lightweight HTTP services that act as bridges, allowing Claude Code to interact with external tools, APIs, databases, or even local system commands. This integrates the AI into a broader ecosystem, enabling it to perform actions beyond just generating text.

MCP servers empower Claude Code to "do" things in the real world, not just "suggest" them.

I use MCP servers to query internal knowledge bases, trigger CI/CD pipelines, interact with cloud provider APIs, or even perform database migrations. Each MCP server exposes a simple API that Claude Code can call with structured requests.

Here's an example of a simple Python-based MCP server that allows Claude Code to look up documentation for Python packages:

# mcp_doc_server.py
from flask import Flask, request, jsonify
import subprocess
import json

app = Flask(__name__)

@app.route('/package_docs', methods=['POST'])
def get_package_docs():
    data = request.json
    package_name = data.get('package_name')
    if not package_name:
        return jsonify({"error": "package_name is required"}), 400

    try:
        # Example: Use pip show to get basic package info
        # For full documentation, this would integrate with a more sophisticated system
        result = subprocess.run(['pip', 'show', package_name], capture_output=True, text=True, check=True)
        return jsonify({"package_name": package_name, "docs": result.stdout}), 200
    except subprocess.CalledProcessError as e:
        return jsonify({"error": f"Could not find documentation for {package_name}: {e.stderr}"}), 404
    except Exception as e:
        return jsonify({"error": f"An unexpected error occurred: {str(e)}"}), 500

if __name__ == '__main__':
    # Run the MCP server on a specific port
    app.run(port=5001, debug=False)

To enable Claude Code to use this, I would configure it in claude_config.json to define the MCP server and an associated slash command:

{
  "mcp_servers": [
    {
      "name": "doc_lookup_service",
      "url": "http://localhost:5001",
      "description": "Provides documentation lookup for Python packages."
    }
  ],
  "commands": [
    {
      "name": "get_package_docs",
      "description": "Retrieves documentation for a specified Python package.",
      "template": {
        "mcp_server": "doc_lookup_service",
        "endpoint": "/package_docs",
        "method": "POST",
        "payload": {
          "package_name": "{{package_name}}"
        }
      },
      "args": [
        {
          "name": "package_name",
          "type": "string",
          "description": "The name of the Python package."
        }
      ]
    }
  ]
}

Now, I can type /get_package_docs fastapi directly within Claude Code. Claude Code sends a POST request to http://localhost:5001/package_docs with {"package_name": "fastapi"}. The MCP server processes this, retrieves the pip show output, and returns it to Claude Code. Claude then incorporates this information into its responses, for example, by summarizing the package details or suggesting usage examples based on the retrieved documentation.

This integration is powerful. It moves Claude Code from a purely generative tool to an actionable agent within my development environment.

Session Management for Long-Running Projects

Maintaining context over long development cycles is critical. Claude Code's session management ensures that the AI retains its understanding of the project, conversation history, and active tasks across multiple interactions and even days. Losing context means repeatedly re-explaining the project state, which wastes time and dilutes efficiency.

Persistent sessions prevent context drift, allowing Claude Code to pick up exactly where it left off, even after a break.

I manage sessions by creating a new session for each major feature or bug fix branch. When I start a new task, I load the relevant session or create a new one. This keeps the AI's focus narrow and relevant to the current work.

The process involves:

  1. Saving a session: When a task is paused or completed, I save the current Claude Code session. This includes the entire conversation history, any temporary files Claude created, and its current internal state regarding the project.
  2. Loading a session: When returning to a task, I load the corresponding session. Claude Code immediately restores its context, remembering previous discussions, code snippets, and decisions.

This capability is particularly useful for large projects with complex interdependencies. Claude remembers architectural nuances, design choices, and even past refactoring discussions, ensuring continuity throughout the development lifecycle. It prevents "AI amnesia" that plagues many other LLM interactions.

My Exact Daily Workflow

My daily workflow with Claude Code is highly structured, leveraging CLAUDE.md, custom commands, MCP servers, and session management. This integrated approach allows me to tackle complex tasks with unprecedented speed and consistency.

Here's a step-by-step breakdown of a typical development day:

  1. Morning Setup and Session Load:

    • I start Claude Code and load the session corresponding to my current Git branch or feature. If it's a new feature, I create a new session named after the branch (e.g., feature/user-profile-editing).
    • Claude Code automatically loads the project's CLAUDE.md file, providing immediate context.
  2. Task Definition and Initial Brainstorming:

    • I articulate the task to Claude Code. For example: "Implement the update_user_profile endpoint. It should allow users to change their name and email. Ensure email uniqueness and proper validation."
    • Claude Code, referencing CLAUDE.md, suggests the relevant files to modify (schemas, crud, services, endpoints) and potential security considerations. We iterate on the API design.
  3. Code Generation and Iteration:

    • I use custom commands to generate boilerplate. For instance, /create_endpoint user_profile might be too broad, so I'd ask Claude directly to generate specific Pydantic models for the update request.
    • Claude generates the initial code for the Pydantic schemas, service logic, and endpoint definition.
    • I review the generated code, making minor adjustments. I then ask Claude to "Refine app/services/user_service.py to handle unique email constraint errors gracefully."
  4. Testing and Debugging:

    • Once the code is generated, I initiate testing. I use the /test_suite command. Claude Code executes the tests, then analyzes the output.
    • If tests fail, Claude Code highlights the failing tests and proposes specific fixes within the relevant files. For example: "The test test_update_user_profile_invalid_email failed. The current validation in app/services/user_service.py does not correctly handle existing email addresses. Here's a proposed fix:"
    • I iterate with Claude, asking it to "Generate unit tests for the update_user_profile function in app/services/user_service.py."
  5. Documentation and Refinement:

    • After the feature is functional and tested, I use /generate_docs app/services/user_service.py to automatically update the documentation for the new functions.
    • I then use /refactor_file app/api/v1/endpoints/user.py to ensure the new endpoint adheres to all coding standards and is as clean as possible.
  6. External Interactions (MCP Servers):

    • During development, if I need to look up a specific package's usage or query a staging database for existing user data patterns, I use MCP server commands like /get_package_docs pydantic or a custom /query_db <sql_statement> command to interact with a read-only staging database via an MCP server. This provides real-time data or context without leaving the Claude Code environment.
  7. Saving Session and Committing:

    • Once the feature is complete and all checks pass, I save the current Claude Code session.
    • I then commit my code to Git, often asking Claude to "Generate a concise Git commit message summarizing the changes for the 'update user profile' feature."

This workflow ensures that Claude Code is deeply integrated into every stage of development. It acts as an intelligent assistant, from initial design to final documentation, constantly informed by the project's context and capable of executing complex tasks.

Takeaway and Next Steps

Claude Code, with its structured approach to AI-assisted development, transforms how engineers build software. The combination of a definitive CLAUDE.md for context, customizable slash commands for automation, MCP servers for real-world integration, and robust session management creates an environment where the AI is a true co-developer. It is not merely a tool for generating snippets, but a partner that understands, executes, and learns within your project's ecosystem.

To start leveraging Claude Code effectively, begin by:

  1. Defining your CLAUDE.md: Invest time in clearly articulating your project's context, goals, and constraints. This is the most critical step for effective AI interaction.
  2. Identifying repetitive tasks: Pinpoint common actions in your workflow that can be automated with custom slash commands. Start with simple ones like generating tests or boilerplate.
  3. Exploring MCP server potential: Consider what external tools or data sources would benefit from direct AI interaction. Even a simple local script can unlock significant value.

Embrace Claude Code's structured environment. It's a fundamental shift in how you interact with AI, moving from ad-hoc prompting to a systematic, context-aware development partnership.

Why Your AI Agent Forgets Everything (And How to Fix It)

2026-02-23 17:32:47

Most AI agents operate with a severe handicap: they forget everything. Every interaction starts from zero. Your agent might perfectly answer a question about a product, then draw a blank when you ask a follow-up about that same product's warranty, simply because the prior context is gone. This stateless behavior cripples agent capabilities, making them frustratingly ineffective for anything beyond single-turn queries.

Building truly useful AI agents requires persistent, intelligent memory. This article demonstrates how to implement robust memory systems, moving beyond simple chat history to structured knowledge, ensuring your agents remember what matters.

The Agent Memory Problem

Large Language Models (LLMs) are inherently stateless. Each API call is a fresh request. To maintain context, developers typically pass the entire conversation history with every prompt. This approach works for short chats but quickly becomes unsustainable and inefficient.

The primary limitation of simply passing chat history is the context window. As conversations lengthen, the prompt size grows, incurring higher token costs and potentially exceeding the LLM's maximum input length. More critically, simply re-feeding raw text does not provide structured knowledge or enable complex reasoning across turns.

Agents need to recall specific facts, understand relationships, and retrieve relevant information from a vast knowledge base. Basic chat history fails at these requirements. It lacks semantic understanding and the ability to selectively retrieve information.

Level 1: File-Based Memory (Simple State Management)

The simplest form of persistent memory involves storing explicit pieces of information in structured files or databases. This level is suitable for discrete facts, user preferences, or task-specific variables.

Description: File-based memory stores data as key-value pairs, JSON objects, or rows in a lightweight database like SQLite. The agent explicitly stores and retrieves information by a predefined key.

Use Cases:

  • Storing a user's name, preferred language, or default settings.
  • Tracking a specific task ID or progress status.
  • Remembering a temporary variable for a multi-step process.

Pros:

  • Simplicity: Easy to implement and understand.
  • Performance: Fast retrieval for exact matches.
  • Cost-effective: Minimal overhead.

Cons:

  • No Semantic Understanding: The agent only retrieves what was explicitly stored. It cannot infer or generalize.
  • Scalability Limitations: Becomes unwieldy for complex, interconnected data.
  • Manual Management: Requires explicit code to store, update, and retrieve.

Implementation Example (Conceptual):
An agent stores a user's preferred product category.

import json

class SimpleFileMemory:
    def __init__(self, filename="agent_memory.json"):
        self.filename = filename
        self.memory = self._load_memory()

    def _load_memory(self):
        try:
            with open(self.filename, 'r') as f:
                return json.load(f)
        except FileNotFoundError:
            return {}

    def _save_memory(self):
        with open(self.filename, 'w') as f:
            json.dump(self.memory, f, indent=4)

    def get(self, key, default=None):
        return self.memory.get(key, default)

    def set(self, key, value):
        self.memory[key] = value
        self._save_memory()

    def delete(self, key):
        if key in self.memory:
            del self.memory[key]
            self._save_memory()

# Example Usage
memory = SimpleFileMemory()

# Agent remembers user preference
user_id = "user_123"
memory.set(f"{user_id}_preferred_category", "Electronics")
print(f"User's preferred category: {memory.get(f'{user_id}_preferred_category')}")

# Agent remembers a task state
task_id = "task_001"
memory.set(f"{task_id}_status", "pending")
print(f"Task {task_id} status: {memory.get(f'{task_id}_status')}")

# Clear some memory
memory.delete(f"{task_id}_status")
print(f"Task {task_id} status after deletion: {memory.get(f'{task_id}_status')}")

This simple file-based memory provides immediate persistence for explicit facts. For more complex, unstructured data, a different approach is necessary.

Level 2: Vector Store Memory (Semantic Retrieval)

When agents need to recall information based on meaning rather than exact keywords, vector store memory becomes essential. This is the foundation of Retrieval-Augmented Generation (RAG).

Description: Vector store memory converts text chunks into numerical representations called embeddings. These embeddings capture the semantic meaning of the text. When the agent needs to recall information, it converts the query into an embedding and searches the vector store for semantically similar embeddings. The corresponding text chunks are then retrieved and provided to the LLM as context.

Use Cases:

  • Long-term Knowledge Base: Storing vast amounts of documentation, articles, or past conversations.
  • Semantic Search: Retrieving relevant information even if the query uses different phrasing.
  • Contextual Recall: Remembering key points from previous, lengthy interactions without re-feeding the entire transcript.
  • Chatbot Memory: Enabling a chatbot to recall previous topics or user preferences discussed implicitly across multiple sessions.

Pros:

  • Semantic Understanding: Retrieves information based on meaning, not just keywords.
  • Scalability: Handles large volumes of unstructured text efficiently.
  • Reduces Context Window Pressure: Only relevant chunks are retrieved, keeping prompt sizes manageable.
  • Extensible: Easy to add new knowledge by embedding and storing new documents.

Cons:

  • Embedding Model Dependency: Requires a robust embedding model, which incurs cost and latency.
  • Retrieval Limitations: The quality of retrieval depends heavily on the embedding model and the chunking strategy.
  • No Explicit Relationships: Does not inherently understand relationships between pieces of information; it's a "bag of facts."
  • Freshness: Stale embeddings require re-indexing to reflect updated information.

Implementation Example (LangChain with ChromaDB):
We use LangChain's VectorStoreRetrieverMemory with an in-memory ChromaDB for demonstration. This allows the agent to semantically recall information it previously "learned."

from langchain_community.embeddings import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain.chains import ConversationChain
from langchain_community.llms import OpenAI
from langchain.memory import VectorStoreRetrieverMemory

import os
# Set your OpenAI API key
# os.environ["OPENAI_API_KEY"] = "YOUR_OPENAI_API_KEY"

# Ensure API key is set
if not os.getenv("OPENAI_API_KEY"):
    raise ValueError("OPENAI_API_KEY environment variable not set.")

# 1. Initialize Embeddings and Vector Store
# Using a temporary directory for ChromaDB to store embeddings
# In a real application, you might persist this to disk or use a hosted solution.
vectorstore = Chroma(embedding_function=OpenAIEmbeddings(), persist_directory="./chroma_db_memory")
retriever = vectorstore.as_retriever(search_kwargs={"k": 2}) # Retrieve top 2 most relevant documents

# 2. Initialize VectorStoreRetrieverMemory
# This memory type uses a retriever to fetch relevant documents based on the current input.
memory = VectorStoreRetrieverMemory(retriever=retriever)

# 3. Initialize the LLM
llm = OpenAI(temperature=0) # Using a low temperature for consistent responses

# 4. Create a Conversation Chain with the VectorStoreRetrieverMemory
# The chain will automatically add retrieved documents to the prompt.
conversation = ConversationChain(
    llm=llm,
    memory=memory,
    verbose=True # Set to True to see the prompt with retrieved context
)

# --- Agent Learning Phase ---
print("--- Agent Learning Phase ---")

# Simulate the agent "learning" facts by adding them to memory
# These facts will be embedded and stored in the vector store.
memory.save_context({"input": "My favorite color is blue."}, {"output": "Okay, I'll remember that your favorite color is blue."})
memory.save_context({"input": "I like to hike on weekends."}, {"output": "Hiking sounds like a great weekend activity!"})
memory.save_context({"input": "My name is Alice and I work as a software engineer."}, {"output": "Nice to meet you, Alice. A software engineer, interesting!"})
memory.save_context({"input": "The project deadline is next Friday."}, {"output": "Got it, next Friday is the deadline."})
memory.save_context({"input": "My dog's name is Buddy."}, {"output": "Buddy, what a cute name for a dog!"})


# --- Agent Recall Phase ---
print("\n--- Agent Recall Phase ---")

# Query the agent with questions related to the stored facts.
# The memory will retrieve semantically similar information to provide context.

print("\nUser: What is my dog's name?")
response = conversation.predict(input="What is my dog's name?")
print(f"Agent: {response}")
# Expected: Agent recalls "Buddy" because "dog's name" is semantically similar to "My dog's name is Buddy."

print("\nUser: What do I do for a living?")
response = conversation.predict(input="What do I do for a living?")
print(f"Agent: {response}")
# Expected: Agent recalls "software engineer" because "do for a living" is semantically similar to "work as a software engineer."

print("\nUser: What is my favorite hue?")
response = conversation.predict(input="What is my favorite hue?")
print(f"Agent: {response}")
# Expected: Agent recalls "blue" because "hue" is semantically similar to "color."

print("\nUser: When is the project due?")
response = conversation.predict(input="When is the project due?")
print(f"Agent: {response}")
# Expected: Agent recalls "next Friday" because "project due" is semantically similar to "project deadline."

# Example of a new piece of information that will be added to memory
print("\nUser: I also enjoy reading sci-fi novels.")
response = conversation.predict(input="I also enjoy reading sci-fi novels.")
print(f"Agent: {response}")

print("\nUser: What kind of books do I read?")
response = conversation.predict(input="What kind of books do I read?")
print(f"Agent: {response}")
# Expected: Agent recalls "sci-fi novels"

# Clean up ChromaDB directory
import shutil
if os.path.exists("./chroma_db_memory"):
    shutil.rmtree("./chroma_db_memory")

To run this code, install langchain-community, langchain, openai, chromadb. Replace YOUR_OPENAI_API_KEY with your actual key.

The VectorStoreRetrieverMemory automatically embeds the current input and queries the vector store for relevant past interactions or facts. It then adds these retrieved documents to the LLM's prompt, allowing the LLM to generate a contextually aware response. This significantly enhances the agent's ability to "remember" details from a large body of information.

Level 3: Knowledge Graph Memory (Structured Relationships)

For agents that need to perform complex reasoning, understand relationships between entities, and answer multi-hop questions, knowledge graph memory provides a powerful solution.

Description: A knowledge graph represents information as a network of interconnected entities (nodes) and their relationships (edges). Instead of just storing facts, it stores how facts relate to each other. An LLM can extract these entities and relationships (triples: subject-predicate-object) from text, which are then stored in a graph database (e.g., Neo4j). When querying, the agent can traverse the graph to find indirect connections and infer new information.

Use Cases:

  • Complex Reasoning: Answering questions like "What projects did Alice work on, and what skills are required for those projects?"
  • User Profiling: Building a rich profile of a user, including their preferences, roles, projects, and how these elements are connected.
  • Domain-Specific Knowledge: Representing intricate relationships in fields like healthcare (drug-disease interactions), finance (company-subsidiary relationships), or supply chain (product-component-supplier relationships).
  • Multi-Hop Question Answering: Finding answers that require combining information from multiple distinct facts.

Pros:

  • Rich Relationships: Explicitly models how entities are connected.
  • Complex Querying: Supports powerful graph queries for deep insights.
  • Inference Capabilities: Can infer new facts or relationships based on existing ones.
  • Structured Knowledge: Provides a clear, human-readable representation of knowledge.

Cons:

  • Complexity: More challenging to build and maintain than other memory types. Requires entity and relationship extraction.
  • Resource Intensive: Graph databases can be more resource-intensive.
  • Overkill for Simple Tasks: Not necessary for basic fact recall or simple conversations.
  • Requires LLM for Extraction: Often relies on an LLM to parse text into graph triples, adding latency and cost.

Implementation Example (LangChain with ConversationKGMemory):
LangChain's ConversationKGMemory uses an LLM to extract knowledge triples from the conversation and stores them in a simple in-memory graph.

from langchain.chains import ConversationChain
from langchain_community.llms import OpenAI
from langchain.memory import ConversationKGMemory

import os
# Set your OpenAI API key
# os.environ["OPENAI_API_KEY"] = "YOUR_OPENAI_API_KEY"

# Ensure API key is set
if not os.getenv("OPENAI_API_KEY"):
    raise ValueError("OPENAI_API_KEY environment variable not set.")

# 1. Initialize the LLM
llm = OpenAI(temperature=0)

# 2. Initialize ConversationKGMemory
# This memory extracts knowledge triples (subject, predicate, object) from the conversation
# and stores them. When the LLM is prompted, relevant triples are added to the context.
memory = ConversationKGMemory(llm=llm, verbose=True) # verbose=True shows the extracted triples

# 3. Create a Conversation Chain with the Knowledge Graph Memory
conversation = ConversationChain(
    llm=llm,
    memory=memory,
    verbose=True # Set to True to see the prompt with retrieved KG context
)

# --- Agent Learning Phase ---
print("--- Agent Learning Phase ---")

print("\nUser: My name is Charlie. I work at Acme Corp.")
response = conversation.predict(input="My name is Charlie. I work at Acme Corp.")
print(f"Agent: {response}")
# Memory will extract: (Charlie, is, name), (Charlie, works at, Acme Corp)

print("\nUser: Acme Corp develops AI software and is based in New York.")
response = conversation.predict(input="Acme Corp develops AI software and is based in New York.")
print(f"Agent: {response}")
# Memory will extract: (Acme Corp, develops, AI software), (Acme Corp, based in, New York)

print("\nUser: I have a colleague named David, who also works on AI projects.")
response = conversation.predict(input="I have a colleague named David, who also works on AI projects.")
print(f"Agent: {response}")
# Memory will extract: (Charlie, has colleague, David), (David, works on, AI projects)

# --- Agent Recall and Reasoning Phase ---
print("\n--- Agent Recall and Reasoning Phase ---")

# Query the agent with questions that require traversing the graph.

print("\nUser: Where is Acme Corp located?")
response = conversation.predict(input="Where is Acme Corp located?")
print(f"Agent: {response}")
# Expected: Agent uses the triple (Acme Corp, based in, New York) to answer.

print("\nUser: What does my company do?")
response = conversation.predict(input="What does my company do?")
print(f"Agent: {response}")
# Expected: Agent connects Charlie to Acme Corp, then Acme Corp to developing AI software.

print("\nUser: Who is David and what does he do?")
response = conversation.predict(input="Who is David and what does he do?")
print(f"Agent: {response}")
# Expected: Agent connects Charlie to David (colleague), then David to working on AI projects.

print("\nUser: Tell me about yourself, Charlie.")
response = conversation.predict(input="Tell me about yourself, Charlie.")
print(f"Agent: {response}")
# Expected: Agent combines multiple facts about Charlie and his company.

# You can also manually add triples to the memory
print("\n--- Manually Adding Knowledge ---")
memory.add_knowledge(["Charlie lives in Brooklyn", "Brooklyn is a borough of New York"])
print("Manually added: Charlie lives in Brooklyn, Brooklyn is a borough of New York")

print("\nUser: Does Charlie live in New York?")
response = conversation.predict(input="Does Charlie live in New York?")
print(f"Agent: {response}")
# Expected: Agent infers this from (Charlie, lives in, Brooklyn) and (Brooklyn, is a borough of, New York)

To run this code, install langchain-community, langchain, openai. Replace YOUR_OPENAI_API_KEY with your actual key.

Notice how ConversationKGMemory automatically extracts and stores the relationships. When prompted, it queries this internal graph for relevant facts and includes them in the LLM's context, enabling more sophisticated reasoning beyond simple keyword matching or semantic similarity.

When to Use Which Memory Level

Choosing the right memory type depends on the complexity of your agent's task and the nature of the information it needs to recall.

  • File-Based Memory (Level 1):

    • Use when: Storing explicit, small, and structured facts that require exact retrieval.
    • Examples: User ID, current task status, boolean flags, simple preferences.
    • Benefit: Lowest overhead, easiest to implement.
    • Avoid when: Information requires semantic understanding or complex relationships.
  • Vector Store Memory (Level 2):

    • Use when: Dealing with large volumes of unstructured text where semantic meaning is crucial for retrieval. Ideal for RAG applications.
    • Examples: Long-term knowledge base, detailed chat histories where specific topics need recall, document search, contextual Q&A.
    • Benefit: Scales well for large text data, provides semantic recall.
    • Avoid when: You need to understand explicit relationships between entities or perform complex, multi-hop reasoning.
  • Knowledge Graph Memory (Level 3):

    • Use when: Your agent needs to understand relationships between entities, perform complex reasoning, infer new facts, or answer multi-hop questions.
    • Examples: Building rich user profiles, connecting disparate pieces of information, navigating complex domain knowledge, planning and decision-making agents.
    • Benefit: Enables sophisticated reasoning and structured knowledge representation.
    • Avoid when: Tasks are simple, data is purely unstructured, or the overhead of graph extraction and management outweighs the benefits.

Start with the simplest memory solution that meets your requirements. Only increase complexity when the problem demands it. Over-engineering memory can introduce unnecessary latency, cost, and maintenance burden.

Conclusion

Building AI agents that truly work means equipping them with more than just a fleeting short-term memory. By understanding the limitations of basic chat history and implementing layered memory solutions—from simple file-based storage to powerful vector stores and knowledge graphs—you empower your agents to retain context, recall relevant information, and perform sophisticated reasoning.

Each memory level addresses a different facet of the forgetting problem, offering a spectrum of capabilities. Choose the right tool for the job, progressively adding complexity as your agent's needs grow. This structured approach to memory design transforms stateless LLM wrappers into intelligent, persistent agents capable of engaging in meaningful, long-term interactions.

Next Steps

  • Experiment with different embedding models: Compare performance and cost for your specific use case.
  • Explore persistent vector stores: Integrate with solutions like Pinecone, Weaviate, or Qdrant for production-grade vector memory.
  • Dive deeper into knowledge graph databases: Learn how to integrate Neo4j or other graph databases with LangChain for more robust graph memory management.
  • Implement memory purging strategies: Develop methods to manage and prune old or irrelevant memories to optimize performance and cost.

Building a RAG pipeline with Kreuzberg and LangChain

2026-02-23 17:28:56

Most discussions about retrieval-augmented generation (RAG) focus on choosing the right model, tuning prompts, or experimenting with vector databases. In practice, these are rarely the hardest parts. The real bottleneck appears much earlier: getting clean, reliable text out of messy documents.

There is a real challenge in ingestion, chunking, and embeddings. PDFs preserve visual layout rather than logical structure, Office files rely on completely different internal formats, and scanned documents require OCR before any text exists at all. Metadata is often incomplete or inconsistent, and small problems at this stage propagate downstream. If the extraction quality is poor, retrieval becomes unreliable, and the language model begins to produce weak or misleading answers.

This is where Kreuzberg plays a central role, covering the entire early-stage data flow: document ingestion, text chunking, and embedding generation. A typical RAG pipeline can combine Kreuzberg for ingestion, chunking, and embeddings with LangChain as the orchestration layer, alongside a vector database and an LLM. While the architecture is fairly standard, the quality of the early steps determines everything that follows.

Embeddings are numerical vector representations of text. An embedding model converts a piece of text, such as a sentence, paragraph, or document, into a list of numbers that captures its semantic meaning. Texts with similar meanings end up close to each other in this high-dimensional vector space, making it possible to search by meaning rather than exact keywords. If you haven’t seen this before, the TensorFlow Embedding Projector is a useful way to visualize how embeddings cluster similar concepts together.

Here are the steps to a RAG pipeline with Kreuzberg and LangChain:

  1. Extract text from a sample PDF and DOCX using Kreuzberg
  2. Inspect the raw output and metadata to understand what high-quality extraction looks like
  3. Chunk the text using a concrete strategy (recursive splitting with overlap) with Kreuzberg
  4. Generate embeddings with Kreuzberg and store them in a vector database such as Chroma or FAISS
  5. Wire everything together with LangChain and run a query end-to-end

In the examples, we'll use Kreuzberg Python.

Begin by installing dependencies.

pip install kreuzberg langchain chromadb

Then, extract text from your document.

from kreuzberg import extract

# Extract from a PDF
pdf_result = extract("sample.pdf")

# Extract from a DOCX
docx_result = extract("sample.docx")

print(pdf_result.text[:500])
print(pdf_result.metadata)

At this stage, you receive:

Clean extracted text
Structured metadata
Page-level and document-level information

After that, chunk the extracted text. Instead of manually splitting strings, use Kreuzberg’s built-in chunking configuration.

from kreuzberg import extract, ChunkingConfig

result = extract(
    "sample.pdf",
    chunking=ChunkingConfig(
        strategy="recursive",
        chunk_size=500,
        chunk_overlap=50
    )
)

# Access generated chunks
for chunk in result.chunks[:3]:
    print(chunk.content)
    print(chunk.metadata)

Embeddings with Kreuzberg are the next step.

from kreuzberg import extract, ChunkingConfig, EmbeddingConfig

result = extract(
    "sample.pdf",
    chunking=ChunkingConfig(
        strategy="recursive",
        chunk_size=500,
        chunk_overlap=50
    ),
    embedding=EmbeddingConfig(
        preset="sentence-transformers/all-MiniLM-L6-v2"
    )
)

# Each chunk now contains an embedding vector
first_chunk = result.chunks[0]

print(len(first_chunk.embedding))  # vector dimension

Store embeddings in a Vector Database (for example, Chroma)

import chromadb
from chromadb.config import Settings

client = chromadb.Client(Settings(anonymized_telemetry=False))
collection = client.create_collection("documents")

for chunk in result.chunks:
    collection.add(
        documents=[chunk.content],
        metadatas=[chunk.metadata],
        embeddings=[chunk.embedding],
        ids=[chunk.id]
    )

And query with LangChain. LangChain orchestrates retrieval and generation.

from langchain.vectorstores import Chroma
from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA

vectorstore = Chroma(
    collection_name="documents",
    embedding_function=None  # embeddings already computed
)

retriever = vectorstore.as_retriever()

llm = ChatOpenAI(model="gpt-4o-mini")

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=retriever
)

response = qa_chain.run("What is this document about?")
print(response)

LangChain connects:
The retriever (vector database)
The prompt template
The LLM
The final response pipeline

What You Just Built

You now have:
Document ingestion (Kreuzberg)
Structured chunking (Kreuzberg)
Embedding generation (Kreuzberg)
Vector storage (Chroma)
Retrieval orchestration (LangChain)
Answer synthesis (LLM)

This is a complete, production-ready RAG pipeline.

Why Document Processing Can Be the Hardest Part of RAG

Many tutorials focus heavily on embeddings and prompting, but teams that deploy real systems quickly discover that data preparation is the bottleneck. Production pipelines must deal with complex layouts, multiple file formats, scanned documents, large batches, and multilingual content.

Kreuzberg is designed specifically for this layer. It transforms heterogeneous documents into clean, structured outputs that downstream systems can reliably use. In a typical RAG pipeline, Kreuzberg sits at the beginning, extracting text, structuring metadata, chunking content, and generating embeddings in a consistent and unified way.

A useful way to visualize the flow is as a sequence of transformations: documents are extracted, divided into smaller segments, converted into embeddings, stored in a vector database, retrieved in response to a query, and finally synthesized by a language model. Every stage depends on the quality of the one before it.

The Architecture of a RAG Pipeline

Although implementations differ, most pipelines follow the same logical progression. Documents are first ingested and normalized. The extracted text is then split into chunks of manageable size, after which embeddings are generated and stored in a searchable index. When a user asks a question, the system retrieves the most relevant chunks and passes them to an LLM for synthesis.

One of the strengths of the RAG pattern is that each stage can be swapped independently. The ingestion engine, embedding model, database, and LLM can all be replaced without redesigning the entire system. Keeping these concerns separated makes pipelines easier to evolve.

Extracting Text from Documents

The first stage is always extraction. In practice, this involves reading files in multiple formats, detecting whether text is embedded or must be recovered through OCR, and preserving structural or metadata information whenever possible.

After this step, the system has clean text, document metadata, and often page-level or structural information. This output becomes the foundation for everything that follows, and in Kreuzberg’s case, it directly feeds into chunking and embedding generation.

Chunking and Embeddings

Once text has been extracted, it must be divided into smaller segments. Large documents cannot be embedded or retrieved efficiently as a single block. The goal of chunking is not only to reduce size but also to preserve meaning. Splitting in the wrong place can destroy context and reduce retrieval accuracy.

This step is especially critical because the semantic models used in RAG systems are designed to capture relationships across sequences of text. Many models effectively learn patterns in both directions, allowing them to understand context beyond individual tokens. The way text is chunked directly affects how well these relationships are preserved in the resulting embeddings.

After chunking, each segment is converted into a vector representation. At this point, each chunk becomes a structured record consisting of text, metadata, and an embedding vector. Kreuzberg handles both chunking and embedding generation, reducing complexity and ensuring consistency across the pipeline.

Retrieval and Answer Generation

When a user submits a query, the pipeline converts it into an embedding and searches the vector database for similar entries. In practice, this means finding the chunks whose representations are closest to the query in semantic space.

Frameworks like LangChain orchestrate this process, connecting retrieval, prompting, and generation into a single workflow. They also make it possible to refine retrieval, for example, through filtering, ranking, or hybrid search, so that the most relevant context is passed to the language model.
An important detail is that the model never sees the entire dataset. It only receives a carefully selected subset of chunks. The quality of this selection determines the quality of the final answer.

Scaling a RAG Pipeline

Once a pipeline works on a small dataset, real-world deployments introduce additional requirements. Ingestion must handle large volumes of files and often run in parallel. Retrieval systems benefit from metadata filtering and hybrid search strategies, and generation layers often include structured prompts or citation mechanisms.
At scale, another challenge emerges: as data grows, it becomes increasingly difficult to understand or navigate the information at all. Large document collections quickly exceed what humans can manually organize or search effectively. This is exactly where RAG systems become so important: they make massive, unstructured datasets usable.

Common Mistakes

One of the most frequent mistakes is treating ingestion as a trivial preprocessing step. Teams often invest heavily in prompt engineering while overlooking extraction quality, only to discover that retrieval accuracy is limited by poor source data. Inconsistent chunking and missing metadata create similar issues.
A good rule of thumb is to design this early stage carefully. Because extraction, chunking, and embedding happen at the beginning, mistakes here propagate forward. Poor extraction leads to weaker chunking, lower-quality embeddings, less accurate retrieval, and ultimately worse answers.

Final Thoughts

RAG systems succeed or fail based on the quality of their data pipeline. Reliable document parsing, chunking, and consistent embedding generation form the foundation on which retrieval and generation depend.

Kreuzberg fits naturally into this architecture because it addresses the first part of the workflow: turning messy, real-world documents into clean, structured, and semantically meaningful data ready for retrieval and generation. LangChain provides the glue between components, letting you compose retrieval, prompts, and LLMs into a single, production-ready pipeline.

Don't hesitate to submit issues or make contributions to Kreuzberg on GitHub.