2026-02-23 17:44:15
Last Tuesday I had Claude Code fixing a pagination bug in my API layer. While it worked, I sat there. Waiting. Watching it think. For eleven minutes.
Meanwhile, three other tasks sat in my backlog: a Blazor component needed refactoring, a new endpoint needed tests, and the SCSS build pipeline had a caching issue. All independent. All blocked behind my single terminal.
I thought: I have 5 monitors and a machine that could run a small country. Why am I running one agent at a time?
Then I discovered that Claude Code shipped built-in worktree support, and everything changed. I went from sequential AI coding to running five agents in parallel, each on its own branch, none stepping on each other's files. My throughput didn't just double. It went up roughly 5x.
Here's exactly how I set it up, the .NET-specific gotchas I hit, and why I think worktrees are the single biggest productivity unlock for AI-assisted development right now.
A git worktree is a second (or third, or fifth) working directory linked to the same repository. Each worktree checks out a different branch, but they all share the same .git history, refs, and objects.
Think of it this way: instead of cloning your repo five times (and wasting disk space on five copies of your git history), you create five lightweight checkouts that share one .git folder.
# Your main repo
C:\code\MyApp\ # on branch: master
# Your worktrees (separate folders, same repo)
C:\code\MyApp-worktrees\fix-pagination\ # on branch: fix/pagination
C:\code\MyApp-worktrees\add-tests\ # on branch: feature/api-tests
C:\code\MyApp-worktrees\refactor-blazor\ # on branch: refactor/blazor-grid
Git introduced worktrees in version 2.5 (July 2015). They've been around for over a decade. Most developers have never used them because, until AI coding agents, there was rarely a reason to work on five branches simultaneously.
Now there is.
Here's the typical AI coding workflow in 2026:
Steps 1-4 are sequential. You're blocked. Your machine is doing maybe 10% of what it could.
"But I can just open another terminal and start a second agent."
No, you can't. Not safely. Two agents editing the same working directory is a recipe for corrupted state. Agent A writes to OrderService.cs while Agent B is reading it. Agent A runs dotnet build while Agent B is mid-refactor. Merge conflicts happen in real-time, inside your working directory, with no version control to save you.
Worktrees fix this. Each agent gets its own directory, its own branch, its own isolated workspace. They can all build, test, and modify files simultaneously without interference.
The syntax is simple:
# Create a worktree with a new branch
git worktree add ../MyApp-worktrees/fix-pagination -b fix/pagination
# Create a worktree from an existing branch
git worktree add ../MyApp-worktrees/fix-pagination fix/pagination
# List all worktrees
git worktree list
# Remove a worktree when you're done
git worktree remove ../MyApp-worktrees/fix-pagination
I keep my worktrees in a sibling directory to avoid cluttering the main repo:
C:\code\
├── MyApp\ # Main working directory
└── MyApp-worktrees\ # All worktrees live here
├── fix-pagination\
├── add-tests\
└── refactor-blazor\
One critical rule: you cannot check out the same branch in two worktrees. Git enforces this by default. If your main directory is on master, no worktree can also be on master. You can override this with git worktree add -f, but don't. It prevents two workspaces from stomping on each other's state. The restriction is a feature, not a bug.
Here's where it gets interesting. Once you have worktrees set up, you can launch an AI agent in each one.
Claude Code has built-in worktree support with a --worktree (-w) CLI flag that starts a session in an isolated worktree automatically. You can also create worktrees manually and point Claude Code at them:
# Terminal 1: Main repo - fixing the pagination bug
cd C:\code\MyApp
claude "Fix the pagination bug in OrdersController where offset is off by one"
# Terminal 2: Worktree - adding API tests
cd C:\code\MyApp-worktrees\add-tests
claude "Add integration tests for all endpoints in OrdersController"
# Terminal 3: Worktree - refactoring Blazor component
cd C:\code\MyApp-worktrees\refactor-blazor
claude "Refactor the OrderGrid component to use virtualization"
# Terminal 4: Worktree - fixing SCSS
cd C:\code\MyApp-worktrees\fix-scss
claude "Fix the SCSS compilation caching issue in the build pipeline"
# Terminal 5: Worktree - documentation
cd C:\code\MyApp-worktrees\update-docs
claude "Update the API documentation for the Orders endpoint"
Five terminals. Five agents. Five branches. Zero conflicts.
Claude Code also supports spawning subagents in worktrees internally using isolation: "worktree" in agent definitions, where each subagent works in isolation and the changes get merged back. Boris Cherny, Creator and Head of Claude Code at Anthropic, called worktrees his number one productivity tip — he runs 3-5 worktrees simultaneously and described it as particularly useful for "1-shotting large batch changes like codebase-wide code migrations."
The same pattern works with any AI coding tool:
# Cursor - open each worktree as a separate workspace
code C:\code\MyApp-worktrees\fix-pagination
# GitHub Copilot CLI - run in each worktree directory
cd C:\code\MyApp-worktrees\add-tests && gh copilot suggest "..."
The worktree is just a directory. Any tool that operates on a directory works.
This is where generic worktree guides fall short. .NET projects have specific pain points that will bite you if you're not prepared.
Each worktree needs its own bin/ and obj/ directories. The good news: dotnet restore handles this automatically. The bad news: your first build in each worktree takes longer because it's restoring packages from scratch.
# After creating a worktree, always restore first
cd C:\code\MyApp-worktrees\fix-pagination
dotnet restore
The NuGet global packages cache (%userprofile%\.nuget\packages on Windows, ~/.nuget/packages on Mac/Linux) is shared across all worktrees. So the packages aren't downloaded again — they're just linked. Fast enough.
This one will get you. If all your worktrees use the same launchSettings.json, they'll all try to bind to the same port. Two Kestrel instances on port 5001 means one of them crashes.
Fix it with environment variables or override the port at launch:
# In worktree terminal, override the port
dotnet run --urls "https://localhost:5011"
# Or set it via environment variable
ASPNETCORE_URLS=https://localhost:5011 dotnet run
One gotcha: if you have Kestrel endpoints configured explicitly in appsettings.json, those override ASPNETCORE_URLS. The --urls flag is safer because it takes highest precedence.
I usually don't bother with any of this — most of the time the AI agent doesn't need to run the app, just build and test it.
User secrets are stored by UserSecretsId (set in your .csproj) under %APPDATA%\Microsoft\UserSecrets\<UserSecretsId>\secrets.json on Windows (~/.microsoft/usersecrets/ on Mac/Linux). They live outside the repo entirely. So they're shared automatically across worktrees. This is usually what you want.
appsettings.Development.json is tracked in git (or should be gitignored), so it exists in every worktree. No issues here.
If two agents both try to run dotnet ef database update against the same database at the same time, you'll get lock contention or worse.
My rule: only one worktree touches the database at a time. If a task involves migrations, it gets its own dedicated slot and the other agents work on code-only changes.
Or better: use a separate database per worktree for integration tests. Your docker-compose.yml can spin up isolated Postgres instances:
# docker-compose.worktree-tests.yml
services:
db-pagination:
image: postgres:17
ports: ["5433:5432"]
environment:
POSTGRES_DB: myapp_pagination
db-tests:
image: postgres:17
ports: ["5434:5432"]
environment:
POSTGRES_DB: myapp_tests
The .NET SDK is machine-wide. global.json in your repo pins the version. Since all worktrees share the same repo, they all use the same SDK version. No issues here — this just works.
Here's my actual daily workflow. I've been running this for a few weeks and it's settled into a rhythm.
Morning planning (10 minutes):
# Quick script I keep handy
#!/bin/bash
REPO="C:\code\MyApp"
TREES="C:\code\MyApp-worktrees"
for branch in "$@"; do
git worktree add "$TREES/$branch" -b "$branch" 2>/dev/null || \
git worktree add "$TREES/$branch" "$branch"
echo "Created worktree: $TREES/$branch"
done
# Usage
./create-worktrees.sh fix/pagination feature/api-tests refactor/blazor fix/scss update/docs
Parallel execution (1-2 hours):
Merge back (15 minutes):
git checkout master
git merge fix/pagination
git merge feature/api-tests
# ... and so on
git worktree remove ../MyApp-worktrees/fix-pagination
git worktree remove ../MyApp-worktrees/add-tests
# Or nuke them all
git worktree list | grep -v "bare" | awk '{print $1}' | xargs -I{} git worktree remove {}
Results: What used to take a full day of sequential agent sessions now takes about 2 hours including review time.
Not every task is a good worktree candidate. The ideal task for parallel AI execution:
| Good for worktrees | Bad for worktrees |
|---|---|
| Bug fix in isolated file | Database schema migration |
| Adding tests for existing code | Renaming a shared model class |
| New endpoint (separate controller) | Refactoring shared base classes |
| UI component work | Changing DI registration order |
| Documentation updates | Anything that touches Program.cs
|
The rule of thumb: if two tasks would cause a merge conflict, don't run them in parallel.
The criticisms are real. Let me address them honestly.
"I have to npm install in every worktree."
True for Node projects. For .NET, dotnet restore is fast because the global package cache is shared. If you're in a monorepo with both Node and .NET, install node_modules per worktree — it takes 30 seconds with a warm cache.
"Pre-commit hooks don't install automatically."
If you use Husky or similar, run the install command after creating the worktree. For .NET projects using dotnet format as a pre-commit hook, it works automatically since the tool is restored via dotnet tool restore.
"I have to copy env files."
Write a setup script. Seriously. If you're creating worktrees regularly, spending 20 minutes on a setup-worktree.sh script will save you hours:
#!/bin/bash
WORKTREE_DIR=$1
cp .env "$WORKTREE_DIR/.env"
cd "$WORKTREE_DIR"
dotnet restore
dotnet tool restore
echo "Worktree ready: $WORKTREE_DIR"
"Ports conflict."
Pass --urls to override the port. For ASP.NET Core integration tests, port conflicts aren't even an issue — WebApplicationFactory<T> uses an in-memory test server with no actual port binding. Multiple test suites can run simultaneously without stepping on each other.
These are all solvable problems. The throughput gain is worth the 30-minute setup cost.
I'm not going to pretend worktrees are always the answer. Skip them when:
God.cs file that everything importsFor a focused 30-minute bug fix, just use your main directory. Worktrees shine when you have 3+ hours of independent tasks and the machine to run them.
A git worktree is an additional working directory linked to an existing repository. It lets you check out a different branch in a separate folder while sharing the same git history and objects. Created with git worktree add <path> <branch>, worktrees have been available since Git 2.5 (July 2015).
Yes. Visual Studio 2022 and later can open a worktree folder as a project. Solution files, project references, and NuGet packages all work normally. The only caveat is that Solution Explorer shows the worktree path, not the main repo path. JetBrains Rider also handles worktrees well.
Git imposes no hard limit. The practical limit is your machine's RAM and CPU. Each worktree with an AI agent running dotnet build consumes roughly 2-4GB of RAM. On a 32GB machine, 5-6 concurrent worktrees with active builds is comfortable. On 64GB, you can push to 10+.
Yes. The NuGet global packages folder (~/.nuget/packages) is machine-wide, not per-repository. When you run dotnet restore in a worktree, packages are resolved from the global cache. Only packages not already cached will be downloaded. This makes the first restore in a new worktree fast — usually under 10 seconds for a typical .NET solution.
For AI-assisted parallel development, yes. Worktrees share git history, refs, and the object database. Five worktrees use a fraction of the disk space of five full clones. Commits made in any worktree are immediately visible to all others (same .git directory). The only advantage of separate clones is full isolation — useful if you need different git configs or hooks per copy.
Merge each branch back to your main branch one at a time. If branches touched different files (which they should if you planned well), merges are clean. For conflicts, resolve them using your normal merge workflow. The key is task selection: if you chose truly independent tasks, merge conflicts are rare. I've been running 5 parallel branches daily for weeks and hit fewer than 3 conflicts total.
The era of watching a single AI agent grind through your tasks one by one is over. Git worktrees give you isolated workspaces in seconds. AI coding tools give you agents that can fill each one.
The math is simple. If one agent takes 10 minutes per task and you have 5 tasks, that's 50 minutes sequential. With 5 worktrees, it's 10 minutes plus review time.
Set up a few worktrees. Pick independent tasks. Launch your agents. Go make coffee.
When you come back, five branches will be waiting for review.
Now if you'll excuse me, I have 4 agents running and one of them just finished refactoring my Blazor grid component. Time to review.
I'm Mashrul Haque, a Systems Architect with over 15 years of experience building enterprise applications with .NET, Blazor, ASP.NET Core, and SQL Server. I specialize in Azure cloud architecture, AI integration, and performance optimization.
When production catches fire at 2 AM, I'm the one they call.
Follow me here on dev.to for more .NET and AI coding content
2026-02-23 17:41:16
Agent skill + prompt templates that generate rich HTML pages for visual diff reviews, architecture overviews, plan audits, data tables, and project recaps
Agent skill + prompt templates that generate rich HTML pages for visual diff reviews, architecture overviews, plan audits, data tables, and project recaps
git clone https://github.com/nicobailon/visual-explainer
cd visual-explainer
github trending opensource html
欢迎关注!
2026-02-23 17:38:41
Cloud AI is convenient. It is also expensive and dependent on internet access.
Over the weekend, I tried something different: I converted an old 4GB Android phone into a local LLM server and routed it to my PC.
The goal was simple. Run AI offline. No subscriptions. No API costs.
Here is what worked, what did not, and what this experiment reveals about the future of edge AI.
The Stack
The setup was minimal:
• Termux for a Linux-like environment on Android
• Ollama for running local language models
• Qwen2 (0.5b variant) as the lightweight model
• One old Android device with 4GB RAM
Below are the exact steps that worked for me when setting up Termux and Ollama on Android.
Download the latest APK from the official GitHub releases page of Termux. After installation, grant storage and network permissions when prompted.
Inside Termux:
pkg update && pkg install ollama
This installs Ollama directly in the Termux environment.
Expose it to your local network:
export OLLAMA_HOST=0.0.0.0:11434
ollama serve &
Setting 0.0.0.0 allows other devices on the same network to connect.
For low-RAM devices, I used:
ollama pull qwen2:0.5b
This pulls the 0.5B parameter variant of Qwen2, which is small enough to run on constrained hardware. If download speed is an issue, using alternative mirrors can help.
ollama run qwen2:0.5b
Note: On some setups, Ollama may throw an error about a missing serve executable. Creating a symbolic link fixes it:
ln -s $PREFIX/bin/ollama $PREFIX/bin/serve
This maps the expected command to the correct binary.
From your computer, send a request to the phone’s local IP:
curl http://[phone-ip]:11434/api/generate -d '{"model": "qwen2:0.5b", "prompt": "Test"}'
If everything is configured correctly, the phone responds with generated text.
At this point, your Android device is functioning as a local LLM server.
The Android device I used has a weak mobile CPU and limited RAM. Inference times were noticeably slow. Large prompts required patience.
There were additional bottlenecks:
• Termux introduces slight I/O latency since it runs a Linux environment on top of Android.
• The phone throttled performance to manage heat and battery health.
• Sustained loads caused noticeable slowdowns.
Phones are not designed to behave like servers. Thermal limits are very real.
Still, the system remained functional.
What This Actually Proves
The interesting part is not performance. It is feasibility.
A few years ago, running a language model required serious hardware. Now, even a retired Android phone can serve a lightweight LLM.
This experiment highlights three shifts:
Model compression is improving rapidly.
Edge AI is becoming practical.
Personal AI infrastructure is possible without cloud dependence.
This was not about replacing high-performance systems. It was about exploring autonomy.
Offline AI changes the equation. No network dependency. No usage limits. No recurring costs.
Is It Practical?
For production workloads? nope, not really.
For experimentation, learning, and private local tooling, yes.
If you are building tools that require lightweight inference or offline capabilities, small models running on edge devices are increasingly viable.
The tradeoff is speed.
The benefit is independence.
I hope you enjoyed it!
2026-02-23 17:33:39
Software development demands efficiency and precision. Integrating AI into the development loop promises significant productivity gains. Many developers experiment with AI chatbots, but true transformation requires a structured approach. Claude Code offers this structure, moving beyond simple chat interfaces to an AI-native development environment.
For months, I have embedded Claude Code into my daily workflow. This isn't about using a large language model (LLM) as a glorified search engine or a quick code snippet generator. It's about leveraging an AI that understands project context, adheres to architectural patterns, and automates complex tasks. My setup has dramatically increased my output, reducing boilerplate and accelerating iteration cycles. Here is the exact configuration and workflow that made me 10x more productive.
Claude Code is not merely a chatbot with a code interpreter. It is an integrated development environment (IDE) designed from the ground up to collaborate with Anthropic's Claude LLM. This environment provides specific features that allow the AI to operate within a defined project context, interact with files, and execute custom commands.
It differentiates itself through its deep understanding of project structure and its ability to maintain persistent context across sessions. This allows for complex, multi-step tasks that traditional chat interfaces struggle with. Claude Code functions as an intelligent co-pilot, not just a suggestion engine.
The cornerstone of an effective Claude Code setup is the CLAUDE.md file. This isn't just a README; it's the project's constitution for the AI. It provides Claude with a comprehensive understanding of the project's architecture, goals, constraints, and preferred coding styles.
CLAUDE.mdacts as a dynamic prompt, ensuring Claude always operates with the latest, most relevant project context. This eliminates the need to repeatedly provide background information.
I place CLAUDE.md at the root of every project. It contains sections for high-level goals, architectural decisions, technology stack, coding standards, and even specific modules or files the AI should prioritize. Updating this file updates Claude's understanding of the entire project automatically.
Here is a typical CLAUDE.md structure I use:
# Project: User Management Service
## [CONTEXT]
This service manages user authentication, authorization, and profile data. It integrates with an existing API Gateway and a PostgreSQL database. All communication is RESTful JSON. Security and performance are paramount.
## [GOALS]
- Implement robust user registration and login flows.
- Provide endpoints for user profile management (CRUD operations).
- Ensure all API endpoints are secured with JWT tokens.
- Maintain high test coverage (>90%).
- Deliver a scalable and maintainable codebase.
## [ARCHITECTURAL_GUIDELINES]
- Microservice architecture.
- Stateless service design.
- Event-driven patterns for asynchronous tasks (e.g., email verification).
- Use dependency injection for all services and repositories.
## [TECHNOLOGY_STACK]
- **Language:** Python 3.10+
- **Framework:** FastAPI
- **Database:** PostgreSQL (via SQLAlchemy 2.0 ORM)
- **Authentication:** PyJWT
- **Testing:** Pytest, httpx
- **Linting/Formatting:** Black, Pylint
## [CODING_STANDARDS]
- Adhere to PEP 8.
- Use type hints extensively.
- Docstrings for all functions, classes, and modules.
- Prefer explicit over implicit.
- Error handling must be explicit and informative.
- Avoid global state.
## [IMPORTANT_FILES_OR_MODULES]
- `app/main.py`: Main FastAPI application entry point.
- `app/schemas/`: Pydantic models for request/response validation.
- `app/crud/`: Database interaction logic.
- `app/services/`: Business logic.
- `app/api/v1/endpoints/`: API route definitions.
- `app/core/security.py`: JWT handling and password hashing.
## [CONSTRAINTS]
- Response times for critical endpoints must be under 50ms.
- Database queries must be optimized; avoid N+1 problems.
- All sensitive data must be encrypted at rest and in transit.
- No external dependencies without explicit approval.
## [PREVIOUS_DECISIONS]
- Chosen UUIDs for primary keys in all database tables.
- Implemented a custom rate-limiting middleware.
- Using `loguru` for structured logging.
This detailed CLAUDE.md provides Claude with a complete operational blueprint. When I ask Claude to "implement user registration," it immediately understands the technology stack, architectural patterns, and even specific file locations. This drastically reduces the back-and-forth common with less structured AI interactions.
Repetitive development tasks are prime candidates for automation. Claude Code allows defining custom slash commands, which map to predefined prompts or sequences of actions. These commands streamline common operations, ensuring consistency and saving significant time.
I configure these commands in a claude_config.json file, typically located in my user's Claude Code configuration directory. Each command specifies a name, a description, and the underlying prompt template or script to execute.
Here are some of the custom slash commands I use daily:
/test_suite: Runs all tests in the current directory, then analyzes failures and suggests fixes./refactor_file <filename>: Analyzes a specified file for potential refactorings (readability, performance, adherence to standards) and proposes changes./generate_docs <module_name>: Creates or updates Sphinx/MkDocs-style documentation for a given Python module./optimize_query <sql_query>: Analyzes a SQL query for performance bottlenecks and suggests indexing strategies or query rewrites./create_endpoint <resource_name>: Generates boilerplate for a new FastAPI CRUD endpoint, including Pydantic schemas, CRUD operations, and route definitions.Consider the /create_endpoint command. Instead of manually creating files and writing boilerplate, I type /create_endpoint product. Claude Code then leverages the CLAUDE.md context (FastAPI, SQLAlchemy, Pydantic) to generate the necessary files.
An example entry in claude_config.json for a custom command might look like this:
{
"commands": [
{
"name": "create_endpoint",
"description": "Generates a new FastAPI CRUD endpoint boilerplate.",
"template": "Based on the CLAUDE.md context, generate a complete FastAPI CRUD endpoint for the resource '{{resource_name}}'. Include Pydantic schemas (request and response), SQLAlchemy CRUD operations, and the API router definitions. Ensure type hints and docstrings are present. Provide the code for app/schemas/{{resource_name}}.py, app/crud/{{resource_name}}.py, and app/api/v1/endpoints/{{resource_name}}.py. Use UUIDs for IDs.",
"args": [
{
"name": "resource_name",
"type": "string",
"description": "The name of the resource (e.g., 'user', 'product')."
}
]
},
{
"name": "refactor_file",
"description": "Analyzes a file for refactoring opportunities.",
"template": "Analyze the file '{{filename}}' for code smells, potential performance improvements, and adherence to the coding standards defined in CLAUDE.md. Propose specific, actionable refactorings, showing both the original and modified code snippets. Focus on readability, maintainability, and efficiency.",
"args": [
{
"name": "filename",
"type": "string",
"description": "The path to the file to refactor."
}
]
}
]
}
This configuration allows me to invoke commands like /create_endpoint product or /refactor_file app/services/user_service.py. Claude Code parses the command, substitutes arguments into the template, and executes the refined prompt against the current project context. This automation ensures consistency and reduces manual effort significantly.
Claude Code's real power extends beyond internal code generation through its Multi-Context Proxy (MCP) servers. MCP servers are lightweight HTTP services that act as bridges, allowing Claude Code to interact with external tools, APIs, databases, or even local system commands. This integrates the AI into a broader ecosystem, enabling it to perform actions beyond just generating text.
MCP servers empower Claude Code to "do" things in the real world, not just "suggest" them.
I use MCP servers to query internal knowledge bases, trigger CI/CD pipelines, interact with cloud provider APIs, or even perform database migrations. Each MCP server exposes a simple API that Claude Code can call with structured requests.
Here's an example of a simple Python-based MCP server that allows Claude Code to look up documentation for Python packages:
# mcp_doc_server.py
from flask import Flask, request, jsonify
import subprocess
import json
app = Flask(__name__)
@app.route('/package_docs', methods=['POST'])
def get_package_docs():
data = request.json
package_name = data.get('package_name')
if not package_name:
return jsonify({"error": "package_name is required"}), 400
try:
# Example: Use pip show to get basic package info
# For full documentation, this would integrate with a more sophisticated system
result = subprocess.run(['pip', 'show', package_name], capture_output=True, text=True, check=True)
return jsonify({"package_name": package_name, "docs": result.stdout}), 200
except subprocess.CalledProcessError as e:
return jsonify({"error": f"Could not find documentation for {package_name}: {e.stderr}"}), 404
except Exception as e:
return jsonify({"error": f"An unexpected error occurred: {str(e)}"}), 500
if __name__ == '__main__':
# Run the MCP server on a specific port
app.run(port=5001, debug=False)
To enable Claude Code to use this, I would configure it in claude_config.json to define the MCP server and an associated slash command:
{
"mcp_servers": [
{
"name": "doc_lookup_service",
"url": "http://localhost:5001",
"description": "Provides documentation lookup for Python packages."
}
],
"commands": [
{
"name": "get_package_docs",
"description": "Retrieves documentation for a specified Python package.",
"template": {
"mcp_server": "doc_lookup_service",
"endpoint": "/package_docs",
"method": "POST",
"payload": {
"package_name": "{{package_name}}"
}
},
"args": [
{
"name": "package_name",
"type": "string",
"description": "The name of the Python package."
}
]
}
]
}
Now, I can type /get_package_docs fastapi directly within Claude Code. Claude Code sends a POST request to http://localhost:5001/package_docs with {"package_name": "fastapi"}. The MCP server processes this, retrieves the pip show output, and returns it to Claude Code. Claude then incorporates this information into its responses, for example, by summarizing the package details or suggesting usage examples based on the retrieved documentation.
This integration is powerful. It moves Claude Code from a purely generative tool to an actionable agent within my development environment.
Maintaining context over long development cycles is critical. Claude Code's session management ensures that the AI retains its understanding of the project, conversation history, and active tasks across multiple interactions and even days. Losing context means repeatedly re-explaining the project state, which wastes time and dilutes efficiency.
Persistent sessions prevent context drift, allowing Claude Code to pick up exactly where it left off, even after a break.
I manage sessions by creating a new session for each major feature or bug fix branch. When I start a new task, I load the relevant session or create a new one. This keeps the AI's focus narrow and relevant to the current work.
The process involves:
This capability is particularly useful for large projects with complex interdependencies. Claude remembers architectural nuances, design choices, and even past refactoring discussions, ensuring continuity throughout the development lifecycle. It prevents "AI amnesia" that plagues many other LLM interactions.
My daily workflow with Claude Code is highly structured, leveraging CLAUDE.md, custom commands, MCP servers, and session management. This integrated approach allows me to tackle complex tasks with unprecedented speed and consistency.
Here's a step-by-step breakdown of a typical development day:
Morning Setup and Session Load:
feature/user-profile-editing).CLAUDE.md file, providing immediate context.Task Definition and Initial Brainstorming:
update_user_profile endpoint. It should allow users to change their name and email. Ensure email uniqueness and proper validation."CLAUDE.md, suggests the relevant files to modify (schemas, crud, services, endpoints) and potential security considerations. We iterate on the API design.Code Generation and Iteration:
/create_endpoint user_profile might be too broad, so I'd ask Claude directly to generate specific Pydantic models for the update request.app/services/user_service.py to handle unique email constraint errors gracefully."Testing and Debugging:
/test_suite command. Claude Code executes the tests, then analyzes the output.test_update_user_profile_invalid_email failed. The current validation in app/services/user_service.py does not correctly handle existing email addresses. Here's a proposed fix:"update_user_profile function in app/services/user_service.py."Documentation and Refinement:
/generate_docs app/services/user_service.py to automatically update the documentation for the new functions./refactor_file app/api/v1/endpoints/user.py to ensure the new endpoint adheres to all coding standards and is as clean as possible.External Interactions (MCP Servers):
/get_package_docs pydantic or a custom /query_db <sql_statement> command to interact with a read-only staging database via an MCP server. This provides real-time data or context without leaving the Claude Code environment.Saving Session and Committing:
This workflow ensures that Claude Code is deeply integrated into every stage of development. It acts as an intelligent assistant, from initial design to final documentation, constantly informed by the project's context and capable of executing complex tasks.
Claude Code, with its structured approach to AI-assisted development, transforms how engineers build software. The combination of a definitive CLAUDE.md for context, customizable slash commands for automation, MCP servers for real-world integration, and robust session management creates an environment where the AI is a true co-developer. It is not merely a tool for generating snippets, but a partner that understands, executes, and learns within your project's ecosystem.
To start leveraging Claude Code effectively, begin by:
CLAUDE.md: Invest time in clearly articulating your project's context, goals, and constraints. This is the most critical step for effective AI interaction.Embrace Claude Code's structured environment. It's a fundamental shift in how you interact with AI, moving from ad-hoc prompting to a systematic, context-aware development partnership.
2026-02-23 17:32:47
Most AI agents operate with a severe handicap: they forget everything. Every interaction starts from zero. Your agent might perfectly answer a question about a product, then draw a blank when you ask a follow-up about that same product's warranty, simply because the prior context is gone. This stateless behavior cripples agent capabilities, making them frustratingly ineffective for anything beyond single-turn queries.
Building truly useful AI agents requires persistent, intelligent memory. This article demonstrates how to implement robust memory systems, moving beyond simple chat history to structured knowledge, ensuring your agents remember what matters.
Large Language Models (LLMs) are inherently stateless. Each API call is a fresh request. To maintain context, developers typically pass the entire conversation history with every prompt. This approach works for short chats but quickly becomes unsustainable and inefficient.
The primary limitation of simply passing chat history is the context window. As conversations lengthen, the prompt size grows, incurring higher token costs and potentially exceeding the LLM's maximum input length. More critically, simply re-feeding raw text does not provide structured knowledge or enable complex reasoning across turns.
Agents need to recall specific facts, understand relationships, and retrieve relevant information from a vast knowledge base. Basic chat history fails at these requirements. It lacks semantic understanding and the ability to selectively retrieve information.
The simplest form of persistent memory involves storing explicit pieces of information in structured files or databases. This level is suitable for discrete facts, user preferences, or task-specific variables.
Description: File-based memory stores data as key-value pairs, JSON objects, or rows in a lightweight database like SQLite. The agent explicitly stores and retrieves information by a predefined key.
Use Cases:
Pros:
Cons:
Implementation Example (Conceptual):
An agent stores a user's preferred product category.
import json
class SimpleFileMemory:
def __init__(self, filename="agent_memory.json"):
self.filename = filename
self.memory = self._load_memory()
def _load_memory(self):
try:
with open(self.filename, 'r') as f:
return json.load(f)
except FileNotFoundError:
return {}
def _save_memory(self):
with open(self.filename, 'w') as f:
json.dump(self.memory, f, indent=4)
def get(self, key, default=None):
return self.memory.get(key, default)
def set(self, key, value):
self.memory[key] = value
self._save_memory()
def delete(self, key):
if key in self.memory:
del self.memory[key]
self._save_memory()
# Example Usage
memory = SimpleFileMemory()
# Agent remembers user preference
user_id = "user_123"
memory.set(f"{user_id}_preferred_category", "Electronics")
print(f"User's preferred category: {memory.get(f'{user_id}_preferred_category')}")
# Agent remembers a task state
task_id = "task_001"
memory.set(f"{task_id}_status", "pending")
print(f"Task {task_id} status: {memory.get(f'{task_id}_status')}")
# Clear some memory
memory.delete(f"{task_id}_status")
print(f"Task {task_id} status after deletion: {memory.get(f'{task_id}_status')}")
This simple file-based memory provides immediate persistence for explicit facts. For more complex, unstructured data, a different approach is necessary.
When agents need to recall information based on meaning rather than exact keywords, vector store memory becomes essential. This is the foundation of Retrieval-Augmented Generation (RAG).
Description: Vector store memory converts text chunks into numerical representations called embeddings. These embeddings capture the semantic meaning of the text. When the agent needs to recall information, it converts the query into an embedding and searches the vector store for semantically similar embeddings. The corresponding text chunks are then retrieved and provided to the LLM as context.
Use Cases:
Pros:
Cons:
Implementation Example (LangChain with ChromaDB):
We use LangChain's VectorStoreRetrieverMemory with an in-memory ChromaDB for demonstration. This allows the agent to semantically recall information it previously "learned."
from langchain_community.embeddings import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain.chains import ConversationChain
from langchain_community.llms import OpenAI
from langchain.memory import VectorStoreRetrieverMemory
import os
# Set your OpenAI API key
# os.environ["OPENAI_API_KEY"] = "YOUR_OPENAI_API_KEY"
# Ensure API key is set
if not os.getenv("OPENAI_API_KEY"):
raise ValueError("OPENAI_API_KEY environment variable not set.")
# 1. Initialize Embeddings and Vector Store
# Using a temporary directory for ChromaDB to store embeddings
# In a real application, you might persist this to disk or use a hosted solution.
vectorstore = Chroma(embedding_function=OpenAIEmbeddings(), persist_directory="./chroma_db_memory")
retriever = vectorstore.as_retriever(search_kwargs={"k": 2}) # Retrieve top 2 most relevant documents
# 2. Initialize VectorStoreRetrieverMemory
# This memory type uses a retriever to fetch relevant documents based on the current input.
memory = VectorStoreRetrieverMemory(retriever=retriever)
# 3. Initialize the LLM
llm = OpenAI(temperature=0) # Using a low temperature for consistent responses
# 4. Create a Conversation Chain with the VectorStoreRetrieverMemory
# The chain will automatically add retrieved documents to the prompt.
conversation = ConversationChain(
llm=llm,
memory=memory,
verbose=True # Set to True to see the prompt with retrieved context
)
# --- Agent Learning Phase ---
print("--- Agent Learning Phase ---")
# Simulate the agent "learning" facts by adding them to memory
# These facts will be embedded and stored in the vector store.
memory.save_context({"input": "My favorite color is blue."}, {"output": "Okay, I'll remember that your favorite color is blue."})
memory.save_context({"input": "I like to hike on weekends."}, {"output": "Hiking sounds like a great weekend activity!"})
memory.save_context({"input": "My name is Alice and I work as a software engineer."}, {"output": "Nice to meet you, Alice. A software engineer, interesting!"})
memory.save_context({"input": "The project deadline is next Friday."}, {"output": "Got it, next Friday is the deadline."})
memory.save_context({"input": "My dog's name is Buddy."}, {"output": "Buddy, what a cute name for a dog!"})
# --- Agent Recall Phase ---
print("\n--- Agent Recall Phase ---")
# Query the agent with questions related to the stored facts.
# The memory will retrieve semantically similar information to provide context.
print("\nUser: What is my dog's name?")
response = conversation.predict(input="What is my dog's name?")
print(f"Agent: {response}")
# Expected: Agent recalls "Buddy" because "dog's name" is semantically similar to "My dog's name is Buddy."
print("\nUser: What do I do for a living?")
response = conversation.predict(input="What do I do for a living?")
print(f"Agent: {response}")
# Expected: Agent recalls "software engineer" because "do for a living" is semantically similar to "work as a software engineer."
print("\nUser: What is my favorite hue?")
response = conversation.predict(input="What is my favorite hue?")
print(f"Agent: {response}")
# Expected: Agent recalls "blue" because "hue" is semantically similar to "color."
print("\nUser: When is the project due?")
response = conversation.predict(input="When is the project due?")
print(f"Agent: {response}")
# Expected: Agent recalls "next Friday" because "project due" is semantically similar to "project deadline."
# Example of a new piece of information that will be added to memory
print("\nUser: I also enjoy reading sci-fi novels.")
response = conversation.predict(input="I also enjoy reading sci-fi novels.")
print(f"Agent: {response}")
print("\nUser: What kind of books do I read?")
response = conversation.predict(input="What kind of books do I read?")
print(f"Agent: {response}")
# Expected: Agent recalls "sci-fi novels"
# Clean up ChromaDB directory
import shutil
if os.path.exists("./chroma_db_memory"):
shutil.rmtree("./chroma_db_memory")
To run this code, install
langchain-community,langchain,openai,chromadb. ReplaceYOUR_OPENAI_API_KEYwith your actual key.
The VectorStoreRetrieverMemory automatically embeds the current input and queries the vector store for relevant past interactions or facts. It then adds these retrieved documents to the LLM's prompt, allowing the LLM to generate a contextually aware response. This significantly enhances the agent's ability to "remember" details from a large body of information.
For agents that need to perform complex reasoning, understand relationships between entities, and answer multi-hop questions, knowledge graph memory provides a powerful solution.
Description: A knowledge graph represents information as a network of interconnected entities (nodes) and their relationships (edges). Instead of just storing facts, it stores how facts relate to each other. An LLM can extract these entities and relationships (triples: subject-predicate-object) from text, which are then stored in a graph database (e.g., Neo4j). When querying, the agent can traverse the graph to find indirect connections and infer new information.
Use Cases:
Pros:
Cons:
Implementation Example (LangChain with ConversationKGMemory):
LangChain's ConversationKGMemory uses an LLM to extract knowledge triples from the conversation and stores them in a simple in-memory graph.
from langchain.chains import ConversationChain
from langchain_community.llms import OpenAI
from langchain.memory import ConversationKGMemory
import os
# Set your OpenAI API key
# os.environ["OPENAI_API_KEY"] = "YOUR_OPENAI_API_KEY"
# Ensure API key is set
if not os.getenv("OPENAI_API_KEY"):
raise ValueError("OPENAI_API_KEY environment variable not set.")
# 1. Initialize the LLM
llm = OpenAI(temperature=0)
# 2. Initialize ConversationKGMemory
# This memory extracts knowledge triples (subject, predicate, object) from the conversation
# and stores them. When the LLM is prompted, relevant triples are added to the context.
memory = ConversationKGMemory(llm=llm, verbose=True) # verbose=True shows the extracted triples
# 3. Create a Conversation Chain with the Knowledge Graph Memory
conversation = ConversationChain(
llm=llm,
memory=memory,
verbose=True # Set to True to see the prompt with retrieved KG context
)
# --- Agent Learning Phase ---
print("--- Agent Learning Phase ---")
print("\nUser: My name is Charlie. I work at Acme Corp.")
response = conversation.predict(input="My name is Charlie. I work at Acme Corp.")
print(f"Agent: {response}")
# Memory will extract: (Charlie, is, name), (Charlie, works at, Acme Corp)
print("\nUser: Acme Corp develops AI software and is based in New York.")
response = conversation.predict(input="Acme Corp develops AI software and is based in New York.")
print(f"Agent: {response}")
# Memory will extract: (Acme Corp, develops, AI software), (Acme Corp, based in, New York)
print("\nUser: I have a colleague named David, who also works on AI projects.")
response = conversation.predict(input="I have a colleague named David, who also works on AI projects.")
print(f"Agent: {response}")
# Memory will extract: (Charlie, has colleague, David), (David, works on, AI projects)
# --- Agent Recall and Reasoning Phase ---
print("\n--- Agent Recall and Reasoning Phase ---")
# Query the agent with questions that require traversing the graph.
print("\nUser: Where is Acme Corp located?")
response = conversation.predict(input="Where is Acme Corp located?")
print(f"Agent: {response}")
# Expected: Agent uses the triple (Acme Corp, based in, New York) to answer.
print("\nUser: What does my company do?")
response = conversation.predict(input="What does my company do?")
print(f"Agent: {response}")
# Expected: Agent connects Charlie to Acme Corp, then Acme Corp to developing AI software.
print("\nUser: Who is David and what does he do?")
response = conversation.predict(input="Who is David and what does he do?")
print(f"Agent: {response}")
# Expected: Agent connects Charlie to David (colleague), then David to working on AI projects.
print("\nUser: Tell me about yourself, Charlie.")
response = conversation.predict(input="Tell me about yourself, Charlie.")
print(f"Agent: {response}")
# Expected: Agent combines multiple facts about Charlie and his company.
# You can also manually add triples to the memory
print("\n--- Manually Adding Knowledge ---")
memory.add_knowledge(["Charlie lives in Brooklyn", "Brooklyn is a borough of New York"])
print("Manually added: Charlie lives in Brooklyn, Brooklyn is a borough of New York")
print("\nUser: Does Charlie live in New York?")
response = conversation.predict(input="Does Charlie live in New York?")
print(f"Agent: {response}")
# Expected: Agent infers this from (Charlie, lives in, Brooklyn) and (Brooklyn, is a borough of, New York)
To run this code, install
langchain-community,langchain,openai. ReplaceYOUR_OPENAI_API_KEYwith your actual key.
Notice how ConversationKGMemory automatically extracts and stores the relationships. When prompted, it queries this internal graph for relevant facts and includes them in the LLM's context, enabling more sophisticated reasoning beyond simple keyword matching or semantic similarity.
Choosing the right memory type depends on the complexity of your agent's task and the nature of the information it needs to recall.
File-Based Memory (Level 1):
Vector Store Memory (Level 2):
Knowledge Graph Memory (Level 3):
Start with the simplest memory solution that meets your requirements. Only increase complexity when the problem demands it. Over-engineering memory can introduce unnecessary latency, cost, and maintenance burden.
Building AI agents that truly work means equipping them with more than just a fleeting short-term memory. By understanding the limitations of basic chat history and implementing layered memory solutions—from simple file-based storage to powerful vector stores and knowledge graphs—you empower your agents to retain context, recall relevant information, and perform sophisticated reasoning.
Each memory level addresses a different facet of the forgetting problem, offering a spectrum of capabilities. Choose the right tool for the job, progressively adding complexity as your agent's needs grow. This structured approach to memory design transforms stateless LLM wrappers into intelligent, persistent agents capable of engaging in meaningful, long-term interactions.
2026-02-23 17:28:56
Most discussions about retrieval-augmented generation (RAG) focus on choosing the right model, tuning prompts, or experimenting with vector databases. In practice, these are rarely the hardest parts. The real bottleneck appears much earlier: getting clean, reliable text out of messy documents.
There is a real challenge in ingestion, chunking, and embeddings. PDFs preserve visual layout rather than logical structure, Office files rely on completely different internal formats, and scanned documents require OCR before any text exists at all. Metadata is often incomplete or inconsistent, and small problems at this stage propagate downstream. If the extraction quality is poor, retrieval becomes unreliable, and the language model begins to produce weak or misleading answers.
This is where Kreuzberg plays a central role, covering the entire early-stage data flow: document ingestion, text chunking, and embedding generation. A typical RAG pipeline can combine Kreuzberg for ingestion, chunking, and embeddings with LangChain as the orchestration layer, alongside a vector database and an LLM. While the architecture is fairly standard, the quality of the early steps determines everything that follows.
Embeddings are numerical vector representations of text. An embedding model converts a piece of text, such as a sentence, paragraph, or document, into a list of numbers that captures its semantic meaning. Texts with similar meanings end up close to each other in this high-dimensional vector space, making it possible to search by meaning rather than exact keywords. If you haven’t seen this before, the TensorFlow Embedding Projector is a useful way to visualize how embeddings cluster similar concepts together.
Here are the steps to a RAG pipeline with Kreuzberg and LangChain:
In the examples, we'll use Kreuzberg Python.
Begin by installing dependencies.
pip install kreuzberg langchain chromadb
Then, extract text from your document.
from kreuzberg import extract
# Extract from a PDF
pdf_result = extract("sample.pdf")
# Extract from a DOCX
docx_result = extract("sample.docx")
print(pdf_result.text[:500])
print(pdf_result.metadata)
At this stage, you receive:
Clean extracted text
Structured metadata
Page-level and document-level information
After that, chunk the extracted text. Instead of manually splitting strings, use Kreuzberg’s built-in chunking configuration.
from kreuzberg import extract, ChunkingConfig
result = extract(
"sample.pdf",
chunking=ChunkingConfig(
strategy="recursive",
chunk_size=500,
chunk_overlap=50
)
)
# Access generated chunks
for chunk in result.chunks[:3]:
print(chunk.content)
print(chunk.metadata)
Embeddings with Kreuzberg are the next step.
from kreuzberg import extract, ChunkingConfig, EmbeddingConfig
result = extract(
"sample.pdf",
chunking=ChunkingConfig(
strategy="recursive",
chunk_size=500,
chunk_overlap=50
),
embedding=EmbeddingConfig(
preset="sentence-transformers/all-MiniLM-L6-v2"
)
)
# Each chunk now contains an embedding vector
first_chunk = result.chunks[0]
print(len(first_chunk.embedding)) # vector dimension
Store embeddings in a Vector Database (for example, Chroma)
import chromadb
from chromadb.config import Settings
client = chromadb.Client(Settings(anonymized_telemetry=False))
collection = client.create_collection("documents")
for chunk in result.chunks:
collection.add(
documents=[chunk.content],
metadatas=[chunk.metadata],
embeddings=[chunk.embedding],
ids=[chunk.id]
)
And query with LangChain. LangChain orchestrates retrieval and generation.
from langchain.vectorstores import Chroma
from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA
vectorstore = Chroma(
collection_name="documents",
embedding_function=None # embeddings already computed
)
retriever = vectorstore.as_retriever()
llm = ChatOpenAI(model="gpt-4o-mini")
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
retriever=retriever
)
response = qa_chain.run("What is this document about?")
print(response)
LangChain connects:
The retriever (vector database)
The prompt template
The LLM
The final response pipeline
You now have:
Document ingestion (Kreuzberg)
Structured chunking (Kreuzberg)
Embedding generation (Kreuzberg)
Vector storage (Chroma)
Retrieval orchestration (LangChain)
Answer synthesis (LLM)
This is a complete, production-ready RAG pipeline.
Many tutorials focus heavily on embeddings and prompting, but teams that deploy real systems quickly discover that data preparation is the bottleneck. Production pipelines must deal with complex layouts, multiple file formats, scanned documents, large batches, and multilingual content.
Kreuzberg is designed specifically for this layer. It transforms heterogeneous documents into clean, structured outputs that downstream systems can reliably use. In a typical RAG pipeline, Kreuzberg sits at the beginning, extracting text, structuring metadata, chunking content, and generating embeddings in a consistent and unified way.
A useful way to visualize the flow is as a sequence of transformations: documents are extracted, divided into smaller segments, converted into embeddings, stored in a vector database, retrieved in response to a query, and finally synthesized by a language model. Every stage depends on the quality of the one before it.
Although implementations differ, most pipelines follow the same logical progression. Documents are first ingested and normalized. The extracted text is then split into chunks of manageable size, after which embeddings are generated and stored in a searchable index. When a user asks a question, the system retrieves the most relevant chunks and passes them to an LLM for synthesis.
One of the strengths of the RAG pattern is that each stage can be swapped independently. The ingestion engine, embedding model, database, and LLM can all be replaced without redesigning the entire system. Keeping these concerns separated makes pipelines easier to evolve.
The first stage is always extraction. In practice, this involves reading files in multiple formats, detecting whether text is embedded or must be recovered through OCR, and preserving structural or metadata information whenever possible.
After this step, the system has clean text, document metadata, and often page-level or structural information. This output becomes the foundation for everything that follows, and in Kreuzberg’s case, it directly feeds into chunking and embedding generation.
Once text has been extracted, it must be divided into smaller segments. Large documents cannot be embedded or retrieved efficiently as a single block. The goal of chunking is not only to reduce size but also to preserve meaning. Splitting in the wrong place can destroy context and reduce retrieval accuracy.
This step is especially critical because the semantic models used in RAG systems are designed to capture relationships across sequences of text. Many models effectively learn patterns in both directions, allowing them to understand context beyond individual tokens. The way text is chunked directly affects how well these relationships are preserved in the resulting embeddings.
After chunking, each segment is converted into a vector representation. At this point, each chunk becomes a structured record consisting of text, metadata, and an embedding vector. Kreuzberg handles both chunking and embedding generation, reducing complexity and ensuring consistency across the pipeline.
When a user submits a query, the pipeline converts it into an embedding and searches the vector database for similar entries. In practice, this means finding the chunks whose representations are closest to the query in semantic space.
Frameworks like LangChain orchestrate this process, connecting retrieval, prompting, and generation into a single workflow. They also make it possible to refine retrieval, for example, through filtering, ranking, or hybrid search, so that the most relevant context is passed to the language model.
An important detail is that the model never sees the entire dataset. It only receives a carefully selected subset of chunks. The quality of this selection determines the quality of the final answer.
Once a pipeline works on a small dataset, real-world deployments introduce additional requirements. Ingestion must handle large volumes of files and often run in parallel. Retrieval systems benefit from metadata filtering and hybrid search strategies, and generation layers often include structured prompts or citation mechanisms.
At scale, another challenge emerges: as data grows, it becomes increasingly difficult to understand or navigate the information at all. Large document collections quickly exceed what humans can manually organize or search effectively. This is exactly where RAG systems become so important: they make massive, unstructured datasets usable.
One of the most frequent mistakes is treating ingestion as a trivial preprocessing step. Teams often invest heavily in prompt engineering while overlooking extraction quality, only to discover that retrieval accuracy is limited by poor source data. Inconsistent chunking and missing metadata create similar issues.
A good rule of thumb is to design this early stage carefully. Because extraction, chunking, and embedding happen at the beginning, mistakes here propagate forward. Poor extraction leads to weaker chunking, lower-quality embeddings, less accurate retrieval, and ultimately worse answers.
RAG systems succeed or fail based on the quality of their data pipeline. Reliable document parsing, chunking, and consistent embedding generation form the foundation on which retrieval and generation depend.
Kreuzberg fits naturally into this architecture because it addresses the first part of the workflow: turning messy, real-world documents into clean, structured, and semantically meaningful data ready for retrieval and generation. LangChain provides the glue between components, letting you compose retrieval, prompts, and LLMs into a single, production-ready pipeline.
Don't hesitate to submit issues or make contributions to Kreuzberg on GitHub.