MoreRSS

site iconShadow Walker | 松烟阁修改

Where other men are limited by morality or law, remember, everything is permitted. I walk in the darkness to serve the light.
请复制 RSS 到你的阅读器,或快速订阅到 :

Inoreader Feedly Follow Feedbin Local Reader

Shadow Walker | 松烟阁的 RSS 预览

A Deep Dive of LangGraph Mechanisms & Agent Design Patterns

2026-01-05 21:38:33

Introduction

While building a CVE assessment agent, I ran into an orchestration issue that looked trivial at first—but turned out to be instructive.

The agent was implemented with LangGraph and (conceptually) structured like this:

--- config: theme: 'base' themeVariables: primaryColor: '#BB2528' primaryTextColor: '#fff' primaryBorderColor: '#7C0000' lineColor: '#F8B229' secondaryColor: '#006100' tertiaryColor: '#fff' --- graph TD Start((__start__)) --> GetCVE[get_cve_data] Start --> GetCVSS[get_cvss_data] GetCVE --> GenASD[generate_asd_data] GetCVSS --> GetStmt[get_cvss_statement_data] GenASD --> Normalize[normalize_cvss_data] GetStmt --> Normalize Normalize --> GenVector[generate_cvss_vector] GenVector --> End((__end__))

Then the logs started to feel… off:

2025-12-24 12:07:16|x-sec|INFO|./core/agents/mimora/cvss_vector_agent.py:504|_generate_cvss_vector|开始完成CVE风险评估并生成CVSS向量
2025-12-24 12:07:16|x-sec|INFO|./core/agents/mimora/cvss_vector_agent.py:401|_normalize_cvss_data|开始归一化CVSS向量数据
2025-12-24 12:07:26|x-sec|INFO|./core/agents/mimora/cvss_vector_agent.py:527|_generate_cvss_vector|Generated CVSS Vector Response
2025-12-24 12:07:26|x-sec|DEBUG|./core/agents/mimora/cvss_vector_agent.py:528|_generate_cvss_vector|Response content: {
    "cvss_vector": "CVSS:3.1/AV:L/AC:L
2025-12-24 12:07:26|x-sec|INFO|./core/agents/mimora/tools/tool_utils.py:19|wrapper|开始执行工具: _calculate_cvss_score, 参数: ('CVSS:3.1/AV:L/AC:L/PR:N/UI:R/S:U/C:H/I:H/A:H',), {}
2025-12-24 12:07:26|x-sec|INFO|./core/agents/mimora/tools/cvss_tool.py:107|_calculate_cvss_score|开始计算CVSS评分: version=3.0, vector=CVSS:3.1/AV:L/AC:L/PR:N/UI:R/S:U/C:H/I:H/A:H...
2025-12-24 12:07:26|x-sec|INFO|./core/agents/mimora/tools/cvss_tool.py:170|_calculate_cvss_score|成功计算CVSS评分: version=3.1, base_score=7.8, base_severity=High
2025-12-24 12:07:26|x-sec|INFO|./core/agents/mimora/tools/tool_utils.py:22|wrapper|工具 _calculate_cvss_score 执行成功,耗时: 0.00s
2025-12-24 12:07:26|x-sec|INFO|./core/agents/mimora/cvss_vector_agent.py:504|_generate_cvss_vector|开始完成CVE风险评估并生成CVSS向量
2025-12-24 12:07:26|x-sec|INFO|./core/agents/mimora/cvss_vector_agent.py:568|_generate_cvss_severity|开始生成CVSS严重性等级
2025-12-24 12:07:35|x-sec|INFO|./core/agents/mimora/cvss_vector_agent.py:527|_generate_cvss_vector|Generated CVSS Vector Response
2025-12-24 12:07:35|x-sec|DEBUG|./core/agents/mimora/cvss_vector_agent.py:528|_generate_cvss_vector|Response content: {
    "cvss_vector": "CVSS:3.1/AV:L/AC:L
2025-12-24 12:07:35|x-sec|INFO|./core/agents/mimora/tools/tool_utils.py:19|wrapper|开始执行工具: _calculate_cvss_score, 参数: ('CVSS:3.1/AV:L/AC:L/PR:N/UI:R/S:U/C:H/I:H/A:H',), {}
2025-12-24 12:07:35|x-sec|INFO|./core/agents/mimora/tools/cvss_tool.py:107|_calculate_cvss_score|开始计算CVSS评分: version=3.0, vector=CVSS:3.1/AV:L/AC:L/PR:N/UI:R/S:U/C:H/I:H/A:H...
2025-12-24 12:07:35|x-sec|INFO|./core/agents/mimora/tools/cvss_tool.py:170|_calculate_cvss_score|成功计算CVSS评分: version=3.1, base_score=7.8, base_severity=High
2025-12-24 12:07:35|x-sec|INFO|./core/agents/mimora/tools/tool_utils.py:22|wrapper|工具 _calculate_cvss_score 执行成功,耗时: 0.00s
2025-12-24 12:07:42|x-sec|INFO|./core/agents/mimora/cvss_vector_agent.py:596|_generate_cvss_severity|Generated CVSS Severity Response

The output was unstable. My first instinct was the usual scapegoat—LLM hallucination.
But a graph runtime is supposed to reduce this kind of unpredictability, not amplify it.

After tracing the execution path, I found the culprit: _generate_cvss_vector was being scheduled twice. That directly contradicted my intended topology.

I’ll skip the play-by-play debugging here. What matters is what the anomaly triggered: a deeper look into agent orchestration—and the design patterns that fall out of it.

Rethinking Agent Orchestration

0:00
/0:44

Where today’s orchestration starts to crack

As systems evolve from “generative AI” (single-shot text) into autonomous agents, architecture becomes the real stability lever—more than prompts, more than model choice.

Early paradigms favored chains: linear prompt sequences that work well for small, bounded tasks.
But once an agent needs to plan, call tools, reflect, and iterate, a linear DAG (Directed Acyclic Graph) becomes a poor fit.

An agent is not a clean input–output pipeline. It is a loop of Perception → Reasoning → Action → Observation, repeated until termination—if termination exists at all. That cyclic nature violates the “acyclic” assumption. Meanwhile, many systems are drifting toward multi-agent setups: planners, executors, critics, and retrievers collaborating in parallel, all sharing and mutating context.

At that point, you inherit the problems of distributed systems: race conditions, state consistency, cyclic dependencies, and fault tolerance.

So the question becomes: what orchestration model can represent cycles and parallel collaboration without turning the runtime into a guessing game?

LangGraph’s bet is to bring the BSP (Bulk Synchronous Parallel) model—battle-tested in HPC and big-data graph computing—into agent orchestration.

Why graph computing models?

Traditional software models systems as services or objects. An agent system behaves closer to a state machine traversing a graph, where state is the asset and transitions are the work.

  1. Cycles are the default, not the exception
    ReAct is basically Think → Act → Observe → Think. DAGs can express this only indirectly (recursion, outer loops, manual re-entry), which tends to complicate call stacks and context handling. BSP treats cycles naturally: a loop is simply an ongoing sequence of supersteps.
  2. State is the center of gravity
    In agent systems, context is not “data passing through”—it is the system. Decisions are functions of the current state. BSP forces explicit state management and versioning, which aligns unusually well with LLM-based workflows.
  3. Parallelism needs a first-class synchronization primitive
    Patterns like Map-Reduce fan-out or supervisor/worker collaboration require parallel work that later converges. BSP’s barrier gives you that synchronization point natively—without ad-hoc asyncio.gather, locks, or fragile ordering assumptions.

Google Pregel & the BSP model

The Pregel framework

Pregel can be summarized in three ideas:

  • How it computes: a vertex state machine — decide whether to work or to sleep
  • How it runs: the BSP execution model — decide how the system synchronizes
  • How it propagates: message passing — move values across edges

This is the core intuition behind “think like a vertex.” Each vertex has two key states:

  • Active: the vertex runs compute(), processes incoming messages, updates its value, and sends messages to neighbors.
  • Inactive (halted): the vertex “sleeps” after it votes to halt.
  • Wake-up: receiving a message brings a halted vertex back to Active.

On a cluster, computation is sliced into supersteps:

  • Compute: all active vertices run in parallel (read messages from step S-1 → compute → send messages for step S+1)
  • Messages: values are in flight
  • Barrier: everyone must finish step S—and messages must be delivered—before anyone enters step S+1

No one runs ahead; no one is left behind. That rhythm eliminates a large class of race conditions.

Example: spreading the maximum value (6) across a graph.

  1. Superstep 0: Node 1 holds the value 6.
  2. Message: Node 1 tells Node 2: “I have a 6.”
  3. Superstep 1: Node 2 receives 6, compares it with its own value (3), updates to 6, and propagates further.
  4. Result: the maximum spreads through the graph like a contagion.

The BSP model

Proposed by Leslie Valiant, Bulk Synchronous Parallel (BSP) divides execution into sequential supersteps. In each superstep, three things happen:

  1. Local computation: each processor computes independently on local data.
  2. Communication: processors send messages, but those messages are not visible until the next step.
  3. Barrier synchronization: everyone waits until computation and communication complete.

This tames the chaos: because messages are only visible after the barrier, every unit observes a globally consistent state from the previous step. For the programmer, the mental model is simpler: write logic that alternates between compute and communicate, bounded by a barrier.

0:00
/0:51

Decoding the LangGraph runtime

So how does LangGraph implement BSP? The core engine is the PregelLoop.

StateGraph & message passing

Everything begins with state. You define a schema (often a TypedDict or a Pydantic model) representing the data that flows through the graph.

from typing import TypedDict, Annotated
import operator

class AgentState(TypedDict):
    messages: Annotated[list[str], operator.add]
    summary: str

The key detail is Annotated[list[str], operator.add]: it defines a channel and its reducer.

Channels: decoupling reads from writes

In BSP, nodes don’t mutate shared memory directly. They publish updates to channels.

  • LastValue (default): keep the latest value (good for overwrites).
  • BinaryOperatorAggregate: the backbone of safe parallel updates. A binary operator (e.g. operator.add) merges updates at the barrier. If multiple nodes emit updates in the same superstep, the runtime aggregates them deterministically—no lost updates, no races.
  • Topic: a pub/sub-like channel for transient events.

PregelLoop: the lifecycle of a superstep

The heartbeat is PregelLoop.tick.

Phase 1: Plan

At the start of a superstep, the runtime checks channel versions.

  • It’s data-driven: if a node subscribes to a channel updated in the previous step, that node becomes active.
  • If the previous step ended on a conditional edge, the routing function decides which nodes are activated next.

Phase 2: Execute (local computation)

Active nodes run in parallel.

  • Read isolation: each node reads a snapshot of state captured at the start of the step. Even if Node A emits updates, Node B (running concurrently) still sees the old snapshot.
  • Write buffering: node outputs are buffered; they are not applied immediately.

Phase 3: Update & barrier

Once all active nodes finish:

  • collect buffered writes
  • apply reducers (e.g. old_messages + new_A + new_B)
  • increment channel versions
  • checkpoint: serialize the full state into storage

Only after this does the barrier lift and the next superstep begin.

LangGraph source code (conceptual)

State and channels

State behavior is defined by the underlying channel type.

Channel class Update logic Typical use
LastValue value = new_value (overwrite) flags, latest query
BinaryOperatorAggregate value = reducer(value, new_value) chat history (add_messages), parallel results
Topic append to a queue pub/sub, event streams
# BinaryOperatorAggregate (reducer channel type)
class BinaryOperatorAggregate(BaseChannel):
    def __init__(self, operator, initial_value):
        self.operator = operator  # e.g., operator.add
        self.value = initial_value

    def update(self, values):
        if not values:
            return False

        for new_val in values:
            if isinstance(new_val, Overwrite):
                self.value = new_val.value
            else:
                # Apply reducer: old + new -> updated
                self.value = self.operator(self.value, new_val)
        return True

Pregel loop and supersteps (simplified)

class PregelLoop:
    def execute(self, initial_state):
        # 1. Initialize channels
        self.channels = self.initialize_channels(initial_state)

        # 2. Superstep loop
        while not self.is_terminated():

            # --- Phase A: Plan ---
            tasks = []
            for node in self.nodes:
                # Trigger: input channel updated in the previous step
                if self.check_trigger(node, self.channels):
                    # Read snapshot (immutable)
                    input_snapshot = self.read_channels(node.inputs)
                    tasks.append((node, input_snapshot))

            if not tasks:
                break

            # --- Phase B: Execute (parallel) ---
            # Nodes cannot observe each other's writes within the same step
            results = await parallel_execute(tasks)

            # --- Phase C: Update (barrier) ---
            for node, result in results:
                writes = self.parse_writes(node, result)
                for channel, values in writes:
                    self.channels[channel].update(values)

            # --- Phase D: Checkpoint ---
            self.checkpointer.put(self.channels.snapshot())

            self.step += 1

Checkpointer and “time travel”

A checkpoint is not just a save file; it’s a logical clock.

It stores both channel_values (user data) and channel_versions (synchronization metadata). That enables “time travel”: load any previous checkpoint, replay execution, or fork a new branch from a past state. For debugging multi-step agent behavior, this is not a nice-to-have—it changes what is possible.

Interrupts

In standard Python, pausing mid-await and serializing the suspended execution context is painful.

In BSP, the barrier between supersteps is a natural pause point. When an interrupt is configured (e.g. interrupt_before=["node_A"]), the runtime simply stops scheduling at the barrier, persists state, and exits. Resuming is just: reload checkpoint → continue with the next superstep.

Framework comparison

Feature LangGraph (BSP) Native asyncio Notes
Control flow step-wise: read → run → write → sync continuous callbacks / awaits BSP is structured and easier to reason about; asyncio can be faster but harder to audit
Consistency strong: reducers resolve conflicts at the barrier fragile: easy to introduce races BSP reduces the need for locks
Debugging time travel: replay from any step logs only snapshots make global reconstruction feasible
Runaway loops explicit guardrails (e.g., recursion / step limits) implicit (hangs / starvation) BSP makes “termination policy” a first-class concern

Versus other agent frameworks:

  • CrewAI: great for high-level “role-playing teams,” but harder to control granular state or implement rigorous rollback.
  • AutoGen: conversation-centric; state is often scattered across agent histories rather than centralized, which makes global undo and replay harder.

Advanced patterns

BSP unlocks patterns that are awkward in other architectures.

1) Map-Reduce (dynamic fan-out)

When batch size is unknown until runtime:

  • Map (step 1): a dispatcher emits Send objects
  • Process (step 2): the runtime spawns $N$ parallel workers dynamically
  • Reduce (step 3): a reducer triggers only when all parallel outputs have arrived and been aggregated at the barrier
--- config: theme: 'base' themeVariables: primaryColor: '#BB2528' primaryTextColor: '#fff' primaryBorderColor: '#7C0000' lineColor: '#F8B229' secondaryColor: '#006100' tertiaryColor: '#fff' --- graph LR A[Planner Node] -->|Generate| B(Send Packet 1) A -->|Generate| C(Send Packet 2) A -->|Generate| D(Send Packet 3) B -.->|Dynamic Spawn| W1 C -.->|Dynamic Spawn| W2 D -.->|Dynamic Spawn| W3 W1 -->|Write to| R W2 -->|Write to| R W3 -->|Write to| R R -->|Trigger| S
import operator
from typing import Annotated, TypedDict
from langgraph.graph import StateGraph, START, END
from langgraph.types import Send

# 1. 定义状态
class OverallState(TypedDict):
    topic: str
    sub_results: Annotated[list[str], operator.add] # 聚合所有 Worker 的结果

class WorkerState(TypedDict):
    section: str

# 2. 定义节点
def planner(state: OverallState):
    # 动态生成 3 个子任务
    sections =
    # 返回 Send 对象列表。这不会立即运行,而是安排在下一超步并行运行。
    return

def worker(state: WorkerState):
    # 并行执行的逻辑
    return {"sub_results": [f"Finished section: {state['section']}"]}

def reducer(state: OverallState):
    # 当所有 Worker 完成后,本节点被触发
    # 由于 sub_results 是 operator.add,这里能看到完整的列表
    return {"final_summary": "\n".join(state["sub_results"])}

# 3. 构建图
graph = StateGraph(OverallState)
graph.add_node("planner", planner)
graph.add_node("worker_node", worker)
graph.add_node("reducer", reducer)

# 动态扇出:使用 add_conditional_edges
graph.add_conditional_edges("planner", lambda x: x) # 直接返回 Send 列表
graph.add_edge("worker_node", "reducer") # Fan-in: 所有 worker 写完后触发 reducer
graph.add_edge("planner", END) # 只是为了图完整性,实际流向由 Send 控制
graph.set_entry_point("planner")

app = graph.compile()

2) Subgraphs (fractal composition)

A graph can be wrapped as a node inside another graph. The parent graph pauses while the child graph advances through its own supersteps. This supports modularity and isolation—useful when you want complex agents without a monolith.

--- config: theme: 'base' themeVariables: primaryColor: '#BB2528' primaryTextColor: '#fff' primaryBorderColor: '#7C0000' lineColor: '#F8B229' secondaryColor: '#006100' tertiaryColor: '#fff' background: '#f4f4f4' --- graph LR subgraph Parent Graph Start --> Router Router -->|Complexity High| SubGraphNode Router -->|Complexity Low| SimpleNode SubGraphNode --> End SimpleNode --> End end subgraph SubGraphNode [Child Graph execution] direction LR S_Start((Start)) --> Agent1 Agent1 --> Critiques Critiques -->|Reject| Agent1 Critiques -->|Approve| S_End((End)) end
# 定义子图 (Child Graph)
child_builder = StateGraph(MessagesState)
child_builder.add_node("child_agent", call_model)
child_builder.add_edge(START, "child_agent")
child_builder.add_edge("child_agent", END)
child_graph = child_builder.compile()

# 定义父图 (Parent Graph)
parent_builder = StateGraph(ParentState)
parent_builder.add_node("router", router_node)

#!!! 关键点:将编译后的子图作为节点加入父图!!!
# 在 BSP 运行时看来,这只是一个耗时较长的普通节点
parent_builder.add_node("nested_workflow", child_graph) 

parent_builder.add_edge(START, "router")
parent_builder.add_conditional_edges(
    "router", 
    route_logic, 
    {"complex": "nested_workflow", "simple": "simple_node"}
)

3) Human-in-the-loop (HITL)

Because state is decoupled from execution, you can “freeze” the world, let a human edit state (e.g., correct a bank transfer amount), and then resume as if the world had always been consistent.

# Demo 说明:
# Agent负责处理敏感的转账请求:
# - 输入分析: 提取金额和收款人。
# - 风险评估: 如果金额 > 1000,需要人工审批。
# - 执行转账: 调用银行 API。

# Demo 代码示意实现:
## 定义状态
class State(TypedDict):
    amount: int
    recipient: str
    status: str

## 节点 1: 风险检查
def risk_check(state: State):
    if state["amount"] > 1000:
        # 触发中断
        decision = interrupt(f"Approve transfer of {state['amount']}?")
        if decision!= "approve":
            return {"status": "rejected"}
    return {"status": "approved"}

## 节点 2: 执行
def execute_transfer(state: State):
    if state["status"] == "approved":
        print(f"Transferring to {state['recipient']}")
    return {}

## 构建图
workflow = StateGraph(State)
workflow.add_node("risk_check", risk_check)
workflow.add_node("execute_transfer", execute_transfer)
workflow.add_edge(START, "risk_check")
workflow.add_edge("risk_check", "execute_transfer")
workflow.add_edge("execute_transfer", END)

app = workflow.compile(checkpointer=MemorySaver())

Back to reality: the Agent implementation

Returning to the original bug, I applied these ideas in the agent development.

Speed and isolation

I used parallel execution for data fetching (get_cve_data, get_cvss_data) to reduce latency.
To avoid context pollution—where a large context from one branch (e.g., ASD generation) bleeds into another—I used subgraphs to isolate execution contexts.

class CVSSVectorAgent:
    """CVSS Vector Agent"""

    def __init__(self):
        self.data_agent = CVEDataAgent()
        self.asd_agent = MitreASDAgent()
        self.llm = ChatTongyi(name="cvss-vector-agent-llm", model="qwen3-max")
        self.prompt_manager = CVSSVectorPrompts()
        self.memory = MemorySaver()
        self.agent = self._build_graph()
        self.logger = get_logger()

    def _build_graph(self):
	    ...
	    # 添加边
        # get_cve_data, get_cvss_data, generate_asd_data 是并行节点用于加速agent执行
        builder.add_edge(START, "get_cve_data")
        builder.add_edge(START, "get_cvss_data")

An explicit synchronization barrier

To resolve the scheduling/synchronization issue, I added a no-op barrier node.

# No-op node to synchronize paths
def sync_barrier(state: CVSSVectorState):
    return {}

builder.add_node("sync_barrier", sync_barrier)

# ... route conditional edges to sync_barrier ...

# Only proceed after the barrier
builder.add_edge("sync_barrier", "normalize_cvss_data")

By making the topology explicitly respect the BSP rhythm, the “double execution” vanished. The runtime returned to a predictable cadence: compute, wait, advance.

Closing thoughts

“Knowing the tool” is the first step. “Knowing the model behind the tool” is where leverage comes from.

Moving from chains to graphs is not just a syntax upgrade—it changes how we think about time, state, and consistency in agent systems. Once you see the barrier as a clock, many problems stop being mysterious.

References

  1. Pregel: a system for large-scale graph processing
  2. Graph API overview - Docs by LangChain
  3. Pregel | LangGraph.js API Reference - GitHub Pages
  4. LangGraph runtime - Docs by LangChain
  5. Building AI Agents Using LangGraph: Part 8 — Understanding Reducers and State Updates | by HARSHA J S
  6. LangGraph overview - Docs by LangChain
  7. Use the graph API - Docs by LangChain
  8. Application structure - Docs by LangChain
  9. CompiledStateGraph | LangGraph.js API Reference - GitHub Pages
  10. StateGraph | LangGraph.js API Reference - GitHub Pages
  11. LangGraph 101: Let's Build A Deep Research Agent | Towards Data Science
  12. Building Event-Driven Multi-Agent Workflows with Triggers in LangGraph - Medium
  13. Channels | LangChain Reference
  14. if there are two nodes(one node has a prenode) go to same one 4th node , then that 4th node will run twice · Issue #5979 · langchain-ai/langgraph - GitHub
  15. Duplicate node execution when using conditionals - LangGraph - LangChain Forum
  16. Graph execution goes back to a previous node - LangGraph - LangChain Forum
  17. Graph execution goes back to a previous node - #3 by ignacio - LangChain Forum
  18. The Evolution of Graph Processing: From Pregel to LangGraph | by ...
  19. LangGraph: Multi-Agent Workflows - LangChain Blog
  20. How Build.inc used LangGraph to launch a Multi-Agent Architecture for automating critical CRE workflows for Data Center Development. - LangChain Blog
  21. Building LangGraph: Designing an Agent Runtime from first principles - LangChain Blog
  22. Pregel | LangChain Reference - LangChain Docs
  23. LangGraph Execution Semantics. | by Christoph Bussler - Medium
  24. 基于LangGraph开发复杂智能体一则 - 博客园
  25. Sink node issue, if multiple subgraphs are used in parallel · Issue #1964 · langchain-ai/langgraph - GitHub
  26. Mastering LangGraph State Management in 2025 - Sparkco
  27. LangGraph Multi-Agent Orchestration: Complete Framework Guide + Architecture Analysis 2025 - Latenode
  28. Functional API overview - Docs by LangChain
  29. My experience using Langgraph for deterministic workflow : r/LangChain - Reddit
  30. Building Smarter Agents with LangGraph: Tools, Memory & Workflows - GoPenAI
  31. Comparing AI agent frameworks: CrewAI, LangGraph, and BeeAI - IBM Developer
  32. LangGraph vs CrewAI: Let's Learn About the Differences - ZenML Blog
  33. Leveraging LangGraph's Send API for Dynamic and Parallel Workflow Execution
  34. LangGraph's Execution Model is Trickier Than You Might Think - Atomic Spin
  35. How does state work in LangGraph subgraphs? - LangChain Forum

LangGraph 机制深度解析与Agent模式设计

2026-01-04 23:13:13

引子

我在开发一个CVE相关的Agent的时候,碰到一个很有意思的Agent编排问题,Agent采用LangGraph框架开发的,具体Agent结构如下所示:

注意:...为Graph的条件边

Agent运行的结果不稳定,一开始我以为是Agent常见的幻觉问题,但是基于Graph编排就是为了避免幻觉问题,这很奇怪。在排查tracing和调用日志之后,我发现了一个很奇怪的现象:_generate_cvss_vector 执行了两次,具体日志如下所示:

2025-12-24 12:07:16|x-sec|INFO|./core/agents/mimora/cvss_vector_agent.py:504|_generate_cvss_vector|开始完成CVE风险评估并生成CVSS向量
2025-12-24 12:07:16|x-sec|INFO|./core/agents/mimora/cvss_vector_agent.py:401|_normalize_cvss_data|开始归一化CVSS向量数据
2025-12-24 12:07:26|x-sec|INFO|./core/agents/mimora/cvss_vector_agent.py:527|_generate_cvss_vector|Generated CVSS Vector Response
2025-12-24 12:07:26|x-sec|DEBUG|./core/agents/mimora/cvss_vector_agent.py:528|_generate_cvss_vector|Response content: {
    "cvss_vector": "CVSS:3.1/AV:L/AC:L
2025-12-24 12:07:26|x-sec|INFO|./core/agents/mimora/tools/tool_utils.py:19|wrapper|开始执行工具: _calculate_cvss_score, 参数: ('CVSS:3.1/AV:L/AC:L/PR:N/UI:R/S:U/C:H/I:H/A:H',), {}
2025-12-24 12:07:26|x-sec|INFO|./core/agents/mimora/tools/cvss_tool.py:107|_calculate_cvss_score|开始计算CVSS评分: version=3.0, vector=CVSS:3.1/AV:L/AC:L/PR:N/UI:R/S:U/C:H/I:H/A:H...
2025-12-24 12:07:26|x-sec|INFO|./core/agents/mimora/tools/cvss_tool.py:170|_calculate_cvss_score|成功计算CVSS评分: version=3.1, base_score=7.8, base_severity=High
2025-12-24 12:07:26|x-sec|INFO|./core/agents/mimora/tools/tool_utils.py:22|wrapper|工具 _calculate_cvss_score 执行成功,耗时: 0.00s
2025-12-24 12:07:26|x-sec|INFO|./core/agents/mimora/cvss_vector_agent.py:504|_generate_cvss_vector|开始完成CVE风险评估并生成CVSS向量
2025-12-24 12:07:26|x-sec|INFO|./core/agents/mimora/cvss_vector_agent.py:568|_generate_cvss_severity|开始生成CVSS严重性等级
2025-12-24 12:07:35|x-sec|INFO|./core/agents/mimora/cvss_vector_agent.py:527|_generate_cvss_vector|Generated CVSS Vector Response
2025-12-24 12:07:35|x-sec|DEBUG|./core/agents/mimora/cvss_vector_agent.py:528|_generate_cvss_vector|Response content: {
    "cvss_vector": "CVSS:3.1/AV:L/AC:L
2025-12-24 12:07:35|x-sec|INFO|./core/agents/mimora/tools/tool_utils.py:19|wrapper|开始执行工具: _calculate_cvss_score, 参数: ('CVSS:3.1/AV:L/AC:L/PR:N/UI:R/S:U/C:H/I:H/A:H',), {}
2025-12-24 12:07:35|x-sec|INFO|./core/agents/mimora/tools/cvss_tool.py:107|_calculate_cvss_score|开始计算CVSS评分: version=3.0, vector=CVSS:3.1/AV:L/AC:L/PR:N/UI:R/S:U/C:H/I:H/A:H...
2025-12-24 12:07:35|x-sec|INFO|./core/agents/mimora/tools/cvss_tool.py:170|_calculate_cvss_score|成功计算CVSS评分: version=3.1, base_score=7.8, base_severity=High
2025-12-24 12:07:35|x-sec|INFO|./core/agents/mimora/tools/tool_utils.py:22|wrapper|工具 _calculate_cvss_score 执行成功,耗时: 0.00s
2025-12-24 12:07:42|x-sec|INFO|./core/agents/mimora/cvss_vector_agent.py:596|_generate_cvss_severity|Generated CVSS Severity Response

这与Agent的Graph设计编排动作并不符合,定位过程这里不细说,但是这引起了我对Agent设计模式的好奇,这篇文章我就来探索一下这个问题。

Agent编排问题思考

0:00
/0:44

Agent编排有什么问题

在生成式人工智能(Generative AI)从单纯的文本生成向自主智能体(Autonomous Agents)演进的历史进程中,编排框架的架构设计成为了决定系统稳定性、可扩展性和复杂度的核心变量。早期的开发范式主要围绕“链式”(Chain)结构展开,这种线性的 Prompt 序列在处理单一、短程任务时表现出色。然而,随着需求转向能够自主决策、使用工具、自我反思并进行长期规划的“智能体”,线性有向无环图(DAG)的局限性暴露无遗。

Agent系统的本质不再是简单的输入-输出管道,而是一个包含了感知(Perception)、决策(Reasoning)、行动(Action)和观察(Observation)的无限循环。这种循环性(Cycles)打破了传统的 DAG 假设。更为关键的是,Agent系统往往不再是单打独斗,而是演变为多智体系统(Multi-Agent Systems),其中多个专注于不同领域的智能体(如规划者、执行者、审查者)需要并行工作并共享上下文 。

随着Agent系统的演进,在编排Agent系统的时候会碰到很多问题,例如需要解决多智体协作中的竞态条件(Race Conditions)、状态一致性(State Consistency)、循环推理(Cyclic Reasoning)以及容错恢复(Fault Tolerance)等,此时开发者不禁会思考能否有一个图灵完备的编排模型来解决这个问题呢?

面对这样的Agent编排的问题,LangGraph将已经在高性能计算(HPC)和大数据处理领域验证过的 BSP 模型引入到了 AI Agent 的编排中。

为什么是图计算模型

传统的软件工程倾向于将系统建模为服务(Services)或对象(Objects),而 AI Agent 的行为模式更接近于状态机(State Machine) 在图上的随机游走。

  • 非线性与循环: Agent 的核心特征是循环(Looping)。例如,ReAct 模式(Reasoning + Acting)本质上是一个 Think -> Act -> Observe -> Think 的闭环。DAG(有向无环图)无法原生表达这种循环,通常需要通过递归调用或外部循环来实现,这会导致调用栈溢出或上下文管理混乱。BSP 模型天生支持循环——循环仅仅是无限的超步序列而已。
  • 状态的中心地位: 在 Agent 系统中,Context(上下文/状态)是核心资产。所有的决策都基于当前的 State。BSP 模型强制要求显式的状态管理和版本控制,这与 Agent 对上下文依赖的需求不谋而合。
  • 并发与协作: 现代 Agent 系统往往是多角色的(Map-Reduce pattern, Supervisor pattern)。多个 Agent 需要并行工作并汇聚结果。BSP 的栅栏机制天然解决了并行任务的同步与汇聚问题,无需开发者手动编写复杂的 asyncio.gather 或锁机制 。

Google Pregel&BSP模型

Google Pregel框架

Pregel 的核心可以用三个图来概括:

  • 怎么算: “顶点状态机” —— 决定节点是工作还是休息。
  • 怎么跑: “BSP模型” —— 决定整个集群如何同步。
  • 怎么传: “最大值传播示例” —— 演示一个具体算法在图上的流动。

如上图所示,这是 Pregel "Think Like a Vertex"(像顶点一样思考) 的核心。 每个顶点只有两种状态:活跃 (Active) 和 不活跃 (Inactive/Halted)。

  • Active (活跃): 顶点正在计算。它可以处理收到的消息,更新自己的值,并向邻居发送新消息。
  • Inactive (不活跃): 顶点“睡着了”。如果它觉得自己没活干了(比如计算结果收敛了),就投票休眠(Vote to Halt)。
  • 被唤醒: 哪怕顶点睡着了,只要它收到了新消息,系统会立刻把它强制唤醒(切换回 Active),让它处理消息。

如上图所示,这是 Pregel 在分布式集群上的宏观运行方式。 计算被切分成一个个 Superstep(超步),所有机器必须“齐步走”。

  • 计算 (Compute): 所有顶点并行处理自己的逻辑(读上一轮消息 -> 算 -> 发下一轮消息)。
  • 通信 (Messages): 这一轮发出的消息,会在网络中飞一会儿。
  • 路障 (Barrier): 这是关键。 所有顶点必须都跑完 Superstep S,且消息都传到了,才能一起进入 Superstep S+1。不允许有的跑得快,有的跑得慢。

这种“走一步、停一步、等一等”的模式,解决了分布式系统中极其复杂的死锁和竞态条件问题。

如上图所示,假设我们要在一个图里找到最大的数字(在这个例子中是 6)并传给所有人。

  • Superstep 0: 大家都有初始值。节点 1 拿着最大值 6。
  • 消息传递: 节点 1 发现自己值是 6,告诉邻居节点 2:“嘿,我有 6”。
  • Superstep 1: 节点 2 收到了“6”,对比自己原来的“3”,发现 6 更大,于是更新自己为 6,并在下一轮继续传播。
  • 结果: 就像病毒扩散一样,最大值会在几次 Superstep 后覆盖全图。

BSP模型

Bulk Synchronous Parallel (BSP) 模型是一种整体同步并行计算模型,由计算机科学家 Leslie Valiant 提出。它将并行计算划分为一系列 超级步(Superstep) 顺序执行。在每个超级步内,所有处理单元都执行以下三个阶段:

  1. 本地计算阶段:每个处理单元(例如处理器或节点)使用当前可用的数据执行计算。各处理单元彼此独立、并行地进行局部运算。
  2. 消息传递阶段:处理单元将本超级步产生的输出发送为消息给其他处理单元。这些消息在本超级步内不会被目标立即处理,而是累积起来供下一个超级步使用。
  3. 全局同步屏障阶段:所有处理单元在此同步点等待,直到每个单元都完成了本超级步的前两阶段。同步屏障确保没有单元抢先进入下一超级步。

以上三个阶段严格串行发生:只有当所有处理单元完成本地计算后,才进行统一的通信,然后才能执行同步。同步屏障标志着一个超步的结束和下一个超步的开始。整个 BSP 程序由若干连续的超步构成,重复“计算->通信->同步”的流程,直到满足终止条件。由于通信中的消息仅在同步后才可见,这保证了每个超步各处理单元看到的是上一超步结束时的全局一致状态。BSP 模型具有易于编程、性能可预测且不易出现死锁等特点。从程序员视角来看,BSP 提供了一种简洁的并行语义:把并发逻辑写成在同步栅栏之间交替进行的计算和通信步骤,从而降低了思维复杂度。

0:00
/0:51

LangGraph运行时框架解析

这一节主要是研究清楚LangGraph是如何实现BSP模型的(langgraph == 0.3.27),LangGraph运行时框架的核心引擎是 PregelLoop 类 。

状态图(StateGraph)与消息传递

在 LangGraph 中,一切皆始于状态(State)。开发者首先定义一个 StateSchema(通常是 TypedDict 或 Pydantic Model),它规定了图中流动的数据结构。

from typing import TypedDict, Annotated
import operator

class AgentState(TypedDict):
    messages: Annotated[list[str], operator.add]
    summary: str

这里的 Annotated[list[str], operator.add] 是理解 LangGraph BSP 实现的关键。它定义了一个通道(Channel)及其归约器(Reducer)。

通道(Channels):解耦读与写

在 BSP 模型中,节点不直接写入内存位置。相反,它们向“通道”发送更新。通道是管理状态变更的中间层。LangGraph 提供了多种类型的通道 :

  • LastValue Channel(默认): 存储最后一次接收到的值。如果在一个超步中有多个节点向此通道发送数据,通常只有最后一个(或随机一个,取决于具体实现细节)会被保留。这适用于那些全量替换的状态字段。
  • BinaryOperatorAggregate Channel: 这是 BSP 并行能力的基石。它允许定义一个二元操作符(如 operator.add)。在栅栏同步阶段,系统会将该超步内所有发往此通道的更新值,以及当前通道的旧值,通过这个操作符进行聚合。多个 Agent 可以并行生成消息,系统会自动将它们追加(Append)到历史记录中,而不会发生覆盖。
  • Topic Channel: 类似于消息队列的主题,用于传递瞬时的事件流,不保留历史状态。

PregelLoop:Superstep的生命周期

PregelLoop.tick 方法是 LangGraph 运行时的心跳,每一次 tick 代表一个超步的执行。我们可以将其逻辑分解为以下几个微观阶段 :

阶段一:计划(Plan)

当一个新的超步开始时,运行时首先检查当前的通道版本(Channel Versions)。

  • LangGraph 维护着每个通道的数据版本号(通常是递增的整数)。
  • 每个节点都订阅了一个或多个输入通道。
  • 触发逻辑(Triggering): 如果一个节点订阅的通道在上一轮超步中发生了版本更新(即接收到了新数据),该节点就会被标记为“待执行”。
  • 条件边解析: 如果上一轮结束于一个条件边,系统会执行路由函数(Routing Function),确定下一轮应该激活哪些节点。

这一阶段对应于 BSP 模型的“调度”逻辑。重要的是,这种触发是数据驱动的(Data-driven),而非传统的控制流驱动。

阶段二:执行(Execute)—— 局部计算

一旦确定了本轮需要运行的节点集合(例如 Node A 和 Node B),LangGraph 会并行启动它们的执行。

  • 读隔离: 每个节点在执行时,读取的是本超步开始时的状态快照(Snapshot)。即使 Node A 在执行过程中“修改”了状态,并行运行的 Node B 看到的仍然是旧状态。这保证了并行任务之间的隔离性。
  • 写缓冲: 节点执行完毕后,返回一个字典(如 {"messages": ["Hello"]})。这个返回值不会立即应用到全局 State 中。它被放入一个临时的“写入缓冲区”(Write Buffer)。

这对应于 BSP 模型的“并发计算”阶段。在这一阶段,系统中不存在共享内存的竞争,因为所有的读操作都是基于快照的,所有的写操作都是缓冲的 。

阶段三:更新与栅栏(Update & Barrier)

当本超步内的所有节点都完成执行后,系统进入栅栏同步阶段。这是 LangGraph 发挥魔法的地方:

  • 收集写入: 运行时从缓冲区中取出所有节点产生的更新。
  • 执行归约(Reduce): 对于每一个通道,运行时应用预定义的 Reducer 函数。例如,如果 Node A 返回 msg1,Node B 返回 msg2,且通道配置为 add,则新状态计算为 State + msg1 + msg2。
  • 版本递增: 更新后的通道版本号加 1。
  • 持久化: 如果配置了 Checkpointer,此时系统会将更新后的完整 State 序列化并存储到数据库中。

只有在这一系列原子操作完成后,系统才会解除栅栏,准备进入下一个超步。

LangGraph 源码解析

State && Channel

LangGraph 的 State 在底层被编译为一组 Channel 对象,所以 State 的运行逻辑可以通过 channel 的源码来理解。

通道类型 (Class) 对应 State 标注 更新逻辑 (update method) 获取逻辑 (get method) 典型应用场景
LastValue int, str (无注解) value = new_value (覆盖) 返回当前存储的值 状态机流转标志、最新查询词、单一结果
BinaryOperatorAggregate Annotated value = reducer(value, new_value) (归约) 返回归约后的累积值 聊天历史 (add_messages)、并行分析结果汇总
EphemeralValue (内部使用或特殊配置) 接受更新 读取一次后即清空 (Reset after read) 信号传递、触发器、无需持久化的中间数据
Topic Topic (显式配置) 追加到队列 返回本轮新增的消息列表 Pub/Sub 模式、日志流、事件广播
# channel 基类抽象
class BaseChannel(ABC):
    @abstractmethod
    def update(self, values: Sequence[Any]) -> bool:
        """接收更新值,修改内部状态。返回 True 表示状态已变更。"""
        pass

    @abstractmethod
    def get(self) -> Any:
        """获取当前值 (供 Node 读取)。"""
        pass
        
    @abstractmethod
    def checkpoint(self) -> Any:
        """序列化当前状态。"""
        pass
        
# channel 实现逻辑
# 1. LastValue(默认channel类型)
class LastValue(BaseChannel):
    def update(self, values):
        if not values:
            return False
        # 直接覆盖:如果有多个 Node 同时写入,保留列表中的最后一个
        # (通常由并行执行完成的顺序决定,或由图拓扑顺序决定)
        self.value = values[-1] 
        return True

# 2. BinaryOperatorAggregate(Reducer channel类型)
class BinaryOperatorAggregate(BaseChannel):
    def __init__(self, operator, initial_value):
        self.operator = operator # 例如: operator.add 或 add_messages
        self.value = initial_value

    def update(self, values):
        if not values:
            return False
        
        # 遍历所有待应用的更新
        for new_val in values:
            # 支持特殊指令:Overwrite
            if isinstance(new_val, Overwrite):
                self.value = new_val.value
            else:
                # 应用 Reducer: old + new -> new_old
                # 例如: list + list -> extended list
                self.value = self.operator(self.value, new_val)
        return True
        
# 3. Topic(PubSub channel类型)
class Topic(BaseChannel, Value | list[Value], list[Value]]):
    """
    一个可配置的 PubSub Topic 通道。
    """
    def __init__(self, typ: type[Value], accumulate: bool = False):
        self.typ = typ
        self.accumulate = accumulate
        self.unique = False # 注意:标准实现中通常没有显式的 unique 参数,需通过逻辑推导
        #... 初始化内部存储结构
    def update(self, writes: Sequence[Value]) -> bool:
	    # 如果不累积,直接用新值覆盖旧值
	    if not self.accumulate:
	        self.values = list(writes) # 替换旧状态

	    # 如果累积,将新值追加到旧状态
	    if self.accumulate:
	        self.values.extend(writes)
	    return True

Pregel Loop && Superstep

我们可以通过一个简化的伪代码模型来理解 LangGraph 的 _step(单步执行)逻辑。这部分逻辑主要位于 langgraph/pregel/__init__.pylanggraph/pregel/loop.py 中(GitHub源码地址)。

# 简化的 LangGraph 运行时逻辑模型 (伪代码)
# 具体实现在 class Pregel(PregelProtocol) --> stream()/astream()

class PregelLoop:
    def execute(self, initial_state):
        # 1. 初始化通道 (Channels)
        # 将输入状态写入对应的通道 (如 'messages', 'count' 等)
        self.channels = self.initialize_channels(initial_state)
        
        # 2. 超步循环 (Super-step Loop)
        # 只要有待处理的任务,就继续循环
        while not self.is_terminated():
            
            # --- 阶段 A: 计划 (Plan) / 触发器逻辑 ---
            # 检查哪些节点订阅的通道在上一轮发生了更新
            tasks =
            for node_name, node in self.nodes.items():
                # Trigger: 如果节点的输入通道有新数据,则激活该节点
                if self.check_trigger(node, self.channels):
                    # 准备任务:读取当前状态快照 (不可变)
                    input_snapshot = self.read_channels(node.inputs)
                    tasks.append((node_name, node.func, input_snapshot))
            
            if not tasks:
                break # 没有任务,图执行结束
            
            # --- 阶段 B: 执行 (Execute) / 并行计算 ---
            # 并行运行所有激活节点的函数
            # 注意:节点内部无法看到其他节点本轮产生的更新
            results = await parallel_execute(tasks)
            
            # --- 阶段 C: 更新 (Update) / 栅栏同步 ---
            # 这是 BSP 的核心:统一应用所有更新
            checkpoint_writes =
            for node_name, result in results:
                # 解析节点返回值,确定要更新哪些通道
                writes = self.parse_writes(node_name, result)
                
                # 将更新应用到通道 (应用 Reducer)
                for channel_name, value in writes:
                    # 例如: channels['messages'].update(new_msg)
                    # 如果是 add_messages reducer,这里会执行 list append
                    self.channels[channel_name].update(value)
                    
            # --- 阶段 D: 持久化 (Checkpoint) ---
            # 保存当前所有通道的状态到数据库 (支持时间旅行)
            self.checkpointer.put(self.channels.snapshot())
            
            # 增加步数,准备下一轮
            self.step += 1

在每一轮 Superstep 开始时,运行时需要决定哪些 Node 应该被激活。

  • 源码逻辑:系统检查所有 Channel 的 version(版本号)。
  • 每个 Node 都有一个订阅列表(Input Channels)。
  • 逻辑判断:if max(channel.version for channel in node.inputs) > node.last_seen_version:
  • 如果条件满足,说明该 Node 的输入数据发生了变化,Node 被加入 tasks 队列。
  • 特殊处理:对于 START 节点或被 Send API 动态调用的节点,它们会被无条件或基于特定规则加入队列。

Checkpointer

中断机制的基石是状态的持久化。没有检查点,图就是无状态的,中断后无法恢复。

检查点的数据结构

一个检查点不仅仅是用户定义的 State 字典。它包含:

  • Config (Thread ID): 类似于会话 ID。
  • Channel Values: 所有通道的当前值。
  • Version Information: 逻辑时钟,用于冲突检测。
  • Pending Sends: 尚未处理的消息。
  • Next Tasks: 下一步计划执行的任务列表(如果是中断状态)。
# checkpoint 数据结构
# channel_versions 和 versions_seen 是增量计算的核心,
# LangGraph 依靠比对这两个字典来决定下一轮激活哪些节点,而不是全量扫描。
checkpoint = {
    "v": 1, # 协议版本
    "id": "uuid-...", # Checkpoint ID
    "ts": "2023-10-27...", # 时间戳
    "channel_values": { # 用户状态
        "messages": [...], 
        "count": 5
    },
    "channel_versions": { # 逻辑时钟 (关键!)
        "messages": 12,
        "count": 5
    },
    "versions_seen": { # 每个节点上次看到的版本
        "agent_node": {"messages": 11, "count": 5}
    }
}
存储后端与序列化

LangGraph 支持多种 Checkpointer 实现:

  • InMemorySaver: 仅用于测试,进程重启即丢失。
  • PostgresSaver / AsyncSqliteSaver: 生产环境标准。

默认情况下,检查点使用 pickle 进行序列化。这支持了复杂的 Python 对象(如自定义类),但也带来了安全性和兼容性问题。如果代码更新导致类定义改变,旧的检查点可能无法加载。生产环境建议尽量使用 JSON 兼容的基础数据类型,或实现自定义的序列化协议。

class BaseCheckpointSaver:
    def put(self, config, checkpoint, metadata, new_versions):
        # 1. 序列化
        serialized_checkpoint = self.serde.dumps(checkpoint)
        
        # 2. 写入数据库 (伪 SQL)
        # INSERT INTO checkpoints (thread_id, checkpoint_id, data) 
        # VALUES (?,?,?)
        # ON CONFLICT DO NOTHING;
        
        # 3. 更新最新指针
        # UPDATE threads SET latest_checkpoint_id =? WHERE thread_id =?
线程级隔离

Thread ID 是实现多用户并发的关键。每个 Thread ID 对应一条独立的状态演进链。中断和恢复必须严格匹配同一个 Thread ID。源代码中,checkpointer.get(config) 方法利用这个 ID 来检索最新的状态快照。

故障恢复与重放 (Replay)

当用户调用 graph.invoke(..., config={"thread_id": "1"}) 时:

  • checkpointer.get(config) 从数据库查出最新的 Checkpoint
  • PregelLoop 将 checkpoint["channel_values"] 恢复到内存中的 self.channels
  • PregelLoop 将 checkpoint["versions_seen"] 恢复到 Node 状态。
  • 无缝继续: 循环继续运行,仿佛从未停止过。因为所有上下文(包括逻辑时钟)都已完美复原。

Interrupt

# langgraph.types.interrupt 伪代码
def interrupt(value):
    # 1. 检查当前是否在 Pregel 循环中
    if not _is_in_pregel_loop():
        raise RuntimeError("interrupt can only be called inside a node")
    
    # 2. 抛出特殊异常,携带 Payload
    # 这会立即中止当前 Node 函数的执行
    raise GraphInterrupt(value)
    
    
# interrupt 运行时捕获
# PregelLoop.run_task 内部逻辑
try:
    result = node.func(input)
except GraphInterrupt as e:
    # 捕获中断请求
    # 1. 将任务标记为 "interrupted"
    # 2. 保存中断产生的值 (e.value) 到 Checkpoint
    self.save_checkpoint(...)
    # 3. 停止整个 Superstep,不进行 Update
    return CreateInterrupt(e.value)
    
    
# 恢复 (Resume) 与值注入
# interrupt 恢复有两种方式:
# 1. Command(resume="approved")
# 2. graph.update_state(thread_config, {"input": "hi"})
def interrupt(value):
    # 检查是否有 resume 值注入
    if _has_resume_value():
        # 直接返回注入的值,不再抛出异常!
        return _get_resume_value()

    #... (抛出异常逻辑)

LangGraph vs. 其他框架

智能体编排范式对比

维度 LangGraph (BSP) 原生 Asyncio (事件驱动) 核心差异分析
执行流 分步式 (Step-wise): 严格的 Read -> Process -> Write -> Sync 循环。 连续流 (Continuous): 回调链、Promise 链,任务一旦完成立即触发下一个。 BSP 提供了更清晰的逻辑结构,易于理解和预测;Asyncio 理论延迟更低,但逻辑难以追踪。
状态一致性 强一致性: 归约器在栅栏处解决冲突,所有节点在下一轮看到的都是一致的合并状态。 最终一致性: 容易出现竞态条件,需要复杂的锁机制。 BSP 避免了复杂的并发锁,降低了开发风险。
调试体验 时光倒流: 支持从任意历史超步恢复及重放。 日志追踪: 依赖散落在各处的日志,难以还原全局状态。 BSP 的“快照”特性是调试神技。
死锁处理 显式检测: 框架可以检测到循环超步限制(Recursion Limit)。 隐式死锁: await 可能无限挂起,难以检测。 BSP 强制设置最大超步数,防止无限循环。

框架对比

我对几个使用过的Agent框架进行对比:

  • CrewAI 采用了更高级的抽象,通常基于角色的顺序执行或简单的并行。它更像是一个基于“团队”隐喻的封装层。相比之下,LangGraph 更底层 。CrewAI 往往难以处理精细的状态回滚和复杂的条件跳转,而 LangGraph 的 BSP 模型允许开发者控制每一个超步的细节。
  • AutoGen 采用了基于“对话”的多 Agent 模式。Agent 之间的交互通过消息流驱动。虽然也具备并发能力,但其状态管理通常分散在各个 Agent 的内部历史中,缺乏一个全局的、版本控制的 State 对象。这使得在 AutoGen 中实现全局一致的“撤销”或“状态修改”比 LangGraph 困难。

高阶Agent设计模式

BSP 架构不仅仅是为了LangGraph解决基础问题,它还解锁了一系列高级设计模式,使得构建能够处理现实世界复杂度的 Agent 成为可能。

Map-Reduce 与动态任务分发(Send API)

在处理文档批量分析等任务时,我们通常不知道文档的确切数量。传统的静态图结构难以应对这种动态性。利用 BSP 的批处理特性,结合 Send API,优雅地实现了 Map-Reduce 模式。

场景: 用户上传了一个包含未知数量文件的文件夹,要求“总结每个文件,然后生成总报告”。

  1. Map 阶段(超步 1): Dispatcher 节点运行。它读取输入列表,针对列表中的每一项,生成一个 Send("process_file", {"file": item}) 对象。在 BSP 视角下,这相当于在当前超步结束时,动态向图中注入了 $N$ 个并发任务。
  2. Process 阶段(超步 2): 系统检测到 process_file 节点收到了 $N$ 个独立的消息包。于是,系统并行启动 $N$ 个 process_file 节点实例。由于 BSP 的隔离性,这 $N$ 个实例互不干扰。每个实例处理完后,返回 {"summaries": [summary]}
  3. Reduce 阶段(超步 3): Summarizer 节点订阅了 summaries 通道(配置为 append 归约器)。在超步 2 结束的栅栏处,所有 $N$ 个摘要被自动聚合成一个大列表。Summarizer 在超步 3 被触发一次,接收到完整的列表,生成总报告。
--- config: theme: 'base' themeVariables: primaryColor: '#BB2528' primaryTextColor: '#fff' primaryBorderColor: '#7C0000' lineColor: '#F8B229' secondaryColor: '#006100' tertiaryColor: '#fff' background: '#f4f4f4' --- graph LR A[Planner Node] -->|Generate| B(Send Packet 1) A -->|Generate| C(Send Packet 2) A -->|Generate| D(Send Packet 3) B -.->|Dynamic Spawn| W1 C -.->|Dynamic Spawn| W2 D -.->|Dynamic Spawn| W3 W1 -->|Write to| R W2 -->|Write to| R W3 -->|Write to| R R -->|Trigger| S

代码示例:

import operator
from typing import Annotated, TypedDict
from langgraph.graph import StateGraph, START, END
from langgraph.types import Send

# 1. 定义状态
class OverallState(TypedDict):
    topic: str
    sub_results: Annotated[list[str], operator.add] # 聚合所有 Worker 的结果

class WorkerState(TypedDict):
    section: str

# 2. 定义节点
def planner(state: OverallState):
    # 动态生成 3 个子任务
    sections =
    # 返回 Send 对象列表。这不会立即运行,而是安排在下一超步并行运行。
    return

def worker(state: WorkerState):
    # 并行执行的逻辑
    return {"sub_results": [f"Finished section: {state['section']}"]}

def reducer(state: OverallState):
    # 当所有 Worker 完成后,本节点被触发
    # 由于 sub_results 是 operator.add,这里能看到完整的列表
    return {"final_summary": "\n".join(state["sub_results"])}

# 3. 构建图
graph = StateGraph(OverallState)
graph.add_node("planner", planner)
graph.add_node("worker_node", worker)
graph.add_node("reducer", reducer)

# 动态扇出:使用 add_conditional_edges
graph.add_conditional_edges("planner", lambda x: x) # 直接返回 Send 列表
graph.add_edge("worker_node", "reducer") # Fan-in: 所有 worker 写完后触发 reducer
graph.add_edge("planner", END) # 只是为了图完整性,实际流向由 Send 控制
graph.set_entry_point("planner")

app = graph.compile()

这种模式的精妙之处在于隐式同步。开发者不需要编写任何“等待所有 Worker 完成”的代码(如 Promise.all),BSP 的栅栏机制保证了 Summarizer 只有在所有并行的 Map 任务都完成后(即所有消息都已处理并归约)才会被调度。

子图(Subgraphs)

分形的 BSP随着系统复杂度增加,单层图变得难以管理。LangGraph 支持将一个图封装为另一个图的节点,称为子图。在 BSP 模型下,子图的执行表现为嵌套的超步循环。

  • 当父图执行到子图节点时,父图的超步时钟“挂起”。
  • 子图启动自己的 PregelLoop,拥有独立的 State、独立的超步计数器。
  • 子图在内部运行多个超步,直到完成。
  • 子图的最终结果作为父图当前超步的输出,父图恢复,进入下一个超步。
--- config: theme: 'base' themeVariables: primaryColor: '#BB2528' primaryTextColor: '#fff' primaryBorderColor: '#7C0000' lineColor: '#F8B229' secondaryColor: '#006100' tertiaryColor: '#fff' background: '#f4f4f4' --- graph LR subgraph Parent Graph Start --> Router Router -->|Complexity High| SubGraphNode Router -->|Complexity Low| SimpleNode SubGraphNode --> End SimpleNode --> End end subgraph SubGraphNode [Child Graph execution] direction LR S_Start((Start)) --> Agent1 Agent1 --> Critiques Critiques -->|Reject| Agent1 Critiques -->|Approve| S_End((End)) end

代码示例:

# 定义子图 (Child Graph)
child_builder = StateGraph(MessagesState)
child_builder.add_node("child_agent", call_model)
child_builder.add_edge(START, "child_agent")
child_builder.add_edge("child_agent", END)
child_graph = child_builder.compile()

# 定义父图 (Parent Graph)
parent_builder = StateGraph(ParentState)
parent_builder.add_node("router", router_node)

#!!! 关键点:将编译后的子图作为节点加入父图!!!
# 在 BSP 运行时看来,这只是一个耗时较长的普通节点
parent_builder.add_node("nested_workflow", child_graph) 

parent_builder.add_edge(START, "router")
parent_builder.add_conditional_edges(
    "router", 
    route_logic, 
    {"complex": "nested_workflow", "simple": "simple_node"}
)

这种设计保证了模块化隔离。子图内部的中间状态(Intermediate State)不会污染父图的全局状态,除非显式地作为结果返回。这使得团队可以并行开发不同的 Agent 模块,最后像搭积木一样组装起来,而不用担心状态命名冲突或版本混乱。

中断与人工介入(Human-in-the-Loop)

这是 BSP 模型相对于连续流模型最“杀手级”的应用场景。在涉及敏感操作(如转账、发送合同)时,Agent 必须暂停并寻求人类确认。在异步函数执行过程中(例如 await llm.invoke(...) 正在等待网络响应时),要“暂停”程序并把状态序列化到磁盘是非常困难的。程序的状态分散在堆栈帧、闭包变量和 Event Loop 的句柄中。

而在 BSP 模型中,每个超步之间的栅栏是天然的、完美的暂停点。

  • LangGraph 允许在节点定义中指定 interrupt_before=["approve_node"]
  • 当图执行即将进入 approve_node 超步时,运行时检测到中断信号。
  • 系统在栅栏处“冻结”:保存当前所有通道的状态,停止调度,释放内存和计算资源。
  • 人类介入: 管理员通过 API 查询当前状态,发现 Agent 拟定的合同金额有误。管理员发送一个 update_state 请求,修改了内存中的金额字段。
  • 恢复(Resume): 管理员发送“继续”指令。系统加载被修改后的状态,就像什么都没发生一样,进入 approve_node 超步。这种 “冻结-修改-继续” 的能力,完全依赖于 BSP 模型将状态(Data)与执行(Control)解耦的特性。

HITL Agent举例:

# Demo 说明:
# Agent负责处理敏感的转账请求:
# - 输入分析: 提取金额和收款人。
# - 风险评估: 如果金额 > 1000,需要人工审批。
# - 执行转账: 调用银行 API。

# Demo 代码示意实现:
## 定义状态
class State(TypedDict):
    amount: int
    recipient: str
    status: str

## 节点 1: 风险检查
def risk_check(state: State):
    if state["amount"] > 1000:
        # 触发中断
        decision = interrupt(f"Approve transfer of {state['amount']}?")
        if decision!= "approve":
            return {"status": "rejected"}
    return {"status": "approved"}

## 节点 2: 执行
def execute_transfer(state: State):
    if state["status"] == "approved":
        print(f"Transferring to {state['recipient']}")
    return {}

## 构建图
workflow = StateGraph(State)
workflow.add_node("risk_check", risk_check)
workflow.add_node("execute_transfer", execute_transfer)
workflow.add_edge(START, "risk_check")
workflow.add_edge("risk_check", "execute_transfer")
workflow.add_edge("execute_transfer", END)

app = workflow.compile(checkpointer=MemorySaver())

一些Agent设计模式

Agent执行加速

class CVSSVectorAgent:
    """CVSS Vector Agent"""

    def __init__(self):
        self.data_agent = CVEDataAgent()
        self.asd_agent = MitreASDAgent()
        self.llm = ChatTongyi(name="cvss-vector-agent-llm", model="qwen3-max")
        self.prompt_manager = CVSSVectorPrompts()
        self.memory = MemorySaver()
        self.agent = self._build_graph()
        self.logger = get_logger()

    def _build_graph(self):
	    ...
	    # 添加边
        # get_cve_data, get_cvss_data, generate_asd_data 是并行节点用于加速agent执行
        builder.add_edge(START, "get_cve_data")
        builder.add_edge(START, "get_cvss_data")

Agent Context隔离

# 添加节点
# 通过子图的方式添加数据sub-agent:CVEDataAgent
# CVSSVectorAgent与CVEDataAgent执行的context相互隔离,通过input/output数据耦合
builder.add_node("get_cve_data", self._get_cve_data)
builder.add_node("get_cvss_data", self._get_cvss_data)
# 风险建模Sub-Agent:MitreASDAgent,采用同样的思路进行context隔离
# ASD Agent大约占用8000token的context,非常容易触发token limit
builder.add_node("generate_asd_data", self._generate_asd_data)
builder.add_node("generate_cvss_vector", self._generate_cvss_vector)
builder.add_node("generate_cvss_severity", self._generate_cvss_severity)

Agent并发管理

	# 空操作节点,只是用来同步所有路径
    def sync_barrier(state: CVSSVectorState):
        """同步屏障节点 - 等待所有前驱完成后再继续"""
        return {}  # 不做任何操作,只是等待
    
    builder.add_node("sync_barrier", sync_barrier)
    
    # 并行启动
    builder.add_edge(START, "get_cve_data")
    builder.add_edge(START, "get_cvss_data")
    
    # get_cve_data 的条件边 - 始终经过中间节点
    builder.add_conditional_edges(
        "get_cve_data",
        self._is_cve_data_empty,
        {
            "generate_asd_data": "generate_asd_data",
            "skip_asd": "sync_barrier",  # 条件不满足时也走 barrier
        }
    )
    
    # get_cvss_data 的条件边 - 始终经过中间节点
    builder.add_conditional_edges(
        "get_cvss_data",
        self._cvss_data_check,
        {
            "get_cvss_statement_data": "get_cvss_statement_data",
            "skip_statement": "sync_barrier",  # 条件不满足时也走 barrier
        }
    )
    
    # 中间节点都指向 sync_barrier
    builder.add_edge("generate_asd_data", "sync_barrier")
    builder.add_edge("get_cvss_statement_data", "sync_barrier")
    
    # sync_barrier 之后才是 normalize_cvss_data
    builder.add_edge("sync_barrier", "normalize_cvss_data")

引子问题定位分析(vibe coding版)

引子中提到的这个问题其实是一个Agent编排设计的时候并发问题,大致的问题产生过程如下所示:

0:00
/0:34

有了这个问题之后就很好解决了,我直接指导Cursor来完成问题分析与修复的,与Cursor的交互记录参考下面的附件文件:

References

  1. Pregel: a system for large-scale graph processing
  2. Graph API overview - Docs by LangChain
  3. Pregel | LangGraph.js API Reference - GitHub Pages
  4. LangGraph runtime - Docs by LangChain
  5. Building AI Agents Using LangGraph: Part 8 — Understanding Reducers and State Updates | by HARSHA J S
  6. LangGraph overview - Docs by LangChain
  7. Use the graph API - Docs by LangChain
  8. Application structure - Docs by LangChain
  9. CompiledStateGraph | LangGraph.js API Reference - GitHub Pages
  10. StateGraph | LangGraph.js API Reference - GitHub Pages
  11. LangGraph 101: Let's Build A Deep Research Agent | Towards Data Science
  12. Building Event-Driven Multi-Agent Workflows with Triggers in LangGraph - Medium
  13. Channels | LangChain Reference
  14. if there are two nodes(one node has a prenode) go to same one 4th node , then that 4th node will run twice · Issue #5979 · langchain-ai/langgraph - GitHub
  15. Duplicate node execution when using conditionals - LangGraph - LangChain Forum
  16. Graph execution goes back to a previous node - LangGraph - LangChain Forum
  17. Graph execution goes back to a previous node - #3 by ignacio - LangChain Forum
  18. The Evolution of Graph Processing: From Pregel to LangGraph | by ...
  19. LangGraph: Multi-Agent Workflows - LangChain Blog
  20. How Build.inc used LangGraph to launch a Multi-Agent Architecture for automating critical CRE workflows for Data Center Development. - LangChain Blog
  21. Building LangGraph: Designing an Agent Runtime from first principles - LangChain Blog
  22. Pregel | LangChain Reference - LangChain Docs
  23. LangGraph Execution Semantics. | by Christoph Bussler - Medium
  24. 基于LangGraph开发复杂智能体一则 - 博客园
  25. Sink node issue, if multiple subgraphs are used in parallel · Issue #1964 · langchain-ai/langgraph - GitHub
  26. Mastering LangGraph State Management in 2025 - Sparkco
  27. LangGraph Multi-Agent Orchestration: Complete Framework Guide + Architecture Analysis 2025 - Latenode
  28. Functional API overview - Docs by LangChain
  29. My experience using Langgraph for deterministic workflow : r/LangChain - Reddit
  30. Building Smarter Agents with LangGraph: Tools, Memory & Workflows - GoPenAI
  31. Comparing AI agent frameworks: CrewAI, LangGraph, and BeeAI - IBM Developer
  32. LangGraph vs CrewAI: Let's Learn About the Differences - ZenML Blog
  33. Leveraging LangGraph's Send API for Dynamic and Parallel Workflow Execution
  34. LangGraph's Execution Model is Trickier Than You Might Think - Atomic Spin
  35. How does state work in LangGraph subgraphs? - LangChain Forum

The Physics of Inference – A Deep Dive into KV and Prompt Caching

2025-12-14 21:46:59

Introduction

Viewed through an engineering lens, as the "Scaling Laws" face increasing scrutiny, I find myself agreeing with the growing consensus: Large Language Models (LLMs) are entering a "middle age" of calculated efficiency—a time for harvesting fruits rather than just planting forests.

In his Thanksgiving letter, Andrew Ng noted that while there may be bubbles in AI, they are certainly not in the application layer:

  • AI Application Layer: Underinvested. The potential here far exceeds common perception.
  • AI Inference Infrastructure: Still requires significant investment.
  • AI Training Infrastructure: I remain cautiously optimistic, though this is where a bubble might exist.

Context

As Generative AI transitions from experimental labs to large-scale commercial deployment, inference efficiency has become the critical variable determining economic viability. In the current landscape dominated by the Transformer architecture, the marginal cost of inference is constrained not by pure compute (FLOPs), but by the "Memory Wall."

As context windows expand from the early 4k tokens to 128k, 1M, and even 10M, managing the Key-Value (KV) Cache has emerged as the primary bottleneck for system throughput and latency.

This analysis spans from underlying physical principles to high-level application strategies. We begin by dissecting the mathematics of the KV Cache during decoding and its consumption of memory bandwidth. We then trace the architectural evolution from Multi-Head Attention (MHA) to Grouped-Query Attention (GQA), and finally to the Multi-Head Latent Attention (MLA) pioneered by DeepSeek. MLA, in particular, achieves extreme compression through the decoupling of low-rank matrix decomposition and Rotary Positional Embeddings (RoPE), laying the physical foundation for "disk-level caching."

On the system software front, we examine how vLLM’s PagedAttention borrows paging concepts from operating systems to solve fragmentation, and how SGLang’s RadixAttention utilizes Radix Trees for dynamic KV reuse. We also touch upon StreamingLLM, which exploits the "Attention Sink" phenomenon to bypass window limits for infinite streaming.

Finally, we survey the market implementation of Prompt Caching (Google, Anthropic, OpenAI, DeepSeek, Alibaba), contrasting the "High-Performance Memory" route against the "Architecture-Driven Low-Cost" route.

1. The Physical Bottleneck: Seeing Through the KV Cache

Before discussing optimization, we must understand—from first principles—why the KV Cache is the Achilles' heel of large model inference. It is not merely a question of capacity, but a conflict between Memory Bandwidth and Arithmetic Intensity.

1.1 The Autoregressive Nature of Transformer Decoding

Inference in Transformers occurs in two distinct phases:

  1. Prefill Phase: The model processes all input tokens in parallel. Because this is highly parallelizable, it is usually Compute-bound. GPU utilization is high.
  2. Decoding Phase: The model generates subsequent tokens one by one. This is an Autoregressive process; generating the $t$-th token depends on the internal state of the previous $t-1$ tokens.

In standard Self-Attention, the calculation is:

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

Here, $Q$ (Query) is the vector for the current step, while $K$ (Key) and $V$ (Value) hold information from all history tokens. To avoid recalculating the $K$ and $V$ projections for the entire history at every new step, the system stores these vectors in VRAM. This is the KV Cache.

1.2 The Math of VRAM Consumption

KV Cache size is a linear function of sequence length, multiplying with layers, heads, and dimensions. For a standard Transformer, it can be calculated as:

$$\text{Size}{KV} = 2 \times L{seq} \times B_{batch} \times N_{layers} \times H_{heads} \times D_{head} \times P_{prec}$$

Where:

  • $2$: Represents the two matrices, Key and Value.
  • $L_{seq}$: Current sequence length (context window).
  • $B_{batch}$: Batch size of concurrent requests.
  • $N_{layers}$: Number of layers.
  • $H_{heads}$: Number of attention heads.
  • $D_{head}$: Dimension per head.
  • $P_{prec}$: Precision (2 bytes for FP16).

Case Study: Llama-2 70B
Assuming FP16 precision, a sequence length of 4096, and a Batch Size of 1:

  • $N_{layers} = 80$
  • $H_{heads} = 64$
  • $D_{head} = 128$

The KV Cache for a single request is:
$$2 \times 4096 \times 1 \times 80 \times 64 \times 128 \times 2 \approx 10.7 \text{ GB}$$

If we extend the context to 100k tokens, this swells to 260 GB. This far exceeds the capacity of a single NVIDIA A100 (80GB) or H100. Consequently, memory capacity limits Batch Size, preventing the GPU cores from being fully utilized, driving up unit costs.

1.3 The Memory Wall

Beyond capacity, bandwidth is the silent killer. During decoding, for every token generated, the GPU must move the entire KV Cache from High Bandwidth Memory (HBM) to the on-chip SRAM for calculation.

  • Compute (FLOPs): Grows linearly.
  • Data Transfer (Bytes): Also grows linearly.

However, because the matrix multiplication degenerates into a vector operation (Query vector), the Arithmetic Intensity (FLOPs/Bytes ratio) is extremely low. Even with an H100's massive bandwidth (~3.35 TB/s), the GPU spends most of its time waiting for data. This is the definition of a Memory-bound scenario.

2. Architectural Evolution: From MHA to MLA

To shrink the KV Cache, architects have performed surgery on the heart of the Transformer.

2.1 Multi-Head Attention (MHA): The Expensive Baseline

In the original Attention Is All You Need, the model has $H$ Query Heads and $H$ Key/Value Heads.

  • Mechanism: Each Query Head has a unique KV pair. Maximum expressiveness.
  • Cost: Size is proportional to $H$. In the long-context era, this became unsustainable.

2.2 Multi-Query Attention (MQA): Radical Compression

Proposed by Noam Shazeer (2019).

  • Mechanism: All Query Heads share one Key Head and one Value Head.
  • Compression: $H : 1$. (e.g., 64x reduction).
  • Trade-off: Radical memory savings, but the model loses the ability to "attend" to different nuances simultaneously, often degrading perplexity. Used in PaLM and Falcon.

2.3 Grouped-Query Attention (GQA): The Golden Mean

Introduced with Llama-2, GQA became the standard for open-source models (Llama-3, Mistral, Qwen).

  • Mechanism: Query Heads are divided into $G$ groups. Each group shares a KV Head.
  • Example: Llama-2 70B uses 8 KV Heads for 64 Query Heads (8:1 compression).
  • Result: It sits on the Pareto Frontier—delivering performance near MHA with efficiency near MQA.

2.4 Multi-Head Latent Attention (MLA): DeepSeek's Revolution

DeepSeek-V2 (and V3) introduced MLA, which is not just a grouping strategy, but a fundamental reconstruction of storage.

2.4.1 Low-Rank Compression

Instead of storing the full $d_{model} \times L$ matrices, MLA assumes redundancy. It projects the input into a low-dimensional "Latent Vector" ($c_{KV}$) and stores only this compressed version. During computation, it projects this vector back up to the full dimension.
This reduces memory footprint from $O(H \times d_{head})$ to $O(d_{latent})$.

2.4.2 Decoupled RoPE

The challenge with compression is Rotary Positional Embeddings (RoPE). RoPE is geometrically sensitive; applying it to a compressed vector destroys position information.

DeepSeek's solution: Decoupling.

  1. Content Head: Captures semantics, uses low-rank compression (No RoPE).
  2. Position Head: A separate, tiny vector specifically carrying RoPE info.
  3. Concatenation: They are joined only during the attention score calculation.

This allows the KV Cache to be 1/5th the size of GQA models. Crucially, it makes moving the cache to SSD/RAM feasible because the bandwidth requirement drops drastically.

Feature MHA (Llama-1) MQA (Falcon) GQA (Llama-3) MLA (DeepSeek-V3)
KV Heads = Query Heads ($H$) 1 Groups ($G$) Virtual/Dynamic
VRAM Usage High (100%) Very Low (~1-2%) Medium (~12-25%) Extreme (~5-10%)
Performance Baseline Lossy Near Lossless Lossless/Better
RoPE Native Native Native Decoupled

3. System-Level Management: OS Concepts Reborn

If architecture defines the "theoretical minimum," system software determines how we place that data on hardware.

3.1 PagedAttention (vLLM)

Before vLLM, memory was allocated statically based on "Max Sequence Length," leading to fragmentation and 60-80% waste.

3.1.1 The Principle

Inspired by Virtual Memory paging:

  1. KV Block: Data is sliced into fixed blocks (e.g., 16 tokens).
  2. Non-contiguous: Blocks can live anywhere in physical memory.
  3. Block Table: Maps logical flow to physical blocks.

Impact:

  • Zero Waste: Internal fragmentation is limited to the last block.
  • Memory Sharing: Multiple requests sharing a System Prompt ("You are a helpful assistant...") point to the same physical blocks. Copy-on-Write is triggered only when they diverge. This is the foundation of Prompt Caching.

3.2 RadixAttention (SGLang)

vLLM handled allocation; SGLang handles discovery.

3.2.1 Radix Tree Structure

SGLang views the KV Cache not as a linear array, but as a Radix Tree.

  • Nodes: KV Cache states.
  • Edges: Token sequences.

3.2.2 Automatic Reuse

Scenario: User asks A, gets B. User asks C. The system sees the path A->B and resumes calculation from there.

  • LRU Eviction: When memory fills, leaves are pruned first.

3.3 StreamingLLM

For infinite streams (e.g., digital humans), simple sliding windows break the model.
MIT researchers discovered Attention Sinks: The first few tokens (usually 4) anchor the entire attention mechanism. StreamingLLM keeps these "sink tokens" permanently and slides the rest, allowing infinite length with stable perplexity.

4. Extreme Compression: Quantization

  • FP8: Supported by H100, halves memory usage with negligible loss.
  • INT4: Difficult due to "Outliers" in the Key/Value matrices. Techniques like SmoothQuant and KIVI migrate outliers to weights or keep them in high precision to make INT4 viable.

5. Market Landscape: The Battle of Caching

2025 marks the era of "Context Caching" as a standard product.

5.1 DeepSeek: The Price Butcher

Leveraging MLA, DeepSeek moves cache to Disk (SSD).

  • Price: $0.014 / million tokens (Hit). This is ~0.5% of OpenAI's price.
  • Storage: Free.
  • TTL: Hours to days. Ideal for long-tail knowledge retrieval.

5.2 Google Gemini: TPU Scale

  • Implicit: Automatic for Flash models.
  • Explicit: For Pro models. A "Lease" model—you pay a storage fee per hour. Only economical for high-frequency queries.

5.3 Anthropic Claude: High-Speed RAM Lease

Targeted at coding and high-interaction tasks.

  • TTL: 5 minutes.
  • Mechanism: Explicit breakpoints.
  • Economics: You pay a premium (1.25x) to write to cache. You must reuse it within 5 minutes to break even.

5.4 OpenAI & Alibaba

  • OpenAI: Conservative. 50% discount on hits. No write premium.
  • Alibaba (Qwen): Mixed mode. Strong support for long contexts (10M tokens).
Vendor Mechanism Storage Medium TTL Write Cost Read Cost Storage Fee
DeepSeek Implicit SSD/Disk Long 1.0x ~0.05x Free
Anthropic Explicit HBM 5 min 1.25x 0.10x Included
Google Hybrid TPU HBM 1 hour+ 1.0x ~0.25x Hourly
OpenAI Implicit HBM Dynamic 1.0x 0.50x Free

6. Semantic Caching

Complementary to Prompt Caching (Server-side), Semantic Caching (Client-side) uses Embeddings (Vector DBs like Milvus) to match intent.

  • If a user asks "Price of apple?" and later "How much is an apple?", Semantic Cache returns the saved answer without hitting the LLM.
  • Tools: GPTCache.

7. Case Study: Prompt Cache in Agent Dev

Code Example

# Enabling Prompt Caching in Qwen
from langchain_core.prompts import ChatPromptTemplate

prompt = ChatPromptTemplate.from_messages([
  SystemMessage(
    content=[
      {
        "type": "text",
        "text": app_prompt_template.format(vars),
        "cache_control": {"type": "ephemera"}, # The explicit flag
      }
    ]
  ),
  HumanMessage(content=app_user_prompt_template.format(input_data))
])

# Enabling Prompt Caching in OpenAI
from openai import OpenAI

client = OpenAI()

completion = client.chat.completions.create(
    model="gpt-4o",
    # openai 支持 24h 保存 cache
    prompt_cache_retention: "24h"
    messages=[
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Tell me a joke."},
    ]
)

Summary

The history of Large Model inference is, essentially, a history of struggling against memory bandwidth.

  1. Architecture & Hardware: DeepSeek's MLA proves that algorithmic innovation (low-rank compression) can unlock hardware potential (SSD storage), completely upending the business model.
  2. Stateful APIs: The HTTP stateless protocol is no longer sufficient. LLMs are becoming "Stateful Operating Systems," and developers must manage "Context Lifecycle" just as they manage database connections.
  3. The Cost Cliff: With prices hitting $0.014/M tokens, the bottleneck for RAG shifts from "how to retrieve less to save money" to "how much context can the model handle without hallucinating." Full Context is replacing sliced retrieval.
  4. For developers, the strategy is clear: Use Anthropic/vLLM for high-frequency, low-latency tasks (coding assistants), and leverage DeepSeek's disk caching for massive knowledge analysis where cost is the primary constraint.

References

  1. Optimizing Transformer Inference with Grouped Query Attention | Towards AI, accessed November 27, 2025, https://towardsai.net/p/machine-learning/optimizing-transformer-inference-with-grouped-query-attention
  2. arXiv:2305.13245v3 [cs.CL] 23 Dec 2023, accessed November 27, 2025, https://arxiv.org/pdf/2305.13245
  3. Attention Mechanisms in Transformers: Comparing MHA, MQA, and GQA | Yue Shui Blog, accessed November 27, 2025, https://syhya.github.io/posts/2025-01-16-group-query-attention/
  4. Understanding Multi-Head Latent Attention, accessed November 27, 2025, https://planetbanatt.net/articles/mla.html
  5. DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model, accessed November 27, 2025, https://arxiv.org/html/2405.04434v2
  6. DeepSeek API introduces Context Caching on Disk, cutting prices by an order of magnitude, accessed November 27, 2025, https://api-docs.deepseek.com/news/news0802
  7. MLA: Redefining KV-Cache Through Low-Rank Projections and On-Demand Decompression - Hugging Face, accessed November 27, 2025, https://huggingface.co/blog/NormalUhr/mla-explanation
  8. How PagedAttention resolves memory waste of LLM systems - Red Hat Developer, accessed November 27, 2025, https://developers.redhat.com/articles/2025/07/24/how-pagedattention-resolves-memory-waste-llm-systems
  9. Introduction to vLLM and PagedAttention | Runpod Blog, accessed November 27, 2025, https://www.runpod.io/blog/introduction-to-vllm-and-pagedattention
  10. vLLM and PagedAttention: A Comprehensive Overview | by Abonia Sojasingarayar | Medium, accessed November 27, 2025, https://medium.com/@abonia/vllm-and-pagedattention-a-comprehensive-overview-20046d8d0c61
  11. vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention - arXiv, accessed November 27, 2025, https://arxiv.org/html/2405.04437v1
  12. When to Choose SGLang Over vLLM: Multi-Turn Conversations and KV Cache Reuse, accessed November 27, 2025, https://www.runpod.io/blog/sglang-vs-vllm-kv-cache
  13. SGLang: Efficient Execution of Structured Language Model Programs - arXiv, accessed November 27, 2025, https://arxiv.org/pdf/2312.07104
  14. Fast and Expressive LLM Inference with RadixAttention and SGLang | LMSYS Org, accessed November 27, 2025, https://lmsys.org/blog/2024-01-17-sglang/
  15. Arxiv Dives - Efficient Streaming Language Models with Attention Sinks - Oxen.ai, accessed November 27, 2025, https://ghost.oxen.ai/arxiv-dives-efficient-streaming-language-models-with-attention-sinks/
  16. Efficient Streaming Language Models with Attention Sinks - arXiv, accessed November 27, 2025, https://arxiv.org/html/2309.17453v4
  17. [2309.17453] Efficient Streaming Language Models with Attention Sinks - arXiv, accessed November 27, 2025, https://arxiv.org/abs/2309.17453
  18. Attention Sinks for LLM - Endless Generation - Analytics Vidhya, accessed November 27, 2025, https://www.analyticsvidhya.com/blog/2023/12/attention-sinks-for-llm/
  19. FP8 E5M2 KV Cache - vLLM, accessed November 27, 2025, https://docs.vllm.ai/en/v0.6.3.post1/quantization/fp8_e5m2_kvcache.html
  20. FP8 quantization with AMD Quark for vLLM — Tutorials for AI developers 8.0, accessed November 27, 2025, https://rocm.docs.amd.com/projects/ai-developer-hub/en/latest/notebooks/gpu_dev_optimize/fp8_quantization_quark_vllm.html
  21. KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization, accessed November 27, 2025, https://www.stat.berkeley.edu/~mmahoney/pubs/neurips-2024-kvquant.pdf
  22. AsymKV: Enabling 1-Bit Quantization of KV Cache with Layer-Wise Asymmetric Quantization Configurations - ACL Anthology, accessed November 27, 2025, https://aclanthology.org/2025.coling-main.158.pdf
  23. FireQ: Fast INT4-FP8 Kernel and RoPE-aware Quantization for LLM Inference Acceleration, accessed November 27, 2025, https://arxiv.org/html/2505.20839v1
  24. NQKV: A KV Cache Quantization Scheme Based on Normal Distribution Characteristics, accessed November 27, 2025, https://arxiv.org/html/2505.16210v1
  25. Gemini Developer API pricing, accessed November 27, 2025, https://ai.google.dev/gemini-api/docs/pricing
  26. Context caching overview | Generative AI on Vertex AI - Google Cloud Documentation, accessed November 27, 2025, https://docs.cloud.google.com/vertex-ai/generative-ai/docs/context-cache/context-cache-overview
  27. Context Caching In Google Gemini: Better Than RAG For Memory - Empathy First Media, accessed November 27, 2025, https://empathyfirstmedia.com/context-caching-google-gemini/
  28. Prompt Caching Support in Spring AI with Anthropic Claude, accessed November 27, 2025, https://spring.io/blog/2025/10/27/spring-ai-anthropic-prompt-caching-blog
  29. Prompt caching - Claude Docs, accessed November 27, 2025, https://platform.claude.com/docs/en/build-with-claude/prompt-caching
  30. Prompt Caching is a Must! How I Went From Spending $720 to $72 Monthly on API Costs | by Du'An Lightfoot | Medium, accessed November 27, 2025, https://medium.com/@labeveryday/prompt-caching-is-a-must-how-i-went-from-spending-720-to-72-monthly-on-api-costs-3086f3635d63
  31. Prompt caching - OpenAI API, accessed November 27, 2025, https://platform.openai.com/docs/guides/prompt-caching
  32. Prompt Caching in the API - OpenAI, accessed November 27, 2025, https://openai.com/index/api-prompt-caching/
  33. How does Prompt Caching work? - OpenAI Developer Community, accessed November 27, 2025, https://community.openai.com/t/how-does-prompt-caching-work/992307
  34. Context Cache feature of Qwen models - Alibaba Cloud Model Studio, accessed November 27, 2025, https://www.alibabacloud.com/help/en/model-studio/context-cache
  35. Qwen context window: token limits, memory policy, and 2025 rules - Data Studios, accessed November 27, 2025, https://www.datastudios.org/post/qwen-context-window-token-limits-memory-policy-and-2025-rules
  36. Semantic Cache: How to Speed Up LLM and RAG Applications - Medium, accessed November 27, 2025, https://medium.com/@svosh2/semantic-cache-how-to-speed-up-llm-and-rag-applications-79e74ce34d1d
  37. Semantic Cache: Accelerating AI with Lightning-Fast Data Retrieval - Qdrant, accessed November 27, 2025, https://qdrant.tech/articles/semantic-cache-ai-data-retrieval/
  38. Semantic caching for faster, smarter LLM apps - Redis, accessed November 27, 2025, https://redis.io/blog/what-is-semantic-caching/
  39. zilliztech/GPTCache: Semantic cache for LLMs. Fully integrated with LangChain and llama_index. - GitHub, accessed November 27, 2025, https://github.com/zilliztech/GPTCache
  40. Reducing LLM Costs and Latency via Semantic Embedding Caching - arXiv, accessed November 27, 2025, https://arxiv.org/html/2411.05276v2
  41. If your app process many similar queries, use Semantic Caching to reduce your cost and latency : r/LangChain - Reddit, accessed November 27, 2025, https://www.reddit.com/r/LangChain/comments/1f4rlx0/if_your_app_process_many_similar_queries_use/
  42. 图解vLLM Automatic Prefix Cache(RadixAttention), https://zhuanlan.zhihu.com/p/693556044
  43. Gemini 3技术是跳蛙式超越 https://www.youtube.com/watch?v=EMQxQwoFSb4
  44. Andre Ng issue 329 | deeplearning.ai batch https://www.deeplearning.ai/the-batch/issue-329/
  45. The Architecture Behind vLLM: How PagedAttention Improves Memory Utilization https://medium.com/@mandeep0405/the-architecture-behind-vllm-how-pagedattention-improves-memory-utilization-2f9b25272110
  46. Low Rank Decompositions of Matrices - YouTube https://www.youtube.com/watch?v=_FmolBCUo9M
  47. The Inner Workings of DeepSeek-V3 · Chris McCormick https://mccormickml.com/2025/02/12/the-inner-workings-of-deep-seek-v3/
  48. Decoupled RoPE with MHLA: Blending Rotary Positional Encoding and Latent Attention Like a Pro https://medium.com/@drpester001/decoupled-rope-with-mhla-blending-rotary-positional-encoding-and-latent-attention-like-a-pro-d13842134f4d
  49. Efficient Streaming Language Models with Attention Sinks https://github.com/mit-han-lab/streaming-llm

从KV Cache到Prompt Cache的应用

2025-11-30 20:40:41

引子

Screenshot from YouTube

从工程师的视角来观察,随着Scaling Law失效问题被更多的人提起,我越来越认同LLM正在逐渐进入「精打细算,收割果实的平庸时代」。Andrew Ng在他的感恩节给读者的来信中提到,AI可能存在泡沫但是一定不是在AI应用开发:

  • AI 应用层: 投资不足。其潜力远超大多数人的认知。
  • AI 推理基础设施: 仍需大量投资。
  • AI 模型训练基础设施: 我对这一领域仍持谨慎乐观态度,但可能存在泡沫。

1. 大模型推理的物理瓶颈:透视KV Cache

在探讨具体的优化技术之前,必须从第一性原理出发,理解为何KV Cache会成为大模型推理的阿喀琉斯之踵。这不仅是显存容量的问题,更是显存带宽(Memory Bandwidth)与计算强度(Arithmetic Intensity)之间矛盾的体现。

1.1 Transformer解码的自回归特性

Transformer模型的推理过程分为两个阶段:预填充(Prefill)和解码(Decoding)。

  1. 预填充阶段(Prefill):模型并行处理输入的所有token。由于可以并行计算,这一阶段主要受限于GPU的计算能力(Compute-bound)。此时,GPU的利用率通常很高。
  2. 解码阶段(Decoding):模型逐个生成后续token。这是一个自回归(Autoregressive)过程,即生成第 $t$ 个token需要依赖于前 $t-1$ 个token的内部状态。

在标准的注意力机制(Self-Attention)中,计算公式为:

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$
其中,$Q$(Query)是当前时间步的查询向量,而 $K$(Key)和 $V$(Value)则包含了所有历史token的信息。为了避免在生成每一个新token时都重新计算前面所有token的 $K$ 和 $V$ 投影,系统会将这些计算好的向量存储在显存中,这就是 KV Cache。

1.2 显存占用的数学推导

KV Cache的显存占用量是序列长度的线性函数,且随层数、头数和隐藏层维度倍增。对于一个标准的Transformer模型,其KV Cache的大小可以通过以下公式精确计算:

$$ \text{Size}{KV} = 2 \times L{seq} \times B_{batch} \times N_{layers} \times H_{heads} \times D_{head} \times P_{prec} $$

其中:

  • $2$:代表Key和Value两个矩阵。
  • $L_{seq}$:当前的序列长度(上下文窗口大小)。
  • $B_{batch}$:并发请求的批处理大小(Batch Size)。
  • $N_{layers}$:Transformer的层数。
  • $H_{heads}$:注意力头的数量。
  • $D_{head}$:每个注意力头的维度(通常为 $D_{model} / H_{heads}$)。
  • $P_{prec}$:数据精度(FP16为2字节,FP32为4字节)。

案例分析:Llama-2 70B模型
假设我们使用FP16精度(2字节),序列长度为4096,Batch Size为1。

  • $N_{layers} = 80$
  • $H_{heads} = 64$ (GQA之前)
  • $D_{head} = 128$

单次请求的KV Cache大小为:

$$2 \times 4096 \times 1 \times 80 \times 64 \times 128 \times 2 \approx 10.7 \text{ GB}$$
如果我们将上下文扩展到100k tokens(如Claude或GPT-4 Turbo常见场景),单次请求的KV Cache将膨胀至 260 GB。这已经远远超过了单张NVIDIA A100 (80GB) 甚至 H100 (80GB) 的显存容量。这意味着,在长文本场景下,显存容量直接限制了Batch Size,导致GPU的算力无法被填满,推理成本急剧上升。

1.3 内存墙与带宽瓶颈

除了容量限制,更致命的是带宽限制。在解码阶段,每生成一个token,GPU都需要从高带宽内存(HBM)中读取整个KV Cache到片上SRAM进行计算。

  • 计算量(FLOPs):随着KV Cache的增长,计算量仅线性增长。
  • 数据传输量(Bytes):数据传输量也线性增长,但由于矩阵乘法中的一维退化为向量(Query vector),算术强度(Arithmetic Intensity,即FLOPs/Bytes比率)极低。

现代GPU(如H100)具有极高的算力(~1000 TFLOPS FP16)和极高的带宽(~3.35 TB/s)。然而,在解码阶段,由于每次矩阵向量乘法(GEMV)都需要搬运庞大的KV Cache,GPU大部分时间都在等待数据从HBM传输,导致计算单元闲置。这就是所谓的“内存受限”(Memory-bound)场景。

为了缓解这一问题,业界在过去两年中经历了从算法架构改革到系统工程优化的剧烈演变。

2. 注意力机制的架构演进:从MHA到MLA

为了从根本上减小KV Cache的体积,模型架构师们对Transformer的核心——注意力机制进行了多次手术。这一演进路径清晰地展示了从追求极致性能到追求效率与性能平衡的过程。

2.1 多头注意力(MHA):昂贵的基准

在《Attention Is All You Need》原论文中提出的多头注意力(Multi-Head Attention, MHA)机制中,模型拥有 $H$ 个查询头(Query Heads),同时也对应拥有 $H$ 个键值头(Key/Value Heads)。

  • 机制:每个Query Head都有自己独立的Key和Value投影矩阵。这意味着模型可以从 $H$ 个不同的子空间(Subspaces)捕捉信息,理论上表达能力最强。
  • 代价:正如第1节中的计算所示,KV Cache的大小与头数 $H$ 成正比。对于大模型,这意味着巨大的显存开销。
  • 现状:早期的BERT、GPT-2、GPT-3以及Llama-1均采用MHA。但在长上下文时代,MHA已成为不可承受之重。

2.2 多查询注意力(MQA):激进的压缩

为了解决推理性价比问题,Noam Shazeer在2019年提出了多查询注意力(Multi-Query Attention, MQA)。

  • 机制:无论有多少个Query Heads,整个层只保留1个 Key Head 和 1个 Value Head。所有的Query Heads共享这唯一的KV对。
  • 压缩比:$H : 1$。如果模型有64个头,KV Cache的大小直接缩小64倍。
  • 优势:极大地减少了显存占用和数据搬运量,使得推理速度显著提升,且能支持更大的Batch Size。
  • 劣势:这种压缩过于激进,导致模型在处理复杂任务时,无法同时关注输入序列的不同方面,造成性能(Perplexity)下降和训练不稳定性。
  • 应用:Google的PaLM模型和Falcon系列采用了MQA,但在开源社区并未立刻成为主流。

2.3 分组查询注意力(GQA):中庸之道的胜利

在Llama-2发布时,Meta引入了分组查询注意力(Grouped-Query Attention, GQA),这一机制迅速成为当今开源大模型(如Llama-3, Mistral, Qwen)的实际标准。

  • 机制:GQA是MHA和MQA的折中方案。它将Query Heads分为 $G$ 个组,每个组内的Query Heads共享一个KV Head。
  • 配置:例如,Llama-2 70B使用了8个KV Heads($G=8$),而Query Heads为64个。这意味着每8个Query Heads共享1个KV Head。
  • 压缩比:$H : G$。在上述例子中,压缩比为8:1。
  • 效果:研究表明,GQA在显存占用和推理速度上接近MQA,而在模型效果(Accuracy/Perplexity)上几乎等同于MHA。它成功地在帕累托前沿(Pareto Frontier)上找到了最佳平衡点。

2.4 多头潜在注意力(MLA):DeepSeek的架构革命

DeepSeek-V2(及其后的V3)引入的MLA(Multi-Head Latent Attention)不仅仅是分组策略的调整,而是对KV Cache存储方式的根本性重构。这是DeepSeek能够提供极低API价格的核心技术支撑。

2.4.1 低秩矩阵压缩(Low-Rank Compression)原理

传统的注意力机制直接存储投影后的 $K$ 和 $V$ 矩阵,维度为 $d_{model} \times L$。MLA认为这些高维矩阵中存在大量的冗余信息,可以通过低秩分解来压缩。MLA不直接存储 $K$ 和 $V$,而是将输入的隐藏状态投影到一个低维的“潜在向量”(Latent Vector, $c_{KV}$)。

  • 压缩:输入向量首先经过一个下投影矩阵,变为极低维度的潜在向量(例如,压缩比可达数十倍)。
  • 存储:在KV Cache中,只存储这个压缩后的潜在向量。
  • 还原:在计算注意力分数时,通过一个上投影矩阵,将潜在向量实时还原为用于计算的Key和Value。

这种方法将KV Cache的显存占用从 $O(H \times d_{head})$ 降低到了 $O(d_{latent})$,其中 $d_{latent}$ 远小于前者。

2.4.2 解耦旋转位置编码(Decoupled RoPE)

低秩压缩的一个巨大挑战是如何兼容旋转位置编码(RoPE)。RoPE对向量的旋转操作具有几何敏感性,直接在压缩向量上应用RoPE会破坏其位置信息,或者在还原时引入巨大误差。

DeepSeek创造性地提出了“解耦RoPE”策略:

  1. 内容头(Content Head):负责捕捉语义信息,采用上述的低秩压缩(不带RoPE)。
  2. 位置头(Position Head):一个单独的、维度很小的向量,专门用于携带RoPE位置信息。
  3. 拼接(Concatenation):在计算Attention Score时,将还原后的内容头与位置头拼接,共同参与计算。

通过这种方式,MLA既实现了极致的KV Cache压缩(仅存储压缩内容+少量位置信息),又完美保留了长上下文所需的位置感知能力。根据DeepSeek的报告,MLA使得KV Cache的大小在同等参数规模下只有GQA模型的1/5甚至更低,这使得将KV Cache放入显存之外的介质(如内存或SSD)成为可能,因为数据传输的带宽压力被大幅减轻了。

特性 MHA (Llama-1) MQA (Falcon) GQA (Llama-3) MLA (DeepSeek-V3)
KV头数量 等于Query头数 ($H$) 1 分组数 ($G$, 如8) 虚拟/动态生成
显存占用 极高 (100%) 极低 (~1-2%) 中等 (~12-25%) 极致压缩 (5-10%)
模型性能 基准 (高) 有损 接近无损 无损甚至更优
推理速度 慢 (受限于带宽) 极快 极快
RoPE兼容性 原生支持 原生支持 原生支持 需解耦设计

3. 系统级显存管理与优化:从分页到流式

如果说Transformer架构决定了KV Cache的“理论最小体积”,那么系统软件则决定了如何在物理硬件上高效地“摆放”这些数据。2023年以来,以vLLM为代表的推理框架通过引入操作系统领域的经典思想,彻底改变了显存管理的范式。

3.1 显存碎片化与PagedAttention (vLLM)

在vLLM出现之前,主流推理框架(如FasterTransformer)采用的是静态显存分配。对于一个请求,系统必须按照其“最大可能长度”(Max Sequence Length)预先分配一块连续的显存空间。

  • 内部碎片(Internal Fragmentation):如果预分配了2048长度,但用户只生成了50个token,剩余的空间全部被浪费。
  • 外部碎片(External Fragmentation):不同请求的显存块大小不一,导致显存中出现许多无法被利用的空隙。

据统计,这种方式导致的显存浪费率高达60%-80%。

3.1.1 PagedAttention的原理

vLLM团队受操作系统虚拟内存(Virtual Memory)分页机制的启发,提出了PagedAttention。

  1. KV Block:将KV Cache切分为固定大小的块(Block),例如每块存储16个token的KV数据。
  2. 非连续存储:这些Block在物理显存(HBM)中不需要连续存放,可以分散在任意位置。
  3. 页表(Block Table):系统维护一张映射表,记录每个请求的逻辑token顺序对应哪些物理Block。
  4. 按需分配:只有当新的token生成填满当前Block时,系统才申请下一个Block。

优势:

  • 零浪费:内部碎片仅存在于最后一个未填满的Block中,浪费率降至4%以下。
  • 显存共享(Memory Sharing):这是PagedAttention最强大的特性。对于这就如Python中的引用计数,如果多个请求共享相同的System Prompt(例如“你是一个有用的助手...”),vLLM只需在物理显存中存储一份该Prompt的KV Block,所有请求的页表都指向这一份数据。只有当各自生成不同的后续内容时,才触发“写时复制”(Copy-on-Write)。这为由Prompt Cache奠定了系统基础。
  • 并行采样(Parallel Sampling):在 Parallel Sampling 中,同一个 prompt 会生成多个候选输出,便于用户从多个备选中选择最佳响应,常用于内容生成或模型对比测试。在 vLLM 中,这些采样序列共享相同的 prompt,其对应的 KV Cache 也可以共用同一组物理块。PagedAttention 通过引用计数和 block-level 的 copy-on-write 机制实现共享与隔离的平衡:只有当序列出现不同分支时,才会触发复制操作。

3.2 动态前缀复用与RadixAttention (SGLang)

虽然vLLM解决了分配问题,但在处理复杂的、非线性的对话历史时,如何自动发现可复用的KV Cache仍是难题。SGLang框架提出了RadixAttention,将缓存管理提升到了一个新的维度。

3.2.1 Radix Tree(基数树)结构

SGLang不再将KV Cache视为线性的数组,而是将其维护为一个基数树(Radix Tree)。

  • 节点:树的边代表token序列,节点代表KV Cache的状态。
  • 路径:从根节点到叶子节点的路径代表一个完整的对话历史。

Hash RadixAttention 代码走读:

3.2.2 自动复用机制

当一个新的请求到达时,系统将Prompt作为搜索键在Radix Tree中进行最长前缀匹配(Longest Prefix Match)。

  • 场景A(多轮对话):用户问“A”,模型答“B”。用户接着问“C”。RadixAttention自动匹配到“A->B”的路径,直接复用其KV Cache,只需计算“C”。
  • 场景B(Few-Shot Learning):多个请求使用相同的Few-Shot示例,但在最后的问题上不同。RadixAttention自动锁定公共前缀节点,无需人工干预。
  • LRU淘汰:当显存不足时,系统根据最近最少使用(LRU)原则,从叶子节点开始剪枝,释放显存。

prefix hash 代码走读:

与vLLM早期的前缀缓存相比,RadixAttention无需用户手动配置,且能处理更复杂的分支结构(如Tree-of-Thoughts推理),显著提高了复杂Agent任务的吞吐量。

3.3 无限流式生成与StreamingLLM

对于需要7x24小时运行的数字人或长期助理,KV Cache理论上会无限增长直至显存溢出。简单的“滑动窗口”(Sliding Window,只保留最近N个token)会导致模型崩溃,因为Transformer在训练时并未适应这种信息的突然截断。

3.3.1 注意力汇聚(Attention Sink)现象

MIT的研究人员发现,Transformer模型在推理时,会倾向于将大量的注意力分数分配给序列开头的几个token(通常是前4个)。这些token充当了“锚点”(Anchor),稳定了后续层的注意力计算,即使它们本身可能没有太多语义信息。

3.3.2 StreamingLLM机制

基于上面的发现,StreamingLLM提出了一种特殊的缓存策略:

  1. 保留汇聚点:永久保留序列开头的几个token(Attention Sinks)的KV Cache。
  2. 滑动窗口:对后续的token使用滑动窗口,只保留最近的N个。
  3. 位置编码调整:对RoPE进行相对位置的平移,使模型感知到正确的距离。

这种方法使得模型可以在有限的显存(例如只缓存2048个token)下,处理长度达到400万甚至无限的输入流,且困惑度(Perplexity)不发生爆炸。

4. 极致压缩:KV Cache量化技术

除了架构和系统优化,降低数据本身的精度是另一个维度的压缩手段。KV Cache量化(Quantization)正从研究走向生产环境。

4.1 精度格式的演变

  • FP16/BF16:传统的基准,每个参数2字节。
  • FP8 (E4M3/E5M2):NVIDIA H100原生支持。将KV Cache压缩到1字节/参数。vLLM和TensorRT-LLM已经支持FP8 KV Cache,通常能带来2倍的吞吐量提升,且精度损失微乎其微。

4.2 激进量化:INT4与非均匀分布挑战

将KV Cache压缩到4-bit(INT4)可以将显存占用减少4倍,但这面临巨大挑战。

4.2.1 异常值(Outliers)问题

研究发现,Key和Value矩阵中的数值分布并非均匀的高斯分布。特定的通道(Channels)或Token会出现数值极大的异常值。如果使用标准的均匀量化(Uniform Quantization),这些异常值会拉大量化范围(Range),导致大部分小数值的精度被严重吞噬,模型彻底失效。

解决方案:

  1. SmoothQuant / Atom:引入平滑因子,将激活值(Activation)中的异常值迁移到权重(Weight)中,或者在量化前对通道进行缩放(Per-channel scaling),使得分布更平滑,适合INT8/INT4量化。
  2. KIVI / KVQuant:采用非均匀量化策略,或者将少量的异常值单独以高精度(FP16)存储,而对绝大部分数据进行INT4甚至2-bit量化。
  3. 分组量化:类似于GQA,对每128个或64个元素进行独立的量化统计,减小异常值的影响范围。

目前,INT4 KV Cache在部分长文本场景下已可投入使用,但在高精度要求的逻辑推理任务中仍需谨慎评估。

5. 各大厂商Prompt Cache支持情况深度评测

2025年以来,随着上述技术的成熟,各大模型厂商纷纷推出了面向开发者的“Context Caching”或“Prompt Caching”服务。这一功能被视为RAG(检索增强生成)和Agent(智能体)应用的经济基石。

5.1 DeepSeek:磁盘缓存与价格屠夫

DeepSeek(深度求索)是目前市场上最具颠覆性的玩家。依托于其独特的MLA架构,DeepSeek实现了真正的磁盘级上下文缓存。

  • 技术原理:由于MLA将KV Cache压缩到了极小(约为MHA的1/10甚至更低),数据量小到足以从SSD硬盘阵列中实时读取,而不会造成严重的延迟瓶颈。这打破了“KV Cache必须在显存”的铁律 6。
  • 缓存策略(Implicit):自动触发。无需用户显式创建缓存ID,系统自动识别重复的前缀。
  • 价格体系:
    • 缓存命中(Cache Hit):$0.014 / 百万tokens。这是一个惊人的数字。相比之下,GPT-4o的输入价格约为$2.50,DeepSeek的价格仅为OpenAI的 0.5%
    • 缓存未命中(Cache Miss): 约 $0.14 - $0.27 / 百万tokens(取决于具体版本),依然远低于竞品。
    • 存储费:免费。得益于磁盘存储的低成本,DeepSeek不收取额外的时间存储费。
  • 持久性:由于存储在磁盘,其TTL(生存时间)远长于基于显存的竞品,理论上可达数小时甚至数天(取决于系统调度),非常适合低频但长尾的知识库查询。

5.2 Google Gemini:TPU加持下的灵活双模

Google Vertex AI利用其TPU Pod的庞大显存池,提供了最为灵活的“隐式+显式”双轨制。

  • 隐式缓存(Implicit):针对Gemini 2.5 Flash等模型。自动检测,无需代码更改。
    • 门槛:1024或2048 tokens以上。
    • 价格:命中时约为标准输入价格的25%(即75%折扣)。无存储费。
  • 显式缓存(Explicit):针对企业级长文档。
    • 机制:用户调用API创建CachedContent对象,获得一个ID。
    • 价格结构:包含两部分。
      • 计算费:命中缓存的Token价格极低(甚至在某些层级接近免费)。
      • 存储费:按小时计费。例如,Gemini 2.5 Pro约为$4.50 / 100万tokens / 小时。
    • 适用场景:** 这是一种“租赁显存”的模式。只有当你的查询频率非常高(例如每小时数百次查询同一份文档),节省的计算费才能覆盖存储费。如果只是偶尔查询,显式缓存反而更贵。

5.3 Anthropic Claude:极速流转的显存租赁

Anthropic的策略非常激进,专注于“高频会话”场景,尤其是配合Claude 4 Sonnet强大的编码能力。

  • 显式断点:用户需在API参数中设置 cache_control: {"type": "ephemeral"} 断点。
  • TTL(5分钟):这是最大的争议点与特点。缓存仅保存5分钟。每次命中(Read),TTL重置为5分钟。这意味着它不适合长期存储,只适合连续不断的对话。
  • 价格:
    • 写入(Write):1.25倍标准输入价格。你需要支付溢价来创建缓存。
    • 读取(Read):0.1倍(10%)标准输入价格。90%的折扣。
  • 盈亏平衡点:由于有25%的写入溢价,用户必须在5分钟内至少复用一次缓存(即第二次提问),才能开始省钱。如果写入后5分钟内没有再次提问,缓存失效,用户的写入溢价就“白花了”。

5.4 OpenAI:保守的黑盒策略

OpenAI在Prompt Cache上显得相对保守和不透明。

  • 机制:纯隐式。自动匹配1024 tokens以上的块。
  • 支持模型:GPT-4、5等系列。
  • 价格:
    • 写入:原价(无溢价)。
    • 读取:50%折扣。
  • 分析:50%的折扣力度远小于DeepSeek (95%+) 或 Anthropic (90%)。但由于没有写入溢价,这是一种“无风险”的优惠——即使没命中也不会亏。
  • TTL:不透明,通常为5分钟-24小时,受系统负载影响极大。这使得开发者很难依赖它来做严格的成本控制。

5.5 阿里云 Qwen (通义千问):混合模式

阿里云Model Studio紧跟国际步伐,提供了类似的机制。

  • 隐式:命中时按标准价的20%计费(80%折扣)。
  • 显式:
    • 写入:1.25倍价格。
    • 读取:20%价格。
  • Qwen-Long:支持文件ID引用的长上下文,本质上是一种持久化的上下文缓存,支持高达10M tokens,适合超长文档分析。

5.6 厂商对比汇总表

厂商 机制类型 最小Token限制 存储介质推测 TTL (生存时间) 写入成本 (Write) 命中成本 (Read) 存储费用
DeepSeek 隐式 (自动) 无/低 SSD/磁盘 长 (小时/天) 1.0x (原价) ~0.05x ($0.014) 免费
Anthropic 显式 (断点) 1024 显存 (HBM) 5分钟 (刷新) 1.25x (溢价) 0.10x (一折) 包含在写入溢价中
Google 显式 + 隐式 1024/2048 TPU HBM 1小时 (显式, 可续) 1.0x ~0.25x 按小时收费 (显式)
OpenAI 隐式 (自动) 1024 显存 (HBM) 动态 (短) 1.0x 0.50x (五折) 免费
Alibaba 显式 + 隐式 256/1024 显存 5分钟/动态 1.25x (显式) 0.10x - 0.20x 免费

5.7 成本情景模拟:法律文档分析

场景:上传一本50,000 tokens的法律法典,并在接下来的2小时内进行100次问答查询。

  1. Anthropic (Claude 3.5 Sonnet):
    • 首单(写入):$50k \times $3.75/M = $0.1875$
    • 后续99次(读取):$99 \times 50k \times $0.30/M = $1.485$
    • 总计:~$1.67
    • 风险:如果中间停顿超过5分钟,需重新支付写入费。
  2. Google (Gemini 2.5 Pro - 显式缓存):
    • 写入:$50k \times $1.25/M = $0.0625$ (假设基础价)
    • 存储:$2 \text{小时} \times 0.05M \text{ tokens} \times $4.50/M/hr = $0.45$
    • 读取:$100 \times 50k \times $0.30/M = $1.50$
    • 总计:~$2.01
    • 优势:即使2小时内没人提问,缓存也在。
  3. OpenAI (GPT-4o):
    • 首单:$50k \times $2.50/M = $0.125$
    • 后续99次:$99 \times 50k \times $1.25/M = $6.18$
    • 总计:~$6.30
    • 劣势:读取折扣力度不够。
  4. DeepSeek (V3):
    • 首单:$50k \times $0.14/M = $0.007$
    • 后续99次:$99 \times 50k \times $0.014/M = $0.069$
    • 总计:~$0.076

6. 语义缓存与应用层优化

除了依赖模型厂商的Prompt Cache,开发者在应用层(Client-side)还可以通过语义缓存(Semantic Caching)进一步降低成本。这与Prompt Cache是互补关系。

6.1 语义缓存的原理

传统的缓存(如Redis)基于Key-Value精确匹配。如果用户问“苹果的价格?”和“苹果多少钱?”,传统缓存会视为两个请求。
语义缓存引入了向量嵌入(Embedding)技术:

  1. Embedding: 将用户Query转化为向量。
  2. 向量搜索: 在向量数据库(如Milvus, Qdrant, Redis Vector)中搜索历史Query。
  3. 相似度阈值: 如果发现一个历史Query的余弦相似度大于阈值(如0.95),则直接返回之前缓存的LLM回答 36。

6.2 开源工具:GPTCache

GPTCache是目前最流行的开源语义缓存库,支持LangChain集成。

  • 模块化设计: 支持多种Embedding模型(OpenAI, HuggingFace)和多种向量存储(Redis, FAISS)。
  • 后处理: 甚至支持对缓存的回答进行评估,确保时效性。
  • 对比:
    • Prompt Cache (服务端): 解决的是“长Context复用”的问题。输入是旧的(书),问题是新的。
    • Semantic Cache (客户端): 解决的是“高频重复问题”的问题。输入是新的(问题),但意义是旧的。
  • 最佳实践: 只有当用户的提问具有高重复性(如客服系统常见问题)时,语义缓存才有意义。对于开放式分析任务,Prompt Cache更为关键 。

7. Prompt Cache在X-Sec中的应用

7.1 Prompt Cache

# Qwen3 等支持 Cache 的 LLM 使能 Prompt Caching
from langchain_core.prompts import ChatPromptTemplate

prompt = ChatPromptTemplate.from_messages([
  SystemMessage(
    # Qwen启动显示prompt cache能力
    content=[
      {
        "type": "text",
        "text": app_prompt_template.format(vars),
        "cache_control": {"type": "ephemera"},
      }
    ]
  ),
  HumanMessage(content=app_user_prompt_template.format(input_data),
])

# OpenAI 使能 Prompt Caching
from openai import OpenAI

client = OpenAI()

completion = client.chat.completions.create(
    model="gpt-4o",
    # openai 支持 24h 保存 cache
    prompt_cache_retention: "24h"
    messages=[
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Tell me a joke."},
    ]
)

7.2 Semantic Cache

语义cahce需要Memory系统的支持(langchain checkpoint支持vector store),所以当前还没有进行深入探索,后续Agent应用开发进入到这个阶段再行总结和探索。Stay Tuned!

8. 总结

大模型推理的演进史,本质上是一部与显存带宽和容量搏斗的抗争史。

  1. 架构与硬件的协同:DeepSeek MLA的成功证明了算法创新(低秩压缩)可以解锁硬件潜力(SSD存储),从而颠覆商业模式。这预示着未来模型设计将更多地考虑底层硬件特性,而非单纯追求参数量。
  2. 有状态API(Stateful API)的兴起:HTTP无状态协议已不再适应LLM应用。Prompt Cache的普及标志着LLM服务正演变为“有状态”的操作系统。开发者必须学会管理“Context 生命周期”,像管理数据库连接池一样管理Prompt Cache。
  3. 成本的断崖式下降:随着$0.014/M tokens这种价格的出现,RAG应用的瓶颈将从“检索多少文档省钱”转变为“模型能处理多少文档不幻觉”。全量知识库注入(Full Context)正在取代部分切片检索,成为新的技术趋势。
  4. 对于Agent应用开发者而言,当下的最优策略是:对于高频低延时任务(如代码助手)选择Anthropic或vLLM自建服务;对于海量数据分析和知识库问答,DeepSeek的磁盘缓存方案提供了目前无法匹敌的性价比优势。

References

  1. Optimizing Transformer Inference with Grouped Query Attention | Towards AI, accessed November 27, 2025, https://towardsai.net/p/machine-learning/optimizing-transformer-inference-with-grouped-query-attention
  2. arXiv:2305.13245v3 [cs.CL] 23 Dec 2023, accessed November 27, 2025, https://arxiv.org/pdf/2305.13245
  3. Attention Mechanisms in Transformers: Comparing MHA, MQA, and GQA | Yue Shui Blog, accessed November 27, 2025, https://syhya.github.io/posts/2025-01-16-group-query-attention/
  4. Understanding Multi-Head Latent Attention, accessed November 27, 2025, https://planetbanatt.net/articles/mla.html
  5. DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model, accessed November 27, 2025, https://arxiv.org/html/2405.04434v2
  6. DeepSeek API introduces Context Caching on Disk, cutting prices by an order of magnitude, accessed November 27, 2025, https://api-docs.deepseek.com/news/news0802
  7. MLA: Redefining KV-Cache Through Low-Rank Projections and On-Demand Decompression - Hugging Face, accessed November 27, 2025, https://huggingface.co/blog/NormalUhr/mla-explanation
  8. How PagedAttention resolves memory waste of LLM systems - Red Hat Developer, accessed November 27, 2025, https://developers.redhat.com/articles/2025/07/24/how-pagedattention-resolves-memory-waste-llm-systems
  9. Introduction to vLLM and PagedAttention | Runpod Blog, accessed November 27, 2025, https://www.runpod.io/blog/introduction-to-vllm-and-pagedattention
  10. vLLM and PagedAttention: A Comprehensive Overview | by Abonia Sojasingarayar | Medium, accessed November 27, 2025, https://medium.com/@abonia/vllm-and-pagedattention-a-comprehensive-overview-20046d8d0c61
  11. vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention - arXiv, accessed November 27, 2025, https://arxiv.org/html/2405.04437v1
  12. When to Choose SGLang Over vLLM: Multi-Turn Conversations and KV Cache Reuse, accessed November 27, 2025, https://www.runpod.io/blog/sglang-vs-vllm-kv-cache
  13. SGLang: Efficient Execution of Structured Language Model Programs - arXiv, accessed November 27, 2025, https://arxiv.org/pdf/2312.07104
  14. Fast and Expressive LLM Inference with RadixAttention and SGLang | LMSYS Org, accessed November 27, 2025, https://lmsys.org/blog/2024-01-17-sglang/
  15. Arxiv Dives - Efficient Streaming Language Models with Attention Sinks - Oxen.ai, accessed November 27, 2025, https://ghost.oxen.ai/arxiv-dives-efficient-streaming-language-models-with-attention-sinks/
  16. Efficient Streaming Language Models with Attention Sinks - arXiv, accessed November 27, 2025, https://arxiv.org/html/2309.17453v4
  17. [2309.17453] Efficient Streaming Language Models with Attention Sinks - arXiv, accessed November 27, 2025, https://arxiv.org/abs/2309.17453
  18. Attention Sinks for LLM - Endless Generation - Analytics Vidhya, accessed November 27, 2025, https://www.analyticsvidhya.com/blog/2023/12/attention-sinks-for-llm/
  19. FP8 E5M2 KV Cache - vLLM, accessed November 27, 2025, https://docs.vllm.ai/en/v0.6.3.post1/quantization/fp8_e5m2_kvcache.html
  20. FP8 quantization with AMD Quark for vLLM — Tutorials for AI developers 8.0, accessed November 27, 2025, https://rocm.docs.amd.com/projects/ai-developer-hub/en/latest/notebooks/gpu_dev_optimize/fp8_quantization_quark_vllm.html
  21. KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization, accessed November 27, 2025, https://www.stat.berkeley.edu/~mmahoney/pubs/neurips-2024-kvquant.pdf
  22. AsymKV: Enabling 1-Bit Quantization of KV Cache with Layer-Wise Asymmetric Quantization Configurations - ACL Anthology, accessed November 27, 2025, https://aclanthology.org/2025.coling-main.158.pdf
  23. FireQ: Fast INT4-FP8 Kernel and RoPE-aware Quantization for LLM Inference Acceleration, accessed November 27, 2025, https://arxiv.org/html/2505.20839v1
  24. NQKV: A KV Cache Quantization Scheme Based on Normal Distribution Characteristics, accessed November 27, 2025, https://arxiv.org/html/2505.16210v1
  25. Gemini Developer API pricing, accessed November 27, 2025, https://ai.google.dev/gemini-api/docs/pricing
  26. Context caching overview | Generative AI on Vertex AI - Google Cloud Documentation, accessed November 27, 2025, https://docs.cloud.google.com/vertex-ai/generative-ai/docs/context-cache/context-cache-overview
  27. Context Caching In Google Gemini: Better Than RAG For Memory - Empathy First Media, accessed November 27, 2025, https://empathyfirstmedia.com/context-caching-google-gemini/
  28. Prompt Caching Support in Spring AI with Anthropic Claude, accessed November 27, 2025, https://spring.io/blog/2025/10/27/spring-ai-anthropic-prompt-caching-blog
  29. Prompt caching - Claude Docs, accessed November 27, 2025, https://platform.claude.com/docs/en/build-with-claude/prompt-caching
  30. Prompt Caching is a Must! How I Went From Spending $720 to $72 Monthly on API Costs | by Du'An Lightfoot | Medium, accessed November 27, 2025, https://medium.com/@labeveryday/prompt-caching-is-a-must-how-i-went-from-spending-720-to-72-monthly-on-api-costs-3086f3635d63
  31. Prompt caching - OpenAI API, accessed November 27, 2025, https://platform.openai.com/docs/guides/prompt-caching
  32. Prompt Caching in the API - OpenAI, accessed November 27, 2025, https://openai.com/index/api-prompt-caching/
  33. How does Prompt Caching work? - OpenAI Developer Community, accessed November 27, 2025, https://community.openai.com/t/how-does-prompt-caching-work/992307
  34. Context Cache feature of Qwen models - Alibaba Cloud Model Studio, accessed November 27, 2025, https://www.alibabacloud.com/help/en/model-studio/context-cache
  35. Qwen context window: token limits, memory policy, and 2025 rules - Data Studios, accessed November 27, 2025, https://www.datastudios.org/post/qwen-context-window-token-limits-memory-policy-and-2025-rules
  36. Semantic Cache: How to Speed Up LLM and RAG Applications - Medium, accessed November 27, 2025, https://medium.com/@svosh2/semantic-cache-how-to-speed-up-llm-and-rag-applications-79e74ce34d1d
  37. Semantic Cache: Accelerating AI with Lightning-Fast Data Retrieval - Qdrant, accessed November 27, 2025, https://qdrant.tech/articles/semantic-cache-ai-data-retrieval/
  38. Semantic caching for faster, smarter LLM apps - Redis, accessed November 27, 2025, https://redis.io/blog/what-is-semantic-caching/
  39. zilliztech/GPTCache: Semantic cache for LLMs. Fully integrated with LangChain and llama_index. - GitHub, accessed November 27, 2025, https://github.com/zilliztech/GPTCache
  40. Reducing LLM Costs and Latency via Semantic Embedding Caching - arXiv, accessed November 27, 2025, https://arxiv.org/html/2411.05276v2
  41. If your app process many similar queries, use Semantic Caching to reduce your cost and latency : r/LangChain - Reddit, accessed November 27, 2025, https://www.reddit.com/r/LangChain/comments/1f4rlx0/if_your_app_process_many_similar_queries_use/
  42. 图解vLLM Automatic Prefix Cache(RadixAttention), https://zhuanlan.zhihu.com/p/693556044
  43. Gemini 3技术是跳蛙式超越 https://www.youtube.com/watch?v=EMQxQwoFSb4
  44. Andre Ng issue 329 | deeplearning.ai batch https://www.deeplearning.ai/the-batch/issue-329/

Before Making AI Agent Systems Smarter, First Make Them Trustworthy

2025-10-26 15:58:59

Introduction: An Illusion of "Simplicity"

A narrative has recently become common within the team: "Building an Agent is simple now. You can just piece it together with LangChain, BaiLian, or Flowise, and it runs."

At first glance, this statement is hard to refute—frameworks have indeed lowered the barrier to entry. But that "simplicity" is more of an illusion, a facade created after the complexity has been temporarily absorbed by the platform. From a technical standpoint, Agent development involves:

  • Orchestration and task planning;
  • Context and Memory management;
  • Domain knowledge fusion (RAG);
  • And the "agentification" of business logic.

These steps are not accomplished just by writing a few prompts. When developers feel it's "simple," it's because the complexity has been absorbed by the platform. The difficulty of Agents lies not in getting a demo to run, but in making it operate reliably, controllably, and sustainably over the long term.

Why Is Agent Development Mistakenly Seen as "Simple"?

On the surface, we are in an era of explosive AI growth, with platforms and tools emerging endlessly. It's true that by writing a few prompts and connecting a few chains, a "functional" Agent can be born. But this doesn't mean the complexity has vanished. Instead, the complexity has been relocated.

I break this "simplicity" down into three illusions:

1. Encapsulated Complexity

Frameworks help you string prompts and trim context, shielding developers from the details. But the underlying mechanics—debugging, tracing, and state recovery—are still burdens you must bear alone.

Take LangChain as an example. A "question-answering" Agent can be created with just a few lines of code:

from langchain.agents import initialize_agent, load_tools
from langchain.llms import OpenAI

llm = OpenAI(temperature=0)
tools = load_tools(["serpapi", "llm-math"], llm=llm)
agent = initialize_agent(tools, llm, agent_type="zero-shot-react-description")

agent.run("What is the current weather in Singapore, and convert it to Celsius?")

This code hides almost all complexity:

  • Prompt assembly, call chains, and context management are encapsulated internally.
  • But if the task fails (e.g., API rate limiting, tool failure), the Agent defaults to neither retrying nor logging a trace.

What looks like a "simple run" actually means sacrificing the interfaces for observability and debugging.

2. Outsourced Complexity

Memory, RAG, and Embeddings are all handed over to the platform for custody. The price is the loss of the ability to intervene and explain.

In LangChain, you can quickly add "memory":

from langchain.memory import ConversationBufferMemory
memory = ConversationBufferMemory(memory_key="chat_history")

But this is just a short-term memory buffer. It doesn't handle:

  • Conflicts with old information;
  • State drift over multiple turns;
  • Or context truncation issues due to excessive length.

As the Agent scales, memory consistency and state cleanup become new sources of system complexity.

3. Postponed Complexity

It doesn't disappear; it just reappears during the execution phase:

  • Output drift
  • Inability to reproduce results
  • Collapse of correctness and stability

Being able to run does not equal being able to run correctly over the long term. What we call simplicity is often just us temporarily avoiding the confrontation with complexity.

The Three Layers of Agent System Complexity

1. Agent Complexity

The complexity of an Agent system manifests in its ability to be run, reproduced, and evolved. Most current Agent frameworks have solved "runnability," but "reproducibility" and "evolvability" remain significant system engineering challenges.

Level Core Objective Engineering Keywords LangChain Example Explanation
Runnability (Run) Enable the Agent to start and execute tasks prompt, context, tool calls, execution flow Rapidly assembling an executable chain via initialize_agent
Reproducibility (Reproduce) Make behavior controllable and debuggable memory, state, logs, versioning No built-in version tracking; Memory state drift requires manual management
Evolvability (Evolve) Allow the system to continuously improve RAG, feedback loops, collaboration, safety boundaries Supports vector retrieval, but lacks self-assessment and reinforcement learning mechanisms

At the "Runnability" level, the abstractions designed by frameworks like LangChain are indeed efficient. But to make an Agent's behavior stable, explainable, and continuously optimizable, additional infrastructure—such as logging systems, prompt version management, and feedback loops—is still required.

From a system engineering perspective, the difficulty of an Agent lies not in "generation" but in "execution." All platforms will eventually expose their costs along these two lifecycles.

Dimension Definition Common Issues Essence
Correctness Is each decision correct? Hallucinations, incorrect tool calls, logical deviations The output logic is wrong
Stability Is the system reproducible? State drift, infinite loops, random fluctuations The behavior is uncertain

In the implementation phase, stability is often more critical than correctness. Only when stability exists can correctness even be verified and optimized.

Intelligence's uncertainty must be underpinned by engineering's certainty. Stability and observability are the prerequisites for an Agent to be truly evolvable.

2. The Agent Amplification Effect

As shown in the image above, the same model (qwen-max), the same time, and the same prompt produce different results. This is the amplification effect that LLM uncertainty brings to Agents. Compared to the traditional software systems developers are most familiar with, the complexity and difficulty of Agents stem from this uncertainty, amplified at each semantic level by the LLM.

If a single LLM interaction has a 90% correctness rate, an Agent system requiring 10 LLM interactions will have its correctness drop to just 35%. If it requires 20 interactions, the correctness plummets to 12%.

Memory's Uncertainty Amplification

Traditional software state management is deterministic (e.g., what's in the database is what's in the database). An Agent's memory, however, relies on LLM parsing, embedding, and retrieval. The results are highly uncertain. Therefore, memory is not a storage/retrieval problem, but a semantic consistency problem. This is unique to Agents.

Orchestration's Dynamic Amplification

In traditional systems, orchestration (workflow) is a fixed, predefined process. In an Agent, the orchestration—which tool to call next, and how—is often dynamically decided by the LLM. This means the orchestration problem isn't just about "sequence/concurrency"; it's about an explosion of the decision space, making testing, monitoring, and optimization far more complex.

Testability's Unpredictability Amplification

Traditional software is predictable: given input → expected output. An Agent's output is a probability distribution (a stream of tokens from the LLM); there is no strict determinism. Therefore, testing cannot rely solely on unit tests. It must incorporate replay testing, baseline comparison testing, and simulation environment testing, which is far beyond the difficulty of standard application testing.

3. From "Runnable" to "Usable"

The "'It Runs, Doesn't It?' Fallacy"

Some might say, "I can get the Agent to work just by modifying the prompts. Am I amplifying the problem myself, rather than the Agent?"

"Getting it to run by tweaking prompts" essentially means: Short-term goal + High tolerance = Good enough.
The goal of an Agent system, however, is: Long-term goal + Engineering-grade reliability = Drastic increase in difficulty.

Let's first look at why tweaking prompts seems to work. Many Agent Demos or POCs (Proofs of Concept) aim for one-off tasks, like "write a summary for me" or "call this API." In these low-requirement scenarios, the raw power of the LLM masks many underlying issues:

  • Memory can be passed purely through context (long-term persistence is never really tested).
  • Orchestration can be hard-coded or hinted at in the prompt.
  • Testability is irrelevant; if it gets the right answer once, it's a win.

The problem is that when the requirement shifts from a "Demo" to a "Sustainably Usable System," these issues are rapidly amplified:

  • Prompt Modification ≠ Reliability Guarantee. Changing a prompt might fix the immediate bug, but it doesn't guarantee the same class of problem won't reappear in another case. You haven't established reproducible, maintainable decision logic; you've just engaged in "black-box tweaking."
  • Prompt Modification ≠ Scalability. Prompt hacking works for a single-task Agent. But in a multi-tool, multi-scenario Agent, the prompt's complexity grows exponentially and eventually becomes uncontrollable.
  • Prompt Modification ≠ Engineering Controllability. Traditional software can be covered by test cases to ensure logical coverage. Prompts can only partially mitigate the LLM's probabilistic fluctuations; they cannot provide strong guarantees.

This is why, ultimately, we need more structured methods for memory, orchestration, and testing—which is to say, Agent systematization.

Limitations of Agent Frameworks

Let's use the LangChain framework as an example to see if frameworks can solve the three layers of Agent complexity. LangChain provides a basic CallbackManager and LangSmith integration for tracing an Agent's execution. This functionality is often overlooked, but it is key to understanding "reproducibility" and "observability."

from langchain.callbacks import StdOutCallbackHandler, CallbackManager
from langchain.llms import OpenAI
from langchain.agents import initialize_agent, load_tools

# Create a simple callback manager
handler = StdOutCallbackHandler()
manager = CallbackManager([handler])

llm = OpenAI(temperature=0, callback_manager=manager)
tools = load_tools(["llm-math"], llm=llm)

agent = initialize_agent(tools, llm, agent_type="zero-shot-react-description")

agent.run("Calculate (15 + 9) * 2")

When executed, LangChain will output every Thought and Action to the console:

Thought: I need to use the calculator tool.
Action: Calculator
Action Input: (15 + 9) * 2
Observation: 48
Thought: I now know the final answer.
Final Answer: 48

This seemingly simple output reveals three important facts:

  1. The Agent's internal decision process is traceable (this is the prerequisite for reproducibility).
  2. The CallbackManager must be actively enabled by the engineer (it doesn't log by default).
  3. The granularity of observation is limited (it cannot directly trace context trimming, memory overwrites, etc.).

LangSmith provides a more complete visual trace, but it remains an external observation tool. The Agent framework itself still lacks built-in verification mechanisms. In other words, the framework gives you the ability to "see," but it doesn't solve the problem of "control" for you.

Although frameworks like LangChain are making interesting attempts to solve the complex problems in Agent systems, we must admit that most engineering dimensions remain unresolved. (In short, frameworks solve the problem of "using an LLM to do things," but not the problem of "making the LLM do things in a way that is controllable, sustainable, and scalable like a system"):

Module What Frameworks Provide What Remains Uncovered / Needs Engineering
Inference Layer (LLM Layer) Model calls, Prompt templates Output stability, task context constraints, hallucination detection
Tools Layer API calls, Function routing Secure tool sandbox, permission control, error recovery
Memory Layer Basic vector retrieval, session context Long-term memory compression, conflict detection, memory decay strategy
Orchestration Layer Simple loops or chained calls Multi-task scheduling, plan optimization, inter-agent dependency graphs
Evaluation Layer Some tracing, benchmarks Automated metrics (success rate, controllability, cost monitoring)
Safety & Compliance Almost non-existent Execution boundaries, permission models, audit logs, sandboxed execution
Deployment & Ops Some SDKs, CLI tools Persistence, elastic scaling, version management, A/B testing mechanisms
Framework Runnability Reproducibility Evolvability Notes
LangChain ✅ Mature chain calls ⚙️ Partially observable ⚙️ Manual tuning Many tools, but state is unstable
AutoGen ✅ Multi-Agent collaboration ⚙️ Rudimentary memory ❌ Lacks learning mechanism Flexible but hard to reproduce
CrewAI ✅ Easy task orchestration ⚙️ State instability ❌ No feedback optimization Strong interaction, weak control
AliCloud BaiLian ✅ Drag-and-drop building ⚙️ Platform logs ⚙️ Built-in knowledge center Platform absorbs complexity, but is a major black box with limited control
  • ✅ Runnability: Generally well-supported (low barrier to entry)
  • ⚙️ Reproducibility: Only partially supported (requires self-built state and observation layers)
  • ❌ Evolvability: Still relies on manual effort and system design

LangChain makes Agents "buildable," but it makes the system lose its "explainability." Complexity didn't disappear; it just migrated from the code layer to the runtime.

Let's delve deeper into runtime complexity. The new problem Agent systems bring is that they don't just "run"; they must "continuously think," and the side effect of thinking is instability. This is not "traditional code complexity" but "system uncertainty introduced by intelligent behavior." It makes Agent engineering feel more like managing a complex adaptive system than a linear, controllable piece of software.

New Dimension of Complexity Description Example Scenario
Context Drift The model misunderstands or forgets key task objectives during multi-turn reasoning An Agent deviates from the task's semantics during a long conversation, executing irrelevant actions
Semantic Non-determinism The same input may produce different outputs, making processes non-replayable Prompt debugging results are unstable; automated testing is hard to cover
Task Decomposition & Planning The quality of plans generated by the LLM is unstable; task boundaries are vague In AutoGen's "plan+execute" model, sub-tasks overflow or loop
Memory Pollution Long-term stored context introduces noise or conflicting information The Agent "learns" incorrect knowledge, causing future execution deviations
Control Ambiguity The boundary between the Agent's execution and the human/system control layer is unclear Manual instructions are overridden, tasks are repeated, resources are abused
Self-Adaptation Drift The Agent learns incorrect patterns or behaviors based on feedback Reinforcing a hallucinatory response during an RLHF/reflection loop
Multi-Agent Coordination Communication, role assignment, and conflict resolution between Agents Task duplication or conflicts in multi-role systems like CrewAI

The Only Solution for Agents is Systematization

  1. Prompt Hacking fails when the problem scales. For a single, simple scenario, tweaking a prompt works. But as task complexity and the number of scenarios increase, the prompt becomes bloated and uncontrollable (e.g., one prompt stuffed with dozens of rules). It's like concatenating strings to build SQL queries: it runs at first, but inevitably leads to injection vulnerabilities and a maintenance disaster. Systematization helps by providing structured constraints and automated orchestration, rather than manual prompt tuning.
  2. Uncertainty demands controllability. Getting it right once is a win for a demo. But in a production environment, you need 99% correctness (or 100%). Even a 1% hallucination rate will accumulate into a disaster. For example, a log analysis Agent that misses or false-reports an issue just once could lead to an undiscovered online incident. Systematization ensures controllability through testing, monitoring, and replay verification, rather than gambling on luck every time.
  3. Knowledge persistence vs. repeating mistakes. Today, an Agent's bug is fixed by changing a prompt. Tomorrow, a new requirement comes in, and the exploration starts all over again. Knowledge isn't retained. The Agent can't remember or reuse past solutions, leading to constant redundant labor. A colleague complained that in one business system, prompt modification commits made up over a third of all code commits. Yet, when another colleague tried to reuse that prompt for a similar problem, it was completely non-transferable and had to be hacked again from scratch. Systematization, through Memory + Knowledge Bases, ensures an Agent can learn and accumulate knowledge, not reinvent the wheel every time.

Prompt Hacking / Demo Agents solve "small problems." Only Systematized Agents can solve the problems of "scalability, reliability, and persistence." These issues might not be obvious now, but they will inevitably explode as usage time and scope expand.

A Demo Agent can solve today's problem. A Systematized Agent can solve tomorrow's and the day after's.

Dimension Demo Agent (Can run) Systematized Agent (Can run sustainably)
Goal Single task / POC success Continuous, repeatable, multi-dependent business processes
Memory/Knowledge Raw chat history; occasional vector retrieval Layered memory (session/short-term/long-term + RAG strategy); consistency & versioning
Orchestration/State Sequential calls / simple ReAct; no explicit state Explicit state machine / graph (concurrency, rollback, retry, timeout, fault tolerance)
Reliability & Testing "Passes the example" is the standard; non-reproducible Replay sets / baseline comparison / fuzz testing; SLOs & failure mode design
Observability A few log lines End-to-end tracing, call chains, metrics, auditing
Security/Permissions Constraints written in the prompt Fine-grained permissions / sandbox / audit trails / anti-privilege-escalation
Scalability Prompt becomes uncontrollable as scenarios grow Modular components / model routing / tool governance
Cost Curve Fast/cheap initially; maintenance costs skyrocket later Upfront engineering investment; stable and scalable long-term

From "Smart" to "Reliable"

Some Real-World Agent Cases

Looking at history, we can understand rise and fall. Looking at others, we can understand our own successes and failures. The problems I've encountered in Agent system development are surely not mine alone. I asked ChatGPT to search Reddit, GitHub, and blogs for Agent development cases, hoping to use others' experiences to validate my own thinking and reflections:

1. Typical Failures of Toy-Level Agents

  • Auto-GPT community feedback: looping, getting stuck, unable to complete tasks (the classic early example of "runnable but not reliable"). Auto-GPT seems nearly unusable
  • Developer questioning if agents can go to production, noting severe step-skipping/hallucinations in multi-step tasks (system prompt + function calling isn't enough). Seriously, can LLM agents REALLY work in production?
  • OpenAI Realtime Agents official example repo issue: Even the "simple demo" has too many hallucinations to be usable non-demo contexts. Lots of hallucinations?

2. Engineering Problems Exposed After Production (Not solvable by prompt changes)

3. Industry/Big-Tech Postmortems: Why "Systematization" is Needed

4. Positive Case: Treating it with a "Distributed Systems Mindset"

5. Community Reality: People are using it in production, but focus on "de-complexing + limited agents"

  • Developer feedback on LangGraph being production-viable: Migrated from LangChain's Agent Executor; the prototype→streamline→retain-necessities path is more robust (de-hallucinate/de-fancy, retain control). Anyone Using Langchai Agents in production?

The Four Stages of Agent Development

Over more than a year of Agent development, I've gone through a cognitive shift from "Agents are simple" to "Agents are truly complex." At first, I treated frameworks as black boxes, writing prompts and piecing things together to run a demo. As the complexity of the scenarios increased and I needed to go deeper into Agent system R&D, the difficulties gradually revealed themselves. I've tried to break down this "simple → truly hard" process:

Stage 1: The "Hello World" Stage (Looks simple)

Using frameworks like LangChain / AutoGen / CrewAI, you can get something running in a few lines of code. Most people stop at "it can chat" or "it can call tools," so they feel "Agent development is just this."

Stage 2: The Scene Adaptation Stage (Starting to hit pitfalls)

As the complexity of the problems the Agent solves increases, you slowly run into the LLM context window limit, requiring trimming, compression, or selection (i.e., Context Management problems). You find that vector retrieval results are often irrelevant, leading to non-answers, requiring optimization of preprocessing and query rewriting (RAG Knowledge Management). It runs in simple scenes, but falls into traps in slightly more complex ones.

Stage 3: The Systematization Stage (Complexity explodes)

Going further, as tool calls and context management increase, the Agent must ensure consistency across sessions and tasks. You must consider persistence, version control, and conflict resolution. A single Agent can't adapt to complex tasks; you need multi-Agent collaboration. At this point, you must solve deadlock, task conflicts, and state rollbacks. When task complexity rises, debugging the Agent flow can't be solved by tweaking prompts; you must add tracing and observability tools.

Stage 4: The Engineering Landing Stage (The real hard part)

  • Agentifying Business Logic: How to test it? How to guarantee controllability and stability?
  • Security & Compliance: Permissions, privilege escalation, data leakage. Strict security boundaries are a must.
  • Monitoring & SLOs: Like operating microservices, you need monitoring, alerting, and failure recovery.

In summary, frameworks like LangChain lowered the "barrier to entry" for Agents, but they did not lower the "barrier to implementation."

My Cognitive Evolution in Agent Development

I have been developing an Agent system focused on vulnerability and security assessment in my own work. As I experienced the four stages of Agent development mentioned above, my thinking and understanding of Agents also changed:

Level 0: The Framework Illusion Layer

  • Typical Behavior: Install LangChain / AutoGen / CrewAI, run an official demo, modify a prompt.
  • Cognitive Trait: Believes "Agent Development = Writing Prompts." The barrier to entry is extremely low, similar to writing a script.
  • Misconception: Thinks the framework solves all complexity, ignoring memory, orchestration, testing, and security.

Level 1: The Scene Splicing Layer

  • Typical Behavior: Can stitch together RAG, tool calls, and simple multi-agent orchestration to build a seemingly viable prototype.
  • Cognitive Trait: Begins to realize the importance of context management and RAG strategies.
  • Pain Points: Encounters "irrelevant answers," "memory corruption," and "tasks failing to complete reliably."
  • Misconception: Tries to use prompt hacking to solve all problems, ignoring underlying information management and system design.

Level 2: The System Design Layer

  • Typical Behavior: Treats the Agent as a microservices system, needing to consider architecture, observability, and state management.
  • Cognitive Trait: Understands that memory is essentially a database/knowledge-base problem, and orchestration is more like workflow scheduling than a chat.
  • Pain Points: Debugging costs are extremely high; requires tracing, logging, and metrics monitoring.
  • Key Challenge: How to ensure the Agent is robust, controllable, and reproducible.

Level 3: The Engineering Landing Layer

  • Typical Behavior: Deploys the Agent into a production business environment.
  • Cognitive Trait: Treats Agent development as an engineering discipline, just like SRE / Security / Distributed Systems.
  • Pain Points:
    • Testability: The non-determinism of LLMs makes it impossible to guarantee stability with traditional unit tests.
    • Security: Permission management, privilege escalation, prompt injection protection.
    • Monitoring & SLOs: The Agent must be observable and recoverable, just like a service.
  • Key Challenge: How to make the Agent reliable enough to carry critical business functions.

Level 4: The Intelligent Evolution Layer (Frontier Exploration)

  • Typical Behavior: Attempting to build an Agent system with long-term memory, autonomous learning, and evolvability.
  • Cognitive Trait: No longer sees the Agent as an LLM wrapper, but as a new type of distributed intelligent system.
  • Challenges:
    • Memory becomes a knowledge graph + adaptive learning problem.
    • Orchestration involves game theory, collaboration, and even emergent behavior.
    • Security requires "AI sandboxes" to prevent loss of control.
  • Status: Most are not at this stage; it is primarily research and experimentation.

Based on my current understanding of Agents, I now position them as system components rather than intelligent robots. My goal is not "occasional brilliance" but "sustained reliability."

Basic Principles:

  1. Principles:
    • Stable first, smart second.
    • Observable first, optimized second.
  2. Functionality:
    • Establish a replayable mechanism for state and logs.
    • Implement version tracking for Prompts / Memory / RAG.
    • Introduce observability metrics (success rate, drift rate, redundant calls).
    • Clearly define the boundaries and permission scope for each Agent.
    • Designate "error recovery" pathways in the architecture.
  3. Boundaries:
    • If the Agent is only for one-off tasks or exploratory experiments, complexity control can be relaxed.
    • If used for production tasks (monitoring, automated operations), stability and security boundaries take precedence.
    • The deeper the framework's encapsulation, the more an external explainability layer is needed.

The Path to Agent Intelligence

Someone said 2025 might be the "Year of the Agent." After nearly a year of technical iteration, Agents have also seen considerable development from an engineering perspective. LangChain has essentially become the preferred backend option for Agent systems, and Agent R&D has evolved from prompt engineering → context engineering (as shown in the figure below).

1. Agent Development Philosophy

Agents are not a panacea. The key is to choose the appropriate automation stage for tasks of different complexities. I believe we can see from the five evolutionary stages of Agents:

  1. Complex ≠ Better
    • Don't blindly chase the "strongest Agent architecture"; suitability is key.
    • Using a complex system for a simple task only increases cost and risk.
  2. The Real Challenge is "Human"
    • Many failed cases stem from the designer choosing the wrong architecture or lacking phased thinking.
    • The model and workflow are not the problem; the human is.
  3. The Importance of Design Thinking
    • First, assess the task's complexity and automation potential.
    • Then, decide the required level of intelligence (Script → LLM → RPA → Agent → Multi-Agent).
    • Finally, match the appropriate tool, don't use a "one-size-fits-all" approach.

2. Agent Design Patterns

  • 1️⃣ ReAct Pattern (Reasoning + Acting)
    • Structure: Divided into Reasoning and Acting phases.
    • Mechanism:
      • LLM1: Understands context, plans which tool/API to call.
      • LLM2: Executes the action, returns the result.
    • Pros: Decouples reasoning and action, clear structure.
    • Applications: Q&A, multi-step tasks, tool-driven workflows.
  • 2️⃣ CodeAct Pattern
    • Flow:
      • User → Plan: User gives a task, Agent plans the steps.
      • Code → Feedback: Generates and executes code, corrects based on results.
    • Feature: Introduces a feedback loop (code execution → result → reflection).
    • Applications: Verifiable tasks (data processing, analysis, API calls).
    • Represents: AI acting through code.
  • 3️⃣ Tool Use Pattern
    • Core Concept: Upgrades from single API calls to a unified protocol (MCP) for managing tools.
    • Features:
      • Tool abstraction and standardization.
      • Supports multi-modal, multi-source tool access.
    • Significance: Greatly improves the Agent's ecosystem compatibility and extensibility.
  • 4️⃣ Self-Reflection / Reflexion Pattern
    • Architecture:
      • Main LLM: Executes the main task.
      • Critique LLM(s): Criticizes/reviews the main model's output.
      • Generator: Combines feedback to produce the final answer.
    • Advantages:
      • Introduces a "self-reflection" mechanism.
      • Reduces hallucination rates, improves logic and quality consistency.
    • Applications: Scientific research, content generation, high-risk decision scenarios.
  • 5️⃣ Multi-Agent Workflow
    • Structure:
      • Core Agent: Coordinates task allocation.
      • Sub-Agents: Each focuses on a specific function/domain.
      • Aggregator: Integrates outputs from sub-agents.
    • Features:
      • Simulates real team collaboration.
      • Supports complex, cross-functional tasks.
    • Applications: Enterprise-level systems, automated programming, cross-departmental processes.
  • 6️⃣ Agentic RAG Pattern
    • Flow:
      • Agent uses tools to perform Web / Vector retrieval.
      • Main Agent fuses retrieval results with its own reasoning.
      • Generator produces the final answer.
    • Features:
      • Dynamic retrieval + reasoning.
      • Agent can autonomously decide "if, when, and how" to retrieve.
    • Significance: From static RAG → intelligent, decision-making Agentic RAG.

3. Latest Agent Progress

Finally, I want to summarize the latest engineering progress in Agents and the most recent engineering experiences worth learning from:

  • Agentic Design Pattern (by Google's Antonio Gulli), PDF
  • Build agentic AI systems (by Andrew Ng), Course

Below are some takeaways from Agent development. Those interested can look up how various Agent players are planning their strategies.

Perhaps future frameworks will absorb even more of this complexity. But the role of the engineer will not disappear. What we must do is to re-establish order in the places where complexity has been hidden—to make intelligence not just callable, but tamable.

让Agent系统更聪明之前,先让它能被信任

2025-10-11 16:04:20

引子:一种“简单”的错觉

团队内部最近常出现一种论调:“现在做 Agent 很简单,用 LangChain、百炼、Flowise 搭一搭就能跑。”

这句话乍一听确实无法反驳 —— 框架确实降低了门槛。但那种“简单”,更像是复杂性暂时被平台吸收后的假象。从技术层面看,Agent 开发涉及:

  • 编排与任务规划;
  • Context 与 Memory 管理;
  • 领域知识融合(RAG);
  • 以及业务逻辑的 agent 化。

这些环节并不是写几个 prompt 就能搞定的。 当开发者觉得“简单”其实是因为——复杂性被平台吸收了。 Agent 之难,不在跑通 Demo,而在让它长期、稳定、可控地运行。

Agent 开发为何被误以为“简单”?

从表面看,我们站在了一个 AI 爆炸的年代,各种平台与工具层出不穷。确实写几个 prompt、拼几层链路,一个“能动”的 Agent 就诞生了。但这并不是复杂性消失的标志,而是——复杂性被转移了位置

我把这层“简单”拆成三种幻觉:

1. 被封装的复杂性

框架帮你拼接 prompt、裁剪 context,让开发者远离细节,但调试、trace、状态恢复这些底层骨架,仍无人替你承担。

以 LangChain 为例,只需几行代码即可创建一个 “能回答问题” 的 Agent:

from langchain.agents import initialize_agent, load_tools
from langchain.llms import OpenAI

llm = OpenAI(temperature=0)
tools = load_tools(["serpapi", "llm-math"], llm=llm)
agent = initialize_agent(tools, llm, agent_type="zero-shot-react-description")

agent.run("给我查一下新加坡现在的天气,并换算成摄氏度")

这段代码几乎隐藏了所有复杂性:

  • prompt 拼装、调用链、上下文管理都在内部封装;
  • 但若任务出错(如 API 限流、工具失败),Agent 默认并不会重试或记录 trace。

看似“简单运行”,实则丧失了可观测与调试的接口。

2. 被外包的复杂性

Memory、RAG、Embedding 全交由平台托管,代价是失去了干预与解释的能力。在 LangChain 中,你可以快速添加“记忆”:

from langchain.memory import ConversationBufferMemory
memory = ConversationBufferMemory(memory_key="chat_history")

但这只是短期记忆缓冲,它不会处理:

  • 旧信息冲突;
  • 多轮状态漂移;
  • 以及上下文过长导致的剪裁问题。

当 Agent 规模扩大,内存一致性与状态清理反而成了新的系统复杂度。

3. 被推迟的复杂性

它不会消失,只会在运行阶段重新显现:

  • 输出漂移
  • 无法复现
  • 正确性与稳定性塌陷

能跑起来并不等于能长期跑得对。所谓简单,其实是我们暂时不用面对复杂。

Agent 系统的三层复杂度

1. Agent复杂度

Agent 系统的复杂性体现在可运行、可复现、可进化。当下的 Agent 框架大多解决了「可运行性」,但「可复现性」与「可进化性」仍是系统工程难题。

层次 核心目标 工程关键词 LangChain 示例说明
可运行性(Run) 让 Agent 能启动并执行任务 prompt、context、工具调用、执行流 通过 initialize_agent 快速组装可执行链路
可复现性(Reproduce) 让行为可控、可调试 memory、状态、日志、版本化 无内建版本追踪,Memory 状态漂移需人工管理
可进化性(Evolve) 让系统能持续变聪明 RAG、反馈回路、协作、安全边界 支持向量检索,但缺少自我评估与强化学习机制

在“可运行性”层面,以LangChain为代表的框架的抽象设计确实高效。但若要让 Agent 行为稳定、可解释、可持续优化,仍需额外引入日志系统、prompt 版本管理、feedback loop 等基础设施。

从系统工程角度看,Agent 的难点并非在“生成”而在“执行”。所有平台最终都会在这两条生命线上暴露代价。

维度 定义 常见问题 本质
正确性(Correctness) 每次决策是否正确 幻觉、误调用、逻辑偏差 输出逻辑错
稳定性(Stability) 系统是否可复现 状态漂移、死循环、随机波动 行为不确定

在落地阶段,稳定性往往比正确性更关键。只有稳定性存在,正确性才有被验证和优化的可能性。

智能的不确定性必须以工程的确定性为支撑。稳定与可观测,是 Agent 真正可演化的前提。

2. Agent放大效应

如上图所示,同样的模型(qwen-max),同样的时间、同样的prompt,产生的结果缺不一样,这就是LLM的不确定性带给Agent的放大效应。相对于开发者最熟悉的传统软件系统的开发,Agent带来的复杂和难点就源于它被 LLM 的不确定性和语义层次的逐级放大了。假设一次LLM交互正确率为90%,一个Agent系统需要10次LLM的交互,那么这个Agent系统的正确率就只有35%,一个Agent系统需要20次LLM的交互,那么这个Agent系统的正确率只有12%。

Memory 的不确定性放大

相比传统软件的状态管理来说(是确定性的,例如数据库里有啥就是啥),Agent 的memory依赖 LLM 的解析、embedding、检索,结果高度不确定,所以memory不是存取问题而是语义一致性问题,这是 Agent 特有的。

编排的动态性放大

传统系统里编排(workflow/orchestration)是固定的流程,预定义好。Agent 里编排常常是 LLM 动态决定下一步调用哪个工具、如何调用。这意味着编排问题不仅是“顺序/并发”的问题,而是决策空间爆炸,导致测试、监控、优化都更复杂。

测试性的不可预测性放大

传统软件可预测:给定输入 → 预期输出。Agent 的输出是概率分布(LLM 输出 token 流),没有严格确定性。所以测试不能只用单元测试,而要引入 回放测试、对比基线测试、模拟环境测试,这就远超普通应用测试的难度。

3. Agent从“能跑”到“能用”

又不是不能跑,要什么自行车?

有人说,Agent开发的时候我修改修改提示词也能达到目标,是否是我自己放大了问题并不是Agent放大了上面提到的不确定性。

“改改提示词就能跑通”,本质上其实在说:短期目标 + 容忍度高 = 足够好,而Agent系统的目标是:长期目标 + 工程级可靠性 = 难度激增

先看看为什么改改prompt就能跑通,很多 Agent Demo 或 POC(Proof of Concept)目标是 一次性任务,比如“帮我写个总结”“调用下 API”,在这种低要求场景里,LLM 本身的强大能力掩盖了很多问题:

  • Memory 可以只靠上下文传递(没真正测试过长时效);
  • 编排可以写死流程或靠提示词 hint;
  • 测试性无所谓,跑一次能对上答案就算赢;

是我放大了问题还是Agent系统放大了问题,其实需要从需求出发,因为当需求从 “Demo” → “持续可用系统” 时,问题会迅速被放大:

  • Prompt 修改 ≠ 可靠性保证,改提示词可能解决眼前 bug,但没有保证同类问题不会在别的 case 再次出现。你其实没有建立可复现、可维护的决策逻辑,只是调参式“玄学优化”。
  • Prompt 修改 ≠ 可扩展性,在单任务 Agent 下,prompt hack 有效。但在多工具、多场景 Agent 里,prompt 的复杂度指数级增长,最终失控。
  • Prompt 修改 ≠ 工程可控性,传统软件能写测试 case 保证逻辑覆盖,但是 prompt 只能部分缓解 LLM 的概率波动,没法做强保证。

所以最终需要更结构化的 memory、编排和测试手段 —— Agent系统化。

Agent框架的局限

以Langchain框架为例,看看框架是否能够解决Agent三层复杂度的问题。LangChain 提供了基础的 CallbackManager 与 LangSmith(或Lanfuse) 集成,用于追踪 Agent 的执行过程。这部分功能通常被忽略,却是理解「可复现性」与「可观测性」的关键。

from langchain.callbacks import StdOutCallbackHandler, CallbackManager
from langchain.llms import OpenAI
from langchain.agents import initialize_agent, load_tools

# 创建一个简单的回调管理器
handler = StdOutCallbackHandler()
manager = CallbackManager([handler])

llm = OpenAI(temperature=0, callback_manager=manager)
tools = load_tools(["llm-math"], llm=llm)

agent = initialize_agent(tools, llm, agent_type="zero-shot-react-description")

agent.run("计算一下 (15 + 9) * 2 是多少?")

执行时,LangChain 会在终端输出每一次 思考(Thought动作(Action

Thought: 我需要使用计算工具。
Action: Calculator
Action Input: (15 + 9) * 2
Observation: 48
Thought: 我现在知道最终答案了。
Final Answer: 48

看似简单的输出,其实揭示了三个重要事实:

  1. Agent 内部决策过程可追踪(这是复现性的前提);
  2. CallbackManager 需要工程师主动启用(默认不会记录);
  3. 观测粒度受限(无法直接追踪上下文裁剪、记忆覆盖等细节)。

LangSmith 提供了更完善的可视化 trace,但依然属于外部观测工具,Agent 框架本身仍未内建可验证机制。也就是说,框架给你“看”的能力,却不会替你“控”的问题。

虽然Langchain这样的框架已经有意思的在解决Agent系统中的复杂问题,但是不得不承认当前大部分工程维度仍然是未解决的(简言之,框架解决了“调用 LLM 做事”的问题,但没有解决“让 LLM 做事像系统那样可控、可持续、可扩展”的问题):

模块 当前框架提供 未覆盖 / 待工程化部分
推理层 (LLM Layer) 模型调用、Prompt 模板 输出稳定性、任务上下文约束、幻觉检测
工具层 (Tools Layer) API 调用、函数路由 工具调用安全沙箱、权限控制、错误恢复
记忆层 (Memory Layer) 基本向量检索、会话上下文 长期记忆压缩、冲突检测、记忆衰减策略
协调层 (Orchestration) 简单 loop 或链式调用 多任务调度、计划优化、Agent 间依赖图
评估层 (Evaluation) 少量 tracing、benchmark 自动指标体系(成功率、可控性、成本监控)
安全与合规 (Safety & Compliance) 几乎缺失 执行边界、权限模型、审计日志、沙箱执行
部署与运维 (Ops) 少量 SDK、CLI 工具 持久化、弹性伸缩、版本管理、A/B 测试机制
框架 可运行性 可复现性 可进化性 说明
LangChain ✅ 链式调用成熟 ⚙️ 部分可观测 ⚙️ 手动调优 工具多,但状态不稳
AutoGen ✅ 多 Agent 协作 ⚙️ 初级记忆 ❌ 缺乏学习机制 灵活但难复现
CrewAI ✅ 任务编排便捷 ⚙️ 状态不稳定 ❌ 无反馈优化 强交互,弱管控
阿里云百炼 ✅ 拖拽式搭建 ⚙️ 平台日志 ⚙️ 内置知识中心 平台吸收复杂性,但黑箱严重,控制粒度受限
  • ✅ 可运行性:普遍支持良好(开发门槛低)
  • ⚙️ 可复现性:仅局部支持(需自建状态与观测层)
  • ❌ 可进化性:仍靠人工与系统设计

LangChain 让 Agent “能搭”,却让系统失去了“能解释”的能力。复杂性并未消失,只是从代码层迁移到了运行时。

我们再来深入的分析一下运行时的复杂度,即Agent系统带来的新问题——它不仅要运行,还要「持续思考」,而思考的副作用就是不稳定性。这些复杂性不是「传统的代码复杂度」,而是「智能行为带来的系统不确定性」。它们让 Agent 工程更像在管理一个复杂适应系统 ,而非线性可控的软件。

新增复杂性维度 描述 示例场景
上下文漂移 (Context Drift) 模型在多轮推理中误解或遗忘关键任务目标 多轮对话中 Agent 偏离任务语义,执行无关动作
语义非确定性 (Non-determinism) 同样输入可能产生不同输出,导致流程不可重放 Prompt 调试后结果不稳定、自动测试难以覆盖
任务分解与规划 (Decomposition) LLM 生成的计划质量不稳定、任务边界模糊 AutoGen 的“plan+execute”模式中,子任务溢出或循环
记忆污染 (Memory Pollution) 长期存储的上下文引入噪声或冲突信息 Agent 学会错误知识,导致后续执行偏差
控制权与边界模糊 (Control Ambiguity) Agent 的执行与人类/系统控制层之间界限不清 人工指令被覆盖、重复任务、资源滥用
自适应演化风险 (Self-Adaptation Drift) Agent 基于反馈学习错误模式或行为 RLHF/反思循环中强化了幻觉响应
多Agent 协同复杂性 (Coordination) Agent 间通信、角色分配、冲突解决 CrewAI 等多角色系统中任务重复或冲突

Agent唯一解是系统化

  1. 问题规模放大后 Prompt Hack 失效,单一问题场景,改 prompt 就能跑通,但是当任务复杂度、场景数量增加,prompt 就会变得臃肿不可控(比如一个 prompt 里要塞几十条规则),就像写 SQL 时拼接字符串,开始能跑,最后一定注入 + 维护灾难。系统化帮助Agent结构化约束 + 自动化编排,而不是人肉调 prompt
  2. 不确定性需要可控性,一次性跑出来成功就算赢,但是在生产环境必须 99% 正确(甚至100%),哪怕 1% 幻觉就会积累成灾难,例如像日志分析 Agent,错报/漏报一次可能导致线上事故没被发现。系统化通过测试、监控、回放验证,确保可控,而不是每次都赌运气
  3. 知识沉淀 vs 重复踩坑,Agent今天改 prompt 能解决 bug,明天来了新需求又重新摸索。知识没有沉淀,Agent 不能记忆/复用,最终不断重复劳动。同事抱怨过一个业务系统的开发中prompt修改的commit占所有代码提交的三分之一以上,但是另一同事遇到同类问题想复用这个prompt发现完全无法迁移还要重新 hack。系统化就是通过Memory + 知识库保证 Agent 能学到、积累下来,不是每次都重造轮子。

Prompt Hack/Demo Agent 能解决的是“小问题”,系统化 Agent 才能解决“扩展性、可靠性、沉淀”的问题。这些问题现在可能不明显,但随着使用时间和范围扩大,必然会爆发。

Demo Agent 确实能解决问题,但只能解决今天的问题,系统化 Agent 才能解决明天和后天的问题。

维度 Demo Agent(能跑起来) 系统化 Agent(可持续运行)
目标 单次任务/POC 成功 连续、可重复、多人依赖的业务
记忆/知识 聊天记录直塞;偶尔向量检索 分层记忆(会话/短期/长期+RAG 策略);一致性与版本化
编排/状态 顺序调用/简单 ReAct;无显式状态 显式状态机/图(并发、回滚、重试、超时、容错)
可靠性与测试 以“能过样例”为准;不可复现 回放集/基线对比/ fuzz 测;SLO 与失败模式设计
可观测性 打几行日志 端到端 tracing、调用链、指标、审计
安全/权限 Prompt 里写约束 细粒度权限/沙箱/审计轨迹/越权防护
扩展性 场景一多,Prompt 失控 模块化组件/模型路由/工具治理
成本曲线 前期快/便宜;后期维护飙升 前期工程化投入;后期稳定、可扩展

Agent从“聪明”到“可靠”

一些真实Agent案例

以史为镜,可以知兴替;以人为镜,可以明得失,我在Agent系统开发过程中碰到的问题一定不止我一个人,我让ChatGPT帮我搜索了Reddit、GitHub、Blog中关于Agent开发的案例,想借助别人的案例来验证我自己的思考和反思是否一致:

1. 玩具级 Agent 的典型失败

2. 上生产后暴露的工程问题(不是改 Prompt 能解决)

3. 行业/大厂公开复盘:为什么需要“系统化能力”

4. 正向案例:当你用“分布式系统心态”做编排/容错

5. 社区实况:有人在生产用,但都在谈“去复杂化 + 有限代理”

  • LangGraph 在产线可用的开发者反馈:从 LangChain 的 Agent Executor 迁移;原型→精简→保留必要能力的路线更稳健(去幻觉/去花哨,保留可控),Anyone Using Langchai Agents in production?

Agent开发的四个阶段

一年多的Agent开发,我经历Agent很简单到Agent真复杂的认知变化,最开始把框架当黑盒,写 prompt 拼拼凑凑,就能跑个 demo,随着场景复杂性提升需要往Agent系统研发的深处走时,难点就逐步暴露出来。我尝试把这个“简单 → 真难”的过程拆了一下:

第一阶段:Hello World 阶段(看起来很简单)

用 LangChain / CrewAI 之类的框架,几行代码就能跑起来,大多数停在“能对话”、“能调用工具”这层,所以觉得“AI Agent 开发不过如此”。

第二阶段:场景适配阶段(开始遇到坑)

随着Agent解决问题的复杂度提升,慢慢会碰到LLM context窗口装不下,需要裁剪、压缩、选择(即Context 管理问题),然后又发现向量检索、数据获取的结果经常无关、答非所问,需要优化预处理、query 重写(RAG知识管理),渐渐地我认识到简单场景能跑,但是稍微复杂点就掉坑。

第三阶段:系统化阶段(复杂性爆炸)

再进一步,Agent随着工具调用、上下文管理增加(例如复杂安全威胁建模过程),需要保证跨会话、跨任务一致性,必须考虑持久化、版本控制、冲突解决。单个Agent无法适应复杂任务(例如复杂的漏洞风险评估任务),需要多 Agent 协同,此时就必须解决 deadlock、任务冲突、状态回滚。任务的复杂性上来了Agent 流程调试就不是改改 prompt 能解决的,要加 tracing、可观测性工具。

第四阶段:工程落地阶段(真正的硬骨头)

  • 业务逻辑 Agent 化:业务workflow开发完成了,但是如何测试?如何保证可控性和稳定性?还没有任何最佳实践和答案。
  • 安全与合规:权限、越权调用、数据泄露,必须引入严格的安全边界,当前这方面完全没有涉及,这对一个生产使用的系统会有非常大的隐患。
  • 监控与 SLO:像运维微服务一样,需要监控、报警、故障恢复。

Langchain框架、百炼平台等确实让Agent“起步门槛”变低,但没有降低“落地门槛”。第三阶段、第四阶段很多问题,我也是摸着石头过河,当下还总结不出任何经验和结论。

我对Agent开发认知的演化

我一直围绕自己工作中涉及到的漏洞安全评估开发Agent系统,在经历上面提到的四个Agent开发的时候,我对Agent的思考和理解也在变化:

第一层:框架幻觉层

  • 典型行为 :装个 LangChain / AutoGen / CrewAI,跑个官方 demo,改一改 prompt。
  • 认知特征 :觉得“Agent 开发=写 Prompt”,门槛极低,和写脚本差不多。
  • 误区 :以为框架解决了一切复杂性,忽略了 memory、编排、测试、安全。

第二层:场景拼接层

  • 典型行为 :能把 RAG、工具调用、简单多 Agent 编排拼接在一起,做一个看似可用的原型。
  • 认知特征 :开始意识到 context 管理、RAG 策略的重要性。
  • 痛点 :遇到“答非所问”“记忆错乱”“任务无法稳定完成”。
  • 误区 :尝试用 prompt hack 解决所有问题,忽略了底层信息管理和系统设计。

第三层:系统设计层

  • 典型行为 :将 Agent 当成微服务系统,需要考虑架构、可观测性、状态管理。
  • 认知特征 :理解 memory 本质上是数据库/知识库问题,编排更像工作流调度而非聊天。
  • 痛点 :debug 成本极高;需要 tracing、日志、指标监控。
  • 关键挑战 :如何确保 Agent 鲁棒性、可控性、可复现性

第四层:工程落地层

  • 典型行为 :将 Agent 投入业务生产环境。
  • 认知特征 :把 Agent 开发当成 SRE/安全/分布式系统 一样的工程学科。
  • 痛点
    • 测试性 :LLM 的非确定性导致无法用传统单测保证稳定。
    • 安全性 :权限管理、越权调用、prompt 注入防护。
    • 监控与SLO :Agent 必须像服务一样可观测、可恢复。
  • 关键挑战如何让 Agent 可靠到能承载关键业务。

第五层:智能演化层(前沿探索)

  • 典型行为 :尝试构建有长期记忆、自主学习、可进化的 Agent 体系。
  • 认知特征 :不再把 Agent 当 LLM wrapper,而是当 新型分布式智能系统
  • 挑战
    • memory 变成知识图谱 + 自适应学习问题
    • 编排涉及博弈、协作甚至涌现行为
    • 安全需要“AI sandboxes”,避免失控
  • 现状 :大多数人还没到这个阶段,研究和实验为主。

结合当下对Agent的理解,当前我对Agent的定位是将其视作一个系统组件而非智能机器人,我的目标不是“偶尔惊艳”而是“持续可靠”。基本原则:

  1. 原则:
    • 先稳定,后聪明
    • 先可观测,后优化
  2. 功能:
    • 建立状态与日志的可回放机制
    • 对 Prompt / Memory / RAG 做版本追踪
    • 引入观测指标(成功率、漂移率、冗余调用)
    • 明确每个 Agent 的边界与权限范围
    • 在设计上预留“错误恢复”通道
  3. 边界:
    • 若 Agent 仅用于一次性任务或探索性实验,复杂度控制可以放宽。
    • 若用于生产任务(监控、自动化操作),稳定性与安全边界优先。
    • 框架封装越深,越需要额外的可解释层。

Agent智能化之路(最新进展)

好像有人说过2025是Agent元年,经过将近一年的Agent技术迭代,Agent也从工程角度有了比较长足的发展,Langchain基本上已经成为Agent system后端的优先选项,Agent研发也经历 prompt engineering → context engineering的演变(如下图所示)。

最后,我想分享一下当下Agent工程上最新进展以及Agent system最新的工程经验值得借鉴与学习:

  • Agentic Design Pattern(by Google Antonio Gulli),PDF
  • Build agentic AI systems(by Adrew Ng),Course

最后,也许未来的框架能进一步吸收这些复杂性。但工程师的角色不会因此消失。我们要做的,是在复杂性被隐藏的地方,重新建立秩序 —— 让智能不只是可调用,更是可驯服。