datawhale-agent-learning-hub

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Datawhale Agent Learning Hub

Datawhale Agent学习中心

Skill by ara.so — AI Agent Skills collection.

A curated AI Agent learning roadmap and resource hub maintained by Datawhale. This project provides a structured learning path from basic agent loops to production-ready agent systems, emphasizing modern patterns like agent harnesses, skills, MCP (Model Context Protocol), and evaluation.

由ara.so提供的Skill——AI Agent技能合集。

由Datawhale维护的AI Agent学习路线图与资源精选中心。本项目提供从基础Agent循环到可投入生产的Agent系统的结构化学习路径，重点关注Agent框架、Skills、MCP（Model Context Protocol，模型上下文协议）及评估等现代模式。

What This Project Provides

本项目提供的内容

Structured Learning Path: 7-stage todo list from basic agent loops to browser/computer-use agents
Curated Resources: Official docs, papers, and proven open-source projects
Modern Focus: Prioritizes Claude Code, OpenClaw, skills, MCP, A2A over legacy role-play frameworks
Project Ladder: Real-world agent projects you can build at each stage
Current Best Practices: What to learn now vs. what's outdated

结构化学习路径：从基础Agent循环到浏览器/电脑操作Agent的7阶段任务清单
精选资源：官方文档、论文及经过验证的开源项目
聚焦现代模式：优先关注Claude Code、OpenClaw、Skills、MCP、A2A，而非传统角色扮演框架
项目进阶阶梯：每个阶段可实践的真实Agent项目
当前最佳实践：区分当下需学习内容与已过时技术

Installation & Access

安装与访问

This is a learning resource repository, not a package to install:

bash

undefined

这是一个学习资源仓库，并非可安装的包：

bash

undefined

Clone the repository

git clone https://github.com/datawhalechina/Agent-Learning-Hub.git cd Agent-Learning-Hub

Read the README

cat README.md

Use it as reference while building agents

undefined

undefined

Key Learning Stages

核心学习阶段

Stage 0: Understand What An Agent Is

阶段0：理解Agent是什么

Core Concept: Distinguish chatbot vs workflow vs agent vs multi-agent.

Required Reading:

Deliverable: One-page note answering "Why does my use case need an agent instead of a workflow?"

核心概念：区分聊天机器人、工作流、Agent与多Agent系统。

必读资料:

交付成果：一页笔记，回答“为什么我的业务场景需要Agent而非工作流？”

Stage 1: Build A Minimal Agent Loop

阶段1：构建最小Agent循环

Core Pattern: observe → think → act → observe

python

undefined

核心模式：观察 → 思考 → 行动 → 观察

python

undefined

Minimal agent loop example (Python + OpenAI)

import os from openai import OpenAI

client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))

tools = [ { "type": "function", "function": { "name": "calculate", "description": "Perform basic arithmetic", "parameters": { "type": "object", "properties": { "expression": {"type": "string", "description": "Math expression like '2+2'"} }, "required": ["expression"] } } } ]

def calculate(expression: str) -> str: """Execute safe math expression.""" try: return str(eval(expression, {"builtins": {}}, {})) except Exception as e: return f"Error: {e}"

def run_agent(user_message: str, max_steps: int = 5): messages = [{"role": "user", "content": user_message}]

for step in range(max_steps):
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=messages,
        tools=tools
    )
    
    message = response.choices[0].message
    messages.append(message)
    
    # Check if done
    if not message.tool_calls:
        return message.content
    
    # Execute tool calls
    for tool_call in message.tool_calls:
        if tool_call.function.name == "calculate":
            import json
            args = json.loads(tool_call.function.arguments)
            result = calculate(args["expression"])
            
            messages.append({
                "role": "tool",
                "tool_call_id": tool_call.id,
                "content": result
            })

return "Max steps reached"

import os from openai import OpenAI

client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))

def calculate(expression: str) -> str: """Execute safe math expression.""" try: return str(eval(expression, {"builtins": {}}, {})) except Exception as e: return f"Error: {e}"

def run_agent(user_message: str, max_steps: int = 5): messages = [{"role": "user", "content": user_message}]

for step in range(max_steps):
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=messages,
        tools=tools
    )
    
    message = response.choices[0].message
    messages.append(message)
    
    # Check if done
    if not message.tool_calls:
        return message.content
    
    # Execute tool calls
    for tool_call in message.tool_calls:
        if tool_call.function.name == "calculate":
            import json
            args = json.loads(tool_call.function.arguments)
            result = calculate(args["expression"])
            
            messages.append({
                "role": "tool",
                "tool_call_id": tool_call.id,
                "content": result
            })

return "Max steps reached"

Usage

result = run_agent("What is 25 * 4 + 10?") print(result)


**Deliverable**: 50-150 line agent that can choose tools, execute them, and return final answer.

result = run_agent("What is 25 * 4 + 10?") print(result)


**交付成果**：50-150行代码的Agent，可选择工具、执行工具并返回最终答案。

Stage 2: Tool Use, RAG, and Memory

阶段2：工具使用、RAG与记忆

Recommended Projects to Study:

Project	Focus Area
GPT Researcher	Search → scrape → filter → cite → generate report
STORM	Multi-perspective research writing with outline
Khoj	Personal second brain with semantic search
mem0	Adding long-term memory to agents

python

undefined

推荐学习项目:

项目	聚焦领域
GPT Researcher	搜索 → 抓取 → 过滤 → 引用 → 生成报告
STORM	多视角研究写作（含大纲）
Khoj	带语义搜索的个人第二大脑
mem0	为Agent添加长期记忆

python

undefined

RAG-enhanced agent example (using LlamaIndex)

import os from llama_index.core import VectorStoreIndex, SimpleDirectoryReader from llama_index.core.agent import ReActAgent from llama_index.core.tools import QueryEngineTool, ToolMetadata

Load and index documents

documents = SimpleDirectoryReader("./docs").load_data() index = VectorStoreIndex.from_documents(documents)

Create query engine tool

query_engine = index.as_query_engine() query_tool = QueryEngineTool( query_engine=query_engine, metadata=ToolMetadata( name="doc_search", description="Search company documentation. Use this when user asks about policies, procedures, or technical specs." ) )

Create agent with tools

agent = ReActAgent.from_tools( tools=[query_tool], verbose=True )

Run agent with citation requirement

response = agent.chat( "What is our company's remote work policy? Please cite sources." ) print(response)


**Deliverable**: Research assistant that searches, filters, summarizes, and outputs citations.

response = agent.chat( "What is our company's remote work policy? Please cite sources." ) print(response)


**交付成果**：可进行搜索、过滤、总结并输出引用的研究助手。

Stage 3: Study One Modern Agent Harness

阶段3：学习一款现代Agent框架

Key Systems to Learn:

System	Learn This For
Claude Code	Real coding agent: CLI, tools, permissions, hooks, subagents, MCP
learn-claude-code	From-scratch harness implementation
claw0	Building session, gateway, memory, heartbeat, delivery, resilience
OpenClaw	Local-first personal agent with skills and system tools
LangGraph	Stateful graph orchestration

What to Look For in a Harness:

Agent loop implementation
Tool registry and permission gates
Session/state store
Context compaction strategy
Trace/logging system
Error handling and recovery

python

undefined

核心学习系统:

系统	学习目标
Claude Code	真实编码Agent：CLI、工具、权限、钩子、子Agent、MCP
learn-claude-code	从零实现框架
claw0	构建会话、网关、记忆、心跳、交付、韧性机制
OpenClaw	本地优先的个人Agent，含技能与系统工具
LangGraph	有状态的图编排

框架学习要点:

Agent循环实现 -工具注册与权限网关 -会话/状态存储 -上下文压缩策略 -追踪/日志系统 -错误处理与恢复

python

undefined

Example: Understanding tool permission gate pattern

class ToolRegistry: def init(self): self.tools = {} self.permissions = {}

def register(self, name: str, func: callable, requires_approval: bool = False):
    """Register tool with optional approval gate."""
    self.tools[name] = func
    self.permissions[name] = {
        "requires_approval": requires_approval,
        "allowed_domains": []  # Could expand to domain restrictions
    }

def execute(self, name: str, args: dict, auto_approve: bool = False):
    """Execute tool with permission check."""
    if name not in self.tools:
        raise ValueError(f"Tool {name} not found")
    
    if self.permissions[name]["requires_approval"] and not auto_approve:
        # In real system, this would trigger user confirmation
        print(f"⚠️  Tool {name} requires approval. Args: {args}")
        confirm = input("Approve? (y/n): ")
        if confirm.lower() != 'y':
            return "Tool execution denied by user"
    
    return self.tools[name](**args)

class ToolRegistry: def init(self): self.tools = {} self.permissions = {}

def register(self, name: str, func: callable, requires_approval: bool = False):
    """Register tool with optional approval gate."""
    self.tools[name] = func
    self.permissions[name] = {
        "requires_approval": requires_approval,
        "allowed_domains": []  # Could expand to domain restrictions
    }

def execute(self, name: str, args: dict, auto_approve: bool = False):
    """Execute tool with permission check."""
    if name not in self.tools:
        raise ValueError(f"Tool {name} not found")
    
    if self.permissions[name]["requires_approval"] and not auto_approve:
        # In real system, this would trigger user confirmation
        print(f"⚠️  Tool {name} requires approval. Args: {args}")
        confirm = input("Approve? (y/n): ")
        if confirm.lower() != 'y':
            return "Tool execution denied by user"
    
    return self.tools[name](**args)

Usage

registry = ToolRegistry() registry.register("search_web", lambda query: f"Results for {query}", requires_approval=False) registry.register("send_email", lambda to, body: f"Email sent to {to}", requires_approval=True)


**Deliverable**: Working agent harness demo with README, example runs, and failure logs.


**交付成果**：可运行的Agent框架Demo，含README、示例运行及失败日志。

Stage 4: Multi-Agent Coordination

阶段4：多Agent协同

Core Principle: Multi-agent is coordination, not magic. Use supervisor patterns or graphs, not random chat.

python

undefined

核心原则：多Agent的核心是协同，而非魔法。使用监督者模式或图编排，而非随机聊天。

python

undefined

LangGraph multi-agent example

from langgraph.graph import StateGraph, END from typing import TypedDict, List

class ResearchState(TypedDict): topic: str outline: List[str] research: dict draft: str review: str final: str

def planner(state: ResearchState) -> ResearchState: """Create outline for research.""" # Call LLM to generate outline state["outline"] = ["Introduction", "Key Findings", "Conclusion"] return state

def researcher(state: ResearchState) -> ResearchState: """Research each section.""" research = {} for section in state["outline"]: # Call search API and summarize research[section] = f"Research for {section}..." state["research"] = research return state

def writer(state: ResearchState) -> ResearchState: """Write draft from research.""" state["draft"] = "Draft based on research..." return state

def reviewer(state: ResearchState) -> ResearchState: """Review and suggest improvements.""" state["review"] = "Needs more citations in section 2" return state

def reviser(state: ResearchState) -> ResearchState: """Revise based on review.""" state["final"] = "Final version with improvements..." return state

from langgraph.graph import StateGraph, END from typing import TypedDict, List

class ResearchState(TypedDict): topic: str outline: List[str] research: dict draft: str review: str final: str

def planner(state: ResearchState) -> ResearchState: """Create outline for research.""" # Call LLM to generate outline state["outline"] = ["Introduction", "Key Findings", "Conclusion"] return state

def writer(state: ResearchState) -> ResearchState: """Write draft from research.""" state["draft"] = "Draft based on research..." return state

def reviewer(state: ResearchState) -> ResearchState: """Review and suggest improvements.""" state["review"] = "Needs more citations in section 2" return state

def reviser(state: ResearchState) -> ResearchState: """Revise based on review.""" state["final"] = "Final version with improvements..." return state

Build graph

workflow = StateGraph(ResearchState) workflow.add_node("planner", planner) workflow.add_node("researcher", researcher) workflow.add_node("writer", writer) workflow.add_node("reviewer", reviewer) workflow.add_node("reviser", reviser)

workflow.set_entry_point("planner") workflow.add_edge("planner", "researcher") workflow.add_edge("researcher", "writer") workflow.add_edge("writer", "reviewer") workflow.add_edge("reviewer", "reviser") workflow.add_edge("reviser", END)

app = workflow.compile()


**Deliverable**: Multi-agent system with clear roles (e.g., research → write → review → revise).

app = workflow.compile()


**交付成果**：角色明确的多Agent系统（如：研究 → 写作 → 审核 → 修改）。

Stage 5: Skills, MCP, and Capability Packaging

阶段5：Skills、MCP与能力封装

Key Concepts:

Skill: Reusable procedural knowledge (how to do X)
Tool: Callable interface (function/API)
MCP: Model Context Protocol for connecting external tools/data
A2A: Agent-to-Agent protocol
ACP: Agent Client Protocol

Skill File Structure (Claude Code style):

markdown

undefined

核心概念:

Skill：可复用的过程性知识（如何完成X任务）
Tool：可调用的接口（函数/API）
MCP：连接外部工具/数据的Model Context Protocol
A2A：Agent-to-Agent协议
ACP：Agent Client Protocol

Skill文件结构（Claude Code风格）:

markdown

undefined

SKILL.md

Name

code-review

Description

Perform thorough code review following team standards

When to Use

User asks "review this code"
PR is opened (via webhook)
Code changes detected in staging branch

User asks "review this code"
PR is opened (via webhook)
Code changes detected in staging branch

Steps

Read code changes (use git diff or file_read tool)
Check against style guide in
```
.code-standards.md
```
Run linter:
```
npm run lint
```
or
```
python -m pylint
```
Check for common issues:
- Hardcoded secrets
- Missing error handling
- Unhandled edge cases
- Performance anti-patterns
Generate structured feedback with severity levels

Read code changes (use git diff or file_read tool)
Check against style guide in
```
.code-standards.md
```
Run linter:
```
npm run lint
```
or
```
python -m pylint
```
Check for common issues:
- Hardcoded secrets
- Missing error handling
- Unhandled edge cases
- Performance anti-patterns
Generate structured feedback with severity levels

Tools Required

file_read
execute_command
(optional) github_api for posting comments

file_read
execute_command
(optional) github_api for posting comments

Acceptance Criteria

All files reviewed
At least 3 specific suggestions
Severity level assigned (blocker/major/minor)
Code style compliance checked


**MCP Server Example**:

```typescript
// MCP server for custom tools
import { Server } from "@modelcontextprotocol/sdk/server/index.js";
import { StdioServerTransport } from "@modelcontextprotocol/sdk/server/stdio.js";

const server = new Server(
  {
    name: "my-tools-server",
    version: "1.0.0",
  },
  {
    capabilities: {
      tools: {},
    },
  }
);

// Register tool
server.setRequestHandler("tools/list", async () => {
  return {
    tools: [
      {
        name: "search_codebase",
        description: "Search company codebase using semantic search",
        inputSchema: {
          type: "object",
          properties: {
            query: { type: "string" },
            language: { type: "string", enum: ["python", "typescript", "all"] }
          },
          required: ["query"]
        }
      }
    ]
  };
});

server.setRequestHandler("tools/call", async (request) => {
  if (request.params.name === "search_codebase") {
    const { query, language } = request.params.arguments;
    // Implement search logic
    return {
      content: [{ type: "text", text: `Results for ${query}...` }]
    };
  }
});

// Start server
const transport = new StdioServerTransport();
await server.connect(transport);

Deliverable: Reusable skill (e.g., code-review, research-report, migration-helper) with clear structure.

All files reviewed
At least 3 specific suggestions
Severity level assigned (blocker/major/minor)
Code style compliance checked


**MCP Server Example**:

```typescript
// MCP server for custom tools
import { Server } from "@modelcontextprotocol/sdk/server/index.js";
import { StdioServerTransport } from "@modelcontextprotocol/sdk/server/stdio.js";

const server = new Server(
  {
    name: "my-tools-server",
    version: "1.0.0",
  },
  {
    capabilities: {
      tools: {},
    },
  }
);

// Register tool
server.setRequestHandler("tools/list", async () => {
  return {
    tools: [
      {
        name: "search_codebase",
        description: "Search company codebase using semantic search",
        inputSchema: {
          type: "object",
          properties: {
            query: { type: "string" },
            language: { type: "string", enum: ["python", "typescript", "all"] }
          },
          required: ["query"]
        }
      }
    ]
  };
});

server.setRequestHandler("tools/call", async (request) => {
  if (request.params.name === "search_codebase") {
    const { query, language } = request.params.arguments;
    // Implement search logic
    return {
      content: [{ type: "text", text: `Results for ${query}...` }]
    };
  }
});

// Start server
const transport = new StdioServerTransport();
await server.connect(transport);

交付成果：结构清晰的可复用Skill（如代码审核、研究报告生成、迁移助手）。

Stage 6: Browser and Computer-Use Agents

阶段6：浏览器与电脑操作Agent

Key Patterns:

DOM observation and element selection
Click/type/scroll actions
Screenshot-based fallback
Safety boundaries (no sensitive logins, respect robots.txt)

python

undefined

核心模式: -DOM观察与元素选择 -点击/输入/滚动操作 -基于截图的降级方案 -安全边界（不登录敏感账号、遵守robots.txt）

python

undefined

Browser agent using browser-use

import asyncio from browser_use import Agent from langchain_openai import ChatOpenAI

async def main(): agent = Agent( task="Go to news.ycombinator.com and get the top 5 story titles", llm=ChatOpenAI(model="gpt-4o"), )

result = await agent.run()
print(result)

asyncio.run(main())


**Anthropic Computer Use Example**:

```python
import anthropic
import os

client = anthropic.Anthropic(api_key=os.environ.get("ANTHROPIC_API_KEY"))

response = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=1024,
    tools=[
        {
            "type": "computer_20241022",
            "name": "computer",
            "display_width_px": 1024,
            "display_height_px": 768,
        }
    ],
    messages=[
        {
            "role": "user",
            "content": "Open a browser and search for 'AI agent frameworks'"
        }
    ]
)

print(response)

Safety Checklist:

No login to sensitive accounts
No financial transactions without explicit confirmation
Respect robots.txt and rate limits
Log all actions with screenshots
Human-in-the-loop for risky actions

Deliverable: Browser agent that operates on public pages (e.g., extract info, generate summary).

import asyncio from browser_use import Agent from langchain_openai import ChatOpenAI

async def main(): agent = Agent( task="Go to news.ycombinator.com and get the top 5 story titles", llm=ChatOpenAI(model="gpt-4o"), )

result = await agent.run()
print(result)

asyncio.run(main())


**Anthropic Computer Use Example**:

```python
import anthropic
import os

client = anthropic.Anthropic(api_key=os.environ.get("ANTHROPIC_API_KEY"))

response = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=1024,
    tools=[
        {
            "type": "computer_20241022",
            "name": "computer",
            "display_width_px": 1024,
            "display_height_px": 768,
        }
    ],
    messages=[
        {
            "role": "user",
            "content": "Open a browser and search for 'AI agent frameworks'"
        }
    ]
)

print(response)

安全检查清单:

不登录敏感账号
无明确确认时不进行金融交易
遵守robots.txt与速率限制
记录所有操作及截图
高风险操作需人工介入

交付成果：可在公开页面操作的浏览器Agent（如提取信息、生成摘要）。

Stage 7: Evaluation, Observability, and Safety

阶段7：评估、可观测性与安全

Core Metrics:

Success rate
Failure reason distribution
Tool call count
Cost per task
Latency (p50, p95, p99)

python

undefined

核心指标: -成功率 -失败原因分布 -工具调用次数 -单任务成本 -延迟（p50、p95、p99）

python

undefined

Evaluation harness example

import json from typing import List, Dict from dataclasses import dataclass from datetime import datetime

@dataclass class TestCase: id: str input: str expected_output: str max_steps: int = 10

@dataclass class EvalResult: test_id: str success: bool actual_output: str steps_taken: int cost_usd: float latency_ms: float failure_reason: str = None trace: List[Dict] = None

class AgentEvaluator: def init(self, agent_fn, test_cases: List[TestCase]): self.agent_fn = agent_fn self.test_cases = test_cases self.results: List[EvalResult] = []

def run_evaluation(self) -> Dict:
    """Run all test cases and collect metrics."""
    for test in self.test_cases:
        start_time = datetime.now()
        
        try:
            result = self.agent_fn(test.input, max_steps=test.max_steps)
            success = self._check_success(result, test.expected_output)
            
            latency = (datetime.now() - start_time).total_seconds() * 1000
            
            eval_result = EvalResult(
                test_id=test.id,
                success=success,
                actual_output=result["output"],
                steps_taken=result["steps"],
                cost_usd=result["cost"],
                latency_ms=latency,
                trace=result.get("trace")
            )
        except Exception as e:
            eval_result = EvalResult(
                test_id=test.id,
                success=False,
                actual_output="",
                steps_taken=0,
                cost_usd=0,
                latency_ms=0,
                failure_reason=str(e)
            )
        
        self.results.append(eval_result)
    
    return self._compute_metrics()

def _check_success(self, result: Dict, expected: str) -> bool:
    """Check if output matches expected (implement your logic)."""
    return expected.lower() in result["output"].lower()

def _compute_metrics(self) -> Dict:
    """Aggregate metrics across all tests."""
    total = len(self.results)
    successful = sum(1 for r in self.results if r.success)
    
    return {
        "success_rate": successful / total if total > 0 else 0,
        "total_tests": total,
        "total_cost_usd": sum(r.cost_usd for r in self.results),
        "avg_latency_ms": sum(r.latency_ms for r in self.results) / total if total > 0 else 0,
        "failure_reasons": [r.failure_reason for r in self.results if not r.success]
    }

def export_report(self, filename: str):
    """Export detailed report as JSON."""
    report = {
        "metrics": self._compute_metrics(),
        "results": [
            {
                "test_id": r.test_id,
                "success": r.success,
                "steps": r.steps_taken,
                "cost": r.cost_usd,
                "latency_ms": r.latency_ms,
                "failure_reason": r.failure_reason
            }
            for r in self.results
        ]
    }
    
    with open(filename, "w") as f:
        json.dump(report, f, indent=2)

import json from typing import List, Dict from dataclasses import dataclass from datetime import datetime

@dataclass class TestCase: id: str input: str expected_output: str max_steps: int = 10

@dataclass class EvalResult: test_id: str success: bool actual_output: str steps_taken: int cost_usd: float latency_ms: float failure_reason: str = None trace: List[Dict] = None

class AgentEvaluator: def init(self, agent_fn, test_cases: List[TestCase]): self.agent_fn = agent_fn self.test_cases = test_cases self.results: List[EvalResult] = []

def run_evaluation(self) -> Dict:
    """Run all test cases and collect metrics."""
    for test in self.test_cases:
        start_time = datetime.now()
        
        try:
            result = self.agent_fn(test.input, max_steps=test.max_steps)
            success = self._check_success(result, test.expected_output)
            
            latency = (datetime.now() - start_time).total_seconds() * 1000
            
            eval_result = EvalResult(
                test_id=test.id,
                success=success,
                actual_output=result["output"],
                steps_taken=result["steps"],
                cost_usd=result["cost"],
                latency_ms=latency,
                trace=result.get("trace")
            )
        except Exception as e:
            eval_result = EvalResult(
                test_id=test.id,
                success=False,
                actual_output="",
                steps_taken=0,
                cost_usd=0,
                latency_ms=0,
                failure_reason=str(e)
            )
        
        self.results.append(eval_result)
    
    return self._compute_metrics()

def _check_success(self, result: Dict, expected: str) -> bool:
    """Check if output matches expected (implement your logic)."""
    return expected.lower() in result["output"].lower()

def _compute_metrics(self) -> Dict:
    """Aggregate metrics across all tests."""
    total = len(self.results)
    successful = sum(1 for r in self.results if r.success)
    
    return {
        "success_rate": successful / total if total > 0 else 0,
        "total_tests": total,
        "total_cost_usd": sum(r.cost_usd for r in self.results),
        "avg_latency_ms": sum(r.latency_ms for r in self.results) / total if total > 0 else 0,
        "failure_reasons": [r.failure_reason for r in self.results if not r.success]
    }

def export_report(self, filename: str):
    """Export detailed report as JSON."""
    report = {
        "metrics": self._compute_metrics(),
        "results": [
            {
                "test_id": r.test_id,
                "success": r.success,
                "steps": r.steps_taken,
                "cost": r.cost_usd,
                "latency_ms": r.latency_ms,
                "failure_reason": r.failure_reason
            }
            for r in self.results
        ]
    }
    
    with open(filename, "w") as f:
        json.dump(report, f, indent=2)

Usage

test_cases = [ TestCase( id="research-basic", input="Research recent AI agent frameworks", expected_output="langchain" ), TestCase( id="research-citation", input="Find papers on agent evaluation", expected_output="citation" ) ]

evaluator = AgentEvaluator(agent_fn=my_agent_function, test_cases=test_cases) metrics = evaluator.run_evaluation() evaluator.export_report("eval_results.json")

print(f"Success rate: {metrics['success_rate']:.2%}") print(f"Avg cost: ${metrics['total_cost_usd']:.4f}")


**Safety Patterns**:

```python

evaluator = AgentEvaluator(agent_fn=my_agent_function, test_cases=test_cases) metrics = evaluator.run_evaluation() evaluator.export_report("eval_results.json")

print(f"Success rate: {metrics['success_rate']:.2%}") print(f"Avg cost: ${metrics['total_cost_usd']:.4f}")


**安全模式**:

```python

Dangerous tool approval gate

DANGEROUS_TOOLS = ["delete_file", "send_email", "make_payment", "publish_content"]

def execute_tool_with_approval(tool_name: str, args: dict): """Execute tool with human-in-the-loop for dangerous actions.""" if tool_name in DANGEROUS_TOOLS: print(f"\n⚠️ APPROVAL REQUIRED") print(f"Tool: {tool_name}") print(f"Args: {json.dumps(args, indent=2)}")

    approval = input("\nApprove this action? (yes/no): ")
    if approval.lower() != "yes":
        return {"status": "rejected", "reason": "User denied approval"}

# Execute tool
return tools_registry[tool_name](**args)


**Deliverable**: Eval suite with fixed test set, success rate tracking, and cost/latency metrics.

DANGEROUS_TOOLS = ["delete_file", "send_email", "make_payment", "publish_content"]

    approval = input("\nApprove this action? (yes/no): ")
    if approval.lower() != "yes":
        return {"status": "rejected", "reason": "User denied approval"}

# Execute tool
return tools_registry[tool_name](**args)


**交付成果**：含固定测试集、成功率追踪及成本/延迟指标的评估套件。

Common Patterns

常见模式

When to Use Agents vs. Workflows

何时使用Agent vs 工作流

Use Agent When:

Task requires dynamic tool selection
Steps depend on runtime information
Need to handle unexpected situations
Task involves exploration or research

Use Workflow When:

Steps are predictable
Process is well-defined
Speed and cost matter more than flexibility
You need guarantees about execution path

选择Agent的场景: -任务需要动态选择工具 -步骤依赖运行时信息 -需要处理意外情况 -任务涉及探索或研究

选择工作流的场景: -步骤可预测 -流程已明确定义 -速度与成本比灵活性更重要 -需要保证执行路径

Context Management

上下文管理

python

undefined

python

undefined

Simple context compaction strategy

class ContextManager: def init(self, max_tokens: int = 8000): self.max_tokens = max_tokens self.messages = []

def add_message(self, message: dict):
    """Add message and compact if needed."""
    self.messages.append(message)
    
    # Estimate tokens (rough: 1 token ≈ 4 chars)
    total_chars = sum(len(str(m)) for m in self.messages)
    estimated_tokens = total_chars // 4
    
    if estimated_tokens > self.max_tokens:
        self._compact()

def _compact(self):
    """Keep system message, last user message, and recent history."""
    system = [m for m in self.messages if m["role"] == "system"]
    recent = self.messages[-10:]  # Keep last 10 messages
    
    # Summarize middle messages (in production, use LLM)
    if len(self.messages) > 12:
        summary = {
            "role": "system",
            "content": f"[Previous conversation summarized: {len(self.messages) - 12} messages]"
        }
        self.messages = system + [summary] + recent

undefined

class ContextManager: def init(self, max_tokens: int = 8000): self.max_tokens = max_tokens self.messages = []

def add_message(self, message: dict):
    """Add message and compact if needed."""
    self.messages.append(message)
    
    # Estimate tokens (rough: 1 token ≈ 4 chars)
    total_chars = sum(len(str(m)) for m in self.messages)
    estimated_tokens = total_chars // 4
    
    if estimated_tokens > self.max_tokens:
        self._compact()

def _compact(self):
    """Keep system message, last user message, and recent history."""
    system = [m for m in self.messages if m["role"] == "system"]
    recent = self.messages[-10:]  # Keep last 10 messages
    
    # Summarize middle messages (in production, use LLM)
    if len(self.messages) > 12:
        summary = {
            "role": "system",
            "content": f"[Previous conversation summarized: {len(self.messages) - 12} messages]"
        }
        self.messages = system + [summary] + recent

undefined

Troubleshooting

故障排查

Agent Loops Forever

Agent无限循环

Cause: No max_steps limit or unclear stopping criteria.

Solution:

python

def run_agent_with_limits(task: str, max_steps: int = 10, max_cost: float = 1.0):
    total_cost = 0.0
    
    for step in range(max_steps):
        if total_cost > max_cost:
            return {"error": "Cost limit exceeded", "partial_result": current_state}
        
        # Run agent step
        result = agent.step()
        total_cost += result.cost
        
        if result.is_final:
            return result
    
    return {"error": "Max steps exceeded", "partial_result": current_state}

原因：无最大步数限制或停止标准不明确。

解决方案:

python

def run_agent_with_limits(task: str, max_steps: int = 10, max_cost: float = 1.0):
    total_cost = 0.0
    
    for step in range(max_steps):
        if total_cost > max_cost:
            return {"error": "Cost limit exceeded", "partial_result": current_state}
        
        # Run agent step
        result = agent.step()
        total_cost += result.cost
        
        if result.is_final:
            return result
    
    return {"error": "Max steps exceeded", "partial_result": current_state}

Tools Return Empty/Error Frequently

工具频繁返回空值/错误

Cause: Input validation issues or tool design mismatch.

Solution:

Add input schema validation
Provide clear tool descriptions
Show examples in tool metadata
Add retry logic with backoff

python

from pydantic import BaseModel, Field

class SearchInput(BaseModel):
    query: str = Field(..., min_length=3, description="Search query, at least 3 characters")
    max_results: int = Field(5, ge=1, le=20, description="Number of results, 1-20")

def search_tool(input: SearchInput) -> dict:
    """Type-safe search tool."""
    # Validation happens automatically via Pydantic
    return {"results": [...]}

原因：输入验证问题或工具设计不匹配。

解决方案: -添加输入Schema验证 -提供清晰的工具描述 -在工具元数据中添加示例 -添加带退避的重试逻辑

python

from pydantic import BaseModel, Field

class SearchInput(BaseModel):
    query: str = Field(..., min_length=3, description="Search query, at least 3 characters")
    max_results: int = Field(5, ge=1, le=20, description="Number of results, 1-20")

def search_tool(input: SearchInput) -> dict:
    """Type-safe search tool."""
    # Validation happens automatically via Pydantic
    return {"results": [...]}

Agent Hallucinates Citations

Agent虚构引用

Cause: No grounding mechanism, agent invents sources.

Solution:

Return citations with every retrieval
Use structured output for citations
Post-process to verify citation validity

python

def verify_citations(text: str, sources: List[str]) -> bool:
    """Check if all citations in text exist in sources."""
    import re
    
    cited = re.findall(r'\[(\d+)\]', text)
    max_source_idx = len(sources) - 1
    
    for citation in cited:
        if int(citation) > max_source_idx:
            return False
    
    return True

原因：无 grounding 机制，Agent自行编造来源。

解决方案: -每次检索都返回引用 -使用结构化输出存储引用 -后处理验证引用有效性

python

def verify_citations(text: str, sources: List[str]) -> bool:
    """Check if all citations in text exist in sources."""
    import re
    
    cited = re.findall(r'\[(\d+)\]', text)
    max_source_idx = len(sources) - 1
    
    for citation in cited:
        if int(citation) > max_source_idx:
            return False
    
    return True

Related Resources

Best Practices

最佳实践

Start Simple: Build minimal loop before adding frameworks
Add Safety Early: Approval gates, logging, cost limits from day one
Evaluate Continuously: Fixed test set, track regressions
Study Harnesses: Learn from Claude Code, OpenClaw, not just toy examples
Prefer Skills Over Prompts: Package reusable knowledge formally
Use MCP for Integration: Standard protocol beats custom tool wrappers
Log Everything: Traces are essential for debugging agent failures
Human-in-Loop for Risk: Never auto-approve delete/send/publish

从简开始：先构建最小循环，再添加框架
尽早加入安全机制：从第一天就加入审批网关、日志、成本限制
持续评估：固定测试集，追踪回归问题
学习成熟框架：向Claude Code、OpenClaw学习，而非仅关注玩具示例
优先使用Skills而非Prompt：将可复用知识进行规范化封装
使用MCP进行集成：标准协议优于自定义工具包装器
记录所有内容：追踪信息是调试Agent故障的关键
高风险操作需人工介入：永远不要自动批准删除/发送/发布操作

When to Use This Skill

何时使用本Skill

An AI coding agent should use this skill when:

User asks "how do I learn AI agents"
User wants structured agent learning path
User needs modern agent architecture guidance
User asks about specific stage (tool use, RAG, multi-agent, evaluation)
User wants curated agent resources or project recommendations
User needs code examples for agent loops, tools, or harnesses

AI编码Agent应在以下场景使用本Skill：

用户询问“如何学习AI Agent”
用户需要结构化的Agent学习路径
用户需要现代Agent架构指导
用户询问特定阶段的内容（工具使用、RAG、多Agent、评估）
用户需要精选的Agent资源或项目推荐
用户需要Agent循环、工具或框架的代码示例