datawhale-agent-learning-hub
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseDatawhale Agent Learning Hub
Datawhale Agent学习中心
Skill by ara.so — AI Agent Skills collection.
A curated AI Agent learning roadmap and resource hub maintained by Datawhale. This project provides a structured learning path from basic agent loops to production-ready agent systems, emphasizing modern patterns like agent harnesses, skills, MCP (Model Context Protocol), and evaluation.
由ara.so提供的Skill——AI Agent技能合集。
由Datawhale维护的AI Agent学习路线图与资源精选中心。本项目提供从基础Agent循环到可投入生产的Agent系统的结构化学习路径,重点关注Agent框架、Skills、MCP(Model Context Protocol,模型上下文协议)及评估等现代模式。
What This Project Provides
本项目提供的内容
- Structured Learning Path: 7-stage todo list from basic agent loops to browser/computer-use agents
- Curated Resources: Official docs, papers, and proven open-source projects
- Modern Focus: Prioritizes Claude Code, OpenClaw, skills, MCP, A2A over legacy role-play frameworks
- Project Ladder: Real-world agent projects you can build at each stage
- Current Best Practices: What to learn now vs. what's outdated
- 结构化学习路径:从基础Agent循环到浏览器/电脑操作Agent的7阶段任务清单
- 精选资源:官方文档、论文及经过验证的开源项目
- 聚焦现代模式:优先关注Claude Code、OpenClaw、Skills、MCP、A2A,而非传统角色扮演框架
- 项目进阶阶梯:每个阶段可实践的真实Agent项目
- 当前最佳实践:区分当下需学习内容与已过时技术
Installation & Access
安装与访问
This is a learning resource repository, not a package to install:
bash
undefined这是一个学习资源仓库,并非可安装的包:
bash
undefinedClone the repository
Clone the repository
git clone https://github.com/datawhalechina/Agent-Learning-Hub.git
cd Agent-Learning-Hub
git clone https://github.com/datawhalechina/Agent-Learning-Hub.git
cd Agent-Learning-Hub
Read the README
Read the README
cat README.md
cat README.md
Use it as reference while building agents
Use it as reference while building agents
undefinedundefinedKey Learning Stages
核心学习阶段
Stage 0: Understand What An Agent Is
阶段0:理解Agent是什么
Core Concept: Distinguish chatbot vs workflow vs agent vs multi-agent.
Required Reading:
Deliverable: One-page note answering "Why does my use case need an agent instead of a workflow?"
核心概念:区分聊天机器人、工作流、Agent与多Agent系统。
必读资料:
交付成果:一页笔记,回答“为什么我的业务场景需要Agent而非工作流?”
Stage 1: Build A Minimal Agent Loop
阶段1:构建最小Agent循环
Core Pattern: observe → think → act → observe
python
undefined核心模式:观察 → 思考 → 行动 → 观察
python
undefinedMinimal agent loop example (Python + OpenAI)
Minimal agent loop example (Python + OpenAI)
import os
from openai import OpenAI
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
tools = [
{
"type": "function",
"function": {
"name": "calculate",
"description": "Perform basic arithmetic",
"parameters": {
"type": "object",
"properties": {
"expression": {"type": "string", "description": "Math expression like '2+2'"}
},
"required": ["expression"]
}
}
}
]
def calculate(expression: str) -> str:
"""Execute safe math expression."""
try:
return str(eval(expression, {"builtins": {}}, {}))
except Exception as e:
return f"Error: {e}"
def run_agent(user_message: str, max_steps: int = 5):
messages = [{"role": "user", "content": user_message}]
for step in range(max_steps):
response = client.chat.completions.create(
model="gpt-4o",
messages=messages,
tools=tools
)
message = response.choices[0].message
messages.append(message)
# Check if done
if not message.tool_calls:
return message.content
# Execute tool calls
for tool_call in message.tool_calls:
if tool_call.function.name == "calculate":
import json
args = json.loads(tool_call.function.arguments)
result = calculate(args["expression"])
messages.append({
"role": "tool",
"tool_call_id": tool_call.id,
"content": result
})
return "Max steps reached"import os
from openai import OpenAI
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
tools = [
{
"type": "function",
"function": {
"name": "calculate",
"description": "Perform basic arithmetic",
"parameters": {
"type": "object",
"properties": {
"expression": {"type": "string", "description": "Math expression like '2+2'"}
},
"required": ["expression"]
}
}
}
]
def calculate(expression: str) -> str:
"""Execute safe math expression."""
try:
return str(eval(expression, {"builtins": {}}, {}))
except Exception as e:
return f"Error: {e}"
def run_agent(user_message: str, max_steps: int = 5):
messages = [{"role": "user", "content": user_message}]
for step in range(max_steps):
response = client.chat.completions.create(
model="gpt-4o",
messages=messages,
tools=tools
)
message = response.choices[0].message
messages.append(message)
# Check if done
if not message.tool_calls:
return message.content
# Execute tool calls
for tool_call in message.tool_calls:
if tool_call.function.name == "calculate":
import json
args = json.loads(tool_call.function.arguments)
result = calculate(args["expression"])
messages.append({
"role": "tool",
"tool_call_id": tool_call.id,
"content": result
})
return "Max steps reached"Usage
Usage
result = run_agent("What is 25 * 4 + 10?")
print(result)
**Deliverable**: 50-150 line agent that can choose tools, execute them, and return final answer.result = run_agent("What is 25 * 4 + 10?")
print(result)
**交付成果**:50-150行代码的Agent,可选择工具、执行工具并返回最终答案。Stage 2: Tool Use, RAG, and Memory
阶段2:工具使用、RAG与记忆
Recommended Projects to Study:
| Project | Focus Area |
|---|---|
| GPT Researcher | Search → scrape → filter → cite → generate report |
| STORM | Multi-perspective research writing with outline |
| Khoj | Personal second brain with semantic search |
| mem0 | Adding long-term memory to agents |
python
undefined推荐学习项目:
| 项目 | 聚焦领域 |
|---|---|
| GPT Researcher | 搜索 → 抓取 → 过滤 → 引用 → 生成报告 |
| STORM | 多视角研究写作(含大纲) |
| Khoj | 带语义搜索的个人第二大脑 |
| mem0 | 为Agent添加长期记忆 |
python
undefinedRAG-enhanced agent example (using LlamaIndex)
RAG-enhanced agent example (using LlamaIndex)
import os
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.agent import ReActAgent
from llama_index.core.tools import QueryEngineTool, ToolMetadata
import os
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.agent import ReActAgent
from llama_index.core.tools import QueryEngineTool, ToolMetadata
Load and index documents
Load and index documents
documents = SimpleDirectoryReader("./docs").load_data()
index = VectorStoreIndex.from_documents(documents)
documents = SimpleDirectoryReader("./docs").load_data()
index = VectorStoreIndex.from_documents(documents)
Create query engine tool
Create query engine tool
query_engine = index.as_query_engine()
query_tool = QueryEngineTool(
query_engine=query_engine,
metadata=ToolMetadata(
name="doc_search",
description="Search company documentation. Use this when user asks about policies, procedures, or technical specs."
)
)
query_engine = index.as_query_engine()
query_tool = QueryEngineTool(
query_engine=query_engine,
metadata=ToolMetadata(
name="doc_search",
description="Search company documentation. Use this when user asks about policies, procedures, or technical specs."
)
)
Create agent with tools
Create agent with tools
agent = ReActAgent.from_tools(
tools=[query_tool],
verbose=True
)
agent = ReActAgent.from_tools(
tools=[query_tool],
verbose=True
)
Run agent with citation requirement
Run agent with citation requirement
response = agent.chat(
"What is our company's remote work policy? Please cite sources."
)
print(response)
**Deliverable**: Research assistant that searches, filters, summarizes, and outputs citations.response = agent.chat(
"What is our company's remote work policy? Please cite sources."
)
print(response)
**交付成果**:可进行搜索、过滤、总结并输出引用的研究助手。Stage 3: Study One Modern Agent Harness
阶段3:学习一款现代Agent框架
Key Systems to Learn:
| System | Learn This For |
|---|---|
| Claude Code | Real coding agent: CLI, tools, permissions, hooks, subagents, MCP |
| learn-claude-code | From-scratch harness implementation |
| claw0 | Building session, gateway, memory, heartbeat, delivery, resilience |
| OpenClaw | Local-first personal agent with skills and system tools |
| LangGraph | Stateful graph orchestration |
What to Look For in a Harness:
- Agent loop implementation
- Tool registry and permission gates
- Session/state store
- Context compaction strategy
- Trace/logging system
- Error handling and recovery
python
undefined核心学习系统:
| 系统 | 学习目标 |
|---|---|
| Claude Code | 真实编码Agent:CLI、工具、权限、钩子、子Agent、MCP |
| learn-claude-code | 从零实现框架 |
| claw0 | 构建会话、网关、记忆、心跳、交付、韧性机制 |
| OpenClaw | 本地优先的个人Agent,含技能与系统工具 |
| LangGraph | 有状态的图编排 |
框架学习要点:
- Agent循环实现 -工具注册与权限网关 -会话/状态存储 -上下文压缩策略 -追踪/日志系统 -错误处理与恢复
python
undefinedExample: Understanding tool permission gate pattern
Example: Understanding tool permission gate pattern
class ToolRegistry:
def init(self):
self.tools = {}
self.permissions = {}
def register(self, name: str, func: callable, requires_approval: bool = False):
"""Register tool with optional approval gate."""
self.tools[name] = func
self.permissions[name] = {
"requires_approval": requires_approval,
"allowed_domains": [] # Could expand to domain restrictions
}
def execute(self, name: str, args: dict, auto_approve: bool = False):
"""Execute tool with permission check."""
if name not in self.tools:
raise ValueError(f"Tool {name} not found")
if self.permissions[name]["requires_approval"] and not auto_approve:
# In real system, this would trigger user confirmation
print(f"⚠️ Tool {name} requires approval. Args: {args}")
confirm = input("Approve? (y/n): ")
if confirm.lower() != 'y':
return "Tool execution denied by user"
return self.tools[name](**args)class ToolRegistry:
def init(self):
self.tools = {}
self.permissions = {}
def register(self, name: str, func: callable, requires_approval: bool = False):
"""Register tool with optional approval gate."""
self.tools[name] = func
self.permissions[name] = {
"requires_approval": requires_approval,
"allowed_domains": [] # Could expand to domain restrictions
}
def execute(self, name: str, args: dict, auto_approve: bool = False):
"""Execute tool with permission check."""
if name not in self.tools:
raise ValueError(f"Tool {name} not found")
if self.permissions[name]["requires_approval"] and not auto_approve:
# In real system, this would trigger user confirmation
print(f"⚠️ Tool {name} requires approval. Args: {args}")
confirm = input("Approve? (y/n): ")
if confirm.lower() != 'y':
return "Tool execution denied by user"
return self.tools[name](**args)Usage
Usage
registry = ToolRegistry()
registry.register("search_web", lambda query: f"Results for {query}", requires_approval=False)
registry.register("send_email", lambda to, body: f"Email sent to {to}", requires_approval=True)
**Deliverable**: Working agent harness demo with README, example runs, and failure logs.registry = ToolRegistry()
registry.register("search_web", lambda query: f"Results for {query}", requires_approval=False)
registry.register("send_email", lambda to, body: f"Email sent to {to}", requires_approval=True)
**交付成果**:可运行的Agent框架Demo,含README、示例运行及失败日志。Stage 4: Multi-Agent Coordination
阶段4:多Agent协同
Core Principle: Multi-agent is coordination, not magic. Use supervisor patterns or graphs, not random chat.
python
undefined核心原则:多Agent的核心是协同,而非魔法。使用监督者模式或图编排,而非随机聊天。
python
undefinedLangGraph multi-agent example
LangGraph multi-agent example
from langgraph.graph import StateGraph, END
from typing import TypedDict, List
class ResearchState(TypedDict):
topic: str
outline: List[str]
research: dict
draft: str
review: str
final: str
def planner(state: ResearchState) -> ResearchState:
"""Create outline for research."""
# Call LLM to generate outline
state["outline"] = ["Introduction", "Key Findings", "Conclusion"]
return state
def researcher(state: ResearchState) -> ResearchState:
"""Research each section."""
research = {}
for section in state["outline"]:
# Call search API and summarize
research[section] = f"Research for {section}..."
state["research"] = research
return state
def writer(state: ResearchState) -> ResearchState:
"""Write draft from research."""
state["draft"] = "Draft based on research..."
return state
def reviewer(state: ResearchState) -> ResearchState:
"""Review and suggest improvements."""
state["review"] = "Needs more citations in section 2"
return state
def reviser(state: ResearchState) -> ResearchState:
"""Revise based on review."""
state["final"] = "Final version with improvements..."
return state
from langgraph.graph import StateGraph, END
from typing import TypedDict, List
class ResearchState(TypedDict):
topic: str
outline: List[str]
research: dict
draft: str
review: str
final: str
def planner(state: ResearchState) -> ResearchState:
"""Create outline for research."""
# Call LLM to generate outline
state["outline"] = ["Introduction", "Key Findings", "Conclusion"]
return state
def researcher(state: ResearchState) -> ResearchState:
"""Research each section."""
research = {}
for section in state["outline"]:
# Call search API and summarize
research[section] = f"Research for {section}..."
state["research"] = research
return state
def writer(state: ResearchState) -> ResearchState:
"""Write draft from research."""
state["draft"] = "Draft based on research..."
return state
def reviewer(state: ResearchState) -> ResearchState:
"""Review and suggest improvements."""
state["review"] = "Needs more citations in section 2"
return state
def reviser(state: ResearchState) -> ResearchState:
"""Revise based on review."""
state["final"] = "Final version with improvements..."
return state
Build graph
Build graph
workflow = StateGraph(ResearchState)
workflow.add_node("planner", planner)
workflow.add_node("researcher", researcher)
workflow.add_node("writer", writer)
workflow.add_node("reviewer", reviewer)
workflow.add_node("reviser", reviser)
workflow.set_entry_point("planner")
workflow.add_edge("planner", "researcher")
workflow.add_edge("researcher", "writer")
workflow.add_edge("writer", "reviewer")
workflow.add_edge("reviewer", "reviser")
workflow.add_edge("reviser", END)
app = workflow.compile()
**Deliverable**: Multi-agent system with clear roles (e.g., research → write → review → revise).workflow = StateGraph(ResearchState)
workflow.add_node("planner", planner)
workflow.add_node("researcher", researcher)
workflow.add_node("writer", writer)
workflow.add_node("reviewer", reviewer)
workflow.add_node("reviser", reviser)
workflow.set_entry_point("planner")
workflow.add_edge("planner", "researcher")
workflow.add_edge("researcher", "writer")
workflow.add_edge("writer", "reviewer")
workflow.add_edge("reviewer", "reviser")
workflow.add_edge("reviser", END)
app = workflow.compile()
**交付成果**:角色明确的多Agent系统(如:研究 → 写作 → 审核 → 修改)。Stage 5: Skills, MCP, and Capability Packaging
阶段5:Skills、MCP与能力封装
Key Concepts:
- Skill: Reusable procedural knowledge (how to do X)
- Tool: Callable interface (function/API)
- MCP: Model Context Protocol for connecting external tools/data
- A2A: Agent-to-Agent protocol
- ACP: Agent Client Protocol
Skill File Structure (Claude Code style):
markdown
undefined核心概念:
- Skill:可复用的过程性知识(如何完成X任务)
- Tool:可调用的接口(函数/API)
- MCP:连接外部工具/数据的Model Context Protocol
- A2A:Agent-to-Agent协议
- ACP:Agent Client Protocol
Skill文件结构(Claude Code风格):
markdown
undefinedSKILL.md
SKILL.md
Name
Name
code-review
code-review
Description
Description
Perform thorough code review following team standards
Perform thorough code review following team standards
When to Use
When to Use
- User asks "review this code"
- PR is opened (via webhook)
- Code changes detected in staging branch
- User asks "review this code"
- PR is opened (via webhook)
- Code changes detected in staging branch
Steps
Steps
- Read code changes (use git diff or file_read tool)
- Check against style guide in
.code-standards.md - Run linter: or
npm run lintpython -m pylint - Check for common issues:
- Hardcoded secrets
- Missing error handling
- Unhandled edge cases
- Performance anti-patterns
- Generate structured feedback with severity levels
- Read code changes (use git diff or file_read tool)
- Check against style guide in
.code-standards.md - Run linter: or
npm run lintpython -m pylint - Check for common issues:
- Hardcoded secrets
- Missing error handling
- Unhandled edge cases
- Performance anti-patterns
- Generate structured feedback with severity levels
Tools Required
Tools Required
- file_read
- execute_command
- (optional) github_api for posting comments
- file_read
- execute_command
- (optional) github_api for posting comments
Acceptance Criteria
Acceptance Criteria
- All files reviewed
- At least 3 specific suggestions
- Severity level assigned (blocker/major/minor)
- Code style compliance checked
**MCP Server Example**:
```typescript
// MCP server for custom tools
import { Server } from "@modelcontextprotocol/sdk/server/index.js";
import { StdioServerTransport } from "@modelcontextprotocol/sdk/server/stdio.js";
const server = new Server(
{
name: "my-tools-server",
version: "1.0.0",
},
{
capabilities: {
tools: {},
},
}
);
// Register tool
server.setRequestHandler("tools/list", async () => {
return {
tools: [
{
name: "search_codebase",
description: "Search company codebase using semantic search",
inputSchema: {
type: "object",
properties: {
query: { type: "string" },
language: { type: "string", enum: ["python", "typescript", "all"] }
},
required: ["query"]
}
}
]
};
});
server.setRequestHandler("tools/call", async (request) => {
if (request.params.name === "search_codebase") {
const { query, language } = request.params.arguments;
// Implement search logic
return {
content: [{ type: "text", text: `Results for ${query}...` }]
};
}
});
// Start server
const transport = new StdioServerTransport();
await server.connect(transport);Deliverable: Reusable skill (e.g., code-review, research-report, migration-helper) with clear structure.
- All files reviewed
- At least 3 specific suggestions
- Severity level assigned (blocker/major/minor)
- Code style compliance checked
**MCP Server Example**:
```typescript
// MCP server for custom tools
import { Server } from "@modelcontextprotocol/sdk/server/index.js";
import { StdioServerTransport } from "@modelcontextprotocol/sdk/server/stdio.js";
const server = new Server(
{
name: "my-tools-server",
version: "1.0.0",
},
{
capabilities: {
tools: {},
},
}
);
// Register tool
server.setRequestHandler("tools/list", async () => {
return {
tools: [
{
name: "search_codebase",
description: "Search company codebase using semantic search",
inputSchema: {
type: "object",
properties: {
query: { type: "string" },
language: { type: "string", enum: ["python", "typescript", "all"] }
},
required: ["query"]
}
}
]
};
});
server.setRequestHandler("tools/call", async (request) => {
if (request.params.name === "search_codebase") {
const { query, language } = request.params.arguments;
// Implement search logic
return {
content: [{ type: "text", text: `Results for ${query}...` }]
};
}
});
// Start server
const transport = new StdioServerTransport();
await server.connect(transport);交付成果:结构清晰的可复用Skill(如代码审核、研究报告生成、迁移助手)。
Stage 6: Browser and Computer-Use Agents
阶段6:浏览器与电脑操作Agent
Key Patterns:
- DOM observation and element selection
- Click/type/scroll actions
- Screenshot-based fallback
- Safety boundaries (no sensitive logins, respect robots.txt)
python
undefined核心模式:
-DOM观察与元素选择
-点击/输入/滚动操作
-基于截图的降级方案
-安全边界(不登录敏感账号、遵守robots.txt)
python
undefinedBrowser agent using browser-use
Browser agent using browser-use
import asyncio
from browser_use import Agent
from langchain_openai import ChatOpenAI
async def main():
agent = Agent(
task="Go to news.ycombinator.com and get the top 5 story titles",
llm=ChatOpenAI(model="gpt-4o"),
)
result = await agent.run()
print(result)asyncio.run(main())
**Anthropic Computer Use Example**:
```python
import anthropic
import os
client = anthropic.Anthropic(api_key=os.environ.get("ANTHROPIC_API_KEY"))
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
tools=[
{
"type": "computer_20241022",
"name": "computer",
"display_width_px": 1024,
"display_height_px": 768,
}
],
messages=[
{
"role": "user",
"content": "Open a browser and search for 'AI agent frameworks'"
}
]
)
print(response)Safety Checklist:
- No login to sensitive accounts
- No financial transactions without explicit confirmation
- Respect robots.txt and rate limits
- Log all actions with screenshots
- Human-in-the-loop for risky actions
Deliverable: Browser agent that operates on public pages (e.g., extract info, generate summary).
import asyncio
from browser_use import Agent
from langchain_openai import ChatOpenAI
async def main():
agent = Agent(
task="Go to news.ycombinator.com and get the top 5 story titles",
llm=ChatOpenAI(model="gpt-4o"),
)
result = await agent.run()
print(result)asyncio.run(main())
**Anthropic Computer Use Example**:
```python
import anthropic
import os
client = anthropic.Anthropic(api_key=os.environ.get("ANTHROPIC_API_KEY"))
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
tools=[
{
"type": "computer_20241022",
"name": "computer",
"display_width_px": 1024,
"display_height_px": 768,
}
],
messages=[
{
"role": "user",
"content": "Open a browser and search for 'AI agent frameworks'"
}
]
)
print(response)安全检查清单:
- 不登录敏感账号
- 无明确确认时不进行金融交易
- 遵守robots.txt与速率限制
- 记录所有操作及截图
- 高风险操作需人工介入
交付成果:可在公开页面操作的浏览器Agent(如提取信息、生成摘要)。
Stage 7: Evaluation, Observability, and Safety
阶段7:评估、可观测性与安全
Core Metrics:
- Success rate
- Failure reason distribution
- Tool call count
- Cost per task
- Latency (p50, p95, p99)
python
undefined核心指标:
-成功率
-失败原因分布
-工具调用次数
-单任务成本
-延迟(p50、p95、p99)
python
undefinedEvaluation harness example
Evaluation harness example
import json
from typing import List, Dict
from dataclasses import dataclass
from datetime import datetime
@dataclass
class TestCase:
id: str
input: str
expected_output: str
max_steps: int = 10
@dataclass
class EvalResult:
test_id: str
success: bool
actual_output: str
steps_taken: int
cost_usd: float
latency_ms: float
failure_reason: str = None
trace: List[Dict] = None
class AgentEvaluator:
def init(self, agent_fn, test_cases: List[TestCase]):
self.agent_fn = agent_fn
self.test_cases = test_cases
self.results: List[EvalResult] = []
def run_evaluation(self) -> Dict:
"""Run all test cases and collect metrics."""
for test in self.test_cases:
start_time = datetime.now()
try:
result = self.agent_fn(test.input, max_steps=test.max_steps)
success = self._check_success(result, test.expected_output)
latency = (datetime.now() - start_time).total_seconds() * 1000
eval_result = EvalResult(
test_id=test.id,
success=success,
actual_output=result["output"],
steps_taken=result["steps"],
cost_usd=result["cost"],
latency_ms=latency,
trace=result.get("trace")
)
except Exception as e:
eval_result = EvalResult(
test_id=test.id,
success=False,
actual_output="",
steps_taken=0,
cost_usd=0,
latency_ms=0,
failure_reason=str(e)
)
self.results.append(eval_result)
return self._compute_metrics()
def _check_success(self, result: Dict, expected: str) -> bool:
"""Check if output matches expected (implement your logic)."""
return expected.lower() in result["output"].lower()
def _compute_metrics(self) -> Dict:
"""Aggregate metrics across all tests."""
total = len(self.results)
successful = sum(1 for r in self.results if r.success)
return {
"success_rate": successful / total if total > 0 else 0,
"total_tests": total,
"total_cost_usd": sum(r.cost_usd for r in self.results),
"avg_latency_ms": sum(r.latency_ms for r in self.results) / total if total > 0 else 0,
"failure_reasons": [r.failure_reason for r in self.results if not r.success]
}
def export_report(self, filename: str):
"""Export detailed report as JSON."""
report = {
"metrics": self._compute_metrics(),
"results": [
{
"test_id": r.test_id,
"success": r.success,
"steps": r.steps_taken,
"cost": r.cost_usd,
"latency_ms": r.latency_ms,
"failure_reason": r.failure_reason
}
for r in self.results
]
}
with open(filename, "w") as f:
json.dump(report, f, indent=2)import json
from typing import List, Dict
from dataclasses import dataclass
from datetime import datetime
@dataclass
class TestCase:
id: str
input: str
expected_output: str
max_steps: int = 10
@dataclass
class EvalResult:
test_id: str
success: bool
actual_output: str
steps_taken: int
cost_usd: float
latency_ms: float
failure_reason: str = None
trace: List[Dict] = None
class AgentEvaluator:
def init(self, agent_fn, test_cases: List[TestCase]):
self.agent_fn = agent_fn
self.test_cases = test_cases
self.results: List[EvalResult] = []
def run_evaluation(self) -> Dict:
"""Run all test cases and collect metrics."""
for test in self.test_cases:
start_time = datetime.now()
try:
result = self.agent_fn(test.input, max_steps=test.max_steps)
success = self._check_success(result, test.expected_output)
latency = (datetime.now() - start_time).total_seconds() * 1000
eval_result = EvalResult(
test_id=test.id,
success=success,
actual_output=result["output"],
steps_taken=result["steps"],
cost_usd=result["cost"],
latency_ms=latency,
trace=result.get("trace")
)
except Exception as e:
eval_result = EvalResult(
test_id=test.id,
success=False,
actual_output="",
steps_taken=0,
cost_usd=0,
latency_ms=0,
failure_reason=str(e)
)
self.results.append(eval_result)
return self._compute_metrics()
def _check_success(self, result: Dict, expected: str) -> bool:
"""Check if output matches expected (implement your logic)."""
return expected.lower() in result["output"].lower()
def _compute_metrics(self) -> Dict:
"""Aggregate metrics across all tests."""
total = len(self.results)
successful = sum(1 for r in self.results if r.success)
return {
"success_rate": successful / total if total > 0 else 0,
"total_tests": total,
"total_cost_usd": sum(r.cost_usd for r in self.results),
"avg_latency_ms": sum(r.latency_ms for r in self.results) / total if total > 0 else 0,
"failure_reasons": [r.failure_reason for r in self.results if not r.success]
}
def export_report(self, filename: str):
"""Export detailed report as JSON."""
report = {
"metrics": self._compute_metrics(),
"results": [
{
"test_id": r.test_id,
"success": r.success,
"steps": r.steps_taken,
"cost": r.cost_usd,
"latency_ms": r.latency_ms,
"failure_reason": r.failure_reason
}
for r in self.results
]
}
with open(filename, "w") as f:
json.dump(report, f, indent=2)Usage
Usage
test_cases = [
TestCase(
id="research-basic",
input="Research recent AI agent frameworks",
expected_output="langchain"
),
TestCase(
id="research-citation",
input="Find papers on agent evaluation",
expected_output="citation"
)
]
evaluator = AgentEvaluator(agent_fn=my_agent_function, test_cases=test_cases)
metrics = evaluator.run_evaluation()
evaluator.export_report("eval_results.json")
print(f"Success rate: {metrics['success_rate']:.2%}")
print(f"Avg cost: ${metrics['total_cost_usd']:.4f}")
**Safety Patterns**:
```pythontest_cases = [
TestCase(
id="research-basic",
input="Research recent AI agent frameworks",
expected_output="langchain"
),
TestCase(
id="research-citation",
input="Find papers on agent evaluation",
expected_output="citation"
)
]
evaluator = AgentEvaluator(agent_fn=my_agent_function, test_cases=test_cases)
metrics = evaluator.run_evaluation()
evaluator.export_report("eval_results.json")
print(f"Success rate: {metrics['success_rate']:.2%}")
print(f"Avg cost: ${metrics['total_cost_usd']:.4f}")
**安全模式**:
```pythonDangerous tool approval gate
Dangerous tool approval gate
DANGEROUS_TOOLS = ["delete_file", "send_email", "make_payment", "publish_content"]
def execute_tool_with_approval(tool_name: str, args: dict):
"""Execute tool with human-in-the-loop for dangerous actions."""
if tool_name in DANGEROUS_TOOLS:
print(f"\n⚠️ APPROVAL REQUIRED")
print(f"Tool: {tool_name}")
print(f"Args: {json.dumps(args, indent=2)}")
approval = input("\nApprove this action? (yes/no): ")
if approval.lower() != "yes":
return {"status": "rejected", "reason": "User denied approval"}
# Execute tool
return tools_registry[tool_name](**args)
**Deliverable**: Eval suite with fixed test set, success rate tracking, and cost/latency metrics.DANGEROUS_TOOLS = ["delete_file", "send_email", "make_payment", "publish_content"]
def execute_tool_with_approval(tool_name: str, args: dict):
"""Execute tool with human-in-the-loop for dangerous actions."""
if tool_name in DANGEROUS_TOOLS:
print(f"\n⚠️ APPROVAL REQUIRED")
print(f"Tool: {tool_name}")
print(f"Args: {json.dumps(args, indent=2)}")
approval = input("\nApprove this action? (yes/no): ")
if approval.lower() != "yes":
return {"status": "rejected", "reason": "User denied approval"}
# Execute tool
return tools_registry[tool_name](**args)
**交付成果**:含固定测试集、成功率追踪及成本/延迟指标的评估套件。Common Patterns
常见模式
When to Use Agents vs. Workflows
何时使用Agent vs 工作流
Use Agent When:
- Task requires dynamic tool selection
- Steps depend on runtime information
- Need to handle unexpected situations
- Task involves exploration or research
Use Workflow When:
- Steps are predictable
- Process is well-defined
- Speed and cost matter more than flexibility
- You need guarantees about execution path
选择Agent的场景:
-任务需要动态选择工具
-步骤依赖运行时信息
-需要处理意外情况
-任务涉及探索或研究
选择工作流的场景:
-步骤可预测
-流程已明确定义
-速度与成本比灵活性更重要
-需要保证执行路径
Context Management
上下文管理
python
undefinedpython
undefinedSimple context compaction strategy
Simple context compaction strategy
class ContextManager:
def init(self, max_tokens: int = 8000):
self.max_tokens = max_tokens
self.messages = []
def add_message(self, message: dict):
"""Add message and compact if needed."""
self.messages.append(message)
# Estimate tokens (rough: 1 token ≈ 4 chars)
total_chars = sum(len(str(m)) for m in self.messages)
estimated_tokens = total_chars // 4
if estimated_tokens > self.max_tokens:
self._compact()
def _compact(self):
"""Keep system message, last user message, and recent history."""
system = [m for m in self.messages if m["role"] == "system"]
recent = self.messages[-10:] # Keep last 10 messages
# Summarize middle messages (in production, use LLM)
if len(self.messages) > 12:
summary = {
"role": "system",
"content": f"[Previous conversation summarized: {len(self.messages) - 12} messages]"
}
self.messages = system + [summary] + recentundefinedclass ContextManager:
def init(self, max_tokens: int = 8000):
self.max_tokens = max_tokens
self.messages = []
def add_message(self, message: dict):
"""Add message and compact if needed."""
self.messages.append(message)
# Estimate tokens (rough: 1 token ≈ 4 chars)
total_chars = sum(len(str(m)) for m in self.messages)
estimated_tokens = total_chars // 4
if estimated_tokens > self.max_tokens:
self._compact()
def _compact(self):
"""Keep system message, last user message, and recent history."""
system = [m for m in self.messages if m["role"] == "system"]
recent = self.messages[-10:] # Keep last 10 messages
# Summarize middle messages (in production, use LLM)
if len(self.messages) > 12:
summary = {
"role": "system",
"content": f"[Previous conversation summarized: {len(self.messages) - 12} messages]"
}
self.messages = system + [summary] + recentundefinedTroubleshooting
故障排查
Agent Loops Forever
Agent无限循环
Cause: No max_steps limit or unclear stopping criteria.
Solution:
python
def run_agent_with_limits(task: str, max_steps: int = 10, max_cost: float = 1.0):
total_cost = 0.0
for step in range(max_steps):
if total_cost > max_cost:
return {"error": "Cost limit exceeded", "partial_result": current_state}
# Run agent step
result = agent.step()
total_cost += result.cost
if result.is_final:
return result
return {"error": "Max steps exceeded", "partial_result": current_state}原因:无最大步数限制或停止标准不明确。
解决方案:
python
def run_agent_with_limits(task: str, max_steps: int = 10, max_cost: float = 1.0):
total_cost = 0.0
for step in range(max_steps):
if total_cost > max_cost:
return {"error": "Cost limit exceeded", "partial_result": current_state}
# Run agent step
result = agent.step()
total_cost += result.cost
if result.is_final:
return result
return {"error": "Max steps exceeded", "partial_result": current_state}Tools Return Empty/Error Frequently
工具频繁返回空值/错误
Cause: Input validation issues or tool design mismatch.
Solution:
- Add input schema validation
- Provide clear tool descriptions
- Show examples in tool metadata
- Add retry logic with backoff
python
from pydantic import BaseModel, Field
class SearchInput(BaseModel):
query: str = Field(..., min_length=3, description="Search query, at least 3 characters")
max_results: int = Field(5, ge=1, le=20, description="Number of results, 1-20")
def search_tool(input: SearchInput) -> dict:
"""Type-safe search tool."""
# Validation happens automatically via Pydantic
return {"results": [...]}原因:输入验证问题或工具设计不匹配。
解决方案:
-添加输入Schema验证
-提供清晰的工具描述
-在工具元数据中添加示例
-添加带退避的重试逻辑
python
from pydantic import BaseModel, Field
class SearchInput(BaseModel):
query: str = Field(..., min_length=3, description="Search query, at least 3 characters")
max_results: int = Field(5, ge=1, le=20, description="Number of results, 1-20")
def search_tool(input: SearchInput) -> dict:
"""Type-safe search tool."""
# Validation happens automatically via Pydantic
return {"results": [...]}Agent Hallucinates Citations
Agent虚构引用
Cause: No grounding mechanism, agent invents sources.
Solution:
- Return citations with every retrieval
- Use structured output for citations
- Post-process to verify citation validity
python
def verify_citations(text: str, sources: List[str]) -> bool:
"""Check if all citations in text exist in sources."""
import re
cited = re.findall(r'\[(\d+)\]', text)
max_source_idx = len(sources) - 1
for citation in cited:
if int(citation) > max_source_idx:
return False
return True原因:无 grounding 机制,Agent自行编造来源。
解决方案:
-每次检索都返回引用
-使用结构化输出存储引用
-后处理验证引用有效性
python
def verify_citations(text: str, sources: List[str]) -> bool:
"""Check if all citations in text exist in sources."""
import re
cited = re.findall(r'\[(\d+)\]', text)
max_source_idx = len(sources) - 1
for citation in cited:
if int(citation) > max_source_idx:
return False
return TrueRelated Resources
相关资源
Official Documentation:
Key Papers:
- ReAct: Synergizing Reasoning and Acting in Language Models
- WebArena: A Realistic Web Environment for Building Autonomous Agents
- ToolBench: Tool Learning with Foundation Models
Recommended Open Source Projects (curated in repo):
- GPT Researcher, STORM, Khoj (RAG/research)
- learn-claude-code, claw0, OpenClaw (agent harnesses)
- browser-use (browser agents)
- mem0, Letta (memory systems)
官方文档:
核心论文:
- ReAct: Synergizing Reasoning and Acting in Language Models
- WebArena: A Realistic Web Environment for Building Autonomous Agents
- ToolBench: Tool Learning with Foundation Models
推荐开源项目(仓库中已精选):
- GPT Researcher、STORM、Khoj(RAG/研究)
- learn-claude-code、claw0、OpenClaw(Agent框架)
- browser-use(浏览器Agent)
- mem0、Letta(记忆系统)
Best Practices
最佳实践
- Start Simple: Build minimal loop before adding frameworks
- Add Safety Early: Approval gates, logging, cost limits from day one
- Evaluate Continuously: Fixed test set, track regressions
- Study Harnesses: Learn from Claude Code, OpenClaw, not just toy examples
- Prefer Skills Over Prompts: Package reusable knowledge formally
- Use MCP for Integration: Standard protocol beats custom tool wrappers
- Log Everything: Traces are essential for debugging agent failures
- Human-in-Loop for Risk: Never auto-approve delete/send/publish
- 从简开始:先构建最小循环,再添加框架
- 尽早加入安全机制:从第一天就加入审批网关、日志、成本限制
- 持续评估:固定测试集,追踪回归问题
- 学习成熟框架:向Claude Code、OpenClaw学习,而非仅关注玩具示例
- 优先使用Skills而非Prompt:将可复用知识进行规范化封装
- 使用MCP进行集成:标准协议优于自定义工具包装器
- 记录所有内容:追踪信息是调试Agent故障的关键
- 高风险操作需人工介入:永远不要自动批准删除/发送/发布操作
When to Use This Skill
何时使用本Skill
An AI coding agent should use this skill when:
- User asks "how do I learn AI agents"
- User wants structured agent learning path
- User needs modern agent architecture guidance
- User asks about specific stage (tool use, RAG, multi-agent, evaluation)
- User wants curated agent resources or project recommendations
- User needs code examples for agent loops, tools, or harnesses
AI编码Agent应在以下场景使用本Skill:
- 用户询问“如何学习AI Agent”
- 用户需要结构化的Agent学习路径
- 用户需要现代Agent架构指导
- 用户询问特定阶段的内容(工具使用、RAG、多Agent、评估)
- 用户需要精选的Agent资源或项目推荐
- 用户需要Agent循环、工具或框架的代码示例