datawhale-agent-learning-hub

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Datawhale Agent Learning Hub

Datawhale Agent学习中心

Skill by ara.so — AI Agent Skills collection.
A curated AI Agent learning roadmap and resource hub maintained by Datawhale. This project provides a structured learning path from basic agent loops to production-ready agent systems, emphasizing modern patterns like agent harnesses, skills, MCP (Model Context Protocol), and evaluation.
ara.so提供的Skill——AI Agent技能合集。
由Datawhale维护的AI Agent学习路线图与资源精选中心。本项目提供从基础Agent循环到可投入生产的Agent系统的结构化学习路径,重点关注Agent框架、Skills、MCP(Model Context Protocol,模型上下文协议)及评估等现代模式。

What This Project Provides

本项目提供的内容

  • Structured Learning Path: 7-stage todo list from basic agent loops to browser/computer-use agents
  • Curated Resources: Official docs, papers, and proven open-source projects
  • Modern Focus: Prioritizes Claude Code, OpenClaw, skills, MCP, A2A over legacy role-play frameworks
  • Project Ladder: Real-world agent projects you can build at each stage
  • Current Best Practices: What to learn now vs. what's outdated
  • 结构化学习路径:从基础Agent循环到浏览器/电脑操作Agent的7阶段任务清单
  • 精选资源:官方文档、论文及经过验证的开源项目
  • 聚焦现代模式:优先关注Claude Code、OpenClaw、Skills、MCP、A2A,而非传统角色扮演框架
  • 项目进阶阶梯:每个阶段可实践的真实Agent项目
  • 当前最佳实践:区分当下需学习内容与已过时技术

Installation & Access

安装与访问

This is a learning resource repository, not a package to install:
bash
undefined
这是一个学习资源仓库,并非可安装的包:
bash
undefined

Clone the repository

Clone the repository

Read the README

Read the README

cat README.md
cat README.md

Use it as reference while building agents

Use it as reference while building agents

undefined
undefined

Key Learning Stages

核心学习阶段

Stage 0: Understand What An Agent Is

阶段0:理解Agent是什么

Core Concept: Distinguish chatbot vs workflow vs agent vs multi-agent.
Required Reading:
Deliverable: One-page note answering "Why does my use case need an agent instead of a workflow?"
核心概念:区分聊天机器人、工作流、Agent与多Agent系统。
必读资料:
交付成果:一页笔记,回答“为什么我的业务场景需要Agent而非工作流?”

Stage 1: Build A Minimal Agent Loop

阶段1:构建最小Agent循环

Core Pattern: observe → think → act → observe
python
undefined
核心模式:观察 → 思考 → 行动 → 观察
python
undefined

Minimal agent loop example (Python + OpenAI)

Minimal agent loop example (Python + OpenAI)

import os from openai import OpenAI
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
tools = [ { "type": "function", "function": { "name": "calculate", "description": "Perform basic arithmetic", "parameters": { "type": "object", "properties": { "expression": {"type": "string", "description": "Math expression like '2+2'"} }, "required": ["expression"] } } } ]
def calculate(expression: str) -> str: """Execute safe math expression.""" try: return str(eval(expression, {"builtins": {}}, {})) except Exception as e: return f"Error: {e}"
def run_agent(user_message: str, max_steps: int = 5): messages = [{"role": "user", "content": user_message}]
for step in range(max_steps):
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=messages,
        tools=tools
    )
    
    message = response.choices[0].message
    messages.append(message)
    
    # Check if done
    if not message.tool_calls:
        return message.content
    
    # Execute tool calls
    for tool_call in message.tool_calls:
        if tool_call.function.name == "calculate":
            import json
            args = json.loads(tool_call.function.arguments)
            result = calculate(args["expression"])
            
            messages.append({
                "role": "tool",
                "tool_call_id": tool_call.id,
                "content": result
            })

return "Max steps reached"
import os from openai import OpenAI
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
tools = [ { "type": "function", "function": { "name": "calculate", "description": "Perform basic arithmetic", "parameters": { "type": "object", "properties": { "expression": {"type": "string", "description": "Math expression like '2+2'"} }, "required": ["expression"] } } } ]
def calculate(expression: str) -> str: """Execute safe math expression.""" try: return str(eval(expression, {"builtins": {}}, {})) except Exception as e: return f"Error: {e}"
def run_agent(user_message: str, max_steps: int = 5): messages = [{"role": "user", "content": user_message}]
for step in range(max_steps):
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=messages,
        tools=tools
    )
    
    message = response.choices[0].message
    messages.append(message)
    
    # Check if done
    if not message.tool_calls:
        return message.content
    
    # Execute tool calls
    for tool_call in message.tool_calls:
        if tool_call.function.name == "calculate":
            import json
            args = json.loads(tool_call.function.arguments)
            result = calculate(args["expression"])
            
            messages.append({
                "role": "tool",
                "tool_call_id": tool_call.id,
                "content": result
            })

return "Max steps reached"

Usage

Usage

result = run_agent("What is 25 * 4 + 10?") print(result)

**Deliverable**: 50-150 line agent that can choose tools, execute them, and return final answer.
result = run_agent("What is 25 * 4 + 10?") print(result)

**交付成果**:50-150行代码的Agent,可选择工具、执行工具并返回最终答案。

Stage 2: Tool Use, RAG, and Memory

阶段2:工具使用、RAG与记忆

Recommended Projects to Study:
ProjectFocus Area
GPT ResearcherSearch → scrape → filter → cite → generate report
STORMMulti-perspective research writing with outline
KhojPersonal second brain with semantic search
mem0Adding long-term memory to agents
python
undefined
推荐学习项目:
项目聚焦领域
GPT Researcher搜索 → 抓取 → 过滤 → 引用 → 生成报告
STORM多视角研究写作(含大纲)
Khoj带语义搜索的个人第二大脑
mem0为Agent添加长期记忆
python
undefined

RAG-enhanced agent example (using LlamaIndex)

RAG-enhanced agent example (using LlamaIndex)

import os from llama_index.core import VectorStoreIndex, SimpleDirectoryReader from llama_index.core.agent import ReActAgent from llama_index.core.tools import QueryEngineTool, ToolMetadata
import os from llama_index.core import VectorStoreIndex, SimpleDirectoryReader from llama_index.core.agent import ReActAgent from llama_index.core.tools import QueryEngineTool, ToolMetadata

Load and index documents

Load and index documents

documents = SimpleDirectoryReader("./docs").load_data() index = VectorStoreIndex.from_documents(documents)
documents = SimpleDirectoryReader("./docs").load_data() index = VectorStoreIndex.from_documents(documents)

Create query engine tool

Create query engine tool

query_engine = index.as_query_engine() query_tool = QueryEngineTool( query_engine=query_engine, metadata=ToolMetadata( name="doc_search", description="Search company documentation. Use this when user asks about policies, procedures, or technical specs." ) )
query_engine = index.as_query_engine() query_tool = QueryEngineTool( query_engine=query_engine, metadata=ToolMetadata( name="doc_search", description="Search company documentation. Use this when user asks about policies, procedures, or technical specs." ) )

Create agent with tools

Create agent with tools

agent = ReActAgent.from_tools( tools=[query_tool], verbose=True )
agent = ReActAgent.from_tools( tools=[query_tool], verbose=True )

Run agent with citation requirement

Run agent with citation requirement

response = agent.chat( "What is our company's remote work policy? Please cite sources." ) print(response)

**Deliverable**: Research assistant that searches, filters, summarizes, and outputs citations.
response = agent.chat( "What is our company's remote work policy? Please cite sources." ) print(response)

**交付成果**:可进行搜索、过滤、总结并输出引用的研究助手。

Stage 3: Study One Modern Agent Harness

阶段3:学习一款现代Agent框架

Key Systems to Learn:
SystemLearn This For
Claude CodeReal coding agent: CLI, tools, permissions, hooks, subagents, MCP
learn-claude-codeFrom-scratch harness implementation
claw0Building session, gateway, memory, heartbeat, delivery, resilience
OpenClawLocal-first personal agent with skills and system tools
LangGraphStateful graph orchestration
What to Look For in a Harness:
  • Agent loop implementation
  • Tool registry and permission gates
  • Session/state store
  • Context compaction strategy
  • Trace/logging system
  • Error handling and recovery
python
undefined
核心学习系统:
系统学习目标
Claude Code真实编码Agent:CLI、工具、权限、钩子、子Agent、MCP
learn-claude-code从零实现框架
claw0构建会话、网关、记忆、心跳、交付、韧性机制
OpenClaw本地优先的个人Agent,含技能与系统工具
LangGraph有状态的图编排
框架学习要点:
  • Agent循环实现 -工具注册与权限网关 -会话/状态存储 -上下文压缩策略 -追踪/日志系统 -错误处理与恢复
python
undefined

Example: Understanding tool permission gate pattern

Example: Understanding tool permission gate pattern

class ToolRegistry: def init(self): self.tools = {} self.permissions = {}
def register(self, name: str, func: callable, requires_approval: bool = False):
    """Register tool with optional approval gate."""
    self.tools[name] = func
    self.permissions[name] = {
        "requires_approval": requires_approval,
        "allowed_domains": []  # Could expand to domain restrictions
    }

def execute(self, name: str, args: dict, auto_approve: bool = False):
    """Execute tool with permission check."""
    if name not in self.tools:
        raise ValueError(f"Tool {name} not found")
    
    if self.permissions[name]["requires_approval"] and not auto_approve:
        # In real system, this would trigger user confirmation
        print(f"⚠️  Tool {name} requires approval. Args: {args}")
        confirm = input("Approve? (y/n): ")
        if confirm.lower() != 'y':
            return "Tool execution denied by user"
    
    return self.tools[name](**args)
class ToolRegistry: def init(self): self.tools = {} self.permissions = {}
def register(self, name: str, func: callable, requires_approval: bool = False):
    """Register tool with optional approval gate."""
    self.tools[name] = func
    self.permissions[name] = {
        "requires_approval": requires_approval,
        "allowed_domains": []  # Could expand to domain restrictions
    }

def execute(self, name: str, args: dict, auto_approve: bool = False):
    """Execute tool with permission check."""
    if name not in self.tools:
        raise ValueError(f"Tool {name} not found")
    
    if self.permissions[name]["requires_approval"] and not auto_approve:
        # In real system, this would trigger user confirmation
        print(f"⚠️  Tool {name} requires approval. Args: {args}")
        confirm = input("Approve? (y/n): ")
        if confirm.lower() != 'y':
            return "Tool execution denied by user"
    
    return self.tools[name](**args)

Usage

Usage

registry = ToolRegistry() registry.register("search_web", lambda query: f"Results for {query}", requires_approval=False) registry.register("send_email", lambda to, body: f"Email sent to {to}", requires_approval=True)

**Deliverable**: Working agent harness demo with README, example runs, and failure logs.
registry = ToolRegistry() registry.register("search_web", lambda query: f"Results for {query}", requires_approval=False) registry.register("send_email", lambda to, body: f"Email sent to {to}", requires_approval=True)

**交付成果**:可运行的Agent框架Demo,含README、示例运行及失败日志。

Stage 4: Multi-Agent Coordination

阶段4:多Agent协同

Core Principle: Multi-agent is coordination, not magic. Use supervisor patterns or graphs, not random chat.
python
undefined
核心原则:多Agent的核心是协同,而非魔法。使用监督者模式或图编排,而非随机聊天。
python
undefined

LangGraph multi-agent example

LangGraph multi-agent example

from langgraph.graph import StateGraph, END from typing import TypedDict, List
class ResearchState(TypedDict): topic: str outline: List[str] research: dict draft: str review: str final: str
def planner(state: ResearchState) -> ResearchState: """Create outline for research.""" # Call LLM to generate outline state["outline"] = ["Introduction", "Key Findings", "Conclusion"] return state
def researcher(state: ResearchState) -> ResearchState: """Research each section.""" research = {} for section in state["outline"]: # Call search API and summarize research[section] = f"Research for {section}..." state["research"] = research return state
def writer(state: ResearchState) -> ResearchState: """Write draft from research.""" state["draft"] = "Draft based on research..." return state
def reviewer(state: ResearchState) -> ResearchState: """Review and suggest improvements.""" state["review"] = "Needs more citations in section 2" return state
def reviser(state: ResearchState) -> ResearchState: """Revise based on review.""" state["final"] = "Final version with improvements..." return state
from langgraph.graph import StateGraph, END from typing import TypedDict, List
class ResearchState(TypedDict): topic: str outline: List[str] research: dict draft: str review: str final: str
def planner(state: ResearchState) -> ResearchState: """Create outline for research.""" # Call LLM to generate outline state["outline"] = ["Introduction", "Key Findings", "Conclusion"] return state
def researcher(state: ResearchState) -> ResearchState: """Research each section.""" research = {} for section in state["outline"]: # Call search API and summarize research[section] = f"Research for {section}..." state["research"] = research return state
def writer(state: ResearchState) -> ResearchState: """Write draft from research.""" state["draft"] = "Draft based on research..." return state
def reviewer(state: ResearchState) -> ResearchState: """Review and suggest improvements.""" state["review"] = "Needs more citations in section 2" return state
def reviser(state: ResearchState) -> ResearchState: """Revise based on review.""" state["final"] = "Final version with improvements..." return state

Build graph

Build graph

workflow = StateGraph(ResearchState) workflow.add_node("planner", planner) workflow.add_node("researcher", researcher) workflow.add_node("writer", writer) workflow.add_node("reviewer", reviewer) workflow.add_node("reviser", reviser)
workflow.set_entry_point("planner") workflow.add_edge("planner", "researcher") workflow.add_edge("researcher", "writer") workflow.add_edge("writer", "reviewer") workflow.add_edge("reviewer", "reviser") workflow.add_edge("reviser", END)
app = workflow.compile()

**Deliverable**: Multi-agent system with clear roles (e.g., research → write → review → revise).
workflow = StateGraph(ResearchState) workflow.add_node("planner", planner) workflow.add_node("researcher", researcher) workflow.add_node("writer", writer) workflow.add_node("reviewer", reviewer) workflow.add_node("reviser", reviser)
workflow.set_entry_point("planner") workflow.add_edge("planner", "researcher") workflow.add_edge("researcher", "writer") workflow.add_edge("writer", "reviewer") workflow.add_edge("reviewer", "reviser") workflow.add_edge("reviser", END)
app = workflow.compile()

**交付成果**:角色明确的多Agent系统(如:研究 → 写作 → 审核 → 修改)。

Stage 5: Skills, MCP, and Capability Packaging

阶段5:Skills、MCP与能力封装

Key Concepts:
  • Skill: Reusable procedural knowledge (how to do X)
  • Tool: Callable interface (function/API)
  • MCP: Model Context Protocol for connecting external tools/data
  • A2A: Agent-to-Agent protocol
  • ACP: Agent Client Protocol
Skill File Structure (Claude Code style):
markdown
undefined
核心概念:
  • Skill:可复用的过程性知识(如何完成X任务)
  • Tool:可调用的接口(函数/API)
  • MCP:连接外部工具/数据的Model Context Protocol
  • A2A:Agent-to-Agent协议
  • ACP:Agent Client Protocol
Skill文件结构(Claude Code风格):
markdown
undefined

SKILL.md

SKILL.md

Name

Name

code-review
code-review

Description

Description

Perform thorough code review following team standards
Perform thorough code review following team standards

When to Use

When to Use

  • User asks "review this code"
  • PR is opened (via webhook)
  • Code changes detected in staging branch
  • User asks "review this code"
  • PR is opened (via webhook)
  • Code changes detected in staging branch

Steps

Steps

  1. Read code changes (use git diff or file_read tool)
  2. Check against style guide in
    .code-standards.md
  3. Run linter:
    npm run lint
    or
    python -m pylint
  4. Check for common issues:
    • Hardcoded secrets
    • Missing error handling
    • Unhandled edge cases
    • Performance anti-patterns
  5. Generate structured feedback with severity levels
  1. Read code changes (use git diff or file_read tool)
  2. Check against style guide in
    .code-standards.md
  3. Run linter:
    npm run lint
    or
    python -m pylint
  4. Check for common issues:
    • Hardcoded secrets
    • Missing error handling
    • Unhandled edge cases
    • Performance anti-patterns
  5. Generate structured feedback with severity levels

Tools Required

Tools Required

  • file_read
  • execute_command
  • (optional) github_api for posting comments
  • file_read
  • execute_command
  • (optional) github_api for posting comments

Acceptance Criteria

Acceptance Criteria

  • All files reviewed
  • At least 3 specific suggestions
  • Severity level assigned (blocker/major/minor)
  • Code style compliance checked

**MCP Server Example**:

```typescript
// MCP server for custom tools
import { Server } from "@modelcontextprotocol/sdk/server/index.js";
import { StdioServerTransport } from "@modelcontextprotocol/sdk/server/stdio.js";

const server = new Server(
  {
    name: "my-tools-server",
    version: "1.0.0",
  },
  {
    capabilities: {
      tools: {},
    },
  }
);

// Register tool
server.setRequestHandler("tools/list", async () => {
  return {
    tools: [
      {
        name: "search_codebase",
        description: "Search company codebase using semantic search",
        inputSchema: {
          type: "object",
          properties: {
            query: { type: "string" },
            language: { type: "string", enum: ["python", "typescript", "all"] }
          },
          required: ["query"]
        }
      }
    ]
  };
});

server.setRequestHandler("tools/call", async (request) => {
  if (request.params.name === "search_codebase") {
    const { query, language } = request.params.arguments;
    // Implement search logic
    return {
      content: [{ type: "text", text: `Results for ${query}...` }]
    };
  }
});

// Start server
const transport = new StdioServerTransport();
await server.connect(transport);
Deliverable: Reusable skill (e.g., code-review, research-report, migration-helper) with clear structure.
  • All files reviewed
  • At least 3 specific suggestions
  • Severity level assigned (blocker/major/minor)
  • Code style compliance checked

**MCP Server Example**:

```typescript
// MCP server for custom tools
import { Server } from "@modelcontextprotocol/sdk/server/index.js";
import { StdioServerTransport } from "@modelcontextprotocol/sdk/server/stdio.js";

const server = new Server(
  {
    name: "my-tools-server",
    version: "1.0.0",
  },
  {
    capabilities: {
      tools: {},
    },
  }
);

// Register tool
server.setRequestHandler("tools/list", async () => {
  return {
    tools: [
      {
        name: "search_codebase",
        description: "Search company codebase using semantic search",
        inputSchema: {
          type: "object",
          properties: {
            query: { type: "string" },
            language: { type: "string", enum: ["python", "typescript", "all"] }
          },
          required: ["query"]
        }
      }
    ]
  };
});

server.setRequestHandler("tools/call", async (request) => {
  if (request.params.name === "search_codebase") {
    const { query, language } = request.params.arguments;
    // Implement search logic
    return {
      content: [{ type: "text", text: `Results for ${query}...` }]
    };
  }
});

// Start server
const transport = new StdioServerTransport();
await server.connect(transport);
交付成果:结构清晰的可复用Skill(如代码审核、研究报告生成、迁移助手)。

Stage 6: Browser and Computer-Use Agents

阶段6:浏览器与电脑操作Agent

Key Patterns:
  • DOM observation and element selection
  • Click/type/scroll actions
  • Screenshot-based fallback
  • Safety boundaries (no sensitive logins, respect robots.txt)
python
undefined
核心模式: -DOM观察与元素选择 -点击/输入/滚动操作 -基于截图的降级方案 -安全边界(不登录敏感账号、遵守robots.txt)
python
undefined

Browser agent using browser-use

Browser agent using browser-use

import asyncio from browser_use import Agent from langchain_openai import ChatOpenAI
async def main(): agent = Agent( task="Go to news.ycombinator.com and get the top 5 story titles", llm=ChatOpenAI(model="gpt-4o"), )
result = await agent.run()
print(result)
asyncio.run(main())

**Anthropic Computer Use Example**:

```python
import anthropic
import os

client = anthropic.Anthropic(api_key=os.environ.get("ANTHROPIC_API_KEY"))

response = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=1024,
    tools=[
        {
            "type": "computer_20241022",
            "name": "computer",
            "display_width_px": 1024,
            "display_height_px": 768,
        }
    ],
    messages=[
        {
            "role": "user",
            "content": "Open a browser and search for 'AI agent frameworks'"
        }
    ]
)

print(response)
Safety Checklist:
  • No login to sensitive accounts
  • No financial transactions without explicit confirmation
  • Respect robots.txt and rate limits
  • Log all actions with screenshots
  • Human-in-the-loop for risky actions
Deliverable: Browser agent that operates on public pages (e.g., extract info, generate summary).
import asyncio from browser_use import Agent from langchain_openai import ChatOpenAI
async def main(): agent = Agent( task="Go to news.ycombinator.com and get the top 5 story titles", llm=ChatOpenAI(model="gpt-4o"), )
result = await agent.run()
print(result)
asyncio.run(main())

**Anthropic Computer Use Example**:

```python
import anthropic
import os

client = anthropic.Anthropic(api_key=os.environ.get("ANTHROPIC_API_KEY"))

response = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=1024,
    tools=[
        {
            "type": "computer_20241022",
            "name": "computer",
            "display_width_px": 1024,
            "display_height_px": 768,
        }
    ],
    messages=[
        {
            "role": "user",
            "content": "Open a browser and search for 'AI agent frameworks'"
        }
    ]
)

print(response)
安全检查清单:
  • 不登录敏感账号
  • 无明确确认时不进行金融交易
  • 遵守robots.txt与速率限制
  • 记录所有操作及截图
  • 高风险操作需人工介入
交付成果:可在公开页面操作的浏览器Agent(如提取信息、生成摘要)。

Stage 7: Evaluation, Observability, and Safety

阶段7:评估、可观测性与安全

Core Metrics:
  • Success rate
  • Failure reason distribution
  • Tool call count
  • Cost per task
  • Latency (p50, p95, p99)
python
undefined
核心指标: -成功率 -失败原因分布 -工具调用次数 -单任务成本 -延迟(p50、p95、p99)
python
undefined

Evaluation harness example

Evaluation harness example

import json from typing import List, Dict from dataclasses import dataclass from datetime import datetime
@dataclass class TestCase: id: str input: str expected_output: str max_steps: int = 10
@dataclass class EvalResult: test_id: str success: bool actual_output: str steps_taken: int cost_usd: float latency_ms: float failure_reason: str = None trace: List[Dict] = None
class AgentEvaluator: def init(self, agent_fn, test_cases: List[TestCase]): self.agent_fn = agent_fn self.test_cases = test_cases self.results: List[EvalResult] = []
def run_evaluation(self) -> Dict:
    """Run all test cases and collect metrics."""
    for test in self.test_cases:
        start_time = datetime.now()
        
        try:
            result = self.agent_fn(test.input, max_steps=test.max_steps)
            success = self._check_success(result, test.expected_output)
            
            latency = (datetime.now() - start_time).total_seconds() * 1000
            
            eval_result = EvalResult(
                test_id=test.id,
                success=success,
                actual_output=result["output"],
                steps_taken=result["steps"],
                cost_usd=result["cost"],
                latency_ms=latency,
                trace=result.get("trace")
            )
        except Exception as e:
            eval_result = EvalResult(
                test_id=test.id,
                success=False,
                actual_output="",
                steps_taken=0,
                cost_usd=0,
                latency_ms=0,
                failure_reason=str(e)
            )
        
        self.results.append(eval_result)
    
    return self._compute_metrics()

def _check_success(self, result: Dict, expected: str) -> bool:
    """Check if output matches expected (implement your logic)."""
    return expected.lower() in result["output"].lower()

def _compute_metrics(self) -> Dict:
    """Aggregate metrics across all tests."""
    total = len(self.results)
    successful = sum(1 for r in self.results if r.success)
    
    return {
        "success_rate": successful / total if total > 0 else 0,
        "total_tests": total,
        "total_cost_usd": sum(r.cost_usd for r in self.results),
        "avg_latency_ms": sum(r.latency_ms for r in self.results) / total if total > 0 else 0,
        "failure_reasons": [r.failure_reason for r in self.results if not r.success]
    }

def export_report(self, filename: str):
    """Export detailed report as JSON."""
    report = {
        "metrics": self._compute_metrics(),
        "results": [
            {
                "test_id": r.test_id,
                "success": r.success,
                "steps": r.steps_taken,
                "cost": r.cost_usd,
                "latency_ms": r.latency_ms,
                "failure_reason": r.failure_reason
            }
            for r in self.results
        ]
    }
    
    with open(filename, "w") as f:
        json.dump(report, f, indent=2)
import json from typing import List, Dict from dataclasses import dataclass from datetime import datetime
@dataclass class TestCase: id: str input: str expected_output: str max_steps: int = 10
@dataclass class EvalResult: test_id: str success: bool actual_output: str steps_taken: int cost_usd: float latency_ms: float failure_reason: str = None trace: List[Dict] = None
class AgentEvaluator: def init(self, agent_fn, test_cases: List[TestCase]): self.agent_fn = agent_fn self.test_cases = test_cases self.results: List[EvalResult] = []
def run_evaluation(self) -> Dict:
    """Run all test cases and collect metrics."""
    for test in self.test_cases:
        start_time = datetime.now()
        
        try:
            result = self.agent_fn(test.input, max_steps=test.max_steps)
            success = self._check_success(result, test.expected_output)
            
            latency = (datetime.now() - start_time).total_seconds() * 1000
            
            eval_result = EvalResult(
                test_id=test.id,
                success=success,
                actual_output=result["output"],
                steps_taken=result["steps"],
                cost_usd=result["cost"],
                latency_ms=latency,
                trace=result.get("trace")
            )
        except Exception as e:
            eval_result = EvalResult(
                test_id=test.id,
                success=False,
                actual_output="",
                steps_taken=0,
                cost_usd=0,
                latency_ms=0,
                failure_reason=str(e)
            )
        
        self.results.append(eval_result)
    
    return self._compute_metrics()

def _check_success(self, result: Dict, expected: str) -> bool:
    """Check if output matches expected (implement your logic)."""
    return expected.lower() in result["output"].lower()

def _compute_metrics(self) -> Dict:
    """Aggregate metrics across all tests."""
    total = len(self.results)
    successful = sum(1 for r in self.results if r.success)
    
    return {
        "success_rate": successful / total if total > 0 else 0,
        "total_tests": total,
        "total_cost_usd": sum(r.cost_usd for r in self.results),
        "avg_latency_ms": sum(r.latency_ms for r in self.results) / total if total > 0 else 0,
        "failure_reasons": [r.failure_reason for r in self.results if not r.success]
    }

def export_report(self, filename: str):
    """Export detailed report as JSON."""
    report = {
        "metrics": self._compute_metrics(),
        "results": [
            {
                "test_id": r.test_id,
                "success": r.success,
                "steps": r.steps_taken,
                "cost": r.cost_usd,
                "latency_ms": r.latency_ms,
                "failure_reason": r.failure_reason
            }
            for r in self.results
        ]
    }
    
    with open(filename, "w") as f:
        json.dump(report, f, indent=2)

Usage

Usage

test_cases = [ TestCase( id="research-basic", input="Research recent AI agent frameworks", expected_output="langchain" ), TestCase( id="research-citation", input="Find papers on agent evaluation", expected_output="citation" ) ]
evaluator = AgentEvaluator(agent_fn=my_agent_function, test_cases=test_cases) metrics = evaluator.run_evaluation() evaluator.export_report("eval_results.json")
print(f"Success rate: {metrics['success_rate']:.2%}") print(f"Avg cost: ${metrics['total_cost_usd']:.4f}")

**Safety Patterns**:

```python
test_cases = [ TestCase( id="research-basic", input="Research recent AI agent frameworks", expected_output="langchain" ), TestCase( id="research-citation", input="Find papers on agent evaluation", expected_output="citation" ) ]
evaluator = AgentEvaluator(agent_fn=my_agent_function, test_cases=test_cases) metrics = evaluator.run_evaluation() evaluator.export_report("eval_results.json")
print(f"Success rate: {metrics['success_rate']:.2%}") print(f"Avg cost: ${metrics['total_cost_usd']:.4f}")

**安全模式**:

```python

Dangerous tool approval gate

Dangerous tool approval gate

DANGEROUS_TOOLS = ["delete_file", "send_email", "make_payment", "publish_content"]
def execute_tool_with_approval(tool_name: str, args: dict): """Execute tool with human-in-the-loop for dangerous actions.""" if tool_name in DANGEROUS_TOOLS: print(f"\n⚠️ APPROVAL REQUIRED") print(f"Tool: {tool_name}") print(f"Args: {json.dumps(args, indent=2)}")
    approval = input("\nApprove this action? (yes/no): ")
    if approval.lower() != "yes":
        return {"status": "rejected", "reason": "User denied approval"}

# Execute tool
return tools_registry[tool_name](**args)

**Deliverable**: Eval suite with fixed test set, success rate tracking, and cost/latency metrics.
DANGEROUS_TOOLS = ["delete_file", "send_email", "make_payment", "publish_content"]
def execute_tool_with_approval(tool_name: str, args: dict): """Execute tool with human-in-the-loop for dangerous actions.""" if tool_name in DANGEROUS_TOOLS: print(f"\n⚠️ APPROVAL REQUIRED") print(f"Tool: {tool_name}") print(f"Args: {json.dumps(args, indent=2)}")
    approval = input("\nApprove this action? (yes/no): ")
    if approval.lower() != "yes":
        return {"status": "rejected", "reason": "User denied approval"}

# Execute tool
return tools_registry[tool_name](**args)

**交付成果**:含固定测试集、成功率追踪及成本/延迟指标的评估套件。

Common Patterns

常见模式

When to Use Agents vs. Workflows

何时使用Agent vs 工作流

Use Agent When:
  • Task requires dynamic tool selection
  • Steps depend on runtime information
  • Need to handle unexpected situations
  • Task involves exploration or research
Use Workflow When:
  • Steps are predictable
  • Process is well-defined
  • Speed and cost matter more than flexibility
  • You need guarantees about execution path
选择Agent的场景: -任务需要动态选择工具 -步骤依赖运行时信息 -需要处理意外情况 -任务涉及探索或研究
选择工作流的场景: -步骤可预测 -流程已明确定义 -速度与成本比灵活性更重要 -需要保证执行路径

Context Management

上下文管理

python
undefined
python
undefined

Simple context compaction strategy

Simple context compaction strategy

class ContextManager: def init(self, max_tokens: int = 8000): self.max_tokens = max_tokens self.messages = []
def add_message(self, message: dict):
    """Add message and compact if needed."""
    self.messages.append(message)
    
    # Estimate tokens (rough: 1 token ≈ 4 chars)
    total_chars = sum(len(str(m)) for m in self.messages)
    estimated_tokens = total_chars // 4
    
    if estimated_tokens > self.max_tokens:
        self._compact()

def _compact(self):
    """Keep system message, last user message, and recent history."""
    system = [m for m in self.messages if m["role"] == "system"]
    recent = self.messages[-10:]  # Keep last 10 messages
    
    # Summarize middle messages (in production, use LLM)
    if len(self.messages) > 12:
        summary = {
            "role": "system",
            "content": f"[Previous conversation summarized: {len(self.messages) - 12} messages]"
        }
        self.messages = system + [summary] + recent
undefined
class ContextManager: def init(self, max_tokens: int = 8000): self.max_tokens = max_tokens self.messages = []
def add_message(self, message: dict):
    """Add message and compact if needed."""
    self.messages.append(message)
    
    # Estimate tokens (rough: 1 token ≈ 4 chars)
    total_chars = sum(len(str(m)) for m in self.messages)
    estimated_tokens = total_chars // 4
    
    if estimated_tokens > self.max_tokens:
        self._compact()

def _compact(self):
    """Keep system message, last user message, and recent history."""
    system = [m for m in self.messages if m["role"] == "system"]
    recent = self.messages[-10:]  # Keep last 10 messages
    
    # Summarize middle messages (in production, use LLM)
    if len(self.messages) > 12:
        summary = {
            "role": "system",
            "content": f"[Previous conversation summarized: {len(self.messages) - 12} messages]"
        }
        self.messages = system + [summary] + recent
undefined

Troubleshooting

故障排查

Agent Loops Forever

Agent无限循环

Cause: No max_steps limit or unclear stopping criteria.
Solution:
python
def run_agent_with_limits(task: str, max_steps: int = 10, max_cost: float = 1.0):
    total_cost = 0.0
    
    for step in range(max_steps):
        if total_cost > max_cost:
            return {"error": "Cost limit exceeded", "partial_result": current_state}
        
        # Run agent step
        result = agent.step()
        total_cost += result.cost
        
        if result.is_final:
            return result
    
    return {"error": "Max steps exceeded", "partial_result": current_state}
原因:无最大步数限制或停止标准不明确。
解决方案:
python
def run_agent_with_limits(task: str, max_steps: int = 10, max_cost: float = 1.0):
    total_cost = 0.0
    
    for step in range(max_steps):
        if total_cost > max_cost:
            return {"error": "Cost limit exceeded", "partial_result": current_state}
        
        # Run agent step
        result = agent.step()
        total_cost += result.cost
        
        if result.is_final:
            return result
    
    return {"error": "Max steps exceeded", "partial_result": current_state}

Tools Return Empty/Error Frequently

工具频繁返回空值/错误

Cause: Input validation issues or tool design mismatch.
Solution:
  • Add input schema validation
  • Provide clear tool descriptions
  • Show examples in tool metadata
  • Add retry logic with backoff
python
from pydantic import BaseModel, Field

class SearchInput(BaseModel):
    query: str = Field(..., min_length=3, description="Search query, at least 3 characters")
    max_results: int = Field(5, ge=1, le=20, description="Number of results, 1-20")

def search_tool(input: SearchInput) -> dict:
    """Type-safe search tool."""
    # Validation happens automatically via Pydantic
    return {"results": [...]}
原因:输入验证问题或工具设计不匹配。
解决方案: -添加输入Schema验证 -提供清晰的工具描述 -在工具元数据中添加示例 -添加带退避的重试逻辑
python
from pydantic import BaseModel, Field

class SearchInput(BaseModel):
    query: str = Field(..., min_length=3, description="Search query, at least 3 characters")
    max_results: int = Field(5, ge=1, le=20, description="Number of results, 1-20")

def search_tool(input: SearchInput) -> dict:
    """Type-safe search tool."""
    # Validation happens automatically via Pydantic
    return {"results": [...]}

Agent Hallucinates Citations

Agent虚构引用

Cause: No grounding mechanism, agent invents sources.
Solution:
  • Return citations with every retrieval
  • Use structured output for citations
  • Post-process to verify citation validity
python
def verify_citations(text: str, sources: List[str]) -> bool:
    """Check if all citations in text exist in sources."""
    import re
    
    cited = re.findall(r'\[(\d+)\]', text)
    max_source_idx = len(sources) - 1
    
    for citation in cited:
        if int(citation) > max_source_idx:
            return False
    
    return True
原因:无 grounding 机制,Agent自行编造来源。
解决方案: -每次检索都返回引用 -使用结构化输出存储引用 -后处理验证引用有效性
python
def verify_citations(text: str, sources: List[str]) -> bool:
    """Check if all citations in text exist in sources."""
    import re
    
    cited = re.findall(r'\[(\d+)\]', text)
    max_source_idx = len(sources) - 1
    
    for citation in cited:
        if int(citation) > max_source_idx:
            return False
    
    return True

Related Resources

相关资源

Official Documentation:
Key Papers:
  • ReAct: Synergizing Reasoning and Acting in Language Models
  • WebArena: A Realistic Web Environment for Building Autonomous Agents
  • ToolBench: Tool Learning with Foundation Models
Recommended Open Source Projects (curated in repo):
  • GPT Researcher, STORM, Khoj (RAG/research)
  • learn-claude-code, claw0, OpenClaw (agent harnesses)
  • browser-use (browser agents)
  • mem0, Letta (memory systems)
官方文档:
核心论文:
  • ReAct: Synergizing Reasoning and Acting in Language Models
  • WebArena: A Realistic Web Environment for Building Autonomous Agents
  • ToolBench: Tool Learning with Foundation Models
推荐开源项目(仓库中已精选):
  • GPT Researcher、STORM、Khoj(RAG/研究)
  • learn-claude-code、claw0、OpenClaw(Agent框架)
  • browser-use(浏览器Agent)
  • mem0、Letta(记忆系统)

Best Practices

最佳实践

  1. Start Simple: Build minimal loop before adding frameworks
  2. Add Safety Early: Approval gates, logging, cost limits from day one
  3. Evaluate Continuously: Fixed test set, track regressions
  4. Study Harnesses: Learn from Claude Code, OpenClaw, not just toy examples
  5. Prefer Skills Over Prompts: Package reusable knowledge formally
  6. Use MCP for Integration: Standard protocol beats custom tool wrappers
  7. Log Everything: Traces are essential for debugging agent failures
  8. Human-in-Loop for Risk: Never auto-approve delete/send/publish
  1. 从简开始:先构建最小循环,再添加框架
  2. 尽早加入安全机制:从第一天就加入审批网关、日志、成本限制
  3. 持续评估:固定测试集,追踪回归问题
  4. 学习成熟框架:向Claude Code、OpenClaw学习,而非仅关注玩具示例
  5. 优先使用Skills而非Prompt:将可复用知识进行规范化封装
  6. 使用MCP进行集成:标准协议优于自定义工具包装器
  7. 记录所有内容:追踪信息是调试Agent故障的关键
  8. 高风险操作需人工介入:永远不要自动批准删除/发送/发布操作

When to Use This Skill

何时使用本Skill

An AI coding agent should use this skill when:
  • User asks "how do I learn AI agents"
  • User wants structured agent learning path
  • User needs modern agent architecture guidance
  • User asks about specific stage (tool use, RAG, multi-agent, evaluation)
  • User wants curated agent resources or project recommendations
  • User needs code examples for agent loops, tools, or harnesses
AI编码Agent应在以下场景使用本Skill:
  • 用户询问“如何学习AI Agent”
  • 用户需要结构化的Agent学习路径
  • 用户需要现代Agent架构指导
  • 用户询问特定阶段的内容(工具使用、RAG、多Agent、评估)
  • 用户需要精选的Agent资源或项目推荐
  • 用户需要Agent循环、工具或框架的代码示例