ai-engineering

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

AI Engineering

AI工程

Overview

概述

Build effective agentic systems using proven patterns. Start simple, add complexity only when needed.
For specialized prompt design guidance (techniques, patterns, examples for agentic systems), see the prompt-engineering skill.
利用经过验证的模式构建高效的智能代理系统。从简单方案入手,仅在必要时增加复杂度。
如需专业提示词设计指导(针对智能代理系统的技术、模式、示例),请查看prompt-engineering skill

Core Principle

核心原则

Find the simplest solution first. Agentic systems trade latency and cost for better task performance. Only increase complexity when simpler solutions fall short.
  1. Start with optimized single LLM calls (retrieval, in-context examples)
  2. Add workflows for predictable, multi-step tasks
  3. Use agents when flexibility and autonomous decision-making are required
优先寻找最简单的解决方案。智能代理系统以延迟和成本为代价换取更好的任务性能。仅当简单方案无法满足需求时,才增加复杂度。
  1. 从优化的单次LLM调用开始(检索、上下文示例)
  2. 为可预测的多步骤任务添加工作流
  3. 当需要灵活性和自主决策能力时使用Agent

When to Build an Agent

何时构建Agent

Before committing to an agent, validate that your use case truly requires agentic capabilities. Consider alternatives first—deterministic solutions are simpler, faster, and more reliable.
Use agents when workflows involve:
CriteriaDescriptionExample
Complex decision-makingNuanced judgment, exceptions, context-sensitive decisionsRefund approval with edge cases
Brittle rule systemsRulesets that are unwieldy, costly to maintain, or error-proneVendor security reviews
Unstructured dataInterpreting natural language, documents, or conversational inputProcessing insurance claims
If your use case doesn't clearly fit these criteria, a deterministic or simple LLM solution may suffice.
在决定构建Agent之前,请验证你的用例是否确实需要智能代理能力。先考虑替代方案——确定性解决方案更简单、更快且更可靠。
当工作流涉及以下场景时使用Agent:
判定标准描述示例
复杂决策需要细致判断、处理异常情况、基于上下文做决策处理包含特殊情况的退款审批
脆弱的规则系统规则集繁琐、维护成本高或容易出错供应商安全审核
非结构化数据解读自然语言、文档或对话输入处理保险理赔申请
如果你的用例不符合上述明确标准,确定性或简单的LLM解决方案可能就足够了。

Agentic System Taxonomy

智能代理系统分类

Understanding the spectrum of agentic capabilities helps you choose the right level of complexity for your use case.
LevelNameDescriptionUse Case
Level 0Core Reasoning SystemLM operates in isolation, responding based on pre-trained knowledge onlyExplaining concepts, general knowledge
Level 1Connected Problem-SolverLM connects to external tools to retrieve real-time information and take actionsAnswering "What's the score?", querying databases
Level 2Strategic Problem-SolverAgent actively curates context, plans multi-step tasks, and engineers focused queries for each step"Find coffee shops halfway between two locations"
Level 3Collaborative Multi-Agent SystemMultiple specialized agents coordinate under a central manager or through peer handoffsProduct launch with research, marketing, and web dev agents
Level 4Self-Evolving SystemAgents can dynamically create new tools or agents to fill capability gapsAgent creates sentiment analysis agent when needed
Progression guidance: Start at Level 0 or 1. Only increase levels when the current level cannot handle your use case effectively.
了解智能代理能力的范围有助于为你的用例选择合适的复杂程度。
级别名称描述适用场景
Level 0核心推理系统LM独立运行,仅基于预训练知识响应概念讲解、通用知识问答
Level 1连接型问题解决者LM连接外部工具以获取实时信息并执行操作回答“当前比分是多少?”、查询数据库
Level 2策略型问题解决者Agent主动整理上下文,规划多步骤任务,并为每一步设计针对性查询“查找两个地点之间中途的咖啡店”
Level 3协作式多Agent系统多个专业Agent在中央管理器协调下或通过 peer 交接协同工作包含调研、营销和Web开发的产品发布项目
Level 4自我进化系统Agent可以动态创建新工具或Agent来填补能力空白在需要时创建情感分析Agent
进阶指导:从Level 0或Level 1开始。仅当当前级别无法有效处理你的用例时,再升级到更高级别。

Prompt Engineering

提示词工程

Effective prompts are critical to agentic system performance. When designing or refining prompts for LLM calls, workflows, or agents, leverage the prompt-engineering skill if available. It provides specialized guidance for crafting prompts that produce reliable, high-quality outputs.
有效的提示词对智能代理系统的性能至关重要。在为LLM调用、工作流或Agent设计或优化提示词时,如果有prompt-engineering skill可用,请充分利用它。它提供了专门的指导,帮助你编写能产生可靠、高质量输出的提示词。

Context Engineering

上下文工程

Context engineering is the practice of dynamically assembling and managing information within an LLM's context window to enable stateful, intelligent agents. It represents an evolution from prompt engineering—while prompts focus on static instructions, context engineering addresses the entire payload dynamically.
Key principles:
  • Curate attention: Prevent context overload by including only relevant information for each step
  • Dynamic filtering: Transform previous outputs into focused queries for the next step
  • Progressive refinement: Each step should produce a distilled, actionable input for the next
Example: Instead of passing an entire document to summarize, extract key entities first, then retrieve only relevant context about those entities.
For comprehensive guidance on sessions, memory, and context management, see references/context-engineering.md.
上下文工程是指在LLM的上下文窗口内动态组装和管理信息,以实现有状态的智能Agent的实践。它是提示词工程的演进——提示词关注静态指令,而上下文工程则处理整个动态负载。
核心原则:
  • 精准聚焦:通过仅为每一步添加相关信息,避免上下文过载
  • 动态过滤:将之前的输出转换为下一步的针对性查询
  • 逐步细化:每一步都应为下一步生成精炼、可执行的输入
示例:无需传入整个文档进行总结,先提取关键实体,然后仅检索与这些实体相关的上下文。
如需关于会话、内存和上下文管理的全面指导,请查看**references/context-engineering.md**。

Agentic Problem-Solving Process

智能代理问题解决流程

All autonomous agents operate on a continuous cyclical process. Understanding this loop is fundamental to building effective agents.
The 5-Step Loop:
  1. Get the Mission - Receive a high-level goal from user or automated trigger
  2. Scan the Scene - Gather context from available resources: instructions, session history, available tools, long-term memory
  3. Think It Through - Analyze mission against scene, devise a plan using chain-of-reasoning
  4. Take Action - Execute the first concrete step by invoking a tool or generating response
  5. Observe and Iterate - Observe the outcome, add to context/memory, loop back to step 3
This "Think, Act, Observe" cycle continues until the mission is complete or an exit condition is reached.
Code example (Think, Act, Observe with tools):
python
import anthropic

client = anthropic.Anthropic()

def agent_loop(mission: str, max_iterations: int = 10):
    """Run the Think-Act-Observe loop until mission complete."""
    context = f"Mission: {mission}\nAvailable tools: search, read_page, finish"

    for i in range(max_iterations):
        # THINK: LLM analyzes current state and plans next action
        response = client.messages.create(
            model="claude-sonnet-4-5-20250929",
            messages=[{"role": "user", "content": context}],
            tools=[search_tool, read_page_tool, finish_tool]
        )

        # Extract the model's reasoning and intended action
        for block in response.content:
            if block.type == "text":
                print(f"Thought: {block.text}")
            elif block.type == "tool_use" and block.name == "search":
                # ACT: Execute the tool
                result = search(block.input["query"])
                # OBSERVE: Add result to context, loop continues
                context += f"\nObservation: {result}"
            elif block.type == "tool_use" and block.name == "finish":
                # EXIT: Mission complete
                return block.input["summary"]

    return "Max iterations reached"
所有自主Agent都遵循一个持续的循环过程。理解这个循环是构建高效Agent的基础。
五步循环:
  1. 接收任务 - 从用户或自动触发器获取高级目标
  2. 扫描场景 - 从可用资源中收集上下文:指令、会话历史、可用工具、长期记忆
  3. 分析规划 - 结合任务与场景进行分析,使用链式推理制定计划
  4. 执行操作 - 通过调用工具或生成响应来执行第一个具体步骤
  5. 观察迭代 - 观察结果,添加到上下文/内存中,回到步骤3循环
这个“思考、行动、观察”循环会持续到任务完成或触发退出条件。
代码示例(带工具的思考、行动、观察循环):
python
import anthropic

client = anthropic.Anthropic()

def agent_loop(mission: str, max_iterations: int = 10):
    """Run the Think-Act-Observe loop until mission complete."""
    context = f"Mission: {mission}\nAvailable tools: search, read_page, finish"

    for i in range(max_iterations):
        # THINK: LLM analyzes current state and plans next action
        response = client.messages.create(
            model="claude-sonnet-4-5-20250929",
            messages=[{"role": "user", "content": context}],
            tools=[search_tool, read_page_tool, finish_tool]
        )

        # Extract the model's reasoning and intended action
        for block in response.content:
            if block.type == "text":
                print(f"Thought: {block.text}")
            elif block.type == "tool_use" and block.name == "search":
                # ACT: Execute the tool
                result = search(block.input["query"])
                # OBSERVE: Add result to context, loop continues
                context += f"\nObservation: {result}"
            elif block.type == "tool_use" and block.name == "finish":
                # EXIT: Mission complete
                return block.input["summary"]

    return "Max iterations reached"

Pattern Selection Guide

模式选择指南

PatternUse WhenKey Benefit
Augmented LLMSingle task needing external data/toolsRetrieval, tools, memory
Prompt ChainingTask decomposes into fixed subtasksTrade latency for accuracy
RoutingDistinct categories need separate handlingSeparation of concerns
ParallelizationSubtasks are independent OR multiple attempts neededSpeed OR confidence
Orchestrator-WorkersSubtasks unpredictable, input-dependentDynamic task breakdown
Evaluator-OptimizerClear evaluation criteria, iteration adds valueIterative refinement
Autonomous AgentOpen-ended problems, unpredictable stepsFlexibility at scale
模式适用场景核心优势
增强型LLM需外部数据/工具的单任务检索、工具、内存
提示词链式调用可分解为固定子任务的任务以延迟换取准确性
路由不同类别输入需要单独处理关注点分离
并行化子任务相互独立或需要多次尝试速度提升或结果可信度增强
编排器-工作器子任务不可预测、依赖输入动态任务分解
评估器-优化器有明确评估标准,迭代可提升价值逐步细化优化
自主Agent开放式问题,步骤不可预测规模化灵活性

Decision Framework

决策框架

Is the task solvable with a single well-crafted prompt?
├─ Yes → Optimize with retrieval/examples → Done
└─ No → Are subtasks fixed and predictable?
    ├─ Yes → Use Workflow (chaining/routing/parallelization)
    └─ No → Are subtasks input-dependent?
        ├─ Yes → Use Orchestrator-Workers
        └─ No → Is the problem open-ended with unpredictable steps?
            ├─ Yes → Use Autonomous Agent
            └─ No → Reconsider approach
该任务是否可通过单个精心设计的提示词解决?
├─ 是 → 用检索/示例优化 → 完成
└─ 否 → 子任务是否固定且可预测?
    ├─ 是 → 使用工作流(链式调用/路由/并行化)
    └─ 否 → 子任务是否依赖输入?
        ├─ 是 → 使用编排器-工作器模式
        └─ 否 → 问题是否为开放式且步骤不可预测?
            ├─ 是 → 使用自主Agent
            └─ 否 → 重新考虑方案

Workflow Patterns

工作流模式

For detailed workflow implementations with code examples, see references/workflows.md.
When to use workflows: Tasks with predictable, multi-step steps where subtasks are fixed or input-dependent.
Quick reference:
  • Prompt Chaining - Sequential LLM calls, each processing previous output
  • Routing - Classify input and direct to specialized handler
  • Parallelization - Sectioning (independent subtasks) or Voting (multiple attempts)
  • Orchestrator-Workers - Central LLM breaks down tasks, delegates to workers, synthesizes results
  • Evaluator-Optimizer - One LLM generates, another evaluates and provides feedback in a loop
Code example (Orchestrator-Workers):
python
undefined
如需包含代码示例的详细工作流实现,请查看**references/workflows.md**。
何时使用工作流:具有可预测多步骤,且子任务固定或依赖输入的任务。
快速参考:
  • 提示词链式调用 - 顺序LLM调用,每次调用处理前一次的输出
  • 路由 - 对输入进行分类并导向专门的处理程序
  • 并行化 - 分段处理(独立子任务)或投票机制(多次尝试)
  • 编排器-工作器 - 中央LLM分解任务,委派给工作器,然后综合结果
  • 评估器-优化器 - 一个LLM生成结果,另一个LLM评估并提供反馈,形成循环
代码示例(编排器-工作器):
python
undefined

Orchestrator breaks down task

Orchestrator breaks down task

subtasks = llm(f"Break down: {task}")
subtasks = llm(f"Break down: {task}")

Workers execute in parallel

Workers execute in parallel

results = [execute(s) for s in subtasks]
results = [execute(s) for s in subtasks]

Orchestrator synthesizes

Orchestrator synthesizes

final = llm(f"Synthesize results: {results}")

**Code example (Prompt Chaining - complete):**
```python
import anthropic

client = anthropic.Anthropic()

def analyze_document(text: str) -> str:
    """Complete prompt chaining: extract → summarize → recommend."""

    # STEP 1: Extract key entities
    step1 = client.messages.create(
        model="claude-sonnet-4-5-20250929",
        messages=[{
            "role": "user",
            "content": f"Extract all entities (people, orgs, dates) from:\n{text}"
        }]
    )
    entities = step1.content[0].text

    # STEP 2: Summarize using extracted entities
    step2 = client.messages.create(
        model="claude-sonnet-4-5-20250929",
        messages=[{
            "role": "user",
            "content": f"Summarize this document using these entities: {entities}\n\nDocument: {text}"
        }]
    )
    summary = step2.content[0].text

    # STEP 3: Generate recommendations based on summary
    step3 = client.messages.create(
        model="claude-sonnet-4-5-20250929",
        messages=[{
            "role": "user",
            "content": f"Based on this summary, provide 3 actionable recommendations:\n{summary}"
        }]
    )

    return step3.content[0].text
final = llm(f"Synthesize results: {results}")

**代码示例(提示词链式调用 - 完整实现):**
```python
import anthropic

client = anthropic.Anthropic()

def analyze_document(text: str) -> str:
    """Complete prompt chaining: extract → summarize → recommend."""

    # STEP 1: Extract key entities
    step1 = client.messages.create(
        model="claude-sonnet-4-5-20250929",
        messages=[{
            "role": "user",
            "content": f"Extract all entities (people, orgs, dates) from:\n{text}"
        }]
    )
    entities = step1.content[0].text

    # STEP 2: Summarize using extracted entities
    step2 = client.messages.create(
        model="claude-sonnet-4-5-20250929",
        messages=[{
            "role": "user",
            "content": f"Summarize this document using these entities: {entities}\n\nDocument: {text}"
        }]
    )
    summary = step2.content[0].text

    # STEP 3: Generate recommendations based on summary
    step3 = client.messages.create(
        model="claude-sonnet-4-5-20250929",
        messages=[{
            "role": "user",
            "content": f"Based on this summary, provide 3 actionable recommendations:\n{summary}"
        }]
    )

    return step3.content[0].text

Error Handling & Guardrails

错误处理与防护机制

Guardrails are a layered defense. No single layer is sufficient—combine multiple specialized checks for resilient agents.
Layered Defense Pattern:
Input → Relevance Check → Safety Filter → Agent → Tool Safeguards → Output Validation → Response
            ↓block          ↓block                  ↓risk-rating          ↓block
For a complete implementation with code examples and tests, see references/agent-design.md.
防护机制是分层的防御体系。单一防护层不足以保障安全——需结合多个专门检查来构建高韧性Agent。
分层防御模式:
输入 → 相关性检查 → 安全过滤 → Agent → 工具防护 → 输出验证 → 响应
            ↓拦截          ↓拦截                  ↓风险评级          ↓拦截
如需包含代码示例和测试的完整实现,请查看**references/agent-design.md**。

Agent Design

Agent设计

For comprehensive agent design patterns, characteristics, and best practices, see references/agent-design.md.
Core agent characteristics:
  1. Explicit Role & Responsibility - Clearly defined mandate
  2. Single-Purpose Focus - Narrow scope, high performance
  3. Minimal, Purpose-Built Tooling - Only necessary tools
  4. Deterministic Orchestration - Clear execution structure
  5. Cooperation & Delegation - Structured interaction
  6. Self-Constraint & Guardrails - Prevents scope creep
  7. State Awareness - Session memory for tasks
  8. Long-Term Memory - Curated, retrievable knowledge
  9. Observability - Inspectable decisions and outcomes
  10. Failure Awareness - Graceful recovery
Key topics:
  • Autonomous Agents and the Run Loop - The "Think, Act, Observe" cycle with exit conditions
  • Guardrails - Layered defense: relevance classifiers, safety filters, PII filters, tool safeguards
  • Multi-Agent Patterns - Manager (agents as tools), Decentralized (handoffs), Sequential, Iterative Refinement
  • Real-World Examples - Customer support agents, coding agents with test verification
如需全面的Agent设计模式、特性及最佳实践,请查看**references/agent-design.md**。
核心Agent特性:
  1. 明确的角色与职责 - 清晰定义的任务范围
  2. 单一目标聚焦 - 窄范围,高性能
  3. 极简、定制化工具集 - 仅包含必要工具
  4. 确定性编排 - 清晰的执行结构
  5. 协作与委派 - 结构化交互
  6. 自我约束与防护机制 - 防止范围蔓延
  7. 状态感知 - 任务会话内存
  8. 长期记忆 - 可检索的 curated 知识
  9. 可观测性 - 可检查的决策与结果
  10. 故障感知 - 优雅恢复
关键主题:
  • 自主Agent与运行循环 - 包含退出条件的“思考、行动、观察”循环
  • 防护机制 - 分层防御:相关性分类器、安全过滤器、PII过滤器、工具防护
  • 多Agent模式 - 管理器模式(将Agent作为工具)、去中心化模式(任务交接)、顺序模式、迭代细化模式
  • 实战示例 - 客户支持Agent、带测试验证的编码Agent

Agent-Computer Interface (ACI)

Agent-Computer Interface (ACI)

Tool design matters as much as prompt engineering. For comprehensive tool design patterns, see references/aci.md.
Core principles:
  • Give tokens to think - Don't force the model into corners
  • Keep formats natural - Match patterns from training data
  • Minimize overhead - Avoid line counting, escape sequences
  • Publish tasks, not APIs - Tools should encapsulate user-facing actions
Key patterns:
  • Tool Types - Information Retrieval, Action/Execution, System/API Integration, Human-in-the-Loop
  • Output Design - Return references for large data, descriptive error messages for recovery
  • Input Validation - Schema validation for runtime checks and LLM guidance
  • Documentation - Clear descriptions, examples, edge cases, parameter constraints
工具设计与提示词工程同等重要。如需全面的工具设计模式,请查看**references/aci.md**。
核心原则:
  • 给模型思考的空间 - 不要限制模型的决策
  • 采用自然格式 - 匹配训练数据中的模式
  • 最小化开销 - 避免行计数、转义序列
  • 发布任务,而非API - 工具应封装面向用户的操作
关键模式:
  • 工具类型 - 信息检索、动作执行、系统/API集成、人在回路
  • 输出设计 - 针对大数据返回引用,提供可恢复的 descriptive 错误信息
  • 输入验证 - 模式验证用于运行时检查和LLM引导
  • 文档 - 清晰的描述、示例、边缘情况、参数约束

Model Context Protocol (MCP)

Model Context Protocol (MCP)

MCP is an open standard for connecting AI applications to external tools and data sources. For comprehensive coverage, see references/mcp.md.
What it solves: The "N×M integration problem" - without a standard, every model-tool pairing requires custom connectors.
Core architecture:
  • Host - Manages UX, orchestrates tools, enforces security
  • Client - Maintains server connections, manages sessions
  • Server - Advertises tools, executes commands, handles governance
Key capabilities:
  • Tools - Standardized function definitions with JSON Schema
  • Resources - Static data access (validate trusted sources only)
  • Prompts - Reusable prompt templates (use rarely - security risk)
  • Sampling - Server can request LLM completion from client
  • Elicitation - Server can request user input via client UI
When to use MCP:
  • Multi-environment deployments
  • Sharing tools across applications
  • Dynamic tool discovery needs
  • Ecosystem participation
Security considerations:
  • Dynamic Capability Injection, Tool Shadowing, Confused Deputy
  • Requires multi-layered defense: HIL → API Gateway → SDK Allowlists → Schema Validation
MCP是用于连接AI应用与外部工具和数据源的开放标准。如需全面介绍,请查看**references/mcp.md**。
解决的问题:“N×M集成问题”——没有标准时,每个模型与工具的配对都需要自定义连接器。
核心架构:
  • Host - 管理用户体验、编排工具、执行安全策略
  • Client - 维护服务器连接、管理会话
  • Server - 发布工具、执行命令、处理治理
核心能力:
  • Tools - 基于JSON Schema的标准化函数定义
  • Resources - 静态数据访问(仅验证可信源)
  • Prompts - 可复用的提示词模板(谨慎使用 - 存在安全风险)
  • Sampling - 服务器可请求Client完成LLM生成
  • Elicitation - 服务器可通过Client UI请求用户输入
何时使用MCP:
  • 多环境部署
  • 跨应用共享工具
  • 动态工具发现需求
  • 生态系统参与
安全注意事项:
  • 动态能力注入、工具影子、混淆代理
  • 需要多层防御:人在回路 → API网关 → SDK白名单 → 模式验证

Implementation Guidance

实现指导

For practical implementation guidance including model selection, task decomposition, and debugging, see references/implementation.md.
Quick start:
python
undefined
如需包含模型选择、任务分解和调试的实战实现指导,请查看**references/implementation.md**。
快速入门:
python
undefined

Single call with retrieval

Single call with retrieval

response = claude.messages.create( model="claude-sonnet-4-5-20250929", messages=[{"role": "user", "content": query}], tools=[search_tool, database_tool] )

**Key topics:**
- **Start Simple** - Optimize single calls first, add complexity only when needed
- **Framework Considerations** - Claude Agent SDK, Agno, CrewAI, LangChain (or direct APIs)
- **Model Selection** - Prototype with best, optimize cost/latency with smaller models
- **Task Decomposition** - Break down until each step is automatable or human-gated
- **Performance & Scalability** - Context window management, dynamic tool loading, state management
- **Debugging** - Common issues: tool usage, loops, edge cases, compounding errors
response = claude.messages.create( model="claude-sonnet-4-5-20250929", messages=[{"role": "user", "content": query}], tools=[search_tool, database_tool] )

**关键主题:**
- **从简入手** - 先优化单次调用,仅在必要时增加复杂度
- **框架选择** - Claude Agent SDK、Agno、CrewAI、LangChain(或直接API)
- **模型选择** - 用最优模型原型开发,用小模型优化成本/延迟
- **任务分解** - 分解任务直到每个步骤可自动化或需要人工介入
- **性能与扩展性** - 上下文窗口管理、动态工具加载、状态管理
- **调试** - 常见问题:工具使用、循环、边缘情况、复合错误

Operations & Security

运维与安全

For production operations, security, and agent learning patterns, see references/operations.md.
Agent Ops (GenAIOps):
  • Evaluation Strategy - Define success metrics first, use LM as Judge, metrics-driven development
  • Observability - OpenTelemetry traces for full trajectory: prompts, reasoning, tool calls, observations
  • Human Feedback Loop - Collect failures, convert to test cases, "close the loop" on error classes
Agent Identity & Security:
  • Agent as Principal - Distinct from users and service accounts, requires verifiable identity with least privilege
  • Security Layers - Deterministic guardrails (rules) + Reasoning-based defenses (guard models)
  • Tool Security Threats - Dynamic Capability Injection, Tool Shadowing, Confused Deputy, Malicious Definitions
Multi-Layered Defense:
Human-in-the-Loop → API Gateway → SDK Allowlists → Schema Validation → Secure Design
如需生产运维、安全和Agent学习模式,请查看**references/operations.md**。
Agent运维(GenAIOps):
  • 评估策略 - 先定义成功指标,用LM作为裁判,采用指标驱动的开发方式
  • 可观测性 - 使用OpenTelemetry追踪完整轨迹:提示词、推理、工具调用、观察结果
  • 人工反馈循环 - 收集失败案例,转换为测试用例,“闭环”处理错误类别
Agent身份与安全:
  • Agent作为主体 - 与用户和服务账户区分开,需要可验证的最小权限身份
  • 安全层 - 确定性防护(规则)+ 基于推理的防御(防护模型)
  • 工具安全威胁 - 动态能力注入、工具影子、混淆代理、恶意定义
多层防御:
人在回路 → API网关 → SDK白名单 → 模式验证 → 安全设计

Quality & Evaluation

质量与评估

For comprehensive agent quality frameworks, evaluation strategies, and observability practices, see references/quality-evaluation.md.
Four Pillars of Agent Quality:
  • Effectiveness - Goal completion, accuracy, instruction following
  • Efficiency - Latency, cost per interaction, token usage
  • Robustness - Edge case handling, error recovery, consistency
  • Safety - Guardrails, content filtering, policy compliance
Evaluation Hierarchy:
  • End-to-End (Black Box) - Measure final outputs against golden dataset
  • Trajectory (Glass Box) - Inspect intermediate steps, tool calls, reasoning
Evaluators:
  • Automated Metrics - Exact match, similarity scores, rule-based checks
  • LLM-as-a-Judge - Use powerful model to assess against rubric
  • Agent-as-a-Judge - Specialized evaluator agent critiques outputs
  • Human-in-the-Loop - Authoritative feedback for edge cases
如需全面的Agent质量框架、评估策略和可观测性实践,请查看**references/quality-evaluation.md**。
Agent质量四大支柱:
  • 有效性 - 目标完成度、准确性、指令遵循度
  • 效率 - 延迟、每次交互成本、Token使用量
  • 鲁棒性 - 边缘情况处理、错误恢复、一致性
  • 安全性 - 防护机制、内容过滤、政策合规
评估层级:
  • 端到端(黑盒) - 测量最终输出与黄金数据集的匹配度
  • 轨迹(白盒) - 检查中间步骤、工具调用、推理过程
评估方式:
  • 自动化指标 - 精确匹配、相似度评分、基于规则的检查
  • LLM-as-a-Judge - 用强大模型对照评估标准进行评估
  • Agent-as-a-Judge - 专业评估Agent对输出进行评判
  • 人在回路 - 针对边缘情况的权威反馈

Resources

资源

  • Workflows Reference - Detailed workflow patterns with code examples
  • Context Engineering - Sessions, memory, and context management
  • Agent Design - Agent characteristics, ACI, guardrails, multi-agent patterns
  • Implementation Guide - Practical implementation guidance and debugging
  • Operations & Security - Production operations, security, and agent learning
  • Quality & Evaluation - Agent quality frameworks, evaluation strategies, observability
  • ACI Guide - Agent-Computer Interface deep dive with tool design patterns
  • MCP Guide - Model Context Protocol for tool interoperability
  • Examples - Real-world implementations and case studies
  • Workflows Reference - 包含代码示例的详细工作流模式
  • Context Engineering - 会话、内存和上下文管理
  • Agent Design - Agent特性、ACI、防护机制、多Agent模式
  • Implementation Guide - 实战实现指导与调试
  • Operations & Security - 生产运维、安全和Agent学习
  • Quality & Evaluation - Agent质量框架、评估策略、可观测性
  • ACI Guide - Agent-Computer Interface深度解析与工具设计模式
  • MCP Guide - 用于工具互操作性的Model Context Protocol
  • Examples - 实战实现与案例研究