agent-orchestration-improve-agent

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Agent Performance Optimization Workflow

Agent性能优化工作流

Systematic improvement of existing agents through performance analysis, prompt engineering, and continuous iteration.

[Extended thinking: Agent optimization requires a data-driven approach combining performance metrics, user feedback analysis, and advanced prompt engineering techniques. Success depends on systematic evaluation, targeted improvements, and rigorous testing with rollback capabilities for production safety.]

通过性能分析、提示工程和持续迭代，系统性地改进现有Agent。

[拓展思考：Agent优化需要结合性能指标、用户反馈分析和先进提示工程技术的数据驱动方法。成功取决于系统性评估、针对性改进以及具备回滚能力的严格测试，以保障生产环境安全。]

Use this skill when

适用场景

Improving an existing agent's performance or reliability
Analyzing failure modes, prompt quality, or tool usage
Running structured A/B tests or evaluation suites
Designing iterative optimization workflows for agents

提升现有Agent的性能或可靠性
分析故障模式、提示质量或工具使用情况
运行结构化A/B测试或评估套件
设计Agent的迭代优化工作流

Do not use this skill when

不适用场景

You are building a brand-new agent from scratch
There are no metrics, feedback, or test cases available
The task is unrelated to agent performance or prompt quality

从零开始构建全新Agent
没有可用的指标、反馈或测试用例
任务与Agent性能或提示质量无关

Instructions

操作步骤

Establish baseline metrics and collect representative examples.
Identify failure modes and prioritize high-impact fixes.
Apply prompt and workflow improvements with measurable goals.
Validate with tests and roll out changes in controlled stages.

建立基准指标并收集代表性示例。
识别故障模式并优先处理高影响的修复工作。
应用可衡量目标的提示和工作流改进方案。
通过测试验证并分阶段受控推出变更。

Safety

安全注意事项

Avoid deploying prompt changes without regression testing.
Roll back quickly if quality or safety metrics regress.

避免在未进行回归测试的情况下部署提示变更。
如果质量或安全指标出现退化，立即回滚。

Phase 1: Performance Analysis and Baseline Metrics

第一阶段：性能分析与基准指标

Comprehensive analysis of agent performance using context-manager for historical data collection.

使用context-manager进行历史数据收集，全面分析Agent性能。

1.1 Gather Performance Data

1.1 收集性能数据

Use: context-manager
Command: analyze-agent-performance $ARGUMENTS --days 30

Collect metrics including:

Task completion rate (successful vs failed tasks)
Response accuracy and factual correctness
Tool usage efficiency (correct tools, call frequency)
Average response time and token consumption
User satisfaction indicators (corrections, retries)
Hallucination incidents and error patterns

Use: context-manager
Command: analyze-agent-performance $ARGUMENTS --days 30

收集的指标包括：

任务完成率（成功任务vs失败任务）
响应准确性和事实正确性
工具使用效率（正确工具、调用频率）
平均响应时间和Token消耗
用户满意度指标（修正、重试）
幻觉事件和错误模式

1.2 User Feedback Pattern Analysis

1.2 用户反馈模式分析

Identify recurring patterns in user interactions:

Correction patterns: Where users consistently modify outputs
Clarification requests: Common areas of ambiguity
Task abandonment: Points where users give up
Follow-up questions: Indicators of incomplete responses
Positive feedback: Successful patterns to preserve

识别用户交互中的重复模式：

修正模式：用户持续修改输出的场景
澄清请求：常见的模糊领域
任务放弃：用户终止操作的节点
跟进问题：响应不完整的指标
正面反馈：需要保留的成功模式

1.3 Failure Mode Classification

1.3 故障模式分类

Categorize failures by root cause:

Instruction misunderstanding: Role or task confusion
Output format errors: Structure or formatting issues
Context loss: Long conversation degradation
Tool misuse: Incorrect or inefficient tool selection
Constraint violations: Safety or business rule breaches
Edge case handling: Unusual input scenarios

按根本原因对故障进行分类：

指令误解：角色或任务混淆
输出格式错误：结构或格式问题
上下文丢失：长对话性能退化
工具误用：工具选择错误或效率低下
约束违反：违反安全或业务规则
边缘情况处理：异常输入场景

1.4 Baseline Performance Report

1.4 基准性能报告

Generate quantitative baseline metrics:

Performance Baseline:
- Task Success Rate: [X%]
- Average Corrections per Task: [Y]
- Tool Call Efficiency: [Z%]
- User Satisfaction Score: [1-10]
- Average Response Latency: [Xms]
- Token Efficiency Ratio: [X:Y]

生成量化基准指标：

Performance Baseline:
- Task Success Rate: [X%]
- Average Corrections per Task: [Y]
- Tool Call Efficiency: [Z%]
- User Satisfaction Score: [1-10]
- Average Response Latency: [Xms]
- Token Efficiency Ratio: [X:Y]

Phase 2: Prompt Engineering Improvements

第二阶段：提示工程改进

Apply advanced prompt optimization techniques using prompt-engineer agent.

使用prompt-engineer Agent应用先进的提示优化技术。

2.1 Chain-of-Thought Enhancement

2.1 思维链增强

Implement structured reasoning patterns:

Use: prompt-engineer
Technique: chain-of-thought-optimization

Add explicit reasoning steps: "Let's approach this step-by-step..."
Include self-verification checkpoints: "Before proceeding, verify that..."
Implement recursive decomposition for complex tasks
Add reasoning trace visibility for debugging

实施结构化推理模式：

Use: prompt-engineer
Technique: chain-of-thought-optimization

添加明确的推理步骤："Let's approach this step-by-step..."
包含自我验证检查点："Before proceeding, verify that..."
为复杂任务实现递归分解
添加推理轨迹可见性以方便调试

2.2 Few-Shot Example Optimization

2.2 少样本示例优化

Curate high-quality examples from successful interactions:

Select diverse examples covering common use cases
Include edge cases that previously failed
Show both positive and negative examples with explanations
Order examples from simple to complex
Annotate examples with key decision points

Example structure:

Good Example:
Input: [User request]
Reasoning: [Step-by-step thought process]
Output: [Successful response]
Why this works: [Key success factors]

Bad Example:
Input: [Similar request]
Output: [Failed response]
Why this fails: [Specific issues]
Correct approach: [Fixed version]

从成功交互中精选高质量示例：

选择多样化示例，覆盖常见用例
包含之前失败的边缘情况
展示正面和负面示例并附带解释
按从简单到复杂的顺序排列示例
为示例标注关键决策点

示例结构：

Good Example:
Input: [User request]
Reasoning: [Step-by-step thought process]
Output: [Successful response]
Why this works: [Key success factors]

Bad Example:
Input: [Similar request]
Output: [Failed response]
Why this fails: [Specific issues]
Correct approach: [Fixed version]

2.3 Role Definition Refinement

2.3 角色定义细化

Strengthen agent identity and capabilities:

Core purpose: Clear, single-sentence mission
Expertise domains: Specific knowledge areas
Behavioral traits: Personality and interaction style
Tool proficiency: Available tools and when to use them
Constraints: What the agent should NOT do
Success criteria: How to measure task completion

强化Agent的身份和能力：

核心目标：清晰的单句使命
专业领域：特定知识范围
行为特征：个性和交互风格
工具熟练度：可用工具及使用场景
约束条件：Agent不应执行的操作
成功标准：任务完成的衡量方式

2.4 Constitutional AI Integration

2.4 宪法AI集成

Implement self-correction mechanisms:

Constitutional Principles:
1. Verify factual accuracy before responding
2. Self-check for potential biases or harmful content
3. Validate output format matches requirements
4. Ensure response completeness
5. Maintain consistency with previous responses

Add critique-and-revise loops:

Initial response generation
Self-critique against principles
Automatic revision if issues detected
Final validation before output

实施自我校正机制：

Constitutional Principles:
1. Verify factual accuracy before responding
2. Self-check for potential biases or harmful content
3. Validate output format matches requirements
4. Ensure response completeness
5. Maintain consistency with previous responses

添加批评与修订循环：

初始响应生成
根据原则进行自我批评
如果检测到问题则自动修订
输出前进行最终验证

2.5 Output Format Tuning

2.5 输出格式调优

Optimize response structure:

Structured templates for common tasks
Dynamic formatting based on complexity
Progressive disclosure for detailed information
Markdown optimization for readability
Code block formatting with syntax highlighting
Table and list generation for data presentation

优化响应结构：

结构化模板用于常见任务
动态格式基于复杂度调整
渐进式披露展示详细信息
Markdown优化提升可读性
代码块格式化支持语法高亮
表格和列表生成用于数据展示

Phase 3: Testing and Validation

第三阶段：测试与验证

Comprehensive testing framework with A/B comparison.

包含A/B对比的全面测试框架。

3.1 Test Suite Development

3.1 测试套件开发

Create representative test scenarios:

Test Categories:
1. Golden path scenarios (common successful cases)
2. Previously failed tasks (regression testing)
3. Edge cases and corner scenarios
4. Stress tests (complex, multi-step tasks)
5. Adversarial inputs (potential breaking points)
6. Cross-domain tasks (combining capabilities)

创建代表性测试场景：

Test Categories:
1. Golden path scenarios (common successful cases)
2. Previously failed tasks (regression testing)
3. Edge cases and corner scenarios
4. Stress tests (complex, multi-step tasks)
5. Adversarial inputs (potential breaking points)
6. Cross-domain tasks (combining capabilities)

3.2 A/B Testing Framework

3.2 A/B测试框架

Compare original vs improved agent:

Use: parallel-test-runner
Config:
  - Agent A: Original version
  - Agent B: Improved version
  - Test set: 100 representative tasks
  - Metrics: Success rate, speed, token usage
  - Evaluation: Blind human review + automated scoring

Statistical significance testing:

Minimum sample size: 100 tasks per variant
Confidence level: 95% (p < 0.05)
Effect size calculation (Cohen's d)
Power analysis for future tests

对比原始版本与改进后的Agent：

Use: parallel-test-runner
Config:
  - Agent A: Original version
  - Agent B: Improved version
  - Test set: 100 representative tasks
  - Metrics: Success rate, speed, token usage
  - Evaluation: Blind human review + automated scoring

统计显著性测试：

最小样本量：每个变体100个任务
置信水平：95% (p < 0.05)
效应量计算（Cohen's d）
未来测试的功效分析

3.3 Evaluation Metrics

3.3 评估指标

Comprehensive scoring framework:

Task-Level Metrics:

Completion rate (binary success/failure)
Correctness score (0-100% accuracy)
Efficiency score (steps taken vs optimal)
Tool usage appropriateness
Response relevance and completeness

Quality Metrics:

Hallucination rate (factual errors per response)
Consistency score (alignment with previous responses)
Format compliance (matches specified structure)
Safety score (constraint adherence)
User satisfaction prediction

Performance Metrics:

Response latency (time to first token)
Total generation time
Token consumption (input + output)
Cost per task (API usage fees)
Memory/context efficiency

综合评分框架：

任务级指标：

完成率（二元成功/失败）
正确率（0-100%准确性）
效率得分（实际步骤vs最优步骤）
工具使用恰当性
响应相关性和完整性

质量指标：

幻觉率（每个响应中的事实错误）
一致性得分（与之前响应的对齐度）
格式合规性（符合指定结构）
安全得分（遵守约束条件）
用户满意度预测

性能指标：

响应延迟（首个Token生成时间）
总生成时间
Token消耗（输入+输出）
单任务成本（API使用费用）
内存/上下文效率

3.4 Human Evaluation Protocol

3.4 人工评估流程

Structured human review process:

Blind evaluation (evaluators don't know version)
Standardized rubric with clear criteria
Multiple evaluators per sample (inter-rater reliability)
Qualitative feedback collection
Preference ranking (A vs B comparison)

结构化人工评审流程：

盲态评估（评审者不知道版本）
带有明确标准的标准化评分表
每个样本由多名评审者评估（评分者间信度）
收集定性反馈
偏好排名（A与B对比）

Phase 4: Version Control and Deployment

第四阶段：版本控制与部署

Safe rollout with monitoring and rollback capabilities.

具备监控和回滚能力的安全发布策略。

4.1 Version Management

4.1 版本管理

Systematic versioning strategy:

Version Format: agent-name-v[MAJOR].[MINOR].[PATCH]
Example: customer-support-v2.3.1

MAJOR: Significant capability changes
MINOR: Prompt improvements, new examples
PATCH: Bug fixes, minor adjustments

Maintain version history:

Git-based prompt storage
Changelog with improvement details
Performance metrics per version
Rollback procedures documented

系统化版本策略：

Version Format: agent-name-v[MAJOR].[MINOR].[PATCH]
Example: customer-support-v2.3.1

MAJOR: Significant capability changes
MINOR: Prompt improvements, new examples
PATCH: Bug fixes, minor adjustments

维护版本历史：

基于Git的提示存储
包含改进细节的变更日志
每个版本的性能指标
记录回滚流程

4.2 Staged Rollout

4.2 分阶段发布

Progressive deployment strategy:

Alpha testing: Internal team validation (5% traffic)
Beta testing: Selected users (20% traffic)
Canary release: Gradual increase (20% → 50% → 100%)
Full deployment: After success criteria met
Monitoring period: 7-day observation window

渐进式部署策略：

Alpha测试：内部团队验证（5%流量）
Beta测试：精选用户测试（20%流量）
金丝雀发布：逐步扩大范围（20% → 50% → 100%）
全面部署：达到成功标准后
监控期：7天观察窗口

4.3 Rollback Procedures

4.3 回滚流程

Quick recovery mechanism:

Rollback Triggers:
- Success rate drops >10% from baseline
- Critical errors increase >5%
- User complaints spike
- Cost per task increases >20%
- Safety violations detected

Rollback Process:
1. Detect issue via monitoring
2. Alert team immediately
3. Switch to previous stable version
4. Analyze root cause
5. Fix and re-test before retry

快速恢复机制：

Rollback Triggers:
- Success rate drops >10% from baseline
- Critical errors increase >5%
- User complaints spike
- Cost per task increases >20%
- Safety violations detected

Rollback Process:
1. Detect issue via monitoring
2. Alert team immediately
3. Switch to previous stable version
4. Analyze root cause
5. Fix and re-test before retry

4.4 Continuous Monitoring

4.4 持续监控

Real-time performance tracking:

Dashboard with key metrics
Anomaly detection alerts
User feedback collection
Automated regression testing
Weekly performance reports

实时性能跟踪：

关键指标仪表盘
异常检测警报
用户反馈收集
自动化回归测试
每周性能报告

Success Criteria

成功标准

Agent improvement is successful when:

Task success rate improves by ≥15%
User corrections decrease by ≥25%
No increase in safety violations
Response time remains within 10% of baseline
Cost per task doesn't increase >5%
Positive user feedback increases

Agent改进成功的标志：

任务成功率提升≥15%
用户修正次数减少≥25%
安全违规事件未增加
响应时间保持在基准值的10%以内
单任务成本增幅不超过5%
正面用户反馈增加

Post-Deployment Review

部署后评审

After 30 days of production use:

Analyze accumulated performance data
Compare against baseline and targets
Identify new improvement opportunities
Document lessons learned
Plan next optimization cycle

生产环境使用30天后：

分析累积的性能数据
与基准值和目标进行对比
识别新的改进机会
记录经验教训
规划下一个优化周期

Continuous Improvement Cycle

持续改进循环

Establish regular improvement cadence:

Weekly: Monitor metrics and collect feedback
Monthly: Analyze patterns and plan improvements
Quarterly: Major version updates with new capabilities
Annually: Strategic review and architecture updates

Remember: Agent optimization is an iterative process. Each cycle builds upon previous learnings, gradually improving performance while maintaining stability and safety.

建立定期改进节奏：

每周：监控指标并收集反馈
每月：分析模式并规划改进
每季度：重大版本更新，添加新功能
每年：战略评审和架构更新

请记住：Agent优化是一个迭代过程。每个周期都基于之前的经验积累，在保持稳定性和安全性的同时逐步提升性能。