agent-orchestration-improve-agent

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Agent Performance Optimization Workflow

Agent性能优化工作流

Systematic improvement of existing agents through performance analysis, prompt engineering, and continuous iteration.
[Extended thinking: Agent optimization requires a data-driven approach combining performance metrics, user feedback analysis, and advanced prompt engineering techniques. Success depends on systematic evaluation, targeted improvements, and rigorous testing with rollback capabilities for production safety.]
通过性能分析、提示工程和持续迭代,系统性地改进现有Agent。
[拓展思考:Agent优化需要结合性能指标、用户反馈分析和先进提示工程技术的数据驱动方法。成功取决于系统性评估、针对性改进以及具备回滚能力的严格测试,以保障生产环境安全。]

Use this skill when

适用场景

  • Improving an existing agent's performance or reliability
  • Analyzing failure modes, prompt quality, or tool usage
  • Running structured A/B tests or evaluation suites
  • Designing iterative optimization workflows for agents
  • 提升现有Agent的性能或可靠性
  • 分析故障模式、提示质量或工具使用情况
  • 运行结构化A/B测试或评估套件
  • 设计Agent的迭代优化工作流

Do not use this skill when

不适用场景

  • You are building a brand-new agent from scratch
  • There are no metrics, feedback, or test cases available
  • The task is unrelated to agent performance or prompt quality
  • 从零开始构建全新Agent
  • 没有可用的指标、反馈或测试用例
  • 任务与Agent性能或提示质量无关

Instructions

操作步骤

  1. Establish baseline metrics and collect representative examples.
  2. Identify failure modes and prioritize high-impact fixes.
  3. Apply prompt and workflow improvements with measurable goals.
  4. Validate with tests and roll out changes in controlled stages.
  1. 建立基准指标并收集代表性示例。
  2. 识别故障模式并优先处理高影响的修复工作。
  3. 应用可衡量目标的提示和工作流改进方案。
  4. 通过测试验证并分阶段受控推出变更。

Safety

安全注意事项

  • Avoid deploying prompt changes without regression testing.
  • Roll back quickly if quality or safety metrics regress.
  • 避免在未进行回归测试的情况下部署提示变更。
  • 如果质量或安全指标出现退化,立即回滚。

Phase 1: Performance Analysis and Baseline Metrics

第一阶段:性能分析与基准指标

Comprehensive analysis of agent performance using context-manager for historical data collection.
使用context-manager进行历史数据收集,全面分析Agent性能。

1.1 Gather Performance Data

1.1 收集性能数据

Use: context-manager
Command: analyze-agent-performance $ARGUMENTS --days 30
Collect metrics including:
  • Task completion rate (successful vs failed tasks)
  • Response accuracy and factual correctness
  • Tool usage efficiency (correct tools, call frequency)
  • Average response time and token consumption
  • User satisfaction indicators (corrections, retries)
  • Hallucination incidents and error patterns
Use: context-manager
Command: analyze-agent-performance $ARGUMENTS --days 30
收集的指标包括:
  • 任务完成率(成功任务vs失败任务)
  • 响应准确性和事实正确性
  • 工具使用效率(正确工具、调用频率)
  • 平均响应时间和Token消耗
  • 用户满意度指标(修正、重试)
  • 幻觉事件和错误模式

1.2 User Feedback Pattern Analysis

1.2 用户反馈模式分析

Identify recurring patterns in user interactions:
  • Correction patterns: Where users consistently modify outputs
  • Clarification requests: Common areas of ambiguity
  • Task abandonment: Points where users give up
  • Follow-up questions: Indicators of incomplete responses
  • Positive feedback: Successful patterns to preserve
识别用户交互中的重复模式:
  • 修正模式:用户持续修改输出的场景
  • 澄清请求:常见的模糊领域
  • 任务放弃:用户终止操作的节点
  • 跟进问题:响应不完整的指标
  • 正面反馈:需要保留的成功模式

1.3 Failure Mode Classification

1.3 故障模式分类

Categorize failures by root cause:
  • Instruction misunderstanding: Role or task confusion
  • Output format errors: Structure or formatting issues
  • Context loss: Long conversation degradation
  • Tool misuse: Incorrect or inefficient tool selection
  • Constraint violations: Safety or business rule breaches
  • Edge case handling: Unusual input scenarios
按根本原因对故障进行分类:
  • 指令误解:角色或任务混淆
  • 输出格式错误:结构或格式问题
  • 上下文丢失:长对话性能退化
  • 工具误用:工具选择错误或效率低下
  • 约束违反:违反安全或业务规则
  • 边缘情况处理:异常输入场景

1.4 Baseline Performance Report

1.4 基准性能报告

Generate quantitative baseline metrics:
Performance Baseline:
- Task Success Rate: [X%]
- Average Corrections per Task: [Y]
- Tool Call Efficiency: [Z%]
- User Satisfaction Score: [1-10]
- Average Response Latency: [Xms]
- Token Efficiency Ratio: [X:Y]
生成量化基准指标:
Performance Baseline:
- Task Success Rate: [X%]
- Average Corrections per Task: [Y]
- Tool Call Efficiency: [Z%]
- User Satisfaction Score: [1-10]
- Average Response Latency: [Xms]
- Token Efficiency Ratio: [X:Y]

Phase 2: Prompt Engineering Improvements

第二阶段:提示工程改进

Apply advanced prompt optimization techniques using prompt-engineer agent.
使用prompt-engineer Agent应用先进的提示优化技术。

2.1 Chain-of-Thought Enhancement

2.1 思维链增强

Implement structured reasoning patterns:
Use: prompt-engineer
Technique: chain-of-thought-optimization
  • Add explicit reasoning steps: "Let's approach this step-by-step..."
  • Include self-verification checkpoints: "Before proceeding, verify that..."
  • Implement recursive decomposition for complex tasks
  • Add reasoning trace visibility for debugging
实施结构化推理模式:
Use: prompt-engineer
Technique: chain-of-thought-optimization
  • 添加明确的推理步骤:"Let's approach this step-by-step..."
  • 包含自我验证检查点:"Before proceeding, verify that..."
  • 为复杂任务实现递归分解
  • 添加推理轨迹可见性以方便调试

2.2 Few-Shot Example Optimization

2.2 少样本示例优化

Curate high-quality examples from successful interactions:
  • Select diverse examples covering common use cases
  • Include edge cases that previously failed
  • Show both positive and negative examples with explanations
  • Order examples from simple to complex
  • Annotate examples with key decision points
Example structure:
Good Example:
Input: [User request]
Reasoning: [Step-by-step thought process]
Output: [Successful response]
Why this works: [Key success factors]

Bad Example:
Input: [Similar request]
Output: [Failed response]
Why this fails: [Specific issues]
Correct approach: [Fixed version]
从成功交互中精选高质量示例:
  • 选择多样化示例,覆盖常见用例
  • 包含之前失败的边缘情况
  • 展示正面和负面示例并附带解释
  • 按从简单到复杂的顺序排列示例
  • 为示例标注关键决策点
示例结构:
Good Example:
Input: [User request]
Reasoning: [Step-by-step thought process]
Output: [Successful response]
Why this works: [Key success factors]

Bad Example:
Input: [Similar request]
Output: [Failed response]
Why this fails: [Specific issues]
Correct approach: [Fixed version]

2.3 Role Definition Refinement

2.3 角色定义细化

Strengthen agent identity and capabilities:
  • Core purpose: Clear, single-sentence mission
  • Expertise domains: Specific knowledge areas
  • Behavioral traits: Personality and interaction style
  • Tool proficiency: Available tools and when to use them
  • Constraints: What the agent should NOT do
  • Success criteria: How to measure task completion
强化Agent的身份和能力:
  • 核心目标:清晰的单句使命
  • 专业领域:特定知识范围
  • 行为特征:个性和交互风格
  • 工具熟练度:可用工具及使用场景
  • 约束条件:Agent不应执行的操作
  • 成功标准:任务完成的衡量方式

2.4 Constitutional AI Integration

2.4 宪法AI集成

Implement self-correction mechanisms:
Constitutional Principles:
1. Verify factual accuracy before responding
2. Self-check for potential biases or harmful content
3. Validate output format matches requirements
4. Ensure response completeness
5. Maintain consistency with previous responses
Add critique-and-revise loops:
  • Initial response generation
  • Self-critique against principles
  • Automatic revision if issues detected
  • Final validation before output
实施自我校正机制:
Constitutional Principles:
1. Verify factual accuracy before responding
2. Self-check for potential biases or harmful content
3. Validate output format matches requirements
4. Ensure response completeness
5. Maintain consistency with previous responses
添加批评与修订循环:
  • 初始响应生成
  • 根据原则进行自我批评
  • 如果检测到问题则自动修订
  • 输出前进行最终验证

2.5 Output Format Tuning

2.5 输出格式调优

Optimize response structure:
  • Structured templates for common tasks
  • Dynamic formatting based on complexity
  • Progressive disclosure for detailed information
  • Markdown optimization for readability
  • Code block formatting with syntax highlighting
  • Table and list generation for data presentation
优化响应结构:
  • 结构化模板用于常见任务
  • 动态格式基于复杂度调整
  • 渐进式披露展示详细信息
  • Markdown优化提升可读性
  • 代码块格式化支持语法高亮
  • 表格和列表生成用于数据展示

Phase 3: Testing and Validation

第三阶段:测试与验证

Comprehensive testing framework with A/B comparison.
包含A/B对比的全面测试框架。

3.1 Test Suite Development

3.1 测试套件开发

Create representative test scenarios:
Test Categories:
1. Golden path scenarios (common successful cases)
2. Previously failed tasks (regression testing)
3. Edge cases and corner scenarios
4. Stress tests (complex, multi-step tasks)
5. Adversarial inputs (potential breaking points)
6. Cross-domain tasks (combining capabilities)
创建代表性测试场景:
Test Categories:
1. Golden path scenarios (common successful cases)
2. Previously failed tasks (regression testing)
3. Edge cases and corner scenarios
4. Stress tests (complex, multi-step tasks)
5. Adversarial inputs (potential breaking points)
6. Cross-domain tasks (combining capabilities)

3.2 A/B Testing Framework

3.2 A/B测试框架

Compare original vs improved agent:
Use: parallel-test-runner
Config:
  - Agent A: Original version
  - Agent B: Improved version
  - Test set: 100 representative tasks
  - Metrics: Success rate, speed, token usage
  - Evaluation: Blind human review + automated scoring
Statistical significance testing:
  • Minimum sample size: 100 tasks per variant
  • Confidence level: 95% (p < 0.05)
  • Effect size calculation (Cohen's d)
  • Power analysis for future tests
对比原始版本与改进后的Agent:
Use: parallel-test-runner
Config:
  - Agent A: Original version
  - Agent B: Improved version
  - Test set: 100 representative tasks
  - Metrics: Success rate, speed, token usage
  - Evaluation: Blind human review + automated scoring
统计显著性测试:
  • 最小样本量:每个变体100个任务
  • 置信水平:95% (p < 0.05)
  • 效应量计算(Cohen's d)
  • 未来测试的功效分析

3.3 Evaluation Metrics

3.3 评估指标

Comprehensive scoring framework:
Task-Level Metrics:
  • Completion rate (binary success/failure)
  • Correctness score (0-100% accuracy)
  • Efficiency score (steps taken vs optimal)
  • Tool usage appropriateness
  • Response relevance and completeness
Quality Metrics:
  • Hallucination rate (factual errors per response)
  • Consistency score (alignment with previous responses)
  • Format compliance (matches specified structure)
  • Safety score (constraint adherence)
  • User satisfaction prediction
Performance Metrics:
  • Response latency (time to first token)
  • Total generation time
  • Token consumption (input + output)
  • Cost per task (API usage fees)
  • Memory/context efficiency
综合评分框架:
任务级指标:
  • 完成率(二元成功/失败)
  • 正确率(0-100%准确性)
  • 效率得分(实际步骤vs最优步骤)
  • 工具使用恰当性
  • 响应相关性和完整性
质量指标:
  • 幻觉率(每个响应中的事实错误)
  • 一致性得分(与之前响应的对齐度)
  • 格式合规性(符合指定结构)
  • 安全得分(遵守约束条件)
  • 用户满意度预测
性能指标:
  • 响应延迟(首个Token生成时间)
  • 总生成时间
  • Token消耗(输入+输出)
  • 单任务成本(API使用费用)
  • 内存/上下文效率

3.4 Human Evaluation Protocol

3.4 人工评估流程

Structured human review process:
  • Blind evaluation (evaluators don't know version)
  • Standardized rubric with clear criteria
  • Multiple evaluators per sample (inter-rater reliability)
  • Qualitative feedback collection
  • Preference ranking (A vs B comparison)
结构化人工评审流程:
  • 盲态评估(评审者不知道版本)
  • 带有明确标准的标准化评分表
  • 每个样本由多名评审者评估(评分者间信度)
  • 收集定性反馈
  • 偏好排名(A与B对比)

Phase 4: Version Control and Deployment

第四阶段:版本控制与部署

Safe rollout with monitoring and rollback capabilities.
具备监控和回滚能力的安全发布策略。

4.1 Version Management

4.1 版本管理

Systematic versioning strategy:
Version Format: agent-name-v[MAJOR].[MINOR].[PATCH]
Example: customer-support-v2.3.1

MAJOR: Significant capability changes
MINOR: Prompt improvements, new examples
PATCH: Bug fixes, minor adjustments
Maintain version history:
  • Git-based prompt storage
  • Changelog with improvement details
  • Performance metrics per version
  • Rollback procedures documented
系统化版本策略:
Version Format: agent-name-v[MAJOR].[MINOR].[PATCH]
Example: customer-support-v2.3.1

MAJOR: Significant capability changes
MINOR: Prompt improvements, new examples
PATCH: Bug fixes, minor adjustments
维护版本历史:
  • 基于Git的提示存储
  • 包含改进细节的变更日志
  • 每个版本的性能指标
  • 记录回滚流程

4.2 Staged Rollout

4.2 分阶段发布

Progressive deployment strategy:
  1. Alpha testing: Internal team validation (5% traffic)
  2. Beta testing: Selected users (20% traffic)
  3. Canary release: Gradual increase (20% → 50% → 100%)
  4. Full deployment: After success criteria met
  5. Monitoring period: 7-day observation window
渐进式部署策略:
  1. Alpha测试:内部团队验证(5%流量)
  2. Beta测试:精选用户测试(20%流量)
  3. 金丝雀发布:逐步扩大范围(20% → 50% → 100%)
  4. 全面部署:达到成功标准后
  5. 监控期:7天观察窗口

4.3 Rollback Procedures

4.3 回滚流程

Quick recovery mechanism:
Rollback Triggers:
- Success rate drops >10% from baseline
- Critical errors increase >5%
- User complaints spike
- Cost per task increases >20%
- Safety violations detected

Rollback Process:
1. Detect issue via monitoring
2. Alert team immediately
3. Switch to previous stable version
4. Analyze root cause
5. Fix and re-test before retry
快速恢复机制:
Rollback Triggers:
- Success rate drops >10% from baseline
- Critical errors increase >5%
- User complaints spike
- Cost per task increases >20%
- Safety violations detected

Rollback Process:
1. Detect issue via monitoring
2. Alert team immediately
3. Switch to previous stable version
4. Analyze root cause
5. Fix and re-test before retry

4.4 Continuous Monitoring

4.4 持续监控

Real-time performance tracking:
  • Dashboard with key metrics
  • Anomaly detection alerts
  • User feedback collection
  • Automated regression testing
  • Weekly performance reports
实时性能跟踪:
  • 关键指标仪表盘
  • 异常检测警报
  • 用户反馈收集
  • 自动化回归测试
  • 每周性能报告

Success Criteria

成功标准

Agent improvement is successful when:
  • Task success rate improves by ≥15%
  • User corrections decrease by ≥25%
  • No increase in safety violations
  • Response time remains within 10% of baseline
  • Cost per task doesn't increase >5%
  • Positive user feedback increases
Agent改进成功的标志:
  • 任务成功率提升≥15%
  • 用户修正次数减少≥25%
  • 安全违规事件未增加
  • 响应时间保持在基准值的10%以内
  • 单任务成本增幅不超过5%
  • 正面用户反馈增加

Post-Deployment Review

部署后评审

After 30 days of production use:
  1. Analyze accumulated performance data
  2. Compare against baseline and targets
  3. Identify new improvement opportunities
  4. Document lessons learned
  5. Plan next optimization cycle
生产环境使用30天后:
  1. 分析累积的性能数据
  2. 与基准值和目标进行对比
  3. 识别新的改进机会
  4. 记录经验教训
  5. 规划下一个优化周期

Continuous Improvement Cycle

持续改进循环

Establish regular improvement cadence:
  • Weekly: Monitor metrics and collect feedback
  • Monthly: Analyze patterns and plan improvements
  • Quarterly: Major version updates with new capabilities
  • Annually: Strategic review and architecture updates
Remember: Agent optimization is an iterative process. Each cycle builds upon previous learnings, gradually improving performance while maintaining stability and safety.
建立定期改进节奏:
  • 每周:监控指标并收集反馈
  • 每月:分析模式并规划改进
  • 每季度:重大版本更新,添加新功能
  • 每年:战略评审和架构更新
请记住:Agent优化是一个迭代过程。每个周期都基于之前的经验积累,在保持稳定性和安全性的同时逐步提升性能。