agent-orchestration-improve-agent
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseAgent Performance Optimization Workflow
Agent性能优化工作流
Systematic improvement of existing agents through performance analysis, prompt engineering, and continuous iteration.
[Extended thinking: Agent optimization requires a data-driven approach combining performance metrics, user feedback analysis, and advanced prompt engineering techniques. Success depends on systematic evaluation, targeted improvements, and rigorous testing with rollback capabilities for production safety.]
通过性能分析、提示工程和持续迭代,系统性地改进现有Agent。
[拓展思考:Agent优化需要结合性能指标、用户反馈分析和先进提示工程技术的数据驱动方法。成功取决于系统性评估、针对性改进以及具备回滚能力的严格测试,以保障生产环境安全。]
Use this skill when
适用场景
- Improving an existing agent's performance or reliability
- Analyzing failure modes, prompt quality, or tool usage
- Running structured A/B tests or evaluation suites
- Designing iterative optimization workflows for agents
- 提升现有Agent的性能或可靠性
- 分析故障模式、提示质量或工具使用情况
- 运行结构化A/B测试或评估套件
- 设计Agent的迭代优化工作流
Do not use this skill when
不适用场景
- You are building a brand-new agent from scratch
- There are no metrics, feedback, or test cases available
- The task is unrelated to agent performance or prompt quality
- 从零开始构建全新Agent
- 没有可用的指标、反馈或测试用例
- 任务与Agent性能或提示质量无关
Instructions
操作步骤
- Establish baseline metrics and collect representative examples.
- Identify failure modes and prioritize high-impact fixes.
- Apply prompt and workflow improvements with measurable goals.
- Validate with tests and roll out changes in controlled stages.
- 建立基准指标并收集代表性示例。
- 识别故障模式并优先处理高影响的修复工作。
- 应用可衡量目标的提示和工作流改进方案。
- 通过测试验证并分阶段受控推出变更。
Safety
安全注意事项
- Avoid deploying prompt changes without regression testing.
- Roll back quickly if quality or safety metrics regress.
- 避免在未进行回归测试的情况下部署提示变更。
- 如果质量或安全指标出现退化,立即回滚。
Phase 1: Performance Analysis and Baseline Metrics
第一阶段:性能分析与基准指标
Comprehensive analysis of agent performance using context-manager for historical data collection.
使用context-manager进行历史数据收集,全面分析Agent性能。
1.1 Gather Performance Data
1.1 收集性能数据
Use: context-manager
Command: analyze-agent-performance $ARGUMENTS --days 30Collect metrics including:
- Task completion rate (successful vs failed tasks)
- Response accuracy and factual correctness
- Tool usage efficiency (correct tools, call frequency)
- Average response time and token consumption
- User satisfaction indicators (corrections, retries)
- Hallucination incidents and error patterns
Use: context-manager
Command: analyze-agent-performance $ARGUMENTS --days 30收集的指标包括:
- 任务完成率(成功任务vs失败任务)
- 响应准确性和事实正确性
- 工具使用效率(正确工具、调用频率)
- 平均响应时间和Token消耗
- 用户满意度指标(修正、重试)
- 幻觉事件和错误模式
1.2 User Feedback Pattern Analysis
1.2 用户反馈模式分析
Identify recurring patterns in user interactions:
- Correction patterns: Where users consistently modify outputs
- Clarification requests: Common areas of ambiguity
- Task abandonment: Points where users give up
- Follow-up questions: Indicators of incomplete responses
- Positive feedback: Successful patterns to preserve
识别用户交互中的重复模式:
- 修正模式:用户持续修改输出的场景
- 澄清请求:常见的模糊领域
- 任务放弃:用户终止操作的节点
- 跟进问题:响应不完整的指标
- 正面反馈:需要保留的成功模式
1.3 Failure Mode Classification
1.3 故障模式分类
Categorize failures by root cause:
- Instruction misunderstanding: Role or task confusion
- Output format errors: Structure or formatting issues
- Context loss: Long conversation degradation
- Tool misuse: Incorrect or inefficient tool selection
- Constraint violations: Safety or business rule breaches
- Edge case handling: Unusual input scenarios
按根本原因对故障进行分类:
- 指令误解:角色或任务混淆
- 输出格式错误:结构或格式问题
- 上下文丢失:长对话性能退化
- 工具误用:工具选择错误或效率低下
- 约束违反:违反安全或业务规则
- 边缘情况处理:异常输入场景
1.4 Baseline Performance Report
1.4 基准性能报告
Generate quantitative baseline metrics:
Performance Baseline:
- Task Success Rate: [X%]
- Average Corrections per Task: [Y]
- Tool Call Efficiency: [Z%]
- User Satisfaction Score: [1-10]
- Average Response Latency: [Xms]
- Token Efficiency Ratio: [X:Y]生成量化基准指标:
Performance Baseline:
- Task Success Rate: [X%]
- Average Corrections per Task: [Y]
- Tool Call Efficiency: [Z%]
- User Satisfaction Score: [1-10]
- Average Response Latency: [Xms]
- Token Efficiency Ratio: [X:Y]Phase 2: Prompt Engineering Improvements
第二阶段:提示工程改进
Apply advanced prompt optimization techniques using prompt-engineer agent.
使用prompt-engineer Agent应用先进的提示优化技术。
2.1 Chain-of-Thought Enhancement
2.1 思维链增强
Implement structured reasoning patterns:
Use: prompt-engineer
Technique: chain-of-thought-optimization- Add explicit reasoning steps: "Let's approach this step-by-step..."
- Include self-verification checkpoints: "Before proceeding, verify that..."
- Implement recursive decomposition for complex tasks
- Add reasoning trace visibility for debugging
实施结构化推理模式:
Use: prompt-engineer
Technique: chain-of-thought-optimization- 添加明确的推理步骤:"Let's approach this step-by-step..."
- 包含自我验证检查点:"Before proceeding, verify that..."
- 为复杂任务实现递归分解
- 添加推理轨迹可见性以方便调试
2.2 Few-Shot Example Optimization
2.2 少样本示例优化
Curate high-quality examples from successful interactions:
- Select diverse examples covering common use cases
- Include edge cases that previously failed
- Show both positive and negative examples with explanations
- Order examples from simple to complex
- Annotate examples with key decision points
Example structure:
Good Example:
Input: [User request]
Reasoning: [Step-by-step thought process]
Output: [Successful response]
Why this works: [Key success factors]
Bad Example:
Input: [Similar request]
Output: [Failed response]
Why this fails: [Specific issues]
Correct approach: [Fixed version]从成功交互中精选高质量示例:
- 选择多样化示例,覆盖常见用例
- 包含之前失败的边缘情况
- 展示正面和负面示例并附带解释
- 按从简单到复杂的顺序排列示例
- 为示例标注关键决策点
示例结构:
Good Example:
Input: [User request]
Reasoning: [Step-by-step thought process]
Output: [Successful response]
Why this works: [Key success factors]
Bad Example:
Input: [Similar request]
Output: [Failed response]
Why this fails: [Specific issues]
Correct approach: [Fixed version]2.3 Role Definition Refinement
2.3 角色定义细化
Strengthen agent identity and capabilities:
- Core purpose: Clear, single-sentence mission
- Expertise domains: Specific knowledge areas
- Behavioral traits: Personality and interaction style
- Tool proficiency: Available tools and when to use them
- Constraints: What the agent should NOT do
- Success criteria: How to measure task completion
强化Agent的身份和能力:
- 核心目标:清晰的单句使命
- 专业领域:特定知识范围
- 行为特征:个性和交互风格
- 工具熟练度:可用工具及使用场景
- 约束条件:Agent不应执行的操作
- 成功标准:任务完成的衡量方式
2.4 Constitutional AI Integration
2.4 宪法AI集成
Implement self-correction mechanisms:
Constitutional Principles:
1. Verify factual accuracy before responding
2. Self-check for potential biases or harmful content
3. Validate output format matches requirements
4. Ensure response completeness
5. Maintain consistency with previous responsesAdd critique-and-revise loops:
- Initial response generation
- Self-critique against principles
- Automatic revision if issues detected
- Final validation before output
实施自我校正机制:
Constitutional Principles:
1. Verify factual accuracy before responding
2. Self-check for potential biases or harmful content
3. Validate output format matches requirements
4. Ensure response completeness
5. Maintain consistency with previous responses添加批评与修订循环:
- 初始响应生成
- 根据原则进行自我批评
- 如果检测到问题则自动修订
- 输出前进行最终验证
2.5 Output Format Tuning
2.5 输出格式调优
Optimize response structure:
- Structured templates for common tasks
- Dynamic formatting based on complexity
- Progressive disclosure for detailed information
- Markdown optimization for readability
- Code block formatting with syntax highlighting
- Table and list generation for data presentation
优化响应结构:
- 结构化模板用于常见任务
- 动态格式基于复杂度调整
- 渐进式披露展示详细信息
- Markdown优化提升可读性
- 代码块格式化支持语法高亮
- 表格和列表生成用于数据展示
Phase 3: Testing and Validation
第三阶段:测试与验证
Comprehensive testing framework with A/B comparison.
包含A/B对比的全面测试框架。
3.1 Test Suite Development
3.1 测试套件开发
Create representative test scenarios:
Test Categories:
1. Golden path scenarios (common successful cases)
2. Previously failed tasks (regression testing)
3. Edge cases and corner scenarios
4. Stress tests (complex, multi-step tasks)
5. Adversarial inputs (potential breaking points)
6. Cross-domain tasks (combining capabilities)创建代表性测试场景:
Test Categories:
1. Golden path scenarios (common successful cases)
2. Previously failed tasks (regression testing)
3. Edge cases and corner scenarios
4. Stress tests (complex, multi-step tasks)
5. Adversarial inputs (potential breaking points)
6. Cross-domain tasks (combining capabilities)3.2 A/B Testing Framework
3.2 A/B测试框架
Compare original vs improved agent:
Use: parallel-test-runner
Config:
- Agent A: Original version
- Agent B: Improved version
- Test set: 100 representative tasks
- Metrics: Success rate, speed, token usage
- Evaluation: Blind human review + automated scoringStatistical significance testing:
- Minimum sample size: 100 tasks per variant
- Confidence level: 95% (p < 0.05)
- Effect size calculation (Cohen's d)
- Power analysis for future tests
对比原始版本与改进后的Agent:
Use: parallel-test-runner
Config:
- Agent A: Original version
- Agent B: Improved version
- Test set: 100 representative tasks
- Metrics: Success rate, speed, token usage
- Evaluation: Blind human review + automated scoring统计显著性测试:
- 最小样本量:每个变体100个任务
- 置信水平:95% (p < 0.05)
- 效应量计算(Cohen's d)
- 未来测试的功效分析
3.3 Evaluation Metrics
3.3 评估指标
Comprehensive scoring framework:
Task-Level Metrics:
- Completion rate (binary success/failure)
- Correctness score (0-100% accuracy)
- Efficiency score (steps taken vs optimal)
- Tool usage appropriateness
- Response relevance and completeness
Quality Metrics:
- Hallucination rate (factual errors per response)
- Consistency score (alignment with previous responses)
- Format compliance (matches specified structure)
- Safety score (constraint adherence)
- User satisfaction prediction
Performance Metrics:
- Response latency (time to first token)
- Total generation time
- Token consumption (input + output)
- Cost per task (API usage fees)
- Memory/context efficiency
综合评分框架:
任务级指标:
- 完成率(二元成功/失败)
- 正确率(0-100%准确性)
- 效率得分(实际步骤vs最优步骤)
- 工具使用恰当性
- 响应相关性和完整性
质量指标:
- 幻觉率(每个响应中的事实错误)
- 一致性得分(与之前响应的对齐度)
- 格式合规性(符合指定结构)
- 安全得分(遵守约束条件)
- 用户满意度预测
性能指标:
- 响应延迟(首个Token生成时间)
- 总生成时间
- Token消耗(输入+输出)
- 单任务成本(API使用费用)
- 内存/上下文效率
3.4 Human Evaluation Protocol
3.4 人工评估流程
Structured human review process:
- Blind evaluation (evaluators don't know version)
- Standardized rubric with clear criteria
- Multiple evaluators per sample (inter-rater reliability)
- Qualitative feedback collection
- Preference ranking (A vs B comparison)
结构化人工评审流程:
- 盲态评估(评审者不知道版本)
- 带有明确标准的标准化评分表
- 每个样本由多名评审者评估(评分者间信度)
- 收集定性反馈
- 偏好排名(A与B对比)
Phase 4: Version Control and Deployment
第四阶段:版本控制与部署
Safe rollout with monitoring and rollback capabilities.
具备监控和回滚能力的安全发布策略。
4.1 Version Management
4.1 版本管理
Systematic versioning strategy:
Version Format: agent-name-v[MAJOR].[MINOR].[PATCH]
Example: customer-support-v2.3.1
MAJOR: Significant capability changes
MINOR: Prompt improvements, new examples
PATCH: Bug fixes, minor adjustmentsMaintain version history:
- Git-based prompt storage
- Changelog with improvement details
- Performance metrics per version
- Rollback procedures documented
系统化版本策略:
Version Format: agent-name-v[MAJOR].[MINOR].[PATCH]
Example: customer-support-v2.3.1
MAJOR: Significant capability changes
MINOR: Prompt improvements, new examples
PATCH: Bug fixes, minor adjustments维护版本历史:
- 基于Git的提示存储
- 包含改进细节的变更日志
- 每个版本的性能指标
- 记录回滚流程
4.2 Staged Rollout
4.2 分阶段发布
Progressive deployment strategy:
- Alpha testing: Internal team validation (5% traffic)
- Beta testing: Selected users (20% traffic)
- Canary release: Gradual increase (20% → 50% → 100%)
- Full deployment: After success criteria met
- Monitoring period: 7-day observation window
渐进式部署策略:
- Alpha测试:内部团队验证(5%流量)
- Beta测试:精选用户测试(20%流量)
- 金丝雀发布:逐步扩大范围(20% → 50% → 100%)
- 全面部署:达到成功标准后
- 监控期:7天观察窗口
4.3 Rollback Procedures
4.3 回滚流程
Quick recovery mechanism:
Rollback Triggers:
- Success rate drops >10% from baseline
- Critical errors increase >5%
- User complaints spike
- Cost per task increases >20%
- Safety violations detected
Rollback Process:
1. Detect issue via monitoring
2. Alert team immediately
3. Switch to previous stable version
4. Analyze root cause
5. Fix and re-test before retry快速恢复机制:
Rollback Triggers:
- Success rate drops >10% from baseline
- Critical errors increase >5%
- User complaints spike
- Cost per task increases >20%
- Safety violations detected
Rollback Process:
1. Detect issue via monitoring
2. Alert team immediately
3. Switch to previous stable version
4. Analyze root cause
5. Fix and re-test before retry4.4 Continuous Monitoring
4.4 持续监控
Real-time performance tracking:
- Dashboard with key metrics
- Anomaly detection alerts
- User feedback collection
- Automated regression testing
- Weekly performance reports
实时性能跟踪:
- 关键指标仪表盘
- 异常检测警报
- 用户反馈收集
- 自动化回归测试
- 每周性能报告
Success Criteria
成功标准
Agent improvement is successful when:
- Task success rate improves by ≥15%
- User corrections decrease by ≥25%
- No increase in safety violations
- Response time remains within 10% of baseline
- Cost per task doesn't increase >5%
- Positive user feedback increases
Agent改进成功的标志:
- 任务成功率提升≥15%
- 用户修正次数减少≥25%
- 安全违规事件未增加
- 响应时间保持在基准值的10%以内
- 单任务成本增幅不超过5%
- 正面用户反馈增加
Post-Deployment Review
部署后评审
After 30 days of production use:
- Analyze accumulated performance data
- Compare against baseline and targets
- Identify new improvement opportunities
- Document lessons learned
- Plan next optimization cycle
生产环境使用30天后:
- 分析累积的性能数据
- 与基准值和目标进行对比
- 识别新的改进机会
- 记录经验教训
- 规划下一个优化周期
Continuous Improvement Cycle
持续改进循环
Establish regular improvement cadence:
- Weekly: Monitor metrics and collect feedback
- Monthly: Analyze patterns and plan improvements
- Quarterly: Major version updates with new capabilities
- Annually: Strategic review and architecture updates
Remember: Agent optimization is an iterative process. Each cycle builds upon previous learnings, gradually improving performance while maintaining stability and safety.
建立定期改进节奏:
- 每周:监控指标并收集反馈
- 每月:分析模式并规划改进
- 每季度:重大版本更新,添加新功能
- 每年:战略评审和架构更新
请记住:Agent优化是一个迭代过程。每个周期都基于之前的经验积累,在保持稳定性和安全性的同时逐步提升性能。