performance-monitor

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Performance Monitor

性能监控

Purpose

用途

Provides expertise in monitoring, benchmarking, and optimizing AI agent performance. Specializes in token usage tracking, latency analysis, cost optimization, and implementing quality evaluation metrics (evals) for AI systems.
提供AI Agent性能监控、基准测试与优化的专业能力。专注于AI系统的token使用追踪、延迟分析、成本优化以及质量评估指标(evals)的实施。

When to Use

适用场景

  • Tracking token usage and costs for AI agents
  • Measuring and optimizing agent latency
  • Implementing evaluation metrics (evals)
  • Benchmarking agent quality and accuracy
  • Optimizing agent cost efficiency
  • Building observability for AI pipelines
  • Analyzing agent conversation patterns
  • Setting up A/B testing for agents
  • 追踪AI Agent的token使用情况及成本
  • 衡量并优化Agent延迟
  • 实施评估指标(evals)
  • 对Agent的质量与准确性进行基准测试
  • 优化Agent的成本效率
  • 为AI流水线搭建可观测性能力
  • 分析Agent对话模式
  • 为Agent设置A/B测试

Quick Start

快速入门

Invoke this skill when:
  • Optimizing AI agent costs and token usage
  • Measuring agent latency and performance
  • Implementing evaluation frameworks
  • Building observability for AI systems
  • Benchmarking agent quality
Do NOT invoke when:
  • General application performance → use
    /performance-engineer
  • Infrastructure monitoring → use
    /sre-engineer
  • ML model training optimization → use
    /ml-engineer
  • Prompt design → use
    /prompt-engineer
调用此技能的场景:
  • 优化AI Agent的成本与token使用
  • 衡量Agent的延迟与性能
  • 实施评估框架
  • 为AI系统搭建可观测性能力
  • 对Agent质量进行基准测试
请勿调用此技能的场景:
  • 通用应用性能问题 → 使用
    /performance-engineer
  • 基础设施监控 → 使用
    /sre-engineer
  • 机器学习模型训练优化 → 使用
    /ml-engineer
  • Prompt设计 → 使用
    /prompt-engineer

Decision Framework

决策框架

Optimization Goal?
├── Cost Reduction
│   ├── Token usage → Prompt optimization
│   └── API calls → Caching, batching
├── Latency
│   ├── Time to first token → Streaming
│   └── Total response time → Model selection
├── Quality
│   ├── Accuracy → Evals with ground truth
│   └── Consistency → Multiple run analysis
└── Reliability
    └── Error rates, retry patterns
Optimization Goal?
├── Cost Reduction
│   ├── Token usage → Prompt optimization
│   └── API calls → Caching, batching
├── Latency
│   ├── Time to first token → Streaming
│   └── Total response time → Model selection
├── Quality
│   ├── Accuracy → Evals with ground truth
│   └── Consistency → Multiple run analysis
└── Reliability
    └── Error rates, retry patterns

Core Workflows

核心工作流

1. Token Usage Tracking

1. Token使用追踪

  1. Instrument API calls to capture usage
  2. Track input vs output tokens separately
  3. Aggregate by agent, task, user
  4. Calculate costs per operation
  5. Build dashboards for visibility
  6. Set alerts for anomalous usage
  1. 为API调用添加埋点以捕获使用数据
  2. 分别追踪输入与输出token
  3. 按Agent、任务、用户维度聚合数据
  4. 计算每个操作的成本
  5. 搭建可视化仪表盘
  6. 为异常使用设置告警

2. Eval Framework Setup

2. 评估框架搭建

  1. Define evaluation criteria
  2. Create test dataset with expected outputs
  3. Implement scoring functions
  4. Run automated eval pipeline
  5. Track scores over time
  6. Use for regression testing
  1. 定义评估标准
  2. 创建带有预期输出的测试数据集
  3. 实现评分函数
  4. 运行自动化评估流水线
  5. 随时间追踪评分变化
  6. 用于回归测试

3. Latency Optimization

3. 延迟优化

  1. Measure baseline latency
  2. Identify bottlenecks (model, network, parsing)
  3. Implement streaming where applicable
  4. Optimize prompt length
  5. Consider model size tradeoffs
  6. Add caching for repeated queries
  1. 测量基准延迟
  2. 识别瓶颈(模型、网络、解析环节)
  3. 按需实现流式传输
  4. 优化Prompt长度
  5. 考虑模型尺寸的权衡
  6. 为重复查询添加缓存

Best Practices

最佳实践

  • Track tokens separately from API call counts
  • Implement evals before optimizing
  • Use percentiles (p50, p95, p99) not averages for latency
  • Log prompt and response for debugging
  • Set cost budgets and alerts
  • Version prompts and track performance per version
  • 将token使用与API调用次数分开追踪
  • 在优化前先实施评估
  • 延迟指标使用百分位数(p50、p95、p99)而非平均值
  • 记录Prompt与响应以用于调试
  • 设置成本预算与告警
  • 对Prompt进行版本管理,并追踪每个版本的性能

Anti-Patterns

反模式

Anti-PatternProblemCorrect Approach
No token trackingSurprise costsInstrument all calls
Optimizing without evalsQuality regressionMeasure before optimizing
Average-only latencyHides tail latencyUse percentiles
No prompt versioningCan't correlate changesVersion and track
Ignoring cachingRepeated costsCache stable responses
反模式问题正确做法
未追踪token使用意外成本为所有调用添加埋点
未做评估就优化质量退化优化前先进行测量
仅使用平均延迟隐藏尾部延迟使用百分位数
未对Prompt进行版本管理无法关联变更影响版本化并追踪
忽略缓存重复成本对稳定响应进行缓存