performance-monitor

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Performance Monitor

性能监控

Purpose

用途

Provides expertise in monitoring, benchmarking, and optimizing AI agent performance. Specializes in token usage tracking, latency analysis, cost optimization, and implementing quality evaluation metrics (evals) for AI systems.

提供AI Agent性能监控、基准测试与优化的专业能力。专注于AI系统的token使用追踪、延迟分析、成本优化以及质量评估指标（evals）的实施。

When to Use

适用场景

Tracking token usage and costs for AI agents
Measuring and optimizing agent latency
Implementing evaluation metrics (evals)
Benchmarking agent quality and accuracy
Optimizing agent cost efficiency
Building observability for AI pipelines
Analyzing agent conversation patterns
Setting up A/B testing for agents

追踪AI Agent的token使用情况及成本
衡量并优化Agent延迟
实施评估指标（evals）
对Agent的质量与准确性进行基准测试
优化Agent的成本效率
为AI流水线搭建可观测性能力
分析Agent对话模式
为Agent设置A/B测试

Quick Start

快速入门

Invoke this skill when:

Optimizing AI agent costs and token usage
Measuring agent latency and performance
Implementing evaluation frameworks
Building observability for AI systems
Benchmarking agent quality

Do NOT invoke when:

General application performance → use
```
/performance-engineer
```
Infrastructure monitoring → use
```
/sre-engineer
```
ML model training optimization → use
```
/ml-engineer
```
Prompt design → use
```
/prompt-engineer
```

调用此技能的场景：

优化AI Agent的成本与token使用
衡量Agent的延迟与性能
实施评估框架
为AI系统搭建可观测性能力
对Agent质量进行基准测试

请勿调用此技能的场景：

通用应用性能问题 → 使用
```
/performance-engineer
```
基础设施监控 → 使用
```
/sre-engineer
```
机器学习模型训练优化 → 使用
```
/ml-engineer
```
Prompt设计 → 使用
```
/prompt-engineer
```

Decision Framework

决策框架

Optimization Goal?
├── Cost Reduction
│   ├── Token usage → Prompt optimization
│   └── API calls → Caching, batching
├── Latency
│   ├── Time to first token → Streaming
│   └── Total response time → Model selection
├── Quality
│   ├── Accuracy → Evals with ground truth
│   └── Consistency → Multiple run analysis
└── Reliability
    └── Error rates, retry patterns

Optimization Goal?
├── Cost Reduction
│   ├── Token usage → Prompt optimization
│   └── API calls → Caching, batching
├── Latency
│   ├── Time to first token → Streaming
│   └── Total response time → Model selection
├── Quality
│   ├── Accuracy → Evals with ground truth
│   └── Consistency → Multiple run analysis
└── Reliability
    └── Error rates, retry patterns

Core Workflows

核心工作流

1. Token Usage Tracking

1. Token使用追踪

Instrument API calls to capture usage
Track input vs output tokens separately
Aggregate by agent, task, user
Calculate costs per operation
Build dashboards for visibility
Set alerts for anomalous usage

为API调用添加埋点以捕获使用数据
分别追踪输入与输出token
按Agent、任务、用户维度聚合数据
计算每个操作的成本
搭建可视化仪表盘
为异常使用设置告警

2. Eval Framework Setup

2. 评估框架搭建

Define evaluation criteria
Create test dataset with expected outputs
Implement scoring functions
Run automated eval pipeline
Track scores over time
Use for regression testing

定义评估标准
创建带有预期输出的测试数据集
实现评分函数
运行自动化评估流水线
随时间追踪评分变化
用于回归测试

3. Latency Optimization

3. 延迟优化

Measure baseline latency
Identify bottlenecks (model, network, parsing)
Implement streaming where applicable
Optimize prompt length
Consider model size tradeoffs
Add caching for repeated queries

测量基准延迟
识别瓶颈（模型、网络、解析环节）
按需实现流式传输
优化Prompt长度
考虑模型尺寸的权衡
为重复查询添加缓存

Best Practices

最佳实践

Track tokens separately from API call counts
Implement evals before optimizing
Use percentiles (p50, p95, p99) not averages for latency
Log prompt and response for debugging
Set cost budgets and alerts
Version prompts and track performance per version

将token使用与API调用次数分开追踪
在优化前先实施评估
延迟指标使用百分位数（p50、p95、p99）而非平均值
记录Prompt与响应以用于调试
设置成本预算与告警
对Prompt进行版本管理，并追踪每个版本的性能

Anti-Patterns

反模式

Anti-Pattern	Problem	Correct Approach
No token tracking	Surprise costs	Instrument all calls
Optimizing without evals	Quality regression	Measure before optimizing
Average-only latency	Hides tail latency	Use percentiles
No prompt versioning	Can't correlate changes	Version and track
Ignoring caching	Repeated costs	Cache stable responses

反模式	问题	正确做法
未追踪token使用	意外成本	为所有调用添加埋点
未做评估就优化	质量退化	优化前先进行测量
仅使用平均延迟	隐藏尾部延迟	使用百分位数
未对Prompt进行版本管理	无法关联变更影响	版本化并追踪
忽略缓存	重复成本	对稳定响应进行缓存