ai-monitoring
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseKnow When Your AI Breaks in Production
了解你的AI在生产环境中何时出现故障
Guide the user through monitoring AI quality, safety, and cost in production. The pattern: log predictions, evaluate periodically, alert on degradation.
本指南将引导你监控生产环境中AI的质量、安全性与成本。核心流程:记录预测结果、定期评估、针对性能下降发出警报。
When you need monitoring
何时需要监控
- Any AI feature running in production
- After launching something built with the other skills
- After any model or prompt change
- When compliance requires ongoing evidence that AI works correctly
- When you can't afford to discover problems from customer complaints
- 任何在生产环境中运行的AI功能
- 在使用其他技能构建的功能上线后
- 在任何模型或提示词变更后
- 当合规要求持续提供AI正常运行的证据时
- 当你无法承受通过客户投诉才发现问题的代价时
What can go wrong (without monitoring)
无监控可能引发的问题
| Problem | How it happens | Impact |
|---|---|---|
| Silent model changes | Provider updates model behavior | Accuracy drops, nobody notices for weeks |
| Input drift | Users start asking questions you didn't train for | Quality degrades on new use cases |
| Gradual degradation | Prompts rot as data distribution shifts | Slow decline — death by a thousand cuts |
| Cost creep | Longer inputs, more retries, price increases | Budget overrun |
| Safety gaps | New attack vectors, new harmful content patterns | Compliance and reputation risk |
| 问题 | 发生原因 | 影响 |
|---|---|---|
| 无感知的模型变更 | 模型提供商更新了模型行为 | 准确率下降,数周内无人察觉 |
| 输入漂移 | 用户开始提出你未针对其训练的问题 | 新场景下的AI质量下降 |
| 性能逐步退化 | 数据分布变化导致提示词失效 | 缓慢的性能下滑——积少成多的损耗 |
| 成本攀升 | 输入内容变长、重试次数增加、价格上涨 | 预算超支 |
| 安全漏洞 | 出现新的攻击向量、新的有害内容模式 | 合规与声誉风险 |
Step 1: Define what to monitor
步骤1:确定监控内容
Ask the user what matters most:
| Category | What to measure | How |
|---|---|---|
| Quality | Accuracy, relevance, helpfulness | Metrics from |
| Safety | Policy violations, harmful outputs, PII leaks | LM-as-judge or rule-based checks |
| Performance | Latency, error rate, retry rate | Timing and exception logging |
| Cost | Tokens per request, cost per request, daily spend | Token counting from LM responses |
询问用户最关注的指标:
| 类别 | 衡量内容 | 实现方式 |
|---|---|---|
| 质量 | 准确率、相关性、有用性 | 来自 |
| 安全 | 违反政策、有害输出、PII泄露 | 大模型作为评判者或基于规则的检查 |
| 性能 | 延迟、错误率、重试率 | 计时与异常日志 |
| 成本 | 每次请求的Token数、单次请求成本、每日支出 | 大模型响应的Token计数 |
Step 2: Build evaluation metrics
步骤2:构建评估指标
Reuse the metric patterns from :
/ai-improving-accuracy复用中的指标模式:
/ai-improving-accuracyQuality metric (with ground truth)
质量指标(带真实标注)
python
import dspy
def quality_metric(example, prediction, trace=None):
return prediction.answer.strip().lower() == example.answer.strip().lower()python
import dspy
def quality_metric(example, prediction, trace=None):
return prediction.answer.strip().lower() == example.answer.strip().lower()Quality metric (without ground truth — LM-as-judge)
质量指标(无真实标注——大模型作为评判者)
Most production systems don't have ground truth for every request. Use an LM to judge quality:
python
class AssessQuality(dspy.Signature):
"""Is this a high-quality response to the question?"""
question: str = dspy.InputField()
response: str = dspy.InputField()
is_high_quality: bool = dspy.OutputField()
issue: str = dspy.OutputField(desc="what's wrong, if anything")
def quality_judge(example, prediction, trace=None):
judge = dspy.Predict(AssessQuality)
result = judge(question=example.question, response=prediction.answer)
return float(result.is_high_quality)大多数生产系统不会为每个请求都准备真实标注。使用大模型来评判质量:
python
class AssessQuality(dspy.Signature):
"""Is this a high-quality response to the question?"""
question: str = dspy.InputField()
response: str = dspy.InputField()
is_high_quality: bool = dspy.OutputField()
issue: str = dspy.OutputField(desc="what's wrong, if anything")
def quality_judge(example, prediction, trace=None):
judge = dspy.Predict(AssessQuality)
result = judge(question=example.question, response=prediction.answer)
return float(result.is_high_quality)Safety metric
安全指标
python
class SafetyCheck(dspy.Signature):
"""Does this response violate any safety policies?"""
question: str = dspy.InputField()
response: str = dspy.InputField()
is_safe: bool = dspy.OutputField()
violation: str = dspy.OutputField(desc="what policy was violated, if any")
def safety_metric(example, prediction, trace=None):
judge = dspy.Predict(SafetyCheck)
result = judge(question=example.question, response=prediction.answer)
return float(result.is_safe)python
class SafetyCheck(dspy.Signature):
"""Does this response violate any safety policies?"""
question: str = dspy.InputField()
response: str = dspy.InputField()
is_safe: bool = dspy.OutputField()
violation: str = dspy.OutputField(desc="what policy was violated, if any")
def safety_metric(example, prediction, trace=None):
judge = dspy.Predict(SafetyCheck)
result = judge(question=example.question, response=prediction.answer)
return float(result.is_safe)Step 3: Run batch evaluations
步骤3:运行批量评估
Periodically evaluate your program on a reference dataset:
python
import json
from datetime import datetime
from dspy.evaluate import Evaluate
def run_evaluation(program, eval_set, metrics):
"""Run all metrics and log results."""
results = {}
for name, metric_fn in metrics.items():
evaluator = Evaluate(devset=eval_set, metric=metric_fn, num_threads=4)
score = evaluator(program)
results[name] = score
# Log results with timestamp
entry = {
"timestamp": datetime.now().isoformat(),
"scores": results,
}
with open("monitoring_log.jsonl", "a") as f:
f.write(json.dumps(entry) + "\n")
return results定期在参考数据集上评估你的程序:
python
import json
from datetime import datetime
from dspy.evaluate import Evaluate
def run_evaluation(program, eval_set, metrics):
"""Run all metrics and log results."""
results = {}
for name, metric_fn in metrics.items():
evaluator = Evaluate(devset=eval_set, metric=metric_fn, num_threads=4)
score = evaluator(program)
results[name] = score
# Log results with timestamp
entry = {
"timestamp": datetime.now().isoformat(),
"scores": results,
}
with open("monitoring_log.jsonl", "a") as f:
f.write(json.dumps(entry) + "\n")
return resultsDefine your metrics
Define your metrics
metrics = {
"quality": quality_judge,
"safety": safety_metric,
}
metrics = {
"quality": quality_judge,
"safety": safety_metric,
}
Run evaluation
Run evaluation
scores = run_evaluation(my_program, eval_set, metrics)
print(scores)
scores = run_evaluation(my_program, eval_set, metrics)
print(scores)
{"quality": 87.0, "safety": 99.0}
{"quality": 87.0, "safety": 99.0}
undefinedundefinedStep 4: Detect degradation
步骤4:检测性能退化
Compare current scores against a baseline to catch drops early:
python
def check_for_degradation(current_scores, baseline_scores, threshold=0.05):
"""Alert if any metric drops more than threshold below baseline."""
alerts = []
for metric_name, current in current_scores.items():
baseline = baseline_scores.get(metric_name, 0)
drop = baseline - current
if drop > threshold:
alerts.append(
f"{metric_name}: dropped {drop:.1%} "
f"(was {baseline:.1%}, now {current:.1%})"
)
return alerts将当前分数与基准值对比,及早发现下滑:
python
def check_for_degradation(current_scores, baseline_scores, threshold=0.05):
"""Alert if any metric drops more than threshold below baseline."""
alerts = []
for metric_name, current in current_scores.items():
baseline = baseline_scores.get(metric_name, 0)
drop = baseline - current
if drop > threshold:
alerts.append(
f"{metric_name}: dropped {drop:.1%} "
f"(was {baseline:.1%}, now {current:.1%})"
)
return alertsExample usage
Example usage
baseline = {"quality": 0.87, "safety": 0.99}
current = {"quality": 0.75, "safety": 0.98}
alerts = check_for_degradation(current, baseline)
baseline = {"quality": 0.87, "safety": 0.99}
current = {"quality": 0.75, "safety": 0.98}
alerts = check_for_degradation(current, baseline)
["quality: dropped 12.0% (was 87.0%, now 75.0%)"]
["quality: dropped 12.0% (was 87.0%, now 75.0%)"]
Set different thresholds for different metrics:
- **Safety:** alert on any drop >1% (zero tolerance)
- **Quality:** alert on drops >5% (some variance is normal)
- **Cost:** alert on increases >20%
为不同指标设置不同阈值:
- **安全指标**:任何超过1%的下滑都触发警报(零容忍)
- **质量指标**:下滑超过5%时触发警报(允许一定波动)
- **成本指标**:增长超过20%时触发警报Step 5: Log predictions in production
步骤5:在生产环境中记录预测结果
Wrap your production program to log inputs and outputs for later analysis:
python
class MonitoredProgram(dspy.Module):
def __init__(self, program, log_path="predictions.jsonl"):
self.program = program
self.log_path = log_path
def forward(self, **kwargs):
import time
start = time.time()
result = self.program(**kwargs)
latency = time.time() - start
# Log for monitoring
entry = {
"timestamp": datetime.now().isoformat(),
"inputs": {k: str(v) for k, v in kwargs.items()},
"outputs": {k: str(getattr(result, k, "")) for k in result.keys()},
"latency_ms": round(latency * 1000),
}
with open(self.log_path, "a") as f:
f.write(json.dumps(entry) + "\n")
return result封装你的生产程序,记录输入与输出以便后续分析:
python
class MonitoredProgram(dspy.Module):
def __init__(self, program, log_path="predictions.jsonl"):
self.program = program
self.log_path = log_path
def forward(self, **kwargs):
import time
start = time.time()
result = self.program(**kwargs)
latency = time.time() - start
# Log for monitoring
entry = {
"timestamp": datetime.now().isoformat(),
"inputs": {k: str(v) for k, v in kwargs.items()},
"outputs": {k: str(getattr(result, k, "")) for k in result.keys()},
"latency_ms": round(latency * 1000),
}
with open(self.log_path, "a") as f:
f.write(json.dumps(entry) + "\n")
return resultWrap your production program
Wrap your production program
production = MonitoredProgram(optimized_program)
production = MonitoredProgram(optimized_program)
Use it normally — logging happens automatically
Use it normally — logging happens automatically
result = production(question="How do I reset my password?")
undefinedresult = production(question="How do I reset my password?")
undefinedStep 6: Sample and evaluate production traffic
步骤6:抽样并评估生产流量
Periodically sample logged predictions and run metrics on them:
python
import random
def sample_and_evaluate(log_path, metric_fns, sample_size=100):
"""Sample recent predictions and evaluate quality."""
with open(log_path) as f:
entries = [json.loads(line) for line in f]
recent = entries[-1000:] # last 1000 predictions
sample = random.sample(recent, min(sample_size, len(recent)))
# Convert to dspy.Examples for evaluation
examples = []
for entry in sample:
ex = dspy.Example(
question=entry["inputs"].get("question", ""),
answer=entry["outputs"].get("answer", ""),
).with_inputs("question")
examples.append(ex)
# Run each metric
results = {}
for name, metric_fn in metric_fns.items():
evaluator = Evaluate(devset=examples, metric=metric_fn, num_threads=4)
# Create a passthrough program that returns the logged prediction
score = evaluator(lambda **kw: dspy.Prediction(answer=kw.get("answer", "")))
results[name] = score
return results定期抽样已记录的预测结果,并运行指标评估:
python
import random
def sample_and_evaluate(log_path, metric_fns, sample_size=100):
"""Sample recent predictions and evaluate quality."""
with open(log_path) as f:
entries = [json.loads(line) for line in f]
recent = entries[-1000:] # last 1000 predictions
sample = random.sample(recent, min(sample_size, len(recent)))
# Convert to dspy.Examples for evaluation
examples = []
for entry in sample:
ex = dspy.Example(
question=entry["inputs"].get("question", ""),
answer=entry["outputs"].get("answer", ""),
).with_inputs("question")
examples.append(ex)
# Run each metric
results = {}
for name, metric_fn in metric_fns.items():
evaluator = Evaluate(devset=examples, metric=metric_fn, num_threads=4)
# Create a passthrough program that returns the logged prediction
score = evaluator(lambda **kw: dspy.Prediction(answer=kw.get("answer", "")))
results[name] = score
return resultsStep 7: Set up alerts
步骤7:设置警报
Simple threshold-based alerting that integrates with your existing tools:
python
def monitoring_check(program, eval_set, metrics, baseline):
"""Run one monitoring cycle: evaluate, compare, alert."""
scores = run_evaluation(program, eval_set, metrics)
alerts = check_for_degradation(scores, baseline)
if alerts:
alert_message = "AI quality degradation detected:\n" + "\n".join(alerts)
# Send to wherever your team gets alerts
send_to_slack(alert_message) # or email, PagerDuty, etc.
print(f"ALERT: {alert_message}")
else:
print(f"All metrics healthy: {scores}")
return scores基于阈值的简单警报,可集成到现有工具中:
python
def monitoring_check(program, eval_set, metrics, baseline):
"""Run one monitoring cycle: evaluate, compare, alert."""
scores = run_evaluation(program, eval_set, metrics)
alerts = check_for_degradation(scores, baseline)
if alerts:
alert_message = "AI quality degradation detected:\n" + "\n".join(alerts)
# Send to wherever your team gets alerts
send_to_slack(alert_message) # or email, PagerDuty, etc.
print(f"ALERT: {alert_message}")
else:
print(f"All metrics healthy: {scores}")
return scoresSchedule it
调度执行
Run monitoring checks on a schedule. How often depends on traffic and risk:
| Traffic | Risk level | Suggested frequency |
|---|---|---|
| High (>10K req/day) | High (safety-critical) | Every hour |
| High | Medium | Every 6 hours |
| Medium (1-10K/day) | Any | Daily |
| Low (<1K/day) | Any | Weekly |
python
undefined按计划运行监控检查,频率取决于流量与风险:
| 流量 | 风险等级 | 建议频率 |
|---|---|---|
| 高(>1万请求/天) | 高(安全关键) | 每小时一次 |
| 高 | 中 | 每6小时一次 |
| 中(1-1万请求/天) | 任意 | 每日一次 |
| 低(<1千请求/天) | 任意 | 每周一次 |
python
undefinedRun as a cron job, scheduled task, or in your CI pipeline
Run as a cron job, scheduled task, or in your CI pipeline
Example: daily check
Example: daily check
if name == "main":
from my_app import production_program, eval_set
baseline = {"quality": 0.87, "safety": 0.99}
metrics = {"quality": quality_judge, "safety": safety_metric}
monitoring_check(production_program, eval_set, metrics, baseline)undefinedif name == "main":
from my_app import production_program, eval_set
baseline = {"quality": 0.87, "safety": 0.99}
metrics = {"quality": quality_judge, "safety": safety_metric}
monitoring_check(production_program, eval_set, metrics, baseline)undefinedStep 5b: Connect an observability platform
步骤5b:连接可观测性平台
For teams that want dashboards, alerts, and collaboration beyond DIY JSONL logging:
对于需要仪表板、警报与协作功能的团队,可使用以下平台:
Quick setup
快速设置
| Platform | Setup | Open source | DSPy integration |
|---|---|---|---|
| Langtrace | | Yes (self-host) + cloud | Auto-instruments all DSPy calls |
| Arize Phoenix | | Yes | Auto-instruments via OpenInference |
| W&B Weave | | No (cloud) | Manual decorator per function |
| 平台 | 设置方式 | 开源 | DSPy集成 |
|---|---|---|---|
| Langtrace | | 是(自托管+云服务) | 自动检测所有DSPy调用 |
| Arize Phoenix | | 是 | 通过OpenInference自动检测 |
| W&B Weave | | 否(云服务) | 为每个函数手动添加装饰器 |
Langtrace (best DSPy auto-instrumentation)
Langtrace(最佳DSPy自动检测工具)
bash
pip install langtrace-python-sdkpython
from langtrace_python_sdk import langtrace
langtrace.init(api_key="your-key") # or self-host: langtrace.init(api_host="http://localhost:3000")bash
pip install langtrace-python-sdkpython
from langtrace_python_sdk import langtrace
langtrace.init(api_key="your-key") # or self-host: langtrace.init(api_host="http://localhost:3000")All DSPy LM calls, retrievals, and module executions are traced automatically
All DSPy LM calls, retrievals, and module executions are traced automatically
result = production_program(question="How do refunds work?")
undefinedresult = production_program(question="How do refunds work?")
undefinedArize Phoenix (open-source trace viewer)
Arize Phoenix(开源追踪查看器)
bash
pip install arize-phoenix openinference-instrumentation-dspypython
import phoenix as px
from openinference.instrumentation.dspy import DSPyInstrumentor
px.launch_app() # Local UI at http://localhost:6006
DSPyInstrumentor().instrument()bash
pip install arize-phoenix openinference-instrumentation-dspypython
import phoenix as px
from openinference.instrumentation.dspy import DSPyInstrumentor
px.launch_app() # Local UI at http://localhost:6006
DSPyInstrumentor().instrument()Traces appear in the Phoenix UI with full prompt/response details
Traces appear in the Phoenix UI with full prompt/response details
undefinedundefinedW&B Weave (team dashboards)
W&B Weave(团队仪表板)
bash
pip install weavepython
import weave
weave.init("my-ai-project")
@weave.op()
def monitored_predict(question):
return production_program(question=question)bash
pip install weavepython
import weave
weave.init("my-ai-project")
@weave.op()
def monitored_predict(question):
return production_program(question=question)All calls tracked with inputs, outputs, latency, and cost
All calls tracked with inputs, outputs, latency, and cost
View at wandb.ai
View at wandb.ai
undefinedundefinedWhich platform to use
平台选择建议
| Your situation | Recommended |
|---|---|
| Solo developer, want quick DSPy tracing | Langtrace |
| Team wants open-source, self-hosted | Arize Phoenix |
| Team already uses W&B for ML experiments | W&B Weave |
| Need per-request debugging (not aggregate) | See |
| 你的场景 | 推荐平台 |
|---|---|
| 独立开发者,需要快速实现DSPy追踪 | Langtrace |
| 团队需要开源、自托管方案 | Arize Phoenix |
| 团队已使用W&B进行ML实验 | W&B Weave |
| 需要单请求调试(非聚合分析) | 查看 |
When things go wrong
故障处理
Quick decision tree for common monitoring alerts:
| Alert | Likely cause | Fix with |
|---|---|---|
| Quality dropped | Model provider changed behavior, or input distribution shifted | |
| Safety metric dropped | New attack vectors or content patterns | |
| Cost spiked | Longer inputs, more retries, or model price increase | |
| Error rate increased | API changes, schema changes, rate limits | |
| Latency increased | Model congestion, larger inputs, or added retries | Check retry rates first, then consider |
常见监控警报的快速决策树:
| 警报 | 可能原因 | 修复方式 |
|---|---|---|
| 质量下滑 | 模型提供商变更行为,或输入分布变化 | 使用 |
| 安全指标下滑 | 出现新的攻击向量或内容模式 | 使用 |
| 成本飙升 | 输入变长、重试次数增加或模型涨价 | 使用 |
| 错误率上升 | API变更、 schema变更、速率限制 | 使用 |
| 延迟增加 | 模型拥堵、输入变大或重试次数增加 | 先检查重试率,再考虑 |
Tips
小贴士
- Set up monitoring at launch, not after an incident. The cost of monitoring is low; the cost of missing a regression is high.
- Use LM-as-judge metrics when you don't have ground truth. Most production cases won't have labeled answers — an LM judge is good enough to detect degradation.
- Log everything: inputs, outputs, latencies, token counts, costs. You can always analyze later, but you can't retroactively log what you didn't capture.
- Separate safety from quality monitoring. Safety alerts need lower thresholds (>1% drop) and faster response times than quality alerts (>5% drop).
- Run the full safety audit monthly. Periodic metric checks catch gradual degradation. Monthly audits catch new attack vectors.
/ai-testing-safety - Keep your reference eval set fresh. Add examples from real production failures. Remove examples that no longer represent your users.
- Baseline after every optimization. When you re-optimize your program, update the baseline scores so future comparisons are meaningful.
- 在上线时就设置监控,而非事故发生后。监控的成本很低,但错过性能退化的代价极高。
- 无真实标注时使用大模型作为评判者的指标。大多数生产场景没有标注答案——大模型评判足以检测性能退化。
- 记录所有内容:输入、输出、延迟、Token数、成本。你可以后续再分析,但无法追溯记录未捕获的数据。
- 将安全监控与质量监控分离。安全警报需要更低的阈值(下滑>1%)和更快的响应速度,而质量警报的阈值可设为>5%。
- 每月运行完整的安全审计。定期指标检查可发现逐步的性能退化,每月的审计可发现新的攻击向量。
/ai-testing-safety - 保持参考评估数据集的新鲜度。添加真实生产故障的示例,移除不再代表用户需求的示例。
- 每次优化后更新基准值。当你重新优化程序时,更新基准分数,确保后续对比有意义。
Additional resources
额外资源
- Use to wrap your program in FastAPI endpoints before setting up monitoring
/ai-serving-apis - Use for the metrics and evaluation patterns this skill builds on
/ai-improving-accuracy - Use for periodic adversarial safety audits
/ai-testing-safety - Use to add guardrails when monitoring reveals gaps
/ai-checking-outputs - Use when cost monitoring shows spending increasing
/ai-cutting-costs - Use when you need to evaluate a model change
/ai-switching-models - Use to debug individual requests end-to-end
/ai-tracing-requests - See for complete worked examples
examples.md
- 在设置监控前,可使用将你的程序封装为FastAPI端点
/ai-serving-apis - 本技能所依赖的指标与评估模式,可查看
/ai-improving-accuracy - 定期对抗性安全审计可使用
/ai-testing-safety - 当监控发现漏洞时,可使用添加防护措施
/ai-checking-outputs - 当成本监控显示支出增加时,可使用
/ai-cutting-costs - 需要评估模型变更时,可使用
/ai-switching-models - 需要端到端调试单个请求时,可使用
/ai-tracing-requests - 完整的示例可查看
examples.md