ai-monitoring

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Know When Your AI Breaks in Production

了解你的AI在生产环境中何时出现故障

Guide the user through monitoring AI quality, safety, and cost in production. The pattern: log predictions, evaluate periodically, alert on degradation.
本指南将引导你监控生产环境中AI的质量、安全性与成本。核心流程:记录预测结果、定期评估、针对性能下降发出警报。

When you need monitoring

何时需要监控

  • Any AI feature running in production
  • After launching something built with the other skills
  • After any model or prompt change
  • When compliance requires ongoing evidence that AI works correctly
  • When you can't afford to discover problems from customer complaints
  • 任何在生产环境中运行的AI功能
  • 在使用其他技能构建的功能上线后
  • 在任何模型或提示词变更后
  • 当合规要求持续提供AI正常运行的证据时
  • 当你无法承受通过客户投诉才发现问题的代价时

What can go wrong (without monitoring)

无监控可能引发的问题

ProblemHow it happensImpact
Silent model changesProvider updates model behaviorAccuracy drops, nobody notices for weeks
Input driftUsers start asking questions you didn't train forQuality degrades on new use cases
Gradual degradationPrompts rot as data distribution shiftsSlow decline — death by a thousand cuts
Cost creepLonger inputs, more retries, price increasesBudget overrun
Safety gapsNew attack vectors, new harmful content patternsCompliance and reputation risk
问题发生原因影响
无感知的模型变更模型提供商更新了模型行为准确率下降,数周内无人察觉
输入漂移用户开始提出你未针对其训练的问题新场景下的AI质量下降
性能逐步退化数据分布变化导致提示词失效缓慢的性能下滑——积少成多的损耗
成本攀升输入内容变长、重试次数增加、价格上涨预算超支
安全漏洞出现新的攻击向量、新的有害内容模式合规与声誉风险

Step 1: Define what to monitor

步骤1:确定监控内容

Ask the user what matters most:
CategoryWhat to measureHow
QualityAccuracy, relevance, helpfulnessMetrics from
/ai-improving-accuracy
SafetyPolicy violations, harmful outputs, PII leaksLM-as-judge or rule-based checks
PerformanceLatency, error rate, retry rateTiming and exception logging
CostTokens per request, cost per request, daily spendToken counting from LM responses
询问用户最关注的指标:
类别衡量内容实现方式
质量准确率、相关性、有用性来自
/ai-improving-accuracy
的指标
安全违反政策、有害输出、PII泄露大模型作为评判者或基于规则的检查
性能延迟、错误率、重试率计时与异常日志
成本每次请求的Token数、单次请求成本、每日支出大模型响应的Token计数

Step 2: Build evaluation metrics

步骤2:构建评估指标

Reuse the metric patterns from
/ai-improving-accuracy
:
复用
/ai-improving-accuracy
中的指标模式:

Quality metric (with ground truth)

质量指标(带真实标注)

python
import dspy

def quality_metric(example, prediction, trace=None):
    return prediction.answer.strip().lower() == example.answer.strip().lower()
python
import dspy

def quality_metric(example, prediction, trace=None):
    return prediction.answer.strip().lower() == example.answer.strip().lower()

Quality metric (without ground truth — LM-as-judge)

质量指标(无真实标注——大模型作为评判者)

Most production systems don't have ground truth for every request. Use an LM to judge quality:
python
class AssessQuality(dspy.Signature):
    """Is this a high-quality response to the question?"""
    question: str = dspy.InputField()
    response: str = dspy.InputField()
    is_high_quality: bool = dspy.OutputField()
    issue: str = dspy.OutputField(desc="what's wrong, if anything")

def quality_judge(example, prediction, trace=None):
    judge = dspy.Predict(AssessQuality)
    result = judge(question=example.question, response=prediction.answer)
    return float(result.is_high_quality)
大多数生产系统不会为每个请求都准备真实标注。使用大模型来评判质量:
python
class AssessQuality(dspy.Signature):
    """Is this a high-quality response to the question?"""
    question: str = dspy.InputField()
    response: str = dspy.InputField()
    is_high_quality: bool = dspy.OutputField()
    issue: str = dspy.OutputField(desc="what's wrong, if anything")

def quality_judge(example, prediction, trace=None):
    judge = dspy.Predict(AssessQuality)
    result = judge(question=example.question, response=prediction.answer)
    return float(result.is_high_quality)

Safety metric

安全指标

python
class SafetyCheck(dspy.Signature):
    """Does this response violate any safety policies?"""
    question: str = dspy.InputField()
    response: str = dspy.InputField()
    is_safe: bool = dspy.OutputField()
    violation: str = dspy.OutputField(desc="what policy was violated, if any")

def safety_metric(example, prediction, trace=None):
    judge = dspy.Predict(SafetyCheck)
    result = judge(question=example.question, response=prediction.answer)
    return float(result.is_safe)
python
class SafetyCheck(dspy.Signature):
    """Does this response violate any safety policies?"""
    question: str = dspy.InputField()
    response: str = dspy.InputField()
    is_safe: bool = dspy.OutputField()
    violation: str = dspy.OutputField(desc="what policy was violated, if any")

def safety_metric(example, prediction, trace=None):
    judge = dspy.Predict(SafetyCheck)
    result = judge(question=example.question, response=prediction.answer)
    return float(result.is_safe)

Step 3: Run batch evaluations

步骤3:运行批量评估

Periodically evaluate your program on a reference dataset:
python
import json
from datetime import datetime
from dspy.evaluate import Evaluate

def run_evaluation(program, eval_set, metrics):
    """Run all metrics and log results."""
    results = {}
    for name, metric_fn in metrics.items():
        evaluator = Evaluate(devset=eval_set, metric=metric_fn, num_threads=4)
        score = evaluator(program)
        results[name] = score

    # Log results with timestamp
    entry = {
        "timestamp": datetime.now().isoformat(),
        "scores": results,
    }
    with open("monitoring_log.jsonl", "a") as f:
        f.write(json.dumps(entry) + "\n")

    return results
定期在参考数据集上评估你的程序:
python
import json
from datetime import datetime
from dspy.evaluate import Evaluate

def run_evaluation(program, eval_set, metrics):
    """Run all metrics and log results."""
    results = {}
    for name, metric_fn in metrics.items():
        evaluator = Evaluate(devset=eval_set, metric=metric_fn, num_threads=4)
        score = evaluator(program)
        results[name] = score

    # Log results with timestamp
    entry = {
        "timestamp": datetime.now().isoformat(),
        "scores": results,
    }
    with open("monitoring_log.jsonl", "a") as f:
        f.write(json.dumps(entry) + "\n")

    return results

Define your metrics

Define your metrics

metrics = { "quality": quality_judge, "safety": safety_metric, }
metrics = { "quality": quality_judge, "safety": safety_metric, }

Run evaluation

Run evaluation

scores = run_evaluation(my_program, eval_set, metrics) print(scores)
scores = run_evaluation(my_program, eval_set, metrics) print(scores)

{"quality": 87.0, "safety": 99.0}

{"quality": 87.0, "safety": 99.0}

undefined
undefined

Step 4: Detect degradation

步骤4:检测性能退化

Compare current scores against a baseline to catch drops early:
python
def check_for_degradation(current_scores, baseline_scores, threshold=0.05):
    """Alert if any metric drops more than threshold below baseline."""
    alerts = []
    for metric_name, current in current_scores.items():
        baseline = baseline_scores.get(metric_name, 0)
        drop = baseline - current
        if drop > threshold:
            alerts.append(
                f"{metric_name}: dropped {drop:.1%} "
                f"(was {baseline:.1%}, now {current:.1%})"
            )
    return alerts
将当前分数与基准值对比,及早发现下滑:
python
def check_for_degradation(current_scores, baseline_scores, threshold=0.05):
    """Alert if any metric drops more than threshold below baseline."""
    alerts = []
    for metric_name, current in current_scores.items():
        baseline = baseline_scores.get(metric_name, 0)
        drop = baseline - current
        if drop > threshold:
            alerts.append(
                f"{metric_name}: dropped {drop:.1%} "
                f"(was {baseline:.1%}, now {current:.1%})"
            )
    return alerts

Example usage

Example usage

baseline = {"quality": 0.87, "safety": 0.99} current = {"quality": 0.75, "safety": 0.98}
alerts = check_for_degradation(current, baseline)
baseline = {"quality": 0.87, "safety": 0.99} current = {"quality": 0.75, "safety": 0.98}
alerts = check_for_degradation(current, baseline)

["quality: dropped 12.0% (was 87.0%, now 75.0%)"]

["quality: dropped 12.0% (was 87.0%, now 75.0%)"]


Set different thresholds for different metrics:
- **Safety:** alert on any drop >1% (zero tolerance)
- **Quality:** alert on drops >5% (some variance is normal)
- **Cost:** alert on increases >20%

为不同指标设置不同阈值:
- **安全指标**:任何超过1%的下滑都触发警报(零容忍)
- **质量指标**:下滑超过5%时触发警报(允许一定波动)
- **成本指标**:增长超过20%时触发警报

Step 5: Log predictions in production

步骤5:在生产环境中记录预测结果

Wrap your production program to log inputs and outputs for later analysis:
python
class MonitoredProgram(dspy.Module):
    def __init__(self, program, log_path="predictions.jsonl"):
        self.program = program
        self.log_path = log_path

    def forward(self, **kwargs):
        import time
        start = time.time()

        result = self.program(**kwargs)

        latency = time.time() - start

        # Log for monitoring
        entry = {
            "timestamp": datetime.now().isoformat(),
            "inputs": {k: str(v) for k, v in kwargs.items()},
            "outputs": {k: str(getattr(result, k, "")) for k in result.keys()},
            "latency_ms": round(latency * 1000),
        }
        with open(self.log_path, "a") as f:
            f.write(json.dumps(entry) + "\n")

        return result
封装你的生产程序,记录输入与输出以便后续分析:
python
class MonitoredProgram(dspy.Module):
    def __init__(self, program, log_path="predictions.jsonl"):
        self.program = program
        self.log_path = log_path

    def forward(self, **kwargs):
        import time
        start = time.time()

        result = self.program(**kwargs)

        latency = time.time() - start

        # Log for monitoring
        entry = {
            "timestamp": datetime.now().isoformat(),
            "inputs": {k: str(v) for k, v in kwargs.items()},
            "outputs": {k: str(getattr(result, k, "")) for k in result.keys()},
            "latency_ms": round(latency * 1000),
        }
        with open(self.log_path, "a") as f:
            f.write(json.dumps(entry) + "\n")

        return result

Wrap your production program

Wrap your production program

production = MonitoredProgram(optimized_program)
production = MonitoredProgram(optimized_program)

Use it normally — logging happens automatically

Use it normally — logging happens automatically

result = production(question="How do I reset my password?")
undefined
result = production(question="How do I reset my password?")
undefined

Step 6: Sample and evaluate production traffic

步骤6:抽样并评估生产流量

Periodically sample logged predictions and run metrics on them:
python
import random

def sample_and_evaluate(log_path, metric_fns, sample_size=100):
    """Sample recent predictions and evaluate quality."""
    with open(log_path) as f:
        entries = [json.loads(line) for line in f]

    recent = entries[-1000:]  # last 1000 predictions
    sample = random.sample(recent, min(sample_size, len(recent)))

    # Convert to dspy.Examples for evaluation
    examples = []
    for entry in sample:
        ex = dspy.Example(
            question=entry["inputs"].get("question", ""),
            answer=entry["outputs"].get("answer", ""),
        ).with_inputs("question")
        examples.append(ex)

    # Run each metric
    results = {}
    for name, metric_fn in metric_fns.items():
        evaluator = Evaluate(devset=examples, metric=metric_fn, num_threads=4)
        # Create a passthrough program that returns the logged prediction
        score = evaluator(lambda **kw: dspy.Prediction(answer=kw.get("answer", "")))
        results[name] = score

    return results
定期抽样已记录的预测结果,并运行指标评估:
python
import random

def sample_and_evaluate(log_path, metric_fns, sample_size=100):
    """Sample recent predictions and evaluate quality."""
    with open(log_path) as f:
        entries = [json.loads(line) for line in f]

    recent = entries[-1000:]  # last 1000 predictions
    sample = random.sample(recent, min(sample_size, len(recent)))

    # Convert to dspy.Examples for evaluation
    examples = []
    for entry in sample:
        ex = dspy.Example(
            question=entry["inputs"].get("question", ""),
            answer=entry["outputs"].get("answer", ""),
        ).with_inputs("question")
        examples.append(ex)

    # Run each metric
    results = {}
    for name, metric_fn in metric_fns.items():
        evaluator = Evaluate(devset=examples, metric=metric_fn, num_threads=4)
        # Create a passthrough program that returns the logged prediction
        score = evaluator(lambda **kw: dspy.Prediction(answer=kw.get("answer", "")))
        results[name] = score

    return results

Step 7: Set up alerts

步骤7:设置警报

Simple threshold-based alerting that integrates with your existing tools:
python
def monitoring_check(program, eval_set, metrics, baseline):
    """Run one monitoring cycle: evaluate, compare, alert."""
    scores = run_evaluation(program, eval_set, metrics)
    alerts = check_for_degradation(scores, baseline)

    if alerts:
        alert_message = "AI quality degradation detected:\n" + "\n".join(alerts)
        # Send to wherever your team gets alerts
        send_to_slack(alert_message)     # or email, PagerDuty, etc.
        print(f"ALERT: {alert_message}")
    else:
        print(f"All metrics healthy: {scores}")

    return scores
基于阈值的简单警报,可集成到现有工具中:
python
def monitoring_check(program, eval_set, metrics, baseline):
    """Run one monitoring cycle: evaluate, compare, alert."""
    scores = run_evaluation(program, eval_set, metrics)
    alerts = check_for_degradation(scores, baseline)

    if alerts:
        alert_message = "AI quality degradation detected:\n" + "\n".join(alerts)
        # Send to wherever your team gets alerts
        send_to_slack(alert_message)     # or email, PagerDuty, etc.
        print(f"ALERT: {alert_message}")
    else:
        print(f"All metrics healthy: {scores}")

    return scores

Schedule it

调度执行

Run monitoring checks on a schedule. How often depends on traffic and risk:
TrafficRisk levelSuggested frequency
High (>10K req/day)High (safety-critical)Every hour
HighMediumEvery 6 hours
Medium (1-10K/day)AnyDaily
Low (<1K/day)AnyWeekly
python
undefined
按计划运行监控检查,频率取决于流量与风险:
流量风险等级建议频率
高(>1万请求/天)高(安全关键)每小时一次
每6小时一次
中(1-1万请求/天)任意每日一次
低(<1千请求/天)任意每周一次
python
undefined

Run as a cron job, scheduled task, or in your CI pipeline

Run as a cron job, scheduled task, or in your CI pipeline

Example: daily check

Example: daily check

if name == "main": from my_app import production_program, eval_set
baseline = {"quality": 0.87, "safety": 0.99}
metrics = {"quality": quality_judge, "safety": safety_metric}

monitoring_check(production_program, eval_set, metrics, baseline)
undefined
if name == "main": from my_app import production_program, eval_set
baseline = {"quality": 0.87, "safety": 0.99}
metrics = {"quality": quality_judge, "safety": safety_metric}

monitoring_check(production_program, eval_set, metrics, baseline)
undefined

Step 5b: Connect an observability platform

步骤5b:连接可观测性平台

For teams that want dashboards, alerts, and collaboration beyond DIY JSONL logging:
对于需要仪表板、警报与协作功能的团队,可使用以下平台:

Quick setup

快速设置

PlatformSetupOpen sourceDSPy integration
Langtrace
langtrace.init(api_key="...")
Yes (self-host) + cloudAuto-instruments all DSPy calls
Arize Phoenix
px.launch_app()
+
DSPyInstrumentor().instrument()
YesAuto-instruments via OpenInference
W&B Weave
weave.init("project")
+
@weave.op()
decorator
No (cloud)Manual decorator per function
平台设置方式开源DSPy集成
Langtrace
langtrace.init(api_key="...")
是(自托管+云服务)自动检测所有DSPy调用
Arize Phoenix
px.launch_app()
+
DSPyInstrumentor().instrument()
通过OpenInference自动检测
W&B Weave
weave.init("project")
+
@weave.op()
装饰器
否(云服务)为每个函数手动添加装饰器

Langtrace (best DSPy auto-instrumentation)

Langtrace(最佳DSPy自动检测工具)

bash
pip install langtrace-python-sdk
python
from langtrace_python_sdk import langtrace

langtrace.init(api_key="your-key")  # or self-host: langtrace.init(api_host="http://localhost:3000")
bash
pip install langtrace-python-sdk
python
from langtrace_python_sdk import langtrace

langtrace.init(api_key="your-key")  # or self-host: langtrace.init(api_host="http://localhost:3000")

All DSPy LM calls, retrievals, and module executions are traced automatically

All DSPy LM calls, retrievals, and module executions are traced automatically

result = production_program(question="How do refunds work?")
undefined
result = production_program(question="How do refunds work?")
undefined

Arize Phoenix (open-source trace viewer)

Arize Phoenix(开源追踪查看器)

bash
pip install arize-phoenix openinference-instrumentation-dspy
python
import phoenix as px
from openinference.instrumentation.dspy import DSPyInstrumentor

px.launch_app()  # Local UI at http://localhost:6006
DSPyInstrumentor().instrument()
bash
pip install arize-phoenix openinference-instrumentation-dspy
python
import phoenix as px
from openinference.instrumentation.dspy import DSPyInstrumentor

px.launch_app()  # Local UI at http://localhost:6006
DSPyInstrumentor().instrument()

Traces appear in the Phoenix UI with full prompt/response details

Traces appear in the Phoenix UI with full prompt/response details

undefined
undefined

W&B Weave (team dashboards)

W&B Weave(团队仪表板)

bash
pip install weave
python
import weave

weave.init("my-ai-project")

@weave.op()
def monitored_predict(question):
    return production_program(question=question)
bash
pip install weave
python
import weave

weave.init("my-ai-project")

@weave.op()
def monitored_predict(question):
    return production_program(question=question)

All calls tracked with inputs, outputs, latency, and cost

All calls tracked with inputs, outputs, latency, and cost

View at wandb.ai

View at wandb.ai

undefined
undefined

Which platform to use

平台选择建议

Your situationRecommended
Solo developer, want quick DSPy tracingLangtrace
Team wants open-source, self-hostedArize Phoenix
Team already uses W&B for ML experimentsW&B Weave
Need per-request debugging (not aggregate)See
/ai-tracing-requests
你的场景推荐平台
独立开发者,需要快速实现DSPy追踪Langtrace
团队需要开源、自托管方案Arize Phoenix
团队已使用W&B进行ML实验W&B Weave
需要单请求调试(非聚合分析)查看
/ai-tracing-requests

When things go wrong

故障处理

Quick decision tree for common monitoring alerts:
AlertLikely causeFix with
Quality droppedModel provider changed behavior, or input distribution shifted
/ai-improving-accuracy
— re-evaluate and re-optimize
Safety metric droppedNew attack vectors or content patterns
/ai-testing-safety
— run adversarial audit, then fix with
/ai-checking-outputs
Cost spikedLonger inputs, more retries, or model price increase
/ai-cutting-costs
— investigate and optimize
Error rate increasedAPI changes, schema changes, rate limits
/ai-fixing-errors
— diagnose and fix
Latency increasedModel congestion, larger inputs, or added retriesCheck retry rates first, then consider
/ai-switching-models
常见监控警报的快速决策树:
警报可能原因修复方式
质量下滑模型提供商变更行为,或输入分布变化使用
/ai-improving-accuracy
——重新评估并优化
安全指标下滑出现新的攻击向量或内容模式使用
/ai-testing-safety
——运行对抗性审计,再通过
/ai-checking-outputs
修复
成本飙升输入变长、重试次数增加或模型涨价使用
/ai-cutting-costs
——调查并优化
错误率上升API变更、 schema变更、速率限制使用
/ai-fixing-errors
——诊断并修复
延迟增加模型拥堵、输入变大或重试次数增加先检查重试率,再考虑
/ai-switching-models

Tips

小贴士

  • Set up monitoring at launch, not after an incident. The cost of monitoring is low; the cost of missing a regression is high.
  • Use LM-as-judge metrics when you don't have ground truth. Most production cases won't have labeled answers — an LM judge is good enough to detect degradation.
  • Log everything: inputs, outputs, latencies, token counts, costs. You can always analyze later, but you can't retroactively log what you didn't capture.
  • Separate safety from quality monitoring. Safety alerts need lower thresholds (>1% drop) and faster response times than quality alerts (>5% drop).
  • Run the full safety audit monthly. Periodic metric checks catch gradual degradation. Monthly
    /ai-testing-safety
    audits catch new attack vectors.
  • Keep your reference eval set fresh. Add examples from real production failures. Remove examples that no longer represent your users.
  • Baseline after every optimization. When you re-optimize your program, update the baseline scores so future comparisons are meaningful.
  • 在上线时就设置监控,而非事故发生后。监控的成本很低,但错过性能退化的代价极高。
  • 无真实标注时使用大模型作为评判者的指标。大多数生产场景没有标注答案——大模型评判足以检测性能退化。
  • 记录所有内容:输入、输出、延迟、Token数、成本。你可以后续再分析,但无法追溯记录未捕获的数据。
  • 将安全监控与质量监控分离。安全警报需要更低的阈值(下滑>1%)和更快的响应速度,而质量警报的阈值可设为>5%。
  • 每月运行完整的安全审计。定期指标检查可发现逐步的性能退化,每月的
    /ai-testing-safety
    审计可发现新的攻击向量。
  • 保持参考评估数据集的新鲜度。添加真实生产故障的示例,移除不再代表用户需求的示例。
  • 每次优化后更新基准值。当你重新优化程序时,更新基准分数,确保后续对比有意义。

Additional resources

额外资源

  • Use
    /ai-serving-apis
    to wrap your program in FastAPI endpoints before setting up monitoring
  • Use
    /ai-improving-accuracy
    for the metrics and evaluation patterns this skill builds on
  • Use
    /ai-testing-safety
    for periodic adversarial safety audits
  • Use
    /ai-checking-outputs
    to add guardrails when monitoring reveals gaps
  • Use
    /ai-cutting-costs
    when cost monitoring shows spending increasing
  • Use
    /ai-switching-models
    when you need to evaluate a model change
  • Use
    /ai-tracing-requests
    to debug individual requests end-to-end
  • See
    examples.md
    for complete worked examples
  • 在设置监控前,可使用
    /ai-serving-apis
    将你的程序封装为FastAPI端点
  • 本技能所依赖的指标与评估模式,可查看
    /ai-improving-accuracy
  • 定期对抗性安全审计可使用
    /ai-testing-safety
  • 当监控发现漏洞时,可使用
    /ai-checking-outputs
    添加防护措施
  • 当成本监控显示支出增加时,可使用
    /ai-cutting-costs
  • 需要评估模型变更时,可使用
    /ai-switching-models
  • 需要端到端调试单个请求时,可使用
    /ai-tracing-requests
  • 完整的示例可查看
    examples.md