ai-monitoring

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Know When Your AI Breaks in Production

了解你的AI在生产环境中何时出现故障

Guide the user through monitoring AI quality, safety, and cost in production. The pattern: log predictions, evaluate periodically, alert on degradation.

本指南将引导你监控生产环境中AI的质量、安全性与成本。核心流程：记录预测结果、定期评估、针对性能下降发出警报。

When you need monitoring

何时需要监控

Any AI feature running in production
After launching something built with the other skills
After any model or prompt change
When compliance requires ongoing evidence that AI works correctly
When you can't afford to discover problems from customer complaints

任何在生产环境中运行的AI功能
在使用其他技能构建的功能上线后
在任何模型或提示词变更后
当合规要求持续提供AI正常运行的证据时
当你无法承受通过客户投诉才发现问题的代价时

What can go wrong (without monitoring)

无监控可能引发的问题

Problem	How it happens	Impact
Silent model changes	Provider updates model behavior	Accuracy drops, nobody notices for weeks
Input drift	Users start asking questions you didn't train for	Quality degrades on new use cases
Gradual degradation	Prompts rot as data distribution shifts	Slow decline — death by a thousand cuts
Cost creep	Longer inputs, more retries, price increases	Budget overrun
Safety gaps	New attack vectors, new harmful content patterns	Compliance and reputation risk

问题	发生原因	影响
无感知的模型变更	模型提供商更新了模型行为	准确率下降，数周内无人察觉
输入漂移	用户开始提出你未针对其训练的问题	新场景下的AI质量下降
性能逐步退化	数据分布变化导致提示词失效	缓慢的性能下滑——积少成多的损耗
成本攀升	输入内容变长、重试次数增加、价格上涨	预算超支
安全漏洞	出现新的攻击向量、新的有害内容模式	合规与声誉风险

Step 1: Define what to monitor

步骤1：确定监控内容

Ask the user what matters most:

Category	What to measure	How
Quality	Accuracy, relevance, helpfulness	Metrics from `/ai-improving-accuracy`
Safety	Policy violations, harmful outputs, PII leaks	LM-as-judge or rule-based checks
Performance	Latency, error rate, retry rate	Timing and exception logging
Cost	Tokens per request, cost per request, daily spend	Token counting from LM responses

询问用户最关注的指标：

类别	衡量内容	实现方式
质量	准确率、相关性、有用性	来自 `/ai-improving-accuracy` 的指标
安全	违反政策、有害输出、PII泄露	大模型作为评判者或基于规则的检查
性能	延迟、错误率、重试率	计时与异常日志
成本	每次请求的Token数、单次请求成本、每日支出	大模型响应的Token计数

Step 2: Build evaluation metrics

步骤2：构建评估指标

Reuse the metric patterns from

/ai-improving-accuracy

复用

/ai-improving-accuracy

中的指标模式：

Quality metric (with ground truth)

质量指标（带真实标注）

python

import dspy

def quality_metric(example, prediction, trace=None):
    return prediction.answer.strip().lower() == example.answer.strip().lower()

python

import dspy

def quality_metric(example, prediction, trace=None):
    return prediction.answer.strip().lower() == example.answer.strip().lower()

Quality metric (without ground truth — LM-as-judge)

质量指标（无真实标注——大模型作为评判者）

Most production systems don't have ground truth for every request. Use an LM to judge quality:

python

class AssessQuality(dspy.Signature):
    """Is this a high-quality response to the question?"""
    question: str = dspy.InputField()
    response: str = dspy.InputField()
    is_high_quality: bool = dspy.OutputField()
    issue: str = dspy.OutputField(desc="what's wrong, if anything")

def quality_judge(example, prediction, trace=None):
    judge = dspy.Predict(AssessQuality)
    result = judge(question=example.question, response=prediction.answer)
    return float(result.is_high_quality)

大多数生产系统不会为每个请求都准备真实标注。使用大模型来评判质量：

python

class AssessQuality(dspy.Signature):
    """Is this a high-quality response to the question?"""
    question: str = dspy.InputField()
    response: str = dspy.InputField()
    is_high_quality: bool = dspy.OutputField()
    issue: str = dspy.OutputField(desc="what's wrong, if anything")

def quality_judge(example, prediction, trace=None):
    judge = dspy.Predict(AssessQuality)
    result = judge(question=example.question, response=prediction.answer)
    return float(result.is_high_quality)

Safety metric

安全指标

python

class SafetyCheck(dspy.Signature):
    """Does this response violate any safety policies?"""
    question: str = dspy.InputField()
    response: str = dspy.InputField()
    is_safe: bool = dspy.OutputField()
    violation: str = dspy.OutputField(desc="what policy was violated, if any")

def safety_metric(example, prediction, trace=None):
    judge = dspy.Predict(SafetyCheck)
    result = judge(question=example.question, response=prediction.answer)
    return float(result.is_safe)

python

class SafetyCheck(dspy.Signature):
    """Does this response violate any safety policies?"""
    question: str = dspy.InputField()
    response: str = dspy.InputField()
    is_safe: bool = dspy.OutputField()
    violation: str = dspy.OutputField(desc="what policy was violated, if any")

def safety_metric(example, prediction, trace=None):
    judge = dspy.Predict(SafetyCheck)
    result = judge(question=example.question, response=prediction.answer)
    return float(result.is_safe)

Step 3: Run batch evaluations

步骤3：运行批量评估

Periodically evaluate your program on a reference dataset:

python

import json
from datetime import datetime
from dspy.evaluate import Evaluate

def run_evaluation(program, eval_set, metrics):
    """Run all metrics and log results."""
    results = {}
    for name, metric_fn in metrics.items():
        evaluator = Evaluate(devset=eval_set, metric=metric_fn, num_threads=4)
        score = evaluator(program)
        results[name] = score

    # Log results with timestamp
    entry = {
        "timestamp": datetime.now().isoformat(),
        "scores": results,
    }
    with open("monitoring_log.jsonl", "a") as f:
        f.write(json.dumps(entry) + "\n")

    return results

定期在参考数据集上评估你的程序：

python

import json
from datetime import datetime
from dspy.evaluate import Evaluate

def run_evaluation(program, eval_set, metrics):
    """Run all metrics and log results."""
    results = {}
    for name, metric_fn in metrics.items():
        evaluator = Evaluate(devset=eval_set, metric=metric_fn, num_threads=4)
        score = evaluator(program)
        results[name] = score

    # Log results with timestamp
    entry = {
        "timestamp": datetime.now().isoformat(),
        "scores": results,
    }
    with open("monitoring_log.jsonl", "a") as f:
        f.write(json.dumps(entry) + "\n")

    return results

Define your metrics

metrics = { "quality": quality_judge, "safety": safety_metric, }

Run evaluation

scores = run_evaluation(my_program, eval_set, metrics) print(scores)

{"quality": 87.0, "safety": 99.0}

undefined

undefined

Step 4: Detect degradation

步骤4：检测性能退化

Compare current scores against a baseline to catch drops early:

python

def check_for_degradation(current_scores, baseline_scores, threshold=0.05):
    """Alert if any metric drops more than threshold below baseline."""
    alerts = []
    for metric_name, current in current_scores.items():
        baseline = baseline_scores.get(metric_name, 0)
        drop = baseline - current
        if drop > threshold:
            alerts.append(
                f"{metric_name}: dropped {drop:.1%} "
                f"(was {baseline:.1%}, now {current:.1%})"
            )
    return alerts

将当前分数与基准值对比，及早发现下滑：

python

def check_for_degradation(current_scores, baseline_scores, threshold=0.05):
    """Alert if any metric drops more than threshold below baseline."""
    alerts = []
    for metric_name, current in current_scores.items():
        baseline = baseline_scores.get(metric_name, 0)
        drop = baseline - current
        if drop > threshold:
            alerts.append(
                f"{metric_name}: dropped {drop:.1%} "
                f"(was {baseline:.1%}, now {current:.1%})"
            )
    return alerts

Example usage

baseline = {"quality": 0.87, "safety": 0.99} current = {"quality": 0.75, "safety": 0.98}

alerts = check_for_degradation(current, baseline)

baseline = {"quality": 0.87, "safety": 0.99} current = {"quality": 0.75, "safety": 0.98}

alerts = check_for_degradation(current, baseline)

["quality: dropped 12.0% (was 87.0%, now 75.0%)"]


Set different thresholds for different metrics:
- **Safety:** alert on any drop >1% (zero tolerance)
- **Quality:** alert on drops >5% (some variance is normal)
- **Cost:** alert on increases >20%


为不同指标设置不同阈值：
- **安全指标**：任何超过1%的下滑都触发警报（零容忍）
- **质量指标**：下滑超过5%时触发警报（允许一定波动）
- **成本指标**：增长超过20%时触发警报

Step 5: Log predictions in production

步骤5：在生产环境中记录预测结果

Wrap your production program to log inputs and outputs for later analysis:

python

class MonitoredProgram(dspy.Module):
    def __init__(self, program, log_path="predictions.jsonl"):
        self.program = program
        self.log_path = log_path

    def forward(self, **kwargs):
        import time
        start = time.time()

        result = self.program(**kwargs)

        latency = time.time() - start

        # Log for monitoring
        entry = {
            "timestamp": datetime.now().isoformat(),
            "inputs": {k: str(v) for k, v in kwargs.items()},
            "outputs": {k: str(getattr(result, k, "")) for k in result.keys()},
            "latency_ms": round(latency * 1000),
        }
        with open(self.log_path, "a") as f:
            f.write(json.dumps(entry) + "\n")

        return result

封装你的生产程序，记录输入与输出以便后续分析：

python

class MonitoredProgram(dspy.Module):
    def __init__(self, program, log_path="predictions.jsonl"):
        self.program = program
        self.log_path = log_path

    def forward(self, **kwargs):
        import time
        start = time.time()

        result = self.program(**kwargs)

        latency = time.time() - start

        # Log for monitoring
        entry = {
            "timestamp": datetime.now().isoformat(),
            "inputs": {k: str(v) for k, v in kwargs.items()},
            "outputs": {k: str(getattr(result, k, "")) for k in result.keys()},
            "latency_ms": round(latency * 1000),
        }
        with open(self.log_path, "a") as f:
            f.write(json.dumps(entry) + "\n")

        return result

Wrap your production program

production = MonitoredProgram(optimized_program)

Use it normally — logging happens automatically

result = production(question="How do I reset my password?")

undefined

result = production(question="How do I reset my password?")

undefined

Step 6: Sample and evaluate production traffic

步骤6：抽样并评估生产流量

Periodically sample logged predictions and run metrics on them:

python

import random

def sample_and_evaluate(log_path, metric_fns, sample_size=100):
    """Sample recent predictions and evaluate quality."""
    with open(log_path) as f:
        entries = [json.loads(line) for line in f]

    recent = entries[-1000:]  # last 1000 predictions
    sample = random.sample(recent, min(sample_size, len(recent)))

    # Convert to dspy.Examples for evaluation
    examples = []
    for entry in sample:
        ex = dspy.Example(
            question=entry["inputs"].get("question", ""),
            answer=entry["outputs"].get("answer", ""),
        ).with_inputs("question")
        examples.append(ex)

    # Run each metric
    results = {}
    for name, metric_fn in metric_fns.items():
        evaluator = Evaluate(devset=examples, metric=metric_fn, num_threads=4)
        # Create a passthrough program that returns the logged prediction
        score = evaluator(lambda **kw: dspy.Prediction(answer=kw.get("answer", "")))
        results[name] = score

    return results

定期抽样已记录的预测结果，并运行指标评估：

python

import random

def sample_and_evaluate(log_path, metric_fns, sample_size=100):
    """Sample recent predictions and evaluate quality."""
    with open(log_path) as f:
        entries = [json.loads(line) for line in f]

    recent = entries[-1000:]  # last 1000 predictions
    sample = random.sample(recent, min(sample_size, len(recent)))

    # Convert to dspy.Examples for evaluation
    examples = []
    for entry in sample:
        ex = dspy.Example(
            question=entry["inputs"].get("question", ""),
            answer=entry["outputs"].get("answer", ""),
        ).with_inputs("question")
        examples.append(ex)

    # Run each metric
    results = {}
    for name, metric_fn in metric_fns.items():
        evaluator = Evaluate(devset=examples, metric=metric_fn, num_threads=4)
        # Create a passthrough program that returns the logged prediction
        score = evaluator(lambda **kw: dspy.Prediction(answer=kw.get("answer", "")))
        results[name] = score

    return results

Step 7: Set up alerts

步骤7：设置警报

Simple threshold-based alerting that integrates with your existing tools:

python

def monitoring_check(program, eval_set, metrics, baseline):
    """Run one monitoring cycle: evaluate, compare, alert."""
    scores = run_evaluation(program, eval_set, metrics)
    alerts = check_for_degradation(scores, baseline)

    if alerts:
        alert_message = "AI quality degradation detected:\n" + "\n".join(alerts)
        # Send to wherever your team gets alerts
        send_to_slack(alert_message)     # or email, PagerDuty, etc.
        print(f"ALERT: {alert_message}")
    else:
        print(f"All metrics healthy: {scores}")

    return scores

基于阈值的简单警报，可集成到现有工具中：

python

def monitoring_check(program, eval_set, metrics, baseline):
    """Run one monitoring cycle: evaluate, compare, alert."""
    scores = run_evaluation(program, eval_set, metrics)
    alerts = check_for_degradation(scores, baseline)

    if alerts:
        alert_message = "AI quality degradation detected:\n" + "\n".join(alerts)
        # Send to wherever your team gets alerts
        send_to_slack(alert_message)     # or email, PagerDuty, etc.
        print(f"ALERT: {alert_message}")
    else:
        print(f"All metrics healthy: {scores}")

    return scores

Schedule it

调度执行

Run monitoring checks on a schedule. How often depends on traffic and risk:

Traffic	Risk level	Suggested frequency
High (>10K req/day)	High (safety-critical)	Every hour
High	Medium	Every 6 hours
Medium (1-10K/day)	Any	Daily
Low (<1K/day)	Any	Weekly

python

undefined

按计划运行监控检查，频率取决于流量与风险：

流量	风险等级	建议频率
高（>1万请求/天）	高（安全关键）	每小时一次
高	中	每6小时一次
中（1-1万请求/天）	任意	每日一次
低（<1千请求/天）	任意	每周一次

python

undefined

Run as a cron job, scheduled task, or in your CI pipeline

Example: daily check

if name == "main": from my_app import production_program, eval_set

baseline = {"quality": 0.87, "safety": 0.99}
metrics = {"quality": quality_judge, "safety": safety_metric}

monitoring_check(production_program, eval_set, metrics, baseline)

undefined

if name == "main": from my_app import production_program, eval_set

baseline = {"quality": 0.87, "safety": 0.99}
metrics = {"quality": quality_judge, "safety": safety_metric}

monitoring_check(production_program, eval_set, metrics, baseline)

undefined

Step 5b: Connect an observability platform

步骤5b：连接可观测性平台

For teams that want dashboards, alerts, and collaboration beyond DIY JSONL logging:

对于需要仪表板、警报与协作功能的团队，可使用以下平台：

Quick setup

快速设置

Platform	Setup	Open source	DSPy integration
Langtrace	`langtrace.init(api_key="...")`	Yes (self-host) + cloud	Auto-instruments all DSPy calls
Arize Phoenix	`px.launch_app()` + `DSPyInstrumentor().instrument()`	Yes	Auto-instruments via OpenInference
W&B Weave	`weave.init("project")` + `@weave.op()` decorator	No (cloud)	Manual decorator per function

平台	设置方式	开源	DSPy集成
Langtrace	`langtrace.init(api_key="...")`	是（自托管+云服务）	自动检测所有DSPy调用
Arize Phoenix	`px.launch_app()` + `DSPyInstrumentor().instrument()`	是	通过OpenInference自动检测
W&B Weave	`weave.init("project")` + `@weave.op()` 装饰器	否（云服务）	为每个函数手动添加装饰器

Langtrace (best DSPy auto-instrumentation)

Langtrace（最佳DSPy自动检测工具）

bash

pip install langtrace-python-sdk

python

from langtrace_python_sdk import langtrace

langtrace.init(api_key="your-key")  # or self-host: langtrace.init(api_host="http://localhost:3000")

bash

pip install langtrace-python-sdk

python

from langtrace_python_sdk import langtrace

langtrace.init(api_key="your-key")  # or self-host: langtrace.init(api_host="http://localhost:3000")

All DSPy LM calls, retrievals, and module executions are traced automatically

result = production_program(question="How do refunds work?")

undefined

result = production_program(question="How do refunds work?")

undefined

Arize Phoenix (open-source trace viewer)

Arize Phoenix（开源追踪查看器）

bash

pip install arize-phoenix openinference-instrumentation-dspy

python

import phoenix as px
from openinference.instrumentation.dspy import DSPyInstrumentor

px.launch_app()  # Local UI at http://localhost:6006
DSPyInstrumentor().instrument()

bash

pip install arize-phoenix openinference-instrumentation-dspy

python

import phoenix as px
from openinference.instrumentation.dspy import DSPyInstrumentor

px.launch_app()  # Local UI at http://localhost:6006
DSPyInstrumentor().instrument()

Traces appear in the Phoenix UI with full prompt/response details

undefined

undefined

W&B Weave (team dashboards)

W&B Weave（团队仪表板）

bash

pip install weave

python

import weave

weave.init("my-ai-project")

@weave.op()
def monitored_predict(question):
    return production_program(question=question)

bash

pip install weave

python

import weave

weave.init("my-ai-project")

@weave.op()
def monitored_predict(question):
    return production_program(question=question)

All calls tracked with inputs, outputs, latency, and cost

View at wandb.ai

undefined

undefined

Which platform to use

平台选择建议

Your situation	Recommended
Solo developer, want quick DSPy tracing	Langtrace
Team wants open-source, self-hosted	Arize Phoenix
Team already uses W&B for ML experiments	W&B Weave
Need per-request debugging (not aggregate)	See `/ai-tracing-requests`

你的场景	推荐平台
独立开发者，需要快速实现DSPy追踪	Langtrace
团队需要开源、自托管方案	Arize Phoenix
团队已使用W&B进行ML实验	W&B Weave
需要单请求调试（非聚合分析）	查看 `/ai-tracing-requests`

When things go wrong

故障处理

Quick decision tree for common monitoring alerts:

Alert	Likely cause	Fix with
Quality dropped	Model provider changed behavior, or input distribution shifted	`/ai-improving-accuracy` — re-evaluate and re-optimize
Safety metric dropped	New attack vectors or content patterns	`/ai-testing-safety` — run adversarial audit, then fix with `/ai-checking-outputs`
Cost spiked	Longer inputs, more retries, or model price increase	`/ai-cutting-costs` — investigate and optimize
Error rate increased	API changes, schema changes, rate limits	`/ai-fixing-errors` — diagnose and fix
Latency increased	Model congestion, larger inputs, or added retries	Check retry rates first, then consider `/ai-switching-models`

常见监控警报的快速决策树：

警报	可能原因	修复方式
质量下滑	模型提供商变更行为，或输入分布变化	使用 `/ai-improving-accuracy` ——重新评估并优化
安全指标下滑	出现新的攻击向量或内容模式	使用 `/ai-testing-safety` ——运行对抗性审计，再通过 `/ai-checking-outputs` 修复
成本飙升	输入变长、重试次数增加或模型涨价	使用 `/ai-cutting-costs` ——调查并优化
错误率上升	API变更、 schema变更、速率限制	使用 `/ai-fixing-errors` ——诊断并修复
延迟增加	模型拥堵、输入变大或重试次数增加	先检查重试率，再考虑 `/ai-switching-models`

Tips

小贴士

Set up monitoring at launch, not after an incident. The cost of monitoring is low; the cost of missing a regression is high.
Use LM-as-judge metrics when you don't have ground truth. Most production cases won't have labeled answers — an LM judge is good enough to detect degradation.
Log everything: inputs, outputs, latencies, token counts, costs. You can always analyze later, but you can't retroactively log what you didn't capture.
Separate safety from quality monitoring. Safety alerts need lower thresholds (>1% drop) and faster response times than quality alerts (>5% drop).
Run the full safety audit monthly. Periodic metric checks catch gradual degradation. Monthly
```
/ai-testing-safety
```
audits catch new attack vectors.
Keep your reference eval set fresh. Add examples from real production failures. Remove examples that no longer represent your users.
Baseline after every optimization. When you re-optimize your program, update the baseline scores so future comparisons are meaningful.

在上线时就设置监控，而非事故发生后。监控的成本很低，但错过性能退化的代价极高。
无真实标注时使用大模型作为评判者的指标。大多数生产场景没有标注答案——大模型评判足以检测性能退化。
记录所有内容：输入、输出、延迟、Token数、成本。你可以后续再分析，但无法追溯记录未捕获的数据。
将安全监控与质量监控分离。安全警报需要更低的阈值（下滑>1%）和更快的响应速度，而质量警报的阈值可设为>5%。
每月运行完整的安全审计。定期指标检查可发现逐步的性能退化，每月的
```
/ai-testing-safety
```
审计可发现新的攻击向量。
保持参考评估数据集的新鲜度。添加真实生产故障的示例，移除不再代表用户需求的示例。
每次优化后更新基准值。当你重新优化程序时，更新基准分数，确保后续对比有意义。

Additional resources

额外资源

Use
```
/ai-serving-apis
```
to wrap your program in FastAPI endpoints before setting up monitoring
Use
```
/ai-improving-accuracy
```
for the metrics and evaluation patterns this skill builds on
Use
```
/ai-testing-safety
```
for periodic adversarial safety audits
Use
```
/ai-checking-outputs
```
to add guardrails when monitoring reveals gaps
Use
```
/ai-cutting-costs
```
when cost monitoring shows spending increasing
Use
```
/ai-switching-models
```
when you need to evaluate a model change
Use
```
/ai-tracing-requests
```
to debug individual requests end-to-end
See
```
examples.md
```
for complete worked examples

在设置监控前，可使用
```
/ai-serving-apis
```
将你的程序封装为FastAPI端点
本技能所依赖的指标与评估模式，可查看
```
/ai-improving-accuracy
```
定期对抗性安全审计可使用
```
/ai-testing-safety
```
当监控发现漏洞时，可使用
```
/ai-checking-outputs
```
添加防护措施
当成本监控显示支出增加时，可使用
```
/ai-cutting-costs
```
需要评估模型变更时，可使用
```
/ai-switching-models
```
需要端到端调试单个请求时，可使用
```
/ai-tracing-requests
```
完整的示例可查看
```
examples.md
```