observability-and-instrumentation

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Observability and Instrumentation

Observability and Instrumentation

Overview

概述

Code you can't observe is code you can't operate. Observability is the ability to answer "what is the system doing and why?" from the outside, using the telemetry the code emits. Instrumentation is not a post-launch add-on — it's written alongside the feature, the same way tests are. If a feature ships without telemetry, the first user-reported bug becomes archaeology instead of a query.
无法观测的代码也无法运维。Observability 指的是通过代码生成的遥测数据,从外部回答“系统正在做什么以及为什么这么做”的能力。Instrumentation 并非上线后才添加的附属品——它需要和功能代码同步编写,就像测试代码一样。如果一个功能上线时没有遥测数据,那么用户报告的第一个 bug 就会变成一场“考古发掘”,而非简单的查询问题。

When to Use

适用场景

  • Building any feature that will run in production
  • Adding a new service, endpoint, background job, or external integration
  • A production incident took too long to diagnose ("we couldn't tell what happened")
  • Setting up or reviewing alerting rules
  • Reviewing a PR that adds I/O, retries, queues, or cross-service calls
NOT for:
  • Diagnosing a failure happening right now — use the
    debugging-and-error-recovery
    skill (observability is what makes that skill fast next time)
  • Profiling and optimizing measured slowness — use the
    performance-optimization
    skill
  • Launch-day monitoring checklists and rollback triggers — see the
    shipping-and-launch
    skill; this skill covers the instrumentation that feeds them
  • 开发任何将在生产环境运行的功能
  • 添加新服务、接口、后台任务或外部集成
  • 生产环境故障诊断耗时过长(“我们无法确定发生了什么”)
  • 设置或评审告警规则
  • 评审包含 I/O、重试、队列或跨服务调用的 PR
不适用场景:
  • 诊断当前正在发生的故障——请使用
    debugging-and-error-recovery
    技能(Observability 能让下次使用该技能时更高效)
  • 分析并优化已测得的性能问题——请使用
    performance-optimization
    技能
  • 上线日监控清单与回滚触发器——请查看
    shipping-and-launch
    技能;本技能涵盖为这些场景提供数据支撑的 Instrumentation

Process

流程

1. Define "working" before instrumenting

1. 植入 Instrumentation 前先定义“正常运行”的标准

Telemetry without a question is noise. Before adding any instrumentation, write down 2–4 questions an on-call engineer will ask about this feature:
FEATURE: checkout payment retry
QUESTIONS ON-CALL WILL ASK:
1. What fraction of payments succeed on first attempt vs after retry?
2. When a payment fails permanently, why? (provider error? timeout? validation?)
3. Is the payment provider slower than usual?
→ Every signal below must help answer one of these.
If you can't name the questions, you're not ready to instrument — you'll log everything and learn nothing.
没有明确目标的遥测数据只是噪音。在添加任何 Instrumentation 之前,请写下运维工程师针对该功能会提出的 2-4 个问题:
FEATURE: checkout payment retry
QUESTIONS ON-CALL WILL ASK:
1. 首次尝试成功的支付占比 vs 重试后成功的占比是多少?
2. 当支付最终失败时,原因是什么?(服务商错误?超时?验证失败?)
3. 支付服务商的响应速度是否比平时慢?
→ 以下每一项信号都必须能回答上述问题之一。
如果你无法明确这些问题,说明你还没准备好做 Instrumentation——你会记录所有内容却一无所获。

2. Pick the right signal for each question

2. 为每个问题选择合适的信号

SignalAnswersCost profileExample
Structured log"What happened in this specific case?"Per-event; grows with traffic
payment_failed
with provider error code
Metric"How often / how fast, in aggregate?"Fixed per series; cheap to queryp99 latency of provider calls
Trace"Where did time go across services?"Per-request; usually sampledOne slow checkout, broken down by hop
Rule of thumb: metrics tell you that something is wrong, traces tell you where, logs tell you why.
Signal可回答的问题成本特征示例
Structured log“这个特定案例中发生了什么?”按事件计费;随流量增长而增加包含服务商错误码的
payment_failed
日志
Metric“总体上发生频率/速度如何?”按时间序列固定成本;查询成本低服务商调用的 p99 延迟
Trace“跨服务调用的时间消耗在哪里?”按请求计费;通常采用抽样方式某个缓慢的结账流程,按调用环节拆分耗时
经验法则:指标告诉你哪里出了问题,链路追踪告诉你问题发生在哪个环节,日志告诉你问题原因是什么

3. Structured logging

3. 结构化日志

Log events, not prose. Every log line is a JSON object with a stable event name and machine-readable fields:
typescript
// BAD: string interpolation — unqueryable, inconsistent
logger.info(`Payment ${id} failed for user ${userId} after ${n} retries`);

// GOOD: stable event name + structured fields
logger.warn({
  event: 'payment_failed',
  paymentId: id,
  provider: 'stripe',
  errorCode: err.code,
  attempt: n,
}, 'payment failed');
Log levels — use them consistently:
LevelMeaningOn-call action
error
Invariant broken; someone may need to actInvestigate
warn
Degraded but handled (retry succeeded, fallback used)Watch for trends
info
Significant business event (order placed, job finished)None
debug
Diagnostic detailOff in production by default
Correlation IDs are mandatory. Generate (or accept) a request ID at the system boundary and attach it to every log line, span, and outbound call. Without it, you cannot reconstruct a single request from interleaved logs:
typescript
// Express: child logger per request, ID propagated downstream
app.use((req, res, next) => {
  req.id = req.headers['x-request-id'] ?? crypto.randomUUID();
  req.log = logger.child({ requestId: req.id });
  res.setHeader('x-request-id', req.id);
  next();
});
Never log secrets, tokens, passwords, or full PII. This is a hard rule from the
security-and-hardening
skill — telemetry pipelines are a classic data-leak path. Allowlist fields; don't log whole request bodies.
记录事件而非描述性文字。每条日志都是一个 JSON 对象,包含稳定的事件名称和机器可读字段:
typescript
// BAD: 字符串插值——无法查询,格式不一致
logger.info(`Payment ${id} failed for user ${userId} after ${n} retries`);

// GOOD: 稳定的事件名称 + 结构化字段
logger.warn({
  event: 'payment_failed',
  paymentId: id,
  provider: 'stripe',
  errorCode: err.code,
  attempt: n,
}, 'payment failed');
日志级别——保持一致使用:
Level含义运维响应动作
error
违反系统规则;可能需要人工介入立即调查
warn
服务降级但已处理(重试成功、使用降级方案)关注趋势
info
重要业务事件(订单生成、任务完成)无需动作
debug
诊断细节生产环境默认关闭
关联 ID(Correlation IDs)是必填项。 在系统边界生成(或接收)请求 ID,并将其附加到每条日志、每个 Span 和每个外部调用中。没有关联 ID,你无法从交错的日志中重建单个请求的完整链路:
typescript
// Express: 为每个请求创建子日志器,ID 向下游传递
app.use((req, res, next) => {
  req.id = req.headers['x-request-id'] ?? crypto.randomUUID();
  req.log = logger.child({ requestId: req.id });
  res.setHeader('x-request-id', req.id);
  next();
});
绝对不要记录密钥、令牌、密码或完整的个人身份信息(PII)。 这是
security-and-hardening
技能中的硬性规则——遥测数据管道是典型的数据泄露路径。只记录允许的字段;不要记录整个请求体。

4. Metrics

4. 指标

For request-driven services, instrument RED on every endpoint and every external dependency: Rate (requests/sec), Errors (failure rate), Duration (latency histogram, not average). For resources (queues, pools, hosts), use USE: Utilization, Saturation, Errors.
As with tracing, the vendor-neutral path is the OpenTelemetry metrics API (same SDK and context as step 5). The example below uses Prometheus'
prom-client
— one common backend choice, not the only one; the RED/USE and cardinality rules are identical either way.
typescript
import { Histogram } from 'prom-client';

const httpDuration = new Histogram({
  name: 'http_request_duration_seconds',
  help: 'HTTP request duration',
  labelNames: ['method', 'route', 'status_class'],  // '2xx', not '200'
  buckets: [0.05, 0.1, 0.25, 0.5, 1, 2.5, 5],
});
Cardinality is the failure mode. Every unique label combination is a separate time series. Labels must come from small, fixed sets (route template, status class, provider name). Never use user IDs, raw URLs, error messages, or other unbounded values as labels — that belongs in logs and traces.
OK as label:    route="/api/tasks/:id"   status_class="5xx"   provider="stripe"
NEVER a label:  user_id, email, request_id, full URL, error message text
Track averages never, percentiles always: an average hides the 1% of users having a terrible time. Use histograms and read p50/p95/p99.
对于请求驱动的服务,为每个接口和每个外部依赖 Instrument RED 指标:Rate(请求数/秒)、Errors(错误率)、Duration(延迟直方图,而非平均值)。对于资源(队列、连接池、主机),使用 USE 指标:Utilization(利用率)、Saturation(饱和度)、Errors(错误数)。
与链路追踪一样,vendor-neutral 的方案是 OpenTelemetry 指标 API(与第 5 步使用相同的 SDK 和上下文)。以下示例使用 Prometheus 的
prom-client
——这是一种常见的后端选择,但并非唯一选择;RED/USE 规则和基数规则在任何方案中都相同。
typescript
import { Histogram } from 'prom-client';

const httpDuration = new Histogram({
  name: 'http_request_duration_seconds',
  help: 'HTTP request duration',
  labelNames: ['method', 'route', 'status_class'],  // '2xx', not '200'
  buckets: [0.05, 0.1, 0.25, 0.5, 1, 2.5, 5],
});
基数是常见的失败点。 每个唯一的标签组合都是一个独立的时间序列。标签必须来自小型固定集合(路由模板、状态码分类、服务商名称)。绝对不要将用户 ID、原始 URL、错误信息或其他无界值作为标签——这些内容应该放在日志和链路追踪中。
可作为标签:    route="/api/tasks/:id"   status_class="5xx"   provider="stripe"
绝对不能作为标签:  user_id, email, request_id, 完整URL, 错误信息文本
永远不要只追踪平均值,一定要追踪百分位数:平均值会掩盖 1% 用户遇到的严重问题。使用直方图并查看 p50/p95/p99 分位数。

5. Distributed tracing

5. 分布式链路追踪(Distributed tracing)

Use OpenTelemetry — it's the vendor-neutral standard, and auto-instrumentation covers HTTP, gRPC, and common DB clients with near-zero code:
typescript
// tracing.ts — must be imported before anything else
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';

const sdk = new NodeSDK({
  serviceName: 'checkout-service',
  instrumentations: [getNodeAutoInstrumentations()],
});
sdk.start();
Add manual spans only around meaningful internal units of work (e.g.,
applyDiscounts
,
chargeProvider
) and attach the attributes on-call will filter by. Propagate context across every async boundary — HTTP headers, queue message metadata — or the trace dies at the gap. Sample head-based at a low rate by default; keep 100% of errors if your backend supports tail sampling.
使用 OpenTelemetry——这是 vendor-neutral 的标准,自动 Instrumentation 几乎无需代码即可覆盖 HTTP、gRPC 和常见数据库客户端:
typescript
// tracing.ts —— 必须在其他模块之前导入
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';

const sdk = new NodeSDK({
  serviceName: 'checkout-service',
  instrumentations: [getNodeAutoInstrumentations()],
});
sdk.start();
仅在有意义的内部工作单元(例如
applyDiscounts
chargeProvider
)周围添加手动 Span,并附上运维人员会用于过滤的属性。在每个异步边界(HTTP 头、队列消息元数据)传递上下文,否则链路追踪会在断点处中断。默认采用低比例的头部抽样;如果后端支持尾部抽样,则保留 100% 的错误请求。

6. Alerting

6. 告警

Alert on symptoms users feel, not on causes:
SYMPTOM (page-worthy):           CAUSE (dashboard, not a page):
error rate > 1% for 5 min        CPU at 85%
p99 latency > 2s                 one pod restarted
queue age > 10 min               disk at 70%
Cause-based alerts fire when nothing is wrong and miss failures you didn't predict. Symptom-based alerts fire exactly when users are hurt, regardless of the cause.
Rules for every alert you create:
  1. It must be actionable. If the response is "ignore it, it self-heals", delete the alert.
  2. It links to a runbook — even three lines: what it means, first query to run, escalation path.
  3. It has a threshold and duration justified by the SLO or by historical data, not by a guess.
  4. Use two severities only: page (user-facing, act now) and ticket (degradation, act this week). A third tier becomes noise that trains people to ignore everything.
基于用户能感知到的症状设置告警,而非基于原因:
症状(需要立即处理):           原因(仅用于仪表盘,无需告警):
error rate > 1% 持续5分钟        CPU 使用率达85%
p99 延迟 > 2秒                 某个Pod重启
队列等待时间 > 10分钟        磁盘使用率达70%
基于原因的告警会在无问题时触发,且会遗漏你未预测到的故障。基于症状的告警只会在用户受到影响时触发,无论原因是什么。
创建每个告警时需遵循以下规则:
  1. 必须可执行。 如果响应是“忽略它,会自动恢复”,则删除该告警。
  2. 链接到运行手册——即使只有三行:告警含义、首先要执行的查询、升级路径。
  3. 阈值和持续时间需基于服务水平目标(SLO)或历史数据,而非猜测。
  4. 仅使用两种严重级别:紧急告警(影响用户,立即处理)和工单告警(服务降级,本周内处理)。第三种级别会变成噪音,导致人们忽略所有告警。

7. Verify the telemetry itself

7. 验证遥测数据本身

Instrumentation is code; it can be wrong. Before calling the work done, trigger the paths and look at the actual output:
  • Force an error in staging → find it in the logs by
    requestId
    , confirm fields are structured (not
    [object Object]
    )
  • Send test traffic → confirm metric series appear with the expected labels and sane values
  • Follow one request across services in the tracing UI → no broken spans
  • Fire each new alert once (lower the threshold temporarily) → confirm it reaches the right channel and the runbook link works
Instrumentation 也是代码,可能存在错误。在完成工作之前,触发相关流程并查看实际输出:
  • 在预发布环境强制触发错误 → 通过
    requestId
    在日志中找到它,确认字段是结构化的(不是
    [object Object]
  • 发送测试流量 → 确认指标序列按预期的标签和合理的值出现
  • 在链路追踪 UI 中追踪单个请求跨服务的完整路径 → 没有断裂的 Span
  • 手动触发每个新告警一次(临时降低阈值)→ 确认它发送到正确的渠道,且运行手册链接有效

Common Rationalizations

常见误区

RationalizationReality
"I'll add logging after it works""After" becomes "after the first incident", which is the most expensive moment to discover you're blind. Instrument as you build.
"More logs = more observability"Unstructured noise makes incidents slower, not faster. Three queryable events beat three hundred prose lines.
"console.log is fine for now"Unstructured output can't be filtered, correlated, or alerted on. The structured logger costs five extra minutes once.
"We can just look at the dashboards when something breaks"Dashboards built without defined questions show you everything except the answer. Start from on-call questions.
"Alert on everything important, we'll tune later"A noisy pager trains people to ignore it. The tuning never happens; the missed real page does.
"User ID as a metric label makes debugging easier"It also makes your metrics backend fall over. High-cardinality lookups belong in logs and traces.
"Tracing is overkill for our two services"Two services already means cross-service latency questions logs can't answer. Auto-instrumentation makes the cost trivial.
误区实际情况
“等功能正常工作后再添加日志”“之后”往往变成“第一次故障发生后”,这是发现你无法观测系统的最昂贵时刻。在开发时就同步做 Instrumentation。
“日志越多,可观测性越强”非结构化的噪音会让故障排查更慢,而非更快。三个可查询的事件胜过三百行描述性文字。
“现在用 console.log 就够了”非结构化输出无法过滤、关联或设置告警。使用结构化日志器只需多花五分钟,一劳永逸。
“出问题时我们直接看仪表盘就行”没有明确问题的仪表盘会展示所有内容,除了你需要的答案。从运维人员的问题出发。
“把所有重要的都加上告警,之后再调整”频繁触发的告警会让人们忽略它。调整往往不会发生,但真正的告警会被错过。
“将用户 ID 作为指标标签便于调试”这也会让你的指标后端崩溃。高基数的查询应该放在日志和链路追踪中。
“我们只有两个服务,链路追踪没必要”两个服务已经会出现跨服务延迟问题,而日志无法回答这些问题。自动 Instrumentation 的成本极低。

Red Flags

危险信号

  • A feature PR with retries, queues, or external calls and zero new telemetry
  • Log lines built by string interpolation instead of structured fields
  • No correlation/request ID — each log line is an orphan
  • Metrics labeled with user IDs, raw URLs, or error message text (cardinality bomb)
  • Latency tracked as an average with no percentiles
  • Alerts that fire daily and get acknowledged without action
  • Alerts on causes (CPU, memory) paging humans while user-facing error rate is unmonitored
  • Secrets, tokens, or full request bodies appearing in logs
  • "It works on my machine" as the only evidence a production feature is healthy
  • 包含重试、队列或外部调用的功能 PR 中没有任何新的遥测数据
  • 日志行通过字符串插值构建,而非结构化字段
  • 没有关联/请求 ID——每条日志都是孤立的
  • 指标使用用户 ID、原始 URL 或错误信息文本作为标签(基数爆炸)
  • 仅追踪延迟平均值,没有百分位数
  • 每天触发且被确认但无任何动作的告警
  • 基于原因(CPU、内存)的告警通知人工,但未监控用户可见的错误率
  • 日志中出现密钥、令牌或完整请求体
  • “在我机器上正常”是生产环境功能正常的唯一证据

Verification

验证清单

After instrumenting a feature, confirm:
  • The on-call questions for this feature are written down, and each signal maps to one
  • All log output is structured (JSON), with stable event names and a correlation ID on every line
  • No secrets, tokens, or unredacted PII in any log line (spot-check actual output)
  • RED metrics exist for every new endpoint and every external dependency, with bounded label sets
  • Latency is a histogram; p95/p99 are queryable
  • A single request can be followed end-to-end in the tracing UI without broken spans
  • Every new alert is symptom-based, has a runbook link, and was test-fired once
  • An induced failure in staging was located via telemetry alone, without reading the source
完成功能的 Instrumentation 后,确认:
  • 已写下针对该功能的运维问题,且每个信号都对应其中一个问题
  • 所有日志输出都是结构化的(JSON),每条日志都包含稳定的事件名称和关联 ID
  • 任何日志行中都没有密钥、令牌或未脱敏的 PII(抽查实际输出)
  • 每个新接口和每个外部依赖都有 RED 指标,且标签集合是有界的
  • 延迟采用直方图记录;p95/p99 分位数可查询
  • 单个请求可在链路追踪 UI 中完整追踪,没有断裂的 Span
  • 每个新告警都是基于症状的,包含运行手册链接,且已手动触发测试一次
  • 在预发布环境诱导的故障仅通过遥测数据即可定位,无需查看源代码