qa-observability

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

QA Observability and Performance Engineering

QA可观测性与性能工程

Use telemetry (logs, metrics, traces, profiles) as a QA signal and a debugging substrate.

Core references (see

data/sources.json

): OpenTelemetry, W3C Trace Context, and SLO practices (Google SRE).

将遥测数据（日志、指标、追踪、性能剖析）作为QA信号和调试基础。

核心参考资料（详见

data/sources.json

）：OpenTelemetry、W3C Trace Context以及SLO实践（Google SRE）。

Quick Start (Default)

快速开始（默认配置）

If key context is missing, ask for: critical user journeys, service/dependency inventory, environments (local/staging/prod), current telemetry stack, and current SLO/SLA commitments (if any).

Establish the minimum bar: correlation IDs + structured logs + traces + golden metrics (latency, traffic, errors, saturation).
Verify propagation: confirm
```
traceparent
```
(and your request ID) flow across boundaries end-to-end.
Make failures diagnosable: every test failure captures a trace link (or trace ID) plus the correlated logs.
Define SLIs/SLOs and error budget policy; wire burn-rate alerts (prefer multi-window burn rates).
Produce artifacts: a readiness checklist plus an SLO definition and alert rules (use
```
assets/checklists/template-observability-readiness-checklist.md
```
and
```
assets/monitoring/slo/*
```
).

如果关键上下文缺失，需询问：关键用户旅程、服务/依赖清单、环境（本地/预发布/生产）、当前遥测技术栈以及当前SLO/SLA承诺（如有）。

确立最低标准：关联ID + 结构化日志 + 追踪 + 核心指标（延迟、流量、错误、饱和度）。
验证传播：确认
```
traceparent
```
（以及请求ID）能够跨边界端到端流转。
让故障可诊断：每次测试失败都要捕获追踪链接（或追踪ID）以及相关联的日志。
定义SLI/SLO和错误预算策略；配置燃烧率告警（优先使用多窗口燃烧率）。

生成交付物：就绪检查清单以及SLO定义和告警规则（使用

assets/checklists/template-observability-readiness-checklist.md

和

assets/monitoring/slo/*

）。

Default QA stance

默认QA立场

Treat telemetry as part of acceptance criteria (especially for integration/E2E tests).
Require correlation: request_id + trace_id (traceparent) across boundaries.
Prefer SLO-based release gating and burn-rate alerting over raw infra thresholds.
Budget overhead: sampling, cardinality, retention, and cost are quality constraints.
Redact PII/secrets by default (logs and attributes).

将遥测数据纳入验收标准（尤其是集成/端到端测试）。
要求关联：跨边界传递request_id + trace_id（traceparent）。
优先基于SLO的发布门限和燃烧率告警，而非原始基础设施阈值。
预算开销：采样、基数、保留时长和成本均属于质量约束条件。
默认脱敏PII/敏感信息（日志和属性中）。

Core workflows

核心工作流

Establish the minimum bar (logs + metrics + traces + correlation).
Instrument with OpenTelemetry (auto-instrument first, then add manual spans for key paths).
Verify context propagation across service boundaries (traceparent in/out).
Define SLIs/SLOs and error budget policy; wire burn-rate alerts.
Make failures diagnosable: capture a trace link + key logs on every test failure.
Profile and load test only after telemetry is reliable; validate against baselines.

确立最低标准（日志 + 指标 + 追踪 + 关联）。
用OpenTelemetry实现埋点（优先自动埋点，再为关键路径添加手动Span）。
验证跨服务边界的上下文传播（traceparent의传入和传出）。
定义SLI/SLO和错误预算策略；配置燃烧率告警。
让故障可诊断：每次测试失败时捕获追踪链接 + 关键日志。
仅在遥测数据可靠后再进行性能剖析和负载测试；对照基线验证结果。

Quick reference

快速参考

Task	Recommended default	Notes
Tracing	OpenTelemetry + Jaeger/Tempo	Prefer OTLP exporters via Collector when possible
Metrics	Prometheus + Grafana	Use histograms for latency; watch cardinality
Logging	Structured JSON + correlation IDs	Never log secrets/PII; redact aggressively
Reliability gates	SLOs + error budgets + burn-rate alerts	Gate releases on sustained burn/regressions
Performance	Profiling + load tests + budgets	Add continuous profiling for intermittent issues
Zero-code visibility	eBPF (OpenTelemetry zero-code) + continuous profiling (Parca/Pyroscope)	Use when code changes are not feasible

任务	推荐默认方案	说明
追踪	OpenTelemetry + Jaeger/Tempo	尽可能通过Collector使用OTLP导出器
指标	Prometheus + Grafana	用直方图统计延迟；注意基数问题
日志	结构化JSON + 关联ID	绝不要记录敏感信息/PII；严格脱敏
可靠性门限	SLOs + 错误预算 + 燃烧率告警	基于持续燃烧/回归情况管控发布
性能	性能剖析 + 负载测试 + 预算	添加持续性能剖析以排查间歇性问题
无代码可见性	eBPF（OpenTelemetry无代码） + 持续性能剖析（Parca/Pyroscope）	当无法修改代码时使用

Navigation

If the user needs...	Read	Also use
A minimal, production-ready baseline	`references/core-observability-patterns.md`	`assets/checklists/template-observability-readiness-checklist.md`
Node/Python instrumentation setup	`references/opentelemetry-best-practices.md`	`assets/opentelemetry/nodejs/opentelemetry-nodejs-setup.md` , `assets/opentelemetry/python/opentelemetry-python-setup.md`
Working trace propagation across services	`references/distributed-tracing-patterns.md`	`assets/checklists/template-observability-readiness-checklist.md`
SLOs, burn-rate alerts, and release gates	`references/slo-design-guide.md`	`assets/monitoring/slo/slo-definition.yaml` , `assets/monitoring/slo/prometheus-alert-rules.yaml`
Profiling/load testing with evidence	`references/performance-profiling-guide.md`	`assets/load-testing/load-testing-k6.js` , `assets/load-testing/template-load-test-artillery.yaml`
A maturity model and roadmap	`references/observability-maturity-model.md`	`assets/checklists/template-observability-readiness-checklist.md`
What to avoid and how to fix it	`references/anti-patterns-best-practices.md`	`assets/checklists/template-observability-readiness-checklist.md`
Alert design and fatigue reduction	`references/alerting-strategies.md`	`assets/monitoring/slo/prometheus-alert-rules.yaml`
Dashboard hierarchy and layout	`references/dashboard-design-patterns.md`	`assets/monitoring/grafana/template-grafana-dashboard-observability.json`
Structured logging and cost control	`references/log-aggregation-patterns.md`	`assets/observability/template-logging-setup.md`

Implementation guides (deep dives):

references/core-observability-patterns.md

references/opentelemetry-best-practices.md

references/distributed-tracing-patterns.md

```
references/slo-design-guide.md
```

references/performance-profiling-guide.md

references/observability-maturity-model.md

references/anti-patterns-best-practices.md

```
references/alerting-strategies.md
```
```
references/dashboard-design-patterns.md
```
```
references/log-aggregation-patterns.md
```

Templates (copy/paste):

assets/checklists/template-observability-readiness-checklist.md

assets/opentelemetry/nodejs/opentelemetry-nodejs-setup.md

assets/opentelemetry/python/opentelemetry-python-setup.md

assets/monitoring/slo/slo-definition.yaml

assets/monitoring/slo/prometheus-alert-rules.yaml

assets/monitoring/grafana/grafana-dashboard-slo.json

assets/monitoring/grafana/template-grafana-dashboard-observability.json

```
assets/load-testing/load-testing-k6.js
```

assets/load-testing/template-load-test-artillery.yaml

assets/performance/frontend/template-lighthouse-ci.json

assets/performance/backend/template-nodejs-profiling-config.js

Curated sources:

```
data/sources.json
```

需要时可查阅以下指南：

如果用户需要...	阅读文档	同时使用
最小化生产就绪基线	`references/core-observability-patterns.md`	`assets/checklists/template-observability-readiness-checklist.md`
Node/Python埋点设置	`references/opentelemetry-best-practices.md`	`assets/opentelemetry/nodejs/opentelemetry-nodejs-setup.md` , `assets/opentelemetry/python/opentelemetry-python-setup.md`
跨服务的追踪传播实现	`references/distributed-tracing-patterns.md`	`assets/checklists/template-observability-readiness-checklist.md`
SLO、燃烧率告警和发布门限	`references/slo-design-guide.md`	`assets/monitoring/slo/slo-definition.yaml` , `assets/monitoring/slo/prometheus-alert-rules.yaml`
带证据的性能剖析/负载测试	`references/performance-profiling-guide.md`	`assets/load-testing/load-testing-k6.js` , `assets/load-testing/template-load-test-artillery.yaml`
成熟度模型和路线图	`references/observability-maturity-model.md`	`assets/checklists/template-observability-readiness-checklist.md`
避坑指南与最佳实践	`references/anti-patterns-best-practices.md`	`assets/checklists/template-observability-readiness-checklist.md`
告警设计与告警疲劳缓解	`references/alerting-strategies.md`	`assets/monitoring/slo/prometheus-alert-rules.yaml`
仪表盘层级与布局	`references/dashboard-design-patterns.md`	`assets/monitoring/grafana/template-grafana-dashboard-observability.json`
结构化日志与成本控制	`references/log-aggregation-patterns.md`	`assets/observability/template-logging-setup.md`

实现指南（深度解析）：

references/core-observability-patterns.md

references/opentelemetry-best-practices.md

references/distributed-tracing-patterns.md

```
references/slo-design-guide.md
```

references/performance-profiling-guide.md

references/observability-maturity-model.md

references/anti-patterns-best-practices.md

```
references/alerting-strategies.md
```
```
references/dashboard-design-patterns.md
```
```
references/log-aggregation-patterns.md
```

模板（可直接复制粘贴）：

assets/checklists/template-observability-readiness-checklist.md

assets/opentelemetry/nodejs/opentelemetry-nodejs-setup.md

assets/opentelemetry/python/opentelemetry-python-setup.md

assets/monitoring/slo/slo-definition.yaml

assets/monitoring/slo/prometheus-alert-rules.yaml

assets/monitoring/grafana/grafana-dashboard-slo.json

assets/monitoring/grafana/template-grafana-dashboard-observability.json

```
assets/load-testing/load-testing-k6.js
```

assets/load-testing/template-load-test-artillery.yaml

assets/performance/frontend/template-lighthouse-ci.json

assets/performance/backend/template-nodejs-profiling-config.js

精选资料：

```
data/sources.json
```

Scope boundaries (handoffs)

范围边界（交接）

Pure infrastructure monitoring (Kubernetes, Docker, CI/CD):
```
../ops-devops-platform/SKILL.md
```
Database query optimization (SQL tuning, indexing):
```
../data-sql-optimization/SKILL.md
```
Application-level debugging (stack traces, breakpoints):
```
../qa-debugging/SKILL.md
```
Test strategy design (coverage, test pyramids):
```
../qa-testing-strategy/SKILL.md
```
Resilience patterns (retries, circuit breakers):
```
../qa-resilience/SKILL.md
```
Architecture decisions (microservices, event-driven):
```
../software-architecture-design/SKILL.md
```

纯基础设施监控（Kubernetes、Docker、CI/CD）：
```
../ops-devops-platform/SKILL.md
```
数据库查询优化（SQL调优、索引）：
```
../data-sql-optimization/SKILL.md
```
应用级调试（堆栈跟踪、断点）：
```
../qa-debugging/SKILL.md
```
测试策略设计（覆盖率、测试金字塔）：
```
../qa-testing-strategy/SKILL.md
```
弹性模式（重试、断路器）：
```
../qa-resilience/SKILL.md
```
架构决策（微服务、事件驱动）：
```
../software-architecture-design/SKILL.md
```

Tool selection notes (2026)

工具选择说明（2026年）

Default to OpenTelemetry + OTLP + Collector where possible.
Prefer burn-rate alerting against SLOs over alerting on raw infra metrics.
Treat sampling, cardinality, and retention as part of quality (not an afterthought).
When asked to pick vendors/tools, start from
```
data/sources.json
```
and validate time-sensitive claims with current docs/releases if the environment allows it.

尽可能优先选择OpenTelemetry + OTLP + Collector。
优先针对SLO的燃烧率告警，而非基于原始基础设施指标的告警。
将采样、基数和保留时长视为质量的一部分（而非事后考虑项）。
当被要求选择供应商/工具时，从
```
data/sources.json
```
开始，若环境允许，结合当前文档/版本验证时效性强的内容。

Fact-Checking

事实核查

Use web search/web fetch to verify current external facts, versions, pricing, deadlines, regulations, or platform behavior before final answers.
Prefer primary sources; report source links and dates for volatile information.
If web access is unavailable, state the limitation and mark guidance as unverified.

在给出最终答案前，使用网页搜索/网页抓取验证当前外部事实、版本、定价、截止日期、法规或平台行为。
优先使用原始资料；为易变信息提供来源链接和日期。
若无法访问网页，需说明限制条件，并将指南标记为未验证。