qa-observability

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

QA Observability and Performance Engineering

QA可观测性与性能工程

Use telemetry (logs, metrics, traces, profiles) as a QA signal and a debugging substrate.
Core references (see
data/sources.json
): OpenTelemetry, W3C Trace Context, and SLO practices (Google SRE).
将遥测数据(日志、指标、追踪、性能剖析)作为QA信号和调试基础。
核心参考资料(详见
data/sources.json
):OpenTelemetry、W3C Trace Context以及SLO实践(Google SRE)。

Quick Start (Default)

快速开始(默认配置)

If key context is missing, ask for: critical user journeys, service/dependency inventory, environments (local/staging/prod), current telemetry stack, and current SLO/SLA commitments (if any).
  1. Establish the minimum bar: correlation IDs + structured logs + traces + golden metrics (latency, traffic, errors, saturation).
  2. Verify propagation: confirm
    traceparent
    (and your request ID) flow across boundaries end-to-end.
  3. Make failures diagnosable: every test failure captures a trace link (or trace ID) plus the correlated logs.
  4. Define SLIs/SLOs and error budget policy; wire burn-rate alerts (prefer multi-window burn rates).
  5. Produce artifacts: a readiness checklist plus an SLO definition and alert rules (use
    assets/checklists/template-observability-readiness-checklist.md
    and
    assets/monitoring/slo/*
    ).
如果关键上下文缺失,需询问:关键用户旅程、服务/依赖清单、环境(本地/预发布/生产)、当前遥测技术栈以及当前SLO/SLA承诺(如有)。
  1. 确立最低标准:关联ID + 结构化日志 + 追踪 + 核心指标(延迟、流量、错误、饱和度)。
  2. 验证传播:确认
    traceparent
    (以及请求ID)能够跨边界端到端流转。
  3. 让故障可诊断:每次测试失败都要捕获追踪链接(或追踪ID)以及相关联的日志。
  4. 定义SLI/SLO和错误预算策略;配置燃烧率告警(优先使用多窗口燃烧率)。
  5. 生成交付物:就绪检查清单以及SLO定义和告警规则(使用
    assets/checklists/template-observability-readiness-checklist.md
    assets/monitoring/slo/*
    )。

Default QA stance

默认QA立场

  • Treat telemetry as part of acceptance criteria (especially for integration/E2E tests).
  • Require correlation: request_id + trace_id (traceparent) across boundaries.
  • Prefer SLO-based release gating and burn-rate alerting over raw infra thresholds.
  • Budget overhead: sampling, cardinality, retention, and cost are quality constraints.
  • Redact PII/secrets by default (logs and attributes).
  • 将遥测数据纳入验收标准(尤其是集成/端到端测试)。
  • 要求关联:跨边界传递request_id + trace_id(traceparent)。
  • 优先基于SLO的发布门限和燃烧率告警,而非原始基础设施阈值。
  • 预算开销:采样、基数、保留时长和成本均属于质量约束条件。
  • 默认脱敏PII/敏感信息(日志和属性中)。

Core workflows

核心工作流

  1. Establish the minimum bar (logs + metrics + traces + correlation).
  2. Instrument with OpenTelemetry (auto-instrument first, then add manual spans for key paths).
  3. Verify context propagation across service boundaries (traceparent in/out).
  4. Define SLIs/SLOs and error budget policy; wire burn-rate alerts.
  5. Make failures diagnosable: capture a trace link + key logs on every test failure.
  6. Profile and load test only after telemetry is reliable; validate against baselines.
  1. 确立最低标准(日志 + 指标 + 追踪 + 关联)。
  2. 用OpenTelemetry实现埋点(优先自动埋点,再为关键路径添加手动Span)。
  3. 验证跨服务边界的上下文传播(traceparent의传入和传出)。
  4. 定义SLI/SLO和错误预算策略;配置燃烧率告警。
  5. 让故障可诊断:每次测试失败时捕获追踪链接 + 关键日志。
  6. 仅在遥测数据可靠后再进行性能剖析和负载测试;对照基线验证结果。

Quick reference

快速参考

TaskRecommended defaultNotes
TracingOpenTelemetry + Jaeger/TempoPrefer OTLP exporters via Collector when possible
MetricsPrometheus + GrafanaUse histograms for latency; watch cardinality
LoggingStructured JSON + correlation IDsNever log secrets/PII; redact aggressively
Reliability gatesSLOs + error budgets + burn-rate alertsGate releases on sustained burn/regressions
PerformanceProfiling + load tests + budgetsAdd continuous profiling for intermittent issues
Zero-code visibilityeBPF (OpenTelemetry zero-code) + continuous profiling (Parca/Pyroscope)Use when code changes are not feasible
任务推荐默认方案说明
追踪OpenTelemetry + Jaeger/Tempo尽可能通过Collector使用OTLP导出器
指标Prometheus + Grafana用直方图统计延迟;注意基数问题
日志结构化JSON + 关联ID绝不要记录敏感信息/PII;严格脱敏
可靠性门限SLOs + 错误预算 + 燃烧率告警基于持续燃烧/回归情况管控发布
性能性能剖析 + 负载测试 + 预算添加持续性能剖析以排查间歇性问题
无代码可见性eBPF(OpenTelemetry无代码) + 持续性能剖析(Parca/Pyroscope)当无法修改代码时使用

Navigation

导航

Open these guides when needed:
If the user needs...ReadAlso use
A minimal, production-ready baseline
references/core-observability-patterns.md
assets/checklists/template-observability-readiness-checklist.md
Node/Python instrumentation setup
references/opentelemetry-best-practices.md
assets/opentelemetry/nodejs/opentelemetry-nodejs-setup.md
,
assets/opentelemetry/python/opentelemetry-python-setup.md
Working trace propagation across services
references/distributed-tracing-patterns.md
assets/checklists/template-observability-readiness-checklist.md
SLOs, burn-rate alerts, and release gates
references/slo-design-guide.md
assets/monitoring/slo/slo-definition.yaml
,
assets/monitoring/slo/prometheus-alert-rules.yaml
Profiling/load testing with evidence
references/performance-profiling-guide.md
assets/load-testing/load-testing-k6.js
,
assets/load-testing/template-load-test-artillery.yaml
A maturity model and roadmap
references/observability-maturity-model.md
assets/checklists/template-observability-readiness-checklist.md
What to avoid and how to fix it
references/anti-patterns-best-practices.md
assets/checklists/template-observability-readiness-checklist.md
Alert design and fatigue reduction
references/alerting-strategies.md
assets/monitoring/slo/prometheus-alert-rules.yaml
Dashboard hierarchy and layout
references/dashboard-design-patterns.md
assets/monitoring/grafana/template-grafana-dashboard-observability.json
Structured logging and cost control
references/log-aggregation-patterns.md
assets/observability/template-logging-setup.md
Implementation guides (deep dives):
  • references/core-observability-patterns.md
  • references/opentelemetry-best-practices.md
  • references/distributed-tracing-patterns.md
  • references/slo-design-guide.md
  • references/performance-profiling-guide.md
  • references/observability-maturity-model.md
  • references/anti-patterns-best-practices.md
  • references/alerting-strategies.md
  • references/dashboard-design-patterns.md
  • references/log-aggregation-patterns.md
Templates (copy/paste):
  • assets/checklists/template-observability-readiness-checklist.md
  • assets/opentelemetry/nodejs/opentelemetry-nodejs-setup.md
  • assets/opentelemetry/python/opentelemetry-python-setup.md
  • assets/monitoring/slo/slo-definition.yaml
  • assets/monitoring/slo/prometheus-alert-rules.yaml
  • assets/monitoring/grafana/grafana-dashboard-slo.json
  • assets/monitoring/grafana/template-grafana-dashboard-observability.json
  • assets/load-testing/load-testing-k6.js
  • assets/load-testing/template-load-test-artillery.yaml
  • assets/performance/frontend/template-lighthouse-ci.json
  • assets/performance/backend/template-nodejs-profiling-config.js
Curated sources:
  • data/sources.json
需要时可查阅以下指南:
如果用户需要...阅读文档同时使用
最小化生产就绪基线
references/core-observability-patterns.md
assets/checklists/template-observability-readiness-checklist.md
Node/Python埋点设置
references/opentelemetry-best-practices.md
assets/opentelemetry/nodejs/opentelemetry-nodejs-setup.md
,
assets/opentelemetry/python/opentelemetry-python-setup.md
跨服务的追踪传播实现
references/distributed-tracing-patterns.md
assets/checklists/template-observability-readiness-checklist.md
SLO、燃烧率告警和发布门限
references/slo-design-guide.md
assets/monitoring/slo/slo-definition.yaml
,
assets/monitoring/slo/prometheus-alert-rules.yaml
带证据的性能剖析/负载测试
references/performance-profiling-guide.md
assets/load-testing/load-testing-k6.js
,
assets/load-testing/template-load-test-artillery.yaml
成熟度模型和路线图
references/observability-maturity-model.md
assets/checklists/template-observability-readiness-checklist.md
避坑指南与最佳实践
references/anti-patterns-best-practices.md
assets/checklists/template-observability-readiness-checklist.md
告警设计与告警疲劳缓解
references/alerting-strategies.md
assets/monitoring/slo/prometheus-alert-rules.yaml
仪表盘层级与布局
references/dashboard-design-patterns.md
assets/monitoring/grafana/template-grafana-dashboard-observability.json
结构化日志与成本控制
references/log-aggregation-patterns.md
assets/observability/template-logging-setup.md
实现指南(深度解析):
  • references/core-observability-patterns.md
  • references/opentelemetry-best-practices.md
  • references/distributed-tracing-patterns.md
  • references/slo-design-guide.md
  • references/performance-profiling-guide.md
  • references/observability-maturity-model.md
  • references/anti-patterns-best-practices.md
  • references/alerting-strategies.md
  • references/dashboard-design-patterns.md
  • references/log-aggregation-patterns.md
模板(可直接复制粘贴):
  • assets/checklists/template-observability-readiness-checklist.md
  • assets/opentelemetry/nodejs/opentelemetry-nodejs-setup.md
  • assets/opentelemetry/python/opentelemetry-python-setup.md
  • assets/monitoring/slo/slo-definition.yaml
  • assets/monitoring/slo/prometheus-alert-rules.yaml
  • assets/monitoring/grafana/grafana-dashboard-slo.json
  • assets/monitoring/grafana/template-grafana-dashboard-observability.json
  • assets/load-testing/load-testing-k6.js
  • assets/load-testing/template-load-test-artillery.yaml
  • assets/performance/frontend/template-lighthouse-ci.json
  • assets/performance/backend/template-nodejs-profiling-config.js
精选资料:
  • data/sources.json

Scope boundaries (handoffs)

范围边界(交接)

  • Pure infrastructure monitoring (Kubernetes, Docker, CI/CD):
    ../ops-devops-platform/SKILL.md
  • Database query optimization (SQL tuning, indexing):
    ../data-sql-optimization/SKILL.md
  • Application-level debugging (stack traces, breakpoints):
    ../qa-debugging/SKILL.md
  • Test strategy design (coverage, test pyramids):
    ../qa-testing-strategy/SKILL.md
  • Resilience patterns (retries, circuit breakers):
    ../qa-resilience/SKILL.md
  • Architecture decisions (microservices, event-driven):
    ../software-architecture-design/SKILL.md
  • 纯基础设施监控(Kubernetes、Docker、CI/CD):
    ../ops-devops-platform/SKILL.md
  • 数据库查询优化(SQL调优、索引):
    ../data-sql-optimization/SKILL.md
  • 应用级调试(堆栈跟踪、断点):
    ../qa-debugging/SKILL.md
  • 测试策略设计(覆盖率、测试金字塔):
    ../qa-testing-strategy/SKILL.md
  • 弹性模式(重试、断路器):
    ../qa-resilience/SKILL.md
  • 架构决策(微服务、事件驱动):
    ../software-architecture-design/SKILL.md

Tool selection notes (2026)

工具选择说明(2026年)

  • Default to OpenTelemetry + OTLP + Collector where possible.
  • Prefer burn-rate alerting against SLOs over alerting on raw infra metrics.
  • Treat sampling, cardinality, and retention as part of quality (not an afterthought).
  • When asked to pick vendors/tools, start from
    data/sources.json
    and validate time-sensitive claims with current docs/releases if the environment allows it.
  • 尽可能优先选择OpenTelemetry + OTLP + Collector。
  • 优先针对SLO的燃烧率告警,而非基于原始基础设施指标的告警。
  • 将采样、基数和保留时长视为质量的一部分(而非事后考虑项)。
  • 当被要求选择供应商/工具时,从
    data/sources.json
    开始,若环境允许,结合当前文档/版本验证时效性强的内容。

Fact-Checking

事实核查

  • Use web search/web fetch to verify current external facts, versions, pricing, deadlines, regulations, or platform behavior before final answers.
  • Prefer primary sources; report source links and dates for volatile information.
  • If web access is unavailable, state the limitation and mark guidance as unverified.
  • 在给出最终答案前,使用网页搜索/网页抓取验证当前外部事实、版本、定价、截止日期、法规或平台行为。
  • 优先使用原始资料;为易变信息提供来源链接和日期。
  • 若无法访问网页,需说明限制条件,并将指南标记为未验证。