vtex-io-observability-and-ops

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Observability & Operational Readiness

可观测性与运维就绪

When this skill applies

适用场景

Use this skill when a VTEX IO service needs better production visibility, troubleshooting behavior, or operational safety.

Adding metrics to important client calls or flows
Improving logs for routes, workers, or integrations
Surfacing failures clearly for operations and support
Reviewing whether a service is ready for production
Monitoring rate-limit-sensitive integrations

Do not use this skill for:

app policy declaration
trust-boundary modeling
frontend analytics or browser monitoring
route contract design by itself

当VTEX IO服务需要提升生产环境可见性、故障排查能力或运维安全性时，可使用本规范：

为重要的客户端调用或流程添加指标
优化路由、Worker或集成的日志
向运维和支持团队清晰展示故障信息
评估服务是否已就绪可上线生产
监控对限流敏感的集成

本规范不适用于：

应用策略声明
信任边界建模
前端分析或浏览器监控
单独的路由契约设计

Decision rules

决策规则

Log enough structured context to debug failures, but do not log secrets or sensitive payloads.
Use
```
ctx.vtex.logger
```
with appropriate log levels such as
```
info
```
,
```
warn
```
, and
```
error
```
instead of
```
console.log
```
, so logs are properly collected and searchable in the VTEX logging stack.
Treat
```
ctx.vtex.logger
```
as the native platform logging mechanism. If a partner needs to forward logs to its own logging system, prefer doing that through a dedicated integration app or client instead of replacing the VTEX logger pattern inside every service.
Use client-level metrics on important downstream calls so integration behavior is visible below the handler layer.
Choose metric names that reflect the integration and operation, such as
```
partner-get-order
```
or
```
partner-sync-catalog
```
, so counts, latency, and error rates can be tracked over time.
Make failures observable at the point where they happen. Do not swallow errors silently in routes, events, or workers.
For rate-limit-sensitive APIs, combine short timeouts, backoff-aware retries, and caching of frequent reads to reduce burst pressure and avoid hitting hard limits.
Review whether expensive or fragile flows expose enough operational signals before releasing them.

记录足够的结构化上下文以排查故障，但不得记录密钥或敏感载荷。
使用
```
ctx.vtex.logger
```
配合
```
info
```
、
```
warn
```
、
```
error
```
等合适的日志级别，替代
```
console.log
```
，确保日志能被VTEX日志栈正确采集和检索。
将
```
ctx.vtex.logger
```
作为原生平台日志机制。如果合作方需要将日志转发到自有日志系统，优先通过专用的集成应用或客户端实现，而非在每个服务内替换VTEX的日志模式。
对重要的下游调用添加客户端级别的指标，这样在处理层之下也能观测到集成的运行情况。
选择能够反映集成和操作含义的指标名称，例如
```
partner-get-order
```
或
```
partner-sync-catalog
```
，以便长期跟踪请求量、延迟和错误率。
在故障发生的位置就使其可观测，不要在路由、事件或Worker中静默吞掉错误。
对于对限流敏感的API，结合短超时、支持退避的重试和高频读取缓存来降低突发压力，避免触达硬限制。
在发布高开销或脆弱的流程前，评估其是否暴露了足够的运维信号。

Hard constraints

硬性约束

Constraint: Important failures must be visible in logs, metrics, or durable state

约束：重要故障必须在日志、指标或持久化状态中可见

Routes, event handlers, and workers MUST not hide important failures from operators.

Why this matters

If failures disappear silently, the service becomes impossible to diagnose under real traffic and retries.

Detection

If an error is caught and ignored without logging, metric emission, or explicit failure state, STOP and surface the failure.

Correct

typescript

try {
  await ctx.clients.partnerApi.sendOrder(orderId)
} catch (error) {
  ctx.vtex.logger.error({
    message: 'Failed to send order to partner',
    orderId,
    account: ctx.vtex.account,
    routeId: ctx.vtex.route?.id,
  })
  throw error
}

Wrong

typescript

try {
  await ctx.clients.partnerApi.sendOrder(orderId)
} catch (_) {
  return
}

路由、事件处理器和Worker不得向运维人员隐藏重要故障。

为什么重要

如果故障被静默吞掉，在真实流量和重试场景下将无法诊断服务问题。

检测规则

如果捕获了错误却忽略处理，没有记录日志、上报指标或标记明确的失败状态，请立即停止并暴露该故障。

正确示例

typescript

try {
  await ctx.clients.partnerApi.sendOrder(orderId)
} catch (error) {
  ctx.vtex.logger.error({
    message: 'Failed to send order to partner',
    orderId,
    account: ctx.vtex.account,
    routeId: ctx.vtex.route?.id,
  })
  throw error
}

错误示例

typescript

try {
  await ctx.clients.partnerApi.sendOrder(orderId)
} catch (_) {
  return
}

Constraint: Metrics should be attached to important integration calls

约束：重要的集成调用必须绑定指标

Client calls that are operationally important SHOULD include

metric

so request behavior can be tracked consistently.

Why this matters

Without metrics, integration failures and latency patterns are much harder to isolate from generic route behavior.

Detection

If a key downstream integration call has no

metric

and operations depend on it, STOP and add a meaningful metric name.

Correct

typescript

return this.http.get(`/orders/${id}`, {
  metric: 'partner-get-order',
})

Wrong

typescript

return this.http.get(`/orders/${id}`)

运维层面重要的客户端调用应当添加

metric

，以便统一跟踪请求行为。

为什么重要

如果没有指标，将很难把集成故障和延迟特征与通用的路由行为区分开。

检测规则

如果关键下游集成调用没有添加

metric

且运维依赖该调用，请立即停止并添加有意义的指标名称。

正确示例

typescript

return this.http.get(`/orders/${id}`, {
  metric: 'partner-get-order',
})

错误示例

typescript

return this.http.get(`/orders/${id}`)

Constraint: Logs must stay useful without leaking sensitive data

约束：日志必须保持有用，且不得泄露敏感数据

Logs MUST contain enough context to debug production behavior, but MUST NOT include secrets, tokens, or unnecessarily sensitive payloads.

Why this matters

Operational logs are only valuable if they are safe to retain and inspect. Sensitive logging creates security risk while still failing to guarantee useful diagnosis.

Detection

If a log line includes tokens, auth headers, raw personal payloads, or entire downstream responses, STOP and sanitize the log.

Correct

typescript

ctx.vtex.logger.info({
  message: 'Partner sync started',
  orderId,
  account: ctx.vtex.account,
})

Wrong

typescript

ctx.vtex.logger.info({
  message: 'Partner sync started',
  body: ctx.request.body,
  auth: ctx.request.header.authorization,
})

日志必须包含足够的上下文以诊断生产运行问题，但不得包含密钥、令牌或不必要的敏感载荷。

为什么重要

运维日志只有在可安全留存和查看的情况下才有价值。敏感日志会带来安全风险，同时也无法保证能有效支撑问题诊断。

检测规则

如果日志行包含令牌、鉴权头、原始个人信息载荷或完整的下游响应，请立即停止并对日志做脱敏处理。

正确示例

typescript

ctx.vtex.logger.info({
  message: 'Partner sync started',
  orderId,
  account: ctx.vtex.account,
})

错误示例

typescript

ctx.vtex.logger.info({
  message: 'Partner sync started',
  body: ctx.request.body,
  auth: ctx.request.header.authorization,
})

Preferred pattern

Common failure modes

常见故障模式

Catching and ignoring errors in async flows.
Logging too little context to diagnose production incidents.
Logging too much sensitive data.
Omitting metrics from important integration calls.
Treating rate-limit failures as isolated bugs instead of operational signals.

在异步流程中捕获并忽略错误
记录的上下文过少，无法诊断生产事故
记录过多敏感数据
重要的集成调用缺失指标
将限流故障视为孤立的bug，而非运维信号

Review checklist

评审检查清单

Reference

参考资料

Using Node Clients - Client usage patterns relevant to metrics and retries
Best practices for avoiding rate-limit errors - Operational guidance for stable integrations

Using Node Clients - 与指标和重试相关的客户端使用模式
Best practices for avoiding rate-limit errors - 稳定集成的运维指南

vtex-io-observability-and-ops

Original

Translation

Observability & Operational Readiness

可观测性与运维就绪

When this skill applies

适用场景

Decision rules

决策规则

Hard constraints

硬性约束

Constraint: Important failures must be visible in logs, metrics, or durable state

约束：重要故障必须在日志、指标或持久化状态中可见

Constraint: Metrics should be attached to important integration calls

约束：重要的集成调用必须绑定指标

Constraint: Logs must stay useful without leaking sensitive data

约束：日志必须保持有用，且不得泄露敏感数据

Preferred pattern

推荐模式

Common failure modes

常见故障模式

Review checklist

评审检查清单

Reference

参考资料