vtex-io-observability-and-ops
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseObservability & Operational Readiness
可观测性与运维就绪
When this skill applies
适用场景
Use this skill when a VTEX IO service needs better production visibility, troubleshooting behavior, or operational safety.
- Adding metrics to important client calls or flows
- Improving logs for routes, workers, or integrations
- Surfacing failures clearly for operations and support
- Reviewing whether a service is ready for production
- Monitoring rate-limit-sensitive integrations
Do not use this skill for:
- app policy declaration
- trust-boundary modeling
- frontend analytics or browser monitoring
- route contract design by itself
当VTEX IO服务需要提升生产环境可见性、故障排查能力或运维安全性时,可使用本规范:
- 为重要的客户端调用或流程添加指标
- 优化路由、Worker或集成的日志
- 向运维和支持团队清晰展示故障信息
- 评估服务是否已就绪可上线生产
- 监控对限流敏感的集成
本规范不适用于:
- 应用策略声明
- 信任边界建模
- 前端分析或浏览器监控
- 单独的路由契约设计
Decision rules
决策规则
- Log enough structured context to debug failures, but do not log secrets or sensitive payloads.
- Use with appropriate log levels such as
ctx.vtex.logger,info, andwarninstead oferror, so logs are properly collected and searchable in the VTEX logging stack.console.log - Treat as the native platform logging mechanism. If a partner needs to forward logs to its own logging system, prefer doing that through a dedicated integration app or client instead of replacing the VTEX logger pattern inside every service.
ctx.vtex.logger - Use client-level metrics on important downstream calls so integration behavior is visible below the handler layer.
- Choose metric names that reflect the integration and operation, such as or
partner-get-order, so counts, latency, and error rates can be tracked over time.partner-sync-catalog - Make failures observable at the point where they happen. Do not swallow errors silently in routes, events, or workers.
- For rate-limit-sensitive APIs, combine short timeouts, backoff-aware retries, and caching of frequent reads to reduce burst pressure and avoid hitting hard limits.
- Review whether expensive or fragile flows expose enough operational signals before releasing them.
- 记录足够的结构化上下文以排查故障,但不得记录密钥或敏感载荷。
- 使用配合
ctx.vtex.logger、info、warn等合适的日志级别,替代error,确保日志能被VTEX日志栈正确采集和检索。console.log - 将作为原生平台日志机制。如果合作方需要将日志转发到自有日志系统,优先通过专用的集成应用或客户端实现,而非在每个服务内替换VTEX的日志模式。
ctx.vtex.logger - 对重要的下游调用添加客户端级别的指标,这样在处理层之下也能观测到集成的运行情况。
- 选择能够反映集成和操作含义的指标名称,例如或
partner-get-order,以便长期跟踪请求量、延迟和错误率。partner-sync-catalog - 在故障发生的位置就使其可观测,不要在路由、事件或Worker中静默吞掉错误。
- 对于对限流敏感的API,结合短超时、支持退避的重试和高频读取缓存来降低突发压力,避免触达硬限制。
- 在发布高开销或脆弱的流程前,评估其是否暴露了足够的运维信号。
Hard constraints
硬性约束
Constraint: Important failures must be visible in logs, metrics, or durable state
约束:重要故障必须在日志、指标或持久化状态中可见
Routes, event handlers, and workers MUST not hide important failures from operators.
Why this matters
If failures disappear silently, the service becomes impossible to diagnose under real traffic and retries.
Detection
If an error is caught and ignored without logging, metric emission, or explicit failure state, STOP and surface the failure.
Correct
typescript
try {
await ctx.clients.partnerApi.sendOrder(orderId)
} catch (error) {
ctx.vtex.logger.error({
message: 'Failed to send order to partner',
orderId,
account: ctx.vtex.account,
routeId: ctx.vtex.route?.id,
})
throw error
}Wrong
typescript
try {
await ctx.clients.partnerApi.sendOrder(orderId)
} catch (_) {
return
}路由、事件处理器和Worker不得向运维人员隐藏重要故障。
为什么重要
如果故障被静默吞掉,在真实流量和重试场景下将无法诊断服务问题。
检测规则
如果捕获了错误却忽略处理,没有记录日志、上报指标或标记明确的失败状态,请立即停止并暴露该故障。
正确示例
typescript
try {
await ctx.clients.partnerApi.sendOrder(orderId)
} catch (error) {
ctx.vtex.logger.error({
message: 'Failed to send order to partner',
orderId,
account: ctx.vtex.account,
routeId: ctx.vtex.route?.id,
})
throw error
}错误示例
typescript
try {
await ctx.clients.partnerApi.sendOrder(orderId)
} catch (_) {
return
}Constraint: Metrics should be attached to important integration calls
约束:重要的集成调用必须绑定指标
Client calls that are operationally important SHOULD include so request behavior can be tracked consistently.
metricWhy this matters
Without metrics, integration failures and latency patterns are much harder to isolate from generic route behavior.
Detection
If a key downstream integration call has no and operations depend on it, STOP and add a meaningful metric name.
metricCorrect
typescript
return this.http.get(`/orders/${id}`, {
metric: 'partner-get-order',
})Wrong
typescript
return this.http.get(`/orders/${id}`)运维层面重要的客户端调用应当添加,以便统一跟踪请求行为。
metric为什么重要
如果没有指标,将很难把集成故障和延迟特征与通用的路由行为区分开。
检测规则
如果关键下游集成调用没有添加且运维依赖该调用,请立即停止并添加有意义的指标名称。
metric正确示例
typescript
return this.http.get(`/orders/${id}`, {
metric: 'partner-get-order',
})错误示例
typescript
return this.http.get(`/orders/${id}`)Constraint: Logs must stay useful without leaking sensitive data
约束:日志必须保持有用,且不得泄露敏感数据
Logs MUST contain enough context to debug production behavior, but MUST NOT include secrets, tokens, or unnecessarily sensitive payloads.
Why this matters
Operational logs are only valuable if they are safe to retain and inspect. Sensitive logging creates security risk while still failing to guarantee useful diagnosis.
Detection
If a log line includes tokens, auth headers, raw personal payloads, or entire downstream responses, STOP and sanitize the log.
Correct
typescript
ctx.vtex.logger.info({
message: 'Partner sync started',
orderId,
account: ctx.vtex.account,
})Wrong
typescript
ctx.vtex.logger.info({
message: 'Partner sync started',
body: ctx.request.body,
auth: ctx.request.header.authorization,
})日志必须包含足够的上下文以诊断生产运行问题,但不得包含密钥、令牌或不必要的敏感载荷。
为什么重要
运维日志只有在可安全留存和查看的情况下才有价值。敏感日志会带来安全风险,同时也无法保证能有效支撑问题诊断。
检测规则
如果日志行包含令牌、鉴权头、原始个人信息载荷或完整的下游响应,请立即停止并对日志做脱敏处理。
正确示例
typescript
ctx.vtex.logger.info({
message: 'Partner sync started',
orderId,
account: ctx.vtex.account,
})错误示例
typescript
ctx.vtex.logger.info({
message: 'Partner sync started',
body: ctx.request.body,
auth: ctx.request.header.authorization,
})Preferred pattern
推荐模式
Operationally healthy VTEX IO services should:
- emit metrics for important client calls so counts, latency, and error rates are visible
- log failures with enough structured context such as domain IDs, account, and
routeId - avoid silent error swallowing
- sanitize sensitive data before logging
- review retries, caching, and throughput with rate-limit behavior in mind
Use observability to shorten diagnosis time, not just to create more logs.
运维健康的VTEX IO服务应当满足以下要求:
- 为重要的客户端调用上报指标,便于观测请求量、延迟和错误率
- 记录故障时附带足够的结构化上下文,例如领域ID、账号和
routeId - 避免静默吞掉错误
- 记录日志前对敏感数据做脱敏处理
- 结合限流特征评估重试、缓存和吞吐量配置
利用可观测性缩短诊断耗时,而不仅仅是生成更多日志。
Common failure modes
常见故障模式
- Catching and ignoring errors in async flows.
- Logging too little context to diagnose production incidents.
- Logging too much sensitive data.
- Omitting metrics from important integration calls.
- Treating rate-limit failures as isolated bugs instead of operational signals.
- 在异步流程中捕获并忽略错误
- 记录的上下文过少,无法诊断生产事故
- 记录过多敏感数据
- 重要的集成调用缺失指标
- 将限流故障视为孤立的bug,而非运维信号
Review checklist
评审检查清单
- Are important failures visible to operators?
- Do key integrations emit useful metrics?
- Are logs structured and safe?
- Are retries, caching, and rate-limit behavior considered together?
- Would someone on call be able to diagnose this flow from the available signals?
- 运维人员是否能看到重要故障?
- 核心集成是否上报了有用的指标?
- 日志是否是结构化且安全的?
- 是否综合考虑了重试、缓存和限流行为?
- 值班人员能否通过现有信号诊断该流程的问题?
Reference
参考资料
- Using Node Clients - Client usage patterns relevant to metrics and retries
- Best practices for avoiding rate-limit errors - Operational guidance for stable integrations
- Using Node Clients - 与指标和重试相关的客户端使用模式
- Best practices for avoiding rate-limit errors - 稳定集成的运维指南