observability-designer

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Observability Designer (POWERFUL)

可观测性设计器（功能强大）

Category: Engineering
Tier: POWERFUL
Description: Design comprehensive observability strategies for production systems including SLI/SLO frameworks, alerting optimization, and dashboard generation.

分类： 工程领域
层级： 功能强大
描述： 为生产系统设计全面的可观测性策略，包括SLI/SLO框架、告警优化和仪表盘生成。

Overview

概述

Observability Designer enables you to create production-ready observability strategies that provide deep insights into system behavior, performance, and reliability. This skill combines the three pillars of observability (metrics, logs, traces) with proven frameworks like SLI/SLO design, golden signals monitoring, and alert optimization to create comprehensive observability solutions.

可观测性设计器可帮助您创建适用于生产环境的可观测性策略，深入洞察系统行为、性能与可靠性。该技能结合可观测性三大支柱（指标、日志、链路追踪）与成熟框架（如SLI/SLO设计、黄金信号监控、告警优化），打造全面的可观测性解决方案。

Core Competencies

核心能力

SLI/SLO/SLA Framework Design

SLI/SLO/SLA框架设计

Service Level Indicators (SLI): Define measurable signals that indicate service health
Service Level Objectives (SLO): Set reliability targets based on user experience
Service Level Agreements (SLA): Establish customer-facing commitments with consequences
Error Budget Management: Calculate and track error budget consumption
Burn Rate Alerting: Multi-window burn rate alerts for proactive SLO protection

服务水平指标（SLI）： 定义可衡量的信号，用于指示服务健康状态
服务水平目标（SLO）： 基于用户体验设定可靠性目标
服务水平协议（SLA）： 建立面向客户的承诺及相应后果
错误预算管理： 计算并跟踪错误预算消耗情况
耗费率告警： 多窗口耗费率告警，主动保护SLO

Three Pillars of Observability

可观测性三大支柱

Metrics

指标

Golden Signals: Latency, traffic, errors, and saturation monitoring
RED Method: Rate, Errors, and Duration for request-driven services
USE Method: Utilization, Saturation, and Errors for resource monitoring
Business Metrics: Revenue, user engagement, and feature adoption tracking
Infrastructure Metrics: CPU, memory, disk, network, and custom resource metrics

黄金信号： 延迟、流量、错误、饱和度监控
RED Method： 面向请求驱动型服务的速率（Rate）、错误（Errors）、时长（Duration）方法
USE Method： 面向资源监控的利用率（Utilization）、饱和度（Saturation）、错误（Errors）方法
业务指标： 收入、用户参与度、功能采用率追踪
基础设施指标： CPU、内存、磁盘、网络及自定义资源指标

Logs

日志

Structured Logging: JSON-based log formats with consistent fields
Log Aggregation: Centralized log collection and indexing strategies
Log Levels: Appropriate use of DEBUG, INFO, WARN, ERROR, FATAL levels
Correlation IDs: Request tracing through distributed systems
Log Sampling: Volume management for high-throughput systems

结构化日志： 基于JSON的日志格式，字段保持一致
日志聚合： 集中式日志收集与索引策略
日志级别： 合理使用DEBUG、INFO、WARN、ERROR、FATAL级别
关联ID： 分布式系统中的请求追踪
日志采样： 高吞吐量系统的日志量管理

Traces

链路追踪

Distributed Tracing: End-to-end request flow visualization
Span Design: Meaningful span boundaries and metadata
Trace Sampling: Intelligent sampling strategies for performance and cost
Service Maps: Automatic dependency discovery through traces
Root Cause Analysis: Trace-driven debugging workflows

分布式链路追踪： 端到端请求流可视化
Span设计： 有意义的Span边界与元数据
链路采样： 兼顾性能与成本的智能采样策略
服务拓扑图： 通过链路自动发现依赖关系
根因分析： 基于链路追踪的调试流程

Dashboard Design Principles

仪表盘设计原则

Information Architecture

信息架构

Hierarchy: Overview → Service → Component → Instance drill-down paths
Golden Ratio: 80% operational metrics, 20% exploratory metrics
Cognitive Load: Maximum 7±2 panels per dashboard screen
User Journey: Role-based dashboard personas (SRE, Developer, Executive)

层级结构： 概览 → 服务 → 组件 → 实例的下钻路径
黄金比例： 80%运营指标，20%探索性指标
认知负荷： 每个仪表盘屏幕最多保留7±2个面板
用户旅程： 基于角色的仪表盘 persona（SRE、开发人员、管理人员）

Visualization Best Practices

可视化最佳实践

Chart Selection: Time series for trends, heatmaps for distributions, gauges for status
Color Theory: Red for critical, amber for warning, green for healthy states
Reference Lines: SLO targets, capacity thresholds, and historical baselines
Time Ranges: Default to meaningful windows (4h for incidents, 7d for trends)

图表选择： 时间序列图展示趋势，热力图展示分布，仪表盘展示状态
色彩理论： 红色表示严重，黄色表示警告，绿色表示健康
参考线： SLO目标、容量阈值、历史基线
时间范围： 默认使用有意义的窗口（故障场景用4小时，趋势分析用7天）

Panel Design

面板设计

Metric Queries: Efficient Prometheus/InfluxDB queries with proper aggregation
Alerting Integration: Visual alert state indicators on relevant panels
Interactive Elements: Template variables, drill-down links, and annotation overlays
Performance: Sub-second render times through query optimization

指标查询： 高效的Prometheus/InfluxDB查询，搭配合理聚合方式
告警集成： 在相关面板上展示可视化告警状态标识
交互元素： 模板变量、下钻链接、注释叠加层
性能： 通过查询优化实现亚秒级渲染

Alert Design and Optimization

告警设计与优化

Alert Classification

告警分类

Severity Levels:
- Critical: Service down, SLO burn rate high
- Warning: Approaching thresholds, non-user-facing issues
- Info: Deployment notifications, capacity planning alerts
Actionability: Every alert must have a clear response action
Alert Routing: Escalation policies based on severity and team ownership

严重级别：
- Critical（严重）： 服务宕机、SLO耗费率过高
- Warning（警告）： 接近阈值、非用户可见问题
- Info（信息）： 部署通知、容量规划告警
可操作性： 每个告警必须有明确的响应动作
告警路由： 基于严重级别和团队归属的升级策略

Alert Fatigue Prevention

告警疲劳预防

Signal vs Noise: High precision (few false positives) over high recall
Hysteresis: Different thresholds for firing and resolving alerts
Suppression: Dependent alert suppression during known outages
Grouping: Related alerts grouped into single notifications

信号vs噪音： 优先保证高准确率（减少误报）而非高召回率
滞后性设置： 告警触发与恢复使用不同阈值
抑制机制： 已知故障期间抑制关联告警
分组策略： 将相关告警合并为单个通知

Alert Rule Design

告警规则设计

Threshold Selection: Statistical methods for threshold determination
Window Functions: Appropriate averaging windows and percentile calculations
Alert Lifecycle: Clear firing conditions and automatic resolution criteria
Testing: Alert rule validation against historical data

阈值选择： 采用统计方法确定阈值
窗口函数： 合理的平均窗口与百分位计算
告警生命周期： 明确的触发条件与自动恢复规则
测试： 基于历史数据验证告警规则

Runbook Generation and Incident Response

运行手册生成与事件响应

Runbook Structure

运行手册结构

Alert Context: What the alert means and why it fired
Impact Assessment: User-facing vs internal impact evaluation
Investigation Steps: Ordered troubleshooting procedures with time estimates
Resolution Actions: Common fixes and escalation procedures
Post-Incident: Follow-up tasks and prevention measures

告警上下文： 告警含义及触发原因
影响评估： 用户可见影响与内部影响评估
排查步骤： 有序的故障排查流程及时间预估
恢复动作： 常见修复方案与升级流程
事后处理： 跟进任务与预防措施

Incident Detection Patterns

事件检测模式

Anomaly Detection: Statistical methods for detecting unusual patterns
Composite Alerts: Multi-signal alerts for complex failure modes
Predictive Alerts: Capacity and trend-based forward-looking alerts
Canary Monitoring: Early detection through progressive deployment monitoring

异常检测： 统计方法检测异常模式
复合告警： 多信号告警覆盖复杂故障模式
预测性告警： 基于容量与趋势的前瞻性告警
金丝雀监控： 通过渐进式部署实现早期故障检测

Golden Signals Framework

黄金信号框架

Latency Monitoring

延迟监控

Request Latency: P50, P95, P99 response time tracking
Queue Latency: Time spent waiting in processing queues
Network Latency: Inter-service communication delays
Database Latency: Query execution and connection pool metrics

请求延迟： 追踪P50、P95、P99响应时间
队列延迟： 任务在处理队列中的等待时间
网络延迟： 服务间通信延迟
数据库延迟： 查询执行与连接池指标

Traffic Monitoring

流量监控

Request Rate: Requests per second with burst detection
Bandwidth Usage: Network throughput and capacity utilization
User Sessions: Active user tracking and session duration
Feature Usage: API endpoint and feature adoption metrics

请求速率： 每秒请求数及突发检测
带宽使用： 网络吞吐量与容量利用率
用户会话： 活跃用户追踪与会话时长
功能使用： API端点与功能采用率指标

Error Monitoring

错误监控

Error Rate: 4xx and 5xx HTTP response code tracking
Error Budget: SLO-based error rate targets and consumption
Error Distribution: Error type classification and trending
Silent Failures: Detection of processing failures without HTTP errors

错误率： 追踪4xx和5xx HTTP响应码占比
错误预算： 基于SLO的错误率目标与消耗情况
错误分布： 错误类型分类与趋势分析
静默故障： 检测无HTTP错误的处理失败情况

Saturation Monitoring

饱和度监控

Resource Utilization: CPU, memory, disk, and network usage
Queue Depth: Processing queue length and wait times
Connection Pools: Database and service connection saturation
Rate Limiting: API throttling and quota exhaustion tracking

资源利用率： CPU、内存、磁盘、网络使用率
队列深度： 处理队列长度与等待时间
连接池： 数据库与服务连接饱和度
速率限制： API限流与配额耗尽追踪

Distributed Tracing Strategies

分布式链路追踪策略

Trace Architecture

链路架构

Sampling Strategy: Head-based, tail-based, and adaptive sampling
Trace Propagation: Context propagation across service boundaries
Span Correlation: Parent-child relationship modeling
Trace Storage: Retention policies and storage optimization

采样策略： 头部采样、尾部采样与自适应采样
链路传播： 跨服务边界的上下文传播
Span关联： 父子关系建模
链路存储： 保留策略与存储优化

Service Instrumentation

服务埋点

Auto-Instrumentation: Framework-based automatic trace generation
Manual Instrumentation: Custom span creation for business logic
Baggage Handling: Cross-cutting concern propagation
Performance Impact: Instrumentation overhead measurement and optimization

自动埋点： 基于框架的自动链路生成
手动埋点： 为业务逻辑创建自定义Span
** baggage处理：** 跨切面关注点传播
性能影响： 埋点开销测量与优化

Log Aggregation Patterns

日志聚合模式

Collection Architecture

收集架构

Agent Deployment: Log shipping agent strategies (push vs pull)
Log Routing: Topic-based routing and filtering
Parsing Strategies: Structured vs unstructured log handling
Schema Evolution: Log format versioning and migration

Agent部署： 日志传输Agent策略（推送vs拉取）
日志路由： 基于主题的路由与过滤
解析策略： 结构化与非结构化日志处理
Schema演进： 日志格式版本化与迁移

Storage and Indexing

存储与索引

Index Design: Optimized field indexing for common query patterns
Retention Policies: Time and volume-based log retention
Compression: Log data compression and archival strategies
Search Performance: Query optimization and result caching

索引设计： 针对常见查询模式优化字段索引
保留策略： 基于时间与容量的日志保留
压缩： 日志数据压缩与归档策略
搜索性能： 查询优化与结果缓存

Cost Optimization for Observability

可观测性成本优化

Data Management

数据管理

Metric Retention: Tiered retention based on metric importance
Log Sampling: Intelligent sampling to reduce ingestion costs
Trace Sampling: Cost-effective trace collection strategies
Data Archival: Cold storage for historical observability data

指标保留： 基于指标重要性的分层保留策略
日志采样： 智能采样降低 ingestion 成本
链路采样： 高性价比的链路收集策略
数据归档： 历史可观测性数据的冷存储

Resource Optimization

资源优化

Query Efficiency: Optimized metric and log queries
Storage Costs: Appropriate storage tiers for different data types
Ingestion Rate Limiting: Controlled data ingestion to manage costs
Cardinality Management: High-cardinality metric detection and mitigation

查询效率： 优化后的指标与日志查询
存储成本： 为不同数据类型选择合适的存储层级
Ingestion速率限制： 控制数据摄入以管理成本
基数管理： 高基数指标检测与缓解

Scripts Overview

脚本概述

This skill includes three powerful Python scripts for comprehensive observability design:

该技能包含三个功能强大的Python脚本，用于全面的可观测性设计：

1. SLO Designer (

slo_designer.py

)

1. SLO设计器（

slo_designer.py

）

Generates complete SLI/SLO frameworks based on service characteristics:

Input: Service description JSON (type, criticality, dependencies)
Output: SLI definitions, SLO targets, error budgets, burn rate alerts, SLA recommendations
Features: Multi-window burn rate calculations, error budget policies, alert rule generation

基于服务特性生成完整的SLI/SLO框架：

输入： 服务描述JSON（类型、关键程度、依赖关系）
输出： SLI定义、SLO目标、错误预算、耗费率告警、SLA建议
特性： 多窗口耗费率计算、错误预算策略、告警规则生成

2. Alert Optimizer (

alert_optimizer.py

)

2. 告警优化器（

alert_optimizer.py

）

Analyzes and optimizes existing alert configurations:

Input: Alert configuration JSON with rules, thresholds, and routing
Output: Optimization report and improved alert configuration
Features: Noise detection, coverage gaps, duplicate identification, threshold optimization

分析并优化现有告警配置：

输入： 包含规则、阈值、路由的告警配置JSON
输出： 优化报告与改进后的告警配置
特性： 噪音检测、覆盖缺口识别、重复告警排查、阈值优化

3. Dashboard Generator (

dashboard_generator.py

)

3. 仪表盘生成器（

dashboard_generator.py

）

Creates comprehensive dashboard specifications:

Input: Service/system description JSON
Output: Grafana-compatible dashboard JSON and documentation
Features: Golden signals coverage, RED/USE methods, drill-down paths, role-based views

创建全面的仪表盘规格：

输入： 服务/系统描述JSON
输出： 兼容Grafana的仪表盘JSON及文档
特性： 黄金信号覆盖、RED/USE方法、下钻路径、基于角色的视图

Integration Patterns

集成模式

Monitoring Stack Integration

监控栈集成

Prometheus: Metric collection and alerting rule generation
Grafana: Dashboard creation and visualization configuration
Elasticsearch/Kibana: Log analysis and dashboard integration
Jaeger/Zipkin: Distributed tracing configuration and analysis

Prometheus： 指标收集与告警规则生成
Grafana： 仪表盘创建与可视化配置
Elasticsearch/Kibana： 日志分析与仪表盘集成
Jaeger/Zipkin： 分布式链路追踪配置与分析

CI/CD Integration

CI/CD集成

Pipeline Monitoring: Build, test, and deployment observability
Deployment Correlation: Release impact tracking and rollback triggers
Feature Flag Monitoring: A/B test and feature rollout observability
Performance Regression: Automated performance monitoring in pipelines

流水线监控： 构建、测试、部署环节的可观测性
部署关联： 发布影响追踪与回滚触发
特性开关监控： A/B测试与功能灰度发布的可观测性
性能回归： 流水线中的自动化性能监控

Incident Management Integration

事件管理集成

PagerDuty/VictorOps: Alert routing and escalation policies
Slack/Teams: Notification and collaboration integration
JIRA/ServiceNow: Incident tracking and resolution workflows
Post-Mortem: Automated incident analysis and improvement tracking

PagerDuty/VictorOps： 告警路由与升级策略
Slack/Teams： 通知与协作集成
JIRA/ServiceNow： 事件追踪与解决流程
事后复盘： 自动化事件分析与改进追踪

Advanced Patterns

高级模式

Multi-Cloud Observability

多云可观测性

Cross-Cloud Metrics: Unified metrics across AWS, GCP, Azure
Network Observability: Inter-cloud connectivity monitoring
Cost Attribution: Cloud resource cost tracking and optimization
Compliance Monitoring: Security and compliance posture tracking

跨云指标： AWS、GCP、Azure跨云统一指标
网络可观测性： 跨云连接监控
成本归因： 云资源成本追踪与优化
合规监控： 安全与合规态势追踪

Microservices Observability

微服务可观测性

Service Mesh Integration: Istio/Linkerd observability configuration
API Gateway Monitoring: Request routing and rate limiting observability
Container Orchestration: Kubernetes cluster and workload monitoring
Service Discovery: Dynamic service monitoring and health checks

服务网格集成： Istio/Linkerd可观测性配置
API网关监控： 请求路由与限流可观测性
容器编排： Kubernetes集群与工作负载监控
服务发现： 动态服务监控与健康检查

Machine Learning Observability

机器学习可观测性

Model Performance: Accuracy, drift, and bias monitoring
Feature Store Monitoring: Feature quality and freshness tracking
Pipeline Observability: ML pipeline execution and performance monitoring
A/B Test Analysis: Statistical significance and business impact measurement

模型性能： 准确率、漂移、偏差监控
特征存储监控： 特征质量与新鲜度追踪
流水线可观测性： ML流水线执行与性能监控
A/B测试分析： 统计显著性与业务影响衡量

Best Practices

最佳实践

Organizational Alignment

组织对齐

SLO Setting: Collaborative target setting between product and engineering
Alert Ownership: Clear escalation paths and team responsibilities
Dashboard Governance: Centralized dashboard management and standards
Training Programs: Team education on observability tools and practices

SLO设定： 产品与工程团队协作设定目标
告警归属： 明确的升级路径与团队职责
仪表盘治理： 集中式仪表盘管理与标准制定
培训计划： 团队可观测性工具与实践培训

Technical Excellence

技术卓越

Infrastructure as Code: Observability configuration version control
Testing Strategy: Alert rule testing and dashboard validation
Performance Monitoring: Observability system performance tracking
Security Considerations: Access control and data privacy in observability

基础设施即代码： 可观测性配置版本控制
测试策略： 告警规则测试与仪表盘验证
性能监控： 可观测性系统自身性能追踪
安全考量： 可观测性中的访问控制与数据隐私

Continuous Improvement

持续改进

Metrics Review: Regular SLI/SLO effectiveness assessment
Alert Tuning: Ongoing alert threshold and routing optimization
Dashboard Evolution: User feedback-driven dashboard improvements
Tool Evaluation: Regular assessment of observability tool effectiveness

指标回顾： 定期评估SLI/SLO有效性
告警调优： 持续优化告警阈值与路由
仪表盘演进： 基于用户反馈改进仪表盘
工具评估： 定期评估可观测性工具有效性

Success Metrics

成功指标

Operational Metrics

运营指标

Mean Time to Detection (MTTD): How quickly issues are identified
Mean Time to Resolution (MTTR): Time from detection to resolution
Alert Precision: Percentage of actionable alerts
SLO Achievement: Percentage of SLO targets met consistently

平均检测时间（MTTD）： 问题识别速度
平均恢复时间（MTTR）： 从检测到解决的时长
告警准确率： 可执行告警的占比
SLO达成率： 持续满足SLO目标的比例

Business Metrics

业务指标

System Reliability: Overall uptime and user experience quality
Engineering Velocity: Development team productivity and deployment frequency
Cost Efficiency: Observability cost as percentage of infrastructure spend
Customer Satisfaction: User-reported reliability and performance satisfaction

This comprehensive observability design skill enables organizations to build robust, scalable monitoring and alerting systems that provide actionable insights while maintaining cost efficiency and operational excellence.

系统可靠性： 整体可用性与用户体验质量
工程效率： 开发团队生产力与部署频率
成本效益： 可观测性成本占基础设施支出的比例
客户满意度： 用户反馈的可靠性与性能满意度

这款全面的可观测性设计技能可帮助企业构建稳健、可扩展的监控与告警系统，在保持成本效益与运营卓越的同时，提供可执行的洞察。