service-mesh-observability
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseService Mesh Observability
服务网格可观测性
Complete guide to observability patterns for Istio, Linkerd, and service mesh deployments.
适用于Istio、Linkerd和服务网格部署的可观测性模式完整指南。
When to Use This Skill
何时使用该技能
- Setting up distributed tracing across services
- Implementing service mesh metrics and dashboards
- Debugging latency and error issues
- Defining SLOs for service communication
- Visualizing service dependencies
- Troubleshooting mesh connectivity
- 在服务间设置分布式追踪
- 实现服务网格指标与仪表盘
- 调试延迟与错误问题
- 为服务通信定义SLO
- 可视化服务依赖关系
- 排查网格连通性问题
Core Concepts
核心概念
1. Three Pillars of Observability
1. 可观测性三大支柱
┌─────────────────────────────────────────────────────┐
│ Observability │
├─────────────────┬─────────────────┬─────────────────┤
│ Metrics │ Traces │ Logs │
│ │ │ │
│ • Request rate │ • Span context │ • Access logs │
│ • Error rate │ • Latency │ • Error details │
│ • Latency P50 │ • Dependencies │ • Debug info │
│ • Saturation │ • Bottlenecks │ • Audit trail │
└─────────────────┴─────────────────┴─────────────────┘┌─────────────────────────────────────────────────────┐
│ Observability │
├─────────────────┬─────────────────┬─────────────────┤
│ Metrics │ Traces │ Logs │
│ │ │ │
│ • Request rate │ • Span context │ • Access logs │
│ • Error rate │ • Latency │ • Error details │
│ • Latency P50 │ • Dependencies │ • Debug info │
│ • Saturation │ • Bottlenecks │ • Audit trail │
└─────────────────┴─────────────────┴─────────────────┘2. Golden Signals for Mesh
2. 网格黄金指标
| Signal | Description | Alert Threshold |
|---|---|---|
| Latency | Request duration P50, P99 | P99 > 500ms |
| Traffic | Requests per second | Anomaly detection |
| Errors | 5xx error rate | > 1% |
| Saturation | Resource utilization | > 80% |
| 指标 | 描述 | 告警阈值 |
|---|---|---|
| Latency | 请求时长P50、P99 | P99 > 500ms |
| Traffic | 每秒请求数 | 异常检测 |
| Errors | 5xx错误率 | > 1% |
| Saturation | 资源利用率 | > 80% |
Templates
模板
Template 1: Istio with Prometheus & Grafana
模板1:搭配Prometheus & Grafana的Istio配置
yaml
undefinedyaml
undefinedInstall Prometheus
Install Prometheus
apiVersion: v1 kind: ConfigMap metadata: name: prometheus namespace: istio-system data: prometheus.yml: | global: scrape_interval: 15s scrape_configs: - job_name: 'istio-mesh' kubernetes_sd_configs: - role: endpoints namespaces: names: - istio-system relabel_configs: - source_labels: [__meta_kubernetes_service_name] action: keep regex: istio-telemetry
apiVersion: v1 kind: ConfigMap metadata: name: prometheus namespace: istio-system data: prometheus.yml: | global: scrape_interval: 15s scrape_configs: - job_name: 'istio-mesh' kubernetes_sd_configs: - role: endpoints namespaces: names: - istio-system relabel_configs: - source_labels: [__meta_kubernetes_service_name] action: keep regex: istio-telemetry
ServiceMonitor for Prometheus Operator
ServiceMonitor for Prometheus Operator
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: istio-mesh
namespace: istio-system
spec:
selector:
matchLabels:
app: istiod
endpoints:
- port: http-monitoring
interval: 15s
undefinedapiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: istio-mesh
namespace: istio-system
spec:
selector:
matchLabels:
app: istiod
endpoints:
- port: http-monitoring
interval: 15s
undefinedTemplate 2: Key Istio Metrics Queries
模板2:关键Istio指标查询语句
promql
undefinedpromql
undefinedRequest rate by service
Request rate by service
sum(rate(istio_requests_total{reporter="destination"}[5m])) by (destination_service_name)
sum(rate(istio_requests_total{reporter="destination"}[5m])) by (destination_service_name)
Error rate (5xx)
Error rate (5xx)
sum(rate(istio_requests_total{reporter="destination", response_code=~"5.."}[5m]))
/ sum(rate(istio_requests_total{reporter="destination"}[5m])) * 100
sum(rate(istio_requests_total{reporter="destination", response_code=~"5.."}[5m]))
/ sum(rate(istio_requests_total{reporter="destination"}[5m])) * 100
P99 latency
P99 latency
histogram_quantile(0.99,
sum(rate(istio_request_duration_milliseconds_bucket{reporter="destination"}[5m]))
by (le, destination_service_name))
histogram_quantile(0.99,
sum(rate(istio_request_duration_milliseconds_bucket{reporter="destination"}[5m]))
by (le, destination_service_name))
TCP connections
TCP connections
sum(istio_tcp_connections_opened_total{reporter="destination"}) by (destination_service_name)
sum(istio_tcp_connections_opened_total{reporter="destination"}) by (destination_service_name)
Request size
Request size
histogram_quantile(0.99,
sum(rate(istio_request_bytes_bucket{reporter="destination"}[5m]))
by (le, destination_service_name))
undefinedhistogram_quantile(0.99,
sum(rate(istio_request_bytes_bucket{reporter="destination"}[5m]))
by (le, destination_service_name))
undefinedTemplate 3: Jaeger Distributed Tracing
模板3:Jaeger分布式追踪配置
yaml
undefinedyaml
undefinedJaeger installation for Istio
Jaeger installation for Istio
apiVersion: install.istio.io/v1alpha1 kind: IstioOperator spec: meshConfig: enableTracing: true defaultConfig: tracing: sampling: 100.0 # 100% in dev, lower in prod zipkin: address: jaeger-collector.istio-system:9411
apiVersion: install.istio.io/v1alpha1 kind: IstioOperator spec: meshConfig: enableTracing: true defaultConfig: tracing: sampling: 100.0 # 100% in dev, lower in prod zipkin: address: jaeger-collector.istio-system:9411
Jaeger deployment
Jaeger deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: jaeger
namespace: istio-system
spec:
selector:
matchLabels:
app: jaeger
template:
metadata:
labels:
app: jaeger
spec:
containers:
- name: jaeger
image: jaegertracing/all-in-one:1.50
ports:
- containerPort: 5775 # UDP
- containerPort: 6831 # Thrift
- containerPort: 6832 # Thrift
- containerPort: 5778 # Config
- containerPort: 16686 # UI
- containerPort: 14268 # HTTP
- containerPort: 14250 # gRPC
- containerPort: 9411 # Zipkin
env:
- name: COLLECTOR_ZIPKIN_HOST_PORT
value: ":9411"
undefinedapiVersion: apps/v1
kind: Deployment
metadata:
name: jaeger
namespace: istio-system
spec:
selector:
matchLabels:
app: jaeger
template:
metadata:
labels:
app: jaeger
spec:
containers:
- name: jaeger
image: jaegertracing/all-in-one:1.50
ports:
- containerPort: 5775 # UDP
- containerPort: 6831 # Thrift
- containerPort: 6832 # Thrift
- containerPort: 5778 # Config
- containerPort: 16686 # UI
- containerPort: 14268 # HTTP
- containerPort: 14250 # gRPC
- containerPort: 9411 # Zipkin
env:
- name: COLLECTOR_ZIPKIN_HOST_PORT
value: ":9411"
undefinedTemplate 4: Linkerd Viz Dashboard
模板4:Linkerd Viz仪表盘使用
bash
undefinedbash
undefinedInstall Linkerd viz extension
Install Linkerd viz extension
linkerd viz install | kubectl apply -f -
linkerd viz install | kubectl apply -f -
Access dashboard
Access dashboard
linkerd viz dashboard
linkerd viz dashboard
CLI commands for observability
CLI commands for observability
Top requests
Top requests
linkerd viz top deploy/my-app
linkerd viz top deploy/my-app
Per-route metrics
Per-route metrics
linkerd viz routes deploy/my-app --to deploy/backend
linkerd viz routes deploy/my-app --to deploy/backend
Live traffic inspection
Live traffic inspection
linkerd viz tap deploy/my-app --to deploy/backend
linkerd viz tap deploy/my-app --to deploy/backend
Service edges (dependencies)
Service edges (dependencies)
linkerd viz edges deployment -n my-namespace
undefinedlinkerd viz edges deployment -n my-namespace
undefinedTemplate 5: Grafana Dashboard JSON
模板5:Grafana仪表盘JSON配置
json
{
"dashboard": {
"title": "Service Mesh Overview",
"panels": [
{
"title": "Request Rate",
"type": "graph",
"targets": [
{
"expr": "sum(rate(istio_requests_total{reporter=\"destination\"}[5m])) by (destination_service_name)",
"legendFormat": "{{destination_service_name}}"
}
]
},
{
"title": "Error Rate",
"type": "gauge",
"targets": [
{
"expr": "sum(rate(istio_requests_total{response_code=~\"5..\"}[5m])) / sum(rate(istio_requests_total[5m])) * 100"
}
],
"fieldConfig": {
"defaults": {
"thresholds": {
"steps": [
{ "value": 0, "color": "green" },
{ "value": 1, "color": "yellow" },
{ "value": 5, "color": "red" }
]
}
}
}
},
{
"title": "P99 Latency",
"type": "graph",
"targets": [
{
"expr": "histogram_quantile(0.99, sum(rate(istio_request_duration_milliseconds_bucket{reporter=\"destination\"}[5m])) by (le, destination_service_name))",
"legendFormat": "{{destination_service_name}}"
}
]
},
{
"title": "Service Topology",
"type": "nodeGraph",
"targets": [
{
"expr": "sum(rate(istio_requests_total{reporter=\"destination\"}[5m])) by (source_workload, destination_service_name)"
}
]
}
]
}
}json
{
"dashboard": {
"title": "Service Mesh Overview",
"panels": [
{
"title": "Request Rate",
"type": "graph",
"targets": [
{
"expr": "sum(rate(istio_requests_total{reporter=\"destination\"}[5m])) by (destination_service_name)",
"legendFormat": "{{destination_service_name}}"
}
]
},
{
"title": "Error Rate",
"type": "gauge",
"targets": [
{
"expr": "sum(rate(istio_requests_total{response_code=~\"5..\"}[5m])) / sum(rate(istio_requests_total[5m])) * 100"
}
],
"fieldConfig": {
"defaults": {
"thresholds": {
"steps": [
{ "value": 0, "color": "green" },
{ "value": 1, "color": "yellow" },
{ "value": 5, "color": "red" }
]
}
}
}
},
{
"title": "P99 Latency",
"type": "graph",
"targets": [
{
"expr": "histogram_quantile(0.99, sum(rate(istio_request_duration_milliseconds_bucket{reporter=\"destination\"}[5m])) by (le, destination_service_name))",
"legendFormat": "{{destination_service_name}}"
}
]
},
{
"title": "Service Topology",
"type": "nodeGraph",
"targets": [
{
"expr": "sum(rate(istio_requests_total{reporter=\"destination\"}[5m])) by (source_workload, destination_service_name)"
}
]
}
]
}
}Template 6: Kiali Service Mesh Visualization
模板6:Kiali服务网格可视化配置
yaml
undefinedyaml
undefinedKiali installation
Kiali installation
apiVersion: kiali.io/v1alpha1
kind: Kiali
metadata:
name: kiali
namespace: istio-system
spec:
auth:
strategy: anonymous # or openid, token
deployment:
accessible_namespaces:
- "**"
external_services:
prometheus:
url: http://prometheus.istio-system:9090
tracing:
url: http://jaeger-query.istio-system:16686
grafana:
url: http://grafana.istio-system:3000
undefinedapiVersion: kiali.io/v1alpha1
kind: Kiali
metadata:
name: kiali
namespace: istio-system
spec:
auth:
strategy: anonymous # or openid, token
deployment:
accessible_namespaces:
- "**"
external_services:
prometheus:
url: http://prometheus.istio-system:9090
tracing:
url: http://jaeger-query.istio-system:16686
grafana:
url: http://grafana.istio-system:3000
undefinedTemplate 7: OpenTelemetry Integration
模板7:OpenTelemetry集成配置
yaml
undefinedyaml
undefinedOpenTelemetry Collector for mesh
OpenTelemetry Collector for mesh
apiVersion: v1
kind: ConfigMap
metadata:
name: otel-collector-config
data:
config.yaml: |
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
zipkin:
endpoint: 0.0.0.0:9411
processors:
batch:
timeout: 10s
exporters:
jaeger:
endpoint: jaeger-collector:14250
tls:
insecure: true
prometheus:
endpoint: 0.0.0.0:8889
service:
pipelines:
traces:
receivers: [otlp, zipkin]
processors: [batch]
exporters: [jaeger]
metrics:
receivers: [otlp]
processors: [batch]
exporters: [prometheus]apiVersion: v1
kind: ConfigMap
metadata:
name: otel-collector-config
data:
config.yaml: |
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
zipkin:
endpoint: 0.0.0.0:9411
processors:
batch:
timeout: 10s
exporters:
jaeger:
endpoint: jaeger-collector:14250
tls:
insecure: true
prometheus:
endpoint: 0.0.0.0:8889
service:
pipelines:
traces:
receivers: [otlp, zipkin]
processors: [batch]
exporters: [jaeger]
metrics:
receivers: [otlp]
processors: [batch]
exporters: [prometheus]Istio Telemetry v2 with OTel
Istio Telemetry v2 with OTel
apiVersion: telemetry.istio.io/v1alpha1
kind: Telemetry
metadata:
name: mesh-default
namespace: istio-system
spec:
tracing:
- providers:
- name: otel
randomSamplingPercentage: 10
undefinedapiVersion: telemetry.istio.io/v1alpha1
kind: Telemetry
metadata:
name: mesh-default
namespace: istio-system
spec:
tracing:
- providers:
- name: otel
randomSamplingPercentage: 10
undefinedAlerting Rules
告警规则
yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: mesh-alerts
namespace: istio-system
spec:
groups:
- name: mesh.rules
rules:
- alert: HighErrorRate
expr: |
sum(rate(istio_requests_total{response_code=~"5.."}[5m])) by (destination_service_name)
/ sum(rate(istio_requests_total[5m])) by (destination_service_name) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate for {{ $labels.destination_service_name }}"
- alert: HighLatency
expr: |
histogram_quantile(0.99, sum(rate(istio_request_duration_milliseconds_bucket[5m]))
by (le, destination_service_name)) > 1000
for: 5m
labels:
severity: warning
annotations:
summary: "High P99 latency for {{ $labels.destination_service_name }}"
- alert: MeshCertExpiring
expr: |
(certmanager_certificate_expiration_timestamp_seconds - time()) / 86400 < 7
labels:
severity: warning
annotations:
summary: "Mesh certificate expiring in less than 7 days"yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: mesh-alerts
namespace: istio-system
spec:
groups:
- name: mesh.rules
rules:
- alert: HighErrorRate
expr: |
sum(rate(istio_requests_total{response_code=~"5.."}[5m])) by (destination_service_name)
/ sum(rate(istio_requests_total[5m])) by (destination_service_name) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate for {{ $labels.destination_service_name }}"
- alert: HighLatency
expr: |
histogram_quantile(0.99, sum(rate(istio_request_duration_milliseconds_bucket[5m]))
by (le, destination_service_name)) > 1000
for: 5m
labels:
severity: warning
annotations:
summary: "High P99 latency for {{ $labels.destination_service_name }}"
- alert: MeshCertExpiring
expr: |
(certmanager_certificate_expiration_timestamp_seconds - time()) / 86400 < 7
labels:
severity: warning
annotations:
summary: "Mesh certificate expiring in less than 7 days"Best Practices
最佳实践
Do's
建议做法
- Sample appropriately - 100% in dev, 1-10% in prod
- Use trace context - Propagate headers consistently
- Set up alerts - For golden signals
- Correlate metrics/traces - Use exemplars
- Retain strategically - Hot/cold storage tiers
- 合理设置采样率 - 开发环境100%采样,生产环境1-10%
- 传递追踪上下文 - 持续传播追踪头信息
- 配置告警规则 - 针对黄金指标设置告警
- 关联指标与追踪 - 使用示例关联数据
- 分层存储数据 - 采用热/冷存储分层策略
Don'ts
避免事项
- Don't over-sample - Storage costs add up
- Don't ignore cardinality - Limit label values
- Don't skip dashboards - Visualize dependencies
- Don't forget costs - Monitor observability costs
- 不要过度采样 - 避免存储成本过高
- 不要忽略基数问题 - 限制标签值数量
- 不要跳过仪表盘配置 - 可视化服务依赖关系
- 不要忽视成本 - 监控可观测性相关成本