Service Mesh Observability

服务网格可观测性

Complete guide to observability patterns for Istio, Linkerd, and service mesh deployments.

适用于Istio、Linkerd和服务网格部署的可观测性模式完整指南。

When to Use This Skill

何时使用该技能

Setting up distributed tracing across services
Implementing service mesh metrics and dashboards
Debugging latency and error issues
Defining SLOs for service communication
Visualizing service dependencies
Troubleshooting mesh connectivity

在服务间设置分布式追踪
实现服务网格指标与仪表盘
调试延迟与错误问题
为服务通信定义SLO
可视化服务依赖关系
排查网格连通性问题

Core Concepts

核心概念

1. Three Pillars of Observability

1. 可观测性三大支柱

┌─────────────────────────────────────────────────────┐
│                  Observability                       │
├─────────────────┬─────────────────┬─────────────────┤
│     Metrics     │     Traces      │      Logs       │
│                 │                 │                 │
│ • Request rate  │ • Span context  │ • Access logs   │
│ • Error rate    │ • Latency       │ • Error details │
│ • Latency P50   │ • Dependencies  │ • Debug info    │
│ • Saturation    │ • Bottlenecks   │ • Audit trail   │
└─────────────────┴─────────────────┴─────────────────┘

┌─────────────────────────────────────────────────────┐
│                  Observability                       │
├─────────────────┬─────────────────┬─────────────────┤
│     Metrics     │     Traces      │      Logs       │
│                 │                 │                 │
│ • Request rate  │ • Span context  │ • Access logs   │
│ • Error rate    │ • Latency       │ • Error details │
│ • Latency P50   │ • Dependencies  │ • Debug info    │
│ • Saturation    │ • Bottlenecks   │ • Audit trail   │
└─────────────────┴─────────────────┴─────────────────┘

2. Golden Signals for Mesh

2. 网格黄金指标

Signal	Description	Alert Threshold
Latency	Request duration P50, P99	P99 > 500ms
Traffic	Requests per second	Anomaly detection
Errors	5xx error rate	> 1%
Saturation	Resource utilization	> 80%

指标	描述	告警阈值
Latency	请求时长P50、P99	P99 > 500ms
Traffic	每秒请求数	异常检测
Errors	5xx错误率	> 1%
Saturation	资源利用率	> 80%

Templates

模板

Template 1: Istio with Prometheus & Grafana

模板1：搭配Prometheus & Grafana的Istio配置

yaml

undefined

yaml

undefined

Install Prometheus

apiVersion: v1 kind: ConfigMap metadata: name: prometheus namespace: istio-system data: prometheus.yml: | global: scrape_interval: 15s scrape_configs: - job_name: 'istio-mesh' kubernetes_sd_configs: - role: endpoints namespaces: names: - istio-system relabel_configs: - source_labels: [__meta_kubernetes_service_name] action: keep regex: istio-telemetry

ServiceMonitor for Prometheus Operator

apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: istio-mesh namespace: istio-system spec: selector: matchLabels: app: istiod endpoints: - port: http-monitoring interval: 15s

undefined

apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: istio-mesh namespace: istio-system spec: selector: matchLabels: app: istiod endpoints: - port: http-monitoring interval: 15s

undefined

Template 2: Key Istio Metrics Queries

模板2：关键Istio指标查询语句

promql

undefined

promql

undefined

Request rate by service

sum(rate(istio_requests_total{reporter="destination"}[5m])) by (destination_service_name)

Error rate (5xx)

sum(rate(istio_requests_total{reporter="destination", response_code=~"5.."}[5m])) / sum(rate(istio_requests_total{reporter="destination"}[5m])) * 100

P99 latency

histogram_quantile(0.99, sum(rate(istio_request_duration_milliseconds_bucket{reporter="destination"}[5m])) by (le, destination_service_name))

TCP connections

sum(istio_tcp_connections_opened_total{reporter="destination"}) by (destination_service_name)

Request size

histogram_quantile(0.99, sum(rate(istio_request_bytes_bucket{reporter="destination"}[5m])) by (le, destination_service_name))

undefined

histogram_quantile(0.99, sum(rate(istio_request_bytes_bucket{reporter="destination"}[5m])) by (le, destination_service_name))

undefined

Template 3: Jaeger Distributed Tracing

模板3：Jaeger分布式追踪配置

yaml

undefined

yaml

undefined

Jaeger installation for Istio

apiVersion: install.istio.io/v1alpha1 kind: IstioOperator spec: meshConfig: enableTracing: true defaultConfig: tracing: sampling: 100.0 # 100% in dev, lower in prod zipkin: address: jaeger-collector.istio-system:9411

Jaeger deployment

apiVersion: apps/v1 kind: Deployment metadata: name: jaeger namespace: istio-system spec: selector: matchLabels: app: jaeger template: metadata: labels: app: jaeger spec: containers: - name: jaeger image: jaegertracing/all-in-one:1.50 ports: - containerPort: 5775 # UDP - containerPort: 6831 # Thrift - containerPort: 6832 # Thrift - containerPort: 5778 # Config - containerPort: 16686 # UI - containerPort: 14268 # HTTP - containerPort: 14250 # gRPC - containerPort: 9411 # Zipkin env: - name: COLLECTOR_ZIPKIN_HOST_PORT value: ":9411"

undefined

apiVersion: apps/v1 kind: Deployment metadata: name: jaeger namespace: istio-system spec: selector: matchLabels: app: jaeger template: metadata: labels: app: jaeger spec: containers: - name: jaeger image: jaegertracing/all-in-one:1.50 ports: - containerPort: 5775 # UDP - containerPort: 6831 # Thrift - containerPort: 6832 # Thrift - containerPort: 5778 # Config - containerPort: 16686 # UI - containerPort: 14268 # HTTP - containerPort: 14250 # gRPC - containerPort: 9411 # Zipkin env: - name: COLLECTOR_ZIPKIN_HOST_PORT value: ":9411"

undefined

Template 4: Linkerd Viz Dashboard

模板4：Linkerd Viz仪表盘使用

bash

undefined

bash

undefined

Install Linkerd viz extension

linkerd viz install | kubectl apply -f -

Access dashboard

linkerd viz dashboard

CLI commands for observability

Top requests

linkerd viz top deploy/my-app

Per-route metrics

linkerd viz routes deploy/my-app --to deploy/backend

Live traffic inspection

linkerd viz tap deploy/my-app --to deploy/backend

Service edges (dependencies)

linkerd viz edges deployment -n my-namespace

undefined

linkerd viz edges deployment -n my-namespace

undefined

Template 5: Grafana Dashboard JSON

模板5：Grafana仪表盘JSON配置

json

{
  "dashboard": {
    "title": "Service Mesh Overview",
    "panels": [
      {
        "title": "Request Rate",
        "type": "graph",
        "targets": [
          {
            "expr": "sum(rate(istio_requests_total{reporter=\"destination\"}[5m])) by (destination_service_name)",
            "legendFormat": "{{destination_service_name}}"
          }
        ]
      },
      {
        "title": "Error Rate",
        "type": "gauge",
        "targets": [
          {
            "expr": "sum(rate(istio_requests_total{response_code=~\"5..\"}[5m])) / sum(rate(istio_requests_total[5m])) * 100"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "thresholds": {
              "steps": [
                { "value": 0, "color": "green" },
                { "value": 1, "color": "yellow" },
                { "value": 5, "color": "red" }
              ]
            }
          }
        }
      },
      {
        "title": "P99 Latency",
        "type": "graph",
        "targets": [
          {
            "expr": "histogram_quantile(0.99, sum(rate(istio_request_duration_milliseconds_bucket{reporter=\"destination\"}[5m])) by (le, destination_service_name))",
            "legendFormat": "{{destination_service_name}}"
          }
        ]
      },
      {
        "title": "Service Topology",
        "type": "nodeGraph",
        "targets": [
          {
            "expr": "sum(rate(istio_requests_total{reporter=\"destination\"}[5m])) by (source_workload, destination_service_name)"
          }
        ]
      }
    ]
  }
}

json

{
  "dashboard": {
    "title": "Service Mesh Overview",
    "panels": [
      {
        "title": "Request Rate",
        "type": "graph",
        "targets": [
          {
            "expr": "sum(rate(istio_requests_total{reporter=\"destination\"}[5m])) by (destination_service_name)",
            "legendFormat": "{{destination_service_name}}"
          }
        ]
      },
      {
        "title": "Error Rate",
        "type": "gauge",
        "targets": [
          {
            "expr": "sum(rate(istio_requests_total{response_code=~\"5..\"}[5m])) / sum(rate(istio_requests_total[5m])) * 100"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "thresholds": {
              "steps": [
                { "value": 0, "color": "green" },
                { "value": 1, "color": "yellow" },
                { "value": 5, "color": "red" }
              ]
            }
          }
        }
      },
      {
        "title": "P99 Latency",
        "type": "graph",
        "targets": [
          {
            "expr": "histogram_quantile(0.99, sum(rate(istio_request_duration_milliseconds_bucket{reporter=\"destination\"}[5m])) by (le, destination_service_name))",
            "legendFormat": "{{destination_service_name}}"
          }
        ]
      },
      {
        "title": "Service Topology",
        "type": "nodeGraph",
        "targets": [
          {
            "expr": "sum(rate(istio_requests_total{reporter=\"destination\"}[5m])) by (source_workload, destination_service_name)"
          }
        ]
      }
    ]
  }
}

Template 6: Kiali Service Mesh Visualization

模板6：Kiali服务网格可视化配置

yaml

undefined

yaml

undefined

Kiali installation

apiVersion: kiali.io/v1alpha1 kind: Kiali metadata: name: kiali namespace: istio-system spec: auth: strategy: anonymous # or openid, token deployment: accessible_namespaces: - "**" external_services: prometheus: url: http://prometheus.istio-system:9090 tracing: url: http://jaeger-query.istio-system:16686 grafana: url: http://grafana.istio-system:3000

undefined

apiVersion: kiali.io/v1alpha1 kind: Kiali metadata: name: kiali namespace: istio-system spec: auth: strategy: anonymous # or openid, token deployment: accessible_namespaces: - "**" external_services: prometheus: url: http://prometheus.istio-system:9090 tracing: url: http://jaeger-query.istio-system:16686 grafana: url: http://grafana.istio-system:3000

undefined

Template 7: OpenTelemetry Integration

模板7：OpenTelemetry集成配置

yaml

undefined

yaml

undefined

OpenTelemetry Collector for mesh

apiVersion: v1 kind: ConfigMap metadata: name: otel-collector-config data: config.yaml: | receivers: otlp: protocols: grpc: endpoint: 0.0.0.0:4317 http: endpoint: 0.0.0.0:4318 zipkin: endpoint: 0.0.0.0:9411

processors:
  batch:
    timeout: 10s

exporters:
  jaeger:
    endpoint: jaeger-collector:14250
    tls:
      insecure: true
  prometheus:
    endpoint: 0.0.0.0:8889

service:
  pipelines:
    traces:
      receivers: [otlp, zipkin]
      processors: [batch]
      exporters: [jaeger]
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [prometheus]

apiVersion: v1 kind: ConfigMap metadata: name: otel-collector-config data: config.yaml: | receivers: otlp: protocols: grpc: endpoint: 0.0.0.0:4317 http: endpoint: 0.0.0.0:4318 zipkin: endpoint: 0.0.0.0:9411

processors:
  batch:
    timeout: 10s

exporters:
  jaeger:
    endpoint: jaeger-collector:14250
    tls:
      insecure: true
  prometheus:
    endpoint: 0.0.0.0:8889

service:
  pipelines:
    traces:
      receivers: [otlp, zipkin]
      processors: [batch]
      exporters: [jaeger]
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [prometheus]

Istio Telemetry v2 with OTel

apiVersion: telemetry.istio.io/v1alpha1 kind: Telemetry metadata: name: mesh-default namespace: istio-system spec: tracing: - providers: - name: otel randomSamplingPercentage: 10

undefined

apiVersion: telemetry.istio.io/v1alpha1 kind: Telemetry metadata: name: mesh-default namespace: istio-system spec: tracing: - providers: - name: otel randomSamplingPercentage: 10

undefined

Alerting Rules

告警规则

yaml

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: mesh-alerts
  namespace: istio-system
spec:
  groups:
    - name: mesh.rules
      rules:
        - alert: HighErrorRate
          expr: |
            sum(rate(istio_requests_total{response_code=~"5.."}[5m])) by (destination_service_name)
            / sum(rate(istio_requests_total[5m])) by (destination_service_name) > 0.05
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "High error rate for {{ $labels.destination_service_name }}"

        - alert: HighLatency
          expr: |
            histogram_quantile(0.99, sum(rate(istio_request_duration_milliseconds_bucket[5m]))
            by (le, destination_service_name)) > 1000
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "High P99 latency for {{ $labels.destination_service_name }}"

        - alert: MeshCertExpiring
          expr: |
            (certmanager_certificate_expiration_timestamp_seconds - time()) / 86400 < 7
          labels:
            severity: warning
          annotations:
            summary: "Mesh certificate expiring in less than 7 days"

yaml

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: mesh-alerts
  namespace: istio-system
spec:
  groups:
    - name: mesh.rules
      rules:
        - alert: HighErrorRate
          expr: |
            sum(rate(istio_requests_total{response_code=~"5.."}[5m])) by (destination_service_name)
            / sum(rate(istio_requests_total[5m])) by (destination_service_name) > 0.05
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "High error rate for {{ $labels.destination_service_name }}"

        - alert: HighLatency
          expr: |
            histogram_quantile(0.99, sum(rate(istio_request_duration_milliseconds_bucket[5m]))
            by (le, destination_service_name)) > 1000
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "High P99 latency for {{ $labels.destination_service_name }}"

        - alert: MeshCertExpiring
          expr: |
            (certmanager_certificate_expiration_timestamp_seconds - time()) / 86400 < 7
          labels:
            severity: warning
          annotations:
            summary: "Mesh certificate expiring in less than 7 days"

Best Practices

最佳实践

Do's

建议做法

Sample appropriately - 100% in dev, 1-10% in prod
Use trace context - Propagate headers consistently
Set up alerts - For golden signals
Correlate metrics/traces - Use exemplars
Retain strategically - Hot/cold storage tiers

合理设置采样率 - 开发环境100%采样，生产环境1-10%
传递追踪上下文 - 持续传播追踪头信息
配置告警规则 - 针对黄金指标设置告警
关联指标与追踪 - 使用示例关联数据
分层存储数据 - 采用热/冷存储分层策略

service-mesh-observability

Original

Translation

Service Mesh Observability

服务网格可观测性

When to Use This Skill

何时使用该技能

Core Concepts

核心概念

1. Three Pillars of Observability

1. 可观测性三大支柱

2. Golden Signals for Mesh

2. 网格黄金指标

Templates

模板

Template 1: Istio with Prometheus & Grafana

模板1：搭配Prometheus & Grafana的Istio配置

Install Prometheus

Install Prometheus

ServiceMonitor for Prometheus Operator

ServiceMonitor for Prometheus Operator

Template 2: Key Istio Metrics Queries

模板2：关键Istio指标查询语句

Request rate by service

Request rate by service

Error rate (5xx)

Error rate (5xx)

P99 latency

P99 latency

TCP connections

TCP connections

Request size

Request size

Template 3: Jaeger Distributed Tracing

模板3：Jaeger分布式追踪配置

Jaeger installation for Istio

Jaeger installation for Istio

apiVersion: install.istio.io/v1alpha1 kind: IstioOperator spec: meshConfig: enableTracing: true defaultConfig: tracing: sampling: 100.0 # 100% in dev, lower in prod zipkin: address: jaeger-collector.istio-system:9411

apiVersion: install.istio.io/v1alpha1 kind: IstioOperator spec: meshConfig: enableTracing: true defaultConfig: tracing: sampling: 100.0 # 100% in dev, lower in prod zipkin: address: jaeger-collector.istio-system:9411

Jaeger deployment

Jaeger deployment

Template 4: Linkerd Viz Dashboard

模板4：Linkerd Viz仪表盘使用

Install Linkerd viz extension

Install Linkerd viz extension

Access dashboard

Access dashboard

CLI commands for observability

CLI commands for observability

Top requests

Top requests

Per-route metrics

Per-route metrics

Live traffic inspection

Live traffic inspection

Service edges (dependencies)

Service edges (dependencies)

Template 5: Grafana Dashboard JSON

模板5：Grafana仪表盘JSON配置

Template 6: Kiali Service Mesh Visualization

模板6：Kiali服务网格可视化配置

Kiali installation

Kiali installation

Template 7: OpenTelemetry Integration

模板7：OpenTelemetry集成配置

OpenTelemetry Collector for mesh

OpenTelemetry Collector for mesh

Istio Telemetry v2 with OTel

Istio Telemetry v2 with OTel

Alerting Rules

告警规则

Best Practices

最佳实践

Do's

建议做法

Don'ts

避免事项

Resources

参考资源