grafana-dashboards

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Grafana Dashboards

Grafana 仪表盘

Create and manage production-ready Grafana dashboards for comprehensive system observability.

创建并管理适用于生产环境的Grafana仪表盘，实现全面的系统可观测性。

Purpose

用途

Design effective Grafana dashboards for monitoring applications, infrastructure, and business metrics.

设计高效的Grafana仪表盘，用于监控应用、基础设施和业务指标。

When to Use

适用场景

Visualize Prometheus metrics
Create custom dashboards
Implement SLO dashboards
Monitor infrastructure
Track business KPIs

可视化Prometheus指标
创建自定义仪表盘
实现SLO仪表盘
监控基础设施
追踪业务KPI

Dashboard Design Principles

仪表盘设计原则

1. Hierarchy of Information

1. 信息层级

┌─────────────────────────────────────┐
│  Critical Metrics (Big Numbers)     │
├─────────────────────────────────────┤
│  Key Trends (Time Series)           │
├─────────────────────────────────────┤
│  Detailed Metrics (Tables/Heatmaps) │
└─────────────────────────────────────┘

┌─────────────────────────────────────┐
│  Critical Metrics (Big Numbers)     │
├─────────────────────────────────────┤
│  Key Trends (Time Series)           │
├─────────────────────────────────────┤
│  Detailed Metrics (Tables/Heatmaps) │
└─────────────────────────────────────┘

2. RED Method (Services)

2. RED方法（服务监控）

Rate - Requests per second
Errors - Error rate
Duration - Latency/response time

速率 - 每秒请求数
错误 - 错误率
耗时 - 延迟/响应时间

3. USE Method (Resources)

3. USE方法（资源监控）

Utilization - % time resource is busy
Saturation - Queue length/wait time
Errors - Error count

利用率 - 资源繁忙时间占比
饱和度 - 队列长度/等待时间
错误 - 错误数量

Dashboard Structure

仪表盘结构

API Monitoring Dashboard

API监控仪表盘

json

{
  "dashboard": {
    "title": "API Monitoring",
    "tags": ["api", "production"],
    "timezone": "browser",
    "refresh": "30s",
    "panels": [
      {
        "title": "Request Rate",
        "type": "graph",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total[5m])) by (service)",
            "legendFormat": "{{service}}"
          }
        ],
        "gridPos": { "x": 0, "y": 0, "w": 12, "h": 8 }
      },
      {
        "title": "Error Rate %",
        "type": "graph",
        "targets": [
          {
            "expr": "(sum(rate(http_requests_total{status=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m]))) * 100",
            "legendFormat": "Error Rate"
          }
        ],
        "alert": {
          "conditions": [
            {
              "evaluator": { "params": [5], "type": "gt" },
              "operator": { "type": "and" },
              "query": { "params": ["A", "5m", "now"] },
              "type": "query"
            }
          ]
        },
        "gridPos": { "x": 12, "y": 0, "w": 12, "h": 8 }
      },
      {
        "title": "P95 Latency",
        "type": "graph",
        "targets": [
          {
            "expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))",
            "legendFormat": "{{service}}"
          }
        ],
        "gridPos": { "x": 0, "y": 8, "w": 24, "h": 8 }
      }
    ]
  }
}

Reference: See

assets/api-dashboard.json

json

{
  "dashboard": {
    "title": "API Monitoring",
    "tags": ["api", "production"],
    "timezone": "browser",
    "refresh": "30s",
    "panels": [
      {
        "title": "Request Rate",
        "type": "graph",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total[5m])) by (service)",
            "legendFormat": "{{service}}"
          }
        ],
        "gridPos": { "x": 0, "y": 0, "w": 12, "h": 8 }
      },
      {
        "title": "Error Rate %",
        "type": "graph",
        "targets": [
          {
            "expr": "(sum(rate(http_requests_total{status=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m]))) * 100",
            "legendFormat": "Error Rate"
          }
        ],
        "alert": {
          "conditions": [
            {
              "evaluator": { "params": [5], "type": "gt" },
              "operator": { "type": "and" },
              "query": { "params": ["A", "5m", "now"] },
              "type": "query"
            }
          ]
        },
        "gridPos": { "x": 12, "y": 0, "w": 12, "h": 8 }
      },
      {
        "title": "P95 Latency",
        "type": "graph",
        "targets": [
          {
            "expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))",
            "legendFormat": "{{service}}"
          }
        ],
        "gridPos": { "x": 0, "y": 8, "w": 24, "h": 8 }
      }
    ]
  }
}

参考： 查看

assets/api-dashboard.json

Panel Types

面板类型

1. Stat Panel (Single Value)

1. 统计面板（单值）

json

{
  "type": "stat",
  "title": "Total Requests",
  "targets": [
    {
      "expr": "sum(http_requests_total)"
    }
  ],
  "options": {
    "reduceOptions": {
      "values": false,
      "calcs": ["lastNotNull"]
    },
    "orientation": "auto",
    "textMode": "auto",
    "colorMode": "value"
  },
  "fieldConfig": {
    "defaults": {
      "thresholds": {
        "mode": "absolute",
        "steps": [
          { "value": 0, "color": "green" },
          { "value": 80, "color": "yellow" },
          { "value": 90, "color": "red" }
        ]
      }
    }
  }
}

json

{
  "type": "stat",
  "title": "Total Requests",
  "targets": [
    {
      "expr": "sum(http_requests_total)"
    }
  ],
  "options": {
    "reduceOptions": {
      "values": false,
      "calcs": ["lastNotNull"]
    },
    "orientation": "auto",
    "textMode": "auto",
    "colorMode": "value"
  },
  "fieldConfig": {
    "defaults": {
      "thresholds": {
        "mode": "absolute",
        "steps": [
          { "value": 0, "color": "green" },
          { "value": 80, "color": "yellow" },
          { "value": 90, "color": "red" }
        ]
      }
    }
  }
}

2. Time Series Graph

2. 时间序列图

json

{
  "type": "graph",
  "title": "CPU Usage",
  "targets": [
    {
      "expr": "100 - (avg by (instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)"
    }
  ],
  "yaxes": [
    { "format": "percent", "max": 100, "min": 0 },
    { "format": "short" }
  ]
}

json

{
  "type": "graph",
  "title": "CPU Usage",
  "targets": [
    {
      "expr": "100 - (avg by (instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)"
    }
  ],
  "yaxes": [
    { "format": "percent", "max": 100, "min": 0 },
    { "format": "short" }
  ]
}

3. Table Panel

3. 表格面板

json

{
  "type": "table",
  "title": "Service Status",
  "targets": [
    {
      "expr": "up",
      "format": "table",
      "instant": true
    }
  ],
  "transformations": [
    {
      "id": "organize",
      "options": {
        "excludeByName": { "Time": true },
        "indexByName": {},
        "renameByName": {
          "instance": "Instance",
          "job": "Service",
          "Value": "Status"
        }
      }
    }
  ]
}

json

{
  "type": "table",
  "title": "Service Status",
  "targets": [
    {
      "expr": "up",
      "format": "table",
      "instant": true
    }
  ],
  "transformations": [
    {
      "id": "organize",
      "options": {
        "excludeByName": { "Time": true },
        "indexByName": {},
        "renameByName": {
          "instance": "Instance",
          "job": "Service",
          "Value": "Status"
        }
      }
    }
  ]
}

4. Heatmap

4. 热力图

json

{
  "type": "heatmap",
  "title": "Latency Heatmap",
  "targets": [
    {
      "expr": "sum(rate(http_request_duration_seconds_bucket[5m])) by (le)",
      "format": "heatmap"
    }
  ],
  "dataFormat": "tsbuckets",
  "yAxis": {
    "format": "s"
  }
}

json

{
  "type": "heatmap",
  "title": "Latency Heatmap",
  "targets": [
    {
      "expr": "sum(rate(http_request_duration_seconds_bucket[5m])) by (le)",
      "format": "heatmap"
    }
  ],
  "dataFormat": "tsbuckets",
  "yAxis": {
    "format": "s"
  }
}

Variables

变量

Query Variables

查询变量

json

{
  "templating": {
    "list": [
      {
        "name": "namespace",
        "type": "query",
        "datasource": "Prometheus",
        "query": "label_values(kube_pod_info, namespace)",
        "refresh": 1,
        "multi": false
      },
      {
        "name": "service",
        "type": "query",
        "datasource": "Prometheus",
        "query": "label_values(kube_service_info{namespace=\"$namespace\"}, service)",
        "refresh": 1,
        "multi": true
      }
    ]
  }
}

json

{
  "templating": {
    "list": [
      {
        "name": "namespace",
        "type": "query",
        "datasource": "Prometheus",
        "query": "label_values(kube_pod_info, namespace)",
        "refresh": 1,
        "multi": false
      },
      {
        "name": "service",
        "type": "query",
        "datasource": "Prometheus",
        "query": "label_values(kube_service_info{namespace=\"$namespace\"}, service)",
        "refresh": 1,
        "multi": true
      }
    ]
  }
}

Use Variables in Queries

在查询中使用变量

sum(rate(http_requests_total{namespace="$namespace", service=~"$service"}[5m]))

sum(rate(http_requests_total{namespace="$namespace", service=~"$service"}[5m]))

Alerts in Dashboards

仪表盘告警

json

{
  "alert": {
    "name": "High Error Rate",
    "conditions": [
      {
        "evaluator": {
          "params": [5],
          "type": "gt"
        },
        "operator": { "type": "and" },
        "query": {
          "params": ["A", "5m", "now"]
        },
        "reducer": { "type": "avg" },
        "type": "query"
      }
    ],
    "executionErrorState": "alerting",
    "for": "5m",
    "frequency": "1m",
    "message": "Error rate is above 5%",
    "noDataState": "no_data",
    "notifications": [{ "uid": "slack-channel" }]
  }
}

json

{
  "alert": {
    "name": "High Error Rate",
    "conditions": [
      {
        "evaluator": {
          "params": [5],
          "type": "gt"
        },
        "operator": { "type": "and" },
        "query": {
          "params": ["A", "5m", "now"]
        },
        "reducer": { "type": "avg" },
        "type": "query"
      }
    ],
    "executionErrorState": "alerting",
    "for": "5m",
    "frequency": "1m",
    "message": "Error rate is above 5%",
    "noDataState": "no_data",
    "notifications": [{ "uid": "slack-channel" }]
  }
}

Dashboard Provisioning

仪表盘配置管理

dashboards.yml:

yaml

apiVersion: 1

providers:
  - name: "default"
    orgId: 1
    folder: "General"
    type: file
    disableDeletion: false
    updateIntervalSeconds: 10
    allowUiUpdates: true
    options:
      path: /etc/grafana/dashboards

dashboards.yml:

yaml

apiVersion: 1

providers:
  - name: "default"
    orgId: 1
    folder: "General"
    type: file
    disableDeletion: false
    updateIntervalSeconds: 10
    allowUiUpdates: true
    options:
      path: /etc/grafana/dashboards

Common Dashboard Patterns

常见仪表盘模式

Infrastructure Dashboard

基础设施仪表盘

Key Panels:

CPU utilization per node
Memory usage per node
Disk I/O
Network traffic
Pod count by namespace
Node status

Reference: See

assets/infrastructure-dashboard.json

核心面板：

单节点CPU利用率
单节点内存使用率
磁盘I/O
网络流量
按命名空间统计的Pod数量
节点状态

参考： 查看

assets/infrastructure-dashboard.json

Database Dashboard

数据库仪表盘

Key Panels:

Queries per second
Connection pool usage
Query latency (P50, P95, P99)
Active connections
Database size
Replication lag
Slow queries

Reference: See

assets/database-dashboard.json

核心面板：

每秒查询数
连接池使用率
查询延迟（P50、P95、P99）
活跃连接数
数据库大小
复制延迟
慢查询

参考： 查看

assets/database-dashboard.json

Application Dashboard

应用仪表盘

Key Panels:

Request rate
Error rate
Response time (percentiles)
Active users/sessions
Cache hit rate
Queue length

核心面板：

请求速率
错误率
响应时间（分位数）
活跃用户/会话数
缓存命中率
队列长度

Best Practices

最佳实践

Start with templates (Grafana community dashboards)
Use consistent naming for panels and variables
Group related metrics in rows
Set appropriate time ranges (default: Last 6 hours)
Use variables for flexibility
Add panel descriptions for context
Configure units correctly
Set meaningful thresholds for colors
Use consistent colors across dashboards
Test with different time ranges

从模板开始（使用Grafana社区仪表盘）
使用一致的命名 命名面板和变量
分组相关指标 在同一行中
设置合适的时间范围（默认：最近6小时）
使用变量 提升灵活性
添加面板描述 提供上下文信息
正确配置单位
设置有意义的颜色阈值
在所有仪表盘中使用一致的颜色
测试不同的时间范围

Dashboard as Code

即代码化仪表盘

Terraform Provisioning

使用Terraform配置

hcl

resource "grafana_dashboard" "api_monitoring" {
  config_json = file("${path.module}/dashboards/api-monitoring.json")
  folder      = grafana_folder.monitoring.id
}

resource "grafana_folder" "monitoring" {
  title = "Production Monitoring"
}

hcl

resource "grafana_dashboard" "api_monitoring" {
  config_json = file("${path.module}/dashboards/api-monitoring.json")
  folder      = grafana_folder.monitoring.id
}

resource "grafana_folder" "monitoring" {
  title = "Production Monitoring"
}

Ansible Provisioning

使用Ansible配置

yaml

- name: Deploy Grafana dashboards
  copy:
    src: "{{ item }}"
    dest: /etc/grafana/dashboards/
  with_fileglob:
    - "dashboards/*.json"
  notify: restart grafana

yaml

- name: Deploy Grafana dashboards
  copy:
    src: "{{ item }}"
    dest: /etc/grafana/dashboards/
  with_fileglob:
    - "dashboards/*.json"
  notify: restart grafana

grafana-dashboards

Original

Translation

Grafana Dashboards

Grafana 仪表盘

Purpose

用途

When to Use

适用场景

Dashboard Design Principles

仪表盘设计原则

1. Hierarchy of Information

1. 信息层级

2. RED Method (Services)

2. RED方法（服务监控）

3. USE Method (Resources)

3. USE方法（资源监控）

Dashboard Structure

仪表盘结构

API Monitoring Dashboard

API监控仪表盘

Panel Types

面板类型

1. Stat Panel (Single Value)

1. 统计面板（单值）

2. Time Series Graph

2. 时间序列图

3. Table Panel

3. 表格面板

4. Heatmap

4. 热力图

Variables

变量

Query Variables

查询变量

Use Variables in Queries

在查询中使用变量

Alerts in Dashboards

仪表盘告警

Dashboard Provisioning

仪表盘配置管理

Common Dashboard Patterns

常见仪表盘模式

Infrastructure Dashboard

基础设施仪表盘

Database Dashboard

数据库仪表盘

Application Dashboard

应用仪表盘

Best Practices

最佳实践

Dashboard as Code

即代码化仪表盘

Terraform Provisioning

使用Terraform配置

Ansible Provisioning

使用Ansible配置

Reference Files

参考文件

Related Skills

相关技能